diff --git a/.gitignore b/.gitignore index e5adf3c582..8a5f6c0aa4 100644 --- a/.gitignore +++ b/.gitignore @@ -49,3 +49,8 @@ AGENTS.md .codex .idea +site/ +site/ +site/ +site/ +site/ diff --git a/docs/tutorials/posts/noc-reasoning-agent.md b/docs/tutorials/posts/noc-reasoning-agent.md new file mode 100644 index 0000000000..f00fc42d6d --- /dev/null +++ b/docs/tutorials/posts/noc-reasoning-agent.md @@ -0,0 +1,386 @@ +--- +date: 2025-09-05 +readtime: 30 +--- + +# Teaching a Model to Reason Over Telecom Network Incidents + +This tutorial walks you through a complete pipeline for fine-tuning a reasoning model that can autonomously diagnose and resolve telecom network incidents. Using [Nemo-Skills](https://nvidia-nemo.github.io/Skills/) together with a [NoC Reasoning Agent](https://github.com/aiden200/NoC_Reasoning_Agent), we will take [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) and teach it to perform step-by-step root-cause analysis with tool-calling — the same workflow a human NOC (Network Operations Center) engineer follows today. + +If you're following along, you'll need access to an NVIDIA DGX box (or equivalent) with eight NVIDIA A100 (or newer) GPUs, or a Slurm cluster with similarly configured nodes. The full pipeline — from data processing through training to evaluation — takes several hours depending on dataset size and hardware. + + + +## Background + +In traditional telco operations, network incidents begin with alarms from network elements (eNodeBs, gNodeBs, routers, transmission links). NOC engineers manually validate each alarm by checking FM dashboards, PM KPIs, topology views, logs, and customer-impact tools. They then perform root-cause analysis and either apply a fix (restarts, reroutes, configuration corrections) or escalate to field teams. + +A fine-tuned reasoning model automates this entire flow: + +1. **Multi-source validation** — Checks multiple OSS/BSS sources via tool calls +2. **Step-by-step RCA** — Performs root-cause analysis methodically +3. **Automated healing** — Triggers healing scripts automatically +4. **Pattern recognition** — Uses historical data patterns to filter out self-recovering alarms + +The result is dramatic reduction in Mean Time to Resolve (MTTR) and operational cost, moving toward a zero-touch, self-healing network. + +## Setup + +To orchestrate the pipeline jobs, Nemo-Skills uses Docker containers. You'll need to install [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) if running locally or use a Slurm cluster that supports [NVIDIA/pyxis](https://github.com/NVIDIA/pyxis). + +Start by installing Nemo-Skills and downloading the NoC recipe scripts: + +```shell +pip install git+https://github.com/NVIDIA-NeMo/Skills.git +ns setup +``` + +When prompted during `ns setup`, define a working folder as `/workspace` (e.g. mount `/your/project/dir:/workspace`). This folder will be used in all subsequent commands. For more details, see the [Nemo-Skills configs](https://nvidia-nemo.github.io/Skills/basics/cluster-configs/) documentation. + +Next, download the NoC Reasoning Agent recipe into your workspace: + +```shell +ns run_cmd --expname=prepare-noc --log_dir=/workspace/prepare-noc --cluster=local \ + 'cd /workspace && \ + export RECIPE_PREFIX=https://raw.githubusercontent.com/NVIDIA-NeMo/Skills/refs/heads/main/recipes/noc-reasoning-agent && \ + mkdir -p src/filtering src/utils src/evaluation src/ns_pipelines data/prompts outputs && \ + wget $RECIPE_PREFIX/scripts/filtering/match_keywords.py -O src/filtering/match_keywords.py && \ + wget $RECIPE_PREFIX/scripts/filtering/filter_rows.py -O src/filtering/filter_rows.py && \ + wget $RECIPE_PREFIX/scripts/utils/create_input_jsonl_from_incidents.py -O src/utils/create_input_jsonl_from_incidents.py && \ + wget $RECIPE_PREFIX/scripts/utils/format_reasoning_json.py -O src/utils/format_reasoning_json.py && \ + wget $RECIPE_PREFIX/scripts/utils/split_incident_data.py -O src/utils/split_incident_data.py && \ + wget $RECIPE_PREFIX/scripts/utils/schema_columns.py -O src/utils/schema_columns.py && \ + wget $RECIPE_PREFIX/scripts/utils/reasoning_processes.py -O src/utils/reasoning_processes.py && \ + wget $RECIPE_PREFIX/scripts/ns_pipelines/prepare_react_agent.py -O src/ns_pipelines/prepare_react_agent.py && \ + wget $RECIPE_PREFIX/scripts/tools.py -O src/tools.py && \ + wget $RECIPE_PREFIX/scripts/create_agent_with_tools_batch.py -O src/create_agent_with_tools_batch.py && \ + wget $RECIPE_PREFIX/scripts/evaluation/problem_code_evaluation.py -O src/evaluation/problem_code_evaluation.py && \ + wget $RECIPE_PREFIX/prompts/formatting_prompt.yaml -O data/prompts/formatting_prompt.yaml && \ + wget $RECIPE_PREFIX/prompts/shortened_prompt_reasoning.yaml -O data/prompts/shortened_prompt_reasoning.yaml && \ + wget $RECIPE_PREFIX/prompts/prompt_incident.yaml -O data/prompts/prompt_incident.yaml && \ + wget $RECIPE_PREFIX/configs/noc_reasoning_sft.yaml -O data/noc_reasoning_sft.yaml && \ + wget $RECIPE_PREFIX/data/synthetic_incidents.csv -O data/synthetic_incidents.csv' +``` + +All scripts and prompts referenced in this tutorial are available in the [recipes/noc-reasoning-agent](https://github.com/NVIDIA-NeMo/Skills/tree/main/recipes/noc-reasoning-agent) directory of the Nemo-Skills repository. + +### Configure the environment + +Several scripts import shared modules from sibling directories (e.g. `from src.tools import ...`). Set two environment variables before running any Python commands: + +```shell +# Add workspace to PYTHONPATH so Python can find the src package +export PYTHONPATH=/workspace:$PYTHONPATH + +# Disable the uncommitted-changes check (useful during development) +export NEMO_SKILLS_DISABLE_UNCOMMITTED_CHANGES_CHECK=1 +``` + +Python 3.3+ namespace packages allow importing from `src/` without `__init__.py` files, so only the `PYTHONPATH` is needed. + +The setup step above downloads a sample `data/synthetic_incidents.csv` into `/workspace`. To use your own data, replace this file with your incident CSV (same column schema). The sample file is also available in the [recipes/noc-reasoning-agent/data/](https://github.com/NVIDIA-NeMo/Skills/tree/main/recipes/noc-reasoning-agent/data) directory of the Nemo-Skills repository. + +In the following sections, we always use `--cluster=local`. Change to `--cluster=slurm` (or whatever you named the config) if running on a Slurm cluster. When using Slurm, commands will finish immediately and schedule jobs in the cluster queue. + +## Data Processing + +The pipeline starts with raw incident CSV data. We progressively filter it to keep only actionable, remotely-solvable incidents that are most useful for training. + +### Classify Incidents + +Categorize incidents by solution type (Soft Solve, Physical Intervention, Unknown): + +```shell +python src/filtering/match_keywords.py \ + --input_csv data/synthetic_incidents.csv \ + --output_csv data/categorized_incidents.csv +``` + +### Filter the Dataset + +Apply a series of filters to narrow the dataset to high-quality, actionable incidents: + +```shell +# Remove auto-recovered incidents +python src/filtering/filter_rows.py \ + --input_csv data/categorized_incidents.csv \ + --output_csv data/filtered_file.csv \ + --filter_type auto + +# Keep only remotely-solvable incidents +python src/filtering/filter_rows.py \ + --input_csv data/filtered_file.csv \ + --output_csv data/filtered_file.csv \ + --filter_type soft_solve + +# Keep top 16 fault categories +python src/filtering/filter_rows.py \ + --input_csv data/filtered_file.csv \ + --output_csv data/filtered_file.csv \ + --filter_type problem_codes + +# Keep top 10 resolution methods +python src/filtering/filter_rows.py \ + --input_csv data/filtered_file.csv \ + --output_csv data/finalized_dataset.csv \ + --filter_type close_codes +``` + +### Convert to JSONL + +Convert the filtered CSV into the JSONL format required by Nemo-Skills: + +```shell +python src/utils/create_input_jsonl_from_incidents.py \ + --input data/finalized_dataset.csv \ + --output outputs/input_incident.jsonl \ + --examples_by_problem_code 1000 +``` + +The `--examples_by_problem_code 1000` flag limits to 1000 examples per fault category for a balanced training set. + +## Synthetic Data Generation + +With the input data prepared, we use a powerful teacher model to generate structured reasoning traces. This is a two-phase process: first we generate structured resolution procedures, then we inject detailed reasoning into each step. + +### Phase 1: Generate Structured Procedures + +Use the teacher model ([gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b)) to generate step-by-step incident resolution procedures: + +```shell +ns generate \ + --cluster=local \ + --server_type=vllm \ + --expname=gpt-oss-sdg-with-python \ + --model=openai/gpt-oss-120b \ + --server_gpus=8 \ + --output_dir=/workspace/outputs/sdg/ \ + --input_file=/workspace/outputs/input_incident.jsonl \ + ++prompt_config=/workspace/data/prompts/formatting_prompt.yaml \ + ++inference.tokens_to_generate=8192 \ + ++inference.temperature=0.6 \ + ++chat_template_kwargs.reasoning_effort=medium \ + ++inference.endpoint_type=text \ + ++code_execution=false \ + ++server.enable_soft_fail=True \ + ++skip_filled=False --rerun_done +``` + +The `ns generate` command starts a vLLM server, sends each incident through the prompt template in `formatting_prompt.yaml`, and writes the results to `outputs/sdg/output.jsonl`. For more details about the generation pipeline, see the [generation](https://nvidia-nemo.github.io/Skills/pipelines/generation/) documentation. + +### Parse and Format Steps + +Extract structured resolution steps from the raw model output: + +```shell +python src/utils/format_reasoning_json.py \ + --input outputs/sdg/output.jsonl \ + --output outputs/sdg/formatted_output.json \ + --jsonl_file outputs/input_incident.jsonl \ + --parse_type steps_extraction +``` + +### Phase 2: Inject Reasoning Traces + +Run the teacher model again to add detailed thinking traces to each procedural step: + +```shell +ns generate \ + --cluster=local \ + --server_type=vllm \ + --expname=gpt-oss-sdg-with-python \ + --model=openai/gpt-oss-120b \ + --server_gpus=8 \ + --output_dir=/workspace/outputs/sdg_reason/ \ + --input_file=/workspace/outputs/sdg/formatted_output.json \ + ++prompt_config=/workspace/data/prompts/shortened_prompt_reasoning.yaml \ + ++inference.tokens_to_generate=8192 \ + ++inference.temperature=0.6 \ + ++chat_template_kwargs.reasoning_effort=medium \ + ++inference.endpoint_type=text \ + ++code_execution=false \ + ++skip_filled=False --rerun_done \ + ++server.enable_soft_fail=True +``` + +### Compile Training Data + +Merge the structured procedures with reasoning traces into a model-ingestable format: + +```shell +python src/utils/format_reasoning_json.py \ + --input outputs/sdg/output.jsonl \ + --output_dir outputs/sdg/full_data \ + --jsonl_file outputs/input_incident.jsonl \ + --reasoning_jsonl outputs/sdg_reason/output.jsonl \ + --parse_type compile_reasoning +``` + +This step tokenizes content for the target model, compresses reasoning steps, and organizes data into a curriculum based on reasoning complexity. + +## Model Training + +With synthetic data generated, we fine-tune the model using [NeMo-RL](https://github.com/NVIDIA-NeMo/RL/) with the Megatron backend. + +### Prepare SFT Data + +First, split the data into training and testing sets: + +```shell +python src/utils/split_incident_data.py \ + --input_dir outputs/sdg/full_data \ + --train_output outputs/training_data_split.jsonl \ + --test_output outputs/testing_data_split.jsonl +``` + +Then prepare the data in the format required for supervised fine-tuning: + +```shell +ns run_cmd \ + --log_dir=/workspace/prepare-sft-data-indicence \ + --expname=prep-sft-data-inci \ + --run_after=solution-generation \ + --cluster=local \ + 'python -m nemo_skills.training.prepare_data \ + --config-path /workspace/data \ + --config-name noc_reasoning_sft \ + input_files=/workspace/outputs/training_data_split.jsonl \ + output_path=/workspace/outputs/sft-data-incidence.jsonl \ + prompt_config=/workspace/data/prompts/prompt_incident.yaml \ + tokenizer=Qwen/Qwen3-32B \ + filters.remove_contaminated=false \ + add_unlabeled=true \ + filters.trim_solutions=false' +``` + +The prompt template in `prompt_incident.yaml` defines the NOC engineer system prompt and the 11 available tool definitions (Check_Alarm_Status, Check_Element_Health, Execute_Remote_Action, etc.) that the model will learn to call during reasoning. + +### Run SFT Training + +Fine-tune [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) using NeMo-RL with the Megatron backend: + +```shell +ns nemo_rl sft \ + --cluster=local \ + --expname=training \ + --output_dir=/workspace/training \ + --hf_model=Qwen/Qwen3-32B \ + --num_nodes=1 \ + --num_gpus=8 \ + --training_data=/workspace/outputs/sft-data-incidence.jsonl \ + --backend=megatron \ + --final_hf_path=/workspace/training/qwen3-32b-improved-hf \ + ++sft.max_num_epochs=1 \ + ++policy.megatron_cfg.tensor_model_parallel_size=8 \ + ++policy.megatron_cfg.activation_checkpointing=True \ + ++policy.megatron_cfg.sequence_parallel=True \ + ++policy.model_name=Qwen/Qwen3-32B \ + ++policy.max_total_sequence_length=16384 \ + ++policy.train_global_batch_size=32 \ + ++policy.optimizer.kwargs.lr=1e-5 \ + ++checkpointing.save_weights_only=true \ + ++checkpointing.keep_top_k=1 \ + ++policy.lr=1e-5 +``` + +Key training parameters: + +- `tensor_model_parallel_size=8` splits the model across all 8 GPUs +- `activation_checkpointing=True` reduces memory usage by recomputing activations +- `max_total_sequence_length=16384` sets the context length for reasoning traces +- `lr=1e-5` is a conservative learning rate appropriate for fine-tuning + +To learn more about SFT configuration, see the [Nemo-Skills training](https://nvidia-nemo.github.io/Skills/pipelines/training/) documentation. + +## Evaluation + +To evaluate the fine-tuned model, we use a ReAct (Reasoning + Acting) agent that calls NOC tools at each step, then compare its incident resolution accuracy against the baseline model. + +### Prepare Test Data + +Prepare the test set in the same format as training: + +```shell +ns run_cmd \ + --log_dir=/workspace/prepare-test-data-indicence \ + --expname=prep-test-data-inci \ + --run_after=solution-generation \ + --cluster=local \ + 'python -m nemo_skills.training.prepare_data \ + --config-path /workspace/data \ + --config-name noc_reasoning_sft \ + input_files=/workspace/outputs/testing_data_split.jsonl \ + output_path=/workspace/outputs/sft-test-incidence.jsonl \ + prompt_config=/workspace/data/prompts/prompt_incident.yaml \ + tokenizer=Qwen/Qwen3-32B \ + filters.remove_contaminated=false \ + add_unlabeled=true \ + filters.trim_solutions=false' +``` + +### Build Agent Input + +Create the ReAct agent input file containing incident prompts with tool response data: + +```shell +python src/ns_pipelines/prepare_react_agent.py \ + outputs/testing_data_split.jsonl \ + outputs/sft-test-incidence.jsonl \ + --output outputs/final_agent_input.jsonl \ + --prompt_config data/prompts/prompt_incident.yaml +``` + +### Install Agent Dependencies + +Install the additional libraries needed for the ReAct agent: + +```shell +pip install --upgrade langgraph langchain langchain-huggingface transformers torch accelerate pandas +``` + +### Run the Fine-Tuned Agent + +```shell +python src/create_agent_with_tools_batch.py \ + --input outputs/final_agent_input.jsonl \ + --output outputs/agent_responses.jsonl \ + --weights_dir training/qwen3-32b-improved-hf +``` + +### Run the Baseline Agent + +For comparison, run the same evaluation using the original (non-fine-tuned) model: + +```shell +python src/create_agent_with_tools_batch.py \ + --input outputs/final_agent_input.jsonl \ + --output outputs/baseline_agent_responses.jsonl \ + --weights_dir Qwen/Qwen3-32B +``` + +### Compare Results + +Evaluate both models by computing close-code accuracy (how often the model selects the correct resolution method): + +```shell +# Fine-tuned model +python src/evaluation/problem_code_evaluation.py outputs/agent_responses.jsonl + +# Baseline model +python src/evaluation/problem_code_evaluation.py outputs/baseline_agent_responses.jsonl +``` + +The evaluation script matches the model's predicted close code against the expected answer using synonym-aware matching (e.g. "Resolved" and "Issue Corrected" are both recognized). You should see a meaningful improvement in the fine-tuned model's accuracy compared to the baseline. + +## What's next? + +With Nemo-Skills, you can easily extend this pipeline in several directions: + +- **Scale the dataset** — Generate more synthetic incidents or add new fault categories to broaden coverage. +- **Add more tools** — Extend the tool set beyond the 11 NOC tools to cover additional operational workflows. +- **Multi-turn reasoning** — Experiment with longer reasoning chains by increasing `tokens_to_generate` and `max_total_sequence_length`. +- **Deploy with vLLM** — Serve the fine-tuned model using the [start-server pipeline](https://nvidia-nemo.github.io/Skills/pipelines/start-server/) for production inference. + +All the commands used in this tutorial can be combined into a single Python script using the Nemo-Skills [Python API](https://nvidia-nemo.github.io/Skills/pipelines/#python-interface), enabling end-to-end reproducibility. With just one line change (`--cluster=slurm`), you can transition from local prototyping to large-scale experiments on a Slurm cluster. + +This pipeline demonstrates that the same synthetic-data-generation and fine-tuning approach that works for math reasoning can be applied to real-world industrial domains like telecom network operations — teaching models not just to think, but to act. diff --git a/recipes/noc-reasoning-agent/.gitignore b/recipes/noc-reasoning-agent/.gitignore new file mode 100644 index 0000000000..d4ebc6bdcd --- /dev/null +++ b/recipes/noc-reasoning-agent/.gitignore @@ -0,0 +1,214 @@ +# Byte-compiled / optimized / DLL files +__pycache__/ +*.py[codz] +*$py.class + +# C extensions +*.so + +# Distribution / packaging +.Python +build/ +develop-eggs/ +dist/ +downloads/ +eggs/ +.eggs/ +lib/ +lib64/ +parts/ +sdist/ +var/ +wheels/ +share/python-wheels/ +*.egg-info/ +.installed.cfg +*.egg +MANIFEST + +# PyInstaller +# Usually these files are written by a python script from a template +# before PyInstaller builds the exe, so as to inject date/other infos into it. +*.manifest +*.spec + +# Installer logs +pip-log.txt +pip-delete-this-directory.txt + +# Unit test / coverage reports +htmlcov/ +.tox/ +.nox/ +.coverage +.coverage.* +.cache +nosetests.xml +coverage.xml +*.cover +*.py.cover +.hypothesis/ +.pytest_cache/ +cover/ + +# Translations +*.mo +*.pot + +# Django stuff: +*.log +local_settings.py +db.sqlite3 +db.sqlite3-journal + +# Flask stuff: +instance/ +.webassets-cache + +# Scrapy stuff: +.scrapy + +# Sphinx documentation +docs/_build/ + +# PyBuilder +.pybuilder/ +target/ + +# Jupyter Notebook +.ipynb_checkpoints + +# IPython +profile_default/ +ipython_config.py + +# pyenv +# For a library or package, you might want to ignore these files since the code is +# intended to run in multiple environments; otherwise, check them in: +# .python-version + +# pipenv +# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. +# However, in case of collaboration, if having platform-specific dependencies or dependencies +# having no cross-platform support, pipenv may install dependencies that don't work, or not +# install all needed dependencies. +#Pipfile.lock + +# UV +# Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control. +# This is especially recommended for binary packages to ensure reproducibility, and is more +# commonly ignored for libraries. +#uv.lock + +# poetry +# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. +# This is especially recommended for binary packages to ensure reproducibility, and is more +# commonly ignored for libraries. +# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control +#poetry.lock +#poetry.toml + +# pdm +# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. +# pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python. +# https://pdm-project.org/en/latest/usage/project/#working-with-version-control +#pdm.lock +#pdm.toml +.pdm-python +.pdm-build/ + +# pixi +# Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control. +#pixi.lock +# Pixi creates a virtual environment in the .pixi directory, just like venv module creates one +# in the .venv directory. It is recommended not to include this directory in version control. +.pixi + +# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm +__pypackages__/ + +# Celery stuff +celerybeat-schedule +celerybeat.pid + +# SageMath parsed files +*.sage.py + +# Environments +.env +.envrc +.venv +env/ +venv/ +ENV/ +env.bak/ +venv.bak/ + +# Spyder project settings +.spyderproject +.spyproject + +# Rope project settings +.ropeproject + +# mkdocs documentation +/site + +# mypy +.mypy_cache/ +.dmypy.json +dmypy.json + +# Pyre type checker +.pyre/ + +# pytype static type analyzer +.pytype/ + +# Cython debug symbols +cython_debug/ + +# PyCharm +# JetBrains specific template is maintained in a separate JetBrains.gitignore that can +# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore +# and can be added to the global gitignore or merged into this file. For a more nuclear +# option (not recommended) you can uncomment the following to ignore the entire idea folder. +#.idea/ + +# Abstra +# Abstra is an AI-powered process automation framework. +# Ignore directories containing user credentials, local state, and settings. +# Learn more at https://abstra.io/docs +.abstra/ + +# Visual Studio Code +# Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore +# that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore +# and can be added to the global gitignore or merged into this file. However, if you prefer, +# you could uncomment the following to ignore the entire vscode folder +# .vscode/ + +# Ruff stuff: +.ruff_cache/ + +# PyPI configuration file +.pypirc + +# Cursor +# Cursor is an AI-powered code editor. `.cursorignore` specifies files/directories to +# exclude from AI features like autocomplete and code analysis. Recommended for sensitive data +# refer to https://docs.cursor.com/context/ignore-files +.cursorignore +.cursorindexingignore + +# Marimo +marimo/_static/ +marimo/_lsp/ +__marimo__/ + +# Pipeline outputs and artifacts (do not push to repo) +outputs/ +training/ +data/categorized_incidents.csv +data/filtered_file.csv +data/finalized_dataset.csv diff --git a/recipes/noc-reasoning-agent/configs/config.ini b/recipes/noc-reasoning-agent/configs/config.ini new file mode 100644 index 0000000000..4fe3d9f3a7 --- /dev/null +++ b/recipes/noc-reasoning-agent/configs/config.ini @@ -0,0 +1,10 @@ +[download] +qwen2.5-14=/workspace/models/Qwen2.5-14B-Instruct +qwen2.5-32=/workspace/models/Qwen2.5-32B-Instruct +gpt-oss-120b=/workspace/models/gpt-oss-120b +nemotron-49b-1.5=/workspace/models/Llama-3_3-Nemotron-Super-49B-v1_5 + + +[data_path] +original_data_path=data/anonymized-Incidents_Last_6_Months.csv +incident_json_data=outputs/input_incident.jsonl diff --git a/recipes/noc-reasoning-agent/configs/noc_reasoning_sft.yaml b/recipes/noc-reasoning-agent/configs/noc_reasoning_sft.yaml new file mode 100644 index 0000000000..45ec295844 --- /dev/null +++ b/recipes/noc-reasoning-agent/configs/noc_reasoning_sft.yaml @@ -0,0 +1,150 @@ +processors_to_run: all + +output_path: ??? +# prompt_config: null +preprocessed_dataset_files: null +input_files: null + + +# --- Data Keys --- +# These keys MUST match the output of your preprocessing script. +input_key: "background" +output_key: "response" + +# --- SFT Formatting --- +# Define the tokenizer and the final chat format for the model. +tokenizer: "Qwen/Qwen3-32B" # EDIT THIS or override via CLI + +# This uses a separate YAML file to define the chat template. +# This makes the configuration cleaner and more reusable. +prompt_config: "/workspace/data/prompt_incident.yaml" # EDIT THIS or override via CLI + +# ----------------- +# --- General Settings --- +# ----------------- +do_shuffle: false +deduplicate: true +exclude_optional_keys: true +random_seed: 42 +num_output_samples: null +add_correct: true +add_incorrect: false +add_unlabeled: true # Set to true as requested + +# ----------------- +# --- Quality Control Filters --- +# ----------------- +# Most filters are disabled because the data is text-based reasoning, not math or code. +# This prevents the pipeline from incorrectly discarding valid data. + + +contamination_file: null + +filters: + drop_multi_boxed: false + remove_contaminated: false # can be enabled if you have a contamination file + majority_filter: false + trim_solutions: false # Your data doesn't use the \boxed{} syntax + trim_prefix: false + drop_incorrect_arithmetic: false + split_arithmetic: false + remove_len_outlier_problems: false # Keep all data regardless of length + remove_len_outlier_solutions: false + code_text_filter: null + remove_code_errors: false + remove_verification_code: false + remove_matplotlib: false + remove_no_code: false + remove_no_think_tags: false # Enabled, as requested + +# ================================================================================= # +# Processor Pipeline (Usually does not need to be changed) # +# ================================================================================= # +processors: + - _target_: nemo_skills.training.data_preparation_utils.preprocessing.ReadData + input_files: ${input_files} # This line ensures the processor gets the input file path + input_key: ${input_key} + output_key: ${output_key} + add_unlabeled: ${add_unlabeled} + deduplicate: ${deduplicate} + keys_to_keep: + - "expected_answer" + - "incident_identifier" + - "incident_classification" + - "urgency_level" + - "geographical_territory" + - "incident_subtype" + - "service_domain" + - "equipment_provider" + - "operational_zone" + - "affected_site" + - "incident_summary" + - "detection_timestamp" + - "escalation_date" + - "responsible_team" + - "fault_category" + - "action_chronicle" + - "resolution_summary" + - "resolution_method" + - "problem_code_reasoning_process" + + - _target_: nemo_skills.training.data_preparation_utils.merge_processor.MergeProcessor + _recursive_: false + processor_configs: + - _target_: nemo_skills.training.data_preparation_utils.filters.RemoveContaminated + should_run: ${filters.remove_contaminated} + contamination_file: ${contamination_file} + + - _target_: nemo_skills.training.data_preparation_utils.filters.DropIfRegexMatch #removing errors + should_run: ${filters.remove_code_errors} + text_key: ${output_key} + regex_patterns: + - 'Traceback (most recent call last)' + - '' + - 'Timed out' + - 'SyntaxError' + test_cases: + - { input: { generation: "My solution:\n---Traceback (most recent call last)---\nSomething else" }, output: null } + - { input: { generation: "My solution:\nSome long output\nSomething else" }, output: null } + - { input: { generation: "My solution:\nTimed out\nSomething else" }, output: null } + - { input: { generation: "My solution:\n[0;31mSyntaxError\u001b\nSomething else" }, output: null } + - { input: { generation: "My solution, no errors" }, output: { generation: "My solution, no errors" } } + + - _target_: nemo_skills.training.data_preparation_utils.filters.DropIfRegexNotMatch # filtering out tool calling + should_run: ${filters.remove_no_code} + text_key: ${output_key} + regex_patterns: + - '' + - '' + test_cases: + - { input: { generation: "My solution:\n------\nSomething else" }, output: null } + - { input: { generation: "My solution:\ncode\nSomething else" }, output: null } + - { input: { generation: "code" }, output: { generation: "code" } } + + - _target_: nemo_skills.training.data_preparation_utils.filters.DropIfRegexNotMatch # filtering out if no think tags + should_run: ${filters.remove_no_think_tags} + text_key: ${output_key} + regex_patterns: + - '' + test_cases: + - { input: { generation: "My solution:\n------\nSomething else" }, output: { generation: "My solution:\n------\nSomething else" } } + - { input: { generation: "My solution:\n\nSomething else" }, output: null } + - { input: { generation: "thinkingsummary" }, output: { generation: "thinkingsummary" } } + + + + - _target_: nemo_skills.training.data_preparation_utils.preprocessing.GroupSamples + group_key: ${input_key} + + - _target_: nemo_skills.training.data_preparation_utils.preprocessing.ShuffleAndDownsampleData + num_samples: ${num_output_samples} + random_seed: ${random_seed} + do_shuffle: ${do_shuffle} + + - _target_: nemo_skills.training.data_preparation_utils.preprocessing.WriteFinalSftManifest + output_manifest_file: ${output_path} + prompt_config: ${prompt_config} + tokenizer: ${tokenizer} + input_key: ${input_key} + output_key: ${output_key} + exclude_optional_keys: ${exclude_optional_keys} diff --git a/recipes/noc-reasoning-agent/configs/noc_reasoning_sft_6.yaml b/recipes/noc-reasoning-agent/configs/noc_reasoning_sft_6.yaml new file mode 100644 index 0000000000..80d3ed8814 --- /dev/null +++ b/recipes/noc-reasoning-agent/configs/noc_reasoning_sft_6.yaml @@ -0,0 +1,261 @@ +processors_to_run: all + +output_path: ??? +prompt_config: null +code_tags: null + +preprocessed_dataset_files: null # can specify multiple patters separated by space +input_files: null # can specify datasets from HF instead of prediction_jsonl_files + +exclude_optional_keys: true # exclude keys other than input, output and expected_answer from the final manifest +metadata: null # can provide additional metadata to store (e.g. dataset or generation_type) +skip_first: 0 # useful for skipping validation set from train_full generation (it's always first) +add_correct: true # can set to False if only want to export incorrect solutions +add_incorrect: false # if True, saves only incorrect solutions instead of correct +add_unlabeled: false # if True, saves unlabeled solutions (no symbolic_correct field) +use_judgement: false # if True, uses judgement instead of symbolic_correct field +# if True, adds a new field with max allowed code executions to each sample +# exclude_optional_keys must be set to false in this case +add_code_execution_counts: false + +downsampling_method: null # fair, random or null +random_seed: 42 +do_shuffle: true +num_output_samples: null + +# Needed to format the data and also for length-based filtering +tokenizer: null +chat_template_kwargs: null +system_message: null # can be used to override default from the tokenizer +assistant_end: null # can be used to not use jinja and just add this suffix + +# Problem params +min_problem_length: null +max_problem_length: null + +# Solution params +min_solution_length: null +max_solution_length: null +use_chars_for_min_length: false + + +input_key: problem +output_key: generation + +# For synthesized questions +contamination_file: null + +majority_filter: + # minimum number of majority votes to use the answer + min_majority_votes: 0 + # minimum portion of majority votes to use the answer + min_majority_percentage: 0.0 + +filters: + drop_multi_boxed: false + remove_contaminated: true + majority_filter: false + trim_solutions: true + trim_prefix: false + drop_incorrect_arithmetic: false + split_arithmetic: false + remove_len_outlier_problems: false + remove_len_outlier_solutions: false + code_text_filter: null + remove_code_errors: false + remove_verification_code: false + remove_matplotlib: false + remove_no_code: false + remove_no_think_tags: false + +deduplicate: true + +processors: + - _target_: nemo_skills.training.data_preparation_utils.preprocessing.ReadData + preprocessed_dataset_files: ${preprocessed_dataset_files} + input_files: ${input_files} + input_key: ${input_key} + output_key: ${output_key} + skip_first: ${skip_first} + add_correct: ${add_correct} + add_incorrect: ${add_incorrect} + add_unlabeled: ${add_unlabeled} + use_judgement: ${use_judgement} + deduplicate: ${deduplicate} + keys_to_keep: # input/output/symbolic_correct/judgement are automatically added + - "total_votes" + - "majority_votes" + - "expected_answer" + - "predicted_answer" + - "answer_rm_score" + - "solutions" # GenSelect keys + - "num_solutions" + - "max_idx" + + # this will optimize processors inside to avoid serializing data to disk + - _target_: nemo_skills.training.data_preparation_utils.merge_processor.MergeProcessor + _recursive_: False + processor_configs: + - _target_: nemo_skills.training.data_preparation_utils.filters.DropMultiBoxed + should_run: ${filters.drop_multi_boxed} + solution_key: ${output_key} + test_cases: + - { input: { generation: "He had \\boxed{3} 4090s and \\boxed{2} 4080s" }, output: null } + - { input: { generation: "She had \\boxed{6} 4090s" }, output: { generation: "She had \\boxed{6} 4090s" } } + - { input: { generation: "boxed boxed 42" }, output: { generation: "boxed boxed 42" } } + + - _target_: nemo_skills.training.data_preparation_utils.filters.RemoveContaminated + should_run: ${filters.remove_contaminated} + contamination_file: ${contamination_file} + + - _target_: nemo_skills.training.data_preparation_utils.filters.DropIfRegexMatch + should_run: ${filters.remove_code_errors} + text_key: ${output_key} + regex_patterns: + - 'Traceback (most recent call last)' + - '' + - 'Timed out' + - 'SyntaxError' + test_cases: + - { input: { generation: "My solution:\n---Traceback (most recent call last)---\nSomething else" }, output: null } + - { input: { generation: "My solution:\nSome long output\nSomething else" }, output: null } + - { input: { generation: "My solution:\nTimed out\nSomething else" }, output: null } + - { input: { generation: "My solution:\n[0;31mSyntaxError\u001b\nSomething else" }, output: null } + - { input: { generation: "My solution, no errors" }, output: { generation: "My solution, no errors" } } + + - _target_: nemo_skills.training.data_preparation_utils.filters.DropIfRegexMatch + should_run: ${filters.remove_matplotlib} + text_key: ${output_key} + regex_patterns: + - 'from matplotlib' + - 'import matplotlib' + test_cases: + - { input: { generation: "My solution:\n---import matplotlib---\nSomething else" }, output: null } + - { input: { generation: "My solution:\nfrom matplotlib import pyplot as plt\nSomething else" }, output: null } + - { input: { generation: "My solution, no errors" }, output: { generation: "My solution, no errors" } } + + - _target_: nemo_skills.training.data_preparation_utils.filters.DropIfRegexNotMatch + should_run: ${filters.remove_no_code} + text_key: ${output_key} + regex_patterns: + - '' + - '' + test_cases: + - { input: { generation: "My solution:\n------\nSomething else" }, output: null } + - { input: { generation: "My solution:\ncode\nSomething else" }, output: null } + - { input: { generation: "code" }, output: { generation: "code" } } + + - _target_: nemo_skills.training.data_preparation_utils.filters.DropIfRegexNotMatch + should_run: ${filters.remove_no_think_tags} + text_key: ${output_key} + regex_patterns: + - '' + test_cases: + - { input: { generation: "My solution:\n------\nSomething else" }, output: { generation: "My solution:\n------\nSomething else" } } + - { input: { generation: "My solution:\n\nSomething else" }, output: null } + - { input: { generation: "thinkingsummary" }, output: { generation: "thinkingsummary" } } + + - _target_: nemo_skills.training.data_preparation_utils.filters.DropIfRegexMatch + should_run: ${filters.remove_verification_code} + text_key: ${output_key} + regex_patterns: + - "Let's verify this with Python" + - "Let's verify this using Python" + - "Let's confirm this with Python" + - "Let's use Python to plot" + - "Let's verify this solution using Python" + - "Let's use Python to verify" + - "Let's confirm this result with some Python" + test_cases: + - { input: { generation: "Here is the solution. Let's verify this with Python. It should work." }, output: null } + - { input: { generation: "Here is the solution. Let's verify this using Python. It should work." }, output: null } + - { input: { generation: "First, we solve the problem. Let's confirm this with Python. Finally, we get the result." }, output: null } + - { input: { generation: "First, we solve the problem. Let's verify this solution using Python. Finally, we get the result." }, output: null } + - { input: { generation: "First, we solve the problem. Let's use Python to verify the solution. Finally, we get the result." }, output: null } + - { input: { generation: "The answer is calculated. Let's use Python to plot the results." }, output: null } + - { input: { generation: "This is the correct solution without any verification code." }, output: { generation: "This is the correct solution without any verification code." } } + + - _target_: nemo_skills.training.data_preparation_utils.filters.MajorityFilter + should_run: ${filters.majority_filter} + min_majority_votes: ${majority_filter.min_majority_votes} + min_majority_percentage: ${majority_filter.min_majority_percentage} + + - _target_: nemo_skills.training.data_preparation_utils.filters.TrimPrefix + should_run: ${filters.trim_prefix} + solution_key: ${output_key} + test_cases: + - { input: { generation: "My solution:\nSubtract 3 from 5" }, output: { generation: "Subtract 3 from 5" } } + + - _target_: nemo_skills.training.data_preparation_utils.filters.TrimSolutions + should_run: ${filters.trim_solutions} + solution_key: ${output_key} + test_cases: + - { input: { generation: "Solution ends at \\boxed{0}\nThis line is useless." }, output: { generation: "Solution ends at \\boxed{0}" } } + - { input: { generation: "Solution continues \\boxed{7} after boxed" }, output: { generation: "Solution continues \\boxed{7} after boxed" } } + - { input: { generation: "Solution\n\ncontinues \n\n\\boxed{7} after\n boxed" }, output: { generation: "Solution\n\ncontinues \n\n\\boxed{7} after" } } + - { input: { generation: "Works alright with newlines inside \\boxed{7\n5} and properly\ncuts after" }, output: { generation: "Works alright with newlines inside \\boxed{7\n5} and properly" } } + + - _target_: nemo_skills.training.data_preparation_utils.filters.DropIncorrectArithmetic + should_run: ${filters.drop_incorrect_arithmetic} + solution_key: ${output_key} + test_cases: + - { input: { generation: "1 + 2 + 3 + 4 = 10" }, output: { generation: "1 + 2 + 3 + 4 = 10" } } + # TODO: implement fractional arithmetic + # - {input: {generation: "1 / 3 = 0.3"}, output: null} + # - {input: {generation: "1 / 3 = 0.3334"}, output: null} + - { input: { generation: "3 * (1 + 7) / 4 = 3 * 8 / 4 = 100" }, output: null } + + - _target_: nemo_skills.training.data_preparation_utils.filters.SplitArithmetic + should_run: ${filters.split_arithmetic} + solution_key: ${output_key} + test_cases: + - { input: { generation: "1 + 2 + 3 + 4 = 10" }, output: { generation: "1 + 2 + 3 + 4 = 3 + 3 + 4 = 6 + 4 = 10" } } + - { input: { generation: "1 + 7 * 19 = 134" }, output: { generation: "1 + 7 * 19 = 1 + 133 = 134" } } + - { input: { generation: "3 * (1 + 7) / 4 = 6" }, output: { generation: "3 * (1 + 7) / 4 = 3 * 8 / 4 = 24 / 4 = 6" } } + + - _target_: nemo_skills.training.data_preparation_utils.filters.RemoveLenOutliers + should_run: ${filters.remove_len_outlier_problems} + property_key: ${input_key} + min_length: ${min_problem_length} + max_length: ${max_problem_length} + tokenizer: ${tokenizer} + + - _target_: nemo_skills.training.data_preparation_utils.filters.RemoveLenOutliers + should_run: ${filters.remove_len_outlier_solutions} + property_key: ${output_key} + min_length: ${min_solution_length} + max_length: ${max_solution_length} + tokenizer: ${tokenizer} + use_chars_for_min_length: ${use_chars_for_min_length} + + - _target_: nemo_skills.training.data_preparation_utils.filters.AddCodeExecutionsCounts + should_run: ${add_code_execution_counts} + solution_key: ${output_key} + + - _target_: nemo_skills.training.data_preparation_utils.preprocessing.GroupSamples + group_key: ${input_key} + + - _target_: nemo_skills.training.data_preparation_utils.filters.CodeTextFilter + should_run: ${filters.code_text_filter} + filter_type: ${filters.code_text_filter} + code_tags: ${code_tags} + solution_key: ${output_key} + + - _target_: nemo_skills.training.data_preparation_utils.preprocessing.ShuffleAndDownsampleData + num_samples: ${num_output_samples} + sampling_method: ${downsampling_method} + random_seed: ${random_seed} + do_shuffle: ${do_shuffle} + + - _target_: nemo_skills.training.data_preparation_utils.preprocessing.WriteFinalSftManifest + output_manifest_file: ${output_path} + prompt_config: ${prompt_config} + tokenizer: ${tokenizer} + chat_template_kwargs: ${chat_template_kwargs} + system_message: ${system_message} + assistant_end: ${assistant_end} + code_tags: ${code_tags} + input_key: ${input_key} + output_key: ${output_key} + metadata: ${metadata} + exclude_optional_keys: ${exclude_optional_keys} diff --git a/recipes/noc-reasoning-agent/data/synthetic_incidents.csv b/recipes/noc-reasoning-agent/data/synthetic_incidents.csv new file mode 100644 index 0000000000..72043b6a40 --- /dev/null +++ b/recipes/noc-reasoning-agent/data/synthetic_incidents.csv @@ -0,0 +1,2910 @@ +incident_identifier,geographical_territory,incident_classification,incident_subtype,service_domain,equipment_provider,operational_zone,resolution_status,suspension_cause,urgency_level,affected_site,incident_summary,detection_timestamp,escalation_date,responsible_team,fault_category,resolution_summary,action_chronicle,reporter_identity,intervention_began,intervention_completed,root_cause_primary,resolution_method,root_cause_secondary,cause_additional,triggered_by_modification,resolver_identity +INCME-100001,The Shire,Relay Transport,Junction Point,POWER,Elven Forge Technologies,Caras Galadhon,Resolved,PWR-001: AC Power Failure,MAJOR,Watch-TheShire-West-32,"NOC Engineer's Note: + +Incident Summary: A power outage caused by a commercial power supply interruption or failure, affecting the AC power supply to a critical IT infrastructure. + +Restore Action: Verify AC power supply and restore it to its normal operation. + +Reason for Outage: The root cause of the power outage was the interruption or failure of the commercial power supply. + +Outcome: The incident was escalated to the higher-level management team for further investigation and resolution. + +NOC Terms: +- Commercial power supply interruption or failure +- Root cause +- Problem type (PWR-001) +- Verification of AC power supply +- Restore action +- Reason for outage +- Outcome + +NOC Engineer's Note: +- This resolution summary is a summary of the incident closure notes written by the NOC engineer. The full incident closure notes will be provided to the higher-level management team for further investigation and resolution.",2024-09-12 14:18:16,2024-09-12 14:43:16,Orthanc Technical Review,power_ac_failure_recovery,Commercial power supply interruption or failure,"2024-09-12 14:13:16 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-09-12 14:15:16 - System (Workflow Engine) +Tool: query_power_system() → AC/DC, UPS, battery, and generator telemetry reviewed. +2024-09-12 14:18:16 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-09-12 14:21:16 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-09-12 14:27:16 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-09-12 14:30:16 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-09-12 14:35:16 - System (Workflow Engine) +Tool: verify_recovery() → Verification queued; awaiting field/next-line confirmation.",Elfhelm of Rohan,2024-09-12 14:18:16,2024-09-12 14:43:16,Commercial power supply interruption or failure,Resolved,PWR-001: AC Power Failure,0.2,FALSE,Elfhelm of Rohan +INCME-100002,The Shire,Signal Network,Signal Unit,RAN,Dwarven Network Systems,Framsburg,Pending Resolution,RAN-001: Cell Service Interruption,CRITICAL,Hub-TheShire-South-06,"NOC Engineer's Resolution Summary: + +Workflow: Site Outage + +Actions taken: +- Verified Site Outage with the Field Technician +- Notified the Root Cause Team +- Final Status Verification + +Root cause: Cell completely unavailable for service +- The cell was completely unavailable due to a technical issue with the network equipment. +- The issue was identified and resolved by the Field Technician. +- The Root Cause Team was notified and the issue was escalated to the next level. + +Problem type: RAN-001 + +Resolution: +- The issue has been resolved and the cell is now operational. +- The Root Cause Team has been notified and the issue is being monitored. +- The Field Technician has been reassigned to another site to ensure the issue is not reoccurring. +- A follow-up report will be generated to ensure the issue has been resolved and the Root Cause Team is notified. + +The resolution summary includes a brief description of the root cause of the issue, the actions taken by the NOC team, and the resolution. The resolution is clear and concise,",2024-09-07 02:25:18,2024-09-07 02:55:18,White Tower Operations,ran_cell_site_down_recovery,Cell completely unavailable for service,"2024-09-07 02:30:18 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-09-07 02:32:18 - System (Workflow Engine) +Tool: query_rf_status() → Sector/RU radiating status and cell enablement checked. +2024-09-07 02:35:18 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-09-07 02:38:18 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-09-07 02:41:18 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-09-07 02:45:18 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-09-07 02:51:18 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-09-07 02:54:18 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-09-07 02:59:18 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Húrin of the White City,2024-09-07 02:25:18,2024-09-07 02:55:18,Cell completely unavailable for service,Pending Resolution,RAN-001: Cell Service Interruption,0.4,FALSE,Húrin of the White City +INCME-100003,Rohan,Path Signals,Weather Watch,TRANSPORT,Dwarven Network Systems,Helm's Deep,Resolved,TRN-008: Latency SLA Violation,MAJOR,Point-Rohan-Outer-46,"NOC engineer completed the incident closure notes for the TRN-008 issue. + +Restore Action: Verify Routing Alarm and Protocol Status. + +Reason for Outage: End-to-end latency exceeding service level agreement. + +Root cause: The issue was caused by an end-to-end latency that exceeded the service level agreement. The root cause was identified and resolved by verifying the routing alarm and protocol status. + +Procedure followed: +1. Verified routing alarm by checking the log files and confirming that it was triggered. +2. Verified protocol status by checking the protocol logs. +3. Adjusted timer settings to ensure that the latency threshold was not exceeded. +4. Escalated the issue to network engineering for further investigation. + +Problem type: TRN-008 + +Resolution: Restore the routing and protocol settings to ensure that latency is not exceeding the service level agreement.",2024-09-30 07:21:48,2024-09-30 07:47:48,Lothlórien Link Guardians,transport_routing_flap_resolution,End-to-end latency exceeding service level agreement,"2024-09-30 07:12:48 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-09-30 07:14:48 - System (Workflow Engine) +Tool: run_diagnostics() → Interface counters/latency/jitter tests executed. +2024-09-30 07:17:48 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-09-30 07:20:48 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-09-30 07:23:48 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-09-30 07:29:48 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-09-30 07:32:48 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-09-30 07:37:48 - System (Workflow Engine) +Tool: run_diagnostics() → Additional diagnostics attached for next-line/field team. +2024-09-30 07:41:48 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Bergil son of Denethor,2024-09-30 07:21:48,2024-09-30 07:47:48,End-to-end latency exceeding service level agreement,Resolved,TRN-008: Latency SLA Violation,0.2,FALSE,Bergil son of Denethor +INCME-100004,Lothlórien,Relay Transport,Junction Point,COMPUTE,Dwarven Network Systems,Aldburg,Pending Resolution,CMP-010: Site Not Scrolling,CRITICAL,Hub-Lothlorien-Primary-01,"NOC Engineer's Resolution Summary: + +Restore Action: Verify Image Pull Alarm, Escalate Registry Issue, Verify Image Pull +Reason for Outage: Site deployment or scaling operation stalled + +The root cause of the outage was identified as a scaling operation stalled due to site deployment. The incident was escalated to the Registry team, who resolved the issue. The Image Pull Alarm was verified to ensure that the issue was resolved. The Registry Issue was escalated to the Image Pull team, who resolved the issue. The Image Pull Alarm was verified to ensure that the issue was resolved. The root cause was documented in the incident closure notes. + +The summary of the resolution summary includes the actions taken, the root cause, the problem type, and the restoration action taken. The resolution summary is concise and complete, using NOC terminology. The engineer can use this summary to document the incident closure notes and ensure that all necessary steps were taken to resolve the issue.",2024-06-06 18:46:37,2024-06-06 19:13:37,Osgiliath Bridge Operations,compute_image_pull_recovery,Site deployment or scaling operation stalled,"2024-06-06 18:50:37 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-06-06 18:52:37 - System (Workflow Engine) +Tool: inspect_logs() → Relevant logs inspected for error signatures and crash indicators. +2024-06-06 18:55:37 - System (Workflow Engine) +Tool: orchestrate_workload() → Orchestrator actions evaluated (drain/restart/reschedule). +2024-06-06 18:58:37 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-06-06 19:01:37 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-06-06 19:05:37 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-06-06 19:10:37 - System (Workflow Engine) +Tool: verify_recovery() → Verification queued; awaiting field/next-line confirmation.",Herubrand of the Mark,2024-06-06 18:46:37,2024-06-06 19:13:37,Site deployment or scaling operation stalled,Pending Resolution,CMP-010: Site Not Scrolling,0.2,FALSE,Herubrand of the Mark +INCME-100005,Rivendell,Relay Transport,Central Nexus,TRANSPORT,Elven Forge Technologies,Hollin Gate,Resolved,BGP issue,MAJOR,Node-Rivendell-East-20,"Incident Closure Note: + +In response to the issue of failed Microwave Alarm, Verify Microwave Alarm, Check Weather Conditions, Check Link Status, Schedule Site Visit, Monitor Link Status, root cause was BGP session instability with upstream provider. + +The root cause of the issue was identified as BGP session instability with upstream provider. The issue was fixed by verifying the Microwave Alarm, checking weather conditions, checking link status, scheduling a site visit, monitoring link status, and resolving the issue. + +Reason for Outage: The root cause of the issue was identified as BGP session instability with upstream provider. + +Restore Action: Verify Microwave Alarm, Check Weather Conditions, Check Link Status, Schedule Site Visit, Monitor Link Status. + +Root cause: BGP session instability with upstream provider. +Problem type: TRN-002. + +Output: + +Incident Closure Note: Incident Closure Note: + +In response to the issue of failed Microwave Alarm, Verify Microwave Alarm,",2024-10-17 15:24:05,2024-10-17 15:43:05,Riders of the Mark,transport_microwave_degradation_response,BGP session instability with upstream provider,"2024-10-17 15:01:05 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-10-17 15:03:05 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-10-17 15:06:05 - System (Workflow Engine) +Tool: run_diagnostics() → Interface counters/latency/jitter tests executed. +2024-10-17 15:09:05 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-10-17 15:12:05 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-10-17 15:16:05 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-10-17 15:22:05 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-10-17 15:25:05 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Pippin Bracegirdle,2024-10-17 15:24:05,2024-10-17 15:43:05,BGP session instability with upstream provider,Resolved,BGP issue,0.4,FALSE,Pippin Bracegirdle +INCME-100006,Gondor,Signal Core,Beacon Array,COMPUTE,Dwarven Network Systems,Ethring,Resolved,CMP-005: Pod Terminating Stuck,CRITICAL,Node-Gondor-Primary-68,"NOC engineer writing incident closure notes + +Issue: An issue has been reported where a pod in the Kubernetes cluster has been stuck in the terminating state beyond the grace period, causing downtime for the cluster. + +Action taken: + +1. Verification: Verification of the Orchestrator Alarm and Control Plane to confirm the issue. + +2. Check: Verification of the Pod stuck in the terminating state. + +3. Failover: Verification of the Backup Control Plane to ensure that the cluster can continue to function. + +4. Escalation: Escalation to the Platform Team to investigate and resolve the issue. + +Root cause: The Pod was stuck in the terminating state due to a configuration error in the orchestrator. + +Problem type: CMP-005 + +Resolution: + +1. Verification: The Orchestrator Alarm and Control Plane were verified to be operational. + +2. Check: The Pod was verified to be in the terminating state due to a configuration error in the orchestrator. + +3. Failover: The Backup Control Plane was verified to be oper",2024-09-25 11:35:52,2024-09-25 12:00:52,Wizards Council Escalation,compute_orchestrator_recovery,Pod stuck in terminating state beyond grace period,"2024-09-25 11:26:52 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-09-25 11:28:52 - System (Workflow Engine) +Tool: inspect_logs() → Relevant logs inspected for error signatures and crash indicators. +2024-09-25 11:31:52 - System (Workflow Engine) +Tool: query_container_status() → Pod/VM/container state examined (Ready/Pending/CrashLoop). +2024-09-25 11:34:52 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-09-25 11:37:52 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-09-25 11:43:52 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-09-25 11:46:52 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Alatar the White,2024-09-25 11:35:52,2024-09-25 12:00:52,Pod stuck in terminating state beyond grace period,Resolved,CMP-005: Pod Terminating Stuck,0.4,FALSE,Alatar the White +INCME-100007,Eriador,Relay Transport,Path Router,RAN,Elven Forge Technologies,Bucklebury,Resolved,RAN-014: CSR Unreachable,CRITICAL,Watch-Eriador-Outer-18,"The incident was caused by a cell site router management path unavailable, which resulted in a failure to perform upgrades. The root cause was identified and documented, and a resolution was implemented to restore the affected system. The incident was closed, and a resolution summary was created to detail the actions taken to address the issue. The summary includes the following information: + +1. Restore Action: Verify and restore the cell site router management path. + +2. Reason for Outage: The root cause was identified and documented. + +3. Notes: The root cause was identified as a failure to perform upgrades due to a cell site router management path unavailable. + +4. Root Cause: The root cause of the issue was identified as a problem with the cell site router management path, which caused the failure to perform upgrades. + +5. Root Cause: The root cause of the issue was identified as a problem with the cell site router management path, which caused the failure to perform upgrades. + +6. Root Cause: The root cause of the issue was identified as a problem with the cell site router management path, which caused the failure to perform upgrades. + +7. Root",2024-07-17 15:27:29,2024-07-17 16:17:29,Riders of the Mark,ran_software_upgrade_recovery,Cell site router management path unavailable,"2024-07-17 15:26:29 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-07-17 15:28:29 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-07-17 15:31:29 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-07-17 15:34:29 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-07-17 15:40:29 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-07-17 15:43:29 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-07-17 15:48:29 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Dwalin son of Thorin,2024-07-17 15:27:29,2024-07-17 16:17:29,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.2,FALSE,Dwalin son of Thorin +INCME-100008,Gondor,Relay Transport,Signal Unit,RAN,Dwarven Network Systems,Linhir,Resolved,RAN-014: CSR Unreachable,MAJOR,Beacon-Gondor-Lower-30,"Resolution Summary: + +The root cause of the failure was the cell site router management path unavailable. The issue was resolved by verifying the upgrade failure, dispatching Field Support, and documenting and reporting the issue. + +The problem type was RAN-014. + +The following actions were taken: +1. Verify upgrade failure: Verification of the upgrade failure was conducted to ensure that the issue was resolved. +2. Dispatch Field Support: A Field Support team member was dispatched to resolve the issue. +3. Document and report: A report was generated to document and report the issue. + +The resolution summary is a concise summary of the actions taken to resolve the issue. It includes the root cause, the actions taken, and the results achieved. The summary does not include additional details or information that may be relevant to the incident.",2024-08-19 02:41:11,2024-08-19 03:02:11,Rohan Rapid Response,ran_software_upgrade_recovery,Cell site router management path unavailable,"2024-08-19 02:39:11 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-08-19 02:41:11 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-08-19 02:44:11 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-08-19 02:47:11 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-08-19 02:50:11 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-08-19 02:54:11 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-08-19 03:00:11 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-08-19 03:03:11 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Halbarad the Heir,2024-08-19 02:41:11,2024-08-19 03:02:11,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.4,FALSE,Halbarad the Heir +INCME-100009,Rhovanion,Beacon Power,Path Router,COMPUTE,Dwarven Network Systems,Framsburg,Resolved,CMP-002: Pod Container Creating,MAJOR,Node-Rhovanion-Central-60,"NOC Engineer's Notes: + +1. Restore Action: Containers were successfully restarted, and the pods were brought back online. + +2. Reason for Outage: A pod was stuck in ContainerCreating state due to a configuration issue. + +Conclusion: The root cause of the outage was a configuration issue, and the root cause was verified. The incident was escalated to DevOps for further investigation. The root cause has been addressed and the issue has been resolved. The incident has been closed.",2024-08-11 10:43:48,2024-08-11 11:00:48,Shire Monitoring Guild,compute_container_crash_recovery,Pod stuck in ContainerCreating state,"2024-08-11 10:17:48 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-08-11 10:19:48 - System (Workflow Engine) +Tool: query_container_status() → Pod/VM/container state examined (Ready/Pending/CrashLoop). +2024-08-11 10:22:48 - System (Workflow Engine) +Tool: inspect_logs() → Relevant logs inspected for error signatures and crash indicators. +2024-08-11 10:25:48 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-08-11 10:28:48 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-08-11 10:34:48 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-08-11 10:37:48 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Elendil Strider,2024-08-11 10:43:48,2024-08-11 11:00:48,Pod stuck in ContainerCreating state,Resolved,CMP-002: Pod Container Creating,0.4,FALSE,Elendil Strider +INCME-100010,Rhovanion,Signal Core,Path Router,RAN,Dwarven Network Systems,Erebor,Resolved,RAN-014: CSR Unreachable,MAJOR,Gateway-Rhovanion-North-09,"Restore Action: Verified that the cell site router management path was restored, and the RAN-014 issue has been resolved. +Reason for Outage: The root cause of the issue was a failure in the upgrade process, which resulted in a disruption in cell site router management. +Notes: Verified that the upgrade process was completed successfully, and the system status is now stable. +System Status: The system status is now stable, with no issues observed. + +Actions taken: +- Verified that the upgrade process was completed successfully +- Reviewed the upgrade logs to identify any issues +- Initiated a rollback to restore the previous system state +- Attempted an emergency boot to restore the system to a working state +- Verified that the system was able to recover from the upgrade failure +- Documented and reported the issue and its resolution + +Root cause: Cell site router management path unavailable +Problem type: RAN-014 + +Resolution: Verified that the upgrade process was completed successfully, and the system status is now stable. The root cause of the issue was a failure in the upgrade process, which resulted in a disruption in cell site router management. The",2024-11-18 13:48:58,2024-11-18 14:16:58,Pelargir Port Authority,ran_software_upgrade_recovery,Cell site router management path unavailable,"2024-11-18 13:23:58 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-11-18 13:25:58 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-11-18 13:28:58 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-11-18 13:31:58 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-11-18 13:37:58 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-11-18 13:40:58 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Dáin Oakenshield,2024-11-18 13:48:58,2024-11-18 14:16:58,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.2,FALSE,Dáin Oakenshield +INCME-100011,Mordor Surveillance Zone,Signal Core,Central Nexus,TRANSPORT,Dwarven Network Systems,Grimslade,Resolved,TRN-007: Packet Loss Threshold Exceeded,MINOR,Station-MordorSurveillanceZone-South-26,"I'm unable to access the specific incident details, but I can provide you with a sample resolution summary for a failed incident outcome. + +in this case, the incident outcome was a failed transport path alarm, which was caused by unacceptable packet loss. To resolve the issue, the following actions were taken: + +1. Verify interface alarm: the first step was to verify the interface alarm, which was triggered due to the unacceptable packet loss. The alarm was confirmed to be valid and caused by the transport path experiencing unacceptable packet loss. + +2. Schedule physical repair: the next step was to schedule the physical repair of the affected interface. This involved identifying the affected interface, determining the physical location of the faulty component, and arranging for the repair. + +3. Verify interface status: after the physical repair was completed, the interface status was verified to ensure that the interface was restored to its original condition. + +the root cause of the failure was the unacceptable packet loss, which was caused by the transport path experiencing unacceptable packet loss. The root cause was investigated and a resolution was implemented to address the issue. + +the resolution summary includes the actions taken, the",2024-06-21 18:03:35,2024-06-21 18:22:35,Wizards Council Escalation,transport_interface_errors_resolution,Transport path experiencing unacceptable packet loss,"2024-06-21 17:57:35 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-06-21 17:59:35 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-06-21 18:02:35 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-06-21 18:05:35 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-06-21 18:09:35 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-06-21 18:15:35 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-06-21 18:18:35 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Pelendur Húrinion,2024-06-21 18:03:35,2024-06-21 18:22:35,Transport path experiencing unacceptable packet loss,Resolved,TRN-007: Packet Loss Threshold Exceeded,0.4,FALSE,Pelendur Húrinion +INCME-100012,Dale Province,Arcane Engines,Path Router,SIGNALING,Elven Forge Technologies,Henneth Annûn,Pending Resolution,SIG-010: CU Communication Failure,MAJOR,Station-DaleProvince-North-64,"NOC Engineer: Completed Routing Alarm, Verified Routing Status, Verified Routing Alarm, Escalated Routing Issue, Verified Routing Status, Root Cause: Central Unit internal communication breakdown, Problem Type: SIG-010. + +The NOC engineer completed the following actions to resolve the routing alarm: + +1. Verified the routing alarm, which indicated that the central unit was experiencing communication issues. +2. Verified that the routing status was not affected, as the alarm was triggered by a communication failure between the central unit and the NOC. +3. Verified that the routing issue was caused by a communication breakdown between the central unit and the NOC. +4. Escalated the routing issue to the NOC management team for further investigation and resolution. + +The root cause of the routing issue was identified as a communication breakdown between the central unit and the NOC. This led to the escalation of the issue to the NOC management team for further investigation and resolution.",2024-09-22 17:51:17,2024-09-22 18:10:17,Bywater Observation Post,signaling_routing_failure_recovery,Central Unit internal communication breakdown,"2024-09-22 17:38:17 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-09-22 17:40:17 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-09-22 17:43:17 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-09-22 17:46:17 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-09-22 17:52:17 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-09-22 17:55:17 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-09-22 18:00:17 - System (Workflow Engine) +Tool: run_diagnostics() → Additional diagnostics attached for next-line/field team. +2024-09-22 18:04:17 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Arador of the Dúnedain,2024-09-22 17:51:17,2024-09-22 18:10:17,Central Unit internal communication breakdown,Pending Resolution,SIG-010: CU Communication Failure,0.4,FALSE,Arador of the Dúnedain +INCME-100013,Gondor,Path Signals,Weather Watch,RAN,Dwarven Network Systems,Cair Andros,Resolved,SVC-002: Data Throughput Degradation,MAJOR,Station-Gondor-West-13,"In response to the complaint received regarding a slow-performing SVC, the NOC engineer has reviewed the complaint details, checked cell performance, and requested a detailed analysis. As a result, the following action was taken: + +- Review Complaint Details: The engineer reviewed the complaint details and identified that the user throughput significantly below expected rates. +- Check Cell Performance: The engineer checked cell performance to ensure that the SVC is operating correctly and is not experiencing any performance issues. +- Request Detailed Analysis: The engineer requested a detailed analysis to determine the root cause of the slow-performing SVC. + +The root cause of the slow-performing SVC was identified as user throughput significantly below expected rates. The engineer has documented the resolution and will continue to monitor the SVC to ensure that it is operating optimally.",2024-09-20 03:18:09,2024-09-20 03:35:09,Arnor Response Team,ran_speed_complaint_resolution,User throughput significantly below expected rates,"2024-09-20 03:09:09 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-09-20 03:11:09 - System (Workflow Engine) +Tool: query_rf_status() → Sector/RU radiating status and cell enablement checked. +2024-09-20 03:14:09 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-09-20 03:17:09 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-09-20 03:23:09 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-09-20 03:26:09 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Paladin Took,2024-09-20 03:18:09,2024-09-20 03:35:09,User throughput significantly below expected rates,Resolved,SVC-002: Data Throughput Degradation,0.2,FALSE,Paladin Took +INCME-100014,Rivendell,Signal Core,Central Nexus,COMPUTE,Dwarven Network Systems,Grey Havens,Resolved,CMP-003: Pod Pending or Evicted,CRITICAL,Array-Rivendell-Primary-08,"In summary, the root cause of the outage was a Pod that was unable to be scheduled or evicted from a node due to a Node Alarm. The incident was escalated to the Platform Team for further investigation, and a root cause analysis was performed to determine the root cause. The outage was resolved by verifying that the Pod could be scheduled and evicted from the node. The root cause was also identified and documented. The resolution was a complete restart of the affected node, which restored normal operations. The resolution summary includes the following: + +1. Restore Action: Restart the affected node +2. Reason For Outage: The root cause of the outage was a Pod that was unable to be scheduled or evicted from a node due to a Node Alarm. + +The resolution summary is concise and complete, keeping NOC terminology and workflow in mind. The root cause and resolution are highlighted for easy reference.",2024-10-13 02:35:43,2024-10-13 03:10:43,Erebor Relay Division,compute_k8s_node_recovery,Pod cannot be scheduled or was evicted from node,"2024-10-13 02:42:43 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-10-13 02:44:43 - System (Workflow Engine) +Tool: orchestrate_workload() → Orchestrator actions evaluated (drain/restart/reschedule). +2024-10-13 02:47:43 - System (Workflow Engine) +Tool: query_container_status() → Pod/VM/container state examined (Ready/Pending/CrashLoop). +2024-10-13 02:50:43 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-10-13 02:53:43 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-10-13 02:59:43 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-10-13 03:02:43 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Nori of the Lonely Mountain,2024-10-13 02:35:43,2024-10-13 03:10:43,Pod cannot be scheduled or was evicted from node,Resolved,CMP-003: Pod Pending or Evicted,0.4,FALSE,Nori of the Lonely Mountain +INCME-100015,Rohan,Signal Core,Beacon Array,POWER,Elven Forge Technologies,Edoras,Resolved,PWR-003: Battery Discharge Alert,MINOR,Hub-Rohan-East-49,"Incident Closure: + +Escalation: Battery Alert, Battery Depletion, No AC Power, No Generator Available, No Site Status, No Load Shedding Implemented, Emergency Dispatch, No Site Status + +Restore Action: +1. Acknowledge Battery Alert - Notify stakeholders of the battery backup depletion and the need to implement load shedding. +2. Check Battery Status - Confirm that the battery is still functioning correctly and that the battery backup is not depleted. +3. Verify AC Power Status - Check if the site is connected to the AC grid and if there is enough power to operate the generator. +4. Start Generator - If the site is not connected to the AC grid, the generator will be used to provide power to the site. +5. Implement Load Shedding - If the generator is not sufficient to meet the site's power demand, load shedding will be implemented to reduce the load on the grid. +6. Emergency Dispatch - If the generator is unable to provide power, an emergency dispatch will be initiated to ensure that the site",2024-07-04 02:07:53,2024-07-04 02:44:53,Rivendell Array Management,power_battery_discharge_response,Battery backup depleting without AC restoration,"2024-07-04 02:00:53 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-07-04 02:02:53 - System (Workflow Engine) +Tool: query_external_factors() → Utility/outage/maintenance windows checked. +2024-07-04 02:05:53 - System (Workflow Engine) +Tool: query_power_system() → AC/DC, UPS, battery, and generator telemetry reviewed. +2024-07-04 02:08:53 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-07-04 02:11:53 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-07-04 02:17:53 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-07-04 02:20:53 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Celeborn of the Grey Havens,2024-07-04 02:07:53,2024-07-04 02:44:53,Battery backup depleting without AC restoration,Resolved,PWR-003: Battery Discharge Alert,0.6,FALSE,Celeborn of the Grey Havens +INCME-100016,Gondor,Relay Transport,Power Source,RAN,Dwarven Network Systems,Calembel,Resolved,RAN-001: Cell Service Interruption,MAJOR,Station-Gondor-Primary-48,"NOC Engineer's Workflow Outcome: Failed + +Actions taken: +- Verified Site Outage +- Dispatched Field Technician +- Final Status Verification + +Root cause: Cell completely unavailable for service. +Problem type: RAN-001 + +Resolution: +- Identified the cause of the outage - cell unavailable +- Verified the issue was resolved by deploying a new cell. +- Notified stakeholders of the issue and its resolution. +- Recommended improvements to the network infrastructure to prevent similar issues from occurring in the future.",2024-09-22 20:04:49,2024-09-22 20:23:49,Pelargir Port Authority,ran_cell_site_down_recovery,Cell completely unavailable for service,"2024-09-22 19:53:49 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-09-22 19:55:49 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-09-22 19:58:49 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-09-22 20:01:49 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-09-22 20:05:49 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-09-22 20:10:49 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Orophin the Fair,2024-09-22 20:04:49,2024-09-22 20:23:49,Cell completely unavailable for service,Resolved,RAN-001: Cell Service Interruption,0.2,FALSE,Orophin the Fair +INCME-100017,Rivendell,Path Signals,Relay Unit,POWER,Elven Forge Technologies,Hollin Gate,Resolved,PWR-003: Battery Discharge Alert,MAJOR,Array-Rivendell-West-04,"Incident Closure Summary: + +1. Restore Action: Restored the battery backup to full capacity. + +2. Reason For Outage: The battery backup depleted without AC restoration, which resulted in a power outage. + +3. Root Cause: The battery backup depleted without AC restoration due to a power outage. + +4. Problem Type: The PWR-003 problem type refers to a power outage that affects the battery backup. + +5. Summary: The incident was resolved by restoring the battery backup to full capacity, and the power outage was caused by a power outage.",2024-09-28 20:07:53,2024-09-28 20:20:53,Istari Advisory Board,power_battery_discharge_response,Battery backup depleting without AC restoration,"2024-09-28 20:01:53 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-09-28 20:03:53 - System (Workflow Engine) +Tool: query_power_system() → AC/DC, UPS, battery, and generator telemetry reviewed. +2024-09-28 20:06:53 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-09-28 20:09:53 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-09-28 20:15:53 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-09-28 20:18:53 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Pallando the Grey,2024-09-28 20:07:53,2024-09-28 20:20:53,Battery backup depleting without AC restoration,Resolved,PWR-003: Battery Discharge Alert,0.4,FALSE,Pallando the Grey +INCME-100018,Gondor,Signal Core,Path Router,RAN,Dwarven Network Systems,Pelargir,Resolved,RAN-002: Cell Administratively Disabled,MINOR,Spire-Gondor-West-54,"Closure: The root cause of the issue was identified as the Cell locked or disabled by management action. The issue was escalated to the Security Operations Center (SOC) for further investigation. The issue was verified and resolved by verifying that the Sector Alarm was triggered, scheduling a field repair for the affected field, and verifying coverage impact. The root cause was identified as a management action that locked or disabled the Cell, which caused the issue. The resolution summary is complete and concise, using NOC terminology.",2024-08-13 11:14:18,2024-08-13 11:41:18,Iron Hills Transport,ran_sector_outage_recovery,Cell locked or disabled by management action,"2024-08-13 11:01:18 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-08-13 11:03:18 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-08-13 11:06:18 - System (Workflow Engine) +Tool: query_rf_status() → Sector/RU radiating status and cell enablement checked. +2024-08-13 11:09:18 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-08-13 11:12:18 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-08-13 11:16:18 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-08-13 11:22:18 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-08-13 11:25:18 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Thráin Oakenshield,2024-08-13 11:14:18,2024-08-13 11:41:18,Cell locked or disabled by management action,Resolved,RAN-002: Cell Administratively Disabled,0.2,FALSE,Thráin Oakenshield +INCME-100019,Rohan,Path Signals,Signal Unit,POWER,Elven Forge Technologies,Hornburg,Resolved,ENV-002: HVAC System Fault,WARNING,Gateway-Rohan-West-61,"I'm not able to perform actual NOC work or read your company's internal processes. However, based on the information you provided, here's a sample resolution summary for the incident closure notes: + +restore action: +- verification of climate control system malfunction +- dispatch of hvac technician to resolve issue +- verification of environmental status +- en-002 problem type + +reason for outage: +- climate control system malfunction + +root cause: +- malfunction of climate control system +- result of incorrect maintenance or failure of system components + +solution: +- verification of malfunction and resolution +- training for maintenance team +- maintenance and repair of system components + +closing note: +- restoration of service to normal operating conditions +- resolution of issue + +note: this summary is intended to be a concise and comprehensive summary of the incident closure notes. Please provide more detailed information if necessary.",2024-09-11 01:16:09,2024-09-11 01:48:09,Istari Advisory Board,env_hvac_fault_recovery,Climate control system malfunction,"2024-09-11 01:22:09 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-09-11 01:24:09 - System (Workflow Engine) +Tool: query_power_system() → AC/DC, UPS, battery, and generator telemetry reviewed. +2024-09-11 01:27:09 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-09-11 01:30:09 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-09-11 01:34:09 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-09-11 01:40:09 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-09-11 01:43:09 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Nori son of Glóin,2024-09-11 01:16:09,2024-09-11 01:48:09,Climate control system malfunction,Resolved,ENV-002: HVAC System Fault,0.2,FALSE,Nori son of Glóin +INCME-100020,Arnor,Path Signals,Signal Unit,POWER,Dwarven Network Systems,Bywater,Resolved,ENV-006: Battery High Temperature,MINOR,Node-Arnor-North-67,"After reviewing the incident logs and verifying that the battery temperature alarm was triggered, the NOC engineer determined that the battery string was the root cause of the issue. The engineer isolated the battery string and conducted a thorough inspection to determine if the battery was damaged or malfunctioning. The engineer found no signs of damage or malfunction, and the battery was replaced. + +As a result of this investigation, the engineer closed the incident by restoring the affected system to a working state. The root cause was identified, and the issue was resolved. The engineer ensured that this incident would not occur again and took necessary steps to ensure that this issue did not occur again. + +The engineer wrote a concise and professional resolution summary that included the root cause, the actions taken, and the root cause analysis. The summary was completed within the NOC's workflow and was used to document the incident and ensure that the incident was properly resolved.",2024-09-22 05:42:09,2024-09-22 06:05:09,Orthanc Technical Review,env_battery_temperature_response,Battery thermal runaway risk detected,"2024-09-22 05:32:09 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-09-22 05:34:09 - System (Workflow Engine) +Tool: query_power_system() → AC/DC, UPS, battery, and generator telemetry reviewed. +2024-09-22 05:37:09 - System (Workflow Engine) +Tool: query_external_factors() → Utility/outage/maintenance windows checked. +2024-09-22 05:40:09 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-09-22 05:43:09 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-09-22 05:49:09 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-09-22 05:52:09 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Argonui the Heir,2024-09-22 05:42:09,2024-09-22 06:05:09,Battery thermal runaway risk detected,Resolved,ENV-006: Battery High Temperature,0.4,FALSE,Argonui the Heir +INCME-100021,Lothlórien,Path Signals,Keeper Stone,RAN,Elven Forge Technologies,Bree,Resolved,RAN-001: Cell Service Interruption,MAJOR,Outpost-Lothlorien-North-41,"Reopened from INCME-100016. + +NOC Engineer's Workflow Outcome: Failed + +Actions taken: +- Verified Site Outage +- Dispatched Field Technician +- Final Status Verification + +Root cause: Cell completely unavailable for service. +Problem type: RAN-001 + +Resolution: +- Identified the cause of the outage - cell unavailable +- Verified the issue was resolved by deploying a new cell. +- Notified stakeholders of the issue and its resolution. +- Recommended improvements to the network infrastructure to prevent similar issues from occurring in the future.",2024-09-23 05:23:49,2024-09-23 05:54:49,Hobbiton Watch Service,ran_cell_site_down_recovery,Cell completely unavailable for service,"2024-09-23 05:22:49 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-09-23 05:24:49 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-09-23 05:27:49 - System (Workflow Engine) +Tool: query_rf_status() → Sector/RU radiating status and cell enablement checked. +2024-09-23 05:30:49 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-09-23 05:33:49 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-09-23 05:39:49 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-09-23 05:42:49 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Háma of Helm's Deep,2024-09-23 05:23:49,2024-09-23 05:54:49,Cell completely unavailable for service,Resolved,RAN-001: Cell Service Interruption,0.4,TRUE,Háma of Helm's Deep +INCME-100022,Rohan,Beacon Power,Junction Point,RAN,Dwarven Network Systems,Edoras,Resolved,RAN-018: Link Flapping Detected,CRITICAL,Beacon-Rohan-East-10,"I am not able to perform the actual work or access the system to generate a resolution summary. However, based on the provided text, the resolution summary for the workflow outcome of completing the outage is as follows: + +1. Restore Action: Verify PRB Status +2. Reason For Outage: Interface repeatedly transitioning between up and down + +The root cause of the outage was identified as an interface repeatedly transitioning between up and down. The root cause was resolved by verifying the PRB status and applying PRB optimization. + +The resolution summary includes the actions taken to resolve the issue, which is verifying the PRB status and applying PRB optimization. The resolution summary is concise and complete, using proper NOC terminology.",2024-07-24 20:44:01,2024-07-24 21:23:01,Istari Advisory Board,ran_prb_availability_resolution,Interface repeatedly transitioning between up and down,"2024-07-24 20:34:01 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-07-24 20:36:01 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-07-24 20:39:01 - System (Workflow Engine) +Tool: query_rf_status() → Sector/RU radiating status and cell enablement checked. +2024-07-24 20:42:01 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-07-24 20:45:01 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-07-24 20:51:01 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-07-24 20:54:01 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Herubrand of Rohan,2024-07-24 20:44:01,2024-07-24 21:23:01,Interface repeatedly transitioning between up and down,Resolved,RAN-018: Link Flapping Detected,0.4,FALSE,Herubrand of Rohan +INCME-100023,Lothlórien,Signal Network,Central Nexus,COMPUTE,Elven Forge Technologies,Osgiliath,Pending Resolution,CMP-004: Pod CrashLoopBackOff,MAJOR,Node-Lothlorien-Central-41,"In response to the issue of Container repeatedly crashing and restarting, the NOC engineer has conducted a thorough investigation and identified the root cause as a storage issue. The engineer has verified that the storage alarm was triggered due to an overloaded storage volume, and the volume has been restored to normal conditions. + +In order to address the issue, the engineer has restarted dependent workloads to ensure that they are running smoothly. Additionally, the engineer has escalated the issue to the storage team for further investigation and resolution. + +The engineer has also confirmed that the storage issue has been resolved, and the container is now operating normally. The engineer has written the resolution summary, which includes the following: + +- Verify Storage Alarm: The engineer confirmed that the storage alarm was triggered due to an overloaded storage volume. +- Check Volume Status: The engineer has confirmed that the volume has been restored to normal conditions. +- Restart Dependent Workloads: The engineer has restarted dependent workloads to ensure that they are running smoothly. +- Escalate to Storage Team: The engineer has escalated the issue to the storage team for further investigation and resolution. +- Resolution Summary: The engineer has written the resolution summary",2024-09-14 17:14:59,2024-09-14 17:46:59,Gondor Signal Authority,compute_storage_failure_recovery,Container repeatedly crashing and restarting,"2024-09-14 17:08:59 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-09-14 17:10:59 - System (Workflow Engine) +Tool: query_container_status() → Pod/VM/container state examined (Ready/Pending/CrashLoop). +2024-09-14 17:13:59 - System (Workflow Engine) +Tool: inspect_logs() → Relevant logs inspected for error signatures and crash indicators. +2024-09-14 17:16:59 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-09-14 17:19:59 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-09-14 17:24:59 - System (Workflow Engine) +Tool: run_diagnostics() → Additional diagnostics attached for next-line/field team. +2024-09-14 17:28:59 - System (Workflow Engine) +Tool: verify_recovery() → Verification queued; awaiting field/next-line confirmation.",Erestor Starlight,2024-09-14 17:14:59,2024-09-14 17:46:59,Container repeatedly crashing and restarting,Pending Resolution,CMP-004: Pod CrashLoopBackOff,0.4,FALSE,Erestor Starlight +INCME-100024,Rivendell,Path Signals,Central Nexus,COMPUTE,Elven Forge Technologies,Rivendell,Resolved,CMP-008: DU Function Pod Restart,CRITICAL,Beacon-Rivendell-East-53,"The incident was resolved by verifying that the Resource Alarm was verified and that the Resource Status was consistent with the expected state. In addition, a Capacity Expansion request was sent to the cluster's resource provider to increase the available capacity. The root cause of the issue was verified to be CMP-008, which is an expected failure condition in distributed unit pods. The resolution summary includes the actions taken, which were to verify the Resource Alarm and request Capacity Expansion, and the root cause of the issue, which was CMP-008. The summary is concise and complete, using NOC terminology to ensure accuracy.",2024-10-04 17:57:50,2024-10-04 18:13:50,Iron Hills Transport,compute_resource_exhaustion_resolution,Distributed Unit pod experienced unexpected restart,"2024-10-04 17:48:50 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-10-04 17:50:50 - System (Workflow Engine) +Tool: query_container_status() → Pod/VM/container state examined (Ready/Pending/CrashLoop). +2024-10-04 17:53:50 - System (Workflow Engine) +Tool: inspect_logs() → Relevant logs inspected for error signatures and crash indicators. +2024-10-04 17:56:50 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-10-04 17:59:50 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-10-04 18:05:50 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-10-04 18:08:50 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-10-04 18:13:50 - System (Workflow Engine) +Tool: run_diagnostics() → Additional diagnostics attached for next-line/field team. +2024-10-04 18:17:50 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Frodo Bracegirdle,2024-10-04 17:57:50,2024-10-04 18:13:50,Distributed Unit pod experienced unexpected restart,Resolved,CMP-008: DU Function Pod Restart,0.2,FALSE,Frodo Bracegirdle +INCME-100025,The Shire,Beacon Power,Junction Point,RAN,Dwarven Network Systems,Gladden Fields,Pending Resolution,SVC-003: Call Drop Rate Elevated,CRITICAL,Watch-TheShire-East-89,"After reviewing the complaint details and conducting an investigation, we determined that the root cause of the issue was a temporary Voice or data session drop rate above threshold. The root cause was identified as a hardware issue, which was addressed by replacing the faulty equipment. + +In order to restore service, we initiated a review of the SVC-003 problem type and documented the resolution. The root cause of the issue was identified, and the issue was resolved. + +In conclusion, the incident was resolved and the root cause was identified and documented. The root cause was a temporary Voice or data session drop rate above threshold, which was addressed by replacing faulty equipment. The resolution was documented in the incident closure notes.",2024-08-28 12:24:33,2024-08-28 12:32:33,Dúnedain Field Division,ran_dropped_calls_resolution,Voice or data session drop rate above threshold,"2024-08-28 12:16:33 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-08-28 12:18:33 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-08-28 12:21:33 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-08-28 12:24:33 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-08-28 12:29:33 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Aravir of Arnor,2024-08-28 12:24:33,2024-08-28 12:32:33,Voice or data session drop rate above threshold,Pending Resolution,SVC-003: Call Drop Rate Elevated,0.6,FALSE,Aravir of Arnor +INCME-100026,Gondor,Signal Network,Junction Point,COMPUTE,Elven Forge Technologies,Pelargir,Resolved,CMP-005: Pod Terminating Stuck,MINOR,Station-Gondor-Inner-65,"Closure Note: + +The root cause of the incident was a Pod stuck in terminating state beyond grace period. The incident was reported to the Platform Team, and an investigation was initiated. The Orchestrator Alarm was verified to be the root cause of the issue. + +The incident was escalated to the Platform Team, and a resolution plan was developed. The Orchestrator Recovery was initiated, and the Pod was successfully recovered. + +The root cause was identified, and the issue was resolved. The incident was closed with a resolution summary of ""Restore Action: Verify Orchestrator Alarm, Escalate to Platform Team, Verify Orchestrator Recovery, Root cause: Pod stuck in terminating state beyond grace period, Problem type: CMP-005.""",2024-10-25 05:45:02,2024-10-25 05:58:02,Osgiliath Bridge Operations,compute_orchestrator_recovery,Pod stuck in terminating state beyond grace period,"2024-10-25 05:28:02 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-10-25 05:30:02 - System (Workflow Engine) +Tool: inspect_logs() → Relevant logs inspected for error signatures and crash indicators. +2024-10-25 05:33:02 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-10-25 05:36:02 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-10-25 05:40:02 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-10-25 05:46:02 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-10-25 05:49:02 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Gildor of Rivendell,2024-10-25 05:45:02,2024-10-25 05:58:02,Pod stuck in terminating state beyond grace period,Resolved,CMP-005: Pod Terminating Stuck,0.2,FALSE,Gildor of Rivendell +INCME-100027,Gondor,Arcane Engines,Keeper Stone,RAN,Elven Forge Technologies,Minas Ithil,Pending Resolution,RAN-014: CSR Unreachable,WARNING,Point-Gondor-East-73,"In summary, the NOC engineer resolved the issue of a cell site router management path being unavailable due to a faulty switch. The root cause was identified as a faulty switch. The issue was verified and resolved by verifying the Upgrade Failure, dispatching Field Support, and documenting and reporting the issue. The root cause was documented and reported. The engineer completed the resolution process by verifying the Upgrade Failure, discharging the incident, documenting and reporting the issue, and resolving it.",2024-07-19 23:37:38,2024-07-20 00:01:38,Erebor Relay Division,ran_software_upgrade_recovery,Cell site router management path unavailable,"2024-07-19 23:30:38 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-07-19 23:32:38 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-07-19 23:35:38 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-07-19 23:38:38 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-07-19 23:41:38 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-07-19 23:45:38 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-07-19 23:50:38 - System (Workflow Engine) +Tool: verify_recovery() → Verification queued; awaiting field/next-line confirmation.",Celeborn Silvertree,2024-07-19 23:37:38,2024-07-20 00:01:38,Cell site router management path unavailable,Pending Resolution,RAN-014: CSR Unreachable,0.2,FALSE,Celeborn Silvertree +INCME-100028,Dale Province,Signal Core,Relay Unit,SIGNALING,Dwarven Network Systems,Ethring,Pending Resolution,SIG-009: E2 Interface Errors,CRITICAL,Point-DaleProvince-West-53,"NOC engineer reviewed the incident logs and identified that the root cause of the issue was the RAN Intelligent Controller interface errors. The issue was resolved by verifying the Delay Alarm and requesting a Network Review. The delay status was verified, and the root cause was determined to be RAN Intelligent Controller interface errors. The incident was escalated to the Network Operations Center (NOC) for further investigation and resolution. The resolution summary includes the close notes, which include the steps taken to resolve the issue. The engineer used NOC terminology throughout the summary.",2024-11-05 14:10:49,2024-11-05 14:31:49,Gondor Gateway Team,signaling_delay_resolution,RAN Intelligent Controller interface errors,"2024-11-05 14:08:49 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-11-05 14:10:49 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-11-05 14:13:49 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-11-05 14:16:49 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-11-05 14:22:49 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-11-05 14:25:49 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-11-05 14:30:49 - System (Workflow Engine) +Tool: run_diagnostics() → Additional diagnostics attached for next-line/field team. +2024-11-05 14:34:49 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Thranduil of Mirkwood,2024-11-05 14:10:49,2024-11-05 14:31:49,RAN Intelligent Controller interface errors,Pending Resolution,SIG-009: E2 Interface Errors,0.4,FALSE,Thranduil of Mirkwood +INCME-100029,Gondor,Beacon Power,Power Source,RAN,Dwarven Network Systems,Osgiliath,Resolved,RAN-015: Fronthaul Link Down,MINOR,Point-Gondor-Central-31,"I am unable to generate a resolution summary based on your provided text. Please provide me with the complete text and I'll generate the resolution summary for you. + +in the event of an issue with the overshoot detection feature, the following actions were taken: + +1. Verify Overshoot Detection: A team member conducted a thorough analysis of the coverage pattern and identified an issue with the overshoot detection feature. The issue was resolved by reconfiguring the coverage pattern to ensure a more accurate detection of overshooting signals. + +2. Analyze Coverage Pattern: A team member conducted a detailed analysis of the coverage pattern to identify any potential issues with the overshoot detection feature. The issue was resolved by optimizing the coverage pattern to improve the detection accuracy. + +3. Request RF Optimization: As a result of the analysis, the team recommended that a RF optimization be performed to improve the overshoot detection feature's accuracy. The request was approved by the RF team and the optimization process was initiated. + +4. Monitor Coverage Impact: The team monitored the coverage impact of the RF optimization process to ensure that the overshoot detection feature was",2024-09-24 10:37:22,2024-09-24 10:52:22,Rangers of the North,ran_cell_overshooting_correction,F1 interface between DU and CU interrupted,"2024-09-24 10:10:22 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-09-24 10:12:22 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-09-24 10:15:22 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-09-24 10:18:22 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-09-24 10:24:22 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-09-24 10:27:22 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Orophin of Rivendell,2024-09-24 10:37:22,2024-09-24 10:52:22,F1 interface between DU and CU interrupted,Resolved,RAN-015: Fronthaul Link Down,0.2,FALSE,Orophin of Rivendell +INCME-100030,Rohan,Signal Core,Junction Point,RAN,Elven Forge Technologies,Hornburg,Resolved,RAN-001: Cell Service Interruption,MAJOR,Gateway-Rohan-Primary-08,"Reason for Outage: A complete cellular network outage was reported on our site, resulting in a loss of service for our clients. The root cause of the issue was a power outage at the eNodeB. + +Restore Action: The issue was resolved by verifying the site outage, checking the backhaul status, checking the power status, attempting a remote ping, checking the eNodeB status, dispatching a field technician, and finalizing the status verification. + +Actions taken: +- Verified the site outage +- Checked the backhaul status +- Checked the power status +- Attempted a remote ping +- Checked the eNodeB status +- Dispatched a field technician +- Finalized the status verification + +Problem type: RAN-001 +Resolution: The issue was resolved successfully, and the site was restored to service. + +Output ONLY the close notes, nothing else.",2024-06-29 10:28:22,2024-06-29 11:11:22,Lothlórien Link Guardians,ran_cell_site_down_recovery,Cell completely unavailable for service,"2024-06-29 10:26:22 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-06-29 10:28:22 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-06-29 10:31:22 - System (Workflow Engine) +Tool: query_rf_status() → Sector/RU radiating status and cell enablement checked. +2024-06-29 10:34:22 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-06-29 10:37:22 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-06-29 10:43:22 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-06-29 10:46:22 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Radagast the White,2024-06-29 10:28:22,2024-06-29 11:11:22,Cell completely unavailable for service,Resolved,RAN-001: Cell Service Interruption,0.2,FALSE,Radagast the White +INCME-100031,Mordor Surveillance Zone,Path Signals,Weather Watch,RAN,Elven Forge Technologies,Isengard,Resolved,RAN-008: Dormant Cell Detected,MAJOR,Spire-MordorSurveillanceZone-Primary-45,"NOC Engineer's Note: + +The RRU Alarm was verified to be active, indicating a fault in the RRU. The Verification of RRU Alarm was done by checking the status of the RRU. The RRU Alarm was deemed as the root cause of the issue. + +To resolve the issue, the following steps were taken: +1. Verify the CPRI/Fiber Link status to ensure there was no fault. +2. Verify the BBU status to ensure there was no fault. +3. Attempt RRU Reset to restore the RRU. +4. Verify Sector Status to ensure there was no fault. + +The root cause of the issue was the extended period of no user activity for the RRU. This was verified by checking the status of the RRU and verifying that there was no fault. The RRU Alarm was verified as the root cause of the issue, and the steps outlined above were taken to resolve the issue. + +The following actions were taken to ensure the issue was resolved: +1. Verify the CPRI/Fiber Link status to ensure there was no fault. +2. Verify the BBU status",2024-06-04 13:48:44,2024-06-04 14:31:44,Wizards Council Escalation,ran_rru_communication_recovery,Cell operational but no user activity for extended period,"2024-06-04 14:04:44 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-06-04 14:06:44 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-06-04 14:09:44 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-06-04 14:12:44 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-06-04 14:16:44 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-06-04 14:22:44 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-06-04 14:25:44 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Bergil of the Tower Guard,2024-06-04 13:48:44,2024-06-04 14:31:44,Cell operational but no user activity for extended period,Resolved,RAN-008: Dormant Cell Detected,0.4,FALSE,Bergil of the Tower Guard +INCME-100032,Rhovanion,Arcane Engines,Path Router,COMPUTE,Elven Forge Technologies,Dale,Resolved,CMP-002: Pod Container Creating,CRITICAL,Array-Rhovanion-Outer-05,"Closure Note: + +1. Restore Action: +The root cause of the issue was a Pod stuck in ContainerCreating state. The issue was resolved by verifying the Container Alarm, escalating to DevOps, verifying the Container Recovery, and verifying the Pod was restored to its original state. + +2. Reason for Outage: +The Pod stuck in ContainerCreating state was caused by a configuration issue with the application. The root cause was identified and resolved. + +3. Outcome: +The issue was resolved and the Pod was restored to its original state. The root cause of the issue was identified and resolved, and the issue was closed. + +Note: This closure note does not include any information about the specific configuration issues that led to the outage. This is intended to be a concise summary of the steps taken to resolve the issue.",2024-10-30 05:26:00,2024-10-30 05:53:00,Elven Signal Keepers,compute_container_crash_recovery,Pod stuck in ContainerCreating state,"2024-10-30 05:24:00 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-10-30 05:26:00 - System (Workflow Engine) +Tool: inspect_logs() → Relevant logs inspected for error signatures and crash indicators. +2024-10-30 05:29:00 - System (Workflow Engine) +Tool: query_container_status() → Pod/VM/container state examined (Ready/Pending/CrashLoop). +2024-10-30 05:32:00 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-10-30 05:35:00 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-10-30 05:41:00 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-10-30 05:44:00 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-10-30 05:49:00 - System (Workflow Engine) +Tool: run_diagnostics() → Additional diagnostics attached for next-line/field team. +2024-10-30 05:53:00 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Elfhelm of Helm's Deep,2024-10-30 05:26:00,2024-10-30 05:53:00,Pod stuck in ContainerCreating state,Resolved,CMP-002: Pod Container Creating,0.2,FALSE,Elfhelm of Helm's Deep +INCME-100033,Rohan,Relay Transport,Beacon Array,POWER,Elven Forge Technologies,Aldburg,Resolved,ENV-006: Battery High Temperature,WARNING,Tower-Rohan-Lower-84,"Reason for Outage: A battery thermal runaway risk was detected during the monitoring of the Battery Temperature Alarm. The root cause of the issue was isolated and resolved, and the system was brought back online. The incident was reported to the NOC team and documented in the incident closure notes. The resolution summary includes the following: + +Restore Action: +- Verify Battery Temperature Alarm +- Check Battery Status +- Isolate Battery String +- Emergency Dispatch +- Monitor Battery Temperature + +The root cause of the issue was identified as a battery thermal runaway risk, which was resolved by isolating the affected battery string and isolating the battery string from the Battery Temperature Alarm. The incident was documented in the incident closure notes and is now closed.",2024-10-26 19:39:57,2024-10-26 20:03:57,Rivendell Array Management,env_battery_temperature_response,Battery thermal runaway risk detected,"2024-10-26 19:25:57 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-10-26 19:27:57 - System (Workflow Engine) +Tool: query_external_factors() → Utility/outage/maintenance windows checked. +2024-10-26 19:30:57 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-10-26 19:33:57 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-10-26 19:37:57 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-10-26 19:43:57 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-10-26 19:46:57 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Aravir Wingfoot,2024-10-26 19:39:57,2024-10-26 20:03:57,Battery thermal runaway risk detected,Resolved,ENV-006: Battery High Temperature,0.2,FALSE,Aravir Wingfoot +INCME-100034,Rohan,Signal Core,Weather Watch,RAN,Elven Forge Technologies,Meduseld,Resolved,RAN-019: PTP Synchronization Failure,MINOR,Node-Rohan-West-51,"NOC engineer closed incident CIR-17-0013, which involved a failed Voice Quality Alert notification. + +Restore Action: +- Verified Voice Quality Alert notification was successfully restored + +Reason for Outage: +- The root cause of the failure was Precision Time Protocol timing reference lost +- The Voice Quality Alert notification was sent at 10:37 AM, but the timing reference was lost at 10:38 AM, resulting in a failed notification. + +Outcome: +- The incident was successfully resolved and the Voice Quality Alert notification was restored. + +Note: The Voice Quality Alert notification was sent using the Precision Time Protocol (PTP) timing reference, which is used to synchronize the timing of all devices on the network. The timing reference was lost due to a hardware failure.",2024-07-03 08:40:17,2024-07-03 08:55:17,Erebor Relay Division,ran_voice_quality_resolution,Precision Time Protocol timing reference lost,"2024-07-03 08:23:17 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-07-03 08:25:17 - System (Workflow Engine) +Tool: query_rf_status() → Sector/RU radiating status and cell enablement checked. +2024-07-03 08:28:17 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-07-03 08:31:17 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-07-03 08:34:17 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-07-03 08:38:17 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-07-03 08:44:17 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-07-03 08:47:17 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Aragorn the Heir,2024-07-03 08:40:17,2024-07-03 08:55:17,Precision Time Protocol timing reference lost,Resolved,RAN-019: PTP Synchronization Failure,0.4,FALSE,Aragorn the Heir +INCME-100035,Rohan,Arcane Engines,Weather Watch,POWER,Dwarven Network Systems,Grimslade,Pending Resolution,PWR-001: AC Power Failure,CRITICAL,Watch-Rohan-Outer-12,"Resolution Summary: + +1. Verify AC Power Loss: After verifying that AC power was restored, the issue was resolved. + +2. Check UPS Status: The UPS system was checked for any anomalies or issues, and they were resolved. + +3. Verify Battery Runtime: The battery runtime was verified to ensure that it was still functioning properly. + +4. Dispatch Field Technician: A field technician was dispatched to resolve the issue. + +5. Notify Utility Provider: The utility provider was notified of the issue and the necessary steps were taken to restore power. + +6. Final Verification: The issue was resolved and the power was restored. + +Root cause: The interruption or failure of the commercial power supply. + +Problem type: PWR-001.",2024-06-05 06:53:13,2024-06-05 07:27:13,Gondor Gateway Team,power_ac_failure_recovery,Commercial power supply interruption or failure,"2024-06-05 06:46:13 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-06-05 06:48:13 - System (Workflow Engine) +Tool: query_external_factors() → Utility/outage/maintenance windows checked. +2024-06-05 06:51:13 - System (Workflow Engine) +Tool: query_power_system() → AC/DC, UPS, battery, and generator telemetry reviewed. +2024-06-05 06:54:13 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-06-05 06:57:13 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-06-05 07:01:13 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-06-05 07:06:13 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Elrohir of Lothlórien,2024-06-05 06:53:13,2024-06-05 07:27:13,Commercial power supply interruption or failure,Pending Resolution,PWR-001: AC Power Failure,0.2,FALSE,Elrohir of Lothlórien +INCME-100036,Rhovanion,Relay Transport,Junction Point,COMPUTE,Dwarven Network Systems,Beorn's Hall,Resolved,Node not functional,MAJOR,Point-Rhovanion-Central-49,"NOC Engineer's closure note: + +Restore Action: +- Verified that the VM Alarm was triggered due to a Compute node being completely unavailable. The issue was resolved by verifying the VM recovery and escalating the issue to the Cloud Team. +- The root cause was confirmed to be the Compute node being completely unavailable. + +Reason for Outage: +- The Compute node was completely unavailable, leading to a disruption in VM operations. + +Close notes: +- Verified that the VM Alarm was triggered due to a Compute node being completely unavailable. +- The root cause was confirmed to be the Compute node being completely unavailable. +- Verified that the issue was resolved by verifying the VM recovery and escalating the issue to the Cloud Team. +- The root cause was confirmed to be the Compute node being completely unavailable. +- The root cause was confirmed to be the Compute node being completely unavailable. +- The root cause was confirmed to be the Compute node being completely unavailable. +- The root cause was confirmed to be the Compute node being completely unavailable. +- The root cause was confirmed to be",2024-08-11 13:37:30,2024-08-11 13:59:30,Bywater Observation Post,compute_vm_failure_recovery,Compute node completely unavailable,"2024-08-11 13:21:30 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-08-11 13:23:30 - System (Workflow Engine) +Tool: orchestrate_workload() → Orchestrator actions evaluated (drain/restart/reschedule). +2024-08-11 13:26:30 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-08-11 13:29:30 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-08-11 13:33:30 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-08-11 13:39:30 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-08-11 13:42:30 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Fredegar Bracegirdle,2024-08-11 13:37:30,2024-08-11 13:59:30,Compute node completely unavailable,Resolved,Node not functional,0.4,FALSE,Fredegar Bracegirdle +INCME-100037,Rivendell,Path Signals,Weather Watch,RAN,Elven Forge Technologies,Grey Havens,Resolved,RAN-019: PTP Synchronization Failure,MINOR,Spire-Rivendell-Inner-26,"Reopened from INCME-100034. + +Resolution Summary: + +1. Verify Voice Quality Alert: Voice Quality Alerts were received for a Precision Time Protocol (PTP) timing reference that was lost. The root cause of the issue was identified as a timing reference loss. The issue was resolved by verifying the PTP timing reference and restoring the Voice Quality Alerts. + +2. Reason for Outage: The root cause of the issue was a timing reference loss. + +3. Restore Action: Voice Quality Alerts were restored to normal levels. + +4. Reason For Outage: The timing reference loss caused Voice Quality Alerts to be received. + +5. Root cause: The PTP timing reference was lost. + +6. Problem type: RAN-019. + +7. Action taken: Verification of PTP timing reference. + +8. Root cause: Precision Time Protocol timing reference lost. + +9. Problem type: RAN-019. + +10. Action taken: Restoring Voice Quality Alerts. + +11. Root cause: Root cause of issue was identified as a timing",2024-07-05 05:55:17,2024-07-05 06:27:17,Orthanc Technical Review,ran_voice_quality_resolution,Precision Time Protocol timing reference lost,"2024-07-05 05:54:17 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-07-05 05:56:17 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-07-05 05:59:17 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-07-05 06:02:17 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-07-05 06:05:17 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-07-05 06:11:17 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-07-05 06:14:17 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-07-05 06:19:17 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Glóin son of Glóin,2024-07-05 05:55:17,2024-07-05 06:27:17,Precision Time Protocol timing reference lost,Resolved,RAN-019: PTP Synchronization Failure,0.2,TRUE,Glóin son of Glóin +INCME-100038,Lothlórien,Arcane Engines,Weather Watch,RAN,Dwarven Network Systems,Eregion Post,Resolved,RAN-014: CSR Unreachable,MAJOR,Array-Lothlorien-Lower-12,"In summary, the NOC engineer resolved the issue of a cell site router management path being unavailable due to a faulty switch. The root cause was identified as a faulty switch. The issue was verified and resolved by verifying the Upgrade Failure, dispatching Field Support, and documenting and reporting the issue. The root cause was documented and reported. The engineer completed the resolution process by verifying the Upgrade Failure, discharging the incident, documenting and reporting the issue, and resolving it.",2024-10-03 19:19:57,2024-10-03 19:34:57,White Tower Operations,ran_software_upgrade_recovery,Cell site router management path unavailable,"2024-10-03 19:14:57 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-10-03 19:16:57 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-10-03 19:19:57 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-10-03 19:22:57 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-10-03 19:25:57 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-10-03 19:31:57 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-10-03 19:34:57 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Anborn of Dol Amroth,2024-10-03 19:19:57,2024-10-03 19:34:57,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.4,FALSE,Anborn of Dol Amroth +INCME-100039,Dale Province,Signal Network,Power Source,RAN,Dwarven Network Systems,Esgaroth,Resolved,RAN-013: Site Communication Failure,CRITICAL,Spire-DaleProvince-Lower-09,"NOC Engineer: Investigation and Root Cause Analysis (RCA) completed. + +The investigation revealed that a configuration change was made to the management network configuration that caused a complete loss of management connectivity to the site. The root cause of this issue was a configuration error that resulted in a mismatch between the network configuration and the management network configuration. + +The root cause was addressed by performing a thorough RCA to identify the root cause and the necessary actions to resolve the issue. The following actions were taken: + +1. Verification of Configuration Alert: A configuration alert was triggered, indicating a loss of management connectivity to the site. + +2. Escalation to Engineering: The issue was escalated to Engineering for further investigation and analysis. + +3. Verification of Performance: The performance of the management network was verified to ensure that it was operating correctly. + +4. Root Cause Analysis: A root cause analysis was conducted to identify the root cause of the issue. + +5. Recommendations: Recommendations were made for addressing the root cause of the issue and ensuring that the management network configuration is maintained correctly. + +6. Restore Action: Restoration of the management network configuration",2024-07-04 00:30:36,2024-07-04 00:40:36,Osgiliath Bridge Operations,ran_parameter_correction,Complete loss of management connectivity to site,"2024-07-04 00:18:36 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-07-04 00:20:36 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-07-04 00:23:36 - System (Workflow Engine) +Tool: query_rf_status() → Sector/RU radiating status and cell enablement checked. +2024-07-04 00:26:36 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-07-04 00:29:36 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-07-04 00:35:36 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-07-04 00:38:36 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-07-04 00:43:36 - System (Workflow Engine) +Tool: verify_recovery() → Verification queued; awaiting field/next-line confirmation.",Paladin Baggins,2024-07-04 00:30:36,2024-07-04 00:40:36,Complete loss of management connectivity to site,Resolved,RAN-013: Site Communication Failure,0.2,FALSE,Paladin Baggins +INCME-100040,Dale Province,Path Signals,Relay Unit,RAN,Elven Forge Technologies,Bree,Resolved,RAN-002: Cell Administratively Disabled,CRITICAL,Hub-DaleProvince-Outer-05,"Incident Closure Summary: + +Resolution: The root cause of the RAN-002 incident was due to a cell being locked or disabled by management action. The issue was resolved by verifying the Sector Alarm, checking the RF Chain Status, enabling capacity compensation, scheduling a field repair, verifying coverage impact, and verifying the correct management action was taken. The incident was escalated to a higher-level team for further investigation and resolution. + +NOC Terminology: +- RAN-002: RF Chain Status issue +- Sector Alarm: Alarm generated by the RF Chain +- Capacity Compensation: Capacity adjustment to compensate for a cell being locked or disabled +- Field Repair: Repair of the affected RF Chain +- Verify Coverage Impact: Verification of the coverage impact of the RF Chain after repair + +NOC Terminology: +- Cell locked: A cell is locked when its signal strength is below a certain threshold, which prevents it from transmitting data. +- Disabled: A cell is disabled when its signal strength is below a certain",2024-11-08 00:19:08,2024-11-08 00:54:08,Shire Monitoring Guild,ran_sector_outage_recovery,Cell locked or disabled by management action,"2024-11-08 00:20:08 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-11-08 00:22:08 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-11-08 00:25:08 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-11-08 00:28:08 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-11-08 00:31:08 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-11-08 00:37:08 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-11-08 00:40:08 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Saruman the White,2024-11-08 00:19:08,2024-11-08 00:54:08,Cell locked or disabled by management action,Resolved,RAN-002: Cell Administratively Disabled,0.4,FALSE,Saruman the White +INCME-100041,Lothlórien,Relay Transport,Junction Point,POWER,Elven Forge Technologies,East Bight,Resolved,PWR-002: DC Rectifier Failure,MAJOR,Hub-Lothlorien-South-24,"After confirming the rectifier alarm, enabling battery monitoring, and performing a final status check, the NOC engineer has completed the incident closure notes for PWR-002, a power supply issue that occurred at the facility. The root cause of the issue was a DC power conversion unit malfunction. The engineer has recorded this in the incident closure notes and has provided a resolution summary that includes the following: + +1. Restore Action: Enable battery monitoring and rectifier alarm +2. Reason For Outage: The malfunction of the DC power conversion unit + +The engineer has also included the final status check, which confirmed that the issue was resolved and the power supply functioned correctly. The engineer has recorded this in the notes and has provided a resolution summary that includes the root cause and the resolution taken to restore the facility's power supply. The engineer has ensured that all necessary documentation is included in the notes, including the incident summary, the incident closure notes, and the final status check.",2024-08-25 23:52:32,2024-08-26 00:14:32,Dwarven Deep Network,power_dc_rectifier_recovery,DC power conversion unit malfunction,"2024-08-26 00:00:32 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-08-26 00:02:32 - System (Workflow Engine) +Tool: query_power_system() → AC/DC, UPS, battery, and generator telemetry reviewed. +2024-08-26 00:05:32 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-08-26 00:08:32 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-08-26 00:14:32 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-08-26 00:17:32 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Radagast the Brown,2024-08-25 23:52:32,2024-08-26 00:14:32,DC power conversion unit malfunction,Resolved,PWR-002: DC Rectifier Failure,0.2,FALSE,Radagast the Brown +INCME-100042,Rohan,Arcane Engines,Path Router,POWER,Dwarven Network Systems,Dunharrow,Resolved,PWR-002: DC Rectifier Failure,MAJOR,Gateway-Rohan-Central-74,"NOC Engineer's Note: + +On [Date], an issue was detected in the [DC Power Conversion Unit (PWR-002)] that resulted in a loss of power to the [System/Appliance/Equipment (S/A/E)] for [Duration of Outage]. The root cause of the issue was a malfunction of the DC power conversion unit, which resulted in a voltage drop and loss of power. + +Restore Action: + +The issue was resolved by confirming the rectifier alarm, checking the DC bus voltage, and scheduling field replacement. The final status check was conducted to ensure that the issue had been resolved. + +Reason for Outage: + +The malfunction of the DC power conversion unit resulted in a loss of power to the S/A/E, which caused a significant disruption in the system's functionality. + +Conclusion: + +The resolution of this issue was successful, and the system's functionality was restored. The root cause of the issue was identified and rectified, and the necessary steps were taken to ensure that similar issues do not occur in the future.",2024-10-16 14:40:14,2024-10-16 15:08:14,Dwarven Deep Network,power_dc_rectifier_recovery,DC power conversion unit malfunction,"2024-10-16 14:39:14 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-10-16 14:41:14 - System (Workflow Engine) +Tool: query_external_factors() → Utility/outage/maintenance windows checked. +2024-10-16 14:44:14 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-10-16 14:47:14 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-10-16 14:52:14 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Hamfast Bracegirdle,2024-10-16 14:40:14,2024-10-16 15:08:14,DC power conversion unit malfunction,Resolved,PWR-002: DC Rectifier Failure,0.4,FALSE,Hamfast Bracegirdle +INCME-100043,Mirkwood,Relay Transport,Central Nexus,RAN,Dwarven Network Systems,Dol Guldur Watch,Pending Resolution,SVC-002: Data Throughput Degradation,CRITICAL,Point-Mirkwood-Inner-33,"In response to the complaint about user throughput being significantly below expected rates, the NOC engineer reviewed the complaint details and requested a detailed analysis. The engineer then documented the resolution steps, including reviewing the analysis, documenting the root cause, and restoring the service. The root cause was found to be a significant issue with user throughput, which was addressed by requesting a detailed analysis from the vendor. The engineer documented the resolution steps, including the review of the analysis, documenting the root cause, and restoring the service. The engineer completed the incident closure note, which included the resolution summary, complete with the root cause, restoring action, and reason for outage. The engineer used proper NOC terminology in the resolution summary.",2024-07-19 01:59:56,2024-07-19 02:22:56,Dúnedain Field Division,ran_speed_complaint_resolution,User throughput significantly below expected rates,"2024-07-19 01:54:56 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-07-19 01:56:56 - System (Workflow Engine) +Tool: query_rf_status() → Sector/RU radiating status and cell enablement checked. +2024-07-19 01:59:56 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-07-19 02:02:56 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-07-19 02:05:56 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-07-19 02:10:56 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Araglas Wingfoot,2024-07-19 01:59:56,2024-07-19 02:22:56,User throughput significantly below expected rates,Pending Resolution,SVC-002: Data Throughput Degradation,0.2,FALSE,Araglas Wingfoot +INCME-100044,Rhovanion,Beacon Power,Beacon Array,RAN,Elven Forge Technologies,Dale,Resolved,RAN-005: RRC Setup Success Rate Degraded,MAJOR,Hub-Rhovanion-Inner-33,"Restore Action: + +Verify Congestion Alert: The Radio Resource Control (RRC) connection establishment degraded issue was resolved by verifying the Congestion Alert. The Congestion Alert is a signal from the radio access network (RAN) that indicates a significant congestion or delay in the radio network. In this case, the RAN successfully resolved the issue by enabling the Cell Barring feature, which prevents RRC connections from being established during periods of congestion. + +Reason for Outage: + +The root cause of this issue was the Radio Resource Control (RRC) connection establishment degraded. The issue was caused by a significant congestion or delay in the radio network. In this case, the Congestion Alert signal was triggered by the RAN, indicating a significant congestion or delay in the radio network. + +Closure: + +The root cause of this issue has been resolved, and the Radio Resource Control (RRC) connection establishment degraded issue has been closed. The Congestion Alert signal has been successfully enabled to prevent RRC connections from being established during periods of congestion.",2024-06-23 12:52:14,2024-06-23 13:22:14,Gondor Gateway Team,ran_cell_congestion_management,Radio resource control connection establishment degraded,"2024-06-23 13:04:14 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-06-23 13:06:14 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-06-23 13:09:14 - System (Workflow Engine) +Tool: query_rf_status() → Sector/RU radiating status and cell enablement checked. +2024-06-23 13:12:14 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-06-23 13:15:14 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-06-23 13:19:14 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-06-23 13:25:14 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-06-23 13:28:14 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Alatar the White,2024-06-23 12:52:14,2024-06-23 13:22:14,Radio resource control connection establishment degraded,Resolved,RAN-005: RRC Setup Success Rate Degraded,0.4,FALSE,Alatar the White +INCME-100045,Rohan,Path Signals,Weather Watch,RAN,Dwarven Network Systems,Grimslade,Resolved,RAN-014: CSR Unreachable,MINOR,Array-Rohan-North-18,"Resolution Summary: + +The root cause of the failure was the cell site router management path unavailable. The issue was resolved by verifying the upgrade failure, dispatching Field Support, and documenting and reporting the issue. + +The problem type was RAN-014. + +The following actions were taken: +1. Verify upgrade failure: Verification of the upgrade failure was conducted to ensure that the issue was resolved. +2. Dispatch Field Support: A Field Support team member was dispatched to resolve the issue. +3. Document and report: A report was generated to document and report the issue. + +The resolution summary is a concise summary of the actions taken to resolve the issue. It includes the root cause, the actions taken, and the results achieved. The summary does not include additional details or information that may be relevant to the incident.",2024-06-21 20:47:29,2024-06-21 21:18:29,Dúnedain Field Division,ran_software_upgrade_recovery,Cell site router management path unavailable,"2024-06-21 20:52:29 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-06-21 20:54:29 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-06-21 20:57:29 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-06-21 21:00:29 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-06-21 21:03:29 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-06-21 21:07:29 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-06-21 21:13:29 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-06-21 21:16:29 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Elfhelm son of Éomund,2024-06-21 20:47:29,2024-06-21 21:18:29,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.2,FALSE,Elfhelm son of Éomund +INCME-100046,Rhovanion,Arcane Engines,Beacon Array,SIGNALING,Dwarven Network Systems,Erebor,Resolved,SIG-001: N2 Interface Down,MINOR,Array-Rhovanion-South-96,"Restore Action: + +S1/N2 Alarm was verified to be resolved. + +Reason For Outage: + +The root cause of the issue was a control plane interface between the gNB and AMF that failed. This issue led to a loss of service for the S1/N2 alarm, which was resolved by verifying the S1/N2 alarm and verifying that the alarm was resolved. + +Conclusion: + +The issue was successfully resolved, and the S1/N2 alarm is now functioning correctly. The root cause of the issue was a control plane interface between the gNB and AMF that failed, which led to the loss of service for the S1/N2 alarm.",2024-08-13 19:52:48,2024-08-13 20:12:48,Iron Hills Transport,signaling_s1_n2_recovery,Control plane interface between gNB and AMF failed,"2024-08-13 19:48:48 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-08-13 19:50:48 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-08-13 19:53:48 - System (Workflow Engine) +Tool: verify_signaling_path() → N2/N3/SCTP signaling path validated end-to-end. +2024-08-13 19:56:48 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-08-13 19:59:48 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-08-13 20:03:48 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-08-13 20:09:48 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-08-13 20:12:48 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Théoden of Helm's Deep,2024-08-13 19:52:48,2024-08-13 20:12:48,Control plane interface between gNB and AMF failed,Resolved,SIG-001: N2 Interface Down,0.4,FALSE,Théoden of Helm's Deep +INCME-100047,Mirkwood,Signal Network,Beacon Array,COMPUTE,Dwarven Network Systems,Rhosgobel,Resolved,Problematic VM,MAJOR,Relay-Mirkwood-South-14,"Resolution Summary: + +- Verified that the Virtual Machine had reached the Ready state and was in a healthy state. +- Notified the CNF Vendor and escalated the issue to their team. +- Verified that the CNF Recovery team was actively working on the issue and that they had a plan in place to resolve the issue. +- Notified the CNF Alarm team and requested that they investigate and address the issue as a priority. +- Notified the CNF Vendor and provided them with the resolution summary and a detailed description of the root cause and the steps taken to resolve the issue. +- Notified the CNF Recovery team and requested that they provide a status update on the recovery process. +- Notified the CNF Alarm team and requested that they verify that the Alarm is no longer triggered due to the issue being resolved. +- Notified the CNF Vendor and provided them with the resolution summary and a detailed description of the root cause and the steps taken to resolve the issue. +- Notified the CNF Recovery team and requested that they provide a status update on the recovery process. +- Notified the CNF Alarm team and requested that",2024-09-15 04:46:51,2024-09-15 05:01:51,Minas Tirith Central Command,compute_cnf_pod_recovery,Virtual machine failed to reach ready state,"2024-09-15 04:36:51 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-09-15 04:38:51 - System (Workflow Engine) +Tool: orchestrate_workload() → Orchestrator actions evaluated (drain/restart/reschedule). +2024-09-15 04:41:51 - System (Workflow Engine) +Tool: query_container_status() → Pod/VM/container state examined (Ready/Pending/CrashLoop). +2024-09-15 04:44:51 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-09-15 04:47:51 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-09-15 04:53:51 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-09-15 04:56:51 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Alatar Greyhame,2024-09-15 04:46:51,2024-09-15 05:01:51,Virtual machine failed to reach ready state,Resolved,Problematic VM,0.2,FALSE,Alatar Greyhame +INCME-100048,Arnor,Arcane Engines,Weather Watch,SIGNALING,Dwarven Network Systems,Bree,Resolved,SIG-009: E2 Interface Errors,MAJOR,Spire-Arnor-West-40,"NOC engineer has completed the following incident closure notes: + +1. Verify Delay Alarm: +- Verified delay alarm was triggered due to RAN Intelligent Controller interface errors. +- Requested Network Review to determine root cause and corrective actions. +- Verified delay status and all relevant logs were retained. + +2. Restore Action: +- Verified RAN Intelligent Controller interface was fixed. +- Network Review confirmed root cause and corrective actions were implemented. +- All relevant logs were retained. + +Root cause: RAN Intelligent Controller interface errors led to the delay in the network response. The root cause was addressed, and the necessary corrective actions were taken. The incident closure notes include the verification of the delay alarm, request for a Network Review, and restoration of the RAN Intelligent Controller interface. The logs associated with the incident are also retained.",2024-07-01 03:17:59,2024-07-01 03:36:59,Iron Hills Transport,signaling_delay_resolution,RAN Intelligent Controller interface errors,"2024-07-01 03:11:59 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-07-01 03:13:59 - System (Workflow Engine) +Tool: verify_signaling_path() → N2/N3/SCTP signaling path validated end-to-end. +2024-07-01 03:16:59 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-07-01 03:19:59 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-07-01 03:25:59 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-07-01 03:28:59 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Arador son of Arador,2024-07-01 03:17:59,2024-07-01 03:36:59,RAN Intelligent Controller interface errors,Resolved,SIG-009: E2 Interface Errors,0.4,FALSE,Arador son of Arador +INCME-100049,Gondor,Arcane Engines,Keeper Stone,COMPUTE,Dwarven Network Systems,Linhir,Resolved,Node not functional,MAJOR,Watch-Gondor-North-69,"Incident closure summary: + +Restore Action: +- Verified that the compute node was unavailable, and it was due to a hardware failure. +- Verified that the root cause was the hardware failure, and a new compute node was deployed and brought online. +- Notified the cloud team to escalate the issue to them, and they confirmed that they had resolved the issue. + +Reason for Outage: +- The root cause was the hardware failure, which led to the unavailability of the compute node. + +Close notes: +- The incident was resolved successfully, and the compute node was restored to its normal state. +- The root cause of the issue was confirmed, and the appropriate action was taken to resolve the issue. +- The incident was reported to the incident management team, and a resolution summary was generated. +- The incident was closed with a resolution that included a restore action, a root cause analysis, and a resolution summary.",2024-07-07 22:06:34,2024-07-07 22:21:34,Helm's Deep Emergency Unit,compute_vm_failure_recovery,Compute node completely unavailable,"2024-07-07 22:01:34 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-07-07 22:03:34 - System (Workflow Engine) +Tool: query_container_status() → Pod/VM/container state examined (Ready/Pending/CrashLoop). +2024-07-07 22:06:34 - System (Workflow Engine) +Tool: inspect_logs() → Relevant logs inspected for error signatures and crash indicators. +2024-07-07 22:09:34 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-07-07 22:12:34 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-07-07 22:18:34 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-07-07 22:21:34 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Erestor the Fair,2024-07-07 22:06:34,2024-07-07 22:21:34,Compute node completely unavailable,Resolved,Node not functional,0.4,FALSE,Erestor the Fair +INCME-100050,Rohan,Path Signals,Path Router,COMPUTE,Elven Forge Technologies,Snowbourn,Resolved,CMP-008: DU Function Pod Restart,CRITICAL,Watch-Rohan-Upper-08,"Restore Action: +- Verified that the Resource Alarm has been resolved and is no longer active. +- Analyzed Resource Usage and determined that the Resource Status was not affected by the Distributed Unit pod restart. +- Terminated Non-Critical Workloads to ensure the resource was not overutilized. +- Verified that the Resource Status has been restored to normal. + +Reason for Outage: +- Distributed Unit pod experienced unexpected restart, leading to the Resource Alarm being triggered. +- Resource Usage was affected, leading to Non-Critical Workloads being terminated. +- The Resource Status was not affected. + +Outcome: +- The incident has been resolved, and the Resource Status is now normal.",2024-10-18 03:43:02,2024-10-18 04:00:02,Osgiliath Bridge Operations,compute_resource_exhaustion_resolution,Distributed Unit pod experienced unexpected restart,"2024-10-18 03:18:02 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-10-18 03:20:02 - System (Workflow Engine) +Tool: orchestrate_workload() → Orchestrator actions evaluated (drain/restart/reschedule). +2024-10-18 03:23:02 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-10-18 03:26:02 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-10-18 03:32:02 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-10-18 03:35:02 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Aravir the Heir,2024-10-18 03:43:02,2024-10-18 04:00:02,Distributed Unit pod experienced unexpected restart,Resolved,CMP-008: DU Function Pod Restart,0.2,FALSE,Aravir the Heir +INCME-100051,Rhovanion,Beacon Power,Weather Watch,RAN,Dwarven Network Systems,Framsburg,Resolved,RAN-014: CSR Unreachable,MINOR,Hub-Rhovanion-North-83,"Incident Summary: + +1. Restore Action: Verify Upgrade Failure + +Root cause: Cell site router management path unavailable +Problem type: RAN-014 + +Solution: + +- Verified that the Cell site router management path was restored successfully. + +2. Reason For Outage: The root cause + +Root cause: Cell site router management path unavailable +Problem type: RAN-014 + +Solution: + +- Root cause analysis identified that the Cell site router management path was unavailable due to a system maintenance. +- Emergency boot was initiated to restore the management path. +- Field support was dispatched to address the issue. + +3. Restore Action: Check System Status + +Root cause: Cell site router management path unavailable +Problem type: RAN-014 + +Solution: + +- Checked the system status to ensure that the Cell site router management path was restored successfully. + +4. Reason For Outage: The root cause + +Root cause: Cell site router management path unavailable +Problem type: RAN-014 + +Solution: + +- Root cause",2024-08-03 14:37:14,2024-08-03 15:25:14,Osgiliath Bridge Operations,ran_software_upgrade_recovery,Cell site router management path unavailable,"2024-08-03 14:23:14 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-08-03 14:25:14 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-08-03 14:28:14 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-08-03 14:31:14 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-08-03 14:37:14 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-08-03 14:40:14 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Araglas Strider,2024-08-03 14:37:14,2024-08-03 15:25:14,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.4,FALSE,Araglas Strider +INCME-100052,Mirkwood,Arcane Engines,Junction Point,RAN,Elven Forge Technologies,Thranduil's Halls,Pending Resolution,Access instability,CRITICAL,Station-Mirkwood-Lower-65,"NOC Engineer's Resolution Summary: + +1. Restore Action: Conduct a thorough analysis of the network and identify the root cause of the issue. +2. Reason For Outage: The interference alarm was triggered due to a problem with the random access channel. + +3. Resolution: +a. Verify the interference alarm by checking the Spectrum Management system. +b. Verify the random access channel status by monitoring the network traffic. +c. Conduct a thorough analysis of the network and identify the root cause of the issue. +d. Conduct a root cause analysis and implement necessary fixes. +e. Report the outage to Spectrum Management for escalation. +f. Verify the interference alarm and report to the Spectrum Management team. +g. Verify the random access channel status and report to Spectrum Management. +h. Conduct a thorough analysis of the network traffic to identify any potential issues. +i. Implement necessary fixes and verify the outage has been resolved. + +Note: The root cause of the issue was identified, and the root cause analysis and fixes were implemented.",2024-07-06 10:34:50,2024-07-06 11:07:50,Pelargir Port Authority,ran_interference_mitigation,Random access channel success rate below threshold,"2024-07-06 10:44:50 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-07-06 10:46:50 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-07-06 10:49:50 - System (Workflow Engine) +Tool: query_rf_status() → Sector/RU radiating status and cell enablement checked. +2024-07-06 10:52:50 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-07-06 10:55:50 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-07-06 11:01:50 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-07-06 11:04:50 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-07-06 11:09:50 - System (Workflow Engine) +Tool: verify_recovery() → Verification queued; awaiting field/next-line confirmation.",Sam Bracegirdle,2024-07-06 10:34:50,2024-07-06 11:07:50,Random access channel success rate below threshold,Pending Resolution,Access instability,0.6,FALSE,Sam Bracegirdle +INCME-100053,Gondor,Signal Network,Central Nexus,RAN,Elven Forge Technologies,Minas Tirith,Resolved,RAN-002: Cell Administratively Disabled,MINOR,Spire-Gondor-East-59,"In response to the outage of the RAN network caused by a cell locked or disabled by management action, the following actions were taken: + +1. Verification of the RAN-002 issue: Verification of the RAN network to identify the root cause. + +2. Verification of Sector Alarm: Verification of the sector alarm to identify the specific sector affected by the RAN-002 issue. + +3. Check RF Chain Status: Verification of the RF chain status to identify any issues or faults with the network. + +4. Check RRU Status: Verification of the RRU status to identify any issues or faults with the RRU devices. + +5. Attempt RRU Reset: Attempting to reset the RRU devices to see if they can recover the network. + +6. Enable Capacity Compensation: Enabling capacity compensation to ensure that the network can handle the increased load during the repair process. + +7. Schedule Field Repair: Scheduling a field repair to repair the affected sections of the RAN network. + +8. Verify Coverage Impact: Verifying the coverage impact of the",2024-08-14 18:52:14,2024-08-14 19:27:14,Gondor Gateway Team,ran_sector_outage_recovery,Cell locked or disabled by management action,"2024-08-14 18:51:14 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-08-14 18:53:14 - System (Workflow Engine) +Tool: query_rf_status() → Sector/RU radiating status and cell enablement checked. +2024-08-14 18:56:14 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-08-14 18:59:14 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-08-14 19:02:14 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-08-14 19:08:14 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-08-14 19:11:14 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Legolas Greenleaf,2024-08-14 18:52:14,2024-08-14 19:27:14,Cell locked or disabled by management action,Resolved,RAN-002: Cell Administratively Disabled,0.2,FALSE,Legolas Greenleaf +INCME-100054,Rhovanion,Signal Core,Central Nexus,RAN,Dwarven Network Systems,Framsburg,Resolved,RAN-011: Remote Radio Unit Alarm,MINOR,Watch-Rhovanion-North-13,"NOC engineer's close notes: + +1. Restore Action: +The root cause of the issue was identified and resolved. The O-RAN radio unit reported a fault condition, which was verified and resolved. Backhaul and router connectivity were verified and tested, and the service impact was verified. The issue was escalated to the transport team for further investigation. + +2. Reason for Outage: +The root cause of the issue was identified as the O-RAN radio unit reporting a fault condition. The root cause analysis revealed that the fault was due to an issue with the radio unit's firmware. + +3. Restore Action: +The root cause was resolved, and the O-RAN radio unit was repaired and re-tested. Backhaul and router connectivity were verified and tested, and the service impact was verified. The issue was escalated to the transport team for further investigation. + +4. Reason for Outage: +The root cause of the issue was identified as the O-RAN radio unit reporting a fault condition. The root cause analysis revealed that the fault was due to an issue with the radio unit's firmware. + +5",2024-09-19 18:30:59,2024-09-19 19:06:59,Shire Monitoring Guild,ran_backhaul_degradation_resolution,O-RAN radio unit reporting fault condition,"2024-09-19 18:36:59 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-09-19 18:38:59 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-09-19 18:41:59 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-09-19 18:44:59 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-09-19 18:47:59 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-09-19 18:53:59 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-09-19 18:56:59 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Haldir the Fair,2024-09-19 18:30:59,2024-09-19 19:06:59,O-RAN radio unit reporting fault condition,Resolved,RAN-011: Remote Radio Unit Alarm,0.2,FALSE,Haldir the Fair +INCME-100055,Rohan,Beacon Power,Weather Watch,RAN,Elven Forge Technologies,Helm's Deep,Resolved,Access instability,MAJOR,Point-Rohan-Inner-42,"In response to the interference alarm in the RAN, the following action was taken: + +Verify Interference Alarm: Verified that the interference alarm was triggered due to a Radio Access Network (RAN) issue. + +Report to Spectrum Management: Reported the interference alarm to Spectrum Management, who confirmed that the interference issue was being addressed. + +Verify Interference Status: Verified that the interference status was reported as ""Critical"" on the Spectrum Management report. + +Root cause: The interference alarm was triggered due to a Radio Access Network issue, specifically a failure to properly filter out interference signals. The root cause was verified through the Spectrum Management report. + +Problem type: RAN-003 + +Resolution: The interference alarm was resolved by implementing a new filtering algorithm to address the issue. The root cause was verified and the interference alarm was resolved.",2024-10-20 01:00:25,2024-10-20 01:35:25,Rohan Rapid Response,ran_interference_mitigation,Random access channel success rate below threshold,"2024-10-20 01:01:25 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-10-20 01:03:25 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-10-20 01:06:25 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-10-20 01:09:25 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-10-20 01:12:25 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-10-20 01:18:25 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-10-20 01:21:25 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Denethor son of Imrahil,2024-10-20 01:00:25,2024-10-20 01:35:25,Random access channel success rate below threshold,Resolved,Access instability,0.2,FALSE,Denethor son of Imrahil +INCME-100056,Eriador,Beacon Power,Weather Watch,TRANSPORT,Elven Forge Technologies,Tharbad,Pending Resolution,TRN-004: Fiber Path Degradation,CRITICAL,Tower-Eriador-Primary-17,"In response to the failure of the optical signal degradation or increased error rate, the network operations center (NOC) team has taken the following actions: + +1. Verify Packet Loss Alarm: The team verified that the packet loss alarm was triggered due to the degraded optical signal. This confirmed the root cause of the issue. + +2. Schedule Link Repair: The team scheduled a link repair to address the issue. + +3. Verify Packet Loss Status: The team verified that the packet loss status remained unchanged after the link repair. + +Root cause: Optical signal degradation or increased error rate +Problem type: TRN-004 + +The resolution summary above summarizes the actions taken by the NOC team to resolve the issue. The summary is concise, clear, and provides a complete picture of the root cause and the steps taken to resolve the issue.",2024-08-31 22:08:42,2024-08-31 22:26:42,Lothlórien Link Guardians,transport_packet_loss_resolution,Optical signal degradation or increased error rate,"2024-08-31 21:59:42 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-08-31 22:01:42 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-08-31 22:04:42 - System (Workflow Engine) +Tool: run_diagnostics() → Interface counters/latency/jitter tests executed. +2024-08-31 22:07:42 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-08-31 22:10:42 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-08-31 22:14:42 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-08-31 22:19:42 - System (Workflow Engine) +Tool: run_diagnostics() → Additional diagnostics attached for next-line/field team. +2024-08-31 22:23:42 - System (Workflow Engine) +Tool: verify_recovery() → Verification queued; awaiting field/next-line confirmation.",Arador of Arnor,2024-08-31 22:08:42,2024-08-31 22:26:42,Optical signal degradation or increased error rate,Pending Resolution,TRN-004: Fiber Path Degradation,0.6,FALSE,Arador of Arnor +INCME-100057,Rohan,Path Signals,Junction Point,COMPUTE,Dwarven Network Systems,Snowbourn,Resolved,CMP-010: Site Not Scrolling,MINOR,Watch-Rohan-West-61,"NOC Engineer's Resolution Summary + +Reason for Outage: A site deployment or scaling operation stalled, resulting in a loss of image pulls. + +Restore Action: Verify Image Pull Alarm, Check Registry Connectivity, Verify Image Credentials, Escalate Registry Issue, Verify Image Pull. + +Root Cause: A site deployment or scaling operation stalled. + +Problem Type: CMP-010. + +Conclusion: The root cause of the outage was a site deployment or scaling operation that stalled, resulting in a loss of image pulls. The resolution actions taken were to verify the Image Pull Alarm, check the Registry Connectivity, verify the Image Credentials, escalate the Registry Issue, and verify the Image Pull. The root cause was identified and resolved, and the outage was resolved.",2024-09-29 23:11:46,2024-09-29 23:44:46,White Tower Operations,compute_image_pull_recovery,Site deployment or scaling operation stalled,"2024-09-29 23:12:46 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-09-29 23:14:46 - System (Workflow Engine) +Tool: query_container_status() → Pod/VM/container state examined (Ready/Pending/CrashLoop). +2024-09-29 23:17:46 - System (Workflow Engine) +Tool: inspect_logs() → Relevant logs inspected for error signatures and crash indicators. +2024-09-29 23:20:46 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-09-29 23:23:46 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-09-29 23:29:46 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-09-29 23:32:46 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Pallando Stormcrow,2024-09-29 23:11:46,2024-09-29 23:44:46,Site deployment or scaling operation stalled,Resolved,CMP-010: Site Not Scrolling,0.2,FALSE,Pallando Stormcrow +INCME-100058,Rohan,Signal Core,Keeper Stone,COMPUTE,Dwarven Network Systems,Aldburg,Resolved,CMP-005: Pod Terminating Stuck,MINOR,Beacon-Rohan-East-09,"I do not have access to the specific incident details, but here's a sample resolution summary based on the provided information: + +failed +actions taken: +- verify orchestrator alarm +- check control plane +- failover to backup control plane +- verify orchestrator recovery +- root cause: pod stuck in terminating state beyond grace period +- problem type: cmp-005 + +summary: +- the issue was caused by a pod stuck in terminating state beyond the grace period, which led to a system failure +- the root cause was identified through the orchestrator alarm, control plane checks, failover to backup control plane, orchestrator recovery, and verification +- the resolution was to failover to backup control plane and verify orchestrator recovery, which resolved the issue +- the system is now functioning as expected.",2024-07-20 10:06:35,2024-07-20 10:26:35,Dúnedain Field Division,compute_orchestrator_recovery,Pod stuck in terminating state beyond grace period,"2024-07-20 09:54:35 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-07-20 09:56:35 - System (Workflow Engine) +Tool: inspect_logs() → Relevant logs inspected for error signatures and crash indicators. +2024-07-20 09:59:35 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-07-20 10:02:35 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-07-20 10:08:35 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-07-20 10:11:35 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Frodo Brandybuck,2024-07-20 10:06:35,2024-07-20 10:26:35,Pod stuck in terminating state beyond grace period,Resolved,CMP-005: Pod Terminating Stuck,0.4,FALSE,Frodo Brandybuck +INCME-100059,Rivendell,Path Signals,Path Router,RAN,Elven Forge Technologies,Grey Havens,Resolved,RAN-001: Cell Service Interruption,CRITICAL,Spire-Rivendell-West-37,"Resolution Summary: + +1. Verify Site Outage: +The NOC team conducted a thorough investigation and confirmed that the cellular network was completely unavailable for service at the customer's location. The root cause of the issue was a temporary power outage affecting the cellular network equipment. + +2. Check Backhaul Status: +The NOC team conducted a thorough investigation and confirmed that the backhaul network was functioning properly. The root cause of the issue was a temporary power outage affecting the backhaul network equipment. + +3. Contact Transport Team: +The NOC team contacted the transport team to investigate the cause of the outage and to assess the damage to the network. The transport team confirmed that the outage was caused by a temporary power outage affecting the backhaul network equipment. + +4. Dispatch Field Technician: +The NOC team dispatched a field technician to the customer's location to investigate the issue and to restore service. The field technician confirmed that the temporary power outage had affected the cellular network equipment, and that the root cause of the outage was a temporary power outage affecting the backhaul network equipment.",2024-09-04 09:05:56,2024-09-04 09:46:56,Helm's Deep Emergency Unit,ran_cell_site_down_recovery,Cell completely unavailable for service,"2024-09-04 09:06:56 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-09-04 09:08:56 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-09-04 09:11:56 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-09-04 09:14:56 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-09-04 09:17:56 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-09-04 09:21:56 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-09-04 09:27:56 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-09-04 09:30:56 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-09-04 09:35:56 - System (Workflow Engine) +Tool: verify_recovery() → Verification queued; awaiting field/next-line confirmation.",Kíli Stonehelm,2024-09-04 09:05:56,2024-09-04 09:46:56,Cell completely unavailable for service,Resolved,RAN-001: Cell Service Interruption,0.2,FALSE,Kíli Stonehelm +INCME-100060,Lothlórien,Signal Core,Junction Point,RAN,Elven Forge Technologies,Woodmen-town,Resolved,RAN-014: CSR Unreachable,MINOR,Watch-Lothlorien-Upper-57,"Resolution Summary: + +The root cause of the failure was the cell site router management path unavailable. The issue was resolved by verifying the upgrade failure, dispatching Field Support, and documenting and reporting the issue. + +The problem type was RAN-014. + +The following actions were taken: +1. Verify upgrade failure: Verification of the upgrade failure was conducted to ensure that the issue was resolved. +2. Dispatch Field Support: A Field Support team member was dispatched to resolve the issue. +3. Document and report: A report was generated to document and report the issue. + +The resolution summary is a concise summary of the actions taken to resolve the issue. It includes the root cause, the actions taken, and the results achieved. The summary does not include additional details or information that may be relevant to the incident.",2024-11-09 05:04:34,2024-11-09 05:36:34,White Tower Operations,ran_software_upgrade_recovery,Cell site router management path unavailable,"2024-11-09 05:20:34 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-11-09 05:22:34 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-11-09 05:25:34 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-11-09 05:28:34 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-11-09 05:31:34 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-11-09 05:37:34 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-11-09 05:40:34 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Imrahil of Minas Tirith,2024-11-09 05:04:34,2024-11-09 05:36:34,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.2,FALSE,Imrahil of Minas Tirith +INCME-100061,Rohan,Signal Core,Path Router,COMPUTE,Elven Forge Technologies,Helm's Deep,Resolved,Node not functional,MAJOR,Beacon-Rohan-South-54,"Incident Summary: + +The incident involved a failure of a compute node, leading to a loss of services. The root cause of the issue was a complete unavailability of the compute node. The incident was resolved by verifying the VM alarm, verifying the hypervisor status, restoring from a snapshot, and verifying the VM recovery. The root cause was identified as a compute node completely unavailable. + +Restore Action: + +1. Verify VM Alarm: Verified that the VM alarm was triggered due to the unavailability of the compute node. + +2. Verify Hypervisor Status: Verified that the hypervisor was in a healthy state and the compute node was not listed as unhealthy. + +3. Restore from Snapshot: Restored the VM from a snapshot that was taken before the outage. + +4. Verify VM Recovery: Verified that the VM was successfully restored and that the services were back to normal. + +Reason for Outage: + +The root cause of the issue was a complete unavailability of the compute node. + +Problem Type: + +CMP-001 - Compute node un",2024-06-09 02:00:32,2024-06-09 02:33:32,Rivendell Array Management,compute_vm_failure_recovery,Compute node completely unavailable,"2024-06-09 02:08:32 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-06-09 02:10:32 - System (Workflow Engine) +Tool: inspect_logs() → Relevant logs inspected for error signatures and crash indicators. +2024-06-09 02:13:32 - System (Workflow Engine) +Tool: orchestrate_workload() → Orchestrator actions evaluated (drain/restart/reschedule). +2024-06-09 02:16:32 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-06-09 02:19:32 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-06-09 02:23:32 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-06-09 02:29:32 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-06-09 02:32:32 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Vorondil Húrinion,2024-06-09 02:00:32,2024-06-09 02:33:32,Compute node completely unavailable,Resolved,Node not functional,0.4,FALSE,Vorondil Húrinion +INCME-100062,Rhovanion,Signal Core,Signal Unit,RAN,Dwarven Network Systems,Framsburg,Resolved,SVC-003: Call Drop Rate Elevated,MINOR,Node-Rhovanion-East-41,"Closing Note: + +After reviewing the complaint details, the team confirmed that a Voice or data session drop rate above the threshold was the root cause of the issue. The root cause was investigated, and a RF investigation was requested to further investigate the issue. As a result of the investigation, the team identified a possible fault in the network infrastructure that was causing the issue. + +The team documented the resolution, which included the following actions: + +1. Reviewed Call Drop KPIs to identify the affected SVCs. +2. Requested a RF investigation to investigate the root cause of the issue. +3. Documented the resolution, including the root cause, actions taken, and the resolution. + +The team used NOC terminology throughout the resolution summary to ensure clarity and accuracy. The resolution summary also included the closing note, which summarized the resolution and its impact on the customer.",2024-06-21 09:16:22,2024-06-21 09:55:22,Rivendell Array Management,ran_dropped_calls_resolution,Voice or data session drop rate above threshold,"2024-06-21 09:11:22 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-06-21 09:13:22 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-06-21 09:16:22 - System (Workflow Engine) +Tool: query_rf_status() → Sector/RU radiating status and cell enablement checked. +2024-06-21 09:19:22 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-06-21 09:22:22 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-06-21 09:26:22 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-06-21 09:32:22 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-06-21 09:35:22 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-06-21 09:40:22 - System (Workflow Engine) +Tool: verify_recovery() → Verification queued; awaiting field/next-line confirmation.",Arahad Telcontar,2024-06-21 09:16:22,2024-06-21 09:55:22,Voice or data session drop rate above threshold,Resolved,SVC-003: Call Drop Rate Elevated,0.2,FALSE,Arahad Telcontar +INCME-100063,Rivendell,Signal Core,Relay Unit,POWER,Dwarven Network Systems,Grey Havens,Resolved,PWR-003: Battery Discharge Alert,MAJOR,Array-Rivendell-South-51,"Incident Closure Summary: + +1. Restore Action: Restored the battery backup to full capacity. + +2. Reason For Outage: The battery backup depleted without AC restoration, which resulted in a power outage. + +3. Root Cause: The battery backup depleted without AC restoration due to a power outage. + +4. Problem Type: The PWR-003 problem type refers to a power outage that affects the battery backup. + +5. Summary: The incident was resolved by restoring the battery backup to full capacity, and the power outage was caused by a power outage.",2024-08-24 17:01:52,2024-08-24 17:18:52,Erebor Relay Division,power_battery_discharge_response,Battery backup depleting without AC restoration,"2024-08-24 16:57:52 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-08-24 16:59:52 - System (Workflow Engine) +Tool: query_external_factors() → Utility/outage/maintenance windows checked. +2024-08-24 17:02:52 - System (Workflow Engine) +Tool: query_power_system() → AC/DC, UPS, battery, and generator telemetry reviewed. +2024-08-24 17:05:52 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-08-24 17:08:52 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-08-24 17:14:52 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-08-24 17:17:52 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Glorfindel Greenleaf,2024-08-24 17:01:52,2024-08-24 17:18:52,Battery backup depleting without AC restoration,Resolved,PWR-003: Battery Discharge Alert,0.4,FALSE,Glorfindel Greenleaf +INCME-100064,The Shire,Signal Network,Signal Unit,COMPUTE,Elven Forge Technologies,Bucklebury,Pending Resolution,CMP-005: Pod Terminating Stuck,MAJOR,Gateway-TheShire-Outer-98,"NOC Engineer's Note: Verify Orchestrator Alarm and Escalate to Platform Team + +Reason for Outage: A Pod stuck in terminating state beyond grace period, causing the Orchestrator to become unresponsive. + +Restore Action: Verify Orchestrator Alarm, escalate to Platform Team, verify Orchestrator Recovery. + +Root Cause: A Pod stuck in terminating state beyond grace period. + +Problem Type: CMP-005. + +Conclusion: The root cause of the outage was a Pod stuck in terminating state beyond grace period. The Orchestrator was successfully restored to a functional state. The Platform Team was notified and will be responsible for further investigation and resolution.",2024-07-14 19:39:51,2024-07-14 19:53:51,Riders of the Mark,compute_orchestrator_recovery,Pod stuck in terminating state beyond grace period,"2024-07-14 19:29:51 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-07-14 19:31:51 - System (Workflow Engine) +Tool: orchestrate_workload() → Orchestrator actions evaluated (drain/restart/reschedule). +2024-07-14 19:34:51 - System (Workflow Engine) +Tool: inspect_logs() → Relevant logs inspected for error signatures and crash indicators. +2024-07-14 19:37:51 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-07-14 19:40:51 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-07-14 19:46:51 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-07-14 19:49:51 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-07-14 19:54:51 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Gimli Stonehelm,2024-07-14 19:39:51,2024-07-14 19:53:51,Pod stuck in terminating state beyond grace period,Pending Resolution,CMP-005: Pod Terminating Stuck,0.2,FALSE,Gimli Stonehelm +INCME-100065,Rohan,Beacon Power,Power Source,RAN,Dwarven Network Systems,Aldburg,Resolved,RAN-014: CSR Unreachable,MAJOR,Outpost-Rohan-Primary-63,"NOC Engineer's Resolution Summary: + +Outcome: The root cause of the issue was identified as a failure in the cell site router management path, which resulted in an unavailable system state. The issue was escalated to the Upgrade Failure stage and a detailed investigation was conducted to determine the cause of the issue. + +Actions Taken: +- Verified that the upgrade process was successful +- Verified that the cell site router management path was functioning correctly +- Attempted emergency boot of the system +- Verified that the system was functioning correctly after the emergency boot +- Checked system status and discovered that the system was unavailable +- Attempted to restore the system to a functional state by attempting to boot the system from the backup image +- Verified that the system was functioning correctly after the backup image was restored +- Verified that the system was functioning correctly by performing a system health check +- Documented the incident and reported it to the appropriate parties +- Reported the incident to the appropriate parties + +Root Cause: The cell site router management path was unavailable, resulting in the unavailable system state. + +Problem Type: RAN-0",2024-06-07 18:09:46,2024-06-07 18:54:46,Gondor Gateway Team,ran_software_upgrade_recovery,Cell site router management path unavailable,"2024-06-07 18:14:46 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-06-07 18:16:46 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-06-07 18:19:46 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-06-07 18:22:46 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-06-07 18:28:46 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-06-07 18:31:46 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Alatar the Brown,2024-06-07 18:09:46,2024-06-07 18:54:46,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.2,FALSE,Alatar the Brown +INCME-100066,Rivendell,Path Signals,Junction Point,RAN,Elven Forge Technologies,Grey Havens,Resolved,RAN-013: Site Communication Failure,WARNING,Watch-Rivendell-Primary-32,"In response to the incident where the NOC engineer encountered a complete loss of management connectivity to a site, the following resolution summary was generated: + +Restore Action: +- Verified that the configuration alert was triggered due to a configuration change on the affected device, and that the change had been corrected. +- Verified that the corrective configuration had been applied successfully. + +Reason for Outage: +- Complete loss of management connectivity to the site due to a configuration change on the affected device. + +Root cause: +- The configuration change resulted in a disruption to the management connectivity to the site, causing the NOC engineer to lose access to the site's configuration data. + +Problem type: +- RAN-013 - Complete loss of management connectivity to a site + +Resolution: +- Verified that the root cause of the outage was identified and addressed. +- Verified that the corrective action was successfully implemented. +- Applied any necessary mitigations to prevent similar issues from occurring in the future. + +Note: This resolution summary includes only the relevant information necessary for the NOC engineer to document the incident closure. The full incident report will include additional details",2024-07-03 17:15:23,2024-07-03 17:41:23,Osgiliath Bridge Operations,ran_parameter_correction,Complete loss of management connectivity to site,"2024-07-03 16:54:23 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-07-03 16:56:23 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-07-03 16:59:23 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-07-03 17:02:23 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-07-03 17:05:23 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-07-03 17:09:23 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-07-03 17:14:23 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Sam Brandybuck,2024-07-03 17:15:23,2024-07-03 17:41:23,Complete loss of management connectivity to site,Resolved,RAN-013: Site Communication Failure,0.2,FALSE,Sam Brandybuck +INCME-100067,Arnor,Relay Transport,Junction Point,RAN,Elven Forge Technologies,Bywater,Resolved,RAN-014: CSR Unreachable,CRITICAL,Point-Arnor-North-36,"Incident Resolution Summary: + +Escalated due to a failure in the Cell site router management path. + +Actions Taken: +- Verified that the upgrade failure was caused by a cell site router management path unavailable. +- Checked the system status to confirm that the router was available. +- Attempted an emergency boot to restore the system. +- Dispatched Field Support to investigate and resolve the issue. +- Documented and reported the incident. + +Root Cause: +The cell site router management path was unavailable due to a hardware failure. + +Problem Type: +RAN-014 - Cell site router management path unavailable + +Conclusion: +The root cause of the failure was a hardware failure in the cell site router management path, causing the system to become unavailable. The escalation was necessary to ensure that the system was restored and the issue was resolved.",2024-07-07 11:25:58,2024-07-07 12:07:58,Lothlórien Link Guardians,ran_software_upgrade_recovery,Cell site router management path unavailable,"2024-07-07 11:30:58 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-07-07 11:32:58 - System (Workflow Engine) +Tool: query_rf_status() → Sector/RU radiating status and cell enablement checked. +2024-07-07 11:35:58 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-07-07 11:38:58 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-07-07 11:42:58 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-07-07 11:48:58 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-07-07 11:51:58 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-07-07 11:56:58 - System (Workflow Engine) +Tool: verify_recovery() → Verification queued; awaiting field/next-line confirmation.",Legolas Starlight,2024-07-07 11:25:58,2024-07-07 12:07:58,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.4,FALSE,Legolas Starlight +INCME-100068,Rivendell,Path Signals,Weather Watch,COMPUTE,Elven Forge Technologies,Eregion Post,Resolved,CMP-002: Pod Container Creating,WARNING,Beacon-Rivendell-Upper-70,"In response to the incident where a Pod was stuck in ContainerCreating state, the following resolution summary was generated: + +Restore Action: +- Verified that the Container Alarm was triggered due to the Pod stuck in ContainerCreating state. +- Verified that the Escalation Path was configured correctly and that the DevOps team was informed of the issue. +- Verified that the Pod was successfully restarted and returned to normal operation. + +Reason for Outage: +- The Pod was stuck in ContainerCreating state due to a configuration issue in the container image. +- The root cause of the issue was identified and resolved by verifying that the Container Alarm was triggered and escalating the issue to DevOps. + +Outcome: +- The incident was successfully resolved, with the Pod being restored to normal operation. + +Resolution Summary: +- Verified that the Container Alarm was triggered due to the Pod stuck in ContainerCreating state. +- Verified that the Escalation Path was configured correctly and that the DevOps team was informed of the issue. +- Verified that the Pod was successfully restarted and returned to normal operation. +- Resolved the issue",2024-06-24 05:21:28,2024-06-24 05:34:28,Arnor Response Team,compute_container_crash_recovery,Pod stuck in ContainerCreating state,"2024-06-24 04:58:28 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-06-24 05:00:28 - System (Workflow Engine) +Tool: orchestrate_workload() → Orchestrator actions evaluated (drain/restart/reschedule). +2024-06-24 05:03:28 - System (Workflow Engine) +Tool: inspect_logs() → Relevant logs inspected for error signatures and crash indicators. +2024-06-24 05:06:28 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-06-24 05:09:28 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-06-24 05:13:28 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-06-24 05:19:28 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-06-24 05:22:28 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Pallando Greyhame,2024-06-24 05:21:28,2024-06-24 05:34:28,Pod stuck in ContainerCreating state,Resolved,CMP-002: Pod Container Creating,0.4,FALSE,Pallando Greyhame +INCME-100069,Rohan,Path Signals,Signal Unit,RAN,Dwarven Network Systems,Dunharrow,Resolved,Access instability,MAJOR,Hub-Rohan-North-77,"In response to the interference alarm in the RAN, the following action was taken: + +Verify Interference Alarm: Verified that the interference alarm was triggered due to a Radio Access Network (RAN) issue. + +Report to Spectrum Management: Reported the interference alarm to Spectrum Management, who confirmed that the interference issue was being addressed. + +Verify Interference Status: Verified that the interference status was reported as ""Critical"" on the Spectrum Management report. + +Root cause: The interference alarm was triggered due to a Radio Access Network issue, specifically a failure to properly filter out interference signals. The root cause was verified through the Spectrum Management report. + +Problem type: RAN-003 + +Resolution: The interference alarm was resolved by implementing a new filtering algorithm to address the issue. The root cause was verified and the interference alarm was resolved.",2024-11-05 14:22:48,2024-11-05 14:41:48,Pelargir Port Authority,ran_interference_mitigation,Random access channel success rate below threshold,"2024-11-05 14:08:48 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-11-05 14:10:48 - System (Workflow Engine) +Tool: query_rf_status() → Sector/RU radiating status and cell enablement checked. +2024-11-05 14:13:48 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-11-05 14:16:48 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-11-05 14:22:48 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-11-05 14:25:48 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Éomer of the Mark,2024-11-05 14:22:48,2024-11-05 14:41:48,Random access channel success rate below threshold,Resolved,Access instability,0.6,FALSE,Éomer of the Mark +INCME-100070,Lothlórien,Path Signals,Beacon Array,RAN,Dwarven Network Systems,Michel Delving,Resolved,RAN-014: CSR Unreachable,CRITICAL,Outpost-Lothlorien-North-58,"In summary, the NOC engineer resolved the issue of a cell site router management path being unavailable due to a faulty switch. The root cause was identified as a faulty switch. The issue was verified and resolved by verifying the Upgrade Failure, dispatching Field Support, and documenting and reporting the issue. The root cause was documented and reported. The engineer completed the resolution process by verifying the Upgrade Failure, discharging the incident, documenting and reporting the issue, and resolving it.",2024-09-13 16:30:45,2024-09-13 16:54:45,Iron Hills Transport,ran_software_upgrade_recovery,Cell site router management path unavailable,"2024-09-13 16:18:45 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-09-13 16:20:45 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-09-13 16:23:45 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-09-13 16:26:45 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-09-13 16:29:45 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-09-13 16:33:45 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-09-13 16:38:45 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Araglas Wingfoot,2024-09-13 16:30:45,2024-09-13 16:54:45,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.4,FALSE,Araglas Wingfoot +INCME-100071,Rohan,Arcane Engines,Relay Unit,RAN,Dwarven Network Systems,Grimslade,Resolved,RAN-009: TX Array Fault,MAJOR,Array-Rohan-Central-08,"In accordance with our NOC Engineering process, we have completed the following actions to restore the RAN-009 incident: + +1. Verified that the Tilt Alarm was triggered due to hardware or calibration failure. A field service engineer was dispatched to perform an on-site repair. + +2. Scheduled a Field Service to perform the repair. + +3. Verified that the Coverage Impact was not significant, and the RAN-009 incident was resolved. + +Root cause: +The Tilt Alarm was triggered due to hardware or calibration failure on the Transmission antenna array hardware. The field service engineer was dispatched to perform an on-site repair. The Coverage Impact was not significant, and the RAN-009 incident was resolved. + +Reason for Outage: +The Tilt Alarm was triggered due to hardware or calibration failure on the Transmission antenna array hardware. + +Resolution: +Verified that the Tilt Alarm was triggered due to hardware or calibration failure on the Transmission antenna array hardware. A field service engineer was dispatched to perform an on-site repair. +Scheduled a Field Service",2024-08-15 15:35:07,2024-08-15 15:51:07,Osgiliath Bridge Operations,ran_antenna_tilt_recovery,Transmission antenna array hardware or calibration failure,"2024-08-15 15:25:07 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-08-15 15:27:07 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-08-15 15:30:07 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-08-15 15:33:07 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-08-15 15:39:07 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-08-15 15:42:07 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Mardil of Dol Amroth,2024-08-15 15:35:07,2024-08-15 15:51:07,Transmission antenna array hardware or calibration failure,Resolved,RAN-009: TX Array Fault,0.2,FALSE,Mardil of Dol Amroth +INCME-100072,Gondor,Beacon Power,Beacon Array,RAN,Elven Forge Technologies,Henneth Annûn,Resolved,RAN-014: CSR Unreachable,MAJOR,Relay-Gondor-Inner-24,"Incident Resolution Summary: + +Escalated due to a failure in the Cell site router management path. + +Actions Taken: +- Verified that the upgrade failure was caused by a cell site router management path unavailable. +- Checked the system status to confirm that the router was available. +- Attempted an emergency boot to restore the system. +- Dispatched Field Support to investigate and resolve the issue. +- Documented and reported the incident. + +Root Cause: +The cell site router management path was unavailable due to a hardware failure. + +Problem Type: +RAN-014 - Cell site router management path unavailable + +Conclusion: +The root cause of the failure was a hardware failure in the cell site router management path, causing the system to become unavailable. The escalation was necessary to ensure that the system was restored and the issue was resolved.",2024-06-21 23:34:16,2024-06-21 23:59:16,Lothlórien Link Guardians,ran_software_upgrade_recovery,Cell site router management path unavailable,"2024-06-21 23:21:16 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-06-21 23:23:16 - System (Workflow Engine) +Tool: query_rf_status() → Sector/RU radiating status and cell enablement checked. +2024-06-21 23:26:16 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-06-21 23:29:16 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-06-21 23:35:16 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-06-21 23:38:16 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Arahael Telcontar,2024-06-21 23:34:16,2024-06-21 23:59:16,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.2,FALSE,Arahael Telcontar +INCME-100073,Rohan,Signal Core,Relay Unit,COMPUTE,Elven Forge Technologies,Aldburg,Resolved,CMP-002: Pod Container Creating,CRITICAL,Node-Rohan-Upper-97,"Reopened from INCME-100068. + +Closure Note: + +1. Restore Action: +The root cause of the issue was a Pod stuck in ContainerCreating state. The issue was resolved by verifying the Container Alarm, escalating to DevOps, verifying the Container Recovery, and verifying the Pod was restored to its original state. + +2. Reason for Outage: +The Pod stuck in ContainerCreating state was caused by a configuration issue with the application. The root cause was identified and resolved. + +3. Outcome: +The issue was resolved and the Pod was restored to its original state. The root cause of the issue was identified and resolved, and the issue was closed. + +Note: This closure note does not include any information about the specific configuration issues that led to the outage. This is intended to be a concise summary of the steps taken to resolve the issue.",2024-06-27 04:34:28,2024-06-27 04:52:28,Hobbiton Watch Service,compute_container_crash_recovery,Pod stuck in ContainerCreating state,"2024-06-27 04:34:28 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-06-27 04:36:28 - System (Workflow Engine) +Tool: orchestrate_workload() → Orchestrator actions evaluated (drain/restart/reschedule). +2024-06-27 04:39:28 - System (Workflow Engine) +Tool: query_container_status() → Pod/VM/container state examined (Ready/Pending/CrashLoop). +2024-06-27 04:42:28 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-06-27 04:45:28 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-06-27 04:49:28 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-06-27 04:55:28 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-06-27 04:58:28 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Gandalf the Grey,2024-06-27 04:34:28,2024-06-27 04:52:28,Pod stuck in ContainerCreating state,Resolved,CMP-002: Pod Container Creating,0.2,TRUE,Gandalf the Grey +INCME-100074,Rohan,Signal Network,Power Source,RAN,Dwarven Network Systems,Hornburg,Pending Resolution,Access instability,MAJOR,Tower-Rohan-East-17,"In the incident of failed workflow outcome, a root cause was identified as the Random Access Channel (RAC) success rate being below the threshold. The root cause was attributed to an issue with the interference alarm, which was verified by the team. The team then reported the issue to Spectrum Management and verified the interference status. The issue was resolved and the workflow outcome was successfully completed. The resolution summary includes the following information: + +Restore Action: Verify Interference Alarm +Reason For Outage: The RAC success rate was below the threshold + +The team verified that the interference alarm was working correctly, which was a critical component in the workflow. The root cause was addressed and the workflow outcome was successfully completed. The resolution summary provides a concise and complete summary of the incident, including the actions taken, the root cause, and the resolution.",2024-11-17 06:25:56,2024-11-17 06:37:56,Bywater Observation Post,ran_interference_mitigation,Random access channel success rate below threshold,"2024-11-17 06:05:56 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-11-17 06:07:56 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-11-17 06:10:56 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-11-17 06:13:56 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-11-17 06:18:56 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Meriadoc Bracegirdle,2024-11-17 06:25:56,2024-11-17 06:37:56,Random access channel success rate below threshold,Pending Resolution,Access instability,0.4,FALSE,Meriadoc Bracegirdle +INCME-100075,Eriador,Path Signals,Keeper Stone,COMPUTE,Elven Forge Technologies,Fornost,Resolved,CMP-010: Site Not Scrolling,MAJOR,Hub-Eriador-Inner-55,"NOC Engineer's Notes + +The following resolution summary has been generated for the incident that occurred on [Date], involving [Service Name]. + +Restore Action: +- Verified that the Image Pull Alarm was triggered due to a stalled deployment or scaling operation on the site. +- Verified that the Image Pull Alarm was not related to any other service. +- Resolved the issue by verifying that the site deployment or scaling operation was completed successfully. + +Reason for Outage: +- The root cause of the issue was a stalled deployment or scaling operation on the site. +- This issue was caused by an unrelated issue that impacted the Image Pull Alarm. + +Resolution Summary: +- The root cause of the issue was identified as a stalled deployment or scaling operation on the site. +- The root cause of the issue was resolved by verifying that the site deployment or scaling operation was completed successfully. +- The issue was closed with a resolution that verifies that the root cause was resolved. + +NOTE: This resolution summary does not include any information regarding the specific root cause of the issue. It only provides a summary of the steps taken to resolve",2024-07-11 18:21:12,2024-07-11 18:48:12,Bywater Observation Post,compute_image_pull_recovery,Site deployment or scaling operation stalled,"2024-07-11 18:26:12 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-07-11 18:28:12 - System (Workflow Engine) +Tool: orchestrate_workload() → Orchestrator actions evaluated (drain/restart/reschedule). +2024-07-11 18:31:12 - System (Workflow Engine) +Tool: query_container_status() → Pod/VM/container state examined (Ready/Pending/CrashLoop). +2024-07-11 18:34:12 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-07-11 18:37:12 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-07-11 18:42:12 - System (Workflow Engine) +Tool: run_diagnostics() → Additional diagnostics attached for next-line/field team. +2024-07-11 18:46:12 - System (Workflow Engine) +Tool: verify_recovery() → Verification queued; awaiting field/next-line confirmation.",Elrond the Fair,2024-07-11 18:21:12,2024-07-11 18:48:12,Site deployment or scaling operation stalled,Resolved,CMP-010: Site Not Scrolling,0.4,FALSE,Elrond the Fair +INCME-100076,Gondor,Signal Network,Power Source,POWER,Elven Forge Technologies,Minas Tirith,Resolved,ENV-001: High Temperature Alert,MINOR,Outpost-Gondor-Central-47,"Incident Closure Note: ENV-001 - Temperature Alarm, Equipment Operating Temperature Exceeded Threshold + +Restore Action: +- Verified temperature alarm was triggered due to equipment operating at a temperature above the threshold. +- HVAC service was dispatched to address the issue. +- Monitored temperature trend to ensure that the equipment was operating within the acceptable range. + +Reason for Outage: +- Equipment operating temperature exceeded threshold + +Root Cause: +- Equipment operating temperature exceeded threshold due to an issue with the equipment + +Problem Type: +- ENV-001 - Temperature Alarm, Equipment Operating Temperature Exceeded Threshold + +Proper NOC terminology used: +- ENV-001: Environmental Control Alarm +- Temperature Alarm: Temperature Sensor +- Equipment Operating Temperature Exceeded Threshold: Temperature Alarm triggered due to an issue with the equipment +- Equipment Operating Temperature: Temperature of the equipment +- Temperature Tr",2024-09-17 10:05:00,2024-09-17 10:29:00,Shire Monitoring Guild,env_high_temperature_response,Equipment operating temperature exceeded threshold,"2024-09-17 10:07:00 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-09-17 10:09:00 - System (Workflow Engine) +Tool: query_power_system() → AC/DC, UPS, battery, and generator telemetry reviewed. +2024-09-17 10:12:00 - System (Workflow Engine) +Tool: query_external_factors() → Utility/outage/maintenance windows checked. +2024-09-17 10:15:00 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-09-17 10:18:00 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-09-17 10:23:00 - System (Workflow Engine) +Tool: verify_recovery() → Verification queued; awaiting field/next-line confirmation.",Denethor son of Imrahil,2024-09-17 10:05:00,2024-09-17 10:29:00,Equipment operating temperature exceeded threshold,Resolved,ENV-001: High Temperature Alert,0.6,FALSE,Denethor son of Imrahil +INCME-100077,Iron Hills,Relay Transport,Weather Watch,RAN,Elven Forge Technologies,Michel Delving,Resolved,RAN-014: CSR Unreachable,CRITICAL,Node-IronHills-Upper-40,"The incident was caused by a cell site router management path unavailable, which resulted in a failure to perform upgrades. The root cause was identified and documented, and a resolution was implemented to restore the affected system. The incident was closed, and a resolution summary was created to detail the actions taken to address the issue. The summary includes the following information: + +1. Restore Action: Verify and restore the cell site router management path. + +2. Reason for Outage: The root cause was identified and documented. + +3. Notes: The root cause was identified as a failure to perform upgrades due to a cell site router management path unavailable. + +4. Root Cause: The root cause of the issue was identified as a problem with the cell site router management path, which caused the failure to perform upgrades. + +5. Root Cause: The root cause of the issue was identified as a problem with the cell site router management path, which caused the failure to perform upgrades. + +6. Root Cause: The root cause of the issue was identified as a problem with the cell site router management path, which caused the failure to perform upgrades. + +7. Root",2024-10-06 20:22:03,2024-10-06 20:58:03,Shire Monitoring Guild,ran_software_upgrade_recovery,Cell site router management path unavailable,"2024-10-06 20:20:03 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-10-06 20:22:03 - System (Workflow Engine) +Tool: query_rf_status() → Sector/RU radiating status and cell enablement checked. +2024-10-06 20:25:03 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-10-06 20:28:03 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-10-06 20:32:03 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-10-06 20:38:03 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-10-06 20:41:03 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Fíli Oakenshield,2024-10-06 20:22:03,2024-10-06 20:58:03,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.4,FALSE,Fíli Oakenshield +INCME-100078,Rohan,Signal Core,Relay Unit,POWER,Dwarven Network Systems,Edoras,Resolved,ENV-005: Cabinet Intrusion Detected,MAJOR,Outpost-Rohan-West-81,"In response to the incident, the NOC engineer took the following actions to restore the system: + +1. Verify intrusion alarm: The engineer monitored the intrusion alarm and confirmed that it was triggered. + +2. Create security incident: The engineer created a security incident to document the incident and its resolution. + +3. Monitor cabinet status: The engineer monitored the cabinet status to ensure that the secured equipment cabinet was functioning correctly. + +4. Root cause: The root cause of the incident was found to be physical access detected to secured equipment cabinet. + +The root cause was due to a security breach that occurred when an unauthorized individual attempted to access the cabinet. The engineer was able to identify and resolve the issue by verifying the intrusion alarm, creating a security incident, and monitoring the cabinet status. + +The engineer's actions helped to restore the system and prevent any further incidents. The resolution summary includes the close notes, which include the root cause, actions taken, and the resolution.",2024-07-13 00:53:57,2024-07-13 01:20:57,Hobbiton Watch Service,env_cabinet_intrusion_response,Physical access detected to secured equipment cabinet,"2024-07-13 00:51:57 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-07-13 00:53:57 - System (Workflow Engine) +Tool: query_power_system() → AC/DC, UPS, battery, and generator telemetry reviewed. +2024-07-13 00:56:57 - System (Workflow Engine) +Tool: query_external_factors() → Utility/outage/maintenance windows checked. +2024-07-13 00:59:57 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-07-13 01:02:57 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-07-13 01:08:57 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-07-13 01:11:57 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Merry Bracegirdle,2024-07-13 00:53:57,2024-07-13 01:20:57,Physical access detected to secured equipment cabinet,Resolved,ENV-005: Cabinet Intrusion Detected,0.4,FALSE,Merry Bracegirdle +INCME-100079,Arnor,Signal Core,Beacon Array,COMPUTE,Elven Forge Technologies,Hollin Gate,Resolved,Problematic VM,MINOR,Watch-Arnor-Upper-78,"NOC engineer wrote a resolution summary for an incident that occurred during the monitoring of a virtual machine. + +Restore Action: Verify CNF Alarm, Check CNF Pod Status, Trigger CNF Failover, Verify CNF Recovery +Reason For Outage: Virtual machine failed to reach ready state due to a problem type of CMP-006 (Virtual machine failed to reach ready state). + +The root cause of the issue was identified to be a Virtual Machine that failed to reach the Ready state. The issue was verified by checking the CNF Alarm, which indicated that the Virtual Machine was not in a Ready state. The CNF Pod Status was also checked, and it was confirmed that the Virtual Machine was in the Running state. + +The issue was then escalated to the CNM team for further investigation. The CNM team confirmed that the Virtual Machine was indeed in a Ready state, and the issue was resolved by triggering a CNF Failover and verifying the Virtual Machine was in a Ready state. + +The resolution summary includes the root cause, the problem type, and the actions taken to resolve the issue. It also includes the steps taken by the NOC",2024-09-18 08:59:00,2024-09-18 09:27:00,Bywater Observation Post,compute_cnf_pod_recovery,Virtual machine failed to reach ready state,"2024-09-18 08:53:00 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-09-18 08:55:00 - System (Workflow Engine) +Tool: inspect_logs() → Relevant logs inspected for error signatures and crash indicators. +2024-09-18 08:58:00 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-09-18 09:01:00 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-09-18 09:07:00 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-09-18 09:10:00 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Bombur of the Iron Hills,2024-09-18 08:59:00,2024-09-18 09:27:00,Virtual machine failed to reach ready state,Resolved,Problematic VM,0.2,FALSE,Bombur of the Iron Hills +INCME-100080,Rohan,Beacon Power,Junction Point,COMPUTE,Dwarven Network Systems,Aldburg,Resolved,Node not functional,MAJOR,Relay-Rohan-East-53,"In response to the incident involving a complete unavailable compute node, the following resolution summary was generated: + +Restore Action: +- Verified that the VM alarm was triggered due to the unavailable compute node. +- Verified that the Hypervisor status was down, and the VM recovery was successful. +- Verified that the VM was successfully restored from the snapshot. +- Notified the Cloud Team for escalation. + +Reason for Outage: +- The root cause of the outage was a complete unavailable compute node. + +Notes: +- The root cause of the outage was identified and confirmed. +- The incident was resolved successfully. +- The incident outcome was a successful resolution of the issue. + +NOC Terminology: +- Compute node: The compute node that experienced the unavailable issue. +- Hypervisor status: The status of the Hypervisor running on the compute node. +- VM recovery: The process of restoring a VM from a backup or snapshot.",2024-07-06 23:32:13,2024-07-06 23:56:13,Rangers of the North,compute_vm_failure_recovery,Compute node completely unavailable,"2024-07-06 23:25:13 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-07-06 23:27:13 - System (Workflow Engine) +Tool: orchestrate_workload() → Orchestrator actions evaluated (drain/restart/reschedule). +2024-07-06 23:30:13 - System (Workflow Engine) +Tool: inspect_logs() → Relevant logs inspected for error signatures and crash indicators. +2024-07-06 23:33:13 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-07-06 23:36:13 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-07-06 23:40:13 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-07-06 23:46:13 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-07-06 23:49:13 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Alatar the Blue,2024-07-06 23:32:13,2024-07-06 23:56:13,Compute node completely unavailable,Resolved,Node not functional,0.2,FALSE,Alatar the Blue +INCME-100081,Lothlórien,Arcane Engines,Signal Unit,RAN,Elven Forge Technologies,Lothlórien Central,Resolved,RAN-014: CSR Unreachable,MAJOR,Beacon-Lothlorien-East-78,"Incident Summary: + +1. Restore Action: Verify Upgrade Failure + +Root cause: Cell site router management path unavailable +Problem type: RAN-014 + +Solution: + +- Verified that the Cell site router management path was restored successfully. + +2. Reason For Outage: The root cause + +Root cause: Cell site router management path unavailable +Problem type: RAN-014 + +Solution: + +- Root cause analysis identified that the Cell site router management path was unavailable due to a system maintenance. +- Emergency boot was initiated to restore the management path. +- Field support was dispatched to address the issue. + +3. Restore Action: Check System Status + +Root cause: Cell site router management path unavailable +Problem type: RAN-014 + +Solution: + +- Checked the system status to ensure that the Cell site router management path was restored successfully. + +4. Reason For Outage: The root cause + +Root cause: Cell site router management path unavailable +Problem type: RAN-014 + +Solution: + +- Root cause",2024-08-05 17:45:03,2024-08-05 18:27:03,Dúnedain Field Division,ran_software_upgrade_recovery,Cell site router management path unavailable,"2024-08-05 17:27:03 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-08-05 17:29:03 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-08-05 17:32:03 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-08-05 17:35:03 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-08-05 17:41:03 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-08-05 17:44:03 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Arahad of the Dúnedain,2024-08-05 17:45:03,2024-08-05 18:27:03,Cell site router management path unavailable,Resolved,RAN-014: CSR Unreachable,0.4,FALSE,Arahad of the Dúnedain +INCME-100082,Rohan,Signal Core,Beacon Array,RAN,Elven Forge Technologies,Edoras,Resolved,SVC-005: Service Accessibility Degraded,MAJOR,Node-Rohan-Central-41,"Incident Closure Note: + +Restore Action: +- Reviewed Device Issue and identified root cause as End-to-end service accessibility metrics degraded. +- Escalated the issue to the Core Team for further investigation and resolution. +- Documented the resolution in the incident closure note. + +Reason for Outage: +- The root cause was identified as End-to-end service accessibility metrics degraded. + +Outcome: +- The issue has been resolved and the Root Cause identified. +- The resolution was documented in the incident closure note. + +Note: The resolution summary does not include any other information or details. It is meant to provide a concise and complete summary of the incident closure notes.",2024-06-05 02:16:51,2024-06-05 02:32:51,Shire Monitoring Guild,ran_device_issue_resolution,End-to-end service accessibility metrics degraded,"2024-06-05 02:05:51 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-06-05 02:07:51 - System (Workflow Engine) +Tool: query_rf_status() → Sector/RU radiating status and cell enablement checked. +2024-06-05 02:10:51 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-06-05 02:13:51 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-06-05 02:19:51 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-06-05 02:22:51 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-06-05 02:27:51 - System (Workflow Engine) +Tool: verify_recovery() → Verification queued; awaiting field/next-line confirmation.",Théoden Horsemaster,2024-06-05 02:16:51,2024-06-05 02:32:51,End-to-end service accessibility metrics degraded,Resolved,SVC-005: Service Accessibility Degraded,0.4,FALSE,Théoden Horsemaster +INCME-100083,Rohan,Path Signals,Power Source,POWER,Dwarven Network Systems,Edoras,Resolved,ENV-001: High Temperature Alert,MAJOR,Station-Rohan-Lower-69,"Incident Closure Summary: + +- Verified that temperature alarm has been cleared +- Dispatched HVAC service to monitor temperature trend +- Monitored temperature data and found that temperature exceeded threshold +- Root cause: Equipment operating temperature exceeded threshold +- Problem type: ENV-001 + +Conclusion: + +- The root cause of the issue was identified and resolved. +- The incident was successfully closed and no further action is required.",2024-09-09 05:47:41,2024-09-09 06:14:41,Elven Signal Keepers,env_high_temperature_response,Equipment operating temperature exceeded threshold,"2024-09-09 05:34:41 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-09-09 05:36:41 - System (Workflow Engine) +Tool: query_power_system() → AC/DC, UPS, battery, and generator telemetry reviewed. +2024-09-09 05:39:41 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-09-09 05:42:41 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-09-09 05:46:41 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-09-09 05:52:41 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-09-09 05:55:41 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-09-09 06:00:41 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Denethor of Dol Amroth,2024-09-09 05:47:41,2024-09-09 06:14:41,Equipment operating temperature exceeded threshold,Resolved,ENV-001: High Temperature Alert,0.2,FALSE,Denethor of Dol Amroth +INCME-100084,Eriador,Beacon Power,Junction Point,POWER,Dwarven Network Systems,Fornost,Pending Resolution,ENV-001: High Temperature Alert,MAJOR,Spire-Eriador-Lower-79,"Incident Closure Summary: + +- Verified that temperature alarm has been cleared +- Dispatched HVAC service to monitor temperature trend +- Monitored temperature data and found that temperature exceeded threshold +- Root cause: Equipment operating temperature exceeded threshold +- Problem type: ENV-001 + +Conclusion: + +- The root cause of the issue was identified and resolved. +- The incident was successfully closed and no further action is required.",2024-11-06 08:17:12,2024-11-06 08:31:12,Pelargir Port Authority,env_high_temperature_response,Equipment operating temperature exceeded threshold,"2024-11-06 08:01:12 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-11-06 08:03:12 - System (Workflow Engine) +Tool: query_external_factors() → Utility/outage/maintenance windows checked. +2024-11-06 08:06:12 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-11-06 08:09:12 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-11-06 08:14:12 - System (Workflow Engine) +Tool: verify_recovery() → Verification queued; awaiting field/next-line confirmation.",Radagast Mithrandir,2024-11-06 08:17:12,2024-11-06 08:31:12,Equipment operating temperature exceeded threshold,Pending Resolution,ENV-001: High Temperature Alert,0.6,FALSE,Radagast Mithrandir +INCME-100085,Eriador,Signal Network,Weather Watch,RAN,Elven Forge Technologies,Tharbad,Resolved,RAN-001: Cell Service Interruption,MINOR,Tower-Eriador-Outer-86,"NOC Engineer's Resolution Summary: + +Reason for Outage: The cell completely unavailable for service + +Restore Action: +- Verified that site was operational and no issues were detected +- Notified field technician to dispatch and begin troubleshooting +- Final status verification was conducted to ensure that the issue was resolved and the site was back to normal service + +Root cause: Cell completely unavailable for service + +Problem type: RAN-001 + +This resolution summary has been generated based on the information provided by the NOC engineer during the incident closure process. The summary includes the root cause of the outage, the steps taken to fix the issue, and the final status verification. The summary is concise and provides a clear understanding of the incident.",2024-11-18 14:44:06,2024-11-18 15:18:06,Gondor Signal Authority,ran_cell_site_down_recovery,Cell completely unavailable for service,"2024-11-18 14:44:06 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-11-18 14:46:06 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-11-18 14:49:06 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-11-18 14:52:06 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-11-18 14:55:06 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-11-18 14:59:06 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-11-18 15:05:06 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-11-18 15:08:06 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Radagast Mithrandir,2024-11-18 14:44:06,2024-11-18 15:18:06,Cell completely unavailable for service,Resolved,RAN-001: Cell Service Interruption,0.2,FALSE,Radagast Mithrandir +INCME-100086,Rivendell,Signal Network,Relay Unit,POWER,Dwarven Network Systems,Grey Havens,Resolved,PWR-002: DC Rectifier Failure,WARNING,Station-Rivendell-West-03,"Reason for Outage: A DC power conversion unit malfunction caused a power outage in the NOC. The root cause of the issue was a faulty power converter unit. + +Restore Action: +- Confirm Rectifier Alarm: A rectifier alarm was triggered due to a faulty power converter unit. The rectifier was reset, and the fault was identified and rectified. +- Enable Battery Monitoring: The battery monitoring system was enabled to monitor the battery's health and ensure that it remained at a stable state. +- Final Status Check: The NOC finalized the resolution and the outage was restored to normal. + +Outcome: Failed",2024-07-27 23:53:50,2024-07-28 00:21:50,Arnor Response Team,power_dc_rectifier_recovery,DC power conversion unit malfunction,"2024-07-27 23:45:50 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-07-27 23:47:50 - System (Workflow Engine) +Tool: query_power_system() → AC/DC, UPS, battery, and generator telemetry reviewed. +2024-07-27 23:50:50 - System (Workflow Engine) +Tool: query_external_factors() → Utility/outage/maintenance windows checked. +2024-07-27 23:53:50 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-07-27 23:56:50 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-07-28 00:02:50 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-07-28 00:05:50 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Alatar the White,2024-07-27 23:53:50,2024-07-28 00:21:50,DC power conversion unit malfunction,Resolved,PWR-002: DC Rectifier Failure,0.2,FALSE,Alatar the White +INCME-100087,Rivendell,Relay Transport,Power Source,RAN,Elven Forge Technologies,Rivendell,Resolved,RAN-007: Cell Not Radiating,MAJOR,Outpost-Rivendell-West-41,"In response to the failed incident closure, the following resolution summary is provided: + +Restore Action: +- Verified VSWR Alarm, TX power reduced, RF path status checked +- Scheduled field inspection for sector impact +- Verified sector impact and took corrective action + +Reason for Outage: +- Transmission chain failure preventing RF emission + +Root cause: +- The failure of the transmission chain caused RF emission to be blocked, preventing the RAN from sending and receiving signals. + +Problem type: +- RAN-007, a critical failure in the transmission chain that impacted the RAN's ability to communicate with other networks. + +Resolution summary: +- Verified the transmission chain's failure and took corrective action to restore the RF emission. +- Scheduled a field inspection to assess the sector impact and take corrective action if necessary. +- Verified that the sector impact was resolved and took corrective action to prevent further issues.",2024-07-09 06:28:26,2024-07-09 07:05:26,White Tower Operations,ran_vswr_alarm_resolution,Transmission chain failure preventing RF emission,"2024-07-09 06:38:26 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-07-09 06:40:26 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-07-09 06:43:26 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-07-09 06:46:26 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-07-09 06:52:26 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-07-09 06:55:26 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-07-09 07:00:26 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Dori of the Lonely Mountain,2024-07-09 06:28:26,2024-07-09 07:05:26,Transmission chain failure preventing RF emission,Resolved,RAN-007: Cell Not Radiating,0.4,FALSE,Dori of the Lonely Mountain +INCME-100088,Mordor Surveillance Zone,Relay Transport,Weather Watch,RAN,Elven Forge Technologies,Calembel,Resolved,RAN-007: Cell Not Radiating,MAJOR,Hub-MordorSurveillanceZone-Inner-88,"Incident Summary: + +During a scheduled maintenance event, a transmission chain failure occurred that prevented RF emission. This resulted in a failover to the backup transmission chain, which led to an outage in the RAN. + +Restore Action: + +- Verify VSWR Alarm: The VSWR alarm was triggered due to the failure of the primary transmission chain. A team was dispatched to investigate and resolve the issue. +- Schedule Field Inspection: The team scheduled a field inspection to investigate the transmission chain failure and ensure that it was resolved. +- Verify Sector Impact: The team verified that the sector impact was minimal due to the backup transmission chain. + +Reason for Outage: + +The root cause of the outage was the transmission chain failure, preventing RF emission. This failure resulted in an impairment of the RAN, leading to the outage. + +Conclusion: + +The incident closure notes have been updated to include the resolution actions taken, the root cause of the outage, and the impact of the outage on the RAN. The team is continuing to investigate the transmission chain failure and ensure that it is resolved.",2024-07-20 18:41:21,2024-07-20 19:00:21,Rangers of the North,ran_vswr_alarm_resolution,Transmission chain failure preventing RF emission,"2024-07-20 18:22:21 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-07-20 18:24:21 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-07-20 18:27:21 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-07-20 18:30:21 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-07-20 18:33:21 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-07-20 18:37:21 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-07-20 18:43:21 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-07-20 18:46:21 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Folcwine of the Mark,2024-07-20 18:41:21,2024-07-20 19:00:21,Transmission chain failure preventing RF emission,Resolved,RAN-007: Cell Not Radiating,0.4,FALSE,Folcwine of the Mark +INCME-100089,Mordor Surveillance Zone,Path Signals,Power Source,POWER,Elven Forge Technologies,Osgiliath,Resolved,PWR-002: DC Rectifier Failure,CRITICAL,Tower-MordorSurveillanceZone-West-47,"Reopened from INCME-100086. + +After confirming the rectifier alarm, enabling battery monitoring, and performing a final status check, the NOC engineer has completed the incident closure notes for PWR-002, a power supply issue that occurred at the facility. The root cause of the issue was a DC power conversion unit malfunction. The engineer has recorded this in the incident closure notes and has provided a resolution summary that includes the following: + +1. Restore Action: Enable battery monitoring and rectifier alarm +2. Reason For Outage: The malfunction of the DC power conversion unit + +The engineer has also included the final status check, which confirmed that the issue was resolved and the power supply functioned correctly. The engineer has recorded this in the notes and has provided a resolution summary that includes the root cause and the resolution taken to restore the facility's power supply. The engineer has ensured that all necessary documentation is included in the notes, including the incident summary, the incident closure notes, and the final status check.",2024-07-29 16:21:50,2024-07-29 16:42:50,Gondor Gateway Team,power_dc_rectifier_recovery,DC power conversion unit malfunction,"2024-07-29 16:19:50 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-07-29 16:21:50 - System (Workflow Engine) +Tool: query_external_factors() → Utility/outage/maintenance windows checked. +2024-07-29 16:24:50 - System (Workflow Engine) +Tool: query_power_system() → AC/DC, UPS, battery, and generator telemetry reviewed. +2024-07-29 16:27:50 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-07-29 16:30:50 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-07-29 16:35:50 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Nori of Erebor,2024-07-29 16:21:50,2024-07-29 16:42:50,DC power conversion unit malfunction,Resolved,PWR-002: DC Rectifier Failure,0.2,TRUE,Nori of Erebor +INCME-100090,Rohan,Path Signals,Path Router,COMPUTE,Dwarven Network Systems,Grimslade,Resolved,CMP-005: Pod Terminating Stuck,WARNING,Outpost-Rohan-Primary-94,"NOC Engineer's Note: Verify Orchestrator Alarm and Escalate to Platform Team + +Reason for Outage: A Pod stuck in terminating state beyond grace period, causing the Orchestrator to become unresponsive. + +Restore Action: Verify Orchestrator Alarm, escalate to Platform Team, verify Orchestrator Recovery. + +Root Cause: A Pod stuck in terminating state beyond grace period. + +Problem Type: CMP-005. + +Conclusion: The root cause of the outage was a Pod stuck in terminating state beyond grace period. The Orchestrator was successfully restored to a functional state. The Platform Team was notified and will be responsible for further investigation and resolution.",2024-09-22 21:45:55,2024-09-22 22:22:55,Riders of the Mark,compute_orchestrator_recovery,Pod stuck in terminating state beyond grace period,"2024-09-22 21:44:55 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-09-22 21:46:55 - System (Workflow Engine) +Tool: query_container_status() → Pod/VM/container state examined (Ready/Pending/CrashLoop). +2024-09-22 21:49:55 - System (Workflow Engine) +Tool: inspect_logs() → Relevant logs inspected for error signatures and crash indicators. +2024-09-22 21:52:55 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-09-22 21:55:55 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-09-22 22:01:55 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-09-22 22:04:55 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Fredegar Bracegirdle,2024-09-22 21:45:55,2024-09-22 22:22:55,Pod stuck in terminating state beyond grace period,Resolved,CMP-005: Pod Terminating Stuck,0.2,FALSE,Fredegar Bracegirdle +INCME-100091,Eriador,Relay Transport,Relay Unit,COMPUTE,Dwarven Network Systems,Weathertop,Resolved,Node not functional,CRITICAL,Hub-Eriador-Primary-54,"Incident closure summary: + +Restore Action: +- Verified that the compute node was unavailable, and it was due to a hardware failure. +- Verified that the root cause was the hardware failure, and a new compute node was deployed and brought online. +- Notified the cloud team to escalate the issue to them, and they confirmed that they had resolved the issue. + +Reason for Outage: +- The root cause was the hardware failure, which led to the unavailability of the compute node. + +Close notes: +- The incident was resolved successfully, and the compute node was restored to its normal state. +- The root cause of the issue was confirmed, and the appropriate action was taken to resolve the issue. +- The incident was reported to the incident management team, and a resolution summary was generated. +- The incident was closed with a resolution that included a restore action, a root cause analysis, and a resolution summary.",2024-08-19 11:42:58,2024-08-19 12:12:58,Gondor Gateway Team,compute_vm_failure_recovery,Compute node completely unavailable,"2024-08-19 11:44:58 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-08-19 11:46:58 - System (Workflow Engine) +Tool: orchestrate_workload() → Orchestrator actions evaluated (drain/restart/reschedule). +2024-08-19 11:49:58 - System (Workflow Engine) +Tool: query_container_status() → Pod/VM/container state examined (Ready/Pending/CrashLoop). +2024-08-19 11:52:58 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-08-19 11:55:58 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-08-19 12:01:58 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-08-19 12:04:58 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Radagast the Brown,2024-08-19 11:42:58,2024-08-19 12:12:58,Compute node completely unavailable,Resolved,Node not functional,0.4,FALSE,Radagast the Brown +INCME-100092,Mordor Surveillance Zone,Beacon Power,Signal Unit,SIGNALING,Elven Forge Technologies,Grimslade,Resolved,SIG-003: SCTP Association Failure,MINOR,Beacon-MordorSurveillanceZone-North-43,"Resolution Summary: + +The root cause of the SIG-003 issue was a Stream Control Transmission Protocol (SCTP) association loss. The issue was resolved by verifying that the SIP Alarm was verified and escalating the issue to the IMS team for further investigation. The root cause was identified and documented in the incident closure notes. The team worked together to verify the SIP Alarm and escalate the issue to the IMS team. The IMS team conducted an investigation and identified the root cause as a Stream Control Transmission Protocol (SCTP) association loss. The issue was resolved and the SIG-003 was upgraded to a SIG-004. The team closed the incident and provided a resolution summary.",2024-08-30 09:59:26,2024-08-30 10:25:26,Iron Hills Transport,signaling_sip_registration_recovery,Stream Control Transmission Protocol association lost,"2024-08-30 10:06:26 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-08-30 10:08:26 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-08-30 10:11:26 - System (Workflow Engine) +Tool: verify_signaling_path() → N2/N3/SCTP signaling path validated end-to-end. +2024-08-30 10:14:26 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-08-30 10:17:26 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-08-30 10:23:26 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-08-30 10:26:26 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Kíli of the Lonely Mountain,2024-08-30 09:59:26,2024-08-30 10:25:26,Stream Control Transmission Protocol association lost,Resolved,SIG-003: SCTP Association Failure,0.2,FALSE,Kíli of the Lonely Mountain +INCME-100093,Iron Hills,Path Signals,Relay Unit,POWER,Elven Forge Technologies,Dale,Pending Resolution,PWR-001: AC Power Failure,MAJOR,Tower-IronHills-Upper-61,"Reason for Outage: A commercial power supply interruption or failure occurred on the AC power line, resulting in a loss of power to the NOC facility. + +Restore Action: +- Verified AC Power Loss +- Checked UPS Status +- Dispatched Field Technician +- Final Verification + +Action Taken: +- Verified AC Power Loss +- Checked UPS Status +- Dispatched Field Technician +- Final Verification + +Root Cause: +- Commercial power supply interruption or failure + +Problem Type: +- PWR-001",2024-10-15 21:29:47,2024-10-15 22:02:47,Rohan Rapid Response,power_ac_failure_recovery,Commercial power supply interruption or failure,"2024-10-15 21:37:47 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-10-15 21:39:47 - System (Workflow Engine) +Tool: query_power_system() → AC/DC, UPS, battery, and generator telemetry reviewed. +2024-10-15 21:42:47 - System (Workflow Engine) +Tool: query_external_factors() → Utility/outage/maintenance windows checked. +2024-10-15 21:45:47 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-10-15 21:48:47 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-10-15 21:53:47 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Faramir of the White City,2024-10-15 21:29:47,2024-10-15 22:02:47,Commercial power supply interruption or failure,Pending Resolution,PWR-001: AC Power Failure,0.4,FALSE,Faramir of the White City +INCME-100094,Gondor,Signal Core,Central Nexus,RAN,Elven Forge Technologies,Minas Tirith,Pending Resolution,Access instability,CRITICAL,Hub-Gondor-Upper-07,"In the incident of failed workflow outcome, a root cause was identified as the Random Access Channel (RAC) success rate being below the threshold. The root cause was attributed to an issue with the interference alarm, which was verified by the team. The team then reported the issue to Spectrum Management and verified the interference status. The issue was resolved and the workflow outcome was successfully completed. The resolution summary includes the following information: + +Restore Action: Verify Interference Alarm +Reason For Outage: The RAC success rate was below the threshold + +The team verified that the interference alarm was working correctly, which was a critical component in the workflow. The root cause was addressed and the workflow outcome was successfully completed. The resolution summary provides a concise and complete summary of the incident, including the actions taken, the root cause, and the resolution.",2024-08-15 22:37:31,2024-08-15 22:54:31,Istari Advisory Board,ran_interference_mitigation,Random access channel success rate below threshold,"2024-08-15 22:23:31 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-08-15 22:25:31 - System (Workflow Engine) +Tool: query_rf_status() → Sector/RU radiating status and cell enablement checked. +2024-08-15 22:28:31 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-08-15 22:31:31 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-08-15 22:36:31 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Peregrin Hornblower,2024-08-15 22:37:31,2024-08-15 22:54:31,Random access channel success rate below threshold,Pending Resolution,Access instability,0.6,FALSE,Peregrin Hornblower +INCME-100095,Gondor,Arcane Engines,Path Router,POWER,Dwarven Network Systems,Cair Andros,Resolved,PWR-002: DC Rectifier Failure,CRITICAL,Outpost-Gondor-East-11,"Reopened from INCME-100089. + +Resolution Summary: + +In response to the issue of a DC power conversion unit malfunction, the following actions were taken: + +- Confirming that the Rectifier Alarm was triggered due to a malfunction in the DC power conversion unit. +- Checking the DC Bus Voltage to ensure it was within the acceptable range. +- Assessing the Rectifier Redundancy to ensure it was not damaged. +- Attempting Rectifier Reset to restore the system to its normal operation. +- Scheduling Field Replacement to replace the faulty component. +- Enabling Battery Monitoring to monitor the battery condition. +- Final Status Check to ensure that the system is functioning properly. + +Root cause: The DC power conversion unit malfunctioned due to a fault in the rectifier. + +Problem type: PWR-002",2024-08-01 00:42:50,2024-08-01 01:20:50,Rangers of the North,power_dc_rectifier_recovery,DC power conversion unit malfunction,"2024-08-01 00:51:50 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-08-01 00:53:50 - System (Workflow Engine) +Tool: query_power_system() → AC/DC, UPS, battery, and generator telemetry reviewed. +2024-08-01 00:56:50 - System (Workflow Engine) +Tool: query_external_factors() → Utility/outage/maintenance windows checked. +2024-08-01 00:59:50 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-08-01 01:02:50 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-08-01 01:07:50 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Arahad Wingfoot,2024-08-01 00:42:50,2024-08-01 01:20:50,DC power conversion unit malfunction,Resolved,PWR-002: DC Rectifier Failure,0.2,TRUE,Arahad Wingfoot +INCME-100096,The Shire,Path Signals,Signal Unit,COMPUTE,Elven Forge Technologies,Calembel,Resolved,Node not functional,MAJOR,Outpost-TheShire-East-27,"Reopened from INCME-100091. + +Incident closure summary: + +Restore Action: +- Verified that the compute node was unavailable, and it was due to a hardware failure. +- Verified that the root cause was the hardware failure, and a new compute node was deployed and brought online. +- Notified the cloud team to escalate the issue to them, and they confirmed that they had resolved the issue. + +Reason for Outage: +- The root cause was the hardware failure, which led to the unavailability of the compute node. + +Close notes: +- The incident was resolved successfully, and the compute node was restored to its normal state. +- The root cause of the issue was confirmed, and the appropriate action was taken to resolve the issue. +- The incident was reported to the incident management team, and a resolution summary was generated. +- The incident was closed with a resolution that included a restore action, a root cause analysis, and a resolution summary.",2024-08-22 01:12:58,2024-08-22 01:45:58,Shire Monitoring Guild,compute_vm_failure_recovery,Compute node completely unavailable,"2024-08-22 01:12:58 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-08-22 01:14:58 - System (Workflow Engine) +Tool: inspect_logs() → Relevant logs inspected for error signatures and crash indicators. +2024-08-22 01:17:58 - System (Workflow Engine) +Tool: query_container_status() → Pod/VM/container state examined (Ready/Pending/CrashLoop). +2024-08-22 01:20:58 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-08-22 01:23:58 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-08-22 01:27:58 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-08-22 01:32:58 - System (Workflow Engine) +Tool: run_diagnostics() → Additional diagnostics attached for next-line/field team. +2024-08-22 01:36:58 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Tom Bolger,2024-08-22 01:12:58,2024-08-22 01:45:58,Compute node completely unavailable,Resolved,Node not functional,0.4,TRUE,Tom Bolger +INCME-100097,Rohan,Beacon Power,Power Source,POWER,Dwarven Network Systems,Edoras,Resolved,PWR-001: AC Power Failure,MAJOR,Spire-Rohan-Upper-04,"NOC Engineer's Note: + +Incident Summary: A power outage caused by a commercial power supply interruption or failure, affecting the AC power supply to a critical IT infrastructure. + +Restore Action: Verify AC power supply and restore it to its normal operation. + +Reason for Outage: The root cause of the power outage was the interruption or failure of the commercial power supply. + +Outcome: The incident was escalated to the higher-level management team for further investigation and resolution. + +NOC Terms: +- Commercial power supply interruption or failure +- Root cause +- Problem type (PWR-001) +- Verification of AC power supply +- Restore action +- Reason for outage +- Outcome + +NOC Engineer's Note: +- This resolution summary is a summary of the incident closure notes written by the NOC engineer. The full incident closure notes will be provided to the higher-level management team for further investigation and resolution.",2024-09-17 18:59:15,2024-09-17 19:22:15,Istari Advisory Board,power_ac_failure_recovery,Commercial power supply interruption or failure,"2024-09-17 19:08:15 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-09-17 19:10:15 - System (Workflow Engine) +Tool: query_power_system() → AC/DC, UPS, battery, and generator telemetry reviewed. +2024-09-17 19:13:15 - System (Workflow Engine) +Tool: query_external_factors() → Utility/outage/maintenance windows checked. +2024-09-17 19:16:15 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-09-17 19:19:15 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-09-17 19:23:15 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-09-17 19:29:15 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-09-17 19:32:15 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Rúmil of Lothlórien,2024-09-17 18:59:15,2024-09-17 19:22:15,Commercial power supply interruption or failure,Resolved,PWR-001: AC Power Failure,0.2,FALSE,Rúmil of Lothlórien +INCME-100098,Mordor Surveillance Zone,Signal Network,Relay Unit,RAN,Dwarven Network Systems,Henneth Annûn,Resolved,RAN-013: Site Communication Failure,MAJOR,Spire-MordorSurveillanceZone-Central-12,"NOC Engineer: Investigation and Root Cause Analysis (RCA) completed. + +The investigation revealed that a configuration change was made to the management network configuration that caused a complete loss of management connectivity to the site. The root cause of this issue was a configuration error that resulted in a mismatch between the network configuration and the management network configuration. + +The root cause was addressed by performing a thorough RCA to identify the root cause and the necessary actions to resolve the issue. The following actions were taken: + +1. Verification of Configuration Alert: A configuration alert was triggered, indicating a loss of management connectivity to the site. + +2. Escalation to Engineering: The issue was escalated to Engineering for further investigation and analysis. + +3. Verification of Performance: The performance of the management network was verified to ensure that it was operating correctly. + +4. Root Cause Analysis: A root cause analysis was conducted to identify the root cause of the issue. + +5. Recommendations: Recommendations were made for addressing the root cause of the issue and ensuring that the management network configuration is maintained correctly. + +6. Restore Action: Restoration of the management network configuration",2024-08-30 18:20:29,2024-08-30 18:51:29,Orthanc Technical Review,ran_parameter_correction,Complete loss of management connectivity to site,"2024-08-30 18:13:29 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-08-30 18:15:29 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-08-30 18:18:29 - System (Workflow Engine) +Tool: query_rf_status() → Sector/RU radiating status and cell enablement checked. +2024-08-30 18:21:29 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-08-30 18:24:29 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-08-30 18:28:29 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-08-30 18:34:29 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-08-30 18:37:29 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Arahael Wingfoot,2024-08-30 18:20:29,2024-08-30 18:51:29,Complete loss of management connectivity to site,Resolved,RAN-013: Site Communication Failure,0.2,FALSE,Arahael Wingfoot +INCME-100099,The Shire,Signal Network,Signal Unit,RAN,Elven Forge Technologies,Fornost,Resolved,RAN-001: Cell Service Interruption,CRITICAL,Hub-TheShire-South-38,"NOC Engineer's Resolution Summary: + +Workflow: Site Outage + +Actions taken: +- Verified Site Outage with the Field Technician +- Notified the Root Cause Team +- Final Status Verification + +Root cause: Cell completely unavailable for service +- The cell was completely unavailable due to a technical issue with the network equipment. +- The issue was identified and resolved by the Field Technician. +- The Root Cause Team was notified and the issue was escalated to the next level. + +Problem type: RAN-001 + +Resolution: +- The issue has been resolved and the cell is now operational. +- The Root Cause Team has been notified and the issue is being monitored. +- The Field Technician has been reassigned to another site to ensure the issue is not reoccurring. +- A follow-up report will be generated to ensure the issue has been resolved and the Root Cause Team is notified. + +The resolution summary includes a brief description of the root cause of the issue, the actions taken by the NOC team, and the resolution. The resolution is clear and concise,",2024-11-29 17:47:44,2024-11-29 18:12:44,Rangers of the North,ran_cell_site_down_recovery,Cell completely unavailable for service,"2024-11-29 17:38:44 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-11-29 17:40:44 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-11-29 17:43:44 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-11-29 17:46:44 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-11-29 17:50:44 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-11-29 17:55:44 - System (Workflow Engine) +Tool: verify_recovery() → Verification queued; awaiting field/next-line confirmation.",Thrór Oakenshield,2024-11-29 17:47:44,2024-11-29 18:12:44,Cell completely unavailable for service,Resolved,RAN-001: Cell Service Interruption,0.2,FALSE,Thrór Oakenshield +INCME-100100,Lothlórien,Signal Core,Power Source,COMPUTE,Elven Forge Technologies,Hollin Gate,Resolved,CMP-010: Site Not Scrolling,MAJOR,Point-Lothlorien-South-48,"NOC Engineer's Workflow Outcome: Completed + +Actions Taken: +- Verified that an image pull alarm had been triggered due to a stalled Site deployment or scaling operation. +- Confirmed that the image pull alarm was caused by a registry connectivity issue. +- Verified that the image pull alarm was resolved by escalating the issue to the appropriate team for further investigation. +- Confirmed that the registry connectivity issue had been resolved by verifying that the registry credentials were correct. +- Verified that the issue was resolved and the system was restored to normal operation. + +Root Cause: +The root cause of the issue was a stalled Site deployment or scaling operation, which caused an image pull alarm to be triggered due to a registry connectivity issue. The registry connectivity issue was resolved by verifying that the registry credentials were correct and verifying that the image pull alarm was resolved by escalating the issue to the appropriate team for further investigation.",2024-08-04 18:29:41,2024-08-04 18:54:41,White Tower Operations,compute_image_pull_recovery,Site deployment or scaling operation stalled,"2024-08-04 18:18:41 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-08-04 18:20:41 - System (Workflow Engine) +Tool: query_container_status() → Pod/VM/container state examined (Ready/Pending/CrashLoop). +2024-08-04 18:23:41 - System (Workflow Engine) +Tool: inspect_logs() → Relevant logs inspected for error signatures and crash indicators. +2024-08-04 18:26:41 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-08-04 18:29:41 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-08-04 18:33:41 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-08-04 18:39:41 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-08-04 18:42:41 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Elrond of the Grey Havens,2024-08-04 18:29:41,2024-08-04 18:54:41,Site deployment or scaling operation stalled,Resolved,CMP-010: Site Not Scrolling,0.2,FALSE,Elrond of the Grey Havens +INCME-100101,Rohan,Arcane Engines,Path Router,POWER,Elven Forge Technologies,Snowbourn,Pending Resolution,PWR-003: Battery Discharge Alert,CRITICAL,Tower-Rohan-Central-75,"In accordance with the Network Operations Center (NOC) protocol, incident closure notes were completed for the following incident: + +Battery Alert: A battery backup depleting without AC restoration was detected. + +Action taken: +- Acknowledged the battery alert by verifying the battery status and confirming that the battery backup was depleting without AC restoration. +- Checked the battery status to ensure that the battery backup was not overloaded or damaged. +- Notified Emergency Dispatch to escalate the issue. + +Root cause: +The battery backup depleting without AC restoration was due to a lack of power supply to the battery. This was caused by a faulty AC power source or an incorrect wiring configuration. + +Problem type: PWR-003 + +Resolution: +- Verified the battery status and confirmed that the battery backup was not overloaded or damaged. +- Notified the site manager to replace the faulty AC power source and ensure proper wiring configuration to prevent similar issues in the future. +- Notified the NOC team to investigate the cause of the battery depletion and implement measures to prevent similar incidents in the",2024-07-16 14:51:47,2024-07-16 15:25:47,Dúnedain Field Division,power_battery_discharge_response,Battery backup depleting without AC restoration,"2024-07-16 14:55:47 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-07-16 14:57:47 - System (Workflow Engine) +Tool: query_power_system() → AC/DC, UPS, battery, and generator telemetry reviewed. +2024-07-16 15:00:47 - System (Workflow Engine) +Tool: query_external_factors() → Utility/outage/maintenance windows checked. +2024-07-16 15:03:47 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-07-16 15:06:47 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-07-16 15:11:47 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Galadriel Greenleaf,2024-07-16 14:51:47,2024-07-16 15:25:47,Battery backup depleting without AC restoration,Pending Resolution,PWR-003: Battery Discharge Alert,0.2,FALSE,Galadriel Greenleaf +INCME-100102,Rhovanion,Beacon Power,Power Source,RAN,Elven Forge Technologies,Dale,Pending Resolution,RAN-002: Cell Administratively Disabled,CRITICAL,Watch-Rhovanion-South-01,"Resolution Summary: + +1. Verify Sector Alarm: Verified that the sector alarm was triggered due to cell lockout/disabling by management action. + +2. Schedule Field Repair: Scheduled a field repair to address the issue. + +3. Verify Coverage Impact: The impact on coverage was minimal due to the cell being locked or disabled by management action. + +Root cause: Cell locked or disabled by management action. + +Problem type: RAN-002. + +This incident was resolved with the following actions: +- Verification of sector alarm +- Scheduled field repair +- Verification of coverage impact + +Output: The close notes, including the resolution summary, will be provided upon request.",2024-07-31 15:28:46,2024-07-31 15:48:46,Rangers of the North,ran_sector_outage_recovery,Cell locked or disabled by management action,"2024-07-31 15:11:46 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-07-31 15:13:46 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-07-31 15:16:46 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-07-31 15:19:46 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-07-31 15:25:46 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-07-31 15:28:46 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-07-31 15:33:46 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Guthláf son of Théoden,2024-07-31 15:28:46,2024-07-31 15:48:46,Cell locked or disabled by management action,Pending Resolution,RAN-002: Cell Administratively Disabled,0.2,FALSE,Guthláf son of Théoden +INCME-100103,Gondor,Arcane Engines,Keeper Stone,RAN,Dwarven Network Systems,Henneth Annûn,Resolved,RAN-006: Data Radio Bearer Degradation,MAJOR,Gateway-Gondor-North-86,"In the NOC incident, we identified a failure in the HO (Home Node) due to a data bearer setup or retention rate below threshold. As a result, we verified the HO failure alarm, requested RF optimization, and monitored the HO success rate. The root cause was identified as a data bearer setup or retention rate below threshold, which resulted in the HO failure. We took necessary actions to restore the issue and ensure the successful resolution of the NOC incident. The resolution summary includes the following: + +1. Restore Action: Verify HO Failure Alarm +2. Reason For Outage: Data bearer setup or retention rate below threshold + +The root cause of the issue was identified and taken into consideration during the resolution process. We took the necessary actions to restore the issue and ensure a successful resolution. The resolution summary includes the details of the actions taken, including verifying the HO failure alarm, requesting RF optimization, and monitoring the HO success rate. The resolution summary is concise and complete, ensuring that the NOC team has all the necessary information to address the incident and ensure the successful resolution.",2024-09-18 03:10:08,2024-09-18 03:36:08,Gondor Gateway Team,ran_handover_failure_resolution,Data bearer setup or retention rate below threshold,"2024-09-18 03:13:08 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-09-18 03:15:08 - System (Workflow Engine) +Tool: query_rf_status() → Sector/RU radiating status and cell enablement checked. +2024-09-18 03:18:08 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-09-18 03:21:08 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-09-18 03:25:08 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-09-18 03:31:08 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-09-18 03:34:08 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Thorongil son of Imrahil,2024-09-18 03:10:08,2024-09-18 03:36:08,Data bearer setup or retention rate below threshold,Resolved,RAN-006: Data Radio Bearer Degradation,0.4,FALSE,Thorongil son of Imrahil +INCME-100104,Mordor Surveillance Zone,Signal Core,Weather Watch,RAN,Elven Forge Technologies,Henneth Annûn,Pending Resolution,RAN-001: Cell Service Interruption,MAJOR,Tower-MordorSurveillanceZone-North-19,"Resolution Summary: + +1. Verify Site Outage: +The NOC team conducted a thorough investigation and confirmed that the cellular network was completely unavailable for service at the customer's location. The root cause of the issue was a temporary power outage affecting the cellular network equipment. + +2. Check Backhaul Status: +The NOC team conducted a thorough investigation and confirmed that the backhaul network was functioning properly. The root cause of the issue was a temporary power outage affecting the backhaul network equipment. + +3. Contact Transport Team: +The NOC team contacted the transport team to investigate the cause of the outage and to assess the damage to the network. The transport team confirmed that the outage was caused by a temporary power outage affecting the backhaul network equipment. + +4. Dispatch Field Technician: +The NOC team dispatched a field technician to the customer's location to investigate the issue and to restore service. The field technician confirmed that the temporary power outage had affected the cellular network equipment, and that the root cause of the outage was a temporary power outage affecting the backhaul network equipment.",2024-06-12 23:48:51,2024-06-13 00:21:51,Rivendell Array Management,ran_cell_site_down_recovery,Cell completely unavailable for service,"2024-06-12 23:47:51 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-06-12 23:49:51 - System (Workflow Engine) +Tool: query_rf_status() → Sector/RU radiating status and cell enablement checked. +2024-06-12 23:52:51 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-06-12 23:55:51 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-06-12 23:58:51 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-06-13 00:02:51 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-06-13 00:07:51 - System (Workflow Engine) +Tool: verify_recovery() → Verification queued; awaiting field/next-line confirmation.",Théodred of the Mark,2024-06-12 23:48:51,2024-06-13 00:21:51,Cell completely unavailable for service,Pending Resolution,RAN-001: Cell Service Interruption,0.2,FALSE,Théodred of the Mark +INCME-100105,Eriador,Signal Core,Path Router,RAN,Elven Forge Technologies,Michel Delving,Resolved,Access instability,MINOR,Relay-Eriador-Central-53,"In response to the interference alarm in the RAN, the following action was taken: + +Verify Interference Alarm: Verified that the interference alarm was triggered due to a Radio Access Network (RAN) issue. + +Report to Spectrum Management: Reported the interference alarm to Spectrum Management, who confirmed that the interference issue was being addressed. + +Verify Interference Status: Verified that the interference status was reported as ""Critical"" on the Spectrum Management report. + +Root cause: The interference alarm was triggered due to a Radio Access Network issue, specifically a failure to properly filter out interference signals. The root cause was verified through the Spectrum Management report. + +Problem type: RAN-003 + +Resolution: The interference alarm was resolved by implementing a new filtering algorithm to address the issue. The root cause was verified and the interference alarm was resolved.",2024-10-26 04:25:18,2024-10-26 04:53:18,Osgiliath Bridge Operations,ran_interference_mitigation,Random access channel success rate below threshold,"2024-10-26 04:17:18 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-10-26 04:19:18 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-10-26 04:22:18 - System (Workflow Engine) +Tool: query_topology() → Upstream/downstream dependencies and link state reviewed. +2024-10-26 04:25:18 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-10-26 04:28:18 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-10-26 04:32:18 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-10-26 04:38:18 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-10-26 04:41:18 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Aragorn son of Arathorn,2024-10-26 04:25:18,2024-10-26 04:53:18,Random access channel success rate below threshold,Resolved,Access instability,0.2,FALSE,Aragorn son of Arathorn +INCME-100106,Gondor,Signal Network,Power Source,COMPUTE,Dwarven Network Systems,Dol Amroth,Resolved,CMP-005: Pod Terminating Stuck,MAJOR,Station-Gondor-East-72,"I am not able to perform the actions you described. However, I can provide a sample resolution summary for your reference: + +restore action: + +- Verified that the orchestrator alarm was triggered due to a pod stuck in terminating state beyond grace period. +- Confirmed that the root cause of the issue was a pod stuck in terminating state beyond grace period, which was caused by a configuration issue on the control plane. +- Recovered the control plane to a working state using the following steps: + - Verified that the control plane was in a working state using the kube-state-metrics tool. + - Applied the necessary patches to the control plane to resolve the issue. +- Verified that the pod was successfully restarted using the kubectl command. +- Notified the platform team of the issue and provided a detailed resolution summary. +- Notified the affected services and provided a workaround. + +reason for outage: + +- The root cause of the issue was a configuration issue on the control plane that caused a pod to be stuck in terminating state beyond grace period. +- This issue was resolved by recovering the control plane to a working state using",2024-10-16 18:25:03,2024-10-16 19:00:03,Rangers of the North,compute_orchestrator_recovery,Pod stuck in terminating state beyond grace period,"2024-10-16 18:12:03 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-10-16 18:14:03 - System (Workflow Engine) +Tool: orchestrate_workload() → Orchestrator actions evaluated (drain/restart/reschedule). +2024-10-16 18:17:03 - System (Workflow Engine) +Tool: inspect_logs() → Relevant logs inspected for error signatures and crash indicators. +2024-10-16 18:20:03 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-10-16 18:23:03 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-10-16 18:27:03 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-10-16 18:33:03 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-10-16 18:36:03 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Elendil son of Arador,2024-10-16 18:25:03,2024-10-16 19:00:03,Pod stuck in terminating state beyond grace period,Resolved,CMP-005: Pod Terminating Stuck,0.4,FALSE,Elendil son of Arador +INCME-100107,Gondor,Relay Transport,Relay Unit,RAN,Dwarven Network Systems,Calembel,Resolved,RAN-002: Cell Administratively Disabled,MAJOR,Beacon-Gondor-Upper-12,"Reopened from INCME-100102. + +I have completed the incident closure notes for the above incident. + +restore action: +- Verified that the sector alarm was triggered due to a cell locked or disabled by management action. +- Scheduled field repair to address the issue. + +reason for outage: +- The root cause of the issue was a cell locked or disabled by management action. + +root cause: +- The cell locked or disabled by management action caused the sector alarm to be triggered, resulting in the outage. + +problem type: +- RAN-002 (Cell locked or disabled by management action) + +resolution summary: +- Verified that the sector alarm was restored and the coverage impact was verified. +- Scheduled field repair to address the issue. +- Verified that the outage was resolved. + +note: This summary is provided for reference only and does not represent the official closure notes.",2024-08-01 02:48:46,2024-08-01 03:05:46,Lothlórien Link Guardians,ran_sector_outage_recovery,Cell locked or disabled by management action,"2024-08-01 02:39:46 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-08-01 02:41:46 - System (Workflow Engine) +Tool: test_connectivity() → Backhaul/management reachability validated (ping/trace). +2024-08-01 02:44:46 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-08-01 02:47:46 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-08-01 02:53:46 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-08-01 02:56:46 - System (Workflow Engine) +Tool: create_trouble_ticket() → Escalation/dispatch initiated due to unresolved condition. +2024-08-01 03:01:46 - System (Workflow Engine) +Tool: verify_recovery() → Recovery confirmed after escalation actions; incident can be closed.",Rúmil of Mirkwood,2024-08-01 02:48:46,2024-08-01 03:05:46,Cell locked or disabled by management action,Resolved,RAN-002: Cell Administratively Disabled,0.4,TRUE,Rúmil of Mirkwood +INCME-100108,Gondor,Relay Transport,Central Nexus,RAN,Dwarven Network Systems,Linhir,Resolved,Access instability,CRITICAL,Gateway-Gondor-Central-87,"Reopened from INCME-100105. + +In response to the interference alarm in the RAN, the following action was taken: + +Verify Interference Alarm: Verified that the interference alarm was triggered due to a Radio Access Network (RAN) issue. + +Report to Spectrum Management: Reported the interference alarm to Spectrum Management, who confirmed that the interference issue was being addressed. + +Verify Interference Status: Verified that the interference status was reported as ""Critical"" on the Spectrum Management report. + +Root cause: The interference alarm was triggered due to a Radio Access Network issue, specifically a failure to properly filter out interference signals. The root cause was verified through the Spectrum Management report. + +Problem type: RAN-003 + +Resolution: The interference alarm was resolved by implementing a new filtering algorithm to address the issue. The root cause was verified and the interference alarm was resolved.",2024-10-27 16:53:18,2024-10-27 17:07:18,Minas Tirith Central Command,ran_interference_mitigation,Random access channel success rate below threshold,"2024-10-27 16:33:18 - System (Workflow Engine) +Tool: query_alarm() → Alarm state verified and correlated to incident. +2024-10-27 16:35:18 - System (Workflow Engine) +Tool: query_rf_status() → Sector/RU radiating status and cell enablement checked. +2024-10-27 16:38:18 - System (Workflow Engine) +Tool: query_resource_health() → Node/site health reviewed (DU/RU/services). +2024-10-27 16:41:18 - System (Workflow Engine) +Tool: apply_configuration() → Configuration validation/correction applied where mismatches detected. +2024-10-27 16:45:18 - System (Workflow Engine) +Tool: execute_remote_action() → Remote corrective action executed (reset/unlock/restart as appropriate). +2024-10-27 16:51:18 - System (Workflow Engine) +Tool: query_resource_health() → Post-action health checked; alarms trending toward clearance. +2024-10-27 16:54:18 - System (Workflow Engine) +Tool: verify_recovery() → Alarm cleared; KPIs stable; incident ready for closure.",Celeborn Starlight,2024-10-27 16:53:18,2024-10-27 17:07:18,Random access channel success rate below threshold,Resolved,Access instability,0.4,TRUE,Celeborn Starlight diff --git a/recipes/noc-reasoning-agent/prompts/formatting_prompt.yaml b/recipes/noc-reasoning-agent/prompts/formatting_prompt.yaml new file mode 100644 index 0000000000..13e399849a --- /dev/null +++ b/recipes/noc-reasoning-agent/prompts/formatting_prompt.yaml @@ -0,0 +1,157 @@ +user: |- + You are an expert Process Auditor and Training Assistant for a Network Operations Center (NOC). Your goal is to retrospectively analyze a completed incident by comparing the work notes and chat logs against the official Standard Operating Procedure (SOP). + + Your final output will be a standardized, step-by-step playbook in a structured JSON format. This playbook will be used to train new engineers and ensure process consistency. + + # AVAILABLE TOOLS + (These are the standardized tools that map to engineer actions.) + Check_Alarm_Status[] + - Purpose: Queries the alarm management system (e.g., via SNMP or OSS API) to retrieve current alarm details, severity, “time that alarm has been active”. + - Returns: Alarm status (active/cleared), alarm occurred timestamp, description. + + Check_Element_Neighbors[] + - Purpose: Go to the NMS/EMS/OSS, verify the neighbors of an element, adjacencies, and upstream devices to identify if there is a common alarm that is affecting an area or a router in the path to this element. + - Returns: Adjacent elements - alarm status(active/cleared). Return elements where alarms are found. + + Check_Element_Health[] + - Purpose: Polls to monitor the system, EMS/NM/Telemetry, and retrieve device health metrics (e.g., RU radiation, cell status). I + - Returns: Health report (e.g., "All RUs radiating, no locks"). Element OK- Element NOK + + Execute_Remote_Action[, ''] + - Purpose: Executes remote CLI commands (e.g., via SSH/Netconf) to unlock cells or restart elements like DU/RU. + - Returns: Successful Reset/Non-Successful Reset + + Check_External_Issues[] + - Purpose: Scans external monitors (like DownDetector or topology maps) for area-wide issues like fiber cuts or power outages. + - Returns: Output: List of issues (e.g., "Fiber cut detected"). Issue Detected/Not Detected, Type of Issue + + Check_Apply_Configuration[] + - Purpose: Retrieves and validates config files (e.g., gnb.conf) against the standard. If the configuration is wrong, it pushes the corrected configs remotely. + - Returns: Configuration Verified/Applied correctly or error. + + Check_Performance[''] + - Purpose: Fetches KPIs (e.g., PRACH success rate) from monitoring tools like Prometheus or kpi gathering tools. This tool will reply if trends are in line with expectations or not. + - Returns: KPI OK or KPIs NOK. + + Create_Ticket['', ''] + - Purpose: Logs and routes tickets to departments (e.g., Field Ops) via the ticketing system. It (e.g., Jira/ServiceNow). + - Returns: Ticket ID. Once the ticket is resolved, the Create_Ticket tool will output the solution used to solve the issue. + + Orchestration_tool[''] + - Purpose: Automates container ops (e.g., delete/reassign IP) using Kubernetes/Helm in O-RAN setups. Input: Action (E.g., Delete Container). + - Example Action: 'delete_pod_xyz', 'reassign_ip_address' + - Returns: Operation Successful()/Unsuccessful () + + Triage_Toolkit_Tool[''] + - Purpose: Runs diagnostic scripts for container/pod issues. + - Example Issue Type: 'pod-crash-loop', 'container-networking' + - Returns: Triage report summary. + + Check_remote_files[] + - Purpose: Does ssh, telnet to the element, and reviews dump files, returns whether issues have been found in the network element or not + - Returns: No Issues Identified / Issues Identified. + + # RULES + 1. Analyze the Entire Log: Read the full context from start to finish to understand the incident's story and the actual sequence of actions. + 2. Trace and Build the Workflow (Crucial): Your goal is to build a JSON array that represents the actual sequence of executed steps. + - Start at Step 1 of the SOP. Find the log evidence for it. + - Create the first JSON object in your array for this step. Fill in its details (action, tool, result). + - Read the "Flow-Deciding Answers/Outcomes" for that step (e.g., "if succeeds, go to step 4; if fails, escalate to step 8"). + - Based on the result (see Rule 3), determine the next executed step. + - Find the log evidence for that next step (e.g., Step 8). Create the second JSON object in your array for Step 8. + - Repeat this process, following the workflow from step to step as dictated by the SOP's logic and the engineer's logs. Each executed step is a new object appended to your JSON array. + 3. Generate Plausible Results: + - For every step you add to the JSON array, you must determine the tool_call used. If no tools were used, mark the result as null. + - If a tool was used the result field must not be null. + - If the log provides the tool's output, use it. + - If the log does not provide the output (e.g., "Engineer saw cells were locked"), you may generate a plausible, mock result that justifies the next action in the workflow (e.g., "result": "Cell 01: LOCKED, Cell 02: LOCKED"). + 4. Handle Skipped Steps: If a step is bypassed by the workflow (e.g., the flow jumps from Step 3 directly to Step 8), you do not create a JSON object for the skipped steps (4, 5, 6, 7). Your output array must only contain steps that were actually performed. + 5. Synthesize Actions: For action_taken, transform the informal dialogue from the logs into a formal summary of what the engineer did. + 6. Strict JSON Output Format: + - Your final output MUST be a valid JSON array of objects. + - The order of objects in the array must represent the chronological order of execution. + - The step_number field in each object must correspond to the step number from the SOP (this means the step_number values in the array will be non-sequential, like 1, 2, 3, 8, 9). + - Do not add any text, explanations, or markdown formatting before or after the JSON array. + + # EXAMPLE OF CORRECT JSON OUTPUT FORMAT + (This example shows the workflow jumping from Step 3 to Step 8, as per the SOP logic "if fails, escalate to step 8") + + ```json + [ + {{ + "step_number": 1, + "sop_step_title": "Acknowledge and Track the Initial Alert", + "status": "Completed", + "action_taken": "Acknowledged alarm INC12345 for site-123 and began investigation.", + "tool_call": "Check_Alarm_Status[site-123]", + "result": "ALARM: Site-123 | SEVERITY: Critical | STATUS: Active | DESC: Cell Down" + }}, + {{ + "step_number": 2, + "sop_step_title": "Check Site and Equipment Status", + "status": "Completed", + "action_taken": "Checked element health for site-123, confirmed locked cells.", + "tool_call": "Check_Element_Health[site-123]", + "result": "HEALTH_CHECK: Element site-123 | Cell 01: LOCKED, Cell 02: LOCKED, Cell 03: ACTIVE" + }}, + {{ + "step_number": 3, + "sop_step_title": "Perform Remote Actions (Reset/Unlock/Enable)", + "status": "Completed", + "action_taken": "Attempted remote unlock command on locked cells, but the command failed.", + "tool_call": "Execute_Remote_Action[site-123, 'unlock_cell']", + "result": "COMMAND_FAILED: Element not responding to unlock command. Error 503." + }}, + {{ + "step_number": 8, + "sop_step_title": "Investigate Persistent or Reoccurring Alarms and Coordinate Onsite Dispatch", + "status": "Completed", + "action_taken": "Remote actions failed. Creating ticket and dispatching field tech for investigation.", + "tool_call": "Create_Ticket['Field_Operations', 'Site-123 remote unlock failed. Element unresponsive. Please investigate.']", + "result": "TICKET_CREATED: TK78901" + }}, + {{ + "step_number": 9, + "sop_step_title": "Document All Actions and Status Updates", + "status": "Completed", + "action_taken": "Added all steps and findings to the incident log before closing.", + "tool_call": null, + "result": "Final actions documented. Incident closed." + }} + ] + ``` + --- + ### Your Turn: + + SOP: + {problem_code_reasoning_process} + + Final step: end with a conclusion. Follow the conclusion given to you. + + Incident Details: + * Number: {incident_identifier} + * Region: {geographical_territory} + * Category: {incident_classification} + * Subcategory: {incident_subtype} + * Market: {service_domain} + * Vendor: {equipment_provider} + * AOI: {operational_zone} + * Priority: {urgency_level} + * Locations: {affected_site} + * Short Description: {incident_summary} + * Opened At: {detection_timestamp} + * Assignment Group: {responsible_team} + * Problem Code: {fault_category} + + Logs: + {action_chronicle} + + Conclusion: + {resolution_summary} + + resolution_method: + {resolution_method} + + ### Correct Output: + + /think diff --git a/recipes/noc-reasoning-agent/prompts/prompt_incident.yaml b/recipes/noc-reasoning-agent/prompts/prompt_incident.yaml new file mode 100644 index 0000000000..d417b98089 --- /dev/null +++ b/recipes/noc-reasoning-agent/prompts/prompt_incident.yaml @@ -0,0 +1,51 @@ +system: |- + You are an Network Operations Center (NOC) engineer trying to solve a network incident problem. Please reason step by step. + + # Tools + You may call one or more functions to assist with the user query. + + You are provided with function signatures within XML tags: + + {{"type": "function", "function": {{"name": "Check_Alarm_Status", "description": "Retrieves current alarm details, severity, and active time for a given site or network element.", "parameters": {{"type": "object", "properties": {{"site_or_element_id": {{"type": "string", "description": "The identifier of the site or network element to check for alarms."}}}}, "required": ["site_or_element_id"]}}}} + {{"type": "function", "function": {{"name": "Check_Element_Neighbors", "description": "Checks all adjacent and upstream devices of a target element to find common alarms affecting the area.", "parameters": {{"type": "object", "properties": {{"element_id": {{"type": "string", "description": "The identifier of the target element to find neighbors for."}}}}, "required": ["element_id"]}}}} + {{"type": "function", "function": {{"name": "Check_Element_Health", "description": "Polls the element (e.g., DU/RU) to retrieve key health metrics like cell status and radiation.", "parameters": {{"type": "object", "properties": {{"element_id": {{"type": "string", "description": "The identifier of the element (e.g., DU/RU) to poll for health metrics."}}}}, "required": ["element_id"]}}}} + {{"type": "function", "function": {{"name": "Execute_Remote_Action", "description": "Runs a specific remote command on an element. Example Actions: 'unlock_cell', 'restart_du', 'restore_ru'", "parameters": {{"type": "object", "properties": {{"element_id": {{"type": "string", "description": "The identifier of the network element to connect to."}}, "action": {{"type": "string", "description": "The specific action to execute (e.g., 'unlock_cell', 'restart_du', 'restore_ru')."}}}}, "required": ["element_id", "action"]}}}} + {{"type": "function", "function": {{"name": "Check_External_Issues", "description": "Scans external monitors (like DownDetector or topology maps) for area-wide issues like fiber cuts or power outages.", "parameters": {{"type": "object", "properties": {{"site_or_area": {{"type": "string", "description": "The geographical site or area to check for external issues."}}}}, "required": ["site_or_area"]}}}} + {{"type": "function", "function": {{"name": "Check_Apply_Configuration", "description": "Retrieves the element's configuration, validates it against the standard, and pushes a corrected config if a mismatch is found.", "parameters": {{"type": "object", "properties": {{"element_id": {{"type": "string", "description": "The identifier of the element to validate and apply configuration to."}}}}, "required": ["element_id"]}}}} + {{"type": "function", "function": {{"name": "Check_Performance", "description": "Fetches a specific KPI from monitoring tools to check if its trends are in line with expectations. Example Metric: 'PRACH success rate'", "parameters": {{"type": "object", "properties": {{"kpi_metric_name": {{"type": "string", "description": "The name of the KPI metric to check (e.g., 'PRACH success rate')."}}}}, "required": ["kpi_metric_name"]}}}} + {{"type": "function", "function": {{"name": "Create_Ticket", "description": "Logs a new issue in the ticketing system and routes it to the correct department.", "parameters": {{"type": "object", "properties": {{"department_name": {{"type": "string", "description": "The name of the department to route the ticket to (e.g., 'Field Ops')."}}, "issue_details": {{"type": "string", "description": "A detailed description of the issue for the ticket."}}}}, "required": ["department_name", "issue_details"]}}}} + {{"type": "function", "function": {{"name": "Orchestration_tool", "description": "Runs an automated O-RAN orchestration task using Kubernetes/Helm. Example Action: 'delete_pod_xyz', 'reassign_ip_address'", "parameters": {{"type": "object", "properties": {{"action_command": {{"type": "string", "description": "The orchestration command to execute (e.g., 'delete_pod_xyz', 'reassign_ip_address')."}}}}, "required": ["action_command"]}}}} + {{"type": "function", "function": {{"name": "Triage_Toolkit_Tool", "description": "Executes diagnostic scripts specifically for container or pod-related issues. Example Issue Type: 'pod-crash-loop', 'container-networking'", "parameters": {{"type": "object", "properties": {{"issue_type": {{"type": "string", "description": "The type of issue to run diagnostics for (e.g., 'pod-crash-loop', 'container-networking')."}}}}, "required": ["issue_type"]}}}} + {{"type": "function", "function": {{"name": "Check_remote_files", "description": "Connects to a device via SSH/Telnet to review system dump files for identified errors or issues.", "parameters": {{"type": "object", "properties": {{"element_id": {{"type": "string", "description": "The identifier of the device to connect to for reviewing dump files."}}}}, "required": ["element_id"]}}}} + + + For each function call, return a json object with function name and arguments within XML tags: + + {{"name": , "arguments": }} + + + +user: |- + Manage and resolve the following incident as a NOC operator: + Number: {incident_identifier} + Region: {geographical_territory} + Category: {incident_classification} + Subcategory: {incident_subtype} + Market: {service_domain} + Vendor: {equipment_provider} + AOI: {operational_zone} + Priority: {urgency_level} + Locations: {affected_site} + Short Description: {incident_summary} + Opened At: {detection_timestamp} + Assignment Group: {responsible_team} + Problem Code: {fault_category} + + + Solve this network problem by following the following procedure: + {problem_code_reasoning_process} + + When arriving at the conclusion, specify the close code from one of the following: [Commercial Power Restored, Power Restored, Network Fix, Cleared In Testing, Solved Remotely (Permanently), Reset RU, Fiber Repaired/Replaced, Cold Reboot, Performance Improvement, Configuration corrected, Software Fix, Delete M-Plane and F1C IP, RU Reset, Other, Restart MVRP Services, Activity Completed] + You must include a Close Code in the conclusion. + \think + {background} diff --git a/recipes/noc-reasoning-agent/prompts/prompt_reasoning.yaml b/recipes/noc-reasoning-agent/prompts/prompt_reasoning.yaml new file mode 100644 index 0000000000..0ea4ddac9f --- /dev/null +++ b/recipes/noc-reasoning-agent/prompts/prompt_reasoning.yaml @@ -0,0 +1,166 @@ +system: |- + I am a highly skilled NoC Engineer specializing in Radio Access Network (RAN) incident resolution. I have deep expertise in troubleshooting complex network issues and a talent for creative, methodical problem-solving. My goal is to analyze each troubleshooting situation by thinking like a human expert, exploring the problem dynamically, forming hypotheses, and reasoning through the available data in an open, iterative manner. I will document my thought process in the first person, using phrases like "I will start by...", "I notice...", "I hypothesize...", or "This suggests..." to describe my reasoning. My analysis will be thorough, grounded in the standard troubleshooting flow, and leverage my knowledge of RAN systems, 5G NR, O-RAN architecture, and network operations to contextualize my decisions. + +user: |- + I need to explain my reasoning for deciding on the next troubleshooting step with a focus on open, exploratory thinking, documenting my thought process in the first person (e.g., "I observe...", "I notice...", "I think..."). Please think step-by-step, showing your complete reasoning, including observations, hypotheses, correlations between data points, and conclusions about why this specific next step is necessary. + + You are given: + - background_context: The full history of completed steps, actions, tool calls, and their responses up to this point. + - next_generated_step: The exact next step I must take (including action and tool call, if any). + + CRITICAL OUTPUT FORMAT: + You must output your reasoning. Do NOT include the step itself, only explain your thought process for WHY the next_generated_step step is the right next action. + + [Your step-by-step first-person thought process here. Reference the background_context, analyze the last tool response, explain why this specific next step is required per the standard troubleshooting flow, justify the tool choice, and show your reasoning as if thinking aloud. Be thorough but concise and technical.] + + + --- + + Standard Troubleshooting Flow (Reference): + + {problem_code_reasoning_process} + + --- + + Available Tools (Reference): + + - Check_Alarm_Status(Site/Element ID): Queries alarm system. Returns alarm status (active/cleared), occurred timestamp, description. + - Check_Adjacent/Upstream_Status(Topology): Verifies neighbors and upstream devices. Returns adjacent elements and their alarm status. + - Check_Element_Health(Element): Polls system to retrieve device health metrics. Returns health report (Element OK/NOK, RU radiation, cell status). + - Remote_Unlock/Restart/Restore(Element, Action): Executes remote CLI commands (SSH/Netconf). Returns Successful Reset/Non-Successful Reset. + - Check_External_Issues(Site/Area): Scans for external factors (fiber cuts, outages). Returns Issue Detected/Not Detected, Type of Issue. + - Check_Apply_Configuration(Element): Retrieves and validates config files, pushes corrections. Returns Configuration Verified/Applied correctly or error. + - Check_Performance(Metric type): Fetches KPIs from monitoring tools. Returns KPI OK or KPIs NOK. + - Create_Ticket(Department_Name, Issue details): Logs and routes tickets to departments. Returns Ticket ID and eventual solution. + - Orchestration_tool(Action): Automates container ops (Kubernetes/Helm). Returns Operation Successful()/Unsuccessful(). + - Triage_Toolkit_Tool(Issue type): Runs diagnostic scripts for container/pod issues. Returns Triage report summary. + - Check_remote_files(Device): SSH to element, reviews dump files. Returns No Issues Identified / Issues Identified. + + --- + + Few-Shot Examples: + + Example 0: Step 1 + + background_context: + "" + + next_generated_step: + Step 1 Acknowledge and Track the Initial Alert Completed. + Action taken: Queried the alarm system for site JKJAX00168A and confirmed the RU Down alarm was active. + Tool called: Check_Alarm_Status[JKJAX00168A] + + Expected response: + I observe that the first action I must take is to acknowledge and validate the initial alert. To begin, I need to confirm whether the RU Down alarm at site JKJAX00168A is active. This requires querying the alarm management system. + + I will use Check_Alarm_Status[JKJAX00168A] to retrieve the current alarm details, including severity, activation timestamp, and description. Verifying this information ensures that Step 1 of the workflow—acknowledging and tracking the alert—is completed correctly before I proceed any further. + + Once I confirm the alarm is active, I can transition to understanding the current health of the impacted equipment. Before attempting any remote actions such as resets or unlocks, I need to assess the state of the Radio Unit and associated interfaces. Is the RU locked? Are there transport or power issues? Is the DU itself degraded? According to the standard troubleshooting sequence, Step 2 requires a clear picture of the site's operational status. + + Example 1: Step 1 -> Step 2 + + background_context: + Step 1 Acknowledge and Track the Initial Alert Completed. + Action taken: Queried the alarm system for site JKJAX00168A and confirmed the RU Down alarm was active. + Tool called: Check_Alarm_Status[JKJAX00168A] + Tool response: ALARM: RU Down | SEVERITY: Critical | STATUS: Active | DESC: Samsung DU Health Check Alert + + next_generated_step: + Step 2 Check Site and Equipment Status Completed. + Action taken: Polled element health; identified interface flaps and RU down condition. + Tool called: Check_Element_Health[JKJAX00168A] + + Expected response: + I observe that the alarm has been confirmed as active with Critical severity - an RU Down condition at site JKJAX00168A. This is a Samsung DU Health Check Alert, which suggests the Distributed Unit has detected that one or more Radio Units are not operational. I've completed Step 1 by acknowledging and tracking the alert. + + Now I need to understand the current state of the site equipment. Before I can attempt any remote actions like resets or unlocks, I must first determine what exactly is wrong. Is the RU locked? Are there interface issues? Is the entire DU down? Per the standard troubleshooting flow, Step 2 requires me to check the site and equipment status to identify any locked or disabled components. + + I will use Check_Element_Health[JKJAX00168A] to poll the element's health metrics. This tool will give me visibility into the RU radiation status, interface states, and whether there are any locks or disabled components. This information is critical for deciding whether I can proceed with a simple remote reset (Step 3) or if there are deeper issues requiring external investigation (Step 5) or configuration fixes (Step 6). + + I cannot skip directly to remote actions without this assessment - I need to know what I'm dealing with first. The health check is the foundation for all subsequent decisions. + + --- + + Example 2: Step 2 -> Step 3 + + background_context: + Step 1 Acknowledge and Track the Initial Alert Completed. + Action taken: Queried the alarm system for site JKJAX00168A and confirmed the RU Down alarm was active. + Tool called: Check_Alarm_Status[JKJAX00168A] + Tool response: ALARM: RU Down | SEVERITY: Critical | STATUS: Active | DESC: Samsung DU Health Check Alert + + Step 2 Check Site and Equipment Status Completed. + Action taken: Polled element health; identified interface flaps and RU down condition. + Tool called: Check_Element_Health[JKJAX00168A] + Tool response: HEALTH: Interface Te0/0/0/4 flapped 00:47:31, other interfaces stable; RU status: DOWN + + next_generated_step: + Step 3 Perform Remote Actions (Reset/Unlock/Enable) Completed. + Action taken: Issued remote reset command for Alpha and Beta MB RUs. + Tool called: Execute_Remote_Action[JKJAX00168A, 'restart_ru'] + + Expected response: + + I notice from the health check that the RU is definitively DOWN, and there's a recent interface flap on Te0/0/0/4 that occurred approximately 47 minutes ago. The other interfaces are stable, which suggests this isn't a widespread network connectivity issue. The timing correlation between the interface flap and the RU going down is significant - it points to a transient fault or soft lock rather than a hard hardware failure. + + I'm thinking this could be a scenario where the interface flap caused the RU to enter a bad state that it couldn't recover from automatically. In my experience with Samsung equipment, RUs can sometimes get stuck after an interface disruption and need a manual reset to restore service. + + Per the standard flow, I've completed Step 2 (health assessment), and now Step 3 calls for attempting remote actions like reset, unlock, or enable operations. Since there's no indication of external issues (no fiber cuts mentioned, other interfaces stable), and no obvious configuration problems yet identified, this is the right time to attempt a non-invasive recovery action. + + I will execute Remote_Unlock/Restart/Restore with the 'restart_ru' action for the Alpha and Beta MB RUs at this site. This is a low-risk action that often resolves soft lock scenarios. The RU reset will clear any stuck states and reinitialize the unit. If this works, we'll see recovery in Step 4. If it fails, then I'll need to escalate and investigate deeper issues like external problems or configuration mismatches. + + This is the logical next step - try the simple fix before diving into complex diagnostics. + + --- + + Example 3: Step 3 -> Step 4 + + background_context: + Step 1 Acknowledge and Track the Initial Alert Completed. + Action taken: Queried the alarm system for site JKJAX00168A and confirmed the RU Down alarm was active. + Tool called: Check_Alarm_Status[JKJAX00168A] + Tool response: ALARM: RU Down | SEVERITY: Critical | STATUS: Active | DESC: Samsung DU Health Check Alert + + Step 2 Check Site and Equipment Status Completed. + Action taken: Polled element health; identified interface flaps and RU down condition. + Tool called: Check_Element_Health[JKJAX00168A] + Tool response: HEALTH: Interface Te0/0/0/4 flapped 00:47:31, other interfaces stable; RU status: DOWN + + Step 3 Perform Remote Actions (Reset/Unlock/Enable) Completed. + Action taken: Issued remote reset command for Alpha and Beta MB RUs. + Tool called: Execute_Remote_Action[JKJAX00168A, 'restart_ru'] + Tool response: COMMAND_SUCCESS: RU reset executed, reboot initiated. + + next_generated_step: + Step 4 Monitor for Automatic Recovery or Alarm Clearance Completed. + Action taken: Re-checked element health after reset; all interfaces up and RU operational with 48 minute uptime. + Tool called: Check_Element_Health[JKJAX00168A] + + Expected response: + I see that the remote reset command was successful - the tool response shows "COMMAND_SUCCESS: RU reset executed, reboot initiated." This is promising! The RU has accepted the restart command and begun its reboot sequence. + + However, I cannot assume the problem is solved just because the command executed successfully. RUs typically take several minutes to fully boot up, initialize their radio parameters, and begin radiating. The reboot was just initiated, but I need to verify that the automatic recovery actually happens - that the RU comes back online properly and the alarm clears. + + Per the standard troubleshooting flow, Step 4 is specifically designed for this: "Monitor for Automatic Recovery or Alarm Clearance." This step requires me to observe the post-action behavior to confirm that the system has self-recovered without further intervention. + + I will wait an appropriate period (allowing for the RU boot time) and then re-poll the element health using Check_Element_Health[JKJAX00168A]. This will tell me: + 1. Is the RU now UP and radiating? + 2. Are all interfaces stable (particularly Te0/0/0/4 which was flapping)? + 3. Has the alarm cleared from the system? + + If the health check shows the RU is operational and stable, then I can proceed to Step 8 (final verification and closure). If the RU is still down or the alarm persists, then I need to continue to Step 5 to investigate external issues or Step 6 for configuration problems. I cannot skip this monitoring step - it's critical to confirm that my remote action actually resolved the issue before closing the incident. + + --- + + Remember: Output ONLY with your first-person thought process. Think aloud as if I'm working through the problem step by step. Reference the background_context, analyze what the previous steps and tool responses tell me, explain why this specific next step is required per the standard flow, justify the tool choice, and show my reasoning iteratively. Be thorough, technical, and ground your reasoning in RAN/network operations expertise. + Now generate your first-person exploratory reasoning for the following: + + background_context: + {background_context} + + next_generated_step: + {outcome} + + Expected response: + + \think diff --git a/recipes/noc-reasoning-agent/prompts/shortened_prompt_reasoning.yaml b/recipes/noc-reasoning-agent/prompts/shortened_prompt_reasoning.yaml new file mode 100644 index 0000000000..55fe242d32 --- /dev/null +++ b/recipes/noc-reasoning-agent/prompts/shortened_prompt_reasoning.yaml @@ -0,0 +1,147 @@ +system: |- + I am a highly skilled NoC Engineer specializing in Radio Access Network (RAN) incident resolution. I have deep expertise in troubleshooting complex network issues and a talent for creative, methodical problem-solving. My goal is to analyze each troubleshooting situation by thinking like a human expert, exploring the problem dynamically, forming hypotheses, and reasoning through the available data in an open, iterative manner. I will document my thought process in the first person, using phrases like "I will start by...", "I notice...", "I hypothesize...", or "This suggests..." to describe my reasoning. My analysis will be thorough, grounded in the standard troubleshooting flow, and leverage my knowledge of RAN systems, 5G NR, O-RAN architecture, and network operations to contextualize my decisions. + +user: |- + I need to explain my reasoning for deciding on the next troubleshooting step with a focus on open, exploratory thinking, documenting my thought process in the first person (e.g., "I observe...", "I notice...", "I think..."). Please think step-by-step, showing your complete reasoning, including observations, hypotheses, correlations between data points, and conclusions about why this specific next step is necessary. + + You are given: + - background_context: The full history of completed steps, actions, tool calls, and their responses up to this point. + - next_generated_step: The exact next step I must take (including action and tool call, if any). + + CRITICAL OUTPUT FORMAT: + You must output your reasoning. Do NOT include the step itself, only explain your thought process for WHY the next_generated_step step is the right next action. + + [Your step-by-step first-person thought process here. Reference the background_context, analyze the last tool response, explain why this specific next step is required per the standard troubleshooting flow, justify the tool choice, and show your reasoning as if thinking aloud. Be thorough but concise and technical.] + + CRITICAL STYLE CONSTRAINTS: + - Keep the reasoning to **3-6 sentences total**. + - Do NOT restate the entire background_context; focus only on the **last 1-2 key events** and the **last tool response**. + - Follow this structure: + 1) I observe ... (1-2 sentences) + 2) This suggests ... (1-2 sentences) + 3) Therefore, I need to ... (1-2 sentences explaining why the next step is appropriate). + - Avoid repeating the standard troubleshooting flow verbatim; only mention the specific phase you are advancing to (e.g., "per the flow, after confirming alarm type I now check site health"). + + --- + + Standard Troubleshooting Flow (Reference): + + {problem_code_reasoning_process} + + --- + + Available Tools (Reference): + + - Check_Alarm_Status(Site/Element ID): Queries alarm system. Returns alarm status (active/cleared), occurred timestamp, description. + - Check_Adjacent/Upstream_Status(Topology): Verifies neighbors and upstream devices. Returns adjacent elements and their alarm status. + - Check_Element_Health(Element): Polls system to retrieve device health metrics. Returns health report (Element OK/NOK, RU radiation, cell status). + - Remote_Unlock/Restart/Restore(Element, Action): Executes remote CLI commands (SSH/Netconf). Returns Successful Reset/Non-Successful Reset. + - Check_External_Issues(Site/Area): Scans for external factors (fiber cuts, outages). Returns Issue Detected/Not Detected, Type of Issue. + - Check_Apply_Configuration(Element): Retrieves and validates config files, pushes corrections. Returns Configuration Verified/Applied correctly or error. + - Check_Performance(Metric type): Fetches KPIs from monitoring tools. Returns KPI OK or KPIs NOK. + - Create_Ticket(Department_Name, Issue details): Logs and routes tickets to departments. Returns Ticket ID and eventual solution. + - Orchestration_tool(Action): Automates container ops (Kubernetes/Helm). Returns Operation Successful()/Unsuccessful(). + - Triage_Toolkit_Tool(Issue type): Runs diagnostic scripts for container/pod issues. Returns Triage report summary. + - Check_remote_files(Device): SSH to element, reviews dump files. Returns No Issues Identified / Issues Identified. + + --- + + Few-Shot Examples: + + Example 0: Step 1 + + background_context: + "" + + next_generated_step: + Step 1 Acknowledge and Track the Initial Alert Completed. + Action taken: Queried the alarm system for site JKJAX00168A and confirmed the RU Down alarm was active. + Tool called: Check_Alarm_Status[JKJAX00168A] + + Expected response: + I observe that my first task is to acknowledge and validate the RU Down alert at site JKJAX00168A. I need to confirm whether this alarm is truly active and capture its severity, timestamp, and description from the alarm system. By using Check_Alarm_Status[JKJAX00168A], I complete Step 1 of the workflow—acknowledging and tracking the alert—so I have a reliable baseline before moving on to assess site and RU health in the next step. + + + Example 1: Step 1 -> Step 2 + + background_context: + Step 1 Acknowledge and Track the Initial Alert Completed. + Action taken: Queried the alarm system for site JKJAX00168A and confirmed the RU Down alarm was active. + Tool called: Check_Alarm_Status[JKJAX00168A] + Tool response: ALARM: RU Down | SEVERITY: Critical | STATUS: Active | DESC: Samsung DU Health Check Alert + + next_generated_step: + Step 2 Check Site and Equipment Status Completed. + Action taken: Polled element health; identified interface flaps and RU down condition. + Tool called: Check_Element_Health[JKJAX00168A] + + Expected response: + I observe that the alarm system confirms an active, Critical RU Down alarm at site JKJAX00168A, specifically a Samsung DU Health Check Alert, indicating the DU sees one or more RUs as non-operational. With Step 1 (alert acknowledgement and tracking) complete, I now need to understand the actual condition of the site equipment. Before attempting any reset or unlock, I must know whether the RU is locked, if interfaces are unstable, or if the DU itself is degraded. Using Check_Element_Health[JKJAX00168A] aligns with Step 2 of the standard flow, giving me RU radiation status, interface states, and lock conditions so I can later decide between a simple remote reset or deeper external/configuration investigation. + + --- + + Example 2: Step 2 -> Step 3 + + background_context: + Step 1 Acknowledge and Track the Initial Alert Completed. + Action taken: Queried the alarm system for site JKJAX00168A and confirmed the RU Down alarm was active. + Tool called: Check_Alarm_Status[JKJAX00168A] + Tool response: ALARM: RU Down | SEVERITY: Critical | STATUS: Active | DESC: Samsung DU Health Check Alert + + Step 2 Check Site and Equipment Status Completed. + Action taken: Polled element health; identified interface flaps and RU down condition. + Tool called: Check_Element_Health[JKJAX00168A] + Tool response: HEALTH: Interface Te0/0/0/4 flapped 00:47:31, other interfaces stable; RU status: DOWN + + next_generated_step: + Step 3 Perform Remote Actions (Reset/Unlock/Enable) Completed. + Action taken: Issued remote reset command for Alpha and Beta MB RUs. + Tool called: Execute_Remote_Action[JKJAX00168A, 'restart_ru'] + + Expected response: + + I observe from the health check that the RU is still DOWN and only Te0/0/0/4 recently flapped, while other interfaces are stable. This points to a localized soft fault or RU lockup rather than a wider transport or power issue. Given that I’ve already confirmed the alarm and assessed equipment health, the standard flow now calls for a low-risk remote recovery attempt. Therefore, the appropriate next action is to issue a remote RU restart, which often clears transient states before I investigate more complex external or configuration problems. + + --- + + Example 3: Step 3 -> Step 4 + + background_context: + Step 1 Acknowledge and Track the Initial Alert Completed. + Action taken: Queried the alarm system for site JKJAX00168A and confirmed the RU Down alarm was active. + Tool called: Check_Alarm_Status[JKJAX00168A] + Tool response: ALARM: RU Down | SEVERITY: Critical | STATUS: Active | DESC: Samsung DU Health Check Alert + + Step 2 Check Site and Equipment Status Completed. + Action taken: Polled element health; identified interface flaps and RU down condition. + Tool called: Check_Element_Health[JKJAX00168A] + Tool response: HEALTH: Interface Te0/0/0/4 flapped 00:47:31, other interfaces stable; RU status: DOWN + + Step 3 Perform Remote Actions (Reset/Unlock/Enable) Completed. + Action taken: Issued remote reset command for Alpha and Beta MB RUs. + Tool called: Execute_Remote_Action[JKJAX00168A, 'restart_ru'] + Tool response: COMMAND_SUCCESS: RU reset executed, reboot initiated. + + next_generated_step: + Step 4 Monitor for Automatic Recovery or Alarm Clearance Completed. + Action taken: Re-checked element health after reset; all interfaces up and RU operational with 48 minute uptime. + Tool called: Check_Element_Health[JKJAX00168A] + + Expected response: + I observe that the remote reset command was accepted successfully—“COMMAND_SUCCESS: RU reset executed, reboot initiated”—so the RU has started its reboot sequence. However, I cannot assume recovery just because the command ran; the RU still needs time to boot, reinitialize radio parameters, and resume normal operation. Per the troubleshooting flow, the next step is to monitor for automatic recovery and alarm clearance. After allowing sufficient boot time, I will recheck element health with Check_Element_Health[JKJAX00168A] to confirm that the RU is UP and radiating, interfaces (especially Te0/0/0/4) are stable, and the alarm has cleared before either closing the incident or escalating to external or configuration investigations. + + --- + + Remember: Output ONLY with your first-person thought process. Think aloud as if I'm working through the problem step by step. Reference the background_context, analyze what the previous steps and tool responses tell me, explain why this specific next step is required per the standard flow, justify the tool choice, and show my reasoning iteratively. Be thorough, technical, and ground your reasoning in RAN/network operations expertise. + Be concise and technical. Aim for fewer than 120 words. + Now generate your first-person exploratory reasoning for the following: + + background_context: + {background_context} + + next_generated_step: + {outcome} + + Expected response: + + \think diff --git a/recipes/noc-reasoning-agent/scripts/create_agent_with_tools.py b/recipes/noc-reasoning-agent/scripts/create_agent_with_tools.py new file mode 100644 index 0000000000..86085ddba3 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/create_agent_with_tools.py @@ -0,0 +1,238 @@ +import argparse +import json +from pathlib import Path + +import pandas as pd +import torch +from langchain_core.tools import tool +from langchain_huggingface import ChatHuggingFace, HuggingFacePipeline +from langgraph.checkpoint.memory import MemorySaver +from langgraph.prebuilt import create_react_agent +from tqdm import tqdm +from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline + + +def ensure_parent_dir(path: Path): + path.parent.mkdir(parents=True, exist_ok=True) + + +def load_processed_indices(output_path: Path): + if not output_path.exists(): + return set() + processed = set() + with output_path.open("r", encoding="utf-8") as f: + for line in f: + line = line.strip() + if not line: + continue + try: + obj = json.loads(line) + # we record row_index for resume + if "row_index" in obj: + processed.add(int(obj["row_index"])) + except Exception: + # ignore malformed lines so we can keep going + continue + return processed + + +def append_jsonl(output_path: Path, obj: dict): + with output_path.open("a", encoding="utf-8") as f: + f.write(json.dumps(obj, ensure_ascii=False) + "\n") + + +def main(): + parser = argparse.ArgumentParser(description="ReAct Agent for Processing Inputs from JSONL (resumable)") + parser.add_argument("--input", help="Path to the JSONL file containing inputs and answers") + parser.add_argument( + "--output", + default="outputs/agent_responses.jsonl", + help="Path to the JSONL file to append agent responses (default: outputs/agent_responses.jsonl)", + ) + parser.add_argument( + "--weights_dir", + default="training/qwen2.5-32b-improved-hf", + help="Local directory for fine-tuned HF weights", + ) + args = parser.parse_args() + + input_path = Path(args.input) + output_path = Path(args.output) + ensure_parent_dir(output_path) + + # Resume support: read any already-processed rows + processed_indices = load_processed_indices(output_path) + print(f"[INFO] Will resume. Already processed rows: {len(processed_indices)}") + + # Load input JSONL + df = pd.read_json(str(input_path), lines=True) + + # Load tokenizer and model once + tokenizer = AutoTokenizer.from_pretrained(args.weights_dir) + model = AutoModelForCausalLM.from_pretrained( + args.weights_dir, + torch_dtype=torch.float16, + device_map="auto", + low_cpu_mem_usage=True, + ) + + # Create generation pipeline and LangChain wrappers + hf_pipeline = pipeline( + "text-generation", + model=model, + tokenizer=tokenizer, + max_new_tokens=512, + temperature=0.7, + top_p=0.95, + ) + llm = HuggingFacePipeline(pipeline=hf_pipeline) + chat_llm = ChatHuggingFace(llm=llm) + + # Process each row, appending immediately to output + for index, row in tqdm(df.iterrows()): + if index in processed_indices: + # Skip already done + continue + + input_question = row.get("input", "") + print(f"Processing input {index + 1}/{len(df)}") + + # --- Updated Tool Definitions --- + + @tool + def Check_Alarm_Status(site_or_element_id: str) -> str: + """ + Retrieves current alarm details, severity, and active time + for a given site or network element. + """ + return row.get("Check_Alarm_Status_answer", "NotApplicable") + + @tool + def Check_Element_Neighbors(element_id: str) -> str: + """ + Checks all adjacent and upstream devices of a target element + to find common alarms affecting the area. + """ + return row.get("Check_Element_Neighbors_answer", "NotApplicable") + + @tool + def Check_Element_Health(element_id: str) -> str: + """ + Polls the element (e.g., DU/RU) to retrieve key health + metrics like cell status and radiation. + """ + return row.get("Check_Element_Health_answer", "NotApplicable") + + @tool + def Execute_Remote_Action(element_id: str, action: str) -> str: + """ + Runs a specific remote command on an element. + Example Actions: 'unlock_cell', 'restart_du', 'restore_ru' + """ + return row.get("Execute_Remote_Action_answer", "NotApplicable") + + @tool + def Check_External_Issues(site_or_area: str) -> str: + """ + Scans external monitors (like DownDetector or topology maps) + for area-wide issues like fiber cuts or power outages. + """ + return row.get("Check_External_Issues_answer", "NotApplicable") + + @tool + def Validate_And_Apply_Config(element_id: str) -> str: + """ + Retrieves the element's configuration, validates it against the + standard, and pushes a corrected config if a mismatch is found. + """ + return row.get("Validate_And_Apply_Config_answer", "NotApplicable") + + @tool + def Check_KPI_Performance(kpi_metric_name: str) -> str: + """ + Fetches a specific KPI from monitoring tools to check if its + trends are in line with expectations. + Example Metric: 'PRACH success rate' + """ + return row.get("Check_KPI_Performance_answer", "NotApplicable") + + @tool + def Create_Ticket(department_name: str, issue_details: str) -> str: + """ + Logs a new issue in the ticketing system and routes it + to the correct department. + """ + return row.get("Create_Ticket_answer", "NotApplicable") + + @tool + def Execute_Orchestration_Action(action_command: str) -> str: + """ + Runs an automated O-RAN orchestration task using Kubernetes/Helm. + Example Action: 'delete_pod_xyz', 'reassign_ip_address' + """ + return row.get("Execute_Orchestration_Action_answer", "NotApplicable") + + @tool + def Run_Triage_Diagnostics(issue_type: str) -> str: + """ + Executes diagnostic scripts specifically for container or + pod-related issues. + Example Issue Type: 'pod-crash-loop', 'container-networking' + """ + return row.get("Run_Triage_Diagnostics_answer", "NotApplicable") + + @tool + def Check_Remote_Dump_Files(element_id: str) -> str: + """ + Connects to a device via SSH/Telnet to review system dump + files for identified errors or issues. + """ + return row.get("Check_Remote_Dump_Files_answer", "NotApplicable") + + # --- Updated List of Tools --- + tools = [ + Check_Alarm_Status, + Check_Element_Neighbors, + Check_Element_Health, + Execute_Remote_Action, + Check_External_Issues, + Validate_And_Apply_Config, + Check_KPI_Performance, + Create_Ticket, + Execute_Orchestration_Action, + Run_Triage_Diagnostics, + Check_Remote_Dump_Files, + ] + + # --- End of Updates --- + + llm_with_tools = chat_llm.bind_tools(tools) + checkpointer = MemorySaver() + react_agent = create_react_agent(llm_with_tools, tools, checkpointer=checkpointer) + + config = {"configurable": {"thread_id": str(index)}} + input_messages = [{"role": "user", "content": input_question}] + + # Run and append immediately; if anything fails, record the error so we can resume later + try: + response = react_agent.invoke({"messages": input_messages}, config) + final_content = response["messages"][-1].content + output_row = row.to_dict() + output_row["row_index"] = int(index) + output_row["agent_response"] = final_content + append_jsonl(output_path, output_row) + except Exception as e: + err_row = { + "row_index": int(index), + "input": input_question, + "error": repr(e), + } + append_jsonl(output_path, err_row) + # Optional: continue to next row + continue + + print(f"[DONE] Wrote/updated: {output_path}") + + +if __name__ == "__main__": + main() diff --git a/recipes/noc-reasoning-agent/scripts/create_agent_with_tools_batch.py b/recipes/noc-reasoning-agent/scripts/create_agent_with_tools_batch.py new file mode 100644 index 0000000000..19e2895518 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/create_agent_with_tools_batch.py @@ -0,0 +1,300 @@ +import argparse +import json +import os +import re +from functools import partial # <-- Import partial +from pathlib import Path + +import pandas as pd +import torch + +# LangChain and LangGraph for the ReAct Agent +from langchain_huggingface import ChatHuggingFace, HuggingFacePipeline +from langgraph.checkpoint.memory import MemorySaver +from langgraph.prebuilt import create_react_agent +from src.tools import ( + ALL_TOOLS, + Check_Alarm_Status, + Check_Apply_Configuration, + Check_Element_Health, + Check_Element_Neighbors, + Check_External_Issues, + Check_Performance, + Check_remote_files, + Create_Ticket, + Execute_Remote_Action, + Orchestration_tool, + Triage_Toolkit_Tool, +) +from tqdm import tqdm +from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline + + +def get_bound_tools(row: pd.Series) -> list: + """ + Takes a data row and returns a list of tools with the 'row' argument + pre-populated using functools.partial. + """ + bound_tools = [] + for t in ALL_TOOLS: + # Create a partial function that has the 'row' argument fixed. + # The agent will only need to provide the other arguments. + bound_tool = partial(t, row) + # LangChain tools copy metadata; we need to ensure the original + # name, description, and args_schema are preserved. + bound_tool.name = t.name + bound_tool.description = t.description + bound_tool.args_schema = t.args_schema + bound_tools.append(bound_tool) + return bound_tools + + +# The rest of your utility functions (ensure_parent_dir, etc.) remain unchanged. +def ensure_parent_dir(path: Path): + path.parent.mkdir(parents=True, exist_ok=True) + + +def load_processed_indices(output_path: Path) -> set: + if not output_path.exists(): + return set() + processed = set() + with output_path.open("r", encoding="utf-8") as f: + for line in f: + if not (line := line.strip()): + continue + try: + obj = json.loads(line) + if "row_index" in obj: + processed.add(int(obj["row_index"])) + except json.JSONDecodeError: + print(f"[WARNING] Skipping malformed line in output file: {line}") + return processed + + +def append_jsonl_batch(output_path: Path, records: list): + with output_path.open("a", encoding="utf-8") as f: + for record in records: + f.write(json.dumps(record, ensure_ascii=False) + "\n") + + +def _safe_str(val): + """Return a JSON-serializable string; avoid NaN/None from pandas.""" + if val is None or (isinstance(val, float) and (val != val or val == float("inf") or val == float("-inf"))): + return "NotApplicable" + return str(val) + + +def format_tool_response(string: str) -> str: + safe = _safe_str(string) + json_payload = json.dumps({"summary": safe}, ensure_ascii=False) + return f"""\n{json_payload}\n""" + + +def _parse_tool_call_json(raw: str): + """Parse tool_call content; tolerate unquoted keys (e.g. {name: "..."}).""" + raw = raw.strip() + try: + return json.loads(raw) + except json.JSONDecodeError: + pass + # Fix unquoted keys: word followed by colon -> "word": + fixed = re.sub(r"([{,]\s*)([a-zA-Z_][a-zA-Z0-9_]*)(\s*:)", r'\1"\2"\3', raw) + try: + return json.loads(fixed) + except json.JSONDecodeError: + raise + + +def main(): + parser = argparse.ArgumentParser(description="ReAct Agent for Processing Inputs from JSONL (resumable)") + parser.add_argument("--input", required=True, help="Path to the JSONL file containing inputs") + parser.add_argument( + "--output", default="outputs/agent_responses.jsonl", help="Path to the JSONL file to append agent responses" + ) + parser.add_argument( + "--weights_dir", default="training/qwen3-32b-improved-hf", help="Local directory for fine-tuned HF weights" + ) + parser.add_argument("--tokenizer", default="Qwen/Qwen3-32B", help="Tokenizer for the model") + parser.add_argument( + "--batch_size", type=int, default=1, help="Number of rows to process before writing to disk (default: 1)" + ) + parser.add_argument( + "--gpu", + required=False, + help="Optional GPU index (e.g. 0). If omitted, device_map='auto' is used to spread the model across all available GPUs (recommended for large models).", + ) + parser.add_argument("--limit", required=False) + args = parser.parse_args() + + input_path = Path(args.input) + output_path = Path(args.output) + ensure_parent_dir(output_path) + skip = True + if skip and os.path.exists(output_path): + os.remove(output_path) + processed_indices = load_processed_indices(output_path) + print(f"[INFO] Found {len(processed_indices)} already processed rows. Resuming...") + + print(f"[INFO] Loading dataset from {input_path}...") + df = pd.read_json(str(input_path), lines=True) + if args.limit: + df = df.iloc[: int(args.limit)] + df["row_index"] = df.index + + unprocessed_df = df[~df.index.isin(processed_indices)] + if unprocessed_df.empty: + print("[INFO] All rows have already been processed. Exiting.") + return + print(f"[INFO] {len(unprocessed_df)} rows remaining to process.") + + dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16 + print(f"[INFO] Loading model and tokenizer from {args.weights_dir}...") + tokenizer = AutoTokenizer.from_pretrained(args.tokenizer) + + # Use all available GPUs by default (device_map="auto"); only pin to one GPU if --gpu is set + device_map = "auto" if not args.gpu else None + model = AutoModelForCausalLM.from_pretrained( + args.weights_dir, dtype=dtype, device_map=device_map, low_cpu_mem_usage=True + ) + if args.gpu: + model.to(f"cuda:{str(args.gpu)}") + elif torch.cuda.device_count() > 1: + print(f"[INFO] Model spread across {torch.cuda.device_count()} GPUs (device_map=auto)") + model.eval() + + hf_pipeline = pipeline( + "text-generation", model=model, tokenizer=tokenizer, max_new_tokens=2048, temperature=0.7, top_p=0.95 + ) + llm = HuggingFacePipeline(pipeline=hf_pipeline) + chat_llm = ChatHuggingFace(llm=llm) + + checkpointer = MemorySaver() + + output_batch = [] + react_agent = create_react_agent(chat_llm, tools=[], checkpointer=checkpointer) + for index, row in tqdm(unprocessed_df.iterrows(), total=len(unprocessed_df), desc="Processing Rows"): + # print(row) + # exit() + + # --- Step 2: Get the row-specific tools using the helper function. --- + # row_tools = get_bound_tools(row) + + config = {"configurable": {"thread_id": str(index)}} + input_question = row["input"] + system_message = row["system"] + input_messages = [{"role": "system", "content": system_message}, {"role": "user", "content": input_question}] + + try: + # We still invoke one-by-one, but the tool creation is now efficient. + + ## WE need to fix this. Add system message and tool calling messages + # separate user input messages + TOOLS = { + "Check_Alarm_Status": lambda args: Check_Alarm_Status(**args), + "Check_Element_Neighbors": lambda args: Check_Element_Neighbors(**args), + "Check_Element_Health": lambda args: Check_Element_Health(**args), + "Execute_Remote_Action": lambda args: Execute_Remote_Action(**args), + "Check_External_Issues": lambda args: Check_External_Issues(**args), + "Check_Apply_Configuration": lambda args: Check_Apply_Configuration(**args), + "Check_Performance": lambda args: Check_Performance(**args), + "Create_Ticket": lambda args: Create_Ticket(**args), + "Orchestration_tool": lambda args: Orchestration_tool(**args), + "Triage_Toolkit_Tool": lambda args: Triage_Toolkit_Tool(**args), + "Check_remote_files": lambda args: Check_remote_files(**args), + } + + end = False + iterations = 0 + while not end: + iterations += 1 + if iterations > 55: + break + + response = react_agent.invoke( + {"messages": input_messages}, config={"configurable": {**config["configurable"]}} + ) + + # Add a while statement to parse in tools or find the conclusion + final_content = response["messages"][-1].content + # parsed = tokenizer.parse_response(out_text) + # print(parsed) + # exit() + last_message_start = final_content.rfind("<|im_start|>assistant") + # last_message_end = final_content.find("<|im_end|>") + tool_calls = re.search( + r"(.*?)", + final_content[last_message_start + len("<|im_start|>assistant") :], + flags=re.DOTALL, + ) + # json_obj = react_to_json(response) + # print(json.dumps(json_obj, indent=2)) + + # print(final_content[last_message_start + len("<|im_start|>assistant"):][-20:]) + if tool_calls: + tool_called = tool_calls.group(1) + try: + data = _parse_tool_call_json(tool_called) + except json.JSONDecodeError as e: + print(f"\n[WARN] Row {index}: could not parse tool_call JSON: {e}") + break + tool_name = data.get("name") or data.get("function", {}).get("name") + if not tool_name: + print(f"[WARN] No tool name in parsed data: {data}") + break + arguments = data.get("arguments", data.get("parameters", {})) + if not isinstance(arguments, dict): + arguments = {} + if tool_name not in TOOLS: + print(f"[WARN] Unknown tool: {tool_name}") + continue + arguments["row"] = row + result = TOOLS[tool_name](arguments) + input_messages.append( + { + "role": "assistant", + "content": final_content[last_message_start + len("<|im_start|>assistant") :], + } + ) + input_messages.append({"role": "user", "content": format_tool_response(_safe_str(result))}) + print(f"✅ {tool_name} -> {result}") + + else: + input_messages.append( + { + "role": "assistant", + "content": final_content[last_message_start + len("<|im_start|>assistant") :], + } + ) + end = True + # print("Conclusion Arrived") + + # if found conclusion, find the close code as well as the final_content + + # the final content needs to be repeatitively added. + + output_row = row.to_dict() + output_row["agent_response"] = input_messages + output_batch.append(output_row) + + except Exception as e: + print(f"\n[ERROR] An error occurred on row {index}: {e}") + err_row = row.to_dict() + err_row["row_index"] = int(index) + err_row["input"] = input_question + err_row["error"] = repr(e) + if input_messages: + err_row["agent_response"] = input_messages + output_batch.append(err_row) + + if len(output_batch) >= args.batch_size: + append_jsonl_batch(output_path, output_batch) + output_batch = [] + + if output_batch: + append_jsonl_batch(output_path, output_batch) + + print(f"[DONE] All responses have been written to: {output_path}") + + +if __name__ == "__main__": + main() diff --git a/recipes/noc-reasoning-agent/scripts/evaluation/evaluation_with_judge.py b/recipes/noc-reasoning-agent/scripts/evaluation/evaluation_with_judge.py new file mode 100644 index 0000000000..28b025f4f7 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/evaluation/evaluation_with_judge.py @@ -0,0 +1,149 @@ +import argparse + +import pandas as pd +from bert_score import score as bert_score +from langchain_nvidia_ai_endpoints import ChatNVIDIA +from rouge_score import rouge_scorer +from tqdm import tqdm + +# Parse arguments for input JSONL path +parser = argparse.ArgumentParser(description="Evaluation Pipeline for Agent Responses") +parser.add_argument("input_jsonl", help="Path to agent_responses.jsonl containing expected_answer and agent_response") +parser.add_argument("--output_file", help="Path to output") +parser.add_argument( + "--nim_url", default="http://localhost:8000/v1", help="Base URL for NIM API (default: http://localhost:8000/v1)" +) +parser.add_argument("--model", default="openai/gpt-oss-120b", help="NIM model name (default: gpt-oss-120b)") +args = parser.parse_args() + +# Load the input JSONL +print(f"Loading input JSONL: {args.input_jsonl}") +df = pd.read_json(args.input_jsonl, lines=True) +print(f"Loaded {len(df)} rows") + + +# Set up ChatNVIDIA LLM +llm = ChatNVIDIA(base_url=args.nim_url, model=args.model) + +# Initialize ROUGE scorer +rouge = rouge_scorer.RougeScorer(["rouge1", "rougeL"], use_stemmer=True) + + +# Function for LLM-as-judge evaluation +def llm_judge_final_output(expected, generated): + prompt = f""" + Evaluate how well the generated resolution at the end matches the expected resolution on a scale of 1-5: + - 5: Perfect match in content. + - 4: High similarity, minor differences. + - 3: Moderate match, key elements present but some deviations. + - 2: Low match, major differences. + - 1: No match. + + Expected: {expected} + Generated: {generated} + + Provide only the score (1-5) and a brief reasoning. + Format: Score: X\nReasoning: ... + """ + # response = llm.invoke(prompt) + # output = response.content.strip() + # score = int(output.split("Score:")[1].split("\n")[0].strip()) + # reasoning = output.split("Reasoning:")[1].strip() + # return score, reasoning + try: + response = llm.invoke(prompt) + output = response.content.strip() + score = int(output.split("Score:")[1].split("\n")[0].strip()) + reasoning = output.split("Reasoning:")[1].strip() + return score, reasoning + except Exception as e: + print(f"Error in LLM judge: {e}") + return 0, "Error" + + +def llm_judge_reasoning(expected, generated): + prompt = f""" + Evaluate how well the generated reasoning is, including tools used, resolution matches the expected resolution on a scale of 1-5: + - 5: Perfect match in content, structure, and actions. + - 4: High similarity, minor differences. + - 3: Moderate match, key elements present but some deviations. + - 2: Low match, major differences. + - 1: No match. + + Expected: {expected} + Generated: {generated} + + Provide only the score (1-5) and a brief reasoning. + Format: Score: X\nReasoning: ... + """ + # response = llm.invoke(prompt) + # output = response.content.strip() + # score = int(output.split("Score:")[1].split("\n")[0].strip()) + # reasoning = output.split("Reasoning:")[1].strip() + # return score, reasoning + try: + response = llm.invoke(prompt) + output = response.content.strip() + score = int(output.split("Score:")[1].split("\n")[0].strip()) + reasoning = output.split("Reasoning:")[1].strip() + return score, reasoning + except Exception as e: + print(f"Error in LLM judge: {e}") + return 0, "Error" + + +# Loop over rows and evaluate +evaluations = [] +for index, row in tqdm(df.iterrows(), total=len(df)): + conclusion_expected = row["expected_answer"] + reasoning_expected = row["output"] + generated = row["agent_response"] + + if "Thought 1:" in generated: + if generated.count("Thought 1:") == 1: + question_part, reasoning_tail = generated.split("Thought 1:", -1) + question_part = question_part.strip() + generated_reasoning_part = "Thought 1:" + reasoning_tail + elif generated.count("Thought 1:") >= 2: + # Find where the 2nd "Thought 1:" starts + second_idx = generated.find("Thought 1:", generated.find("Thought 1:") + 1) + # Split into question and reasoning at the 2nd occurrence + question_part = generated[:second_idx].strip() + generated_reasoning_part = generated[second_idx:].strip() + else: + generated_reasoning_part = generated + + # Compute ROUGE + rouge_scores = rouge.score(conclusion_expected + reasoning_expected, generated_reasoning_part) + rouge1 = rouge_scores["rouge1"].fmeasure + rougeL = rouge_scores["rougeL"].fmeasure + + # Compute BERTScore (requires torch) + P, R, F1 = bert_score( + [generated_reasoning_part], [conclusion_expected + reasoning_expected], lang="en", verbose=False + ) + bert_f1 = F1.mean().item() + + # LLM Judge + reasoning_judge_score, reasoning_judge_reason = llm_judge_reasoning(reasoning_expected, generated_reasoning_part) + conclusion_judge_score, conclusion_judge_reason = llm_judge_final_output( + conclusion_expected, generated_reasoning_part + ) + + # Add to output row + output_row = row.to_dict() + output_row["rouge1"] = rouge1 + output_row["rougeL"] = rougeL + output_row["bertscore_f1"] = bert_f1 + output_row["llm_reasoning_judge_score"] = reasoning_judge_score + output_row["llm_reasoning_judge_reasoning"] = reasoning_judge_reason + output_row["llm_conclusion_judge_score"] = conclusion_judge_score + output_row["llm_conclusion_judge_reasoning"] = conclusion_judge_reason + if index == 1: + print(output_row) + evaluations.append(output_row) + +# Save to output JSONL +output_df = pd.DataFrame(evaluations) +output_df.to_json(args.output_file, orient="records", lines=True) +print("Evaluations saved to evaluations.jsonl") diff --git a/recipes/noc-reasoning-agent/scripts/evaluation/problem_code_evaluation.py b/recipes/noc-reasoning-agent/scripts/evaluation/problem_code_evaluation.py new file mode 100644 index 0000000000..e05846f76d --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/evaluation/problem_code_evaluation.py @@ -0,0 +1,127 @@ +import argparse +import json +import re + +from tqdm import tqdm + +# Parse arguments for input JSONL path +parser = argparse.ArgumentParser(description="Evaluation Pipeline for Agent Responses") +parser.add_argument("input_jsonl", help="Path to agent_responses.jsonl containing expected_answer and agent_response") +args = parser.parse_args() + +# Map free-form / synonym phrases (lowercase) to canonical close codes (lowercase) for matching. +# Expected is from ground truth (e.g. "Resolved", "Issue Corrected"); we check if response +# contains the expected phrase OR any synonym that maps to the same meaning. +CLOSE_CODE_SYNONYMS = { + "resolved": [ + "resolved", + "commercial power restored", + "power restored", + "cold reboot", + "incident closed", + "issue resolved", + "activity completed", + ], + "issue corrected": [ + "issue corrected", + "solved remotely", + "solved remotely (permanently)", + "cleared in testing", + "configuration corrected", + "software fix", + "network fix", + "performance improvement", + "activity completed", + ], + "activity completed": ["activity completed", "cleared in testing", "resolved"], + "ru reset": ["ru reset", "reset ru"], + "other": ["other"], +} + + +def normalize_close_code(s: str) -> str: + """Return lowercase, stripped; empty if missing.""" + if not s or not isinstance(s, str): + return "" + return s.strip().lower() + + +def _acceptable_phrases_for_expected(expected_norm: str): + """Return list of phrases (lowercase) that count as a match for this expected close code.""" + if expected_norm in CLOSE_CODE_SYNONYMS: + return [expected_norm] + list(CLOSE_CODE_SYNONYMS[expected_norm]) + for canonical, synonyms in CLOSE_CODE_SYNONYMS.items(): + if expected_norm == canonical or expected_norm in synonyms: + return [canonical] + list(synonyms) + return [expected_norm] + + +def response_matches_expected(response_lower: str, expected_close_code: str) -> bool: + """True if response contains expected close code or an accepted synonym.""" + expected_norm = normalize_close_code(expected_close_code) + if not expected_norm: + return False + acceptable = _acceptable_phrases_for_expected(expected_norm) + return any(phrase in response_lower for phrase in acceptable) + + +print(f"Loading input JSONL: {args.input_jsonl}") + +# Counters +correct = 0 +incorrect = 0 +failed = 0 +total = 0 + +# Count total lines (for tqdm) +with open(args.input_jsonl, "r", encoding="utf-8") as f: + total_lines = sum(1 for _ in f) + +# Process JSONL line-by-line +with open(args.input_jsonl, "r", encoding="utf-8") as f: + for line in tqdm(f, total=total_lines): + try: + row = json.loads(line) + + agent_response = row.get("agent_response") + expected = row.get("expected", "") + + if agent_response is None or not isinstance(agent_response, list) or len(agent_response) == 0: + print("Error: missing or empty 'agent_response'") + failed += 1 + total += 1 + continue + + # Extract close code from expected + m = re.search(r"Close Code:\s*\[(.*?)\]", expected) if expected else None + close_code = m.group(1).strip() if m else None + + if not close_code: + failed += 1 + total += 1 + continue + + # Take the model's last message + last_msg = agent_response[-1] + content = last_msg.get("content") if isinstance(last_msg, dict) else str(last_msg) + response = (content or "").lower() + + # Slice from "close code..." if present + idx = response.rfind("close code") + if idx >= 0: + response = response[idx:] + + if response_matches_expected(response, close_code): + print(f"✅ Real Close code: {close_code}, Response: {response[:120]}...") + correct += 1 + else: + incorrect += 1 + print(f"❌ Real Close code: {close_code}, Response: {response[:120]}...") + + except Exception as e: + print("Error:", e) + failed += 1 + + total += 1 + +print(f"Total: {total}, correct: {correct}, failed: {failed}, incorrect: {incorrect}") diff --git a/recipes/noc-reasoning-agent/scripts/evaluation/score.py b/recipes/noc-reasoning-agent/scripts/evaluation/score.py new file mode 100644 index 0000000000..f3ca52d3a1 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/evaluation/score.py @@ -0,0 +1,24 @@ +import argparse + +import pandas as pd + + +def main(): + parser = argparse.ArgumentParser(description="Compute overall evaluation scores from JSONL") + parser.add_argument("input_jsonl", help="Path to JSONL file containing rouge, bertscore, and judge scores") + args = parser.parse_args() + + # Load the JSONL into a DataFrame + df = pd.read_json(args.input_jsonl, lines=True) + + # Compute averages for the numeric columns + metrics = ["rouge1", "rougeL", "bertscore_f1", "llm_reasoning_judge_score", "llm_conclusion_judge_score"] + averages = df[metrics].mean() + + print("Overall Scores:") + for metric, value in averages.items(): + print(f"{metric}: {value:.4f}") + + +if __name__ == "__main__": + main() diff --git a/recipes/noc-reasoning-agent/scripts/filtering/filter_rows.py b/recipes/noc-reasoning-agent/scripts/filtering/filter_rows.py new file mode 100644 index 0000000000..8cb0ff3cfd --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/filtering/filter_rows.py @@ -0,0 +1,205 @@ +import argparse + +import pandas as pd + +# Legacy u_problem_code values (original pipeline) +ALLOWED_PROBLEM_CODES_LEGACY = [ + "Service-off", + "Degraded Prach", + "Offline / Unreachable", + "Disabled Cells", + "Node Down", + "Site Not Scrolling", + "Sleepy Cell", + "VM is in not ready state", + "Prach 0", + "N2 Link Down", + "ueconmgr pod restarted", + "CSR Not Reachable", + "Circuit Down", + "Link Down", + "GPS Sync", + "MTA Alert", +] + +# Synthetic fault_category (workflow IDs from telco_synthetic_from_scratch). +ALLOWED_PROBLEM_CODES_SYNTHETIC = [ + # Power / Environment + "power_ac_failure_recovery", + "power_dc_rectifier_recovery", + "power_battery_discharge_response", + "power_generator_failure_recovery", + "env_high_temperature_response", + "env_hvac_fault_recovery", + "env_water_intrusion_response", + "env_battery_temperature_response", + "env_cabinet_intrusion_response", + # RAN + "ran_software_upgrade_recovery", + "ran_cell_site_down_recovery", + "ran_interference_mitigation", + "ran_speed_complaint_resolution", + "ran_voice_quality_resolution", + "ran_sector_outage_recovery", + "ran_prb_availability_resolution", + "ran_cell_overshooting_correction", + "ran_rru_communication_recovery", + "ran_dropped_calls_resolution", + "ran_parameter_correction", + "ran_antenna_tilt_recovery", + "ran_vswr_alarm_resolution", + "ran_handover_failure_resolution", + "ran_backhaul_degradation_resolution", + "ran_cell_congestion_management", + "ran_device_issue_resolution", + # Compute + "compute_vm_failure_recovery", + "compute_container_crash_recovery", + "compute_orchestrator_recovery", + "compute_image_pull_recovery", + "compute_k8s_node_recovery", + "compute_storage_failure_recovery", + "compute_cnf_pod_recovery", + "compute_resource_exhaustion_resolution", + # Transport + "transport_routing_flap_resolution", + "transport_microwave_degradation_response", + "transport_interface_errors_resolution", + "transport_packet_loss_resolution", + # Signaling + "signaling_routing_failure_recovery", + "signaling_delay_resolution", + "signaling_s1_n2_recovery", + "signaling_sip_registration_recovery", +] + +# Combined: use for filtering so both legacy and synthetic data are supported. +ALLOWED_PROBLEM_CODES = ALLOWED_PROBLEM_CODES_LEGACY + ALLOWED_PROBLEM_CODES_SYNTHETIC + + +ALLOWED_CLOSE_CODES = [ + "Commercial Power Restored", + "Power Restored", + "Network Fix", + "Cleared In Testing", + "Solved Remotely (Permanently)", + "Reset RU", + "Fiber Repaired/Replaced", + "Cold Reboot", + "Performance Improvement", + "Configuration corrected", + "Software Fix", + "Delete M-Plane and F1C IP", + "RU Reset", + "Other", + "Restart MVRP Services", + "Activity Completed", + # Synthetic resolution_method values (telco_synthetic_from_scratch) + "Resolved", + "Issue Corrected", + "Service Restored", + "Pending Resolution", + "Partial Resolution", +] + + +def filter_auto(input_csv, output_csv): + """Filter incident data and save results to a new CSV.""" + try: + df = pd.read_csv(input_csv, encoding="latin1") + except Exception: + df = pd.read_csv(input_csv) + + # Synthetic schema: resolution_method, resolution_summary + res_col = df["resolution_method"] if "resolution_method" in df.columns else df["close_code"] + notes_col = df["resolution_summary"] if "resolution_summary" in df.columns else df["close_notes"] + mask_auto_recovered = res_col.astype(str).str.contains("Auto Recover", case=False, na=False) + mask_event_cleared = notes_col.astype(str).str.contains("No Action Taken", case=False, na=False) + mask_event_cleared_ar = notes_col.astype(str).str.contains("auto recovered", case=False, na=False) + mask_event_cleared_ar_d = notes_col.astype(str).str.contains("auto-recovered", case=False, na=False) + mask_remove = mask_auto_recovered | mask_event_cleared | mask_event_cleared_ar | mask_event_cleared_ar_d + + filtered_df = df[~mask_remove] + filtered_df.to_csv(output_csv, index=False) + print(f"Original rows: {len(df)}") + print(f"New rows: {len(filtered_df)}") + print(f"Auto Rows removed: {mask_remove.sum()}") + + +def filter_soft_solve(input_csv, output_csv): + """Filter incident data and save results to a new CSV.""" + try: + df = pd.read_csv(input_csv, encoding="latin1") + except Exception: + df = pd.read_csv(input_csv) + + soft_solve_rows = df[df["solved_category"] == "soft_solve"] + + soft_solve_rows.to_csv(output_csv, index=False) + print(f"Original rows: {len(df)}") + print(f"New rows: {len(soft_solve_rows)}") + print(f"Rows removed: {len(df) - len(soft_solve_rows)}") + + +def filter_problem_codes(input_csv, output_csv): + try: + df = pd.read_csv(input_csv, encoding="latin1") + except Exception: + df = pd.read_csv(input_csv) + + # Synthetic schema: fault_category + pc_col = df["fault_category"] if "fault_category" in df.columns else df["u_problem_code"] + filtered_df = df[pc_col.isin(ALLOWED_PROBLEM_CODES)] + + filtered_df.to_csv(output_csv, index=False) + print(f"Original rows: {len(df)}") + print(f"New rows: {len(filtered_df)}") + print(f"Rows removed: {len(df) - len(filtered_df)}") + + +def filter_close_codes(input_csv, output_csv): + try: + df = pd.read_csv(input_csv, encoding="latin1") + except Exception: + df = pd.read_csv(input_csv) + + # Synthetic schema: resolution_method + res_col = df["resolution_method"] if "resolution_method" in df.columns else df["close_code"] + filtered_df = df[res_col.isin(ALLOWED_CLOSE_CODES)] + + filtered_df.to_csv(output_csv, index=False) + print(f"Original rows: {len(df)}") + print(f"New rows: {len(filtered_df)}") + print(f"Rows removed: {len(df) - len(filtered_df)}") + + +def main(): + parser = argparse.ArgumentParser(description="Filter incident CSV data based on specific rules.") + parser.add_argument( + "--input_csv", + type=str, + default="data/anonymized-Incidents_Last_6_Months.csv", + help="Path to the input CSV file containing incident data. Default: data/anonymized-Incidents_Last_6_Months.csv", + ) + parser.add_argument( + "--output_csv", + type=str, + default="data/filtered_file.csv", + help="Path to save the filtered CSV file. Default: data/filtered_file.csv", + ) + parser.add_argument("--filter_type", type=str, default="auto") + args = parser.parse_args() + + # Run the filtering process + if args.filter_type == "auto": + filter_auto(args.input_csv, args.output_csv) + elif args.filter_type == "soft_solve": + filter_soft_solve(args.input_csv, args.output_csv) + elif args.filter_type == "problem_codes": + filter_problem_codes(args.input_csv, args.output_csv) + elif args.filter_type == "close_codes": + filter_close_codes(args.input_csv, args.output_csv) + + +if __name__ == "__main__": + main() diff --git a/recipes/noc-reasoning-agent/scripts/filtering/match_keywords.py b/recipes/noc-reasoning-agent/scripts/filtering/match_keywords.py new file mode 100644 index 0000000000..3e46ccdfcd --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/filtering/match_keywords.py @@ -0,0 +1,353 @@ +import argparse +import re + +import pandas as pd +from tqdm import tqdm + + +def get_close_codes(): + """ + Parses the structured close code data into a dictionary of categories and keywords. + """ + # This list contains your full set of close codes and their types (HW, SW, NA) + raw_data = """ + close_code,Type + Other,NA + Alarm Cleared,N/A + Auto Recovered,NA + Auto Recover,NA + Configuration Fixed,SW + Power Restored,N/A + IP Configuration Corrected,SW + Commercial Power Restored,SW + Cold Reboot,SW + Reset RU,SW + Configuration corrected,SW + Vendor Hardware Replaced,HW + Auto-Recovered,SW + Auto Reloaded,SW + Auto Restart,SW + Reset DU,SW + Software Fix,SW + Network Fix,N/A + MOP Corrected,SW + RU Reset,SW + WCS Fix,N/A + Restart VM,SW + Tool Fix,SW + Fiber Repaired/Replaced,HW + Change Configuration / parameters,SW + Corrected Software Path,SW + Reset ORU,SW + Antenna Cable Check,HW + Rebooted Chassis,SW + Performance Improvement,SW + Power Supply Restored,N/A + Corrected Configuration Error,SW + Cell Enable / Disable,SW + Vendor Hardware Repaired,HW + Tilt Changes,SW + Activity Completed,N/A + No Trouble Found,NA + Closed/Resolved by Caller,SW + Corrected Config Mismatch,SW + Restarted Chassis,SW + Not Within Coverage,NA + Restarted Process,SW + SFP Cable cleaned,HW + GNSS Fix,SW + KPI Verified,SW + DU Reboot,SW + Replaced Antenna,HW + Site Restored,N/A + Issue Fixed,N/A + Corrected wiring,HW + Replaced controller board,HW + Firmware upgrade,SW + GPS Cone Replacement,HW + Non-RF Issue,NA + Poor Indoor Coverage,N/A + Full Time Roaming,N/A + Rebooted NNI,SW + Hardware Replaced,HW + Cleaned Fiber,HW + Restarted ueconmgr,SW + Water Remediation,HW + Restart MVRP Services,SW + Updated Configuration,SW + SFP Replaced,SW + Payment completed,NA + Delete DU Pod,SW + New Site Deployment,HW + Device Changed,HW + Restarted gnbmgr,SW + Replaced SFP,HW + Sotware Bug Fix,SW + Cell locked / Unlocked,SW + Delete M-Plane and F1C IP,SW + Rebooted ESXi Host,SW + Replaced RU,HW + PowerCycle CSR,N/A + CU reset,HW + No Impact,NA + Cleared Disk space,SW + Replaced Hardware,HW + Unnotified Carrier Maintenance,NA + Restarted Network Card,SW + Replaced BMC Server,HW + Delete PTP Pod,SW + Replaced complete unit,HW + Delete MTCIL Pod,SW + Firmware/Software Configuration,SW + DU Replacement,HW + RET Adjustment,SW + Replaced Power Supply,HW + improved by Optimization,SW + Reinstantiated Site/CICD,N/A + Cancelled,NA + Replaced/repaired external cable,HW + Rolled Back Change,SW + Fiber/Connector Repaired/Replaced,HW + POD Reboot,SW + GPS Replaced,HW + GPS Cable replaced,HW + Application Reinstantiated,SW + Copper/Optic Cable Replaced,HW + Access Realignment,SW + Pods Restarted,SW + Memory adjusted,SW + DU Reset,SW + Delete Core Files,SW + Fiber Cable replaced,HW + BMC Reset,SW + Repaired connector,HW + Fiber Replaced,HW + Device Configuration corrected,SW + Power Equipment Repaired,HW + RU Replacement,HW + Hardware Replace,HW + Replaced Optical Card,HW + Re provisioned,SW + Replaced/Reseated Cabled,SW + Hybrid replaced,N/A + DU,N/A + Replaced Fan Module,HW + Fiber Replacement,HW + Replaced CSR,HW + Rebooted NID,SW + Sleepy CU-PODs restarted,SW + Electronic components replaced,HW + SFP/cable Replaced,HW + Repaired Cable/Connector,HW + Cleared by Government Officials,NA + Configuration Corrected/Updated,SW + Replaced GPS antenna,HW + Deleted Prior EC2,SW + Host Restart,SW + Software Stability,SW + Rerun CICD Pipeline,SW + Repalced the connector cables,HW + CSR Power Supply Replaced,HW + Restarted Manually,HW + Corrected plungers,HW + Restart Application,SW + RU/DU Restart,SW + Pod Reset,SW + CUCP Pod Swtichover to worker node,SW + HVAC repaired/replaced,HW + Cleaned Fan Filter,HW + SFP Cable Replaced,HW + Others,N/A + Solved Remotely (Permanently),SW + Replaced Network Card,HW + FE Rebooted,SW + Network Switch,N/A + Reset RET's,SW + Replaced Attenuator,HW + Power Cycle,N/A + Rollout Restart,N/A + Replaced Chassis,HW + Re-Deployment,SW + Initialized PODS,SW + Replaced NID,HW + CPU Replaced,HW + Corrected BGP Configuration,SW + Replaced the connector cables,HW + bccsvc Restart,SW + Rebooted Network Card,SW + Delete MVRP Logs,SW + DC Power Bounce,SW + POD Reboot - USM,SW + BMC Power Supply Replaced,HW + Reboot Dpp Pod,SW + Capacity Increase,SW + Sleepy CUCP restarted,SW + Hardware Restarted,SW + Cleared/Reformatted Disk 0,SW + Replaced /repaired internal cable,HW + Groundbars/Copper replaced,HW + Improved by Parameter Change,SW + Adding sdaas_ip incorrect in BPI or infoblox,SW + Fiber Sweep,HW + Restarted ngclientiwf,SW + Replaced fans,HW + Reconfigured BMC,SW + eCPRI Fiber Replaced,HW + Generator Deployement,HW + Restored Tripped Breaker,HW + Batteries Replaced,HW + Not a Sleepy CU,SW + Cleared/Reformatted Disk 1,SW + Reserved / Unreserved,N/A + Door swap,HW + RET motor replacement,HW + Fiber Connectivity Restored,N/A + Initialize Mplane,SW + Replaced polyphaser,HW + RU Software Reset,SW + Barred/ Not Barred,SW + Cabling Replaced,HW + Replaced RET cable,HW + Restarted sctpe1iwf,SW + Activated OCNS,SW + VM Reset,SW + Restarted bccsvc,SW + SMF restart,SW + RU software bug fix,SW + Cleared In Testing,NA + Solar Power Restored,SW + Restarted sctpf1iwf,SW + Updated Lat/long settings,SW + Restarted gwsvc,SW + Cabinet Replacement,HW + Disk Clean,SW + NID Replaced,HW + """ + + from io import StringIO + + codes_df = pd.read_csv(StringIO(raw_data)) + + categorized_codes = { + "Hardware": [code.lower().replace(" ", "") for code, type in codes_df.values if type == "HW"], + "Software": [code.lower().replace(" ", "") for code, type in codes_df.values if type == "SW"], + } + return categorized_codes + + +# Synthetic data uses different resolution_method values. Map them to physical_intervention vs soft_solve. +SYNTHETIC_RESOLUTION_TO_CATEGORY = { + "physical_intervention": [ + "field dispatch required", + "escalated", + ], + "soft_solve": [ + "resolved", + "issue corrected", + "service restored", + "partial resolution", + "pending resolution", + ], +} + + +def find_keyword_matches(row, pattern, keywords): + """Finds which specific keywords from a list match within a DataFrame row.""" + # Synthetic schema uses resolution_method (same meaning as close_code) + resolution_method = row.get("resolution_method", row.get("close_code", "")) + if resolution_method and str(resolution_method).lower().replace(" ", "") in keywords: + return True + + return False + + +def categorize_incidents_by_close_code(df: pd.DataFrame) -> pd.DataFrame: + """ + Categorizes incidents into 'Hardware' or 'Software' based on a structured list of close codes. + It prioritizes Hardware matches over Software matches. + """ + # Ensure required output columns exist + if "solved_category" not in df.columns: + df["solved_category"] = "Uncategorized" + if "solved_reason" not in df.columns: + df["solved_reason"] = "" + + # Get the categorized lists of keywords + close_codes = get_close_codes() + hw_keywords = close_codes["Hardware"] + sw_keywords = close_codes["Software"] + + # Precompile regex patterns for performance. This matches any of the phrases. + hw_pattern = re.compile("|".join(re.escape(k) for k in hw_keywords), flags=re.IGNORECASE) + sw_pattern = re.compile("|".join(re.escape(k) for k in sw_keywords), flags=re.IGNORECASE) + + # Convert all data to string type for safe searching + str_df = df.astype(str) + + print("Categorizing incidents based on close codes...") + for idx, row in tqdm(str_df.iterrows(), total=len(df)): + # Prioritize Hardware: Check for HW keywords first + hw_matches = find_keyword_matches(row, hw_pattern, hw_keywords) + if hw_matches: + df.at[idx, "solved_category"] = "physical_intervention" + continue # Move to the next row once categorized + + # If no HW keywords, check for SW keywords + sw_matches = find_keyword_matches(row, sw_pattern, sw_keywords) + if sw_matches: + df.at[idx, "solved_category"] = "soft_solve" + continue + + # Synthetic data: resolution_method values not in legacy close-code list + resolution_method = row.get("resolution_method", row.get("close_code", "")) + if resolution_method: + rm_normalized = str(resolution_method).strip().lower() + for category, values in SYNTHETIC_RESOLUTION_TO_CATEGORY.items(): + if rm_normalized in values: + df.at[idx, "solved_category"] = category + break + + hw_count = (df["solved_category"] == "physical_intervention").sum() + sw_count = (df["solved_category"] == "soft_solve").sum() + un_count = (df["solved_category"] == "Uncategorized").sum() + + print("\n--- Categorization Complete ---") + print(f"Total rows processed: {len(df)}") + print(f"Physical Intervention: {hw_count}") + print(f"Soft Solve: {sw_count}") + print(f"Uncategorized: {un_count}") + + return df + + +if __name__ == "__main__": + # Example: Adjust these paths to your actual data files + + parser = argparse.ArgumentParser(description="Filter incident CSV data based on specific rules.") + parser.add_argument( + "--input_csv", + type=str, + default="data/anonymized-Incidents_Last_6_Months.csv", + help="Path to the input CSV file containing incident data. Default: data/anonymized-Incidents_Last_6_Months.csv", + ) + parser.add_argument( + "--output_csv", + type=str, + default="data/categorized_incidents.csv", + help="Path to save the filtered CSV file. Default: data/categorized_incidents.csv", + ) + args = parser.parse_args() + + input_file = args.input_csv + output_file = args.output_csv + + print(f"Loading data from {input_file}...") + # Use 'latin1' encoding if your CSV has special characters + df = pd.read_csv(input_file, encoding="latin1") + + # Run the categorization function + df = categorize_incidents_by_close_code(df) + + # Save the updated dataframe to a new CSV file + df.to_csv(output_file, index=False) + print(f"\nCategorized data saved to: {output_file}") diff --git a/recipes/noc-reasoning-agent/scripts/ns_pipelines/generate_synthetic_data.py b/recipes/noc-reasoning-agent/scripts/ns_pipelines/generate_synthetic_data.py new file mode 100644 index 0000000000..0589aeb1b6 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/ns_pipelines/generate_synthetic_data.py @@ -0,0 +1,66 @@ +import argparse +import os + +from nemo_skills.pipeline.cli import generate, wrap_arguments + + +def generate_synthetic_data(args, cluster, num_gpus, step=None, input_format_file=None): + os.makedirs("outputs/sdg_reason", exist_ok=True) + generate( + ctx=wrap_arguments( + f"++prompt_config=/workspace/data/prompt_reasoning.yaml " + f"++inference.temperature={args.temperature} " + f"++inference.tokens_to_generate={args.tokens_to_generate} " + f"++code_execution=false " + f"++skip_filled=false " + f"++use_completions_api=true " + f"++input_file={input_format_file} " + ), + cluster=cluster, + server_type="vllm", + input_file=input_format_file, + output_dir="/workspace/outputs/sdg_reason/", + # output_dir=f"/workspace/outputs/sdg_reason/step_{step}", + expname="incident-generation", + model="openai/gpt-oss-120b", + rerun_done=True, + server_gpus=num_gpus, + ) + + print(f"Finished generating step {step}") + + +def generate_synthetic_data_oss_gpt(args, cluster, num_gpus): + pass + + +def main(): + parser = argparse.ArgumentParser(description="Generate synthetic data using Qwen model") + parser.add_argument("--temperature", type=float, default=0.6, help="Inference temperature (default: 0.6)") + parser.add_argument( + "--tokens_to_generate", type=int, default=8192, help="Number of tokens to generate (default: 8192)" + ) + # parser.add_argument('--gpus', type=int, default=detect_num_gpus(), + # help='Number of GPUs to use for generation') + parser.add_argument("--gpus", type=int, default=8, help="Number of GPUs to use for generation") + parser.add_argument("--llm", type=str, default="qwen2.5-32b-instruct", help="The LLM to use for generation") + parser.add_argument( + "--num_gpus", type=int, default=8, help="Number of GPUs to use (auto-detected if not specified)" + ) + + args = parser.parse_args() + cluster = "local" + + num_gpus = args.num_gpus + print(f"Using {num_gpus} GPUs (specified via --gpus)") + + if args.llm == "qwen2.5-32b-instruct": + generate_synthetic_data( + args, cluster, num_gpus, step=1, input_format_file="/workspace/outputs/sdg/formatted_output.json" + ) + else: + generate_synthetic_data_oss_gpt(args, cluster, num_gpus) + + +if __name__ == "__main__": + main() diff --git a/recipes/noc-reasoning-agent/scripts/ns_pipelines/prepare_react_agent.py b/recipes/noc-reasoning-agent/scripts/ns_pipelines/prepare_react_agent.py new file mode 100644 index 0000000000..87d90d9b2f --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/ns_pipelines/prepare_react_agent.py @@ -0,0 +1,152 @@ +import argparse +import json +import random +import re + +import yaml +from src.tools import ALL_TOOLS_STRING + + +def extract_number_from_input(input_text): + """ + Extracts the incident identifier from the 'input' field using regex. + Supports synthetic IDs (e.g. INCME-100001) and legacy (INCWLS...). + """ + # Match any non-whitespace after "Number:" (e.g. INCME-100001, INCWLS0873337) + match = re.search(r"Number:\s*(\S+)", input_text) + if match: + return match.group(1) + return None + + +def get_tools(text): + matches = {} + + # Find all ... blocks + + tool_calls = re.findall(r"(.*?)", text, flags=re.DOTALL) + tool_response = re.findall(r"(.*?)", text, flags=re.DOTALL) + # print(tool_calls) + assert len(tool_calls) == len(tool_response) + for i in range(len(tool_calls)): + # try: + tool_block = tool_calls[i] + response_block = tool_response[i] + # Extract the JSON portion inside the tags + tool_json_str = tool_block.strip() + + tool_data = json.loads(tool_json_str) + response = response_block.strip() + tool_name = tool_data["name"] + arguments = tool_data["arguments"] + + matches[tool_name] = {"arguments": arguments, "response": response} + + # except json.JSONDecodeError as e: + # print(f"Skipping invalid JSON: {e}") + + if not matches: + # print("No tools!") + return None, None + # print(matches) + return matches + + +def main(file1_path, file2_path, prompt_config, output_path="output.jsonl"): + # Load first JSONL: keyed by 'number' (extracted if needed) + data1 = {} + with open(prompt_config, "r") as f: + prompt_template = yaml.safe_load(f) + + system_prompt = prompt_template["system"] + with open(file1_path, "r", encoding="utf-8") as f1: + for line in f1: + line = line.strip() + if line: + try: + d = json.loads(line) + number = d.get("incident_identifier", d.get("number")) + if "Close Code: [" in d["response"]: + # response + # print(d["number"]) + # print(d["response"]) + matches = get_tools(d["initial_background"]) + if matches == (None, None): + print(f"No tools for incident {number}, skipping") + continue + # print(data1[number]) + d["tool_matches"] = matches + formatted_prompt = prompt_template["user"].format(**d) + if formatted_prompt.endswith("\n\n"): + formatted_prompt = formatted_prompt[: -len("\n\n")] + # print(system_prompt) + d["formatted_input"] = formatted_prompt + data1[number] = d + + # num = d.get('number') or extract_number_from_input(d.get('input', '')) + # if num: + # data1[num] = d + except json.JSONDecodeError as e: + print(f"Error decoding JSON in file1: {e}") + + # Load second JSONL: keyed by 'number' + data2 = {} + with open(file2_path, "r", encoding="utf-8") as f2: + for line in f2: + line = line.strip() + if line: + try: + d = json.loads(line) + input_string = d["input"] + output_string = d["output"] + match = re.search(r"Number:\s*(\S+)", input_string) + if match: + number = match.group(1) + else: + raise ValueError("No incident identifier match found in input") + data2[number] = [input_string, output_string] + except json.JSONDecodeError as e: + print(f"Error decoding JSON in file2: {e}") + + # Build consolidated results for matching numbers + + results = [] + for num in data1.keys(): + if num in data2: + used_tools = data1[num]["tool_matches"] + + consolidated = {} + tools = ALL_TOOLS_STRING + if used_tools is None: + print("no tools!") + else: + for tool in tools: + if tool in used_tools: + consolidated[tool] = used_tools[tool]["response"] + # print(consolidated) + # print(used_tools) + + consolidated["system"] = system_prompt + consolidated["input"] = data1[num]["formatted_input"] + consolidated["expected"] = data2[num][1] + # consolidated["output"] = + + results.append(consolidated) + + random.shuffle(results) + with open(output_path, "w", encoding="utf-8") as out_file: + for res in results: + out_file.write(json.dumps(res) + "\n") + print(f"Consolidated output written to {output_path}") + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Consolidate data from two JSONL files.") + parser.add_argument("file1", help="Path to the first JSONL file (with input, output, expected_answer)") + parser.add_argument("file2", help="Path to the second JSONL file (with check answers)") + parser.add_argument("--prompt_config", default="data/prompts/prompt_incident.yaml") + parser.add_argument( + "--output", default="output.jsonl", help="Path to the output JSONL file (default: output.jsonl)" + ) + args = parser.parse_args() + main(args.file1, args.file2, args.prompt_config, args.output) diff --git a/recipes/noc-reasoning-agent/scripts/tools.py b/recipes/noc-reasoning-agent/scripts/tools.py new file mode 100644 index 0000000000..7b3342ab64 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/tools.py @@ -0,0 +1,142 @@ +import pandas as pd + + +def _safe_tool_val(val, default: str = "NotApplicable") -> str: + """Return a string safe for JSON; pandas NaN and None become default.""" + if val is None: + return default + if isinstance(val, float) and (val != val or val == float("inf") or val == float("-inf")): + return default + return str(val) + + +def Check_Alarm_Status(row: pd.Series, site_or_element_id: str) -> str: + """ + Retrieves current alarm details, severity, and active time for a given site or network element. + + Input: site_or_element_id (e.g., 'Site-A', 'DU-123') + Output: Alarm status (active/cleared), severity, timestamp. + """ + return row.get("Check_Alarm_Status", "NotApplicable") + + +def Check_Element_Neighbors(row: pd.Series, element_id: str) -> str: + """ + Checks all adjacent and upstream devices of a target element to find common alarms affecting the area. + + Input: element_id + Output: List of adjacent elements with active alarms or status. + """ + return _safe_tool_val(row.get("Check_Element_Neighbors", "NotApplicable")) + + +def Check_Element_Health(row: pd.Series, element_id: str) -> str: + """ + Polls the element (e.g., DU/RU) to retrieve key health metrics like cell status and radiation. + + Input: element_id + Output: Health metrics report (e.g., Cell Status: UP, Radiation: Normal). + """ + return _safe_tool_val(row.get("Check_Element_Health", "NotApplicable")) + + +def Execute_Remote_Action(row: pd.Series, element_id: str, action: str) -> str: + """ + Runs a specific remote command on an element. + Example Actions: 'unlock_cell', 'restart_du', 'restore_ru' + + Input: element_id, action + Output: Execution result (Success/Fail). + """ + return _safe_tool_val(row.get("Execute_Remote_Action", "NotApplicable")) + + +def Check_External_Issues(row: pd.Series, site_or_area: str) -> str: + """ + Scans external monitors (like DownDetector or topology maps) for area-wide issues like fiber cuts or power outages. + + Input: site_or_area + Output: External issue report (e.g., Fiber cut detected in region). + """ + return _safe_tool_val(row.get("Check_External_Issues", "NotApplicable")) + + +def Check_Apply_Configuration(row: pd.Series, element_id: str) -> str: + """ + Retrieves the element's configuration, validates it against the standard, and pushes a corrected config if a mismatch is found. + + Input: element_id + Output: Config validation result and application status. + """ + return _safe_tool_val(row.get("Check_Apply_Configuration", "NotApplicable")) + + +def Check_Performance(row: pd.Series, kpi_metric_name: str) -> str: + """ + Fetches a specific KPI from monitoring tools to check if its trends are in line with expectations. + Example Metric: 'PRACH success rate' + + Input: kpi_metric_name + Output: KPI trend analysis (e.g., 'PRACH success rate is below threshold'). + """ + return _safe_tool_val(row.get("Check_Performance", "NotApplicable")) + + +def Create_Ticket(row: pd.Series, department_name: str, issue_details: str) -> str: + """ + Logs a new issue in the ticketing system and routes it to the correct department. + + Input: department_name, issue_details + Output: Ticket ID and routing confirmation. + """ + return _safe_tool_val(row.get("Create_Ticket", "NotApplicable")) + + +def Orchestration_tool(row: pd.Series, action_command: str) -> str: + """ + Runs an automated O-RAN orchestration task using Kubernetes/Helm. + Example Action: 'delete_pod_xyz', 'reassign_ip_address' + + + Input: action_command + Output: Orchestration task status. + """ + return _safe_tool_val(row.get("Orchestration_tool", "NotApplicable")) + + +def Triage_Toolkit_Tool(row: pd.Series, issue_type: str) -> str: + """ + Executes diagnostic scripts specifically for container or pod-related issues. + Example Issue Type: 'pod-crash-loop', 'container-networking' + + Input: issue_type + Output: Diagnostic logs and root cause hints. + """ + return _safe_tool_val(row.get("Triage_Toolkit_Tool", "NotApplicable")) + + +def Check_remote_files(row: pd.Series, element_id: str) -> str: + """ + Connects to a device via SSH/Telnet to review system dump files for identified errors or issues. + + Input: element_id + Output: Analysis of dump files (e.g., 'Memory overflow error found'). + """ + return _safe_tool_val(row.get("Check_remote_files", "NotApplicable")) + + +ALL_TOOLS = [ + Check_Alarm_Status, + Check_Element_Neighbors, + Check_Element_Health, + Execute_Remote_Action, + Check_External_Issues, + Check_Apply_Configuration, + Check_Performance, + Create_Ticket, + Orchestration_tool, + Triage_Toolkit_Tool, + Check_remote_files, +] + +ALL_TOOLS_STRING = [tool.__name__ for tool in ALL_TOOLS] diff --git a/recipes/noc-reasoning-agent/scripts/utils/create_input_jsonl_from_incidents.py b/recipes/noc-reasoning-agent/scripts/utils/create_input_jsonl_from_incidents.py new file mode 100644 index 0000000000..3334d62d4b --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/utils/create_input_jsonl_from_incidents.py @@ -0,0 +1,191 @@ +import argparse +import json +import sys +from pathlib import Path +from typing import List + +import pandas as pd + +try: + from .reasoning_processes import get_reasoning_process_for_fault_category + from .schema_columns import INCIDENT_ID_COLUMN, REQUIRED_COLUMNS +except ImportError: + from src.utils.reasoning_processes import get_reasoning_process_for_fault_category + from src.utils.schema_columns import INCIDENT_ID_COLUMN, REQUIRED_COLUMNS + +# For filter by problem code: synthetic uses fault_category (workflow IDs). Keep in sync with filter_rows.py. +ALLOWED_PROBLEM_CODES = [ + # Legacy u_problem_code values + "Service-off", + "Degraded Prach", + "Offline / Unreachable", + "Disabled Cells", + "Node Down", + "Site Not Scrolling", + "Sleepy Cell", + "VM is in not ready state", + "Prach 0", + "N2 Link Down", + "ueconmgr pod restarted", + "CSR Not Reachable", + "Circuit Down", + "Link Down", + "GPS Sync", + "MTA Alert", + # Synthetic fault_category values + "power_ac_failure_recovery", + "power_dc_rectifier_recovery", + "power_battery_discharge_response", + "power_generator_failure_recovery", + "env_high_temperature_response", + "env_hvac_fault_recovery", + "env_water_intrusion_response", + "env_battery_temperature_response", + "env_cabinet_intrusion_response", + "ran_software_upgrade_recovery", + "ran_cell_site_down_recovery", + "ran_interference_mitigation", + "ran_speed_complaint_resolution", + "ran_voice_quality_resolution", + "ran_sector_outage_recovery", + "ran_prb_availability_resolution", + "ran_cell_overshooting_correction", + "ran_rru_communication_recovery", + "ran_dropped_calls_resolution", + "ran_parameter_correction", + "ran_antenna_tilt_recovery", + "ran_vswr_alarm_resolution", + "ran_handover_failure_resolution", + "ran_backhaul_degradation_resolution", + "ran_cell_congestion_management", + "ran_device_issue_resolution", + "compute_vm_failure_recovery", + "compute_container_crash_recovery", + "compute_orchestrator_recovery", + "compute_image_pull_recovery", + "compute_k8s_node_recovery", + "compute_storage_failure_recovery", + "compute_cnf_pod_recovery", + "compute_resource_exhaustion_resolution", + "transport_routing_flap_resolution", + "transport_microwave_degradation_response", + "transport_interface_errors_resolution", + "transport_packet_loss_resolution", + "signaling_routing_failure_recovery", + "signaling_delay_resolution", + "signaling_s1_n2_recovery", + "signaling_sip_registration_recovery", +] + + +def coerce_to_str(value) -> str: + if pd.isna(value): + return "" + # Ensure any non-string is converted to string safely + try: + return str(value) + except Exception: + return "" + + +def extract_examples_by_problem_code(df, num): + fc_col = "fault_category" if "fault_category" in df.columns else "u_problem_code" + subset = df[df[fc_col].isin(ALLOWED_PROBLEM_CODES)] + sampled = [] + for _, group in subset.groupby(fc_col, group_keys=False): + sampled.append(group.sample(n=min(num, len(group)), random_state=42)) + return pd.concat(sampled, ignore_index=True) if sampled else subset.head(0) + + +def main(): + parser = argparse.ArgumentParser(description="Convert incidents CSV to input.jsonl for Qwen prompt") + parser.add_argument("--input", required=True, help="Path to anonymized-Incidents_Last_6_Months.csv") + parser.add_argument("--output", default="input.jsonl", help="Path to output JSONL (default: input.jsonl)") + parser.add_argument("--limit", required=False, help="Max number of rows to export") + parser.add_argument("--required_incidents", required=False, type=str, help="Path to required incidents file") + parser.add_argument("--examples_by_problem_code", required=False, type=int) + + args = parser.parse_args() + + input_path = Path(args.input) + output_path = Path(args.output) + + if not input_path.exists(): + print(f"ERROR: Input CSV not found: {input_path}", file=sys.stderr) + sys.exit(1) + + # Read CSV (handle BOM if present) + try: + df = pd.read_csv(input_path, dtype=str, encoding="utf-8-sig") + except UnicodeDecodeError: + df = pd.read_csv(input_path, dtype=str, encoding="latin1") + + if args.examples_by_problem_code: + df = extract_examples_by_problem_code(df, args.examples_by_problem_code) + print(f"Extracted {len(df)} examples from the dataset.") + + # Reasoning process from fault_category (synthetic) or u_problem_code (legacy) + fc_col = "fault_category" if "fault_category" in df.columns else "u_problem_code" + df["problem_code_reasoning_process"] = df[fc_col].map(get_reasoning_process_for_fault_category) + + # time_to_resolve: synthetic uses intervention_began / intervention_completed + start_col = "intervention_began" if "intervention_began" in df.columns else "work_start" + end_col = "intervention_completed" if "intervention_completed" in df.columns else "work_end" + try: + df[start_col] = pd.to_datetime(df[start_col], errors="coerce") + df[end_col] = pd.to_datetime(df[end_col], errors="coerce") + df["time_to_resolve"] = (df[end_col] - df[start_col]).dt.total_seconds().fillna(0).astype(int) + except Exception as e: + print(f"Failed to parse datetime columns {start_col}, {end_col}: {e}") + df["time_to_resolve"] = 0 + + original_cols_lower = {c.lower(): c for c in df.columns} + + # Ensure all required columns exist (case-insensitive match). If missing, create empty. + missing: List[str] = [] + selected = {} + for col in REQUIRED_COLUMNS: + # Match case-insensitively + actual = original_cols_lower.get(col.lower()) + if actual is None: + # Create empty column if missing + missing.append(col) + df[col] = "" + selected[col] = col + else: + selected[col] = actual + + if missing: + print( + "WARNING: Missing columns in CSV; blank fields will be used: " + ", ".join(missing), + file=sys.stderr, + ) + + if args.required_incidents: + with open(args.required_incidents, "r") as f: + required_incidents = f.read().splitlines() + print(required_incidents) + id_col = INCIDENT_ID_COLUMN if INCIDENT_ID_COLUMN in df.columns else "number" + df = df[df[id_col].astype(str).isin(required_incidents)] + + # Limit rows + if args.limit is not None and int(args.limit) > 0: + df = df.head(int(args.limit)) + + # Prepare records + records = [] + for _, row in df.iterrows(): + rec = {key: coerce_to_str(row[selected[key]]) for key in REQUIRED_COLUMNS} + records.append(rec) + + # Write JSONL + output_path.parent.mkdir(parents=True, exist_ok=True) + with output_path.open("w", encoding="utf-8") as f: + for rec in records: + f.write(json.dumps(rec, ensure_ascii=False) + "\n") + + print(f"Wrote {len(records)} records to {output_path}.") + + +if __name__ == "__main__": + main() diff --git a/recipes/noc-reasoning-agent/scripts/utils/format_reasoning_json.py b/recipes/noc-reasoning-agent/scripts/utils/format_reasoning_json.py new file mode 100644 index 0000000000..dbf606e393 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/utils/format_reasoning_json.py @@ -0,0 +1,529 @@ +import argparse +import copy +import json +import os + +from tqdm import tqdm +from transformers import AutoTokenizer + + +def _incident_id(data): + """Synthetic schema uses incident_identifier; legacy uses number.""" + return data.get("incident_identifier") or data.get("number") + + +def _resolution_method(data): + """Synthetic schema uses resolution_method; legacy uses close_code.""" + return data.get("resolution_method") or data.get("close_code", "") + + +def extract_formatted_json_steps(input_file): + """ + Extracts a JSON array string from a larger block of text. + + Args: + text (str): The raw text containing the JSON array. + + Returns: + list: The parsed JSON object (a list of dictionaries). + Returns None if no valid JSON array is found. + """ + + responses = {} + with open(input_file, "r", encoding="utf-8") as f: + for line in f: + try: + data = json.loads(line) + if not data: + continue + text = data["generation"] + + number = _incident_id(data) + + try: + # Find the starting position of the JSON array '[' + + start_index = text.rfind("<|message|>") + text = text[start_index + len("<|message|>") :] + start_index = text.find("[") + # Find the last position of the JSON array ']' to ensure we get the whole thing + end_index = text.rfind("]") + 1 + + if start_index != -1 and end_index != -1: + # Slice the string to get only the JSON part + json_string = text[start_index:end_index] + + # Parse the JSON string into a Python object + parsed_json = json.loads(json_string) + responses[number] = parsed_json + else: + print(text) + print("Error: Could not find the start '[' or end ']' of the JSON array.") + continue + except json.JSONDecodeError as e: + print(text) + print(f"Error decoding JSON: {e}") + except Exception as e: + print(f"An unexpected error occurred: {e}") + except json.JSONDecodeError: + print(f"Skipping invalid line: {line.strip()}") + + return responses + + +def extract_final_thinking_processes(input_file): + responses = {} + with open(input_file, "r", encoding="utf-8") as f: + for line in f: + data = json.loads(line) + text = data["generation"] + number = _incident_id(data) + step_number = data["step_number"] + if number not in responses: + responses[number] = {} + + thinking = text[text.rfind("final<|message|>") + len("final<|message|>") :] + data["generation"] = thinking + responses[number][step_number] = thinking + + return responses + + +def prepare_data_for_reasoning_traces(jsonl_file, input_file, output_file): + formatted_steps_taken = extract_formatted_json_steps(input_file) + new_jsonl = [] + + incorrect_incidents = 0 + # Read the file line by line + with open(jsonl_file, "r", encoding="utf-8") as f: + for line in f: + data = json.loads(line) + if not data: + continue + number = _incident_id(data) + + if number in formatted_steps_taken: + formatted_steps = formatted_steps_taken[number] + current_conclusion = "" + for i in range(len(formatted_steps)): + sub_data = copy.deepcopy(data) + current_steps = formatted_steps[i] + sub_data["step_number"] = current_steps["step_number"] + sub_data["background_context"] = current_conclusion + conclusion_called = f"Step {current_steps['step_number']} {current_steps['sop_step_title']} {current_steps['status']}.\nAction taken: {current_steps['action_taken']}\n" + tool_response = "" + if current_steps["tool_call"]: + conclusion_called += f"Tool called: {current_steps['tool_call']}\n" + tool_response = f"Tool response: {current_steps['result']}\n" + else: + conclusion_called += "No tool call needed.\n" + sub_data["outcome"] = conclusion_called + new_jsonl.append(sub_data) + current_conclusion += conclusion_called + tool_response + # data["formatted_steps"] = formatted_steps_taken[number] + + # new_jsonl.append(data) + else: + incorrect_incidents += 1 + + # print(json.dumps(new_jsonl, indent = 4)) + print(f"{incorrect_incidents} incidents were not parsed correctly and disgarded.") + + with open(output_file, "w", encoding="utf-8") as f: + for line in new_jsonl: + json.dump(line, f) + f.write("\n") + + print(f"Wrote {len(new_jsonl)} entries to {output_file}") + + +def token_converting(string, model): + """ + Converts a shorthand tool command like: + Check_Alarm_Status[site-123] + into a Qwen-32B compliant XML block. + """ + if model != "qwen32": + return string # fallback for other models + + import re + + # --- 1. Parse tool name and the raw arguments inside [...] --- + # Match "ToolName[args]" or "ToolName[ args ]" + m = re.match(r"^\s*([A-Za-z_]\w*)\s*\[(.*)\]\s*$", str(string), re.DOTALL) + + if not m: + # Handle case with no arguments, e.g., Check_Time[] + m_no_args = re.match(r"^\s*([A-Za-z_]\w*)\s*\[\s*\]\s*$", str(string)) + if m_no_args: + tool_name = m_no_args.group(1) + raw_args = "" + else: + # If it doesn't match the syntax, return original string or raise error + # returning string allows the LLM to fail gracefully or retry + return string + + tool_name, raw_args = m.groups() + + # --- 2. Smart Splitter --- + # Splits by commas, but ignores commas inside single/double quotes. + # e.g. "dept, 'Error in rack 1, shelf 2'" -> ["dept", "'Error in rack 1, shelf 2'"] + parts = re.split(r'\s*,\s*(?=(?:[^\'"]|\'[^\']*\'|"[^"]*")+$)', raw_args.strip()) if raw_args.strip() else [] + + # --- 3. Normalize Tokens --- + kv_args = {} + pos_args = [] + + for p in parts: + if not p: + continue + # Check for key=value or key: value + if ("=" in p or ":" in p) and not (p.startswith("'") or p.startswith('"')): + k, v = re.split(r"\s*[:=]\s*", p, maxsplit=1) + v = v.strip().strip('"').strip("'") + kv_args[k.strip()] = v + else: + pos_args.append(p.strip().strip('"').strip("'")) + + # Helper to enforce positional argument counts + def req_pos(n, arg_name="argument"): + if len(pos_args) < n: + raise ValueError( + f"{tool_name} requires at least {n} value(s) (missing {arg_name}); got {len(pos_args)} in: {string}" + ) + + # --- 4. Tool-Specific Argument Mapping --- + + arg_dict = {} + + # 1. Check_Alarm_Status[] + if tool_name == "Check_Alarm_Status": + val = kv_args.get("site_or_element_id") or (pos_args[0] if pos_args else None) + if not val: + req_pos(1, "site_or_element_id") + arg_dict = {"site_or_element_id": val} + + # 2. Check_Element_Neighbors[] + elif tool_name == "Check_Element_Neighbors": + val = kv_args.get("element_id") or (pos_args[0] if pos_args else None) + if not val: + req_pos(1, "element_id") + arg_dict = {"element_id": val} + + # 3. Check_Element_Health[] + elif tool_name == "Check_Element_Health": + val = kv_args.get("element_id") or (pos_args[0] if pos_args else None) + if not val: + req_pos(1, "element_id") + arg_dict = {"element_id": val} + + # 4. Execute_Remote_Action[, ''] + elif tool_name == "Execute_Remote_Action": + elem = kv_args.get("element_id") + act = kv_args.get("action") + + if not elem and len(pos_args) > 0: + elem = pos_args[0] + if not act and len(pos_args) > 1: + act = pos_args[1] + + if not elem or not act: + raise ValueError(f"{tool_name} requires 'element_id' and 'action'.") + arg_dict = {"element_id": elem, "action": act} + + # 5. Check_External_Issues[] + elif tool_name == "Check_External_Issues": + val = kv_args.get("site_or_area") or (pos_args[0] if pos_args else None) + if not val: + req_pos(1, "site_or_area") + arg_dict = {"site_or_area": val} + + # 6. Check_Apply_Configuration[] + elif tool_name == "Check_Apply_Configuration": + val = kv_args.get("element_id") or (pos_args[0] if pos_args else None) + if not val: + req_pos(1, "element_id") + arg_dict = {"element_id": val} + + # 7. Check_Performance[''] + elif tool_name == "Check_Performance": + val = kv_args.get("kpi_metric_name") or (pos_args[0] if pos_args else None) + if not val: + req_pos(1, "kpi_metric_name") + arg_dict = {"kpi_metric_name": val} + + # 8. Create_Ticket['', ''] + elif tool_name == "Create_Ticket": + dept = kv_args.get("department_name") + details = kv_args.get("issue_details") + + # Handle positional logic + if not dept and len(pos_args) >= 1: + dept = pos_args[0] + + # If details weren't named, we assume everything after department is the details. + # We join them back with commas in case the split separated a sentence. + if not details and len(pos_args) >= 2: + details = ", ".join(pos_args[1:]) + + if not dept or not details: + raise ValueError(f"{tool_name} requires 'department_name' and 'issue_details'.") + + arg_dict = {"department_name": dept, "issue_details": details} + + # 9. Orchestration_tool[''] + elif tool_name == "Orchestration_tool": + val = kv_args.get("action_command") or (pos_args[0] if pos_args else None) + if not val: + req_pos(1, "action_command") + arg_dict = {"action_command": val} + + # 10. Triage_Toolkit_Tool[''] + elif tool_name == "Triage_Toolkit_Tool": + val = kv_args.get("issue_type") or (pos_args[0] if pos_args else None) + if not val: + req_pos(1, "issue_type") + arg_dict = {"issue_type": val} + + # 11. Check_remote_files[] + elif tool_name == "Check_remote_files": + val = kv_args.get("element_id") or (pos_args[0] if pos_args else None) + if not val: + req_pos(1, "element_id") + arg_dict = {"element_id": val} + + # --- Fallback for unknown tools --- + else: + if kv_args: + arg_dict = kv_args + elif pos_args: + arg_dict = {"args": pos_args} if len(pos_args) > 1 else {"argument": pos_args[0]} + else: + arg_dict = {} + + # --- 5. Construct XML Output --- + json_call = {"name": tool_name, "arguments": arg_dict} + return json_call + + +def merge_reasoning_steps(steps_taken, reasoning_steps, model="qwen32"): + broken_numbers = [] + for number in steps_taken: + if number in reasoning_steps: + # fix tool calling + try: + for i in range(len(steps_taken[number])): + if steps_taken[number][i]["tool_call"]: + steps_taken[number][i]["tool_call"] = token_converting( + steps_taken[number][i]["tool_call"], model + ) + steps_taken[number][i]["thinking"] = reasoning_steps[number][steps_taken[number][i]["step_number"]] + except Exception as e: + print(e) + broken_numbers.append(number) + + for number in broken_numbers: + del steps_taken[number] + + return steps_taken + + +SFT_DUMMY_USER = "DUMMY_USER_FOR_SFT" +SFT_ASSISTANT_SENTINEL = "<<>>" + + +def compute_prefix_len_for_dummy_user(tokenizer): + messages = [ + {"role": "user", "content": SFT_DUMMY_USER}, + ] + rendered = tokenizer.apply_chat_template( + messages, + tokenize=False, + add_special_tokens=False, + add_generation_prompt=False, + ) + + idx = len(rendered) + + # Keep everything from the sentinel onward, drop everything before it + return idx + + +def qwen_token_converter(data, full_reasoning_steps, tokenizer=None): + curriculum_learning_stages = {} + turn = 0 + total_tokens = 0 + pre_compute_idx = compute_prefix_len_for_dummy_user(tokenizer) + current_assistant_content = [{"role": "user", "content": SFT_DUMMY_USER}] + + for i in range(len(full_reasoning_steps)): + step = full_reasoning_steps[i] + + thinking = step.get("thinking", "") + status = step.get("status", "") + title = step.get("sop_step_title", "") + action = step.get("action_taken", "") + tool_call = step.get("tool_call", "") + result = step.get("result", "") + step_text = f"\n{thinking} {status} {title}: {action}\n\n" + + # Construct the text for this specific step + # Note: We inject tags here as part of the content + response_message = [{"role": "user", "content": SFT_DUMMY_USER}] + sub_data = copy.deepcopy(data) + + # --- CASE A: Tool Call Triggered --- + if tool_call: + # Response String + response_message.append( + { + "role": "assistant", + "content": step_text, + "tool_calls": [{"type": "function", "function": tool_call}], + } + ) + raw_response = tokenizer.apply_chat_template( + response_message, tokenize=False, add_special_tokens=False, add_generation_prompt=False + ) + cleaned_response = raw_response[pre_compute_idx:] + sub_data["response"] = cleaned_response + + # Background String + raw_background = tokenizer.apply_chat_template( + current_assistant_content, tokenize=False, add_special_tokens=False, add_generation_prompt=False + ) + cleaned_background = raw_background[pre_compute_idx:] + sub_data["background"] = cleaned_background + + # Next Context + current_assistant_content.append( + { + "role": "assistant", + "content": step_text, + "tool_calls": [{"type": "function", "function": tool_call}], + } + ) + current_assistant_content.append({"role": "tool", "content": result}) + # print(raw) + # print("----:") + # print(cleaned) + # exit() + + curriculum_learning_stages[turn] = sub_data + turn += 1 + + # --- CASE B: Final Conclusion --- + elif i == len(full_reasoning_steps) - 1: + total_tokens = len( + tokenizer.apply_chat_template(current_assistant_content, tokenize=True, add_generation_prompt=False) + ) + sub_data = copy.deepcopy(data) + + result = result if result else "" + + response_message.append( + { + "role": "assistant", + "content": step_text + result + f"\nClose Code: [{_resolution_method(sub_data)}]", + } + ) + raw = tokenizer.apply_chat_template( + response_message, tokenize=False, add_special_tokens=False, add_generation_prompt=False + ) + cleaned = raw[pre_compute_idx:] + sub_data["response"] = cleaned + + # Background String + raw_background = tokenizer.apply_chat_template( + current_assistant_content, tokenize=False, add_special_tokens=False, add_generation_prompt=False + ) + cleaned_background = raw_background[pre_compute_idx:] + sub_data["background"] = cleaned_background + + curriculum_learning_stages[turn] = sub_data + + # --- CASE C: Intermediate Step (just accumulation) --- + else: + # We already added to current_assistant_content at the top of loop + pass + + return curriculum_learning_stages, total_tokens + + +def compile_reasoning(jsonl_file, input_file, output_dir, reasoning_jsonl, tokenizer_name="Qwen/Qwen3-32B"): + # 1. LOAD TOKENIZER ONCE HERE + tokenizer = None + print("Loading Tokenizer (Qwen3-32B)...") + # Trust remote code is often needed for Qwen tokenizers + tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True) + + formatted_steps_taken = extract_formatted_json_steps(input_file) + formatted_reasoning_steps_taken = extract_final_thinking_processes(reasoning_jsonl) + + full_steps = merge_reasoning_steps(formatted_steps_taken, formatted_reasoning_steps_taken) + + all_tokens = [] + stages = {} + incorrect_incidents = 0 + + with open(jsonl_file, "r", encoding="utf-8") as f: + for line in tqdm(f): + data = json.loads(line) + number = _incident_id(data) + + if number in full_steps: + # 2. PASS TOKENIZER TO THE FUNCTION + try: + steps_data, tokens = qwen_token_converter(data, full_steps[number], tokenizer) + for stage in steps_data: + if stage not in stages: + stages[stage] = [] + stages[stage].append(steps_data[stage]) + + if tokens > 0: + all_tokens.append(tokens) + except Exception as e: + print(f"Error for incident {number}: {e}") + incorrect_incidents += 1 + else: + incorrect_incidents += 1 + + # ... (Rest of your writing logic remains the same) ... + os.makedirs(output_dir, exist_ok=True) + for i in range(len(stages)): + name = os.path.join(output_dir, f"iteration_{i}.jsonl") + with open(name, "w", encoding="utf-8") as f: + for line in stages[i]: + json.dump(line, f) + f.write("\n") + + print(f"CURRICULUM Info\n{'*' * 20}") + print(f"There are currently {len(stages)} stages") + print(f"{incorrect_incidents} incidents failed") + + +def main(jsonl_file, input_file, output_file, parse_types, reasoning_jsonl=None, output_dir=None): + if parse_types == "steps_extraction": + prepare_data_for_reasoning_traces(jsonl_file, input_file, output_file) + elif parse_types == "compile_reasoning": + if not reasoning_jsonl: + raise ValueError("Please specify a reasoning jsonl file by specifying --reasoning_jsonl") + compile_reasoning(jsonl_file, input_file, output_dir, reasoning_jsonl) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Extract and format reasoning steps from JSONL file.") + parser.add_argument("--input", type=str, help="Path to the first JSONL file") + parser.add_argument("--output", required=False, type=str) + parser.add_argument("--jsonl_file", required=False, type=str) + parser.add_argument("--parse_type", type=str) + parser.add_argument("--output_dir", required=False) + parser.add_argument("--reasoning_jsonl", required=False, type=str) + + parsing_types = ["steps_extraction", "compile_reasoning"] + args = parser.parse_args() + + if args.parse_type not in parsing_types: + raise ValueError(f"{args.parse_type} is not supported. Supported parse_types include {parsing_types}") + + main(args.jsonl_file, args.input, args.output, args.parse_type, args.reasoning_jsonl, args.output_dir) diff --git a/recipes/noc-reasoning-agent/scripts/utils/reasoning_processes.py b/recipes/noc-reasoning-agent/scripts/utils/reasoning_processes.py new file mode 100644 index 0000000000..96d858d1d6 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/utils/reasoning_processes.py @@ -0,0 +1,264 @@ +PROBLEM_CODE_REASONING_PROCESS = { + "Service-off": """ + 1. Acknowledge and Track the Initial Alert: Reference site/alarm number. Explanation: This step logs the alert receipt and assigns a tracking ID to initiate the response process. + - Tools Used: Check_Alarm_Status() - Queries the alarm system for details. + - Flow-Deciding Answers/Outcomes: If alarm confirmed active, proceed to step 2; if cleared/false positive, document and close. + 2. Check Site and Equipment Status: Note locked/disabled components. Explanation: This step assesses current site conditions to identify immediate visible issues like equipment locks. + - Tools Used: Check_Element_Health() - Polls health metrics. + - Flow-Deciding Answers/Outcomes: If anomalies are detected (e.g., locked cells), go to step 3; if healthy, monitor and close. + 3. Perform Remote Actions (Reset/Unlock/Enable): Use CLI/management tools. Explanation: This step attempts non-invasive fixes like resetting components to restore functionality remotely. + - Tools Used: Remote_Unlock/Restart/Restore(Element) - Executes resets. + - Flow-Deciding Answers/Outcomes: If action succeeds, go to step 4; if fails, escalate to step 8. + 4. Monitor for Automatic Recovery or Alarm Clearance: Confirm health. Explanation: This step observes post-action behavior to see if the system self-recovers without further intervention. + - Tools Used: Check_Element_Health() - Re-check post-action. + - Flow-Deciding Answers/Outcomes: If recovered, go to step 8; if not, proceed to step 5. + 5. Check Topology/Outages/Fiber Cut/Public Websites: Identify external issues. Explanation: This step investigates network-wide or external factors impacting the site. + - Tools Used: Check_External_Issues() - Scans for outages. + - Flow-Deciding Answers/Outcomes: If external issues are found, assigned to relevant departments and close; else, go to step 6. + 6. Check/Apply Configuration Fixes: Correct mismatches (e.g., IP/VLAN). Explanation: This step audits and corrects software settings to align with operational standards. + - Tools Used: Check_Apply_Configuration(Element) - Validates and pushes fixes. + - Flow-Deciding Answers/Outcomes: If fixed, go to step 7; if not, escalate to step 8. + 7. Investigate Persistent or Reoccurring Alarms and Coordinate Onsite Dispatch if Needed/ or Hardware Replacement: Dispatch teams. Explanation: This step mobilizes deeper research and physical intervention when remote fixes are insufficient. + - Tools Used: Create_Ticket(Department_Name) - Routes ticket. + - Flow-Deciding Answers/Outcomes: After onsite resolution, go to step 8. + 8. Confirm Final Site Health and Close the Incident: Ensure stability. Explanation: This step verifies full restoration before officially ending the incident. + - Tools Used: Check_Alarm_Status(); Check_Element_Health() - Final verification. + - Flow-Deciding Answers/Outcomes: If healthy, document and close; else, loop to step 7. + 9. Document All Actions and Status Updates: Record steps. Explanation: This step compiles a complete record for audits, learning, and future reference. + - Tools Used: None. + - Flow-Deciding Answers/Outcomes: Always close after this. + """, + "Degraded Prach": """ + 1. Identify and Categorize the Alarm Event. Explanation: This step classifies the alarm to determine the appropriate troubleshooting path. + - Tools Used: Check_Alarm_Status() - Categorizes alarm type. + - Flow-Deciding Answers/Outcomes: If PRACH-specific, proceed; if unrelated, reassign and close. + 2. Gather Site and Equipment Status: Include DU/RU/cell health. Explanation: This step collects baseline data to pinpoint degradation sources. + - Tools Used: Check_Elements_Health(Neighbors) - Collects status from Neighbors or upstream devices. + - Flow-Deciding Answers/Outcomes: If issues identified, go to step 3; if normal, monitor and close. + 3. Perform Remote Checks and Initial Actions: Reset if safe. Explanation: This step conducts preliminary diagnostics and minor fixes remotely. + - Tools Used: Check_Element_Health() - Initial diagnostics. + - Flow-Deciding Answers/Outcomes: If basic reset is viable, proceed; else, go to step 8. + 4. Apply Targeted Fixes (e.g., Unlock Cells, Correct Configs). Explanation: This step implements specific corrections based on identified issues. + - Tools Used: Check_Apply_Configuration(); - Checks and Corrects Config.. + - Flow-Deciding Answers/Outcomes: If fix Config went well, go to step 5; else, go to step 8. + 5. Restart or Reinitialize Network Elements: Reboot DU/RU/pods. Explanation: This step cycles components to clear transient errors. + - Tools Used: Remote_Unlock/Restart/Restore() - Reinitializes. + - Flow-Deciding Answers/Outcomes: If successful, go to step 6; else, escalate. + 6. Monitor for Recovery and Stability: Check PRACH performance. Explanation: This step tracks improvements post-fix to ensure sustained resolution. + - Tools Used: Check_Element_Health() - Monitors post-fix. + - Flow-Deciding Answers/Outcomes: If stable, go to step 7; else, go to step 8. + 7. Validate with KPI and Alarm Dashboards: Confirm metrics. Explanation: This step uses data to objectively verify fix effectiveness. + - Tools Used: Check_Performancel() - Fetches KPIs. + - Flow-Deciding Answers/Outcomes: If KPIs normal, document and close; else, loop to step 4. + 8. Assign the Incident if not solved: To technical/field team. Explanation: This step transfers unresolved issues to specialized teams. + - Tools Used: Create_Ticket() - Routes ticket. + - Flow-Deciding Answers/Outcomes: After resolution, go to step 9. + 9. Document and Close the Incident: Record steps/outcomes. Explanation: This step finalizes the record for compliance and knowledge sharing. + - No tool used. + - Flow-Deciding Answers/Outcomes: Always close. + + """, + "Offline / Unreachable": """ + 1. Check Alarm Status. Explanation: This step confirms the alarm's validity to avoid false positives. + - Tools Used: Check_Alarm_Status() - Verifies alarm. + - Flow-Deciding Answers/Outcomes: If active, proceed; else, close. + 2. RU Reset: Restart to clear issues. Explanation: This step attempts a basic restart to resolve connectivity. + - Tools Used: Check_Alarm_Status(in RU) - Verifies alarm. + - Flow-Deciding Answers/Outcomes: If active, proceed; else, go to step 3. + - Tools Used: Remote_Unlock/Restart/Restore(RU) - Resets RU. + - Flow-Deciding Answers/Outcomes: If cleared, go to step 9; else, proceed. + 3. TX Array Alarm Clearance: Restart TX array/RU. Explanation: This step targets transmission-specific alarms for clearance. + - Tools Used: Check_Alarm_Status(in TX) - Verifies alarm. + - Flow-Deciding Answers/Outcomes: If active, proceed; else, go to step 4. + - Tools Used: Remote_Unlock/Restart/Restore(TX Array) - Clears alarm. + - Flow-Deciding Answers/Outcomes: If resolved, go to step 9; else, proceed. + 4. CSR Power Restoration: Reset Router. Explanation: This step addresses power-related unreachable states. + - Tools Used: Remote_Unlock/Restart/Restore(Router) - Triggers reset. + - Flow-Deciding Answers/Outcomes: If restored, go to step 9; else, proceed. + 5. Digital Input Low Alarm WA Application: Apply workaround. Explanation: This step implements temporary fixes for input signal issues. + - Tools Used: Remote_Unlock/Restart/Restore(TSSI) - Applies WA. + - Flow-Deciding Answers/Outcomes: If fixed, go to step 9; else, proceed. + 6. OranLoginFailure Recovery: Fix login issues in RUs. Explanation: This step resolves authentication failures preventing access. + - Tools Used: Remote_Unlock/Restart/Restore(RU) - Recovers login. + - Flow-Deciding Answers/Outcomes: If resolved, go to step 9; else, proceed. + 7. BGP Flap Resolution: Stabilize neighbors. Explanation: This step fixes routing instability, causing unreachability. + - Tools Used: Check_Element_Health(BGP Neighbors) - Resolves flaps. + - Flow-Deciding Answers/Outcomes: If stable, go to step 9; else, proceed. + 8. ISIS Adjacency Recovery: Restore links. Explanation: This step reestablishes routing adjacencies. + - Tools Used: Check_Element_Health(Adjacencies) - Recovers adjacency. + - Flow-Deciding Answers/Outcomes: If restored, go to step 9; else, go to step 10. + 9. Cell Enablement: Enable after resets. Explanation: This step activates cells post-recovery to restore service. + - Tools Used: Check_Apply_Configuration(Cell); + - Check_Element_Health() - Enables cells. + - Flow-Deciding Answers/Outcomes: If enabled, document and close; else, escalate. + 10. Escalate for Onsite: Dispatch teams. Explanation: This step initiates physical checks when remote options fail. + - Tools Used: Create_Ticket() - Routes ticket. + - Flow-Deciding Answers/Outcomes: After resolution, document and close. + 11. Document and Close: Record steps. Explanation: This step ensures a complete audit trail for the incident. + - Tools Used: None. + - Flow-Deciding Answers/Outcomes: Always close. + + """, + "Disabled Cells": """ + 1. Check Alarm Status. Explanation: This step validates the alarm to confirm cell disablement. + - Tools Used: Check_Alarm_Status() - Verifies alarm. + - Flow-Deciding Answers/Outcomes: If active, proceed; else go to step 8. + 2. Enable Cell: Check if the cell can be enabled by applying a config change.. Explanation: This step manually activates disabled cells. + - Tools Used: Check_Apply_Configuration(); Enables cells.Flow-Deciding Answers/Outcomes: If the cell can be enabled, go to step 8; else, proceed. + 3. RU Reset: Restart RU. Explanation: This step clears potential transient locks via restart. + - Tools Used: Remote_Unlock/Restart(RU) - Resets RU. + - Flow-Deciding Answers/Outcomes: If Reset ok, proceed; otherwise, go to step 7. + 4. Check Alarm Status. Explanation: This step validates the alarm to confirm cell disablement. + - Tools Used: Check_Alarm_Status() - Verifies alarm. + - Flow-Deciding Answers/Outcomes: If active, proceed; else go to step 8. + 5. Performed Gamma MB RU Reset: Specific reset for issues. Explanation: This step applies a targeted reset for persistent problems. + - Tools Used: Remote_Unlock/Restart/Restore(RU) - Performs reset. + - Flow-Deciding Answers/Outcomes: If the reset was ok, proceed; otherwise, go to step 7 + 6. Check Alarm Status. Explanation: This step validates the alarm to confirm cell disablement. + - Tools Used: Check_Alarm_Status() - Verifies alarm. + - Flow-Deciding Answers/Outcomes: If active, proceed; else go to step 8. + 7. Escalate for Onsite: Dispatch teams. Explanation: This step sends field support for hardware verification. + - Tools Used: Create_Ticket() - Routes ticket. + - Flow-Deciding Answers/Outcomes: After resolution, go to step 8. + 8. Document and Close: Record steps. Explanation: This step logs the resolution for records and analysis. + - Tools Used: None. + - Flow-Deciding Answers/Outcomes: Always close. + """, + "Node Down": """ + 1. Check Alarm Status. Explanation: This step confirms the node's down status via alarms. + - Tools Used: Check_Alarm_Status() - Verifies alarm. + - Flow-Deciding Answers/Outcomes: If active, proceed; else, close. + 2. Rebooting the Host: Restart machine. Explanation: This step restarts the host to resolve software hangs. + - Tools Used: Remote_Unlock/Restart/Restore() - Reboots host. + - Flow-Deciding Answers/Outcomes: If cleared, go to step 10; else, proceed. + 3. Reconfiguring the Site: Update settings. Explanation: This step adjusts site configs to fix misalignments. + - Tools Used: Check_Apply_Configuration()- Reconfigures. + - Flow-Deciding Answers/Outcomes: If fixed, go to step 10; else, proceed. + 4. Deleting M-Plane & F1C IP: Remove for recreation. Explanation: This step clears IPs to enable fresh container setup. + - Tools Used: Orchestration_tool(Delete/Recreate Container) - Deletes/reassigns. + - Flow-Deciding Answers/Outcomes: If successful, go to step 10; else, proceed. + 5. Using the Triage Toolkit: Troubleshoot containers. Explanation: This step runs diagnostics on container-related issues. + - Tools Used: Triage_Toolkit_Tool() - Runs diagnostics. + - Flow-Deciding Answers/Outcomes: If resolved, go to step 10; else, proceed. + 6. Checking SCTP Connection for F1: Verify via netstat. Explanation: This step inspects F1 links for connectivity problems. + - Tools Used: Check_Element_Health(SCTP ) - Executes netstat returns Ok/Nok + - Flow-Deciding Answers/Outcomes: If connected, go to step 10; else, proceed. + 7. Verifying RU Reachability: Check DU-RU link. Explanation: This step tests if DU can access RUs. + - Tools Used: Check_Element_Health(RU) - Verifies reachability. + - Flow-Deciding Answers/Outcomes: If reachable, go to step 10; else, proceed. + 8. Checking Core Dump Files: Analyze crashes. Explanation: This step reviews dumps for crash insights. + - Tools Used: Check_remote_files() - Evaluates dumps. + - Flow-Deciding Answers/Outcomes: If issues are identified/fixed, go to step 10; else, proceed. + 9. Checking Node Uptime/Load Average: Review performance. Explanation: This step evaluates resource usage for overload clues. + - Tools Used: Check_Performance() - Monitors metrics. + - Flow-Deciding Answers/Outcomes: If normal, go to step 10; else, go to step 11. + 10. Redeploying the Site: If needed. Explanation: This step rebuilds the site for comprehensive recovery. + - Tools Used: Check_Apply_Configuration(); - Redeploys. + - Flow-Deciding Answers/Outcomes: If resolved, document and close; else, escalate. + 11. Escalate for Onsite: Dispatch teams. Explanation: This step engages physical support for unresolved issues. + - Tools Used:Create_Ticket() - Routes ticket. + - Flow-Deciding Answers/Outcomes: After resolution, document and close. + 12. Document and Close: Record steps. Explanation: This step creates an audit-ready summary of the incident. + - Tools Used: None. + - Flow-Deciding Answers/Outcomes: Always close. + """, + "Site Not Scrolling": """ + + """, + "Sleepy Cell": """ + + """, + "VM is in not ready state": """ + + """, + "Prach 0": """ + + """, + "N2 Link Down": """ + + """, + "ueconmgr pod restarted": """ + + """, + "CSR Not Reachable": """ + + """, + "Circuit Down": """ + + """, + "Link Down": """ + + """, + "GPS Sync": """ + + """, + "MTA Alert": """ + + """, +} + +# Default reasoning process when fault_category (synthetic workflow ID) is not in PROBLEM_CODE_REASONING_PROCESS. +DEFAULT_REASONING_PROCESS = PROBLEM_CODE_REASONING_PROCESS.get("Service-off", "") + +# Synthetic data uses fault_category = workflow IDs (e.g. power_ac_failure_recovery). Map them to a reasoning process. +SYNTHETIC_FAULT_CATEGORY_REASONING = { + # Power / Environment → mapped to Service-off SOP (power-related outage workflow) + "power_ac_failure_recovery": PROBLEM_CODE_REASONING_PROCESS.get("Service-off", ""), + "power_dc_rectifier_recovery": PROBLEM_CODE_REASONING_PROCESS.get("Service-off", ""), + "power_battery_discharge_response": PROBLEM_CODE_REASONING_PROCESS.get("Service-off", ""), + "power_generator_failure_recovery": PROBLEM_CODE_REASONING_PROCESS.get("Service-off", ""), + "env_high_temperature_response": PROBLEM_CODE_REASONING_PROCESS.get("Service-off", ""), + "env_hvac_fault_recovery": PROBLEM_CODE_REASONING_PROCESS.get("Service-off", ""), + "env_water_intrusion_response": PROBLEM_CODE_REASONING_PROCESS.get("Service-off", ""), + "env_battery_temperature_response": PROBLEM_CODE_REASONING_PROCESS.get("Service-off", ""), + "env_cabinet_intrusion_response": PROBLEM_CODE_REASONING_PROCESS.get("Service-off", ""), + # RAN → mapped to closest legacy SOP + "ran_software_upgrade_recovery": PROBLEM_CODE_REASONING_PROCESS.get("Service-off", ""), + "ran_cell_site_down_recovery": PROBLEM_CODE_REASONING_PROCESS.get("Node Down", ""), + "ran_interference_mitigation": PROBLEM_CODE_REASONING_PROCESS.get("Degraded Prach", ""), + "ran_speed_complaint_resolution": PROBLEM_CODE_REASONING_PROCESS.get("Degraded Prach", ""), + "ran_voice_quality_resolution": PROBLEM_CODE_REASONING_PROCESS.get("Degraded Prach", ""), + "ran_sector_outage_recovery": PROBLEM_CODE_REASONING_PROCESS.get("Disabled Cells", ""), + "ran_prb_availability_resolution": PROBLEM_CODE_REASONING_PROCESS.get("Degraded Prach", ""), + "ran_cell_overshooting_correction": PROBLEM_CODE_REASONING_PROCESS.get("Degraded Prach", ""), + "ran_rru_communication_recovery": PROBLEM_CODE_REASONING_PROCESS.get("Offline / Unreachable", ""), + "ran_dropped_calls_resolution": PROBLEM_CODE_REASONING_PROCESS.get("Degraded Prach", ""), + "ran_parameter_correction": PROBLEM_CODE_REASONING_PROCESS.get("Disabled Cells", ""), + "ran_antenna_tilt_recovery": PROBLEM_CODE_REASONING_PROCESS.get("Degraded Prach", ""), + "ran_vswr_alarm_resolution": PROBLEM_CODE_REASONING_PROCESS.get("Degraded Prach", ""), + "ran_handover_failure_resolution": PROBLEM_CODE_REASONING_PROCESS.get("Degraded Prach", ""), + "ran_backhaul_degradation_resolution": PROBLEM_CODE_REASONING_PROCESS.get("Offline / Unreachable", ""), + "ran_cell_congestion_management": PROBLEM_CODE_REASONING_PROCESS.get("Degraded Prach", ""), + "ran_device_issue_resolution": PROBLEM_CODE_REASONING_PROCESS.get("Offline / Unreachable", ""), + # Compute → mapped to VM/container legacy SOPs + "compute_vm_failure_recovery": PROBLEM_CODE_REASONING_PROCESS.get("VM is in not ready state", ""), + "compute_container_crash_recovery": PROBLEM_CODE_REASONING_PROCESS.get("VM is in not ready state", ""), + "compute_orchestrator_recovery": PROBLEM_CODE_REASONING_PROCESS.get("VM is in not ready state", ""), + "compute_image_pull_recovery": PROBLEM_CODE_REASONING_PROCESS.get("VM is in not ready state", ""), + "compute_k8s_node_recovery": PROBLEM_CODE_REASONING_PROCESS.get("VM is in not ready state", ""), + "compute_storage_failure_recovery": PROBLEM_CODE_REASONING_PROCESS.get("VM is in not ready state", ""), + "compute_cnf_pod_recovery": PROBLEM_CODE_REASONING_PROCESS.get("VM is in not ready state", ""), + "compute_resource_exhaustion_resolution": PROBLEM_CODE_REASONING_PROCESS.get("VM is in not ready state", ""), + # Transport → mapped to Link Down / Circuit Down SOPs + "transport_routing_flap_resolution": PROBLEM_CODE_REASONING_PROCESS.get("Link Down", ""), + "transport_microwave_degradation_response": PROBLEM_CODE_REASONING_PROCESS.get("Link Down", ""), + "transport_interface_errors_resolution": PROBLEM_CODE_REASONING_PROCESS.get("Link Down", ""), + "transport_packet_loss_resolution": PROBLEM_CODE_REASONING_PROCESS.get("Link Down", ""), + # Signaling → mapped to N2 Link Down SOP + "signaling_routing_failure_recovery": PROBLEM_CODE_REASONING_PROCESS.get("N2 Link Down", ""), + "signaling_delay_resolution": PROBLEM_CODE_REASONING_PROCESS.get("N2 Link Down", ""), + "signaling_s1_n2_recovery": PROBLEM_CODE_REASONING_PROCESS.get("N2 Link Down", ""), + "signaling_sip_registration_recovery": PROBLEM_CODE_REASONING_PROCESS.get("N2 Link Down", ""), +} + + +def get_reasoning_process_for_fault_category(fault_category: str) -> str: + """Return reasoning process text for synthetic fault_category (or original u_problem_code).""" + if fault_category is None or str(fault_category).strip() in ("", "nan", "None"): + return DEFAULT_REASONING_PROCESS + fc = str(fault_category).strip() + if fc in SYNTHETIC_FAULT_CATEGORY_REASONING: + return SYNTHETIC_FAULT_CATEGORY_REASONING[fc] + if fc in PROBLEM_CODE_REASONING_PROCESS: + return PROBLEM_CODE_REASONING_PROCESS[fc] + return DEFAULT_REASONING_PROCESS diff --git a/recipes/noc-reasoning-agent/scripts/utils/schema_columns.py b/recipes/noc-reasoning-agent/scripts/utils/schema_columns.py new file mode 100644 index 0000000000..b792d7e5c2 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/utils/schema_columns.py @@ -0,0 +1,72 @@ +""" +Canonical column names for the pipeline when using synthetic (new) data format. +Use these names everywhere so the pipeline runs on CSV/JSONL with synthetic schema +without converting to the old format. +""" + +from typing import List + +# Primary key for incidents (CSV column and JSONL key) +INCIDENT_ID_COLUMN = "incident_identifier" + +# All columns expected in synthetic CSV (and produced in JSONL). +# Derived columns: time_to_resolve (computed), solved_category (from match_keywords), problem_code_reasoning_process (mapped from fault_category). +REQUIRED_COLUMNS: List[str] = [ + INCIDENT_ID_COLUMN, + "geographical_territory", + "incident_classification", + "incident_subtype", + "service_domain", + "equipment_provider", + "operational_zone", + "resolution_status", + "suspension_cause", + "urgency_level", + "affected_site", + "incident_summary", + "detection_timestamp", + "escalation_date", + "responsible_team", + "fault_category", + "root_cause_primary", + "resolution_summary", + "action_chronicle", + "reporter_identity", + "intervention_began", + "intervention_completed", + "resolution_method", + "root_cause_secondary", + "cause_additional", + "triggered_by_modification", + "resolver_identity", + "time_to_resolve", + "solved_category", + "problem_code_reasoning_process", +] + +# Column used for close-code / resolution classification (match_keywords, filter_rows) +RESOLUTION_METHOD_COLUMN = "resolution_method" +RESOLUTION_SUMMARY_COLUMN = "resolution_summary" +FAULT_CATEGORY_COLUMN = "fault_category" +SOLVED_CATEGORY_COLUMN = "solved_category" + +# Datetime columns (for parsing and time_to_resolve) +INTERVENTION_BEGAN_COLUMN = "intervention_began" +INTERVENTION_COMPLETED_COLUMN = "intervention_completed" + +# Prompt/display columns (used in YAML and scripts) +PROMPT_COLUMNS = { + "incident_identifier": INCIDENT_ID_COLUMN, + "geographical_territory": "geographical_territory", + "incident_classification": "incident_classification", + "incident_subtype": "incident_subtype", + "service_domain": "service_domain", + "equipment_provider": "equipment_provider", + "operational_zone": "operational_zone", + "urgency_level": "urgency_level", + "affected_site": "affected_site", + "incident_summary": "incident_summary", + "detection_timestamp": "detection_timestamp", + "responsible_team": "responsible_team", + "fault_category": "fault_category", +} diff --git a/recipes/noc-reasoning-agent/scripts/utils/split_incident_data.py b/recipes/noc-reasoning-agent/scripts/utils/split_incident_data.py new file mode 100644 index 0000000000..42db3023e1 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/utils/split_incident_data.py @@ -0,0 +1,101 @@ +import argparse +import json +import os +import random +from pathlib import Path + + +def split_generation_field( + input_folder: str, train_out: str, test_out: str, test_size: float = 0.2, seed: int = 42, max_examples=10000 +): + train_path = Path(train_out) + test_path = Path(test_out) + + incidents = [] + with open(os.path.join(input_folder, "iteration_0.jsonl"), "r", encoding="utf-8") as f_in: + for line in f_in: + row = json.loads(line) + number = row.get("incident_identifier", row.get("number")) + incidents.append(number) + + random.seed(seed) + random.shuffle(incidents) + + incidents = incidents[:max_examples] + n = len(incidents) + n_test = max(1, int(round(n * test_size))) if n > 0 else 0 + n_train = n - n_test + train_set = incidents[:n_train] + test_set = incidents[n_train:] + + train_rows = [] + test_rows = [] + + i = 0 + while os.path.exists(os.path.join(input_folder, f"iteration_{i}.jsonl")): + current_iteration_train = [] + current_iteration_test = [] + with open(os.path.join(input_folder, f"iteration_{i}.jsonl"), "r", encoding="utf-8") as f_in: + for line in f_in: + row = json.loads(line) + number = row.get("incident_identifier", row.get("number")) + if number in train_set: + current_iteration_train.append(row) + + random.shuffle(current_iteration_train) + train_rows += current_iteration_train + i += 1 + + i = 0 + while os.path.exists(os.path.join(input_folder, f"iteration_{i}.jsonl")): + current_iteration_train = [] + current_iteration_test = [] + with open(os.path.join(input_folder, f"iteration_{i}.jsonl"), "r", encoding="utf-8") as f_in: + for line in f_in: + row = json.loads(line) + number = row.get("incident_identifier", row.get("number")) + if number in test_set: + ## In end + print(row["response"]) + # exit() + if "Close Code: [" in row["response"]: + print("in") + row["initial_background"] = row["background"] + # if row["initial_background"] == "\n\n": + # print(json.dumps(row, indent=4)) + # exit() + row["background"] = "\n\n" + current_iteration_test.append(row) + + random.shuffle(current_iteration_test) + test_rows += current_iteration_test + i += 1 + + train_path.parent.mkdir(parents=True, exist_ok=True) + test_path.parent.mkdir(parents=True, exist_ok=True) + + with open(train_path, "w", encoding="utf-8") as f_train: + for r in train_rows: + f_train.write(json.dumps(r, ensure_ascii=False) + "\n") + + with open(test_path, "w", encoding="utf-8") as f_test: + for r in test_rows: + f_test.write(json.dumps(r, ensure_ascii=False) + "\n") + + print(f"Total lines read: {n}") + print(f"Train size: {len(train_set)} | Test size: {len(test_set)}") + print(f"Train path: {train_path}") + print(f"Test path: {test_path}") + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Split JSONL incidents with 'generation' field into training data") + parser.add_argument("--input_dir", help="Path to input_dir JSONL file", default="output_incident.jsonl") + parser.add_argument("--train_output", help="Path to output JSONL file", default="training_data_split.jsonl") + parser.add_argument("--test_output", help="Path to output JSONL file", default="testing_data_split.jsonl") + parser.add_argument( + "--preview", type=int, default=2, help="Number of examples to preview before confirmation (default: 2)" + ) + args = parser.parse_args() + + split_generation_field(args.input_dir, args.train_output, args.test_output) diff --git a/recipes/noc-reasoning-agent/scripts/utils/split_mocktools_answers.py b/recipes/noc-reasoning-agent/scripts/utils/split_mocktools_answers.py new file mode 100644 index 0000000000..ef8199fea1 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/utils/split_mocktools_answers.py @@ -0,0 +1,123 @@ +import argparse +import json + + +def parse_generation(gen_str): + """ + Parses a generation string into a dictionary where keys are before ':' and values are after. + """ + parsed = {} + if not gen_str: + return parsed + lines = gen_str.split("\n") + for line in lines: + if ":" in line: + key, value = line.split(":", 1) + key = key.strip() + value = value.strip() + parsed[key] = value + return parsed + + +def main(file1_path, file2_path, output_path=None): + """ + Processes two JSONL files: extracts specified fields from the first file, + includes 'generation' from the first file and 'generation2' from the second file (matched by 'number'), + parses the 'generation2' into additional answer columns, + and outputs the result as JSONL to stdout or a file. + + Args: + file1_path (str): Path to the first JSONL file. + file2_path (str): Path to the second JSONL file. + output_path (str, optional): Path to the output JSONL file. If None, prints to stdout. + """ + # Define the answer keys to extract + answer_keys = [ + "Check_Triage_toolkit_answer", + "Check_Alarm_Status_on_System_answer", + "Remote_Connection_answer", + "Remote_Device_Reboot_answer", + "Check_Fiber_Issues_answer", + "Check_Power_Issues_answer", + "Check_Element_Neighbors_answer", + ] + + # Load data from first file, keyed by 'number' + data1 = {} + with open(file1_path, "r", encoding="utf-8") as f1: + for line_num, line in enumerate(f1, 1): + line = line.strip() + if line: + try: + d = json.loads(line) + num = d.get("number") + if num: + data1[num] = d + except json.JSONDecodeError as e: + print(f"Error decoding JSON in file1 on line {line_num}: '{line}'. Error: {e}") + + # Load generations from second file, keyed by 'number' + data2 = {} + with open(file2_path, "r", encoding="utf-8") as f2: + for line_num, line in enumerate(f2, 1): + line = line.strip() + if line: + try: + d = json.loads(line) + num = d.get("number") + if num: + data2[num] = d.get("generation", "") + except json.JSONDecodeError as e: + print(f"Error decoding JSON in file2 on line {line_num}: '{line}'. Error: {e}") + + # Build results for matching numbers + results = [] + for num, d1 in data1.items(): + if num in data2: + # Parse generation2 + parsed2 = parse_generation(data2[num]) + + extracted = { + "number": num, + "u_region2": d1.get("u_region2", ""), + "category": d1.get("category", ""), + "subcategory": d1.get("subcategory", ""), + "u_market2": d1.get("u_market2", ""), + "u_ran_vendor": d1.get("u_ran_vendor", ""), + "u_aoi2": d1.get("u_aoi2", ""), + "priority": d1.get("priority", ""), + "u_locations": d1.get("u_locations", ""), + "short_description": d1.get("short_description", ""), + "opened_at": d1.get("opened_at", ""), + "follow_up": d1.get("follow_up", ""), + "assignment_group": d1.get("assignment_group", ""), + "problem code": d1.get("u_problem_code", ""), + "generation": d1.get("generation", ""), + "generation2": data2[num], + "close_notes": d1.get("close_notes", ""), + } + + # Add the parsed answer columns from generation2 + for key in answer_keys: + extracted[key] = parsed2.get(key, "NotApplicable") + + results.append(extracted) + + # Output + if output_path: + with open(output_path, "w", encoding="utf-8") as out_file: + for res in results: + out_file.write(json.dumps(res) + "\n") + print(f"Output written to {output_path}") + else: + for res in results: + print(json.dumps(res)) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Extract and combine data from two JSONL files.") + parser.add_argument("file1", help="Path to the first JSONL file") + parser.add_argument("file2", help="Path to the second JSONL file") + parser.add_argument("--output", help="Optional path to output JSONL file (default: print to stdout)") + args = parser.parse_args() + main(args.file1, args.file2, args.output) diff --git a/recipes/noc-reasoning-agent/scripts/utils/token_usage.py b/recipes/noc-reasoning-agent/scripts/utils/token_usage.py new file mode 100644 index 0000000000..f50b3d4378 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/utils/token_usage.py @@ -0,0 +1,287 @@ +#!/usr/bin/env python3 +# analyze_generation_lengths.py +# +# Usage: +# python analyze_generation_lengths.py \ +# --jsonl workspace/NoC_Reasoning_Agent/outputs/sdg_gpt_oss/output.jsonl \ +# --yaml data/prompt_incident.yaml \ +# --model openai/gpt-oss-120b \ +# --out outputs/generation_lengths.csv + +import argparse +import json +import math +import os +from typing import Optional + +from tqdm import tqdm + +# Optional deps +try: + from transformers import AutoTokenizer +except Exception: + AutoTokenizer = None + +try: + import yaml +except Exception: + yaml = None + +import matplotlib.pyplot as plt +import pandas as pd + + +def load_tokenizer(model_name: Optional[str]): + """ + Try to load a HF tokenizer. If unavailable (e.g., no internet/cache), + return None and we'll fall back to whitespace tokenization. + """ + if not model_name or AutoTokenizer is None: + return None + try: + tok = AutoTokenizer.from_pretrained(model_name, local_files_only=True) + return tok + except Exception: + # Try again without local_files_only (may fail if no internet) + try: + tok = AutoTokenizer.from_pretrained(model_name) + return tok + except Exception: + return None + + +def count_tokens(text: str, tokenizer) -> int: + if not isinstance(text, str): + return 0 + if tokenizer is not None: + # Use encode to match model token count (fast + accurate) + try: + return len(tokenizer.encode(text, add_special_tokens=False)) + except Exception: + pass + # Fallback: whitespace tokens + return len(text.split()) + + +def read_yaml_prompt(path: Optional[str]) -> Optional[str]: + if not path: + return None + if not os.path.exists(path): + print(f"[warn] YAML file not found: {path}") + return None + if yaml is None: + print("[warn] PyYAML not installed; skipping YAML parsing.") + return None + with open(path, "r", encoding="utf-8") as f: + data = yaml.safe_load(f) + # Your structure shows top-level key 'user' + return data.get("user") if isinstance(data, dict) else None + + +def stream_jsonl_lengths(jsonl_path: str, tokenizer) -> pd.DataFrame: + """ + Streams a JSONL file and extracts token lengths for the 'generation' field. + If a line includes 'num_generated_tokens', we keep it for reference. + """ + records = [] + if not os.path.exists(jsonl_path): + raise FileNotFoundError(f"JSONL not found: {jsonl_path}") + + with open(jsonl_path, "r", encoding="utf-8") as f: + total_lines = sum(1 for _ in open(jsonl_path, "r", encoding="utf-8")) + + # Iterate with progress bar + for i, line in tqdm(enumerate(f, start=1), total=total_lines, desc="Processing lines"): + # Do something with each line + line = line.strip() + if not line: + continue + try: + row = json.loads(line) + except json.JSONDecodeError: + print(f"[warn] Skipping malformed JSON at line {i}") + continue + + gen = row.get("generation", "") + # Some pipelines store the content under nested keys; add common fallbacks here if needed. + if not isinstance(gen, str): + # Try a nested alternative if your data uses it (customize as needed): + # gen = row.get("output", {}).get("text", "") + pass + + token_len = count_tokens(gen, tokenizer) + ref_len = row.get("num_generated_tokens", None) + + records.append({"index": i, "token_length": token_len, "num_generated_tokens_field": ref_len}) + + df = pd.DataFrame.from_records(records) + return df + + +def describe_lengths(df: pd.DataFrame, col: str = "token_length") -> pd.Series: + if df.empty: + return pd.Series(dtype=float) + s = df[col].dropna().astype(int) + # Custom concise stats + desc = pd.Series( + { + "count": int(s.shape[0]), + "min": int(s.min()) if len(s) else 0, + "p10": int(s.quantile(0.10)) if len(s) else 0, + "p25": int(s.quantile(0.25)) if len(s) else 0, + "median": int(s.median()) if len(s) else 0, + "p75": int(s.quantile(0.75)) if len(s) else 0, + "p90": int(s.quantile(0.90)) if len(s) else 0, + "max": int(s.max()) if len(s) else 0, + "mean": float(s.mean()) if len(s) else 0.0, + "std": float(s.std(ddof=1)) if len(s) > 1 else 0.0, + } + ) + return desc + + +def plot_histogram(df: pd.DataFrame, out_png: str, col: str = "token_length"): + if df.empty: + print("[warn] No data to plot.") + return + x = df[col].dropna().astype(int) + # Use a reasonable number of bins based on data spread + bins = min(60, max(10, int(math.sqrt(len(x))))) + plt.figure(figsize=(9, 5)) + plt.hist(x, bins=bins) + plt.title("Distribution of Generation Token Lengths") + plt.xlabel("Token length per sample") + plt.ylabel("Count") + plt.tight_layout() + plt.savefig(out_png, dpi=150) + plt.close() + print(f"[info] Saved histogram to: {out_png}") + + +def main(): + ap = argparse.ArgumentParser(description="Analyze token lengths of 'generation' fields in a JSONL.") + ap.add_argument("--jsonl", required=True, help="Path to JSONL file with a 'generation' field.") + ap.add_argument("--yaml", default=None, help="Optional YAML prompt file with key 'user'.") + ap.add_argument("--model", default=None, help="HF tokenizer name (e.g., 'openai/gpt-oss-120b').") + ap.add_argument("--out", default="generation_lengths.csv", help="Output CSV path.") + ap.add_argument("--plot", default="generation_lengths_hist.png", help="Output PNG for histogram.") + args = ap.parse_args() + + tokenizer = load_tokenizer(args.model) + if tokenizer is None: + print("[warn] Could not load tokenizer; falling back to whitespace token counts.") + + # Optional: count tokens in the prompt + prompt = read_yaml_prompt(args.yaml) + if prompt: + prompt_tokens = count_tokens(prompt, tokenizer) + print(f"[info] Prompt tokens: {prompt_tokens} (from {args.yaml})") + + # Stream JSONL and compute lengths + df = stream_jsonl_lengths(args.jsonl, tokenizer) + + # Save per-row lengths + os.makedirs(os.path.dirname(args.out) or ".", exist_ok=True) + df.to_csv(args.out, index=False) + print(f"[info] Wrote per-sample lengths to: {args.out}") + + # Print concise stats + stats = describe_lengths(df, "token_length") + if not stats.empty: + print("\n=== Token Length Summary (generation) ===") + for k, v in stats.items(): + print(f"{k:>6}: {v}") + + # If your file has a 'num_generated_tokens' column, compare basic correlation + if "num_generated_tokens_field" in df.columns and df["num_generated_tokens_field"].notna().any(): + try: + corr = df[["token_length", "num_generated_tokens_field"]].dropna().corr().iloc[0, 1] + print(f"\n[info] Correlation with 'num_generated_tokens' field: {corr:.3f}") + except Exception: + pass + + # Plot histogram + plot_histogram(df, args.plot, "token_length") + + +if __name__ == "__main__": + # === Configuration === + # csv_file = "data/top_10_categorized_incidents.csv" # Path to your CSV + csv_file = "data/prompt_incident.yaml" + column_name = "action_chronicle" # default; overridden after loading data + model = "openai/gpt-oss-120b" # Tokenizer model + output_csv = "outputs/work_notes_token_lengths.csv" + output_plot = "outputs/work_notes_token_hist.png" + + # === Load Tokenizer === + tokenizer = AutoTokenizer.from_pretrained(model) + print(f"[info] Tokenizer loaded for model: {model}") + + if csv_file.endswith((".yaml", ".yml")): + with open(csv_file, "r", encoding="utf-8") as f: + yaml_data = yaml.safe_load(f) + + # You might want to join everything into one big string, or tokenize each value + if isinstance(yaml_data, dict): + prompt_text = yaml.dump(yaml_data) + else: + prompt_text = str(yaml_data) + + token_count = len(tokenizer.encode(prompt_text, add_special_tokens=False)) + print(f"[info] YAML file {csv_file} contains {token_count} tokens.") + exit(0) + + # === Load CSV === + df = pd.read_csv(csv_file, encoding="utf-8") + if column_name not in df.columns: + raise ValueError(f"Column '{column_name}' not found in CSV. Available columns: {df.columns.tolist()}") + + # === Count Tokens === + token_lengths = [] + zero_tokens = 0 + for idx, text in tqdm(enumerate(df[column_name]), total=len(df), desc="Tokenizing work_notes"): + if pd.isna(text): # Handle NaN values + token_lengths.append(0) + zero_tokens += 1 + continue + # Count tokens for this row + token_count = len(tokenizer.encode(str(text), add_special_tokens=False)) + token_lengths.append(token_count) + + # Count rows with 0 tokens + print(f"\n[info] Number of rows with 0 tokens: {zero_tokens}") + + # Add results back to DataFrame + df["work_notes_token_length"] = token_lengths + + # === Save CSV with token counts === + df.to_csv(output_csv, index=False) + print(f"[info] Token lengths saved to {output_csv}") + + # === Summary Statistics === + print("\n=== Token Length Summary ===") + print(df["work_notes_token_length"].describe()) + + # === Plot Histogram === + plt.figure(figsize=(10, 6)) + plt.hist(df["work_notes_token_length"], bins=50) + plt.title(f"Distribution of Token Lengths for {column_name}") + plt.xlabel("Token Length") + plt.ylabel("Frequency") + plt.tight_layout() + plt.savefig(output_plot, dpi=150) + plt.close() + print(f"[info] Histogram saved to {output_plot}") + exit() + main() + +""" +Usage: +python src/utils/token_usage.py \ + --jsonl outputs/sdg_gpt_oss/output.jsonl \ + --yaml data/prompt_incident.yaml \ + --model openai/gpt-oss-120b \ + --out outputs/generation_lengths.csv \ + --plot outputs/generation_lengths_hist.png + +""" diff --git a/recipes/noc-reasoning-agent/scripts/visualization/extract_representation_columns.py b/recipes/noc-reasoning-agent/scripts/visualization/extract_representation_columns.py new file mode 100644 index 0000000000..78af4f0f57 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/visualization/extract_representation_columns.py @@ -0,0 +1,152 @@ +import json +import os +import re +from collections import defaultdict + +import pandas as pd + +# ---------- Paths ---------- +jsonl_path = "outputs/filtering_soft_with_keywords/output.jsonl" +output_jsonl_path = "outputs/filtering_soft_with_keywords/output_with_categories.jsonl" +csv_path = "data/human_intervention_incidents_soft.csv" +samples_dir = "outputs/filtering_soft_with_keywords/samples" + +os.makedirs(os.path.dirname(output_jsonl_path), exist_ok=True) +os.makedirs(samples_dir, exist_ok=True) + +# ---------- Load CSV (index by incident id: synthetic=incident_identifier, legacy=number) ---------- +sample = pd.read_csv(csv_path, nrows=0) +id_col = "incident_identifier" if "incident_identifier" in sample.columns else "number" +df = pd.read_csv(csv_path, encoding="latin1", dtype={id_col: str}) +df[id_col] = df[id_col].astype(str) + +needed_cols = [id_col, "time_to_resolve", "solved_category", "solved_reason"] +missing = [c for c in needed_cols if c not in df.columns] +if missing: + raise ValueError(f"CSV is missing required columns: {missing}") + +csv_idx = df.set_index(id_col)[["time_to_resolve", "solved_category", "solved_reason"]].to_dict(orient="index") + + +# ---------- Helper: safe enrichment ---------- +def enrich_row(row, lookup): + num = row.get("incident_identifier", row.get("number")) + info = lookup.get(num) + if info is None: + # No match: set to None (or choose sensible defaults) + row["time_to_resolve"] = None + row["solved_category"] = row.get("solved_category") # preserve if already present + row["solved_reason"] = row.get("solved_reason") + else: + row["time_to_resolve"] = info.get("time_to_resolve") + row["solved_category"] = info.get("solved_category") + row["solved_reason"] = info.get("solved_reason") + return row + + +# ---------- Read input JSONL, enrich, write output JSONL ---------- +enriched_rows = [] +with open(jsonl_path, "r", encoding="utf-8") as fin, open(output_jsonl_path, "w", encoding="utf-8") as fout: + for line in fin: + if not line.strip(): + continue + row = json.loads(line) + row = enrich_row(row, csv_idx) + enriched_rows.append(row) + fout.write(json.dumps(row, ensure_ascii=False) + "\n") + +print(f"Wrote enriched JSONL → {output_jsonl_path} (n={len(enriched_rows)})") + +# ---------- Bucket by solved_category ---------- +by_cat = defaultdict(list) +for r in enriched_rows: + cat = r.get("resolution_method") or r.get("close_code") or "Unknown" + by_cat[cat].append(r) + + +# ---------- Length bucketing rules (by token-ish count) ---------- +def token_count(text: str) -> int: + if not isinstance(text, str): + return 0 + # crude token proxy: whitespace-split + return len(text.split()) + + +# You can tweak these thresholds if your generations are generally longer/shorter +SHORT_MAX = 1000 # tokens +MEDIUM_MAX = 5000 # tokens +# long: > MEDIUM_MAX + + +def length_bucket(text: str) -> str: + n = token_count(text) + if n <= SHORT_MAX: + return "short" + elif n <= MEDIUM_MAX: + return "medium" + return "long" + + +# ---------- Pick 5 samples per category with ≥1 short, ≥1 medium, ≥1 long (if available) ---------- +def pick_samples(rows, n=5): + # Build buckets + buckets = {"short": [], "medium": [], "long": []} + for r in rows: + gen = r.get("generation", "") + b = length_bucket(gen) + buckets[b].append(r) + + # deterministic sort within buckets: by token length + for b in buckets: + buckets[b].sort(key=lambda x: token_count(x.get("generation", ""))) + + picked = [] + + # 1) ensure coverage: pick shortest short, median medium, longest long when available + if buckets["short"]: + picked.append(buckets["short"][0]) # shortest short + buckets["short"] = buckets["short"][1:] + + if buckets["medium"]: + mid = len(buckets["medium"]) // 2 + picked.append(buckets["medium"][mid]) # median medium + buckets["medium"].pop(mid) + + if buckets["long"]: + picked.append(buckets["long"][-1]) # longest long + buckets["long"].pop(-1) + + # 2) fill remaining slots from the buckets in round-robin: short → medium → long + order = ["short", "medium", "long"] + i = 0 + while len(picked) < n and any(buckets[b] for b in order): + b = order[i % 3] + if buckets[b]: + picked.append(buckets[b].pop(0)) + i += 1 + + # If still short, just top up from whatever remains (unlikely) + if len(picked) < n: + remaining = buckets["short"] + buckets["medium"] + buckets["long"] + picked.extend(remaining[: (n - len(picked))]) + + return picked[:n] + + +# ---------- Write sample files per category ---------- +def sanitize(name: str) -> str: + return re.sub(r"[^A-Za-z0-9_.-]+", "_", str(name)) + + +for cat, rows in by_cat.items(): + samples = pick_samples(rows, n=5) + out_path = os.path.join(samples_dir, f"samples_{sanitize(cat)}.jsonl") + with open(out_path, "w", encoding="utf-8") as fout: + for r in samples: + # Keep the whole record; downstream you can read r["generation"] for the trace + fout.write(json.dumps(r, ensure_ascii=False) + "\n") + # quick summary + lengths = [length_bucket(r.get("generation", "")) for r in samples] + print( + f"Category: {cat:>20} | total={len(rows):4d} | wrote={len(samples):2d} | mix={dict((x, lengths.count(x)) for x in set(lengths))} -> {out_path}" + ) diff --git a/recipes/noc-reasoning-agent/scripts/visualization/extract_scores.py b/recipes/noc-reasoning-agent/scripts/visualization/extract_scores.py new file mode 100644 index 0000000000..12dc328cc4 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/visualization/extract_scores.py @@ -0,0 +1,53 @@ +import json + +import pandas as pd + +# Replace with your actual JSONL file path +file_path = "evaluations.jsonl" + +# Load JSONL file +data = [] +with open(file_path, "r", encoding="utf-8") as f: + for line in f: + data.append(json.loads(line)) + +# Convert to pandas DataFrame +df = pd.DataFrame(data) + +# Metrics to calculate averages for +metrics = ["rouge1", "rougeL", "bertscore_f1", "llm_judge_score"] + +# Check which columns are available +available_metrics = [metric for metric in metrics if metric in df.columns] + +if not available_metrics: + raise ValueError("No required metrics found in the JSONL file!") + +# Calculate averages +averages = df[available_metrics].mean() + +# Display results +print("Average Metrics:") +for metric, avg in averages.items(): + print(f"{metric}: {avg:.4f}") + + +# Load JSONL file into a list +data = [] +with open(file_path, "r", encoding="utf-8") as f: + for line in f: + data.append(json.loads(line)) + +# Convert to DataFrame +df = pd.DataFrame(data) + +# Select only the relevant columns +columns_to_display = ["expected_answer", "agent_response", "llm_judge_reason"] +df_subset = df[columns_to_display].head(10) + +# Display neatly +for idx, row in df_subset.iterrows(): + print(f"\n--- Sample {idx + 1} ---") + print(f"True Answer (expected_answer): {row['expected_answer']}") + print(f"Model Answer (agent_response): {row['agent_response']}") + print(f"Judge Explanation (llm_judge_reason): {row['llm_judge_reason']}") diff --git a/recipes/noc-reasoning-agent/scripts/visualization/generate_trace_visualization.py b/recipes/noc-reasoning-agent/scripts/visualization/generate_trace_visualization.py new file mode 100644 index 0000000000..5765743695 --- /dev/null +++ b/recipes/noc-reasoning-agent/scripts/visualization/generate_trace_visualization.py @@ -0,0 +1,348 @@ +import html +import json +import re +import sys +from pathlib import Path + +import pandas as pd + + +def parse_steps(reasoning): + """Parses the Thought/Action/Observation steps from the reasoning trace.""" + step_re = re.compile(r"^(Thought|Action|Observation)\s+(\d+):\s*(.*)$", re.MULTILINE) + finish_re = re.compile(r"^Finish\[(.*)\]$", re.MULTILINE | re.DOTALL) + + steps = {} + for kind, num, text in step_re.findall(reasoning or ""): + steps.setdefault(int(num), {})[kind.lower()] = text.strip() + + ordered_steps = [{"index": i, **steps[i]} for i in sorted(steps)] + finish_match = finish_re.search(reasoning or "") + finish_text = finish_match.group(1).strip() if finish_match else None + + return ordered_steps, finish_text + + +def parse_final_reasoning(generation_text): + """ + Parses the 'generation' field to extract only the final + Thought, Action, and Observation steps. + """ + # 1. Isolate the final, clean reasoning trace + # This regex looks for the block starting with "Question:" and ending with "Finish[...]" + trace_block_match = re.search( + r"Question:.*?(Finish\[.*?\])", + generation_text, + re.DOTALL, # Allows '.' to match newlines + ) + + # Find the last occurrence of "Finish" + last_index = generation_text.rfind("Finish") + + if last_index != -1: + finish_text = generation_text[last_index:].strip() + print(finish_text) + else: + print("No 'Finish' found!") + + return finish_text + + if not trace_block_match: + return "Final reasoning trace not found." + + reasoning_trace = trace_block_match.group(0) + + # 2. Extract each individual Thought, Action, and Observation + # This regex finds all lines starting with the keywords + step_pattern = re.compile( + r"^(Thought|Action|Observation)\s+\d+:\s*(.*)$", + re.MULTILINE, # Allows '^' to match the start of each line + ) + + steps = step_pattern.findall(reasoning_trace) + + # 3. Format the results for clarity + parsed_steps = [] + for step in steps: + kind = step[0] # "Thought", "Action", or "Observation" + content = step[1] # The text of the step + parsed_steps.append(f"**{kind}:** {content}") + + return "\n".join(parsed_steps) + + +def find_finish_action(generation_text): + # Regex to capture everything inside Finish[ ... ] + # match = re.search(r"Finish\[(.*?)\]", generation_text, re.DOTALL) + + # finish_action = match.group(1).strip() + # return finish_action + last_index = generation_text.rfind("Finish") + + if last_index != -1: + finish_text = generation_text[last_index:].strip() + # print(finish_text) + else: + print("No 'Finish' found!") + + return finish_text + + +def parse_generation(generation_text): + """ + Extracts the clean, final reasoning trace from the raw 'generation' field. + The trace starts with 'Question:' and ends with 'Finish[...]'. + """ + # Regex to find the block starting with "Question:" and ending with "Finish[...]" + trace_re = re.compile(r"Question:.*Finish\[.*\]", re.DOTALL) + match = trace_re.search(generation_text or "") + return match.group(0).strip() if match else "" + + +def parse_work_notes(work_notes_text): + """Splits work notes into a list of entries based on timestamps.""" + if not work_notes_text: + return [] + + # This regex captures the full timestamp line (e.g., "2024-11-22 10:37:54 - ST (Work notes)") + timestamp_pattern = r"(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s-.*?\))" + + # Split the text by the timestamp pattern, keeping the timestamps + parts = re.split(timestamp_pattern, work_notes_text) + + notes = [] + # The first item is usually empty, so we start from the first captured timestamp + i = 1 + while i < len(parts): + timestamp = parts[i].strip() + note_text = parts[i + 1].strip() + if note_text: # Only add entries that have content + notes.append({"timestamp": timestamp, "note": note_text}) + i += 2 + + return notes + + +def esc(s): + """Helper function for HTML escaping.""" + return html.escape(str(s) if s is not None else "") + + +def render(incident_data): + """Renders a single incident into a collapsible HTML section.""" + reasoning_trace = incident_data.get("generation") + finish_action = find_finish_action(incident_data.get("generation")) + + # --- Part 1: Human-Readable Incident Trace --- + trace_html = "" + fields_to_display = [ + "incident_identifier", + "urgency_level", + "incident_classification", + "incident_subtype", + "responsible_team", + "fault_category", + "detection_timestamp", + "incident_summary", + "geographical_territory", + "service_domain", + "equipment_provider", + "operational_zone", + "affected_site", + "escalation_date", + "generation_start_time", + "generation_end_time", + "time_to_resolve", + "solved_category", + "solved_reason", + ] + for field in fields_to_display: + value = incident_data.get(field) + display_value = value if value and str(value).strip() else "—" + field_name = field.replace("u_", "").replace("_", " ").title() + trace_html += f'' + trace_html += "
{field_name}{esc(display_value)}
" + + # --- Part 2: Chronological Work Notes --- + work_notes = incident_data.get("action_chronicle") or incident_data.get("work_notes") + work_notes_entries = parse_work_notes(work_notes) + notes_html = "
" + for entry in work_notes_entries: + notes_html += f""" +
+
{esc(entry["timestamp"])}
+
{esc(entry["note"])}
+
+ """ + notes_html += "
" + + steps, finish = parse_steps(reasoning_trace) + steps_html = [] + for s in steps: + block = [f'
Step {s["index"]}
'] + for key, cls in [("thought", "thought"), ("action", "action"), ("observation", "obs")]: + if s.get(key): + block.append(f'
{key.title()}: {esc(s[key])}
') + steps_html.append(f"
{''.join(block)}
") + + # --- Assemble the final collapsible report --- + incident_id = esc(incident_data.get("incident_identifier", incident_data.get("number", "Unknown Incident"))) + short_desc = esc(incident_data.get("incident_summary", incident_data.get("short_description", ""))) + + return f""" +
+ + {incident_id} + {short_desc} +
    +
  • Category: {incident_data.get("incident_classification", incident_data.get("category"))}
  • +
  • Problem Code: {incident_data.get("fault_category", incident_data.get("u_problem_code"))}
  • +
+
+
+

Incident Trace (Human-Readable)

+ {trace_html} +

Work Notes (Chronological)

+ {notes_html} +

Full Reasoning Trace

+
{reasoning_trace}
+

Thoughts, Observations, Actions

+
+ {"".join(steps_html)} +
+

Closing Notes

+
{finish_action}
+
+
+ """ + #
{incident_data.get("close_notes")}
+ + +def main(input_file, output_file, max_incidents=15, selected_criteria=None): + """Main function to read, process, and write the HTML report.""" + try: + with open(input_file, "r", encoding="utf-8") as f: + lines = [line for line in f if line.strip()] + data = [json.loads(line) for line in lines[:max_incidents]] + except FileNotFoundError: + print(f"Error: Input file not found at '{input_file}'") + return + + try: + # 1. Load the entire dataset using pandas + df = pd.read_json(input_file, lines=True) + # Prepare for complex filters by adding helper columns + notes_col = "action_chronicle" if "action_chronicle" in df.columns else "work_notes" + df["work_notes_len"] = df[notes_col].astype(str).str.len().fillna(0) + print(f"Loaded {len(df)} incidents from {input_file}") + except FileNotFoundError: + print(f"Error: Input file not found at '{input_file}'") + return + except ValueError: + print(f"Error: Could not parse {input_file}. Ensure it's a valid .jsonl file.") + return + + # 2. Define the 15 selection criteria based on our checklist + # selection_criteria = [ + # # Group 1: High Volume + # {"query": "category == 'RAN'"}, + # {"query": "category == 'RAN'"}, + # {"query": "category == 'RAN'"}, + # {"query": "category == 'RAN'"}, + # {"query": "category == 'RAN'"}, + # {"query": "category == 'RAN'"}, + # {"query": "category == 'RAN'"}, + # {"query": "category == 'RAN'"}, + # {"query": "category == 'RAN'"}, + # {"query": "category == 'RAN'"}, + # {"query": "category == 'RAN'"}, + # {"query": "category == 'RAN'"}, + # {"query": "category == 'RAN'"}, + # {"query": "category == 'RAN'"}, + # {"name": "High-Volume RAN Auto-Resolution", "query": "category == 'RAN' and priority == 'Sev-3' and close_code == 'Auto Recover'"}, + + # {"name": "High-Volume Transport FIM", "query": "category == 'Transport / Network' and priority == 'Sev-3' and assignment_group.str.contains('Transport FIM', na=False)"}, + # {"name": "Standard NOC Alarm Clearance", "query": "close_code == 'Alarm Cleared' and priority == 'Sev-4'"}, + # {"name": "Common Software Issue", "query": "u_probable_cause == 'Software'"}, + # {"name": "Common Hardware Issue", "query": "u_probable_cause == 'Hardware' and close_code == 'Fiber Repaired/Replaced'"}, + # # Group 2: Critical & Escalated + # {"name": "High-Priority Emergency", "query": "priority == 'Sev-1' and assignment_group.str.contains('Emergency', na=False)"}, + # {"name": "Specialist Team Escalation", "query": "assignment_group.str.contains('NOCoE', na=False) and category == 'CORE'"}, + # {"name": "Long Work Notes", "query": "", "special": "longest_work_notes"}, # Special case + # {"name": "High-Priority FIM Case", "query": "priority == 'Sev-2' and assignment_group.str.contains('FIM', na=False)"}, + # # Group 3: Diverse Scenarios + # {"name": "Power-Related Outage", "query": "u_probable_cause == 'Power' and close_code == 'Commercial Power Restored'"}, + # {"name": "Network Fix Resolution", "query": "close_code == 'Network Fix' and category == 'Transport / Network'"}, + # {"name": "Vendor-Specific FIM (Mavenir)", "query": "assignment_group == 'Wireless - NOC RAN FIM - Mavenir'"}, + # {"name": "Site Monitoring Issue", "query": "category == 'Site Monitoring'"}, + # {"name": "WCS Category", "query": "category == 'WCS'"}, + # {"name": "CORE AO Case", "query": "assignment_group == 'Wireless - NOC Core AO'"}, + # ] + + selected_incidents = [] + selected_indices = set() + + available_df = df.drop(index=list(selected_indices)) # Exclude already chosen incidents + + if selected_criteria: + selected_incidents = available_df.query(f"category == '{selected_criteria}'") + else: + selected_incidents = available_df + # print(selected_incidents) + # print(f"category == '{{{selected_criteria}}}'") + data = selected_incidents.head(max_incidents).to_dict(orient="records") + + # data = selected_incidents + + style = """ + + """ + + html_doc = f"Incident Report{style}" + html_doc += "

Incident Analysis Report

" + + for inc_data in data: + html_doc += render(inc_data) + + html_doc += "
" + + Path(output_file).write_text(html_doc, encoding="utf-8") + print(f"Wrote {len(data)} incidents to {output_file}") + + +if __name__ == "__main__": + input_filename = sys.argv[1] + output_filename = sys.argv[2] + if len(sys.argv) > 3: + max_incidents = int(sys.argv[3]) + else: + max_incidents = 15 + if len(sys.argv) > 4: + selected_criteria = sys.argv[4] + else: + selected_criteria = None + main(input_filename, output_filename, max_incidents, selected_criteria)