Skip to content

Conversation

@hsin-c
Copy link
Contributor

@hsin-c hsin-c commented Oct 14, 2025

This PR updates the alert traige agent to include a new config for the optimizer, and updates the system prompts (core agent and the sub agent) to be optimizable.

Test run of the optimizer config succeeded.
All unit tests passed.

Summary by CodeRabbit

  • New Features

    • Added a full offline optimizer workflow for the alert triage agent, including prompt and numeric parameter tuning, evaluators, and GA-style optimization.
    • Expanded offline tooling and model configurations to support multiple analysis and maintenance checks and LLM variants.
    • Agent and telemetry prompts are now optimization-aware; reusable prompt-purpose constants added.
  • Documentation

    • Updated guide with single-run, offline execution, dataset-driven optimization, example outputs, and optimizer command examples.

@hsin-c hsin-c requested a review from a team as a code owner October 14, 2025 00:16
@coderabbitai
Copy link

coderabbitai bot commented Oct 14, 2025

Walkthrough

Adds an offline optimizer YAML and README updates; converts several agent prompt fields to OptimizableField/SearchSpace with optimizer prompt-purpose metadata; adds an optimizer_prompts module; and wires optimizer/eval settings for an end-to-end offline alert-triage optimization workflow.

Changes

Cohort / File(s) Summary
Offline optimizer configuration
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/configs/config_offline_optimizer.yml
New YAML defining workflow tool blocks, llms, workflow settings (offline_mode, offline/benign data paths), evaluators, and an optimizer section (numeric trials, GA prompt-population settings, reps, output paths).
Prompt optimization integration
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/register.py, examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py
Replaces plain prompt fields with OptimizableField and attached SearchSpace (is_prompt, prompt, prompt_purpose). Adds imports for OptimizableField, SearchSpace, and OptimizerPrompts. tool_names made a Field(default_factory=list, ...). No runtime control-flow changes.
New prompts module
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompts.py
New module defining OptimizerPrompts with AGENT_PROMPT_PURPOSE and TELEMETRY_AGENT_PROMPT_PURPOSE string constants used as prompt-purpose metadata.
Documentation updates
examples/advanced_agents/alert_triage_agent/README.md
Expands README with an Optimization section and examples for nat optimize/offline runs, updated commands, output descriptions, and references to optimizer config and results.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Dev as Developer (CLI)
  participant NAT as nat optimize
  participant CFG as Config YAML
  participant DS as Offline Dataset
  participant WF as Alert Triage Workflow
  participant LLM as LLM(s)
  participant EVAL as Evaluators

  Dev->>NAT: nat optimize -c config_offline_optimizer.yml
  NAT->>CFG: Load workflow, tools, llms, optimizer settings
  NAT->>DS: Load offline dataset & benign fallbacks
  loop GA generations
    NAT->>NAT: Generate/mutate parameter sets (including prompts)
    par Parallel evaluations
      NAT->>WF: Instantiate workflow with parameter set (offline_mode)
      WF->>DS: Read sample input / fallbacks
      WF->>LLM: Invoke configured LLMs/tools per sample
      WF-->>NAT: Return outputs
      NAT->>EVAL: Score outputs (rag_accuracy, classification_accuracy)
      EVAL-->>NAT: Return scores
    end
    NAT->>NAT: Select/retain best parameter sets
  end
  NAT-->>Dev: Persist best params, reports to optimizer output_path
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "Update alert traige agent to work with nat optimizer" is fully related to the main change in the changeset and clearly summarizes the primary objective. The PR adds comprehensive optimizer support to the alert triage agent through a new offline optimizer configuration file, makes system prompts optimizable via code updates to register.py and telemetry_metrics_analysis_agent.py, introduces optimizer prompt purpose constants, and documents the optimization workflow in the README. The title uses imperative mood ("Update"), is concise at 52 characters (well under the ~72 character limit), and accurately represents the changeset content. However, the title contains a minor spelling error: "traige" should be "triage".
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c6358d1 and 87cedc8.

📒 Files selected for processing (1)
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/register.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{py,yaml,yml}

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.{py,yaml,yml}: Configure response_seq as a list of strings; values cycle per call, and [] yields an empty string.
Configure delay_ms to inject per-call artificial latency in milliseconds for nat_test_llm.

Files:

  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/register.py
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.py: Programmatic use: create TestLLMConfig(response_seq=[...], delay_ms=...), add with builder.add_llm("", cfg).
When retrieving the test LLM wrapper, use builder.get_llm(name, wrapper_type=LLMFrameworkEnum.) and call the framework’s method (e.g., ainvoke, achat, call).

**/*.py: In code comments/identifiers use NAT abbreviations as specified: nat for API namespace/CLI, nvidia-nat for package name, NAT for env var prefixes; do not use these abbreviations in documentation
Follow PEP 20 and PEP 8; run yapf with column_limit=120; use 4-space indentation; end files with a single trailing newline
Run ruff check --fix as linter (not formatter) using pyproject.toml config; fix warnings unless explicitly ignored
Respect naming: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
Treat pyright warnings as errors during development
Exception handling: use bare raise to re-raise; log with logger.error() when re-raising to avoid duplicate stack traces; use logger.exception() when catching without re-raising
Provide Google-style docstrings for every public module, class, function, and CLI command; first line concise and ending with a period; surround code entities with backticks
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work; profile CPU-heavy paths with cProfile or mprof before optimizing; cache expensive computations with functools.lru_cache or external cache; leverage NumPy vectorized operations when beneficial

Files:

  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/register.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

  • Confirm that copyright years are up-to date whenever a file is changed.

Files:

  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/register.py
examples/**/*

⚙️ CodeRabbit configuration file

examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.

  • If an example contains Python code, it should be placed in a subdirectory named src/ and should
    contain a pyproject.toml file. Optionally, it might also contain scripts in a scripts/ directory.
  • If an example contains YAML files, they should be placed in a subdirectory named configs/. - If an example contains sample data files, they should be placed in a subdirectory named data/, and should
    be checked into git-lfs.

Files:

  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/register.py
🧬 Code graph analysis (1)
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/register.py (2)
src/nat/data_models/optimizable.py (2)
  • OptimizableField (68-107)
  • SearchSpace (33-65)
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompts.py (1)
  • OptimizerPrompts (22-56)
🔇 Additional comments (3)
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/register.py (3)

27-28: LGTM! Necessary imports for optimizer integration.

These imports enable the agent_prompt field to support optimization metadata, aligning with the PR's objective to add optimizer configuration for the alert triage agent.


46-46: Import is now correct—past issue resolved.

The import correctly references .optimizer_prompts (plural), matching the actual module filename. The previously flagged ImportError concern has been addressed.


67-73: Well-structured OptimizableField configuration.

The conversion to OptimizableField with SearchSpace properly enables prompt optimization while maintaining the original default behavior. The is_prompt=True flag and prompt_purpose metadata correctly integrate with the offline optimizer workflow described in the PR objectives.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@hsin-c hsin-c added the non-breaking Non-breaking change label Oct 14, 2025
@coderabbitai coderabbitai bot added the feature request New feature or request label Oct 14, 2025
Copy link
Contributor

@dnandakumar-nv dnandakumar-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Could we also add a section to the README letting developers know that an optimizable config is available and how to run optimization for this workflow?

@AnuradhaKaruppiah
Copy link
Contributor

This looks great! Could we also add a section to the README letting developers know that an optimizable config is available and how to run optimization for this workflow?
+1

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (5)
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompt.py (1)

2-12: Consider using Final for immutable constants.

The class constants AGENT_PROMPT_PURPOSE and TELEMETRY_AGENT_PROMPT_PURPOSE are immutable and could benefit from explicit type annotations.

+from typing import Final
+
 class OptimizerPrompts:
-    AGENT_PROMPT_PURPOSE = """This is the system prompt that instructs the Alert Triage Agent on how to behave and respond to system alerts. It is used as a SystemMessage that's prepended to every LLM conversation, providing the agent with its role and behavior guidelines.
+    AGENT_PROMPT_PURPOSE: Final[str] = """This is the system prompt that instructs the Alert Triage Agent on how to behave and respond to system alerts. It is used as a SystemMessage that's prepended to every LLM conversation, providing the agent with its role and behavior guidelines.
 
 The prompt should be well-structured and provide specific instructions to help the agent:
 - Analyze incoming alerts and identify their type (e.g., InstanceDown, HighCPUUsage)
 - Select and use the appropriate diagnostic tools for each alert type (hardware_check, host_performance_check, network_connectivity_check, telemetry_metrics_analysis_agent, monitoring_process_check)
 - Avoid calling the same tool repeatedly during a single alert investigation
 - Correlate collected data from multiple tools to determine root causes
 - Distinguish between true issues, false positives, and benign anomalies
 - Generate structured markdown triage reports with clear sections: Alert Summary, Collected Metrics, Analysis, Recommended Actions, and Alert Status
 
 The prompt should give the agent clear security context and explicit instructions on the expected final report format to ensure consistent, actionable output for system analysts."""
-    TELEMETRY_AGENT_PROMPT_PURPOSE = """This is the system prompt for the Telemetry Metrics Analysis Agent, a specialized sub-agent within the alert triage system. It is used as a SystemMessage for a nested agent that the main Alert Triage Agent can call to analyze remotely collected telemetry data.
+    TELEMETRY_AGENT_PROMPT_PURPOSE: Final[str] = """This is the system prompt for the Telemetry Metrics Analysis Agent, a specialized sub-agent within the alert triage system. It is used as a SystemMessage for a nested agent that the main Alert Triage Agent can call to analyze remotely collected telemetry data.

Also applies to: 13-28

examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/register.py (1)

67-75: Remove redundant prompt parameter from SearchSpace.

The OptimizableField implementation automatically uses the field's default value as the base prompt when space.prompt is not specified. Since both default and space.prompt are set to ALERT_TRIAGE_AGENT_PROMPT, the explicit prompt parameter is redundant.

Apply this diff to simplify the configuration:

     agent_prompt: str = OptimizableField(
         default=ALERT_TRIAGE_AGENT_PROMPT,
         description="The system prompt to use for the alert triage agent.",
         space=SearchSpace(
             is_prompt=True,
-            prompt=ALERT_TRIAGE_AGENT_PROMPT,
             prompt_purpose=OptimizerPrompts.AGENT_PROMPT_PURPOSE,
         )
     )

Based on the OptimizableField implementation in src/nat/data_models/optimizable.py (lines 78-82), which automatically falls back to the field's default when space.prompt is None.

examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py (2)

35-43: Remove redundant prompt parameter from SearchSpace.

Similar to the main agent configuration, the space.prompt parameter is redundant since OptimizableField automatically uses the field's default value as the base prompt.

Apply this diff:

     prompt: str | None = OptimizableField(
         default=TelemetryMetricsAnalysisAgentPrompts.PROMPT,
         description="The system prompt to use for the alert triage agent.",
         space=SearchSpace(
             is_prompt=True,
-            prompt=TelemetryMetricsAnalysisAgentPrompts.PROMPT,
             prompt_purpose=OptimizerPrompts.TELEMETRY_AGENT_PROMPT_PURPOSE,
         )
     )

Based on the OptimizableField implementation in src/nat/data_models/optimizable.py.


33-33: Pre-existing issue: Mutable default for tool_names.

Ruff correctly flags that tool_names: list[str] = [] uses a mutable default, which can cause unexpected behavior if the list is modified. While this is a pre-existing issue not introduced by this PR, consider addressing it.

Apply this diff to fix the issue:

+from typing import ClassVar
+
 class TelemetryMetricsAnalysisAgentConfig(FunctionBaseConfig, name="telemetry_metrics_analysis_agent"):
     description: str = Field(default=TelemetryMetricsAnalysisAgentPrompts.TOOL_DESCRIPTION,
                              description="Description of the tool for the triage agent.")
-    tool_names: list[str] = []
+    tool_names: list[str] = Field(default_factory=list)

Based on static analysis hints.

examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/configs/config_offline_optimizer.yml (1)

69-69: Consider extracting the duplicated system objective.

The system_objective string is duplicated between prompt_init (line 69) and prompt_recombination (line 73). This long description could be extracted to a YAML anchor for better maintainability.

Apply this diff to use YAML anchors:

+  system_objective_description: &system_objective The alert triage agent autonomously investigates infrastructure monitoring alerts, performs root cause analysis, and generates structured diagnostic reports by dynamically selecting and orchestrating diagnostic tools including IPMI hardware checks, network connectivity tests, host performance monitoring, process status verification, and telemetry analysis, then correlating multi-source data through LLM-powered reasoning to classify issues into predefined categories (hardware, software, network, false positive, or requiring investigation), helping security analysts reduce manual triage workload, accelerate incident response times, and maintain consistent investigation quality through standardized evidence collection and automated documentation of findings and recommended remediation actions.
+
   prompt_init:
     _type: prompt_init
     optimizer_llm: optimizer_llm
-    system_objective: The alert triage agent autonomously investigates infrastructure monitoring alerts, performs root cause analysis, and generates structured diagnostic reports by dynamically selecting and orchestrating diagnostic tools including IPMI hardware checks, network connectivity tests, host performance monitoring, process status verification, and telemetry analysis, then correlating multi-source data through LLM-powered reasoning to classify issues into predefined categories (hardware, software, network, false positive, or requiring investigation), helping security analysts reduce manual triage workload, accelerate incident response times, and maintain consistent investigation quality through standardized evidence collection and automated documentation of findings and recommended remediation actions.
+    system_objective: *system_objective
   prompt_recombination:
     _type: prompt_recombiner
     optimizer_llm: optimizer_llm
-    system_objective: The alert triage agent autonomously investigates infrastructure monitoring alerts, performs root cause analysis, and generates structured diagnostic reports by dynamically selecting and orchestrating diagnostic tools including IPMI hardware checks, network connectivity tests, host performance monitoring, process status verification, and telemetry analysis, then correlating multi-source data through LLM-powered reasoning to classify issues into predefined categories (hardware, software, network, false positive, or requiring investigation), helping security analysts reduce manual triage workload, accelerate incident response times, and maintain consistent investigation quality through standardized evidence collection and automated documentation of findings and recommended remediation actions.
+    system_objective: *system_objective

Also applies to: 73-73

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 83434fc and 01c0dec.

📒 Files selected for processing (4)
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/configs/config_offline_optimizer.yml (1 hunks)
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompt.py (1 hunks)
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/register.py (3 hunks)
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{py,yaml,yml}

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.{py,yaml,yml}: Configure response_seq as a list of strings; values cycle per call, and [] yields an empty string.
Configure delay_ms to inject per-call artificial latency in milliseconds for nat_test_llm.

Files:

  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompt.py
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/register.py
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/configs/config_offline_optimizer.yml
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.py: Programmatic use: create TestLLMConfig(response_seq=[...], delay_ms=...), add with builder.add_llm("", cfg).
When retrieving the test LLM wrapper, use builder.get_llm(name, wrapper_type=LLMFrameworkEnum.) and call the framework’s method (e.g., ainvoke, achat, call).

**/*.py: In code comments/identifiers use NAT abbreviations as specified: nat for API namespace/CLI, nvidia-nat for package name, NAT for env var prefixes; do not use these abbreviations in documentation
Follow PEP 20 and PEP 8; run yapf with column_limit=120; use 4-space indentation; end files with a single trailing newline
Run ruff check --fix as linter (not formatter) using pyproject.toml config; fix warnings unless explicitly ignored
Respect naming: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
Treat pyright warnings as errors during development
Exception handling: use bare raise to re-raise; log with logger.error() when re-raising to avoid duplicate stack traces; use logger.exception() when catching without re-raising
Provide Google-style docstrings for every public module, class, function, and CLI command; first line concise and ending with a period; surround code entities with backticks
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work; profile CPU-heavy paths with cProfile or mprof before optimizing; cache expensive computations with functools.lru_cache or external cache; leverage NumPy vectorized operations when beneficial

Files:

  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompt.py
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/register.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

  • Confirm that copyright years are up-to date whenever a file is changed.

Files:

  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompt.py
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/register.py
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/configs/config_offline_optimizer.yml
examples/**/*

⚙️ CodeRabbit configuration file

examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.

  • If an example contains Python code, it should be placed in a subdirectory named src/ and should
    contain a pyproject.toml file. Optionally, it might also contain scripts in a scripts/ directory.
  • If an example contains YAML files, they should be placed in a subdirectory named configs/. - If an example contains sample data files, they should be placed in a subdirectory named data/, and should
    be checked into git-lfs.

Files:

  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompt.py
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/register.py
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/configs/config_offline_optimizer.yml
**/*.{yaml,yml}

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

In workflow/config YAML, set llms.._type: nat_test_llm to stub responses.

Files:

  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/configs/config_offline_optimizer.yml
**/configs/**

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Configuration files consumed by code must be stored next to that code in a configs/ folder

Files:

  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/configs/config_offline_optimizer.yml
🧬 Code graph analysis (2)
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py (3)
src/nat/data_models/optimizable.py (2)
  • OptimizableField (68-107)
  • SearchSpace (33-65)
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompt.py (1)
  • OptimizerPrompts (1-28)
src/nat/data_models/function.py (1)
  • FunctionBaseConfig (26-27)
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/register.py (2)
src/nat/data_models/optimizable.py (2)
  • OptimizableField (68-107)
  • SearchSpace (33-65)
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompt.py (1)
  • OptimizerPrompts (1-28)
🪛 Ruff (0.14.0)
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py

33-33: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: CI Pipeline / Check
🔇 Additional comments (5)
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/register.py (1)

27-28: LGTM: Imports are correctly added.

The new imports for optimization support (OptimizableField, SearchSpace, OptimizerPrompts) are properly organized and align with the changes to make the agent prompt optimizable.

Also applies to: 46-46

examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py (1)

24-25: LGTM: Imports are correctly added.

The new imports for optimization support are properly organized and necessary for the prompt optimization functionality.

Also applies to: 28-28

examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/configs/config_offline_optimizer.yml (3)

1-15: LGTM: License header is correct.

The Apache License 2.0 header is properly formatted and includes the correct copyright year (2025).

As per coding guidelines.


45-45: Placeholder URL in metrics configuration.

The metrics_url fields contain placeholder values (http://your-monitoring-server:9090) with a comment indicating they should be replaced when running in live mode. Since offline_mode: true is set for these functions, these placeholders are safe but could be clarified.

The placeholder URLs are acceptable given that offline mode is enabled. However, verify that the system gracefully handles these placeholder URLs if offline mode is accidentally disabled.

Also applies to: 50-50


151-189: LGTM: Optimizer configuration is well-structured.

The optimizer section properly configures:

  • Both numeric and prompt optimization
  • Evaluation metrics with appropriate directions (maximize)
  • GA parameters for prompt optimization
  • References to the prompt initialization and recombination functions

The configuration aligns with the code changes that make prompts optimizable via OptimizableField.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompts.py (1)

24-50: Consider adding constant docstrings.

While the class docstring provides context, individual docstrings for AGENT_PROMPT_PURPOSE and TELEMETRY_AGENT_PROMPT_PURPOSE would improve clarity and follow best practices for public constants.

Example:

     """
     AGENT_PROMPT_PURPOSE = """This is the system prompt that instructs the Alert Triage Agent on how to behave and respond to system alerts. It is used as a SystemMessage that's prepended to every LLM conversation, providing the agent with its role and behavior guidelines.

could become:

    """
    
    #: Prompt purpose for the main Alert Triage Agent system prompt.
    AGENT_PROMPT_PURPOSE = """This is the system prompt that instructs the Alert Triage Agent on how to behave and respond to system alerts. It is used as a SystemMessage that's prepended to every LLM conversation, providing the agent with its role and behavior guidelines.
examples/advanced_agents/alert_triage_agent/README.md (2)

531-542: Fix list indentation.

The unordered list items should use 0 spaces of indentation instead of 3 for consistency with markdown best practices.

Apply this diff to fix the indentation:

-#### 1. **Set required environment variables**
+#### 1. Set required environment variables
 
    Make sure `offline_mode: true` is set in both the `workflow` section and individual tool sections of your config file (see [Understanding the configuration](#understanding-the-configuration) section).
 
-#### 2. **How offline mode works:**
+#### 2. How offline mode works:
 
-   - The **main CSV offline dataset** (`offline_data_path`) provides both alert details and a mock environment. For each alert, expected tool return values are included. These simulate how the environment would behave if the alert occurred on a real system.
-   - The **JSON offline dataset** (`eval.general.dataset.filepath` in the config) contains a subset of the information from the main CSV: the alert inputs and their associated ground truth root causes. It is used to run `nat eval`, focusing only on the essential data needed for running the workflow, while the full CSV retains the complete mock environment context.
-   - At runtime, the system links each alert in the JSON dataset to its corresponding context in the CSV using the unique host IDs included in both datasets.
-   - The **benign fallback dataset** fills in tool responses when the agent calls a tool not explicitly defined in the alert's offline data. These fallback responses mimic healthy system behavior and help provide the "background scenery" without obscuring the true root cause.
+- The **main CSV offline dataset** (`offline_data_path`) provides both alert details and a mock environment. For each alert, expected tool return values are included. These simulate how the environment would behave if the alert occurred on a real system.
+- The **JSON offline dataset** (`eval.general.dataset.filepath` in the config) contains a subset of the information from the main CSV: the alert inputs and their associated ground truth root causes. It is used to run `nat eval`, focusing only on the essential data needed for running the workflow, while the full CSV retains the complete mock environment context.
+- At runtime, the system links each alert in the JSON dataset to its corresponding context in the CSV using the unique host IDs included in both datasets.
+- The **benign fallback dataset** fills in tool responses when the agent calls a tool not explicitly defined in the alert's offline data. These fallback responses mimic healthy system behavior and help provide the "background scenery" without obscuring the true root cause.
 
-#### 3. **Run the agent in offline mode**
+#### 3. Run the agent in offline mode

631-642: Fix list indentation for optimization workflow.

The list items under the optimization section should also use 0 spaces of indentation.

Apply this diff:

   The agent will:
-   - Load alerts from the JSON dataset specified in the config `eval.general.dataset.filepath`
-   - Run optimization for the metrics specified in the config `optimizer.eval_metrics`
-   - Save the optimization results to the path specified by `optimizer.output_dir`
+- Load alerts from the JSON dataset specified in the config `eval.general.dataset.filepath`
+- Run optimization for the metrics specified in the config `optimizer.eval_metrics`
+- Save the optimization results to the path specified by `optimizer.output_dir`
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 01c0dec and 10cfa06.

📒 Files selected for processing (3)
  • examples/advanced_agents/alert_triage_agent/README.md (5 hunks)
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompts.py (1 hunks)
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{py,yaml,yml}

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.{py,yaml,yml}: Configure response_seq as a list of strings; values cycle per call, and [] yields an empty string.
Configure delay_ms to inject per-call artificial latency in milliseconds for nat_test_llm.

Files:

  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompts.py
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.py: Programmatic use: create TestLLMConfig(response_seq=[...], delay_ms=...), add with builder.add_llm("", cfg).
When retrieving the test LLM wrapper, use builder.get_llm(name, wrapper_type=LLMFrameworkEnum.) and call the framework’s method (e.g., ainvoke, achat, call).

**/*.py: In code comments/identifiers use NAT abbreviations as specified: nat for API namespace/CLI, nvidia-nat for package name, NAT for env var prefixes; do not use these abbreviations in documentation
Follow PEP 20 and PEP 8; run yapf with column_limit=120; use 4-space indentation; end files with a single trailing newline
Run ruff check --fix as linter (not formatter) using pyproject.toml config; fix warnings unless explicitly ignored
Respect naming: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
Treat pyright warnings as errors during development
Exception handling: use bare raise to re-raise; log with logger.error() when re-raising to avoid duplicate stack traces; use logger.exception() when catching without re-raising
Provide Google-style docstrings for every public module, class, function, and CLI command; first line concise and ending with a period; surround code entities with backticks
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work; profile CPU-heavy paths with cProfile or mprof before optimizing; cache expensive computations with functools.lru_cache or external cache; leverage NumPy vectorized operations when beneficial

Files:

  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompts.py
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

  • Confirm that copyright years are up-to date whenever a file is changed.

Files:

  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompts.py
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py
  • examples/advanced_agents/alert_triage_agent/README.md
examples/**/*

⚙️ CodeRabbit configuration file

examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.

  • If an example contains Python code, it should be placed in a subdirectory named src/ and should
    contain a pyproject.toml file. Optionally, it might also contain scripts in a scripts/ directory.
  • If an example contains YAML files, they should be placed in a subdirectory named configs/. - If an example contains sample data files, they should be placed in a subdirectory named data/, and should
    be checked into git-lfs.

Files:

  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompts.py
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py
  • examples/advanced_agents/alert_triage_agent/README.md
**/README.@(md|ipynb)

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Ensure READMEs follow the naming convention; avoid deprecated names; use “NeMo Agent Toolkit” (capital T) in headings

Files:

  • examples/advanced_agents/alert_triage_agent/README.md
🧬 Code graph analysis (1)
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py (2)
src/nat/data_models/optimizable.py (2)
  • OptimizableField (68-107)
  • SearchSpace (33-65)
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompts.py (1)
  • OptimizerPrompts (16-50)
🪛 markdownlint-cli2 (0.18.1)
examples/advanced_agents/alert_triage_agent/README.md

537-537: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


538-538: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


539-539: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


540-540: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


639-639: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


640-640: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


641-641: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)

🪛 Ruff (0.14.0)
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py

33-33: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: CI Pipeline / Check
🔇 Additional comments (2)
examples/advanced_agents/alert_triage_agent/README.md (1)

298-393: LGTM! Excellent optimization documentation.

The new Optimization section is comprehensive, well-structured, and provides clear guidance on both numeric and prompt optimization. The examples and configuration snippets effectively demonstrate how to enable and configure the optimizer.

examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py (1)

35-43: LGTM! Well-structured optimization configuration.

The conversion to OptimizableField with proper SearchSpace configuration is correct. The field maintains the non-None type as requested in past reviews, and properly wires the prompt purpose from OptimizerPrompts.

Signed-off-by: Hsin Chen <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 10cfa06 and c6358d1.

📒 Files selected for processing (3)
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompts.py (1 hunks)
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/register.py (3 hunks)
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{py,yaml,yml}

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.{py,yaml,yml}: Configure response_seq as a list of strings; values cycle per call, and [] yields an empty string.
Configure delay_ms to inject per-call artificial latency in milliseconds for nat_test_llm.

Files:

  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompts.py
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/register.py
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.py: Programmatic use: create TestLLMConfig(response_seq=[...], delay_ms=...), add with builder.add_llm("", cfg).
When retrieving the test LLM wrapper, use builder.get_llm(name, wrapper_type=LLMFrameworkEnum.) and call the framework’s method (e.g., ainvoke, achat, call).

**/*.py: In code comments/identifiers use NAT abbreviations as specified: nat for API namespace/CLI, nvidia-nat for package name, NAT for env var prefixes; do not use these abbreviations in documentation
Follow PEP 20 and PEP 8; run yapf with column_limit=120; use 4-space indentation; end files with a single trailing newline
Run ruff check --fix as linter (not formatter) using pyproject.toml config; fix warnings unless explicitly ignored
Respect naming: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
Treat pyright warnings as errors during development
Exception handling: use bare raise to re-raise; log with logger.error() when re-raising to avoid duplicate stack traces; use logger.exception() when catching without re-raising
Provide Google-style docstrings for every public module, class, function, and CLI command; first line concise and ending with a period; surround code entities with backticks
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work; profile CPU-heavy paths with cProfile or mprof before optimizing; cache expensive computations with functools.lru_cache or external cache; leverage NumPy vectorized operations when beneficial

Files:

  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompts.py
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/register.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

  • Confirm that copyright years are up-to date whenever a file is changed.

Files:

  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompts.py
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/register.py
examples/**/*

⚙️ CodeRabbit configuration file

examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.

  • If an example contains Python code, it should be placed in a subdirectory named src/ and should
    contain a pyproject.toml file. Optionally, it might also contain scripts in a scripts/ directory.
  • If an example contains YAML files, they should be placed in a subdirectory named configs/. - If an example contains sample data files, they should be placed in a subdirectory named data/, and should
    be checked into git-lfs.

Files:

  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompts.py
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py
  • examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/register.py
🧬 Code graph analysis (2)
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py (2)
src/nat/data_models/optimizable.py (2)
  • OptimizableField (68-107)
  • SearchSpace (33-65)
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompts.py (1)
  • OptimizerPrompts (22-56)
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/register.py (2)
src/nat/data_models/optimizable.py (2)
  • OptimizableField (68-107)
  • SearchSpace (33-65)
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompts.py (1)
  • OptimizerPrompts (22-56)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: CI Pipeline / Check
🔇 Additional comments (3)
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/optimizer_prompts.py (1)

1-56: LGTM - all documentation requirements met.

The file now includes the required Apache License header, module-level docstring, and class docstring. The prompt purpose constants are detailed and self-documenting, providing clear guidance for the optimizer.

examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/register.py (1)

67-73: OptimizableField configuration looks correct.

The agent_prompt field is properly configured with SearchSpace metadata for prompt optimization, including the prompt purpose from OptimizerPrompts. The implementation follows the pattern established in the codebase.

examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/telemetry_metrics_analysis_agent.py (1)

24-25: All previous review issues have been addressed.

The changes correctly implement optimizer support:

  • Import path uses .optimizer_prompts (plural) ✓
  • tool_names uses Field(default_factory=list) to avoid mutable default ✓
  • prompt type is non-optional (str) ✓
  • OptimizableField configuration with SearchSpace is properly structured ✓

Also applies to: 28-28, 35-36, 38-44

Signed-off-by: Hsin Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants