robusta-dev · nherment · Mar 9, 2025 · Feb 20, 2025 · Feb 20, 2025 · Feb 20, 2025
diff --git a/FEATURES.md b/FEATURES.md
@@ -0,0 +1,136 @@
+
+# Features
+
+This page document and describes HolmesGPT's behaviour when it comes to its features.
+
+
+## Root Cause Analysis
+
+Also called Investigation, Root Cause Analysis (RCA) is HolmesGPT's ability to investigate alerts,
+typically from Prometheus' alert manager.
+
+### Sectioned output
+
+HolmesGPT generates structured output by default. It is also capable of generating sections based on request.
+
+Here is an example of a request payload to run an investigation:
+
+```json
+{
+  "source": "prometheus",
+  "source_instance_id": "some-instance",
+  "title": "Pod is crash looping.",
+  "description": "Pod default/oomkill-deployment-696dbdbf67-d47z6 (main2) is in waiting state (reason: 'CrashLoopBackOff').",
+  "subject": {
+    "name": "oomkill-deployment-696dbdbf67-d47z6",
+    "subject_type": "deployment",
+    "namespace": "default",
+    "node": "some-node",
+    "container": "main2",
+    "labels": {
+      "x": "y",
+      "p": "q"
+    },
+    "annotations": {}
+  },
+  "context":
+    {
+      "robusta_issue_id": "5b3e2fb1-cb83-45ea-82ec-318c94718e44"
+    },
+  "include_tool_calls": true,
+  "include_tool_call_results": true
+  "sections":  {
+    "Alert Explanation": "1-2 sentences explaining the alert itself - note don't say \"The alert indicates a warning event related to a Kubernetes pod doing blah\" rather just say \"The pod XYZ did blah\" because that is what the user actually cares about",
+    "Conclusions and Possible Root causes": "What conclusions can you reach based on the data you found? what are possible root causes (if you have enough conviction to say) or what uncertainty remains. Don't say root cause but 'possible root causes'. Be clear to distinguish between what you know for certain and what is a possible explanation",
+    "Related logs": "Truncate and share the most relevant logs, especially if these explain the root cause. For example: \nLogs from pod robusta-holmes:\n```\n<logs>```\n. Always embed the surroundding +/- 5 log lines to any relevant logs. "
+  }
+}
+```
+
+Notice that the "sections" field contains 3  different sections. The text value for each section should be a prompt telling the LLM what the section should contain.
+You can then expect the following in return:
+
+```
+{
+  "analysis": <monolithic text response. Contains all the sections aggregated together>,
+  "sections": {
+    "Alert Explanation": <A markdown text with the explanation of the alert>,
+    "Conclusions and Possible Root causes": <Conclusions reached by the LLM>,
+    "Related logs": <Any related logs the LLM could find through tools>
+  },
+  "tool_calls": <tool calls>,
+  "instructions": <Specific instructions used for this investigation>
+}
+```
+
+In some cases, the LLM may decide to set a section to `null` or even add or ignore some sections.
+
+
+## PromQL
+
+If the `prometheus/metrics` toolset is enabled, HolmesGPT can generate embed graphs in conversations (ask holmes).
+
+For example, here is scenario in which the LLM answers with a graph:
+
+
+User question:
+
+```
+Show me the http request latency over time for the service customer-orders-service?
+```
+
+
+HolmesGPT text response:
+```
+Here's the average HTTP request latency over time for the `customer-orders-service`:
+
+<< {"type": "promql", "tool_name": "execute_prometheus_range_query", "random_key": "9kLK"} >>
+```
+
+In addition to this text response, the returned JSON will contain one or more tool calls, including the prometheus query:
+
+```json
+"tool_calls": [
+  {
+    "tool_call_id": "call_lKI7CQW6Y2n1ZQ5dlxX79TcM",
+    "tool_name": "execute_prometheus_range_query",
+    "description": "Prometheus query_range. query=rate(http_request_duration_seconds_sum{service=\"customer-orders-service\"}[5m]) / rate(http_request_duration_seconds_count{service=\"customer-orders-service\"}[5m]), start=1739705559, end=1739791959, step=300, description=HTTP request latency for customer-orders-service",
+    "result": "{\n  \"status\": \"success\",\n  \"random_key\": \"9kLK\",\n  \"tool_name\": \"execute_prometheus_range_query\",\n  \"description\": \"Average HTTP request latency for customer-orders-service\",\n  \"query\": \"rate(http_request_duration_seconds_sum{service=\\\"customer-orders-service\\\"}[5m]) / rate(http_request_duration_seconds_count{service=\\\"customer-orders-service\\\"}[5m])\",\n  \"start\": \"1739705559\",\n  \"end\": \"1739791959\",\n  \"step\": 60\n}"
+  }
+],
+```
+
+The result of this tool call contains details about the [prometheus query](https://prometheus.io/docs/prometheus/latest/querying/api/#range-queries) to build the graph returned by HolmesGPT:
+
+```json
+{
+  "status": "success",
+  "random_key": "9kLK",
+  "tool_name": "execute_prometheus_range_query",
+  "description": "Average HTTP request latency for customer-orders-service",
+  "query": "rate(http_request_duration_seconds_sum{service=\"customer-orders-service\"}[5m]) / rate(http_request_duration_seconds_count{service=\"customer-orders-service\"}[5m])",
+  "start": "1739705559", // Can be rfc3339 or a unix timestamp
+  "end": "1739791959", // Can be rfc3339 or a unix timestamp
+  "step": 60 // Query resolution step width in seconds
+}
+```
+
+In addition to `execute_prometheus_range_query`, HolmesGPT can generate similar results with an `execute_prometheus_instant_query` which is an [instant query](https://prometheus.io/docs/prometheus/latest/querying/api/#instant-queries):
+
+```
+Here's the average HTTP request latency over time for the `customer-orders-service`:
+
+<< {"type": "promql", "tool_name": "execute_prometheus_instant_query", "random_key": "9kLK"} >>
+```
+
+```json
+{
+  "status": "success",
+  "random_key": "2KiL",
+  "tool_name": "execute_prometheus_instant_query",
+  "description": "Average HTTP request latency for customer-orders-service",
+  "query": "rate(http_request_duration_seconds_sum{service=\"customer-orders-service\"}[5m]) / rate(http_request_duration_seconds_count{service=\"customer-orders-service\"}[5m])"
+}
+```
+
+Unlike the range query, the instant query result lacks the `start`, `end` and `step` arguments.
diff --git a/holmes/core/openai_formatting.py b/holmes/core/openai_formatting.py
@@ -0,0 +1,51 @@
+import re
+
+# parses both simple types: "int", "array", "string"
+# but also arrays of those simpler types: "array[int]", "array[string]", etc.
+pattern = r"^(array\[(?P<inner_type>\w+)\])|(?P<simple_type>\w+)$"
+
+
+def type_to_open_ai_schema(type_value):
+    match = re.match(pattern, type_value.strip())
+
+    if not match:
+        raise ValueError(f"Invalid type format: {type_value}")
+
+    if match.group("inner_type"):
+        return {"type": "array", "items": {"type": match.group("inner_type")}}
+
+    else:
+        return {"type": match.group("simple_type")}
+
+
+def format_tool_to_open_ai_standard(
+    tool_name: str, tool_description: str, tool_parameters: dict
+):
+    tool_properties = {}
+    for param_name, param_attributes in tool_parameters.items():
+        tool_properties[param_name] = type_to_open_ai_schema(param_attributes.type)
+        if param_attributes.description is not None:
+            tool_properties[param_name]["description"] = param_attributes.description
+
+    result = {
+        "type": "function",
+        "function": {
+            "name": tool_name,
+            "description": tool_description,
+            "parameters": {
+                "properties": tool_properties,
+                "required": [
+                    param_name
+                    for param_name, param_attributes in tool_parameters.items()
+                    if param_attributes.required
+                ],
+                "type": "object",
+            },
+        },
+    }
+
+    # gemini doesnt have parameters object if it is without params
+    if tool_properties is None:
+        result["function"].pop("parameters")
+
+    return result
diff --git a/holmes/core/performance_timing.py b/holmes/core/performance_timing.py
@@ -49,14 +49,25 @@ def end(self):
             )
 
 
-def log_function_timing(func):
-    @wraps(func)
-    def function_timing_wrapper(*args, **kwargs):
-        start_time = time.perf_counter()
-        result = func(*args, **kwargs)
-        end_time = time.perf_counter()
-        total_time = int((end_time - start_time) * 1000)
-        logging.info(f'Function "{func.__name__}()" took {total_time}ms')
-        return result
-
-    return function_timing_wrapper
+def log_function_timing(label=None):
+    def decorator(func):
+        @wraps(func)
+        def function_timing_wrapper(*args, **kwargs):
+            start_time = time.perf_counter()
+            result = func(*args, **kwargs)
+            end_time = time.perf_counter()
+            total_time = int((end_time - start_time) * 1000)
+
+            function_identifier = (
+                f'"{label}: {func.__name__}()"' if label else f'"{func.__name__}()"'
+            )
+            logging.info(f"Function {function_identifier} took {total_time}ms")
+            return result
+
+        return function_timing_wrapper
+
+    if callable(label):
+        func = label
+        label = None
+        return decorator(func)
+    return decorator
diff --git a/holmes/core/tools.py b/holmes/core/tools.py
@@ -19,6 +19,8 @@
     model_validator,
 )
 
+from holmes.core.openai_formatting import format_tool_to_open_ai_standard
+
 
 ToolsetPattern = Union[Literal["*"], List[str]]
 
@@ -81,36 +83,11 @@ class Tool(ABC, BaseModel):
     additional_instructions: Optional[str] = None
 
     def get_openai_format(self):
-        tool_properties = {}
-        for param_name, param_attributes in self.parameters.items():
-            tool_properties[param_name] = {"type": param_attributes.type}
-            if param_attributes.description is not None:
-                tool_properties[param_name]["description"] = (
-                    param_attributes.description
-                )
-
-        result = {
-            "type": "function",
-            "function": {
-                "name": self.name,
-                "description": self.description,
-                "parameters": {
-                    "properties": tool_properties,
-                    "required": [
-                        param_name
-                        for param_name, param_attributes in self.parameters.items()
-                        if param_attributes.required
-                    ],
-                    "type": "object",
-                },
-            },
-        }
-
-        # gemini doesnt have parameters object if it is without params
-        if tool_properties is None:
-            result["function"].pop("parameters")
-
-        return result
+        return format_tool_to_open_ai_standard(
+            tool_name=self.name,
+            tool_description=self.description,
+            tool_parameters=self.parameters,
+        )
 
     def invoke(self, params: Dict) -> str:
         logging.info(
@@ -423,7 +400,7 @@ def invoke(self, tool_name: str, params: Dict) -> str:
         tool = self.get_tool_by_name(tool_name)
         return tool.invoke(params) if tool else ""
 
-    def get_tool_by_name(self, name: str) -> Optional[YAMLTool]:
+    def get_tool_by_name(self, name: str) -> Optional[Tool]:
         if name in self.tools_by_name:
             return self.tools_by_name[name]
         logging.warning(f"could not find tool {name}. skipping")

diff --git a/holmes/plugins/prompts/generic_ask_conversation.jinja2 b/holmes/plugins/prompts/generic_ask_conversation.jinja2
@@ -8,6 +8,25 @@ Use conversation history to maintain continuity when appropriate, ensuring effic
 
 {% include '_general_instructions.jinja2' %}
 
+Prometheus/PromQL queries
+* Use prometheus to execute promql queries with the tools `execute_prometheus_instant_query` and `execute_prometheus_range_query`
+* ALWAYS embed the execution results into your answer
+* You only need to embed the partial result in your response. Include the "tool_name" and "random_key". For example: << {"type": "promql", "tool_name": "execute_prometheus_range_query", "random_key": "92jf2hf"} >>
+* Use these tools to generate charts that users can see. Here are standard metrics but you can use different ones:
+** For memory consumption: `container_memory_working_set_bytes`
+** For CPU usage: `container_cpu_usage_seconds_total`
+** For CPU throttling: `container_cpu_cfs_throttled_periods_total`
+** For latencies, prefer using `<metric>_sum` / `<metric>_count` over a sliding window
+** Avoid using `<metric>_bucket` unless you know the bucket's boundaries are configured correctly
+** Prefer individual averages like `rate(<metric>_sum) / rate(<metric>_count)`
+** Avoid global averages like `sum(rate(<metric>_sum)) / sum(rate(<metric>_count))` because it hides data and is not generally informative
+* Post processing will parse your response, re-run the query from the tool output and create a chart visible to the user
+* Only generate and execute a prometheus query after checking what metrics are available with the `list_available_metrics` tool
+* Check that any node, service, pod, container, app, namespace, etc. mentioned in the query exist in the kubernetes cluster before making a query. Use any appropriate kubectl tool(s) for this
+* The toolcall will return no data to you. That is expected. You MUST however ensure that the query is successful.
+* You can get the current time before executing a prometheus range query
+* ALWAYS embed the execution results into your answer
+
 Style guide:
 * Reply with terse output.
 * Be painfully concise.

diff --git a/holmes/plugins/toolsets/__init__.py b/holmes/plugins/toolsets/__init__.py
@@ -10,12 +10,13 @@
 from holmes.plugins.toolsets.grafana.toolset_grafana_tempo import GrafanaTempoToolset
 from holmes.plugins.toolsets.internet.internet import InternetToolset
 from holmes.plugins.toolsets.internet.notion import NotionToolset
+from holmes.plugins.toolsets.prometheus import PrometheusToolset
+from holmes.plugins.toolsets.opensearch import OpenSearchToolset
+from holmes.plugins.toolsets.kafka import KafkaToolset
 
 from holmes.core.tools import Toolset, YAMLToolset
-from holmes.plugins.toolsets.opensearch import OpenSearchToolset
 import yaml
 
-from holmes.plugins.toolsets.kafka import KafkaToolset
 
 THIS_DIR = os.path.abspath(os.path.dirname(__file__))
 
@@ -52,6 +53,7 @@ def load_python_toolsets(dal: Optional[SupabaseDal]) -> List[Toolset]:
         GrafanaTempoToolset(),
         NotionToolset(),
         KafkaToolset(),
+        PrometheusToolset(),
         DatetimeToolset(),
     ]