Skip to content

Commit be4fdf9

Browse files
authored
ROB-699: prometheus + graph generating capability (#295)
1 parent 7a4b202 commit be4fdf9

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+1514
-68
lines changed

FEATURES.md

+136
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
2+
# Features
3+
4+
This page document and describes HolmesGPT's behaviour when it comes to its features.
5+
6+
7+
## Root Cause Analysis
8+
9+
Also called Investigation, Root Cause Analysis (RCA) is HolmesGPT's ability to investigate alerts,
10+
typically from Prometheus' alert manager.
11+
12+
### Sectioned output
13+
14+
HolmesGPT generates structured output by default. It is also capable of generating sections based on request.
15+
16+
Here is an example of a request payload to run an investigation:
17+
18+
```json
19+
{
20+
"source": "prometheus",
21+
"source_instance_id": "some-instance",
22+
"title": "Pod is crash looping.",
23+
"description": "Pod default/oomkill-deployment-696dbdbf67-d47z6 (main2) is in waiting state (reason: 'CrashLoopBackOff').",
24+
"subject": {
25+
"name": "oomkill-deployment-696dbdbf67-d47z6",
26+
"subject_type": "deployment",
27+
"namespace": "default",
28+
"node": "some-node",
29+
"container": "main2",
30+
"labels": {
31+
"x": "y",
32+
"p": "q"
33+
},
34+
"annotations": {}
35+
},
36+
"context":
37+
{
38+
"robusta_issue_id": "5b3e2fb1-cb83-45ea-82ec-318c94718e44"
39+
},
40+
"include_tool_calls": true,
41+
"include_tool_call_results": true
42+
"sections": {
43+
"Alert Explanation": "1-2 sentences explaining the alert itself - note don't say \"The alert indicates a warning event related to a Kubernetes pod doing blah\" rather just say \"The pod XYZ did blah\" because that is what the user actually cares about",
44+
"Conclusions and Possible Root causes": "What conclusions can you reach based on the data you found? what are possible root causes (if you have enough conviction to say) or what uncertainty remains. Don't say root cause but 'possible root causes'. Be clear to distinguish between what you know for certain and what is a possible explanation",
45+
"Related logs": "Truncate and share the most relevant logs, especially if these explain the root cause. For example: \nLogs from pod robusta-holmes:\n```\n<logs>```\n. Always embed the surroundding +/- 5 log lines to any relevant logs. "
46+
}
47+
}
48+
```
49+
50+
Notice that the "sections" field contains 3 different sections. The text value for each section should be a prompt telling the LLM what the section should contain.
51+
You can then expect the following in return:
52+
53+
```
54+
{
55+
"analysis": <monolithic text response. Contains all the sections aggregated together>,
56+
"sections": {
57+
"Alert Explanation": <A markdown text with the explanation of the alert>,
58+
"Conclusions and Possible Root causes": <Conclusions reached by the LLM>,
59+
"Related logs": <Any related logs the LLM could find through tools>
60+
},
61+
"tool_calls": <tool calls>,
62+
"instructions": <Specific instructions used for this investigation>
63+
}
64+
```
65+
66+
In some cases, the LLM may decide to set a section to `null` or even add or ignore some sections.
67+
68+
69+
## PromQL
70+
71+
If the `prometheus/metrics` toolset is enabled, HolmesGPT can generate embed graphs in conversations (ask holmes).
72+
73+
For example, here is scenario in which the LLM answers with a graph:
74+
75+
76+
User question:
77+
78+
```
79+
Show me the http request latency over time for the service customer-orders-service?
80+
```
81+
82+
83+
HolmesGPT text response:
84+
```
85+
Here's the average HTTP request latency over time for the `customer-orders-service`:
86+
87+
<< {"type": "promql", "tool_name": "execute_prometheus_range_query", "random_key": "9kLK"} >>
88+
```
89+
90+
In addition to this text response, the returned JSON will contain one or more tool calls, including the prometheus query:
91+
92+
```json
93+
"tool_calls": [
94+
{
95+
"tool_call_id": "call_lKI7CQW6Y2n1ZQ5dlxX79TcM",
96+
"tool_name": "execute_prometheus_range_query",
97+
"description": "Prometheus query_range. query=rate(http_request_duration_seconds_sum{service=\"customer-orders-service\"}[5m]) / rate(http_request_duration_seconds_count{service=\"customer-orders-service\"}[5m]), start=1739705559, end=1739791959, step=300, description=HTTP request latency for customer-orders-service",
98+
"result": "{\n \"status\": \"success\",\n \"random_key\": \"9kLK\",\n \"tool_name\": \"execute_prometheus_range_query\",\n \"description\": \"Average HTTP request latency for customer-orders-service\",\n \"query\": \"rate(http_request_duration_seconds_sum{service=\\\"customer-orders-service\\\"}[5m]) / rate(http_request_duration_seconds_count{service=\\\"customer-orders-service\\\"}[5m])\",\n \"start\": \"1739705559\",\n \"end\": \"1739791959\",\n \"step\": 60\n}"
99+
}
100+
],
101+
```
102+
103+
The result of this tool call contains details about the [prometheus query](https://prometheus.io/docs/prometheus/latest/querying/api/#range-queries) to build the graph returned by HolmesGPT:
104+
105+
```json
106+
{
107+
"status": "success",
108+
"random_key": "9kLK",
109+
"tool_name": "execute_prometheus_range_query",
110+
"description": "Average HTTP request latency for customer-orders-service",
111+
"query": "rate(http_request_duration_seconds_sum{service=\"customer-orders-service\"}[5m]) / rate(http_request_duration_seconds_count{service=\"customer-orders-service\"}[5m])",
112+
"start": "1739705559", // Can be rfc3339 or a unix timestamp
113+
"end": "1739791959", // Can be rfc3339 or a unix timestamp
114+
"step": 60 // Query resolution step width in seconds
115+
}
116+
```
117+
118+
In addition to `execute_prometheus_range_query`, HolmesGPT can generate similar results with an `execute_prometheus_instant_query` which is an [instant query](https://prometheus.io/docs/prometheus/latest/querying/api/#instant-queries):
119+
120+
```
121+
Here's the average HTTP request latency over time for the `customer-orders-service`:
122+
123+
<< {"type": "promql", "tool_name": "execute_prometheus_instant_query", "random_key": "9kLK"} >>
124+
```
125+
126+
```json
127+
{
128+
"status": "success",
129+
"random_key": "2KiL",
130+
"tool_name": "execute_prometheus_instant_query",
131+
"description": "Average HTTP request latency for customer-orders-service",
132+
"query": "rate(http_request_duration_seconds_sum{service=\"customer-orders-service\"}[5m]) / rate(http_request_duration_seconds_count{service=\"customer-orders-service\"}[5m])"
133+
}
134+
```
135+
136+
Unlike the range query, the instant query result lacks the `start`, `end` and `step` arguments.

holmes/core/openai_formatting.py

+51
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
import re
2+
3+
# parses both simple types: "int", "array", "string"
4+
# but also arrays of those simpler types: "array[int]", "array[string]", etc.
5+
pattern = r"^(array\[(?P<inner_type>\w+)\])|(?P<simple_type>\w+)$"
6+
7+
8+
def type_to_open_ai_schema(type_value):
9+
match = re.match(pattern, type_value.strip())
10+
11+
if not match:
12+
raise ValueError(f"Invalid type format: {type_value}")
13+
14+
if match.group("inner_type"):
15+
return {"type": "array", "items": {"type": match.group("inner_type")}}
16+
17+
else:
18+
return {"type": match.group("simple_type")}
19+
20+
21+
def format_tool_to_open_ai_standard(
22+
tool_name: str, tool_description: str, tool_parameters: dict
23+
):
24+
tool_properties = {}
25+
for param_name, param_attributes in tool_parameters.items():
26+
tool_properties[param_name] = type_to_open_ai_schema(param_attributes.type)
27+
if param_attributes.description is not None:
28+
tool_properties[param_name]["description"] = param_attributes.description
29+
30+
result = {
31+
"type": "function",
32+
"function": {
33+
"name": tool_name,
34+
"description": tool_description,
35+
"parameters": {
36+
"properties": tool_properties,
37+
"required": [
38+
param_name
39+
for param_name, param_attributes in tool_parameters.items()
40+
if param_attributes.required
41+
],
42+
"type": "object",
43+
},
44+
},
45+
}
46+
47+
# gemini doesnt have parameters object if it is without params
48+
if tool_properties is None:
49+
result["function"].pop("parameters")
50+
51+
return result

holmes/core/performance_timing.py

+22-11
Original file line numberDiff line numberDiff line change
@@ -49,14 +49,25 @@ def end(self):
4949
)
5050

5151

52-
def log_function_timing(func):
53-
@wraps(func)
54-
def function_timing_wrapper(*args, **kwargs):
55-
start_time = time.perf_counter()
56-
result = func(*args, **kwargs)
57-
end_time = time.perf_counter()
58-
total_time = int((end_time - start_time) * 1000)
59-
logging.info(f'Function "{func.__name__}()" took {total_time}ms')
60-
return result
61-
62-
return function_timing_wrapper
52+
def log_function_timing(label=None):
53+
def decorator(func):
54+
@wraps(func)
55+
def function_timing_wrapper(*args, **kwargs):
56+
start_time = time.perf_counter()
57+
result = func(*args, **kwargs)
58+
end_time = time.perf_counter()
59+
total_time = int((end_time - start_time) * 1000)
60+
61+
function_identifier = (
62+
f'"{label}: {func.__name__}()"' if label else f'"{func.__name__}()"'
63+
)
64+
logging.info(f"Function {function_identifier} took {total_time}ms")
65+
return result
66+
67+
return function_timing_wrapper
68+
69+
if callable(label):
70+
func = label
71+
label = None
72+
return decorator(func)
73+
return decorator

holmes/core/tools.py

+8-31
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@
1919
model_validator,
2020
)
2121

22+
from holmes.core.openai_formatting import format_tool_to_open_ai_standard
23+
2224

2325
ToolsetPattern = Union[Literal["*"], List[str]]
2426

@@ -81,36 +83,11 @@ class Tool(ABC, BaseModel):
8183
additional_instructions: Optional[str] = None
8284

8385
def get_openai_format(self):
84-
tool_properties = {}
85-
for param_name, param_attributes in self.parameters.items():
86-
tool_properties[param_name] = {"type": param_attributes.type}
87-
if param_attributes.description is not None:
88-
tool_properties[param_name]["description"] = (
89-
param_attributes.description
90-
)
91-
92-
result = {
93-
"type": "function",
94-
"function": {
95-
"name": self.name,
96-
"description": self.description,
97-
"parameters": {
98-
"properties": tool_properties,
99-
"required": [
100-
param_name
101-
for param_name, param_attributes in self.parameters.items()
102-
if param_attributes.required
103-
],
104-
"type": "object",
105-
},
106-
},
107-
}
108-
109-
# gemini doesnt have parameters object if it is without params
110-
if tool_properties is None:
111-
result["function"].pop("parameters")
112-
113-
return result
86+
return format_tool_to_open_ai_standard(
87+
tool_name=self.name,
88+
tool_description=self.description,
89+
tool_parameters=self.parameters,
90+
)
11491

11592
def invoke(self, params: Dict) -> str:
11693
logging.info(
@@ -423,7 +400,7 @@ def invoke(self, tool_name: str, params: Dict) -> str:
423400
tool = self.get_tool_by_name(tool_name)
424401
return tool.invoke(params) if tool else ""
425402

426-
def get_tool_by_name(self, name: str) -> Optional[YAMLTool]:
403+
def get_tool_by_name(self, name: str) -> Optional[Tool]:
427404
if name in self.tools_by_name:
428405
return self.tools_by_name[name]
429406
logging.warning(f"could not find tool {name}. skipping")

holmes/plugins/prompts/generic_ask_conversation.jinja2

+19
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,25 @@ Use conversation history to maintain continuity when appropriate, ensuring effic
88

99
{% include '_general_instructions.jinja2' %}
1010

11+
Prometheus/PromQL queries
12+
* Use prometheus to execute promql queries with the tools `execute_prometheus_instant_query` and `execute_prometheus_range_query`
13+
* ALWAYS embed the execution results into your answer
14+
* You only need to embed the partial result in your response. Include the "tool_name" and "random_key". For example: << {"type": "promql", "tool_name": "execute_prometheus_range_query", "random_key": "92jf2hf"} >>
15+
* Use these tools to generate charts that users can see. Here are standard metrics but you can use different ones:
16+
** For memory consumption: `container_memory_working_set_bytes`
17+
** For CPU usage: `container_cpu_usage_seconds_total`
18+
** For CPU throttling: `container_cpu_cfs_throttled_periods_total`
19+
** For latencies, prefer using `<metric>_sum` / `<metric>_count` over a sliding window
20+
** Avoid using `<metric>_bucket` unless you know the bucket's boundaries are configured correctly
21+
** Prefer individual averages like `rate(<metric>_sum) / rate(<metric>_count)`
22+
** Avoid global averages like `sum(rate(<metric>_sum)) / sum(rate(<metric>_count))` because it hides data and is not generally informative
23+
* Post processing will parse your response, re-run the query from the tool output and create a chart visible to the user
24+
* Only generate and execute a prometheus query after checking what metrics are available with the `list_available_metrics` tool
25+
* Check that any node, service, pod, container, app, namespace, etc. mentioned in the query exist in the kubernetes cluster before making a query. Use any appropriate kubectl tool(s) for this
26+
* The toolcall will return no data to you. That is expected. You MUST however ensure that the query is successful.
27+
* You can get the current time before executing a prometheus range query
28+
* ALWAYS embed the execution results into your answer
29+
1130
Style guide:
1231
* Reply with terse output.
1332
* Be painfully concise.

holmes/plugins/toolsets/__init__.py

+4-2
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,13 @@
1010
from holmes.plugins.toolsets.grafana.toolset_grafana_tempo import GrafanaTempoToolset
1111
from holmes.plugins.toolsets.internet.internet import InternetToolset
1212
from holmes.plugins.toolsets.internet.notion import NotionToolset
13+
from holmes.plugins.toolsets.prometheus import PrometheusToolset
14+
from holmes.plugins.toolsets.opensearch import OpenSearchToolset
15+
from holmes.plugins.toolsets.kafka import KafkaToolset
1316

1417
from holmes.core.tools import Toolset, YAMLToolset
15-
from holmes.plugins.toolsets.opensearch import OpenSearchToolset
1618
import yaml
1719

18-
from holmes.plugins.toolsets.kafka import KafkaToolset
1920

2021
THIS_DIR = os.path.abspath(os.path.dirname(__file__))
2122

@@ -52,6 +53,7 @@ def load_python_toolsets(dal: Optional[SupabaseDal]) -> List[Toolset]:
5253
GrafanaTempoToolset(),
5354
NotionToolset(),
5455
KafkaToolset(),
56+
PrometheusToolset(),
5557
DatetimeToolset(),
5658
]
5759

0 commit comments

Comments
 (0)