Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROB-699: prometheus + graph generating capability #295

Merged
merged 30 commits into from
Mar 9, 2025
Merged
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
43234f4
feat: prometheus + graph generating capability
nherment Feb 20, 2025
6b42ea5
feat: fix prometheus tests
nherment Feb 20, 2025
3a6aa3b
test: split prometheus unit and integration tests
nherment Feb 20, 2025
b11e3e1
chore: ruff
nherment Feb 20, 2025
2602388
feat: prometheus toolset no longer returns query result butv alidate …
nherment Feb 21, 2025
c92a9c0
test: add test and comment to openai formatting for toolset typeS
nherment Feb 21, 2025
bbceaf4
feat: tweak the prompt for prometheus queries
nherment Feb 21, 2025
2b3a55a
doc: add link to docs for datetime toolset
nherment Feb 21, 2025
45c5617
feat: update icon for datetime toolset
nherment Feb 21, 2025
494cbfd
fix: remove unused var
nherment Feb 21, 2025
31ec106
fix: revert change to labels query for testing
nherment Feb 21, 2025
58be19c
Add type to promql embed, document api for promql results
nherment Feb 25, 2025
e22e41d
Merge branch 'master' into add_promql_capabilities
nherment Feb 25, 2025
8b7cad0
remove unused code
nherment Feb 25, 2025
7991c8e
doc: fix typo, missing code block closure
nherment Feb 25, 2025
447ff94
feat: prometheus toolset llm mention use json
nherment Mar 5, 2025
48fe2bd
Merge remote-tracking branch 'origin/master' into add_promql_capabili…
nherment Mar 5, 2025
fca568b
feat: rmove PUSH_EVALS_TO_BRAINTRUST b/c of complex conflicts with ma…
nherment Mar 5, 2025
03d17ed
fix: prometehus unstable test 33
nherment Mar 5, 2025
51531a6
fix: mock use kubectl_get_by_name instead of kubectl_get which no lon…
nherment Mar 5, 2025
22cf9fa
fix: improve prompt
nherment Mar 5, 2025
b55cb88
fix: simplify prometheus eval
nherment Mar 5, 2025
e8434c2
fix: simplify prometheus eval
nherment Mar 5, 2025
ee27e7d
feat: rmove PUSH_EVALS_TO_BRAINTRUST b/c of complex conflicts with ma…
nherment Mar 5, 2025
df4c1c5
Merge branch 'master' into add_promql_capabilities
nherment Mar 6, 2025
43baaaa
test: fix typo in test_investigate/05_crashpod
nherment Mar 6, 2025
48c2e91
Merge branch 'master' into add_promql_capabilities
nherment Mar 6, 2025
2cd34ba
Merge branch 'master' into add_promql_capabilities
nherment Mar 7, 2025
619e53a
chore: address PR comments
nherment Mar 8, 2025
4a06177
Merge branch 'master' into add_promql_capabilities
nherment Mar 8, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 136 additions & 0 deletions FEATURES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@

# Features

This page document and describes HolmesGPT's behaviour when it comes to its features.


## Root Cause Analysis

Also called Investigation, Root Cause Analysis (RCA) is HolmesGPT's ability to investigate alerts,
typically from Prometheus' alert manager.

### Sectioned output

HolmesGPT generates structured output by default. It is also capable of generating sections based on request.

Here is an example of a request payload to run an investigation:

```json
{
"source": "prometheus",
"source_instance_id": "some-instance",
"title": "Pod is crash looping.",
"description": "Pod default/oomkill-deployment-696dbdbf67-d47z6 (main2) is in waiting state (reason: 'CrashLoopBackOff').",
"subject": {
"name": "oomkill-deployment-696dbdbf67-d47z6",
"subject_type": "deployment",
"namespace": "default",
"node": "some-node",
"container": "main2",
"labels": {
"x": "y",
"p": "q"
},
"annotations": {}
},
"context":
{
"robusta_issue_id": "5b3e2fb1-cb83-45ea-82ec-318c94718e44"
},
"include_tool_calls": true,
"include_tool_call_results": true
"sections": {
"Alert Explanation": "1-2 sentences explaining the alert itself - note don't say \"The alert indicates a warning event related to a Kubernetes pod doing blah\" rather just say \"The pod XYZ did blah\" because that is what the user actually cares about",
"Conclusions and Possible Root causes": "What conclusions can you reach based on the data you found? what are possible root causes (if you have enough conviction to say) or what uncertainty remains. Don't say root cause but 'possible root causes'. Be clear to distinguish between what you know for certain and what is a possible explanation",
"Related logs": "Truncate and share the most relevant logs, especially if these explain the root cause. For example: \nLogs from pod robusta-holmes:\n```\n<logs>```\n. Always embed the surroundding +/- 5 log lines to any relevant logs. "
}
}
```

Notice that the "sections" field contains 3 different sections. The text value for each section should be a prompt telling the LLM what the section should contain.
You can then expect the following in return:

```
{
"analysis": <monolithic text response. Contains all the sections aggregated together>,
"sections": {
"Alert Explanation": <A markdown text with the explanation of the alert>,
"Conclusions and Possible Root causes": <Conclusions reached by the LLM>,
"Related logs": <Any related logs the LLM could find through tools>
},
"tool_calls": <tool calls>,
"instructions": <Specific instructions used for this investigation>
}
```

In some cases, the LLM may decide to set a section to `null` or even add or ignore some sections.


## PromQL

If the `prometheus/metrics` toolset is enabled, HolmesGPT can generate embed graphs in conversations (ask holmes).

For example, here is scenario in which the LLM answers with a graph:


User question:

```
Show me the http request latency over time for the service customer-orders-service?
```


HolmesGPT text response:
```
Here's the average HTTP request latency over time for the `customer-orders-service`:

<< {"type": "promql", "tool_name": "execute_prometheus_range_query", "random_key": "9kLK"} >>
```

In addition to this text response, the returned JSON will contain one or more tool calls, including the prometheus query:

```json
"tool_calls": [
{
"tool_call_id": "call_lKI7CQW6Y2n1ZQ5dlxX79TcM",
"tool_name": "execute_prometheus_range_query",
"description": "Prometheus query_range. query=rate(http_request_duration_seconds_sum{service=\"customer-orders-service\"}[5m]) / rate(http_request_duration_seconds_count{service=\"customer-orders-service\"}[5m]), start=1739705559, end=1739791959, step=300, description=HTTP request latency for customer-orders-service",
"result": "{\n \"status\": \"success\",\n \"random_key\": \"9kLK\",\n \"tool_name\": \"execute_prometheus_range_query\",\n \"description\": \"Average HTTP request latency for customer-orders-service\",\n \"query\": \"rate(http_request_duration_seconds_sum{service=\\\"customer-orders-service\\\"}[5m]) / rate(http_request_duration_seconds_count{service=\\\"customer-orders-service\\\"}[5m])\",\n \"start\": \"1739705559\",\n \"end\": \"1739791959\",\n \"step\": 60\n}"
}
],
```

The result of this tool call contains details about the [prometheus query](https://prometheus.io/docs/prometheus/latest/querying/api/#range-queries) to build the graph returned by HolmesGPT:

```json
{
"status": "success",
"random_key": "9kLK",
"tool_name": "execute_prometheus_range_query",
"description": "Average HTTP request latency for customer-orders-service",
"query": "rate(http_request_duration_seconds_sum{service=\"customer-orders-service\"}[5m]) / rate(http_request_duration_seconds_count{service=\"customer-orders-service\"}[5m])",
"start": "1739705559", // Can be rfc3339 or a unix timestamp
"end": "1739791959", // Can be rfc3339 or a unix timestamp
"step": 60 // Query resolution step width in seconds
}
```

In addition to `execute_prometheus_range_query`, HolmesGPT can generate similar results with an `execute_prometheus_instant_query` which is an [instant query](https://prometheus.io/docs/prometheus/latest/querying/api/#instant-queries):

```
Here's the average HTTP request latency over time for the `customer-orders-service`:

<< {"type": "promql", "tool_name": "execute_prometheus_instant_query", "random_key": "9kLK"} >>
```

```json
{
"status": "success",
"random_key": "2KiL",
"tool_name": "execute_prometheus_instant_query",
"description": "Average HTTP request latency for customer-orders-service",
"query": "rate(http_request_duration_seconds_sum{service=\"customer-orders-service\"}[5m]) / rate(http_request_duration_seconds_count{service=\"customer-orders-service\"}[5m])"
}
```

Unlike the range query, the instant query result lacks the `start`, `end` and `step` arguments.
51 changes: 51 additions & 0 deletions holmes/core/openai_formatting.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
import re

# parses both simple types: "int", "array", "string"
# but also arrays of those simpler types: "array[int]", "array[string]", etc.
pattern = r"^(array\[(?P<inner_type>\w+)\])|(?P<simple_type>\w+)$"


def type_to_open_ai_schema(type_value):
match = re.match(pattern, type_value.strip())

if not match:
raise ValueError(f"Invalid type format: {type_value}")

if match.group("inner_type"):
return {"type": "array", "items": {"type": match.group("inner_type")}}

else:
return {"type": match.group("simple_type")}


def format_tool_to_open_ai_standard(
tool_name: str, tool_description: str, tool_parameters: dict
):
tool_properties = {}
for param_name, param_attributes in tool_parameters.items():
tool_properties[param_name] = type_to_open_ai_schema(param_attributes.type)
if param_attributes.description is not None:
tool_properties[param_name]["description"] = param_attributes.description

result = {
"type": "function",
"function": {
"name": tool_name,
"description": tool_description,
"parameters": {
"properties": tool_properties,
"required": [
param_name
for param_name, param_attributes in tool_parameters.items()
if param_attributes.required
],
"type": "object",
},
},
}

# gemini doesnt have parameters object if it is without params
if tool_properties is None:
result["function"].pop("parameters")

return result
33 changes: 22 additions & 11 deletions holmes/core/performance_timing.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,14 +49,25 @@ def end(self):
)


def log_function_timing(func):
@wraps(func)
def function_timing_wrapper(*args, **kwargs):
start_time = time.perf_counter()
result = func(*args, **kwargs)
end_time = time.perf_counter()
total_time = int((end_time - start_time) * 1000)
logging.info(f'Function "{func.__name__}()" took {total_time}ms')
return result

return function_timing_wrapper
def log_function_timing(label=None):
def decorator(func):
@wraps(func)
def function_timing_wrapper(*args, **kwargs):
start_time = time.perf_counter()
result = func(*args, **kwargs)
end_time = time.perf_counter()
total_time = int((end_time - start_time) * 1000)

function_identifier = (
f'"{label}: {func.__name__}()"' if label else f'"{func.__name__}()"'
)
logging.info(f"Function {function_identifier} took {total_time}ms")
return result

return function_timing_wrapper

if callable(label):
func = label
label = None
return decorator(func)
return decorator
39 changes: 8 additions & 31 deletions holmes/core/tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@
model_validator,
)

from holmes.core.openai_formatting import format_tool_to_open_ai_standard


ToolsetPattern = Union[Literal["*"], List[str]]

Expand Down Expand Up @@ -81,36 +83,11 @@ class Tool(ABC, BaseModel):
additional_instructions: Optional[str] = None

def get_openai_format(self):
tool_properties = {}
for param_name, param_attributes in self.parameters.items():
tool_properties[param_name] = {"type": param_attributes.type}
if param_attributes.description is not None:
tool_properties[param_name]["description"] = (
param_attributes.description
)

result = {
"type": "function",
"function": {
"name": self.name,
"description": self.description,
"parameters": {
"properties": tool_properties,
"required": [
param_name
for param_name, param_attributes in self.parameters.items()
if param_attributes.required
],
"type": "object",
},
},
}

# gemini doesnt have parameters object if it is without params
if tool_properties is None:
result["function"].pop("parameters")

return result
return format_tool_to_open_ai_standard(
tool_name=self.name,
tool_description=self.description,
tool_parameters=self.parameters,
)

def invoke(self, params: Dict) -> str:
logging.info(
Expand Down Expand Up @@ -423,7 +400,7 @@ def invoke(self, tool_name: str, params: Dict) -> str:
tool = self.get_tool_by_name(tool_name)
return tool.invoke(params) if tool else ""

def get_tool_by_name(self, name: str) -> Optional[YAMLTool]:
def get_tool_by_name(self, name: str) -> Optional[Tool]:
if name in self.tools_by_name:
return self.tools_by_name[name]
logging.warning(f"could not find tool {name}. skipping")
Expand Down
19 changes: 19 additions & 0 deletions holmes/plugins/prompts/generic_ask_conversation.jinja2
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,25 @@ Use conversation history to maintain continuity when appropriate, ensuring effic

{% include '_general_instructions.jinja2' %}

Prometheus/PromQL queries
* Use prometheus to execute promql queries with the tools `execute_prometheus_instant_query` and `execute_prometheus_range_query`
* ALWAYS embed the execution results into your answer
* You only need to embed the partial result in your response. Include the "tool_name" and "random_key". For example: << {"type": "promql", "tool_name": "execute_prometheus_range_query", "random_key": "92jf2hf"} >>
* Use these tools to generate charts that users can see. Here are standard metrics but you can use different ones:
** For memory consumption: `container_memory_working_set_bytes`
** For CPU usage: `container_cpu_usage_seconds_total`
** For CPU throttling: `container_cpu_cfs_throttled_periods_total`
** For latencies, prefer using `<metric>_sum` / `<metric>_count` over a sliding window
** Avoid using `<metric>_bucket` unless you know the bucket's boundaries are configured correctly
** Prefer individual averages like `rate(<metric>_sum) / rate(<metric>_count)`
** Avoid global averages like `sum(rate(<metric>_sum)) / sum(rate(<metric>_count))` because it hides data and is not generally informative
* Post processing will parse your response, re-run the query from the tool output and create a chart visible to the user
* Only generate and execute a prometheus query after checking what metrics are available with the `list_available_metrics` tool
* Check that any node, service, pod, container, app, namespace, etc. mentioned in the query exist in the kubernetes cluster before making a query. Use any appropriate kubectl tool(s) for this
* The toolcall will return no data to you. That is expected. You MUST however ensure that the query is successful.
* You can get the current time before executing a prometheus range query
* ALWAYS embed the execution results into your answer

Style guide:
* Reply with terse output.
* Be painfully concise.
Expand Down
6 changes: 4 additions & 2 deletions holmes/plugins/toolsets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,13 @@
from holmes.plugins.toolsets.grafana.toolset_grafana_tempo import GrafanaTempoToolset
from holmes.plugins.toolsets.internet.internet import InternetToolset
from holmes.plugins.toolsets.internet.notion import NotionToolset
from holmes.plugins.toolsets.prometheus import PrometheusToolset
from holmes.plugins.toolsets.opensearch import OpenSearchToolset
from holmes.plugins.toolsets.kafka import KafkaToolset

from holmes.core.tools import Toolset, YAMLToolset
from holmes.plugins.toolsets.opensearch import OpenSearchToolset
import yaml

from holmes.plugins.toolsets.kafka import KafkaToolset

THIS_DIR = os.path.abspath(os.path.dirname(__file__))

Expand Down Expand Up @@ -52,6 +53,7 @@ def load_python_toolsets(dal: Optional[SupabaseDal]) -> List[Toolset]:
GrafanaTempoToolset(),
NotionToolset(),
KafkaToolset(),
PrometheusToolset(),
DatetimeToolset(),
]

Expand Down
Loading
Loading