|
| 1 | + |
| 2 | +# Features |
| 3 | + |
| 4 | +This page document and describes HolmesGPT's behaviour when it comes to its features. |
| 5 | + |
| 6 | + |
| 7 | +## Root Cause Analysis |
| 8 | + |
| 9 | +Also called Investigation, Root Cause Analysis (RCA) is HolmesGPT's ability to investigate alerts, |
| 10 | +typically from Prometheus' alert manager. |
| 11 | + |
| 12 | +### Sectioned output |
| 13 | + |
| 14 | +HolmesGPT generates structured output by default. It is also capable of generating sections based on request. |
| 15 | + |
| 16 | +Here is an example of a request payload to run an investigation: |
| 17 | + |
| 18 | +```json |
| 19 | +{ |
| 20 | + "source": "prometheus", |
| 21 | + "source_instance_id": "some-instance", |
| 22 | + "title": "Pod is crash looping.", |
| 23 | + "description": "Pod default/oomkill-deployment-696dbdbf67-d47z6 (main2) is in waiting state (reason: 'CrashLoopBackOff').", |
| 24 | + "subject": { |
| 25 | + "name": "oomkill-deployment-696dbdbf67-d47z6", |
| 26 | + "subject_type": "deployment", |
| 27 | + "namespace": "default", |
| 28 | + "node": "some-node", |
| 29 | + "container": "main2", |
| 30 | + "labels": { |
| 31 | + "x": "y", |
| 32 | + "p": "q" |
| 33 | + }, |
| 34 | + "annotations": {} |
| 35 | + }, |
| 36 | + "context": |
| 37 | + { |
| 38 | + "robusta_issue_id": "5b3e2fb1-cb83-45ea-82ec-318c94718e44" |
| 39 | + }, |
| 40 | + "include_tool_calls": true, |
| 41 | + "include_tool_call_results": true |
| 42 | + "sections": { |
| 43 | + "Alert Explanation": "1-2 sentences explaining the alert itself - note don't say \"The alert indicates a warning event related to a Kubernetes pod doing blah\" rather just say \"The pod XYZ did blah\" because that is what the user actually cares about", |
| 44 | + "Conclusions and Possible Root causes": "What conclusions can you reach based on the data you found? what are possible root causes (if you have enough conviction to say) or what uncertainty remains. Don't say root cause but 'possible root causes'. Be clear to distinguish between what you know for certain and what is a possible explanation", |
| 45 | + "Related logs": "Truncate and share the most relevant logs, especially if these explain the root cause. For example: \nLogs from pod robusta-holmes:\n```\n<logs>```\n. Always embed the surroundding +/- 5 log lines to any relevant logs. " |
| 46 | + } |
| 47 | +} |
| 48 | +``` |
| 49 | + |
| 50 | +Notice that the "sections" field contains 3 different sections. The text value for each section should be a prompt telling the LLM what the section should contain. |
| 51 | +You can then expect the following in return: |
| 52 | + |
| 53 | +``` |
| 54 | +{ |
| 55 | + "analysis": <monolithic text response. Contains all the sections aggregated together>, |
| 56 | + "sections": { |
| 57 | + "Alert Explanation": <A markdown text with the explanation of the alert>, |
| 58 | + "Conclusions and Possible Root causes": <Conclusions reached by the LLM>, |
| 59 | + "Related logs": <Any related logs the LLM could find through tools> |
| 60 | + }, |
| 61 | + "tool_calls": <tool calls>, |
| 62 | + "instructions": <Specific instructions used for this investigation> |
| 63 | +} |
| 64 | +``` |
| 65 | + |
| 66 | +In some cases, the LLM may decide to set a section to `null` or even add or ignore some sections. |
| 67 | + |
| 68 | + |
| 69 | +## PromQL |
| 70 | + |
| 71 | +If the `prometheus/metrics` toolset is enabled, HolmesGPT can generate embed graphs in conversations (ask holmes). |
| 72 | + |
| 73 | +For example, here is scenario in which the LLM answers with a graph: |
| 74 | + |
| 75 | + |
| 76 | +User question: |
| 77 | + |
| 78 | +``` |
| 79 | +Show me the http request latency over time for the service customer-orders-service? |
| 80 | +``` |
| 81 | + |
| 82 | + |
| 83 | +HolmesGPT text response: |
| 84 | +``` |
| 85 | +Here's the average HTTP request latency over time for the `customer-orders-service`: |
| 86 | +
|
| 87 | +<< {"type": "promql", "tool_name": "execute_prometheus_range_query", "random_key": "9kLK"} >> |
| 88 | +``` |
| 89 | + |
| 90 | +In addition to this text response, the returned JSON will contain one or more tool calls, including the prometheus query: |
| 91 | + |
| 92 | +```json |
| 93 | +"tool_calls": [ |
| 94 | + { |
| 95 | + "tool_call_id": "call_lKI7CQW6Y2n1ZQ5dlxX79TcM", |
| 96 | + "tool_name": "execute_prometheus_range_query", |
| 97 | + "description": "Prometheus query_range. query=rate(http_request_duration_seconds_sum{service=\"customer-orders-service\"}[5m]) / rate(http_request_duration_seconds_count{service=\"customer-orders-service\"}[5m]), start=1739705559, end=1739791959, step=300, description=HTTP request latency for customer-orders-service", |
| 98 | + "result": "{\n \"status\": \"success\",\n \"random_key\": \"9kLK\",\n \"tool_name\": \"execute_prometheus_range_query\",\n \"description\": \"Average HTTP request latency for customer-orders-service\",\n \"query\": \"rate(http_request_duration_seconds_sum{service=\\\"customer-orders-service\\\"}[5m]) / rate(http_request_duration_seconds_count{service=\\\"customer-orders-service\\\"}[5m])\",\n \"start\": \"1739705559\",\n \"end\": \"1739791959\",\n \"step\": 60\n}" |
| 99 | + } |
| 100 | +], |
| 101 | +``` |
| 102 | + |
| 103 | +The result of this tool call contains details about the [prometheus query](https://prometheus.io/docs/prometheus/latest/querying/api/#range-queries) to build the graph returned by HolmesGPT: |
| 104 | + |
| 105 | +```json |
| 106 | +{ |
| 107 | + "status": "success", |
| 108 | + "random_key": "9kLK", |
| 109 | + "tool_name": "execute_prometheus_range_query", |
| 110 | + "description": "Average HTTP request latency for customer-orders-service", |
| 111 | + "query": "rate(http_request_duration_seconds_sum{service=\"customer-orders-service\"}[5m]) / rate(http_request_duration_seconds_count{service=\"customer-orders-service\"}[5m])", |
| 112 | + "start": "1739705559", // Can be rfc3339 or a unix timestamp |
| 113 | + "end": "1739791959", // Can be rfc3339 or a unix timestamp |
| 114 | + "step": 60 // Query resolution step width in seconds |
| 115 | +} |
| 116 | +``` |
| 117 | + |
| 118 | +In addition to `execute_prometheus_range_query`, HolmesGPT can generate similar results with an `execute_prometheus_instant_query` which is an [instant query](https://prometheus.io/docs/prometheus/latest/querying/api/#instant-queries): |
| 119 | + |
| 120 | +``` |
| 121 | +Here's the average HTTP request latency over time for the `customer-orders-service`: |
| 122 | +
|
| 123 | +<< {"type": "promql", "tool_name": "execute_prometheus_instant_query", "random_key": "9kLK"} >> |
| 124 | +``` |
| 125 | + |
| 126 | +```json |
| 127 | +{ |
| 128 | + "status": "success", |
| 129 | + "random_key": "2KiL", |
| 130 | + "tool_name": "execute_prometheus_instant_query", |
| 131 | + "description": "Average HTTP request latency for customer-orders-service", |
| 132 | + "query": "rate(http_request_duration_seconds_sum{service=\"customer-orders-service\"}[5m]) / rate(http_request_duration_seconds_count{service=\"customer-orders-service\"}[5m])" |
| 133 | +} |
| 134 | +``` |
| 135 | + |
| 136 | +Unlike the range query, the instant query result lacks the `start`, `end` and `step` arguments. |
0 commit comments