Skip to content

Commit

Permalink
Add job labels to monitoring alerts (#1065)
Browse files Browse the repository at this point in the history
[comment]: # (Note that your PR title should follow the conventional
commit format: https://conventionalcommits.org/en/v1.0.0/#summary)
# PR Description

[comment]: # (The below checklist is for PRs adding new features. If a
box is not checked, add a reason why it's not needed.)
# New Feature Checklist

- [ ] List telemetry added about the feature.
- [ ] Link to the one-pager about the feature.
- [ ] List any tasks necessary for release (3P docs, AKS RP chart
changes, etc.) after merging the PR.
- [ ] Attach results of scale and perf testing.

[comment]: # (The below checklist is for code changes. Not all boxes
necessarily need to be checked. Build, doc, and template changes do not
need to fill out the checklist.)
# Tests Checklist

- [ ] Have end-to-end Ginkgo tests been run on your cluster and passed?
To bootstrap your cluster to run the tests, follow [these
instructions](/otelcollector/test/README.md#bootstrap-a-dev-cluster-to-run-ginkgo-tests).
  - Labels used when running the tests on your cluster:
    - [ ] `operator`
    - [ ] `windows`
    - [ ] `arm64`
    - [ ] `arc-extension`
    - [ ] `fips`
- [ ] Have new tests been added? For features, have tests been added for
this feature? For fixes, is there a test that could have caught this
issue and could validate that the fix works?
  - [ ] Is a new scrape job needed?
- [ ] The scrape job was added to the folder
[test-cluster-yamls](/otelcollector/test/test-cluster-yamls/) in the
correct configmap or as a CR.
  - [ ] Was a new test label added?
- [ ] A string constant for the label was added to
[constants.go](/otelcollector/test/utils/constants.go).
- [ ] The label and description was added to the [test
README](/otelcollector/test/README.md).
- [ ] The label was added to this [PR
checklist](/.github/pull_request_template).
- [ ] The label was added as needed to
[testkube-test-crs.yaml](/otelcollector/test/testkube/testkube-test-crs.yaml).
  - [ ] Are additional API server permissions needed for the new tests?
- [ ] These permissions have been added to
[api-server-permissions.yaml](/otelcollector/test/testkube/api-server-permissions.yaml).
  - [ ] Was a new test suite (a new folder under `/tests`) added?
- [ ] The new test suite is included in
[testkube-test-crs.yaml](/otelcollector/test/testkube/testkube-test-crs.yaml).
  • Loading branch information
vishiy authored Feb 20, 2025
1 parent 47fac27 commit 924ea21
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions internal/alerts/example-alert-template.json
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
"rules": [
{
"alert": "Amd64 metric missing in cluster ci-dev-aks-mac-eus",
"expression": "absent(node_uname_info{machine=\"x86_64\"}) == 1 or node_uname_info{machine=\"x86_64\"} == 0",
"expression": "absent(node_uname_info{job=\"node\",machine=\"x86_64\"}) == 1 or node_uname_info{job=\"node\",machine=\"x86_64\"} == 0",
"for": "PT30M",
"annotations": {
"description": "Amd64 metric missing in cluster ci-dev-aks-mac-eus"
Expand Down Expand Up @@ -200,7 +200,7 @@
},
{
"alert": "CPU usage % greater than 75 for prometheus-collector containers on cluster ci-dev-aks-mac-eus",
"expression": "sum(sum by (cluster, namespace, pod, container) ( rate(container_cpu_usage_seconds_total{job=\"cadvisor\", image!=\"\", namespace=\"kube-system\", container=\"prometheus-collector\"}[5m]) ) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) ( 1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=\"\", namespace=\"kube-system\"}) )) by (container, pod) *100 > 75",
"expression": "sum(sum by (cluster, namespace, pod, container) ( rate(container_cpu_usage_seconds_total{job=\"cadvisor\", image!=\"\", namespace=\"kube-system\", container=\"prometheus-collector\"}[5m]) ) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) ( 1, max by(cluster, namespace, pod, node) (kube_pod_info{job=\"kube-state-metrics\",node!=\"\", namespace=\"kube-system\"}) )) by (container, pod) *100 > 75",
"for": "PT3M",
"annotations": {
"description": "CPU usage greater than 75% for prometheus-collector on cluster ci-dev-aks-mac-eus"
Expand All @@ -218,7 +218,7 @@
},
{
"alert": "Memory usage % greater than 75 for prometheus-collector containers on cluster ci-dev-aks-mac-eus",
"expression": "(sum(container_memory_working_set_bytes{namespace=\"kube-system\", container=\"prometheus-collector\", image!=\"\"}) by (container, pod) / sum(kube_pod_container_resource_limits{namespace=\"kube-system\", container=\"prometheus-collector\", resource=\"memory\"}) by (container, pod)) * 100> 75",
"expression": "(sum(container_memory_working_set_bytes{job=\"cadvisor\",namespace=\"kube-system\", container=\"prometheus-collector\", image!=\"\"}) by (container, pod) / sum(kube_pod_container_resource_limits{job=\"kube-state-metrics\",namespace=\"kube-system\", container=\"prometheus-collector\", resource=\"memory\"}) by (container, pod)) * 100> 75",
"for": "PT3M",
"annotations": {
"description": "Memory usage greater than 75% for prometheus-collector containers on cluster ci-dev-aks-mac-eus"
Expand Down Expand Up @@ -254,7 +254,7 @@
},
{
"alert": "New agent version found for prometheus collector",
"expression": "count(count (kube_pod_container_info{image=~\"mcr.microsoft.com/azuremonitor/containerinsights/ciprod/prometheus-collector.*\"}) by (image)) > 4",
"expression": "count(count (kube_pod_container_info{job=\"kube-state-metrics\",image=~\"mcr.microsoft.com/azuremonitor/containerinsights/ciprod/prometheus-collector.*\"}) by (image)) > 4",
"for": "PT60S",
"annotations": {
"description": "New agent version found for prometheus collector. This alert is only used in near ring regions for prod monitoring clusters"
Expand Down

0 comments on commit 924ea21

Please sign in to comment.