Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: Fatal Error Due to Concurrent Map Access in Custom Gauge Metrics #14076

Open
3 of 4 tasks
ryancurrah opened this issue Jan 13, 2025 · 0 comments
Open
3 of 4 tasks
Assignees
Labels
area/metrics type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@ryancurrah
Copy link
Contributor

ryancurrah commented Jan 13, 2025

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

After updating to the latest version of Argo Workflows (3.6.2), I am encountering a fatal error related to concurrent map access when using custom gauges and histograms. The issue appears to stem from the gauge metrics, leading to a panic and crashing the workflow-controller.

  • Upgraded 5 days ago to 3.6
  • 4 panics observed since the update
  • Over a thousand workflows have ran since the upgrade
  • The issue appears to be a rare occurrence which indicates it would be only be reproducible in certain conditions where there are a lot of workflows running

Version(s)

v3.6.x

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Here is an example of custom metrics we are using.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: "gomodule-pr"
spec:
  metrics:
    prometheus:
      - name: workflow_duration_gauge
        help: "Workflow duration gauge"
        gauge:
          realtime: false
          value: "{{workflow.duration}}"
        labels:
          - key: build_strategy
            value: pullrequest
          - key: git_project
            value: "{{workflow.parameters.git_project}}"
          - key: git_repository
            value: "{{workflow.parameters.git_repository}}"
      - name: workflow_duration_histogram
        help: "Workflow duration histogram"
        histogram:
          buckets:
            - 60
            - 120
            - 180
            - 300
            - 600
            - 900
            - 1800
            - 2700
            - 3600
          value: "{{workflow.duration}}"
        labels:
          - key: build_strategy
            value: pullrequest
          - key: git_project
            value: "{{workflow.parameters.git_project}}"
          - key: git_repository
            value: "{{workflow.parameters.git_repository}}"
---

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: gomodule-shared-templates
spec:
  templates:
    - name: setup
      metrics:
        prometheus:
          - name: task_duration_gauge
            help: "Task duration gauge"
            gauge:
              realtime: false
              value: "{{duration}}"
            labels:
              - key: task_name
                value: setup
              - key: git_project
                value: "{{workflow.parameters.git_project}}"
              - key: git_repository
                value: "{{workflow.parameters.git_repository}}"
          - name: task_result_counter
            help: "Count of task execution by result status"
            counter:
              value: "1"
            labels:
              - key: status
                value: "{{status}}"
              - key: task_name
                value: setup
              - key: git_project
                value: "{{workflow.parameters.git_project}}"
              - key: git_repository
                value: "{{workflow.parameters.git_repository}}"

Logs from the workflow controller

2025-01-13T08:30:55.362712527Z fatal error: concurrent map iteration and map write
2025-01-13T08:30:55.365840614Z 
2025-01-13T08:30:55.365848385Z goroutine 716395 [running]:
2025-01-13T08:30:55.365851452Z github.com/argoproj/argo-workflows/v3/workflow/metrics.(*customInstrument).customCallback(0xc0065320e0, {0x20?, 0x271da80?}, {0x2c79bd0, 0xc002732360})
2025-01-13T08:30:55.365854724Z  /go/src/github.com/argoproj/argo-workflows/workflow/metrics/metrics_custom.go:77 +0x99
2025-01-13T08:30:55.365856887Z go.opentelemetry.io/otel/sdk/metric.(*meter).RegisterCallback.func1({0x2c87308, 0x41d0cc0})
2025-01-13T08:30:55.365859316Z  /go/pkg/mod/go.opentelemetry.io/otel/sdk/[email protected]/meter.go:443 +0x55
2025-01-13T08:30:55.365861878Z go.opentelemetry.io/otel/sdk/metric.(*pipeline).produce(0xc0005ee090, {0x2c87308, 0x41d0cc0}, 0xc0027320a0)
2025-01-13T08:30:55.365864079Z  /go/pkg/mod/go.opentelemetry.io/otel/sdk/[email protected]/pipeline.go:134 +0x302
2025-01-13T08:30:55.365869685Z go.opentelemetry.io/otel/sdk/metric.(*ManualReader).Collect(0xc0003d1ea0, {0x2c87308, 0x41d0cc0}, 0xc0027320a0)
2025-01-13T08:30:55.36587227Z   /go/pkg/mod/go.opentelemetry.io/otel/sdk/[email protected]/manual_reader.go:123 +0xe2
2025-01-13T08:30:55.365874354Z go.opentelemetry.io/otel/exporters/prometheus.(*collector).Collect(0xc0003e9680, 0xc002b4e3f0)
2025-01-13T08:30:55.365876493Z  /go/pkg/mod/go.opentelemetry.io/otel/exporters/[email protected]/exporter.go:158 +0x72
2025-01-13T08:30:55.365886554Z github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
2025-01-13T08:30:55.365909405Z  /go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:457 +0xe5
2025-01-13T08:30:55.365912725Z created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather in goroutine 678
2025-01-13T08:30:55.365922767Z  /go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:466 +0x568

Logs from in your workflow's wait container

N/A
@ryancurrah ryancurrah added type/bug type/regression Regression from previous behavior (a specific type of bug) labels Jan 13, 2025
@Joibel Joibel self-assigned this Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

No branches or pull requests

2 participants