Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using targetAllocator, seeing duplicate metric scraping across Collectors #3654

Open
jlcrow opened this issue Jan 23, 2025 · 1 comment
Open
Labels
bug Something isn't working needs triage

Comments

@jlcrow
Copy link

jlcrow commented Jan 23, 2025

Component(s)

target allocator

What happened?

Description

I was seeing errors from my Mimir installation indicating duplicate timestamps, so I thought to add an attribute in the OpenTelemetry collector pipeline to identify the collector the metric was coming from, when adding this attribute my metrics being ingested tripled. I was seeing scrapes for the same pod coming from multiple collectors that were being managed by the targetAllocator.

Steps to Reproduce

  1. Create statefulset collector deployment utilizing the target allocator with a consistent-hash allocation strategy and a filter strategy of relabel-configs and prometheus scrape config along with servicemonitors
  2. Deploy
  3. Observe metrics
  4. Add an attribute and env to identify the pod name of the collector in the metrics
  5. Observe metrics (counts become duplicative and show coming from multiple collectors)

Example config

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel
  namespace: monitoring
spec:
  config:
    processors:
      batch:
        send_batch_size: 1000
        send_batch_max_size: 2000
        timeout: 10s           
      memory_limiter: 
        check_interval: 5s
        limit_percentage: 90
      attributes:
        actions:
          - key: collector_instance
            value: "${MY_POD_NAME}"
            action: insert
    extensions:
      health_check:
        endpoint: ${MY_POD_IP}:13133
      k8s_observer:
        auth_type: serviceAccount
        node: ${K8S_NODE_NAME}        
    receivers:
      prometheus:
        config:
          scrape_configs:
          - job_name: kubernetes-pods
            kubernetes_sd_configs:
            - role: pod
            relabel_configs:               
              # Include only pods annotated for scraping
            - action: keep
              regex: true
              source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_scrape
              # Replace path and port annotations
            - action: replace
              regex: (.+)
              source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_path
              target_label: __metrics_path__
            - action: replace
              regex: ([^:]+)(?::\d+)?;(\d+)
              replacement: $$1:$$2
              source_labels:
              - __address__
              - __meta_kubernetes_pod_annotation_prometheus_io_port
              target_label: __address__
            - action: labelmap
              regex: __meta_kubernetes_pod_label_(.+)
            - action: replace
              source_labels:
              - __meta_kubernetes_namespace
              target_label: namespace
            - action: replace
              source_labels:
              - __meta_kubernetes_pod_name
              target_label: pod
            - action: replace
              source_labels:
              - __meta_kubernetes_pod_node_name
              target_label: node
            - action: drop
              regex: Pending|Succeeded|Failed
              source_labels:
              - __meta_kubernetes_pod_phase
            scrape_interval: 30s
            scrape_timeout: 10s      
     
    exporters:
      prometheusremotewrite:
        endpoint: https://mimir-tools.staging.twmlabs.com/api/v1/push
        retry_on_failure:
          enabled: true
          initial_interval: 1s
          max_interval: 10s
          max_elapsed_time: 30s
    service:
      telemetry:
        metrics:
          address: "${MY_POD_IP}:8888"
          level: basic    
        logs:
          level: "warn"  
      extensions:
      - health_check
      pipelines:
        metrics:
          receivers:
          - prometheus
          processors:
          - memory_limiter        
          - attributes
          - batch
          exporters:
          - prometheusremotewrite
  env:
  - name: MY_POD_IP
    valueFrom:
      fieldRef:
        fieldPath: status.podIP
  - name: MY_POD_NAME
    valueFrom:
      fieldRef:
        fieldPath: metadata.name
  mode: statefulset
  podAnnotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8888"
  autoscaler:
    behavior:
      scaleUp:
        stabilizationWindowSeconds: 30
    maxReplicas: 10
    minReplicas: 3
    targetCPUUtilization: 70
    targetMemoryUtilization: 70
  resources:
    limits:
      cpu: 1
      memory: 1Gi
    requests:
      cpu: 1
      memory: 1Gi
  targetAllocator:
    allocationStrategy: consistent-hashing
    enabled: true
    filterStrategy: relabel-config
    observability:
      metrics:
        enableMetrics: false
    prometheusCR:
      enabled: true
      podMonitorSelector: {}
      scrapeInterval: 30s
      serviceMonitorSelector: {}
    replicas: 1
    resources:
      limits:
        cpu: 250m
        memory: 500Mi
      requests:
        cpu: 250m
        memory: 500Mi

Expected Result

I would expect each scrape configuration to be distributed only once across collectors.

Actual Result

Metrics scrape configs seem to be duplicated across multiple collectors

Kubernetes Version

1.30.5

Operator version

0.78.2

Collector version

0.116.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04") ContainerOS GKE
Compiler(if manually compiled): (e.g., "go 14.2") otel-collector-contrib library

Log output

No logs being info logs

Additional context

After removing the attribute:

Image
@jlcrow jlcrow added bug Something isn't working needs triage labels Jan 23, 2025
@yuriolisa
Copy link
Contributor

@jlcrow, did you have the opportunity to check whether this issue might have the exact root cause of this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage
Projects
None yet
Development

No branches or pull requests

2 participants