failing loki remote backend prevents working backends from receiving data regularly while forwarding logs to multiple loki clients at once #6963

makeittotop · 2024-03-18T20:37:24Z

What's wrong?

I've noticed that in case of a multi loki clients setup to forward logs to, if one of the loki clients starts failing for some reason, eg. - no process listening on the specified port, etc, it starves other working loki endpoints to receive data as well until the failing client exhausts all of its max_retries (default = 10). Once the loop gets reset, the same issue repeats itself again.
In the end, the working clients only get the data every 6 minutes or so based on what the max_period is set to (Default = 5m). This also leads to "gaps" in the grafana dashboard while looking at the data for those clients,

Steps to reproduce

Take a look at this nominal config -

./agent-local-config.yaml

server:
  log_level: info

logs:
  configs:
  - clients:
    - tls_config:
        insecure_skip_verify: true
      basic_auth:
        password: xxxx
        username: loki
      url: https://logs.my-loki-instance.net/loki/api/v1/push
    - tls_config:
        insecure_skip_verify: true
      url: https://localhost:13100/loki/api/v1/push
      # backoff_config:
      #   # max_retries: 10
      #   max_period: 10s
    name: default
    positions:
      filename: /data/grafana_agent/log-positions.yml
    scrape_configs:
    - job_name: nginx
      pipeline_stages:
      - regex:
          expression: (?P<remote_addr>\S+) - (?P<remote_user>\S+) \[(?P<time_local>[^]]+)\]
            "(?P<request_method>[A-Z]+) (?P<request_url>[^? ]+)[?]*(?P<request_url_params>\S*)
            (?P<request_http_version>[^"]+)" (?P<status_code>\d+) (?P<body_bytes_sent>\d+)
            "(?P<http_referer>[^"]+)" "(?P<http_user_agent>[^"]+)" "(?P<http_x_forwarded_for>[^"]+)"
      - labels:
          remote_user: null
          request_http_version: null
          request_method: null
          request_url: null
          status_code: null
      - timestamp:
          format: 02/Jan/2006:15:04:05 -0700
          source: time_local
      static_configs:
      - labels:
          __path__: /var/log/nginx.log
          instance: dist1.foobar.com
          job: nginx
        targets:
        - dist1.foobar.com

Start the agent as

# /tmp/agent: ./grafana-agent --config.file ./agent-local-config.yaml

Now, let's assume that the localhost:13100 instance is missing for some reason. In such a case I expected the other endpoint (logs.my-loki-instance) to be able to receive data at the configured scrape intervals (60s), but that doesn't happen as explained above.

System information

Linux 6.5.0-15-generic

Software version

Grafana Agent 0.35.0 and master atm

Configuration

server:
  log_level: info

logs:
  configs:
  - clients:
    - tls_config:
        insecure_skip_verify: true
      basic_auth:
        password: xxxx
        username: loki
      url: https://logs.my-loki-instance.net/loki/api/v1/push
    - tls_config:
        insecure_skip_verify: true
      url: https://localhost:13100/loki/api/v1/push
      # backoff_config:
      #   # max_retries: 10
      #   max_period: 10s
    name: default
    positions:
      filename: /data/grafana_agent/log-positions.yml
    scrape_configs:
    - job_name: nginx
      pipeline_stages:
      - regex:
          expression: (?P<remote_addr>\S+) - (?P<remote_user>\S+) \[(?P<time_local>[^]]+)\]
            "(?P<request_method>[A-Z]+) (?P<request_url>[^? ]+)[?]*(?P<request_url_params>\S*)
            (?P<request_http_version>[^"]+)" (?P<status_code>\d+) (?P<body_bytes_sent>\d+)
            "(?P<http_referer>[^"]+)" "(?P<http_user_agent>[^"]+)" "(?P<http_x_forwarded_for>[^"]+)"
      - labels:
          remote_user: null
          request_http_version: null
          request_method: null
          request_url: null
          status_code: null
      - timestamp:
          format: 02/Jan/2006:15:04:05 -0700
          source: time_local
      static_configs:
      - labels:
          __path__: /var/log/nginx.log
          instance: dist1.foobar.com
          job: nginx
        targets:
        - dist1.foobar.com



### Logs

```text
Mar 18 20:28:22 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:22.36522416Z caller=client.go:430 level=error component=logs logs_config=default component=client host=localhost:13100 msg="final error sending batch" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:28:22 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:22.507835563Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:28:23 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:23.271720016Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:28:25 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:25.123445134Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:28:28 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:28.795872338Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:28:35 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:35.337596441Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:28:51 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:51.028375765Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:29:08 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:29:08.033159675Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:29:40 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:29:40.383066904Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:31:09 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:31:09.086003766Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"

The text was updated successfully, but these errors were encountered:

makeittotop · 2024-03-18T20:46:07Z

From whatever I can tell with my limited knowledge of golang, and channels, it appears that there are 2 goroutines (in this case) - one for localhost:13100, other for logs.my-loki-instance.net in the grafana-agent process. Both of them are reading form the same channel (api.Entry) which is being populated in the promtail package in grafana/clients/pkg/promtail/targets/file/tailer.go readLines() function. As the localhost:13100 goroutine gets blocked due to failling into retries and exponential backoffs, it delays the other my-loki goroutine from receiving data too - atleast my tests confirm this.
Is this due to the fact that the underlying api.Entry channel is "full" due to 1 of the 2 receivers being tied up elsewhere? My tests show that as soon as the failing goroutine unblocks after exhausting its retries, both receivers receive data pretty much immediately.

github-actions · 2024-05-18T00:01:37Z

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

github-actions · 2024-07-15T00:11:58Z

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

makeittotop added the bug Something isn't working label Mar 18, 2024

This comment was marked as outdated.

Sign in to view

rfratto transferred this issue from grafana/agent Apr 11, 2024

github-actions bot added the needs-attention An issue or PR has been sitting around and needs attention. label May 18, 2024

rfratto transferred this issue from grafana/alloy Jun 14, 2024

rfratto added the variant/static Related to Grafana Agent Static. label Jun 14, 2024

github-actions bot removed the needs-attention An issue or PR has been sitting around and needs attention. label Jun 15, 2024

github-actions bot added the needs-attention An issue or PR has been sitting around and needs attention. label Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failing loki remote backend prevents working backends from receiving data regularly while forwarding logs to multiple loki clients at once #6963

failing loki remote backend prevents working backends from receiving data regularly while forwarding logs to multiple loki clients at once #6963

makeittotop commented Mar 18, 2024

makeittotop commented Mar 18, 2024

This comment was marked as outdated.

github-actions bot commented May 18, 2024

github-actions bot commented Jul 15, 2024

failing loki remote backend prevents working backends from receiving data regularly while forwarding logs to multiple loki clients at once #6963

failing loki remote backend prevents working backends from receiving data regularly while forwarding logs to multiple loki clients at once #6963

Comments

makeittotop commented Mar 18, 2024

What's wrong?

Steps to reproduce

System information

Software version

Configuration

makeittotop commented Mar 18, 2024

This comment was marked as outdated.

github-actions bot commented May 18, 2024

github-actions bot commented Jul 15, 2024