health monitor keeps triggering scan & fix tasks due to slow nats client

Stemcell: bosh-openstack-kvm-ubuntu-jammy-go_agent-raw/1.803
Bosh version: v282.0.0
bosh-openstack-cpi: 55.0.1
Managing 731 deployments, 1273 agents

We ran into the following situation: 

The amount of bosh scan and fix tasks keeps being around the count of deployments (500-700 tasks). After the task was done, a new scan and fix has been queued immediately. From metrics perspective the VMs of that director were unresponsive, but when checking with `bosh vms` or `bosh instances`, all the VMs were found to be healthy.

In the health_monitor logs the following line appears repetitively: 
`ERROR : NATS client error: nats: slow consumer, messages dropped`

A restart of the health_monitor process helps to unstuck the situation, the bosh scan & fix tasks decrease.
After the restart we are now seeing 1277 Nats onnection, checked with `netstat -anp | grep 4222`

Before and after the huge queue of scan and fix tasks, the health_monitor logs also show numerous lines like

`I, [2025-08-08T06:47:27.906126 #7]  INFO : [ALERT] Alert @ 2025-08-08 06:47:27 UTC, severity 1: process is not running`
`I, [2025-08-08T06:47:27.906275 #7]  INFO : (Event logger) notifying director about event: Alert @ 2025-08-08 06:47:27 UTC, severity 1: process is not running`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

health monitor keeps triggering scan & fix tasks due to slow nats client #484

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

health monitor keeps triggering scan & fix tasks due to slow nats client #484

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions