-
Notifications
You must be signed in to change notification settings - Fork 235
Description
Stemcell: bosh-openstack-kvm-ubuntu-jammy-go_agent-raw/1.803
Bosh version: v282.0.0
bosh-openstack-cpi: 55.0.1
Managing 731 deployments, 1273 agents
We ran into the following situation:
The amount of bosh scan and fix tasks keeps being around the count of deployments (500-700 tasks). After the task was done, a new scan and fix has been queued immediately. From metrics perspective the VMs of that director were unresponsive, but when checking with bosh vms or bosh instances, all the VMs were found to be healthy.
In the health_monitor logs the following line appears repetitively:
ERROR : NATS client error: nats: slow consumer, messages dropped
A restart of the health_monitor process helps to unstuck the situation, the bosh scan & fix tasks decrease.
After the restart we are now seeing 1277 Nats onnection, checked with netstat -anp | grep 4222
Before and after the huge queue of scan and fix tasks, the health_monitor logs also show numerous lines like
I, [2025-08-08T06:47:27.906126 #7] INFO : [ALERT] Alert @ 2025-08-08 06:47:27 UTC, severity 1: process is not running
I, [2025-08-08T06:47:27.906275 #7] INFO : (Event logger) notifying director about event: Alert @ 2025-08-08 06:47:27 UTC, severity 1: process is not running
Metadata
Metadata
Assignees
Labels
Type
Projects
Status