Skip to content

pool_sv2: Stale client entries in /api/v1/clients after TCP disconnection #319

@plebhash

Description

@plebhash

The /api/v1/clients endpoint reports clients that no longer have an active TCP connection to the pool. This produces a mismatch between what the monitoring API reports and what is actually connected.

Reproduction Evidence

Running the following commands on the pool server shows a clear discrepancy:

$ ss -tn | grep 3333
ESTAB 0 0 <POOL_IP>:3333  <CLIENT_IP_1>:60497
ESTAB 0 0 <POOL_IP>:3333  <CLIENT_IP_1>:54639
ESTAB 0 0 <POOL_IP>:3333  <CLIENT_IP_2>:56512
ESTAB 0 0 <POOL_IP>:3333  <CLIENT_IP_1>:58306

4 active TCP connections, but the monitoring API reports 5 clients, two of which have 0 channels and -0 hashrate:

$ curl -s http://0.0.0.0:9090/api/v1/clients | jq
{
  "offset": 0, "limit": 25, "total": 5,
  "items": [
    { "client_id": 174, "extended_channels_count": 1, "standard_channels_count": 0, "total_hashrate": 626961400000 },
    { "client_id": 2,   "extended_channels_count": 0, "standard_channels_count": 0, "total_hashrate": -0 },
    { "client_id": 175, "extended_channels_count": 0, "standard_channels_count": 1, "total_hashrate": 944576860000 },
    { "client_id": 146, "extended_channels_count": 1, "standard_channels_count": 0, "total_hashrate": 5661616000000 },
    { "client_id": 7,   "extended_channels_count": 0, "standard_channels_count": 0, "total_hashrate": -0 }
  ]
}

Client IDs 2 and 7 have no active TCP connections, no channels, and zero hashrate — yet they persist in the API response.

Hypothesis (needs deeper analysis)

The pool has a remove_downstream() function and a DownstreamShutdown state handler that should clean up disconnected clients:

pub fn remove_downstream(
&self,
downstream_id: DownstreamId,
) -> PoolResult<(), error::ChannelManager> {
self.channel_manager_data.super_safe_lock(|cm_data| {
cm_data.downstream.remove(&downstream_id);
cm_data
.vardiff
.retain(|key, _| key.downstream_id != downstream_id);
});
Ok(())
}

message = status_receiver.recv() => {
if let Ok(status) = message {
match status.state {
State::DownstreamShutdown{downstream_id,..} => {
warn!("Downstream {downstream_id:?} disconnected — cleaning up channel manager.");
// Remove downstream from channel manager to prevent memory leak
if let Err(e) = channel_manager_for_cleanup.remove_downstream(downstream_id) {
error!("Failed to remove downstream {downstream_id:?}: {e:?}");
cancellation_token.cancel();
break;
}
}

The monitoring API's get_sv2_clients() reads directly from the downstream HashMap:

fn get_sv2_clients(&self) -> Vec<Sv2ClientInfo> {
// Clone Downstream references and release lock immediately to avoid contention
// with template distribution and message handling
let downstream_refs: Vec<Downstream> = self
.channel_manager_data
.safe_lock(|data| data.downstream.values().cloned().collect())
.unwrap_or_default();
downstream_refs
.iter()
.filter_map(downstream_to_sv2_client_info)
.collect()
}

One possible hypothesis is that State::DownstreamShutdown is only sent on graceful disconnections, and abrupt TCP disconnections (RST, network timeout, etc.) fail to trigger the cleanup path — leaving stale entries in the downstream HashMap. However, this needs deeper analysis to confirm; there may be other explanations.

Impact

• Monitoring API reports inflated client counts
• Zombie entries with total_hashrate: -0 suggest uninitialized/stale state
• Makes it harder to diagnose real connectivity issues

Environment

• pool_sv2 running on Linux
• Observed with multiple miner clients connecting simultaneously

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpool

    Type

    No type

    Projects

    Status

    Todo 📝

    Status

    Todo 📝

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions