Skip to content

call() retry loop cannot recover from transport errors; bitreq pool poisoning causes permanent failure after rpcservertimeout #101

Description

@paratoxick

Describe the bug

After switching from reqwest to bitreq in #82 (0.10.0), this crate's retry
loop in src/client/mod.rs classifies IoError/AddressNotFound/
RustlsCreateConnection as recoverable and retries up to max_retries times,
but the retries are functionally dead: each retry calls
self.http_client.send_async(request) on the same BitreqClient, which
caches a dead Arc<AsyncConnection> and returns it again. All retries fail on
the same dead socket, MaxRetriesExceeded bubbles up to the caller, and every
subsequent RPC call on that Client also fails indefinitely — only process
restart recovers.

The root cause is in bitreq: its connection pool never evicts on error. See
rust-bitcoin/corepc#562. But this crate's retry loop is
effectively load-bearing for users, so it's worth addressing here too.

Steps to reproduce

Against a stock bitcoind with the default rpcservertimeout=30:

  1. Client starts, opens a connection, issues an RPC, pool caches the socket.
  2. No RPC traffic for 30+ seconds.
  3. bitcoind closes the idle socket server-side.
  4. Next RPC call: transport error (dead socket). Retries hit the same cached
    dead socket. MaxRetriesExceeded returned.
  5. Every subsequent call: same failure, forever.

In our logs this looks like:

WARN Error calling bitcoin client err=IoError(…)
WARN connection error, retrying... err=IoError(…)
WARN Error calling bitcoin client err=IoError(…)
WARN connection error, retrying... err=IoError(…)
...

bitcoin-cli from the same host continues to work throughout, because each
bitcoin-cli invocation is a fresh process with a fresh connection. This
isolates the bug to the pooled HTTP client.

Offending code

src/client/mod.rs:197-244:

let response = self.http_client.send_async(request).await;

match response {
    Ok(resp) => {
        // …parse, handle status, return…
    }
    Err(err) => {
        warn!(err = %err, "Error calling bitcoin client");

        // Classify bitreq errors for retry logic
        let should_retry = Self::is_error_recoverable(&err);
        if !should_retry {
            return Err(err.into());
        }
    }
}
retries += 1;
if retries >= self.max_retries {
    return Err(ClientError::MaxRetriesExceeded(self.max_retries));
}
sleep(Duration::from_millis(self.retry_interval)).await;

There is nothing between the error and the next iteration that would force
self.http_client to discard the pooled connection. The retry reuses the same
BitreqClient, which returns the same dead Arc<AsyncConnection> from its
cache.

Platform(s)

Linux (x86)

Code of Conduct

  • I agree to follow the Code of Conduct

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions