Skip to content

fix(cluster-hashring): heartbeat/full-sync never reconnect after Redis endpoint change#237

Merged
dcadenas merged 3 commits into
mainfrom
fix/906-heartbeat-full-sync-never-reconnect-after-redis-endp
May 19, 2026
Merged

fix(cluster-hashring): heartbeat/full-sync never reconnect after Redis endpoint change#237
dcadenas merged 3 commits into
mainfrom
fix/906-heartbeat-full-sync-never-reconnect-after-redis-endp

Conversation

@dcadenas
Copy link
Copy Markdown
Contributor

@dcadenas dcadenas commented May 19, 2026

Summary

Redis command connections could get stuck after Redis closed an old socket.
This change makes cluster membership and API Redis helper commands reconnect, so later heartbeats, full syncs, polling writes, and admin Redis calls can recover without restarting the process.

Avoid naming corporate partners, customers, or other sensitive external brands in this PR title or body. Use generic descriptors unless a maintainer explicitly approves the public reference.

Motivation

The root cause was that command paths reused long-lived multiplexed Redis connections.
When Redis closed one of those sockets, later commands kept using the broken connection.

  • Switched registry and prefixed API Redis commands to redis::aio::ConnectionManager.
  • Removed the unused multiplexed factory path so new command callers use the reconnecting path.
  • Left Pub/Sub on its existing reconnect and resubscribe loop.
  • Kept IAM token refresh by recreating the connection manager when needed.

Related Issue

Testing

I tested the broken-socket case directly and ran the Rust workspace checks.
The regression tests kill the active Redis client connection, confirm the first command can see the closed socket, then confirm the next command reconnects and succeeds.

  • cargo test --workspace --verbose
  • cargo clippy --workspace --all-targets --all-features -- -D warnings -A deprecated
  • cargo fmt --all -- --check
  • Manual verification completed: TEST_REDIS_URL=redis://localhost:16379 cargo test -p cluster-hashring registry::tests::test_registry_recovers_after_connection_killed -- --ignored --exact and TEST_REDIS_URL=redis://localhost:16379 cargo test -p keycast_api redis::tests::test_prefixed_redis_recovers_after_connection_killed -- --ignored --exact

Visuals

  • UI/web change with screenshots/video attached
  • No visual change
  • Visuals and text avoid sensitive external brand or partner names unless explicitly approved

Risks

The command that first discovers a killed socket can still fail once.
That matches the existing retry/backoff behavior, and the following command should use the reconnected manager.

  • git fetch origin main could not run here because SSH auth was unavailable.
  • The branch rebased cleanly onto the local origin/main ref.

@dcadenas dcadenas marked this pull request as ready for review May 19, 2026 14:12
@NotThatKindOfDrLiz
Copy link
Copy Markdown
Member

Pushed follow-up cleanup in commit bdbc033.

What changed:

  • clarified RedisRegistry::connection() to describe the command-connection/PUBLISH usage
  • documented the ConnectionManager IAM caveat: socket reconnect reuses the client credentials, so token rotation still relies on rebuilding the manager through ValkeyConnectionFactory

Validation I ran on the updated branch:

  • cargo fmt --all -- --check
  • cargo clippy --workspace --all-targets --all-features -- -D warnings -A deprecated
  • cargo test -p cluster-hashring --lib
  • cargo test -p keycast_api redis::tests --lib

cargo test --workspace --verbose still hits the existing local Postgres setup failure in atproto_http_test (PoolTimedOut in api/tests/common/mod.rs), so I’m using the GitHub test check as the merge gate for the full suite on this push.

Copy link
Copy Markdown
Member

@NotThatKindOfDrLiz NotThatKindOfDrLiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final pass looks good. The follow-up commit only tightens the Redis reconnect/IAM documentation and fixes the stale registry connection comment. GitHub checks are green on bdbc033, and I do not see any remaining blockers.

@dcadenas dcadenas merged commit f88365d into main May 19, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(redis): reconnect command connections after socket loss

2 participants