fix(cluster-hashring): heartbeat/full-sync never reconnect after Redis endpoint change by dcadenas · Pull Request #237 · divinevideo/keycast

dcadenas · 2026-05-19T14:01:56Z

Summary

Redis command connections could get stuck after Redis closed an old socket.
This change makes cluster membership and API Redis helper commands reconnect, so later heartbeats, full syncs, polling writes, and admin Redis calls can recover without restarting the process.

Avoid naming corporate partners, customers, or other sensitive external brands in this PR title or body. Use generic descriptors unless a maintainer explicitly approves the public reference.

Motivation

The root cause was that command paths reused long-lived multiplexed Redis connections.
When Redis closed one of those sockets, later commands kept using the broken connection.

Switched registry and prefixed API Redis commands to redis::aio::ConnectionManager.
Removed the unused multiplexed factory path so new command callers use the reconnecting path.
Left Pub/Sub on its existing reconnect and resubscribe loop.
Kept IAM token refresh by recreating the connection manager when needed.

Related Issue

Closes fix(redis): reconnect command connections after socket loss #238

Testing

I tested the broken-socket case directly and ran the Rust workspace checks.
The regression tests kill the active Redis client connection, confirm the first command can see the closed socket, then confirm the next command reconnects and succeeds.

cargo test --workspace --verbose
cargo clippy --workspace --all-targets --all-features -- -D warnings -A deprecated
cargo fmt --all -- --check
Manual verification completed: TEST_REDIS_URL=redis://localhost:16379 cargo test -p cluster-hashring registry::tests::test_registry_recovers_after_connection_killed -- --ignored --exact and TEST_REDIS_URL=redis://localhost:16379 cargo test -p keycast_api redis::tests::test_prefixed_redis_recovers_after_connection_killed -- --ignored --exact

Visuals

UI/web change with screenshots/video attached
No visual change
Visuals and text avoid sensitive external brand or partner names unless explicitly approved

Risks

The command that first discovers a killed socket can still fail once.
That matches the existing retry/backoff behavior, and the following command should use the reconnected manager.

git fetch origin main could not run here because SSH auth was unavailable.
The branch rebased cleanly onto the local origin/main ref.

… loss

NotThatKindOfDrLiz · 2026-05-19T16:29:48Z

Pushed follow-up cleanup in commit bdbc033.

What changed:

clarified RedisRegistry::connection() to describe the command-connection/PUBLISH usage
documented the ConnectionManager IAM caveat: socket reconnect reuses the client credentials, so token rotation still relies on rebuilding the manager through ValkeyConnectionFactory

Validation I ran on the updated branch:

cargo fmt --all -- --check
cargo clippy --workspace --all-targets --all-features -- -D warnings -A deprecated
cargo test -p cluster-hashring --lib
cargo test -p keycast_api redis::tests --lib

cargo test --workspace --verbose still hits the existing local Postgres setup failure in atproto_http_test (PoolTimedOut in api/tests/common/mod.rs), so I’m using the GitHub test check as the merge gate for the full suite on this push.

NotThatKindOfDrLiz

Final pass looks good. The follow-up commit only tightens the Redis reconnect/IAM documentation and fixes the stale registry connection comment. GitHub checks are green on bdbc033, and I do not see any remaining blockers.

dcadenas added 2 commits May 19, 2026 10:27

fix(cluster-hashring): reconnect registry Redis commands after socket…

8715ae1

… loss

fix(api): reconnect prefixed Redis commands after socket loss

8f4feb3

dcadenas marked this pull request as ready for review May 19, 2026 14:12

dcadenas requested review from NotThatKindOfDrLiz, irab and rabble May 19, 2026 14:13

fix(cluster-hashring): document redis reconnect caveats

bdbc033

NotThatKindOfDrLiz approved these changes May 19, 2026

View reviewed changes

dcadenas merged commit f88365d into main May 19, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cluster-hashring): heartbeat/full-sync never reconnect after Redis endpoint change#237

fix(cluster-hashring): heartbeat/full-sync never reconnect after Redis endpoint change#237
dcadenas merged 3 commits into
mainfrom
fix/906-heartbeat-full-sync-never-reconnect-after-redis-endp

dcadenas commented May 19, 2026 •

edited

Loading

Uh oh!

NotThatKindOfDrLiz commented May 19, 2026

Uh oh!

NotThatKindOfDrLiz left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dcadenas commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Related Issue

Testing

Visuals

Risks

Uh oh!

NotThatKindOfDrLiz commented May 19, 2026

Uh oh!

NotThatKindOfDrLiz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dcadenas commented May 19, 2026 •

edited

Loading