- 
                Notifications
    You must be signed in to change notification settings 
- Fork 663
chore: better error logging for "failed to join reader and writer tasks" #3910 #3913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: PeaBrane <[email protected]>
Signed-off-by: PeaBrane <[email protected]>
| WalkthroughEnhanced error handling and logging in the TCP client implementation. Captures peer address early from TcpStream, clones subject field to prevent value moves, and refines join-time failure handling to differentiate and log three distinct failure scenarios with contextual information. Changes
 Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 
 Poem
 Pre-merge checks❌ Failed checks (2 warnings)
 ✅ Passed checks (1 passed)
 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
lib/runtime/src/pipeline/network/tcp/client.rs (2)
104-104: Consider optimizing the double clone of subject.The subject is cloned at both line 104 (for the handshake) and line 131 (for error logging in the spawned task). While safe and functionally correct, you could reduce this to a single clone by destructuring
infoor reordering operations.Example:
let subject = info.subject.clone(); let handshake = CallHomeHandshake { subject: subject.clone(), stream_type: StreamType::Response, };However, the current approach is clear and the performance impact is negligible for typical subject strings.
Also applies to: 131-131
171-194: Excellent contextual error messages, but consider reducing log duplication.The enhanced error handling clearly distinguishes the three failure scenarios and includes valuable context (peer address and subject). However, each branch both logs the error with
tracing::error!and then bails with the same message, creating duplicate log entries.Consider removing the explicit
tracing::error!calls and letting the error propagate, or use a different log level (e.g.,debugortrace) for the explicit log if you need to track the error at this location specifically.Example:
(Err(reader_err), Ok(_)) => { - tracing::error!( - "reader task failed to join (peer: {peer_addr:?}, subject: {subject}): {reader_err:?}" - ); anyhow::bail!( "reader task failed to join (peer: {peer_addr:?}, subject: {subject}): {reader_err:?}" ); }That said, the current approach ensures the error is visible in logs even if the spawned task's result is dropped, so it may be intentional.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
- lib/runtime/src/pipeline/network/tcp/client.rs(5 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3184
File: docs/architecture/kv_cache_routing.md:70-73
Timestamp: 2025-09-23T20:08:37.105Z
Learning: PeaBrane prefers to keep documentation diagrams simplified to avoid visual overload, even when this means sacrificing some technical precision for the sake of clarity and comprehension. They prioritize pedagogical effectiveness over exhaustive technical detail in architectural diagrams.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2756
File: lib/llm/src/kv_router/subscriber.rs:36-44
Timestamp: 2025-08-29T10:03:48.330Z
Learning: PeaBrane prefers to keep PRs contained in scope and is willing to defer technical improvements to future PRs when the current implementation works for the immediate use case. They acknowledge technical debt but prioritize deliverability over completeness in individual PRs.
🧬 Code graph analysis (1)
lib/runtime/src/pipeline/network/tcp/client.rs (4)
lib/runtime/src/transports/nats.rs (1)
subject(852-854)lib/runtime/src/component/namespace.rs (1)
subject(15-17)lib/runtime/src/component/component.rs (1)
subject(15-17)lib/runtime/src/traits/events.rs (1)
subject(21-21)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
- GitHub Check: vllm (amd64)
- GitHub Check: vllm (arm64)
- GitHub Check: trtllm (arm64)
- GitHub Check: trtllm (amd64)
- GitHub Check: sglang
- GitHub Check: operator (amd64)
- GitHub Check: tests (lib/bindings/python)
- GitHub Check: tests (.)
- GitHub Check: Build and Test - dynamo
- GitHub Check: clippy (.)
- GitHub Check: tests (lib/runtime/examples)
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: tests (launch/dynamo-run)
🔇 Additional comments (2)
lib/runtime/src/pipeline/network/tcp/client.rs (2)
87-87: LGTM! Good defensive capture of peer address.Capturing the peer address immediately after connection is a solid practice for diagnostics. Using
.ok()handles edge cases gracefully.
252-256: LGTM! Enhanced panic message improves diagnostics.Including the error details (
{e:?}) in the panic message significantly improves debuggability when decode failures occur. This aligns well with the PR's objective of better error logging.
Signed-off-by: PeaBrane <[email protected]>
Overview:
as titled
Summary by CodeRabbit