Skip to content

Conversation

@tommyv1987
Copy link
Contributor

@tommyv1987 tommyv1987 commented Oct 23, 2025

When the mixnet client's mix_tx channel closes during network drops, OutQueueControl
would retry sending packets through the closed channel, flooding logs and
hanging the daemon. This affects all clients (VPN, SOCKS5, native clients), not just VPN.

Solution

Cancelling the root token from the MixTrafficController

Additionally, reduced MAX_FAILURE_COUNT from 100 → 20 to detect dead gateways faster
(~1-2s instead of ~6s), improving reconnection speed during mobile network drops and
sleep/wake cycles.

Example Logs (Before Fix)

2025-10-23T14:18:26.772232Z ERROR nym_client_core::client::mix_traffic: Failed to send sphinx packet to the gateway 100 times in a row - assuming the gateway is dead
2025-10-23T14:18:26.772239Z DEBUG nym_client_core::client::mix_traffic: MixTrafficController: Exiting
2025-10-23T14:18:26.773309Z ERROR nym_client_core::client::real_messages_control::real_traffic_stream: failed to send mixnet packet due to closed channel (outside of shutdown!)
2025-10-23T14:18:26.773382Z ERROR nym_client_core::client::real_messages_control::real_traffic_stream: failed to send mixnet packet due to closed channel (outside of shutdown!)
2025-10-23T14:18:26.774273Z ERROR nym_client_core::client::real_messages_control::real_traffic_stream: failed to send mixnet packet due to closed channel (outside of shutdown!)
... (continues infinitely)

This change is Reviewable

@vercel
Copy link

vercel bot commented Oct 23, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
nym-explorer-v2 Ready Ready Preview Comment Oct 24, 2025 2:33pm
nym-node-status Ready Ready Preview Comment Oct 24, 2025 2:33pm
1 Skipped Deployment
Project Deployment Preview Comments Updated (UTC)
docs-nextra Ignored Ignored Preview Oct 24, 2025 2:33pm

pub fn get_sdk_shutdown_tracker() -> Result<ShutdownTracker, RegistryAccessError> {
Ok(runtime_registry::RuntimeRegistry::get_or_create_sdk()?.shutdown_tracker_owned())
pub fn create_sdk_shutdown_tracker() -> Result<ShutdownTracker, RegistryAccessError> {
Ok(runtime_registry::RuntimeRegistry::create_sdk()?.shutdown_tracker_owned())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you change the existing behaviour to overwrite a pre-existing shutdown manager? this could be potentially dangereous, especially if it had already registered some signals, tasks, etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a global shutdown means that if you cancel the underlying token, you can never spin up a new mixnet client again, because the underlying token will be cancelled to start with.

The SDK manager is only used if no other shutdown manager is provided

.clone())
}

/// Get the ShutdownManager for SDK use.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. if it's not yet used, just remove it
  2. don't leak SDK needs into the common task library (it's like exposing VPN-specific methods in the monorepo)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I merely modified the existing SDK shutdown manager. It's more of an internal shutdown manager rather than a specific one imo. Either you give your client a custom one, or it creates one for internal use. We could have an entire discussion on the shutdown process which I still think could use a proper look into

// Use custom shutdown if provided, otherwise the sdk one will be used later down the line
if let Some(shutdown_tracker) = self.custom_shutdown {
base_builder = base_builder.with_shutdown(shutdown_tracker);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so why do we no longer get the default static one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need it, but it was set a first time here, and then later in the base client startup too. There is no point in setting it in two places

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference :

let shutdown_tracker = match self.shutdown {

I'd be fine with setting it up there and remove it here mentioned. Then BaseClientBuilder.shutdown should no longer be optional

Copy link
Contributor

@pronebird pronebird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pronebird reviewed 6 of 7 files at r2.
Reviewable status: 6 of 7 files reviewed, 4 unresolved discussions (waiting on @durch and @jstuczyn)


common/client-core/src/client/real_messages_control/real_traffic_stream.rs line 283 at r2 (raw file):

        };

        let sending_res = tokio::select! {

NIT: This could be written as:

let sending_res = self.shutdown_token.run_until_cancelled(self.mix_tx.send(vec![next_message])).await;

Copy link
Contributor

@pronebird pronebird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 6 of 7 files reviewed, 5 unresolved discussions (waiting on @durch, @jstuczyn, and @tommyv1987)


common/client-core/src/client/mix_traffic/mod.rs line 166 at r2 (raw file):

                            // Gateway is dead, we have to shut down currently
                            error!("Signalling shutdown from the MixTrafficController");
                            self.shutdown_token.cancel();

Would it not be easier to pass a channel from parent task and bubble error back up? Then let parent cancel everything and exit cleanly.

Copy link
Contributor

@simonwicky simonwicky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 6 of 7 files reviewed, 5 unresolved discussions (waiting on @durch, @jstuczyn, @pronebird, and @tommyv1987)


common/client-core/src/client/mix_traffic/mod.rs line 166 at r2 (raw file):

Previously, pronebird (Andrej Mihajlov) wrote…

Would it not be easier to pass a channel from parent task and bubble error back up? Then let parent cancel everything and exit cleanly.

Easier no, much better yes, hence my comment about it. We need to come back on that, that PR is just a bandaid over a bigger issue

Copy link
Contributor

@pronebird pronebird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 6 of 7 files reviewed, 5 unresolved discussions (waiting on @durch, @jstuczyn, and @tommyv1987)


common/client-core/src/client/mix_traffic/mod.rs line 166 at r2 (raw file):

Previously, simonwicky (Simon Wicky) wrote…

Easier no, much better yes, hence my comment about it. We need to come back on that, that PR is just a bandaid over a bigger issue

If we have a clear parent task, to me it seems logical to eliminate bandaid and just add a simple channel, bubble up the message and then cancel all child tasks from above but not from below.

Copy link
Contributor

@simonwicky simonwicky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 6 of 7 files reviewed, 5 unresolved discussions (waiting on @durch, @jstuczyn, @pronebird, and @tommyv1987)


common/client-core/src/client/mix_traffic/mod.rs line 166 at r2 (raw file):

Previously, pronebird (Andrej Mihajlov) wrote…

If we have a clear parent task, to me it seems logical to eliminate bandaid and just add a simple channel, bubble up the message and then cancel all child tasks from above but not from below.

Bubbling up messages from everywhere is gonna be real ugly, we need a signalling channel within the custom shutdown mechanism we have, and I don't have the time to do it properly

@tommyv1987
Copy link
Contributor Author

Closing with more work being undertaken here: https://nymtech.atlassian.net/browse/NET-688

@tommyv1987 tommyv1987 merged commit fb3f550 into develop Oct 27, 2025
18 of 19 checks passed
@tommyv1987 tommyv1987 deleted the bugfix/mix-tx-closed-v2 branch October 27, 2025 15:45
tommyv1987 added a commit that referenced this pull request Oct 29, 2025
tommyv1987 added a commit that referenced this pull request Oct 29, 2025
Cherry pick - request #6143 from nymtech/bugfix/mix-tx-closed-v2
benedettadavico pushed a commit that referenced this pull request Oct 30, 2025
jstuczyn pushed a commit that referenced this pull request Oct 30, 2025
jstuczyn pushed a commit that referenced this pull request Oct 31, 2025
jstuczyn pushed a commit that referenced this pull request Oct 31, 2025
jstuczyn pushed a commit that referenced this pull request Oct 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants