Skip to content

Conversation

@Frando
Copy link
Member

@Frando Frando commented Nov 20, 2025

Description

This adds a simple test: one server that accepts connections, and then 3 clients that are created consecutively and each connect to the server, exchange data, close connection, close endpoint. After a client is finished the next client is created (new endpoint, new secret key).

The test comes in three variations that share their impl:

  • test_one_server_two_clients_local_relay: uses a local relay
  • test_one_server_two_clients_public_relay: uses the public relay
  • test_one_server_two_clients_no_relay: doesn't use relays at all

The no relay and local relay versions pass for me. However the public relay one times out but only for the second client. The first one succeeds, the second connect attempt times out at the client.

Breaking Changes

Notes & open questions

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.
  • All breaking changes documented.
    • List all breaking changes in the above "Breaking Changes" section.
    • Open an issue or PR on any number0 repos that are affected by this breaking change. Give guidance on how the updates should be handled or do the actual updates themselves. The major ones are:

@Frando Frando changed the base branch from main to feat-multipath November 20, 2025 17:17
@github-actions
Copy link

github-actions bot commented Nov 20, 2025

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/3689/docs/iroh/

Last updated: 2025-11-25T13:44:56Z

@github-actions
Copy link

Netsim report & logs for this PR have been generated and is available at: LOGS
This report will remain available for 3 days.

Last updated for commit: 2c0802e

@n0bot n0bot bot added this to iroh Nov 20, 2025
@github-project-automation github-project-automation bot moved this to 🏗 In progress in iroh Nov 20, 2025
@Frando
Copy link
Member Author

Frando commented Nov 20, 2025

initial investigation for test_one_server_two_clients_public_relay:
once the second client starts to connect, I do get the logs for it sending datagrams via the RemoteStateActor on all paths (ip and relay). on the server, when browsing the logs from that send timestamp forward, however I get these logs only:

2025-11-20T20:54:02.226494Z TRACE server: netwatch::udp: UDP recv src=192.168.18.55:32827 len=1200 count=1 dst=192.168.18.55
2025-11-20T20:54:02.226522Z TRACE server: iroh::magicsock: UDP recv: 1 packets
2025-11-20T20:54:02.554427Z TRACE server:relay-actor:active-relay{url=https://euc1-1.relay.n0.iroh-canary.iroh.link./}:connected: iroh::magicsock::transports::relay::actor: received msg len=1200
2025-11-20T20:54:02.554634Z TRACE server: iroh::magicsock::mapped_addrs: generated new addr addr=RelayMappedAddr(fd15:70a:510b:1::3) key=(RelayUrl("https://euc1-1.relay.n0.iroh-canary.iroh.link./"), PublicKey(5965936c12b5b298af175fed203e78ef0f173a60786efb89575464463f9cdea7))

and .. that's it? I get some further logs for a connection that is can be identified as the older (closed!) connection from the first client.

I don't see any quinn_proto recv lines for this datagram received from the second client, which is weird - it seems these initial packets are received by the server, get up to the point where "generated new addr" is logged, but don't make its way into the quinn endpoint.

@Frando
Copy link
Member Author

Frando commented Nov 20, 2025

Interestingly, CI fails only for ubuntu-stable features none or default, all others pass! I can reproduce the failure reliably locally though.

@Frando
Copy link
Member Author

Frando commented Nov 22, 2025

Some findings. Looking at the logs of the test with the public relay.
First client works fine. Now second client starts. Looking at the logs of the server for the second client:

  • Client initial packet arrives over the IP path quinn_proto::Endpoint::handle emits Some(DatagramEvent::NewConnection).
  • Client initial packet arrives over the relay path, Endpoint::handle emits None, internally RouteTo::Initial(0) is the case that leads to handle returning None
  • Client packet packet arrives over the IP. Strangely, quinn_proto::Endpoint::handle now returns None instead of a DatagramEvent::ConnectionEvent, in handle it hits again the RouteTo::Initial(0) case.

Comparing to the first successfull flow, in the first client attempt we have this flow:

  • First packet over IP arrives -> DatagramEvent::NewConnection
  • Second packet over IP arrives -> DatagramEvent::ConnectionEvent

So in the working case the second packet over IP also has a different dst_cid than the first packet, whereas in the second (broken) client the second packet over IP has the same dst_cid than the first client.

Attaching logs here (with additional debug log statements inserted by me in iroh_quinn).
log-public.txt

@Frando
Copy link
Member Author

Frando commented Nov 25, 2025

Ooh I think I found the reason!
What happens is this (thanks qlog!!):

  1. Server and client create endpoint, connect to public relay
  2. Client connects to server, has its relay and ip addr
  3. Client sends Initial over all paths (ip and relay)
  4. Server receives Initial over IP path, new connection is created, flow continues as it should
  5. Server receives Initial again over Relay path quite a bit later. It creates another Connection
  6. Test hangs because it sequentially accepts incomings, and this second connection never proceeds because no further packets arrive via that path

So this step 5) seems wrong - it should discard the second Initial instead of initiating a new connection.

@matheus23
Copy link
Member

So this step 5) seems wrong - it should discard the second Initial instead of initiating a new connection.

Is the original connection closed before it receives the second initial, or are there two simultaneous connections?

@Frando
Copy link
Member Author

Frando commented Nov 25, 2025

Is the original connection closed before it receives the second initial, or are there two simultaneous connections?

not sure yet, rerunning with more logs now.

@Frando
Copy link
Member Author

Frando commented Nov 25, 2025

Is the original connection closed before it receives the second initial, or are there two simultaneous connections?

The original connection is already closed at the server when the server receives the second, duplicate initial. So maybe there's not much we can do about this apart from advising to always, even in tests, spawn new tasks for awaiting incomings, and not care about them timing out. It's still a bit unfortunate, because the server will keep sending its initial until the 30s timeout hits on this bogus connection.

@Frando Frando force-pushed the Frando/mp-relay-test branch from 1652829 to 11edf1b Compare November 25, 2025 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🏗 In progress

Development

Successfully merging this pull request may close these issues.

3 participants