Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Engine Threads >1 "Failed to connect to remote host" #49

Open
GeorgKreuzmayr opened this issue Nov 14, 2024 · 2 comments
Open

Engine Threads >1 "Failed to connect to remote host" #49

GeorgKreuzmayr opened this issue Nov 14, 2024 · 2 comments

Comments

@GeorgKreuzmayr
Copy link

Hello everyone,

when configuring the engine_threads to any value > 1 I get an error when connecting from a client to a server.

For the configuration with engine_threads=2, I get the error occasionally e.g. not for every port combination. The configuration engine_threads=16 fails more often e.g. on every port combination I tried.

The error message on the client is

ubuntu@ip-172-31-32-21:~$ ${MSG_GEN} --local_ip 172.31.32.121 --remote_ip 172.31.32.120 --msg_size 64 --msg_window 32
I20241103 18:50:38.682282     1 main.cc:332] Starting in client mode, request size 64
Checking for file descriptor...
Got a file descriptor!
ERROR: Failed to dequeue response from control queue.
F20241103 18:50:49.975369     1 main.cc:346] Check failed: ret == 0 Failed to connect to remote host. machnet_connect() error: Unknown error -1
*** Check failure stack trace: ***
    @     0x7fa3d8ce3f03  google::LogMessage::Fail()
    @     0x7fa3d8ce793c  google::LogMessage::SendToLog()
    @     0x7fa3d8ce39e7  google::LogMessage::Flush()
    @     0x7fa3d8ce509f  google::LogMessageFatal::~LogMessageFatal()
    @     0x562d0c932a28  main
    @     0x7fa3d8866d90  (unknown)

I have a server running on another EC2 instance with this command

ubuntu@ip-172-31-32-20:~$ ${MSG_GEN} --local_ip 172.31.32.120 --msg_size 64 

On the other hand, if I use engine_threads=1, the execution succeeds

ubuntu@ip-172-31-32-21:~$ ${MSG_GEN} --local_ip 172.31.32.121 --remote_ip 172.31.32.120 --msg_size 64 --msg_window 32
I20241103 18:06:00.837787     1 main.cc:332] Starting in client mode, request size 64
Checking for file descriptor...
Got a file descriptor!
I20241103 18:06:03.949545     1 main.cc:350] [CONNECTED] [172.31.32.121:1024 <-> 172.31.32.120:888]
I20241103 18:06:03.972815     7 main.cc:294] Client Loop: Starting.
TX/RX (msg/sec, Gbps): (0.0K/0.0K, 0.000/0.000). RTT (p50/99/99.9 us): 144/144/144
TX/RX (msg/sec, Gbps): (220.0K/220.0K, 0.113/0.113). RTT (p50/99/99.9 us): 143/177/195
TX/RX (msg/sec, Gbps): (220.0K/220.0K, 0.113/0.113). RTT (p50/99/99.9 us): 143/177/194
TX/RX (msg/sec, Gbps): (217.4K/217.4K, 0.111/0.111). RTT (p50/99/99.9 us): 143/179/543
TX/RX (msg/sec, Gbps): (220.0K/220.0K, 0.113/0.113). RTT (p50/99/99.9 us): 143/177/193
TX/RX (msg/sec, Gbps): (220.2K/220.2K, 0.113/0.113). RTT (p50/99/99.9 us): 143/177/190
TX/RX (msg/sec, Gbps): (220.1K/220.1K, 0.113/0.113). RTT (p50/99/99.9 us): 143/177/191
TX/RX (msg/sec, Gbps): (220.1K/220.1K, 0.113/0.113). RTT (p50/99/99.9 us): 143/177/189

MSG_GEN="docker run -v /var/run/machnet:/var/run/machnet ghcr.io/microsoft/machnet/machnet:latest release_build/src/apps/msg_gen/msg_gen"

Setup: Two EC2 instances of type c5n.18xlarge running Kernel 6.5.0-1014-aws on Ubuntu 23.10.

@sarsanaee
Copy link
Collaborator

Hi George

Thanks for using Machnet.

This behavior is expected. The master branch does not support Amazon VMs working with arbitrary engine numbers.

Would you try https://github.com/microsoft/machnet/tree/rss_blast branch to see if this would eliminate the issue?

Thanks,
Alireza

@marinosi
Copy link
Collaborator

Hi @GeorgKreuzmayr,

As @sarsanaee pointed out we have this experimental branch to achieve connectivity when using multiple engines. We have not yet tried that in AWS; if you could give it a spin and let us know that would be helpful!

Ilias

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants