-
Notifications
You must be signed in to change notification settings - Fork 899
MPI Spawn jobs doesn't work on multinode LSF cluster #9041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
From my reading of the error log, I don't think this is related to LSF specifically, but might be something with spawn or the machine. In an LSF environment, we only use LSF to launch the ORTE daemons (or PRRTE daemons if you were using v5.x or later) then those daemons handle the MPI spawn and wireup mechanisms. To try to eliminate spawn from the diagnosis: Are you able to run a "hello world" and "ring" program across multiple nodes in the allocation? The UCX adapter is filing to wireup properly:
I don't think UCX is to blame here, but let's try to eliminate UCX from the diagnosis: Can you add the following to your default environment:
If you are able to ssh between hosts in your allocation you can eliminate the LSF daemon launch mechanism by setting the following environment variable:
Give those a try and let us know how it goes. |
Thank you a lot for your answer @jjhursey ! |
So this may point to a general issue with spawn. I'm not certain of the stability of comm_spawn on the Are you able to correctly run your spawn test with the following variables?
If not then can you post the debug output? |
Thank you for your reactivity! export OMPI_MCA_pml=ob1
export OMPI_MCA_btl=tcp,vader,self
export OMPI_MCA_plm=^lsf
export OMPI_MCA_btl_base_verbose=100
bsub -n 3 -R "span[ptile=1]" -o $HOME/log mpirun -n 1 $HOME/producer the output: Sender: LSF System <[email protected]>
Subject: Job 4102: <mpirun -n 1 /home/user/producer> in cluster <r_cluster> Exited
Job <mpirun -n 1 /home/user/producer> was submitted from host <node-001.cm.cluster> by user <user> in cl
uster <r_cluster>.
Job was executed on host(s) <1*node-003.cm.cluster>, in queue <STANDARD_BATCH>, as user <user> in cluster <r
_cluster>.
<1*node-002.cm.cluster>
<1*node-001.cm.cluster>
</home/user> was used as the home directory.
</home/user> was used as the working directory.
Started at Mon Jun 7 18:44:02 2021
Results reported on Mon Jun 7 18:44:08 2021
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 /home/user/producer
------------------------------------------------------------
Exited with exit code 17.
Resource usage summary:
CPU time : 0.22 sec.
Max Memory : 27 MB
Average Memory : 12.50 MB
Total Requested Memory : -
Delta Memory : -
Max Processes : 5
Max Threads : 9
Run time : 5 sec.
Turnaround time : 7 sec.
The output (if any) follows:
[node-003:25380] mca: base: components_register: registering framework btl components
[node-003:25380] mca: base: components_register: found loaded component self
[node-003:25380] mca: base: components_register: component self register function successful
[node-003:25380] mca: base: components_register: found loaded component tcp
[node-003:25380] mca: base: components_register: component tcp register function successful
[node-003:25380] mca: base: components_register: found loaded component vader
[node-003:25380] mca: base: components_register: component vader register function successful
[node-003:25380] mca: base: components_open: opening btl components
[node-003:25380] mca: base: components_open: found loaded component self
[node-003:25380] mca: base: components_open: component self open function successful
[node-003:25380] mca: base: components_open: found loaded component tcp
[node-003:25380] mca: base: components_open: component tcp open function successful
[node-003:25380] mca: base: components_open: found loaded component vader
[node-003:25380] mca: base: components_open: component vader open function successful
[node-003:25380] select: initializing btl component self
[node-003:25380] select: init of component self returned success
[node-003:25380] select: initializing btl component tcp
[node-003:25380] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[node-003:25380] btl: tcp: Found match: 127.0.0.1 (lo)
[node-003:25380] btl:tcp: Attempting to bind to AF_INET port 1024
[node-003:25380] btl:tcp: Successfully bound to AF_INET port 1024
[node-003:25380] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[node-003:25380] btl:tcp: examining interface eth0
[node-003:25380] btl:tcp: using ipv6 interface eth0
[node-003:25380] btl:tcp: examining interface eth1
[node-003:25380] btl:tcp: using ipv6 interface eth1
[node-003:25380] select: init of component tcp returned success
[node-003:25380] select: initializing btl component vader
[node-003:25380] select: init of component vader returned failure
[node-003:25380] mca: base: close: component vader closed
[node-003:25380] mca: base: close: unloading component vader
[node-003:25380] mca: bml: Using self btl for send to [[53723,1],0] on node node-003
0 1
[node-002:01434] mca: base: components_register: registering framework btl components
[node-002:01434] mca: base: components_register: found loaded component self
[node-002:01434] mca: base: components_register: component self register function successful
[node-002:01434] mca: base: components_register: found loaded component tcp
[node-002:01434] mca: base: components_register: component tcp register function successful
[node-002:01434] mca: base: components_register: found loaded component vader
[node-002:01434] mca: base: components_register: component vader register function successful
[node-002:01434] mca: base: components_open: opening btl components
[node-002:01434] mca: base: components_open: found loaded component self
[node-002:01434] mca: base: components_open: component self open function successful
[node-002:01434] mca: base: components_open: found loaded component tcp
[node-002:01434] mca: base: components_open: component tcp open function successful
[node-002:01434] mca: base: components_open: found loaded component vader
[node-002:01434] mca: base: components_open: component vader open function successful
[node-002:01434] select: initializing btl component self
[node-002:01434] select: init of component self returned success
[node-002:01434] select: initializing btl component tcp
[node-002:01434] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[node-002:01434] btl: tcp: Found match: 127.0.0.1 (lo)
[node-002:01434] btl:tcp: Attempting to bind to AF_INET port 1024
[node-002:01434] btl:tcp: Successfully bound to AF_INET port 1024
[node-002:01434] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[node-002:01434] btl:tcp: examining interface eth0
[node-002:01434] btl:tcp: using ipv6 interface eth0
[node-002:01434] btl:tcp: examining interface eth1
[node-002:01434] btl:tcp: using ipv6 interface eth1
[node-002:01434] select: init of component tcp returned success
[node-002:01434] select: initializing btl component vader
[node-002:01434] select: init of component vader returned failure
[node-002:01434] mca: base: close: component vader closed
[node-002:01434] mca: base: close: unloading component vader
[node-001:12771] mca: base: components_register: registering framework btl components
[node-001:12771] mca: base: components_register: found loaded component self
[node-001:12771] mca: base: components_register: component self register function successful
[node-001:12771] mca: base: components_register: found loaded component tcp
[node-001:12771] mca: base: components_register: component tcp register function successful
[node-001:12771] mca: base: components_register: found loaded component vader
[node-001:12771] mca: base: components_register: component vader register function successful
[node-001:12771] mca: base: components_open: opening btl components
[node-001:12771] mca: base: components_open: found loaded component self
[node-001:12771] mca: base: components_open: component self open function successful
[node-001:12771] mca: base: components_open: found loaded component tcp
[node-001:12771] mca: base: components_open: component tcp open function successful
[node-001:12771] mca: base: components_open: found loaded component vader
[node-001:12771] mca: base: components_open: component vader open function successful
[node-001:12771] select: initializing btl component self
[node-001:12771] select: init of component self returned success
[node-001:12771] select: initializing btl component tcp
[node-001:12771] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[node-001:12771] btl: tcp: Found match: 127.0.0.1 (lo)
[node-001:12771] btl:tcp: Attempting to bind to AF_INET port 1024
[node-001:12771] btl:tcp: Successfully bound to AF_INET port 1024
[node-001:12771] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[node-001:12771] btl:tcp: examining interface eth0
[node-001:12771] btl:tcp: using ipv6 interface eth0
[node-001:12771] btl:tcp: examining interface eth1
[node-001:12771] btl:tcp: using ipv6 interface eth1
[node-001:12771] select: init of component tcp returned success
[node-001:12771] select: initializing btl component vader
[node-001:12771] select: init of component vader returned failure
[node-001:12771] mca: base: close: component vader closed
[node-001:12771] mca: base: close: unloading component vader
[node-002:01434] mca: bml: Using self btl for send to [[53723,2],0] on node node-002
[node-001:12771] mca: bml: Using self btl for send to [[53723,2],1] on node node-001
[node-002:01434] btl:tcp: path from 168.124.218.151 to 168.124.218.58: IPV4 PUBLIC SAME NETWORK
[node-002:01434] btl:tcp: path from 168.124.218.151 to 168.124.126.58: IPV4 PUBLIC DIFFERENT NETWORK
[node-002:01434] btl:tcp: path from 168.124.126.151 to 168.124.218.58: IPV4 PUBLIC DIFFERENT NETWORK
[node-002:01434] btl:tcp: path from 168.124.126.151 to 168.124.126.58: IPV4 PUBLIC SAME NETWORK
[node-002:01434] mca: bml: Using tcp btl for send to [[53723,2],1] on node node-001
[node-002:01434] btl:tcp: path from 168.124.218.151 to 168.124.218.58: IPV4 PUBLIC SAME NETWORK
[node-002:01434] btl:tcp: path from 168.124.218.151 to 168.124.126.58: IPV4 PUBLIC DIFFERENT NETWORK
[node-002:01434] btl:tcp: path from 168.124.126.151 to 168.124.218.58: IPV4 PUBLIC DIFFERENT NETWORK
[node-002:01434] btl:tcp: path from 168.124.126.151 to 168.124.126.58: IPV4 PUBLIC SAME NETWORK
[node-002:01434] mca: bml: Using tcp btl for send to [[53723,2],1] on node node-001
[node-002:01434] btl: tcp: attempting to connect() to [[53723,2],1] address 168.124.126.58 on port 1024
[node-002:01434] btl:tcp: would block, so allowing background progress
[node-002:01434] btl:tcp: connect() to 168.124.126.58:1024 completed (complete_connect), sending connect ACK
[node-001:12771] btl:tcp: path from 168.124.218.58 to 168.124.218.151: IPV4 PUBLIC SAME NETWORK
[node-001:12771] btl:tcp: path from 168.124.218.58 to 168.124.126.151: IPV4 PUBLIC DIFFERENT NETWORK
[node-001:12771] btl:tcp: path from 168.124.126.58 to 168.124.218.151: IPV4 PUBLIC DIFFERENT NETWORK
[node-001:12771] btl:tcp: path from 168.124.126.58 to 168.124.126.151: IPV4 PUBLIC SAME NETWORK
[node-001:12771] btl:tcp: path from 168.124.218.58 to 168.124.218.151: IPV4 PUBLIC SAME NETWORK
[node-001:12771] btl:tcp: path from 168.124.218.58 to 168.124.126.151: IPV4 PUBLIC DIFFERENT NETWORK
[node-001:12771] btl:tcp: path from 168.124.126.58 to 168.124.218.151: IPV4 PUBLIC DIFFERENT NETWORK
[node-001:12771] btl:tcp: path from 168.124.126.58 to 168.124.126.151: IPV4 PUBLIC SAME NETWORK
[node-001:12771] btl: tcp: Match incoming connection from [[53723,2],0] 168.124.126.151 with locally known IP 16
8.124.218.151 failed (iface 0/2)!
[node-001:12771] btl:tcp: now connected to 168.124.126.151, process [[53723,2],0]
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[53723,1],0]) is on host: node-003
Process 2 ([[53723,2],0]) is on host: unknown!
BTLs attempted: self tcp
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[node-003:25380] [[53723,1],0] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[node-001:12771] [[53723,2],1] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[node-003:25380] *** An error occurred in MPI_Comm_spawn
[node-003:25380] *** reported by process [3520790529,0]
[node-003:25380] *** on communicator MPI_COMM_WORLD
[node-003:25380] *** MPI_ERR_INTERN: internal error
[node-003:25380] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node-003:25380] *** and potentially your MPI job)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_dpm_dyn_init() failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
[node-001:12771] *** An error occurred in MPI_Init
[node-001:12771] *** reported by process [3520790530,1]
[node-001:12771] *** on a NULL communicator
[node-001:12771] *** Unknown error
[node-001:12771] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node-001:12771] *** and potentially your MPI job)
[node-002:01434] [[53723,2],0] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[node-003:25374] 2 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
[node-003:25374] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[node-003:25374] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[node-003:25374] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
bash-4.2$ bsub -n 3 -R "span[ptile=1]" -o $HOME/log mpirun -n 1 $HOME/producer Do you think an upgrade would be beneficial? @jjhursey Thank! |
You might try v4.1.1 to see if the behavior changes. This points to an issue with wireup around spawn from what I'm seeing from these logs. |
So just build v4.1.1 and run the same example in same conditions, it seems to be more talktative i get this output:
Thank you for your time! @jjhursey |
I still have the issue even if I force the eth interface choice, very strange ... bsub -n 3 -R "span[ptile=1]" -o output.log mpirun -n 1 --mca btl_tcp_if_include eth1 $HOME/producer output: Job <4168> is submitted to default queue <STANDARD_BATCH>.
bash-4.2$ more loog4
Sender: LSF System <[email protected]>
Subject: Job 4168: <mpirun -n 1 --mca btl_tcp_if_include eth1 /home/user/producer> in cluster <r_cluster> Exited
Job <mpirun -n 1 --mca btl_tcp_if_include eth1 /home/user/producer> was submitted from host <node-001.cm.cluster> by user <user> in cluster <r_
cluster>.
Job was executed on host(s) <1*node-003.cm.cluster>, in queue <STANDARD_BATCH>, as user <user> in cluster <r_cluster>.
<1*node-002.cm.cluster>
<1*node-001.cm.cluster>
</home/user> was used as the home directory.
</home/user> was used as the working directory.
Started at Wed Jun 9 12:37:46 2021
Results reported on Wed Jun 9 12:37:52 2021
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 --mca btl_tcp_if_include eth1 /home/user/producer
------------------------------------------------------------
Exited with exit code 17.
Resource usage summary:
CPU time : 0.40 sec.
Max Memory : 33 MB
Total Requested Memory : -
Delta Memory : -
Run time : 5 sec.
Turnaround time : 6 sec.
The output (if any) follows:
[node-003:10140] mca: base: components_register: registering framework btl components
[node-003:10140] mca: base: components_register: found loaded component self
[node-003:10140] mca: base: components_register: component self register function successful
[node-003:10140] mca: base: components_register: found loaded component tcp
[node-003:10140] mca: base: components_register: component tcp register function successful
[node-003:10140] mca: base: components_register: found loaded component sm
[node-003:10140] mca: base: components_register: found loaded component usnic
[node-003:10140] mca: base: components_register: component usnic register function successful
[node-003:10140] mca: base: components_register: found loaded component ofi
[node-003:10140] mca: base: components_register: component ofi register function successful
[node-003:10140] mca: base: components_register: found loaded component vader
[node-003:10140] mca: base: components_register: component vader register function successful
[node-003:10140] mca: base: components_open: opening btl components
[node-003:10140] mca: base: components_open: found loaded component self
[node-003:10140] mca: base: components_open: component self open function successful
[node-003:10140] mca: base: components_open: found loaded component tcp
[node-003:10140] mca: base: components_open: component tcp open function successful
[node-003:10140] mca: base: components_open: found loaded component usnic
[node-003:10140] mca: base: components_open: component usnic open function successful
[node-003:10140] mca: base: components_open: found loaded component ofi
[node-003:10140] mca: base: components_open: component ofi open function successful
[node-003:10140] mca: base: components_open: found loaded component vader
[node-003:10140] mca: base: components_open: component vader open function successful
[node-003:10140] select: initializing btl component self
[node-003:10140] select: init of component self returned success
[node-003:10140] select: initializing btl component tcp
[node-003:10140] btl:tcp: Attempting to bind to AF_INET port 1024
[node-003:10140] btl:tcp: Successfully bound to AF_INET port 1024
[node-003:10140] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[node-003:10140] btl:tcp: examining interface eth1
[node-003:10140] btl:tcp: using ipv6 interface eth1
[node-003:10140] select: init of component tcp returned success
[node-003:10140] select: initializing btl component usnic
[node-003:10140] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[node-003:10140] select: init of component usnic returned failure
[node-003:10140] mca: base: close: component usnic closed
[node-003:10140] mca: base: close: unloading component usnic
[node-003:10140] select: initializing btl component ofi
[node-003:10140] select: init of component ofi returned success
[node-003:10140] select: initializing btl component vader
[node-003:10140] select: init of component vader returned failure
[node-003:10140] mca: base: close: component vader closed
[node-003:10140] mca: base: close: unloading component vader
[node-003:10140] mca: bml: Using self btl for send to [[38226,1],0] on node node-003
0 1
[node-001:30432] mca: base: components_register: registering framework btl components
[node-001:30432] mca: base: components_register: found loaded component self
[node-001:30432] mca: base: components_register: component self register function successful
[node-001:30432] mca: base: components_register: found loaded component tcp
[node-001:30432] mca: base: components_register: component tcp register function successful
[node-001:30432] mca: base: components_register: found loaded component sm
[node-001:30432] mca: base: components_register: found loaded component usnic
[node-001:30432] mca: base: components_register: component usnic register function successful
[node-001:30432] mca: base: components_register: found loaded component ofi
[node-001:30432] mca: base: components_register: component ofi register function successful
[node-001:30432] mca: base: components_register: found loaded component vader
[node-001:30432] mca: base: components_register: component vader register function successful
[node-001:30432] mca: base: components_open: opening btl components
[node-001:30432] mca: base: components_open: found loaded component self
[node-001:30432] mca: base: components_open: component self open function successful
[node-001:30432] mca: base: components_open: found loaded component tcp
[node-001:30432] mca: base: components_open: component tcp open function successful
[node-001:30432] mca: base: components_open: found loaded component usnic
[node-001:30432] mca: base: components_open: component usnic open function successful
[node-001:30432] mca: base: components_open: found loaded component ofi
[node-001:30432] mca: base: components_open: component ofi open function successful
[node-001:30432] mca: base: components_open: found loaded component vader
[node-001:30432] mca: base: components_open: component vader open function successful
[node-001:30432] select: initializing btl component self
[node-001:30432] select: init of component self returned success
[node-001:30432] select: initializing btl component tcp
[node-001:30432] btl:tcp: Attempting to bind to AF_INET port 1024
[node-001:30432] btl:tcp: Successfully bound to AF_INET port 1024
[node-001:30432] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[node-001:30432] btl:tcp: examining interface eth1
[node-001:30432] btl:tcp: using ipv6 interface eth1
[node-001:30432] select: init of component tcp returned success
[node-001:30432] select: initializing btl component usnic
[node-001:30432] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[node-001:30432] select: init of component usnic returned failure
[node-001:30432] mca: base: close: component usnic closed
[node-001:30432] mca: base: close: unloading component usnic
[node-001:30432] select: initializing btl component ofi
[node-001:30432] select: init of component ofi returned success
[node-001:30432] select: initializing btl component vader
[node-001:30432] select: init of component vader returned failure
[node-001:30432] mca: base: close: component vader closed
[node-001:30432] mca: base: close: unloading component vader
[node-002:19988] mca: base: components_register: registering framework btl components
[node-002:19988] mca: base: components_register: found loaded component self
[node-002:19988] mca: base: components_register: component self register function successful
[node-002:19988] mca: base: components_register: found loaded component tcp
[node-002:19988] mca: base: components_register: component tcp register function successful
[node-002:19988] mca: base: components_register: found loaded component sm
[node-002:19988] mca: base: components_register: found loaded component usnic
[node-002:19988] mca: base: components_register: component usnic register function successful
[node-002:19988] mca: base: components_register: found loaded component ofi
[node-002:19988] mca: base: components_register: component ofi register function successful
[node-002:19988] mca: base: components_register: found loaded component vader
[node-002:19988] mca: base: components_register: component vader register function successful
[node-002:19988] mca: base: components_open: opening btl components
[node-002:19988] mca: base: components_open: found loaded component self
[node-002:19988] mca: base: components_open: component self open function successful
[node-002:19988] mca: base: components_open: found loaded component tcp
[node-002:19988] mca: base: components_open: component tcp open function successful
[node-002:19988] mca: base: components_open: found loaded component usnic
[node-002:19988] mca: base: components_open: component usnic open function successful
[node-002:19988] mca: base: components_open: found loaded component ofi
[node-002:19988] mca: base: components_open: component ofi open function successful
[node-002:19988] mca: base: components_open: found loaded component vader
[node-002:19988] mca: base: components_open: component vader open function successful
[node-002:19988] select: initializing btl component self
[node-002:19988] select: init of component self returned success
[node-002:19988] select: initializing btl component tcp
[node-002:19988] btl:tcp: Attempting to bind to AF_INET port 1024
[node-002:19988] btl:tcp: Successfully bound to AF_INET port 1024
[node-002:19988] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[node-002:19988] btl:tcp: examining interface eth1
[node-002:19988] btl:tcp: using ipv6 interface eth1
[node-002:19988] select: init of component tcp returned success
[node-002:19988] select: initializing btl component usnic
[node-002:19988] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[node-002:19988] select: init of component usnic returned failure
[node-002:19988] mca: base: close: component usnic closed
[node-002:19988] mca: base: close: unloading component usnic
[node-002:19988] select: initializing btl component ofi
[node-002:19988] select: init of component ofi returned success
[node-002:19988] select: initializing btl component vader
[node-002:19988] select: init of component vader returned failure
[node-002:19988] mca: base: close: component vader closed
[node-002:19988] mca: base: close: unloading component vader
[node-002:19988] mca: bml: Using self btl for send to [[38226,2],0] on node node-002
[node-001:30432] mca: bml: Using self btl for send to [[38226,2],1] on node node-001
[node-002:19988] btl:tcp: path from 169.124.126.151 to 169.124.126.58: IPV4 PUBLIC SAME NETWORK
[node-002:19988] mca: bml: Using tcp btl for send to [[38226,2],1] on node node-001
[node-002:19988] btl: tcp: attempting to connect() to [[38226,2],1] address 169.124.126.58 on port 1024
[node-002:19988] btl:tcp: would block, so allowing background progress
[node-002:19988] btl:tcp: connect() to 169.124.126.58:1024 completed (complete_connect), sending connect ACK
[node-001:30432] btl:tcp: path from 169.124.126.58 to 169.124.126.151: IPV4 PUBLIC SAME NETWORK
[node-001:30432] btl:tcp: now connected to 169.124.126.151, process [[38226,2],0]
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[38226,1],0]) is on host: node-003
Process 2 ([[38226,2],0]) is on host: unknown!
BTLs attempted: self tcp
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[node-003:10140] [[38226,1],0] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[node-001:30432] [[38226,2],1] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[node-002:19988] [[38226,2],0] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[node-003:10140] *** An error occurred in MPI_Comm_spawn
[node-003:10140] *** reported by process [2505179137,0]
[node-003:10140] *** on communicator MPI_COMM_WORLD
[node-003:10140] *** MPI_ERR_INTERN: internal error
[node-003:10140] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node-003:10140] *** and potentially your MPI job)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_dpm_dyn_init() failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
[node-001:30432] *** An error occurred in MPI_Init
[node-001:30432] *** reported by process [2505179138,1]
[node-001:30432] *** on a NULL communicator
[node-001:30432] *** Unknown error
[node-001:30432] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node-001:30432] *** and potentially your MPI job)
[node-003:10135] 2 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
[node-003:10135] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[node-003:10135] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[node-003:10135] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle |
@Extremys , Do you still have access to this system? Could you please try rerunning with Open MPI v4.0.6? |
Hi there, Have you found a solution to this issue? I have the same issues here with OpenMPI 4.0.2 and 4.1.1 that MPI_COMM_SPAWN() cannot spawn across nodes. I am testing on a cluster with CentOS 7.9 and LSF Batch system, and GCC 6.3.0 was used. I used this code for testing
Running on one node, it looked fine:
But on 2 nodes, errors occured:
|
We recently made several Spawn-related fixes to Open MPI I just tried the example from this comment and the original post. Both passed with a build on Please re-try your examples, and re-open the issue if the problem persists. |
Background information
Version of used OpenMPI
OpenMPI v4.0.5
OpenMPI installation
Installation from GCC 10.2 version of Easybuild recipe
System description
Details of the problem
I try to run a simple MPI spawn program through an LSF cluster, when the scheduler allocs a single node the execution works pretty well but when it's multinode, the MPI processes spawn from seperated hostname can't talk each other so resulting an abort, what I'm doing wrong? Is it an OpenMPI bug? Thank you you for your help!
producer.cpp source:
worker.cpp source:
launching commands:
output.log content:
The text was updated successfully, but these errors were encountered: