Open
Description
Hi Team ,
We had installed openmpi 4.1.6 and openmpi v5 using easy build scripts on a VM based cluster on RHEL 9.5 based cluster.
https://docs.easybuild.io/version-specific/supported-software/o/OpenMPI/
Each server in VM based cluster has following 3 interfaces :
NAME TYPE DEVICE
System ens192 ethernet ens192
System ens161 ethernet ens161
System ens256 ethernet ens256
when we run the openmpi v4 , we are able to perform multinode runs , but we see lots of error messages at the beginning and end of the expected output.
[user@server1 mpi]$ mpirun --version
mpirun (Open MPI) 4.1.6
Report bugs to http://www.open-mpi.org/community/help/
[suser@server1 mpi]$ mpirun -np 4 ./a.out
< error messages>
Hello world from processor server1 rank 1 out of 4 processors
Hello world from processor server2, rank 3 out of 4 processors
Hello world from processor server3 rank 2 out of 4 processors
Hello world from processor server4 rank 0 out of 4 processors
<error messages>
Based on the output error messages, i see 2 category of issues -
issue 1)
server1:rank1: PSM3 can't open nic unit: -1 (err=23)
[1749130772.508535285] server1:rank0.a.out: Unable to create UDP socket for ens161: Address family not supported by protocol
[1749130772.508552855] server1:rank0.a.out: Unable to initialize sockets NIC /sys/class/net/ens161 (unit 0:0)
[1749130772.510515510] server1:rank0.a.out: Unable to create UDP socket for ens192: Address family not supported by protocol
[1749130772.510530731] server1:rank0.a.out: Unable to initialize sockets NIC /sys/class/net/ens192 (unit 1:0)
[1749130772.512386528] server1:rank0.a.out: Unable to create UDP socket for ens256: Address family not supported by protocol
[1749130772.512401122] server1:rank0.a.out: Unable to initialize sockets NIC /sys/class/net/ens256 (unit 2:0)
server1:rank0: PSM3 can't open nic unit: -1 (err=23)
[1749130772.542387747] server2:rank2.a.out: Unable to create UDP socket for ens161: Address family not supported by protocol
[1749130772.542512255] server2:rank2.a.out: Unable to initialize sockets NIC /sys/class/net/ens161 (unit 0:0)
[1749130772.543819831] server2:rank3.a.out: Unable to create UDP socket for ens161: Address family not supported by protocol
[1749130772.543838127] server2:rank3.a.out: Unable to initialize sockets NIC /sys/class/net/ens161 (unit 0:0)
[1749130772.544649294] server2:rank2.a.out: Unable to create UDP socket for ens192: Address family not supported by protocol
[1749130772.544700905] server2:rank2.a.out: Unable to initialize sockets NIC /sys/class/net/ens192 (unit 1:0)
[1749130772.545667705] server2:rank3.a.out: Unable to create UDP socket for ens192: Address family not supported by protocol
[1749130772.545685370] server2:rank3.a.out: Unable to initialize sockets NIC /sys/class/net/ens192 (unit 1:0)
[1749130772.546689089] server2:rank2.a.out: Unable to create UDP socket for ens256: Address family not supported by protocol
[1749130772.546706079] server2:rank2.a.out: Unable to initialize sockets NIC /sys/class/net/ens256 (unit 2:0)
server2:rank2: PSM3 can't open nic unit: -1 (err=23)
[1749130772.547558723] server2:rank3.a.out: Unable to create UDP socket for ens256: Address family not supported by protocol
[1749130772.547571518] server2:rank3.a.out: Unable to initialize sockets NIC /sys/class/net/ens256 (unit 2:0)
server2:rank3: PSM3 can't open nic unit: -1 (err=23)
[1749130772.549656387] server2:rank2.a.out: Unable to create UDP socket for ens161: Address family not supported by protocol
[1749130772.549732373] server2:rank2.a.out: Unable to initialize sockets NIC /sys/class/net/ens161 (unit 0:0)
server2:rank2: PSM3 can't open nic unit: 0 (err=23)
[1749130772.550315258] server2:rank3.a.out: Unable to create UDP socket for ens161: Address family not supported by protocol
[1749130772.550334655] server2:rank3.a.out: Unable to initialize sockets NIC /sys/class/net/ens161 (unit 0:0)
server2:rank3: PSM3 can't open nic unit: 0 (err=23)
[1749130772.552612768] server2:rank2.a.out: Unable to create UDP socket for ens192: Address family not supported by protocol
[1749130772.552688696] server2:rank2.a.out: Unable to initialize sockets NIC /sys/class/net/ens192 (unit 1:0)
server2:rank2: PSM3 can't open nic unit: 1 (err=23)[1749130772.552796054] server2:rank3.a.out: Unable to create UDP socket for ens192: Address family not supported by protocol
issue 2) at the end of run i see following message -
Hello world from processor server1, rank 0 out of 4 processors
[server1:3799223] PMIX ERROR: PMIX_ERR_NO_PERMISSIONS in file dstore_base.c at line 238
I am attaching the complete stdout herewith, ompiv4_error.txt
Please do let me know if any further information is required from my end.