Skip to content

MPI Spawn jobs doesn't work on multinode LSF cluster #9041

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Extremys opened this issue Jun 7, 2021 · 10 comments
Closed

MPI Spawn jobs doesn't work on multinode LSF cluster #9041

Extremys opened this issue Jun 7, 2021 · 10 comments

Comments

@Extremys
Copy link

Extremys commented Jun 7, 2021

Background information

Version of used OpenMPI

OpenMPI v4.0.5

OpenMPI installation

Installation from GCC 10.2 version of Easybuild recipe

[lsf-host:31240] mca: base: components_register: registering framework btl components
[lsf-host:31240] mca: base: components_register: found loaded component self
[lsf-host:31240] mca: base: components_register: component self register function successful
[lsf-host:31240] mca: base: components_register: found loaded component tcp
[lsf-host:31240] mca: base: components_register: component tcp register function successful
[lsf-host:31240] mca: base: components_register: found loaded component sm
[lsf-host:31240] mca: base: components_register: found loaded component usnic
[lsf-host:31240] mca: base: components_register: component usnic register function successful
[lsf-host:31240] mca: base: components_register: found loaded component vader
[lsf-host:31240] mca: base: components_register: component vader register function successful
                 Package: Open MPI easybuild@lsf-host Distribution
                Open MPI: 4.0.5
  Open MPI repo revision: v4.0.5
   Open MPI release date: Aug 26, 2020
                Open RTE: 4.0.5
  Open RTE repo revision: v4.0.5
   Open RTE release date: Aug 26, 2020
                    OPAL: 4.0.5
      OPAL repo revision: v4.0.5
       OPAL release date: Aug 26, 2020
                 MPI API: 3.1.0
            Ident string: 4.0.5
                  Prefix: /cm/easybuild/software/OpenMPI/4.0.5-GCC-10.2.0
 Configured architecture: x86_64-pc-linux-gnu
          Configure host: lsf-host
           Configured by: easybuild
           Configured on: Fri Jun  4 17:58:42 CEST 2021
          Configure host: lsf-host
  Configure command line: '--prefix=/cm/easybuild/software/OpenMPI/4.0.5-GCC-10.2.0'
                          '--build=x86_64-pc-linux-gnu'
                          '--host=x86_64-pc-linux-gnu'
                          '--with-lsf=/pss/lsf/9.1/'
                          '--with-lsf-libdir=/pss/lsf/9.1/linux2.6-glibc2.3-x86_64/lib'
                          '--with-pmix=/cm/easybuild/software/PMIx/3.1.5-GCCcore-10.2.0'
                          '--enable-mpirun-prefix-by-default'
                          '--enable-shared'
                          '--with-hwloc=/cm/easybuild/software/hwloc/2.2.0-GCCcore-10.2.0'
                          '--with-libevent=/cm/easybuild/software/libevent/2.1.12-GCCcore-10.2.0'
                          '--with-ucx=/cm/easybuild/software/UCX/1.9.0-GCCcore-10.2.0'
                          '--without-verbs'
                Built by: easybuild
                Built on: Fri Jun  4 18:10:10 CEST 2021
              Built host: lsf-host
              C bindings: yes
            C++ bindings: no
             Fort mpif.h: yes (all)
            Fort use mpi: yes (full: ignore TKR)
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
                          limitations in the gfortran compiler and/or Open
                          MPI, does not support the following: array
                          subsections, direct passthru (where possible) to
                          underlying Open MPI's C functionality
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: gcc
     C compiler absolute: /cm/easybuild/software/GCCcore/10.2.0/bin/gcc
  C compiler family name: GNU
      C compiler version: 10.2.0
            C++ compiler: g++
   C++ compiler absolute: /cm/easybuild/software/GCCcore/10.2.0/bin/g++
           Fort compiler: gfortran
       Fort compiler abs: /cm/easybuild/software/GCCcore/10.2.0/bin/gfortran
         Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
   Fort 08 assumed shape: yes
      Fort optional args: yes
          Fort INTERFACE: yes
    Fort ISO_FORTRAN_ENV: yes
       Fort STORAGE_SIZE: yes
      Fort BIND(C) (all): yes
      Fort ISO_C_BINDING: yes
 Fort SUBROUTINE BIND(C): yes
       Fort TYPE,BIND(C): yes
 Fort T,BIND(C,name="a"): yes
            Fort PRIVATE: yes
          Fort PROTECTED: yes
           Fort ABSTRACT: yes
       Fort ASYNCHRONOUS: yes
          Fort PROCEDURE: yes
         Fort USE...ONLY: yes
           Fort C_FUNLOC: yes
 Fort f08 using wrappers: yes
         Fort MPI_SIZEOF: yes
             C profiling: yes
           C++ profiling: no
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: yes
          C++ exceptions: no
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
                          OMPI progress: no, ORTE progress: yes, Event lib:
                          yes)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
 mpirun default --prefix: yes
       MPI_WTIME support: native
     Symbol vis. support: yes
   Host topology support: yes
            IPv6 support: no
      MPI1 compatibility: no
          MPI extensions: affinity, cuda, pcollreq
   FT Checkpoint support: no (checkpoint thread: no)
   C/R Enabled Debugging: no
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
           MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v4.0.5)
           MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v4.0.5)
           MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.5)
                 MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.5)
                 MCA btl: usnic (MCA v2.1.0, API v3.1.0, Component v4.0.5)
                 MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.5)
            MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v4.0.5)
            MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA crs: none (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v4.0.5)
               MCA event: external (MCA v2.1.0, API v2.0.0, Component v4.0.5)
               MCA hwloc: external (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
                  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
         MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v4.0.5)
         MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v4.0.5)
              MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v4.0.5)
               MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v4.0.5)
             MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
                          v4.0.5)
                MCA pmix: ext3x (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.0.5)
               MCA pstat: linux (MCA v2.1.0, API v2.0.0, Component v4.0.5)
              MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v4.0.5)
           MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v4.0.5)
               MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v4.0.5)
               MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v4.0.5)
               MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v4.0.5)
               MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v4.0.5)
              MCA errmgr: default_hnp (MCA v2.1.0, API v3.0.0, Component
                          v4.0.5)
              MCA errmgr: default_orted (MCA v2.1.0, API v3.0.0, Component
                          v4.0.5)
              MCA errmgr: default_app (MCA v2.1.0, API v3.0.0, Component
                          v4.0.5)
              MCA errmgr: default_tool (MCA v2.1.0, API v3.0.0, Component
                          v4.0.5)
                 MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component
                          v4.0.5)
                 MCA ess: env (MCA v2.1.0, API v3.0.0, Component v4.0.5)
                 MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.0.5)
                 MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.5)
                 MCA ess: tool (MCA v2.1.0, API v3.0.0, Component v4.0.5)
                 MCA ess: lsf (MCA v2.1.0, API v3.0.0, Component v4.0.5)
                 MCA ess: hnp (MCA v2.1.0, API v3.0.0, Component v4.0.5)
               MCA filem: raw (MCA v2.1.0, API v2.0.0, Component v4.0.5)
             MCA grpcomm: direct (MCA v2.1.0, API v3.0.0, Component v4.0.5)
                 MCA iof: tool (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA iof: orted (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA iof: hnp (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA odls: default (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA odls: pspawn (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA plm: rsh (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA plm: lsf (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA plm: isolated (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA ras: lsf (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
                 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA regx: naive (MCA v2.1.0, API v1.0.0, Component v4.0.5)
                MCA regx: fwd (MCA v2.1.0, API v1.0.0, Component v4.0.5)
                MCA regx: reverse (MCA v2.1.0, API v1.0.0, Component v4.0.5)
               MCA rmaps: rank_file (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
               MCA rmaps: mindist (MCA v2.1.0, API v2.0.0, Component v4.0.5)
               MCA rmaps: seq (MCA v2.1.0, API v2.0.0, Component v4.0.5)
               MCA rmaps: round_robin (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
               MCA rmaps: resilient (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
               MCA rmaps: ppr (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA rml: oob (MCA v2.1.0, API v3.0.0, Component v4.0.5)
              MCA routed: binomial (MCA v2.1.0, API v3.0.0, Component v4.0.5)
              MCA routed: radix (MCA v2.1.0, API v3.0.0, Component v4.0.5)
              MCA routed: direct (MCA v2.1.0, API v3.0.0, Component v4.0.5)
                 MCA rtc: hwloc (MCA v2.1.0, API v1.0.0, Component v4.0.5)
              MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.5)
              MCA schizo: ompi (MCA v2.1.0, API v1.0.0, Component v4.0.5)
              MCA schizo: orte (MCA v2.1.0, API v1.0.0, Component v4.0.5)
              MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v4.0.5)
               MCA state: app (MCA v2.1.0, API v1.0.0, Component v4.0.5)
               MCA state: orted (MCA v2.1.0, API v1.0.0, Component v4.0.5)
               MCA state: tool (MCA v2.1.0, API v1.0.0, Component v4.0.5)
               MCA state: hnp (MCA v2.1.0, API v1.0.0, Component v4.0.5)
               MCA state: novm (MCA v2.1.0, API v1.0.0, Component v4.0.5)
                 MCA bml: r2 (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA coll: monitoring (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
                MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA coll: self (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.0.5)
               MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
               MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
               MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v4.0.5)
               MCA fcoll: two_phase (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
               MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                  MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                  MCA fs: lustre (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                  MCA io: romio321 (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                  MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA mtl: ofi (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA mtl: psm2 (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v4.0.5)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v4.0.5)
                 MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.0.5)
                 MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v4.0.5)
                 MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component
                          v4.0.5)
                 MCA pml: v (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA pml: monitoring (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
                 MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA rte: orte (MCA v2.1.0, API v2.0.0, Component v4.0.5)
            MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v4.0.5)
            MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
            MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
                MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v4.0.5)
                MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component
                          v4.0.5)
           MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component
                          v4.0.5)
[lsf-host:31240] mca: base: close: unloading component self
[lsf-host:31240] mca: base: close: unloading component tcp
[lsf-host:31240] mca: base: close: unloading component usnic
[lsf-host:31240] mca: base: close: unloading component vader

System description

  • Operating system/version: Redhat Enterprise 7.3
  • Computer hardware: Intel 64bits Broadwell gen
  • Network type: eth
  • iptable rules : empty
  • Job scheduler: LSF 9.1 cluster of 3 Intel nodes

Details of the problem

I try to run a simple MPI spawn program through an LSF cluster, when the scheduler allocs a single node the execution works pretty well but when it's multinode, the MPI processes spawn from seperated hostname can't talk each other so resulting an abort, what I'm doing wrong? Is it an OpenMPI bug? Thank you you for your help!

producer.cpp source:

#include "mpi.h"
int main(int argc, char* argv[])
{
  MPI_Init(&argc, &argv);
  int rank;
  int size;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  std::cout << rank << " " << size << '\n';
  const char* command = "/home/user/worker";
  MPI_Comm everyone;
  int nslaves = 2;
  MPI_Comm_spawn(command, MPI_ARGV_NULL, nslaves, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &everyone,
                 MPI_ERRCODES_IGNORE);
  std::cout << "END" << std::endl;
  MPI_Finalize();
}

worker.cpp source:

#include "mpi.h"
#include <iostream>
int main(int argc, char *argv[])
{
  MPI_Init(&argc, &argv);
  MPI_Comm com;
  MPI_Comm_get_parent(&com);
  std::cout << "Hello" << std::endl;
  MPI_Finalize();
}

launching commands:

bash-4.2$ module load OpenMPI/4.0.5-GCC-10.2.0
bash-4.2$ export OMPI_MCA_orte_base_help_aggregate=0
bash-4.2$ export OMPI_MCA_btl_base_verbose=100
bash-4.2$ mpic++ -o producer producer.cpp
bash-4.2$ mpic++ -o worker worker.cpp
bash-4.2$ bsub -n 3 -R "span[ptile=1]" mpirun -np 1 -o output.log ./producer # -R content option force the job to be launch on multinode

output.log content:

Sender: LSF System <[email protected]>
Subject: Job 4088: <mpirun -n 1 /home/user/producer> in cluster <r_cluster> Exited

Job <mpirun -n 1 /home/user/producer> was submitted from host <lsf-host.cm.cluster> by user
 <user> in cluster <r_cluster>.
Job was executed on host(s) <1*lsf-host-001.cm.cluster>, in queue <STANDARD_BATCH>, as user <user> in cluster <r_cluster>.
                            <1*lsf-host-002.cm.cluster>
                            <1*lsf-host.cm.cluster>
</home/user> was used as the home directory.
</home/user> was used as the working directory.
Started at Mon Jun  7 11:42:40 2021
Results reported on Mon Jun  7 11:42:50 2021

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 /home/user/producer
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time :                                   0.47 sec.
    Max Memory :                                 53 MB
    Average Memory :                             7.00 MB
    Total Requested Memory :                     -
    Delta Memory :                               -
    Max Processes :                              3
    Max Threads :                                7
    Run time :                                   9 sec.
    Turnaround time :                            10 sec.

The output (if any) follows:

[lsf-host-001:03013] mca: base: components_register: registering framework btl components
[lsf-host-001:03013] mca: base: components_register: found loaded component self
[lsf-host-001:03013] mca: base: components_register: component self register function successful
[lsf-host-001:03013] mca: base: components_register: found loaded component tcp
[lsf-host-001:03013] mca: base: components_register: component tcp register function successful
[lsf-host-001:03013] mca: base: components_register: found loaded component sm
[lsf-host-001:03013] mca: base: components_register: found loaded component usnic
[lsf-host-001:03013] mca: base: components_register: component usnic register function successful
[lsf-host-001:03013] mca: base: components_register: found loaded component vader
[lsf-host-001:03013] mca: base: components_register: component vader register function successful
[lsf-host-001:03013] mca: base: components_open: opening btl components
[lsf-host-001:03013] mca: base: components_open: found loaded component self
[lsf-host-001:03013] mca: base: components_open: component self open function successful
[lsf-host-001:03013] mca: base: components_open: found loaded component tcp
[lsf-host-001:03013] mca: base: components_open: component tcp open function successful
[lsf-host-001:03013] mca: base: components_open: found loaded component usnic
[lsf-host-001:03013] mca: base: components_open: component usnic open function successful
[lsf-host-001:03013] mca: base: components_open: found loaded component vader
[lsf-host-001:03013] mca: base: components_open: component vader open function successful
[lsf-host-001:03013] select: initializing btl component self
[lsf-host-001:03013] select: init of component self returned success
[lsf-host-001:03013] select: initializing btl component tcp
[lsf-host-001:03013] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[lsf-host-001:03013] btl: tcp: Found match: 127.0.0.1 (lo)
[lsf-host-001:03013] btl:tcp: Attempting to bind to AF_INET port 1024
[lsf-host-001:03013] btl:tcp: Successfully bound to AF_INET port 1024
[lsf-host-001:03013] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[lsf-host-001:03013] btl:tcp: examining interface eth0
[lsf-host-001:03013] btl:tcp: using ipv6 interface eth0
[lsf-host-001:03013] btl:tcp: examining interface eth1
[lsf-host-001:03013] btl:tcp: using ipv6 interface eth1
[lsf-host-001:03013] select: init of component tcp returned success
[lsf-host-001:03013] select: initializing btl component usnic
[lsf-host-001:03013] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data availab
le (-61)
[lsf-host-001:03013] select: init of component usnic returned failure
[lsf-host-001:03013] mca: base: close: component usnic closed
[lsf-host-001:03013] mca: base: close: unloading component usnic
[lsf-host-001:03013] select: initializing btl component vader
[lsf-host-001:03013] select: init of component vader returned failure
[lsf-host-001:03013] mca: base: close: component vader closed
[lsf-host-001:03013] mca: base: close: unloading component vader
0 1
[lsf-host:14462] mca: base: components_register: registering framework btl components
[lsf-host:14462] mca: base: components_register: found loaded component self
[lsf-host:14462] mca: base: components_register: component self register function successful
[lsf-host:14462] mca: base: components_register: found loaded component tcp
[lsf-host:14462] mca: base: components_register: component tcp register function successful
[lsf-host:14462] mca: base: components_register: found loaded component sm
[lsf-host:14462] mca: base: components_register: found loaded component usnic
[lsf-host:14462] mca: base: components_register: component usnic register function successful
[lsf-host:14462] mca: base: components_register: found loaded component vader
[lsf-host:14462] mca: base: components_register: component vader register function successful
[lsf-host:14462] mca: base: components_open: opening btl components
[lsf-host:14462] mca: base: components_open: found loaded component self
[lsf-host:14462] mca: base: components_open: component self open function successful
[lsf-host:14462] mca: base: components_open: found loaded component tcp
[lsf-host:14462] mca: base: components_open: component tcp open function successful
[lsf-host:14462] mca: base: components_open: found loaded component usnic
[lsf-host:14462] mca: base: components_open: component usnic open function successful
[lsf-host:14462] mca: base: components_open: found loaded component vader
[lsf-host:14462] mca: base: components_open: component vader open function successful
[lsf-host:14462] select: initializing btl component self
[lsf-host:14462] select: init of component self returned success
[lsf-host:14462] select: initializing btl component tcp
[lsf-host:14462] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[lsf-host:14462] btl: tcp: Found match: 127.0.0.1 (lo)
[lsf-host:14462] btl:tcp: Attempting to bind to AF_INET port 1024
[lsf-host:14462] btl:tcp: Successfully bound to AF_INET port 1024
[lsf-host:14462] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[lsf-host:14462] btl:tcp: examining interface eth0
[lsf-host:14462] btl:tcp: using ipv6 interface eth0
[lsf-host:14462] btl:tcp: examining interface eth1
[lsf-host:14462] btl:tcp: using ipv6 interface eth1
[lsf-host:14462] select: init of component tcp returned success
[lsf-host:14462] select: initializing btl component usnic
[lsf-host-002:27348] mca: base: components_register: registering framework btl components
[lsf-host-002:27348] mca: base: components_register: found loaded component self
[lsf-host-002:27348] mca: base: components_register: component self register function successful
[lsf-host-002:27348] mca: base: components_register: found loaded component tcp
[lsf-host-002:27348] mca: base: components_register: component tcp register function successful
[lsf-host-002:27348] mca: base: components_register: found loaded component sm
[lsf-host-002:27348] mca: base: components_register: found loaded component usnic
[lsf-host-002:27348] mca: base: components_register: component usnic register function successful
[lsf-host-002:27348] mca: base: components_register: found loaded component vader
[lsf-host-002:27348] mca: base: components_register: component vader register function successful
[lsf-host-002:27348] mca: base: components_open: opening btl components
[lsf-host-002:27348] mca: base: components_open: found loaded component self
[lsf-host-002:27348] mca: base: components_open: component self open function successful
[lsf-host-002:27348] mca: base: components_open: found loaded component tcp
[lsf-host-002:27348] mca: base: components_open: component tcp open function successful
[lsf-host-002:27348] mca: base: components_open: found loaded component usnic
[lsf-host-002:27348] mca: base: components_open: component usnic open function successful
[lsf-host-002:27348] mca: base: components_open: found loaded component vader
[lsf-host-002:27348] mca: base: components_open: component vader open function successful
[lsf-host-002:27348] select: initializing btl component self
[lsf-host-002:27348] select: init of component self returned success
[lsf-host-002:27348] select: initializing btl component tcp
[lsf-host-002:27348] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[lsf-host-002:27348] btl: tcp: Found match: 127.0.0.1 (lo)
[lsf-host-002:27348] btl:tcp: Attempting to bind to AF_INET port 1024
[lsf-host-002:27348] btl:tcp: Successfully bound to AF_INET port 1024
[lsf-host-002:27348] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[lsf-host-002:27348] btl:tcp: examining interface eth0
[lsf-host-002:27348] btl:tcp: using ipv6 interface eth0
[lsf-host-002:27348] btl:tcp: examining interface eth1
[lsf-host-002:27348] btl:tcp: using ipv6 interface eth1
[lsf-host-002:27348] select: init of component tcp returned success
[lsf-host-002:27348] select: initializing btl component usnic
[lsf-host:14462] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data availab
le (-61)
[lsf-host:14462] select: init of component usnic returned failure
[lsf-host:14462] mca: base: close: component usnic closed
[lsf-host:14462] mca: base: close: unloading component usnic
[lsf-host:14462] select: initializing btl component vader
[lsf-host:14462] select: init of component vader returned failure
[lsf-host:14462] mca: base: close: component vader closed
[lsf-host:14462] mca: base: close: unloading component vader
[lsf-host-002:27348] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data availab
le (-61)
[lsf-host-002:27348] select: init of component usnic returned failure
[lsf-host-002:27348] mca: base: close: component usnic closed
[lsf-host-002:27348] mca: base: close: unloading component usnic
[lsf-host-002:27348] select: initializing btl component vader
[lsf-host-002:27348] select: init of component vader returned failure
[lsf-host-002:27348] mca: base: close: component vader closed
[lsf-host-002:27348] mca: base: close: unloading component vader
[lsf-host-001:03013] pml_ucx.c:178  Error: Failed to receive UCX worker address: Not found (-13)
[lsf-host-001:03013] [[59089,1],0] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
[lsf-host-002:27348] pml_ucx.c:178  Error: Failed to receive UCX worker address: Not found (-13)
[lsf-host-002:27348] [[59089,2],0] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
[lsf-host:14462] pml_ucx.c:178  Error: Failed to receive UCX worker address: Not found (-13)
[lsf-host:14462] [[59089,2],1] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[lsf-host:14462] *** An error occurred in MPI_Init
[lsf-host:14462] *** reported by process [3872456706,1]
[lsf-host:14462] *** on a NULL communicator
[lsf-host:14462] *** Unknown error
[lsf-host:14462] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[lsf-host:14462] ***    and potentially your MPI job)
[lsf-host-001:03013] *** An error occurred in MPI_Comm_spawn
[lsf-host-001:03013] *** reported by process [3872456705,0]
[lsf-host-001:03013] *** on communicator MPI_COMM_WORLD
[lsf-host-001:03013] *** MPI_ERR_OTHER: known error not in list
[lsf-host-001:03013] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[lsf-host-001:03013] ***    and potentially your MPI job)
[lsf-host-001:03008] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[lsf-host-002:27348] *** An error occurred in MPI_Init
[lsf-host-002:27348] *** reported by process [3872456706,0]
[lsf-host-002:27348] *** on a NULL communicator
[lsf-host-002:27348] *** Unknown error
[lsf-host-002:27348] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[lsf-host-002:27348] ***    and potentially your MPI job)
@jjhursey
Copy link
Member

jjhursey commented Jun 7, 2021

From my reading of the error log, I don't think this is related to LSF specifically, but might be something with spawn or the machine. In an LSF environment, we only use LSF to launch the ORTE daemons (or PRRTE daemons if you were using v5.x or later) then those daemons handle the MPI spawn and wireup mechanisms.

To try to eliminate spawn from the diagnosis: Are you able to run a "hello world" and "ring" program across multiple nodes in the allocation?

The UCX adapter is filing to wireup properly:

[lsf-host-001:03013] pml_ucx.c:178  Error: Failed to receive UCX worker address: Not found (-13)
[lsf-host-001:03013] [[59089,1],0] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493

I don't think UCX is to blame here, but let's try to eliminate UCX from the diagnosis: Can you add the following to your default environment:

OMPI_MCA_pml=ob1
OMPI_MCA_btl=tcp,vader,self

If you are able to ssh between hosts in your allocation you can eliminate the LSF daemon launch mechanism by setting the following environment variable:

OMPI_MCA_plm=^lsf

Give those a try and let us know how it goes.

@Extremys
Copy link
Author

Extremys commented Jun 7, 2021

Thank you a lot for your answer @jjhursey !
I tested the mpi hello world and ring examples with the first both export vars, and with the additional export vars after, all jobs are working greatly actually, so the ssh is ok between the nodes apparently and my spawns still not working in the both configurations. :)

@jjhursey
Copy link
Member

jjhursey commented Jun 7, 2021

So this may point to a general issue with spawn. I'm not certain of the stability of comm_spawn on the v4.0.5 release.

Are you able to correctly run your spawn test with the following variables?

export OMPI_MCA_pml=ob1
export OMPI_MCA_btl=tcp,vader,self
export OMPI_MCA_plm=^lsf
export OMPI_MCA_btl_base_verbose=100

If not then can you post the debug output?

@Extremys
Copy link
Author

Extremys commented Jun 7, 2021

Thank you for your reactivity!
I have done the exports, the command to launch:

export OMPI_MCA_pml=ob1
export OMPI_MCA_btl=tcp,vader,self
export OMPI_MCA_plm=^lsf
export OMPI_MCA_btl_base_verbose=100
bsub -n 3 -R "span[ptile=1]" -o $HOME/log mpirun -n 1 $HOME/producer

the output:

Sender: LSF System <[email protected]>
Subject: Job 4102: <mpirun -n 1 /home/user/producer> in cluster <r_cluster> Exited

Job <mpirun -n 1 /home/user/producer> was submitted from host <node-001.cm.cluster> by user <user> in cl
uster <r_cluster>.
Job was executed on host(s) <1*node-003.cm.cluster>, in queue <STANDARD_BATCH>, as user <user> in cluster <r
_cluster>.
                            <1*node-002.cm.cluster>
                            <1*node-001.cm.cluster>
</home/user> was used as the home directory.
</home/user> was used as the working directory.
Started at Mon Jun  7 18:44:02 2021
Results reported on Mon Jun  7 18:44:08 2021

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 /home/user/producer
------------------------------------------------------------

Exited with exit code 17.

Resource usage summary:

    CPU time :                                   0.22 sec.
    Max Memory :                                 27 MB
    Average Memory :                             12.50 MB
    Total Requested Memory :                     -
    Delta Memory :                               -
    Max Processes :                              5
    Max Threads :                                9
    Run time :                                   5 sec.
    Turnaround time :                            7 sec.

The output (if any) follows:

[node-003:25380] mca: base: components_register: registering framework btl components
[node-003:25380] mca: base: components_register: found loaded component self
[node-003:25380] mca: base: components_register: component self register function successful
[node-003:25380] mca: base: components_register: found loaded component tcp
[node-003:25380] mca: base: components_register: component tcp register function successful
[node-003:25380] mca: base: components_register: found loaded component vader
[node-003:25380] mca: base: components_register: component vader register function successful
[node-003:25380] mca: base: components_open: opening btl components
[node-003:25380] mca: base: components_open: found loaded component self
[node-003:25380] mca: base: components_open: component self open function successful
[node-003:25380] mca: base: components_open: found loaded component tcp
[node-003:25380] mca: base: components_open: component tcp open function successful
[node-003:25380] mca: base: components_open: found loaded component vader
[node-003:25380] mca: base: components_open: component vader open function successful
[node-003:25380] select: initializing btl component self
[node-003:25380] select: init of component self returned success
[node-003:25380] select: initializing btl component tcp
[node-003:25380] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[node-003:25380] btl: tcp: Found match: 127.0.0.1 (lo)
[node-003:25380] btl:tcp: Attempting to bind to AF_INET port 1024
[node-003:25380] btl:tcp: Successfully bound to AF_INET port 1024
[node-003:25380] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[node-003:25380] btl:tcp: examining interface eth0
[node-003:25380] btl:tcp: using ipv6 interface eth0
[node-003:25380] btl:tcp: examining interface eth1
[node-003:25380] btl:tcp: using ipv6 interface eth1
[node-003:25380] select: init of component tcp returned success
[node-003:25380] select: initializing btl component vader
[node-003:25380] select: init of component vader returned failure
[node-003:25380] mca: base: close: component vader closed
[node-003:25380] mca: base: close: unloading component vader
[node-003:25380] mca: bml: Using self btl for send to [[53723,1],0] on node node-003
0 1
[node-002:01434] mca: base: components_register: registering framework btl components
[node-002:01434] mca: base: components_register: found loaded component self
[node-002:01434] mca: base: components_register: component self register function successful
[node-002:01434] mca: base: components_register: found loaded component tcp
[node-002:01434] mca: base: components_register: component tcp register function successful
[node-002:01434] mca: base: components_register: found loaded component vader
[node-002:01434] mca: base: components_register: component vader register function successful
[node-002:01434] mca: base: components_open: opening btl components
[node-002:01434] mca: base: components_open: found loaded component self
[node-002:01434] mca: base: components_open: component self open function successful
[node-002:01434] mca: base: components_open: found loaded component tcp
[node-002:01434] mca: base: components_open: component tcp open function successful
[node-002:01434] mca: base: components_open: found loaded component vader
[node-002:01434] mca: base: components_open: component vader open function successful
[node-002:01434] select: initializing btl component self
[node-002:01434] select: init of component self returned success
[node-002:01434] select: initializing btl component tcp
[node-002:01434] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[node-002:01434] btl: tcp: Found match: 127.0.0.1 (lo)
[node-002:01434] btl:tcp: Attempting to bind to AF_INET port 1024
[node-002:01434] btl:tcp: Successfully bound to AF_INET port 1024
[node-002:01434] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[node-002:01434] btl:tcp: examining interface eth0
[node-002:01434] btl:tcp: using ipv6 interface eth0
[node-002:01434] btl:tcp: examining interface eth1
[node-002:01434] btl:tcp: using ipv6 interface eth1
[node-002:01434] select: init of component tcp returned success
[node-002:01434] select: initializing btl component vader
[node-002:01434] select: init of component vader returned failure
[node-002:01434] mca: base: close: component vader closed
[node-002:01434] mca: base: close: unloading component vader
[node-001:12771] mca: base: components_register: registering framework btl components
[node-001:12771] mca: base: components_register: found loaded component self
[node-001:12771] mca: base: components_register: component self register function successful
[node-001:12771] mca: base: components_register: found loaded component tcp
[node-001:12771] mca: base: components_register: component tcp register function successful
[node-001:12771] mca: base: components_register: found loaded component vader
[node-001:12771] mca: base: components_register: component vader register function successful
[node-001:12771] mca: base: components_open: opening btl components
[node-001:12771] mca: base: components_open: found loaded component self
[node-001:12771] mca: base: components_open: component self open function successful
[node-001:12771] mca: base: components_open: found loaded component tcp
[node-001:12771] mca: base: components_open: component tcp open function successful
[node-001:12771] mca: base: components_open: found loaded component vader
[node-001:12771] mca: base: components_open: component vader open function successful
[node-001:12771] select: initializing btl component self
[node-001:12771] select: init of component self returned success
[node-001:12771] select: initializing btl component tcp
[node-001:12771] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[node-001:12771] btl: tcp: Found match: 127.0.0.1 (lo)
[node-001:12771] btl:tcp: Attempting to bind to AF_INET port 1024
[node-001:12771] btl:tcp: Successfully bound to AF_INET port 1024
[node-001:12771] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[node-001:12771] btl:tcp: examining interface eth0
[node-001:12771] btl:tcp: using ipv6 interface eth0
[node-001:12771] btl:tcp: examining interface eth1
[node-001:12771] btl:tcp: using ipv6 interface eth1
[node-001:12771] select: init of component tcp returned success
[node-001:12771] select: initializing btl component vader
[node-001:12771] select: init of component vader returned failure
[node-001:12771] mca: base: close: component vader closed
[node-001:12771] mca: base: close: unloading component vader
[node-002:01434] mca: bml: Using self btl for send to [[53723,2],0] on node node-002
[node-001:12771] mca: bml: Using self btl for send to [[53723,2],1] on node node-001
[node-002:01434] btl:tcp: path from 168.124.218.151 to 168.124.218.58: IPV4 PUBLIC SAME NETWORK
[node-002:01434] btl:tcp: path from 168.124.218.151 to 168.124.126.58: IPV4 PUBLIC DIFFERENT NETWORK
[node-002:01434] btl:tcp: path from 168.124.126.151 to 168.124.218.58: IPV4 PUBLIC DIFFERENT NETWORK
[node-002:01434] btl:tcp: path from 168.124.126.151 to 168.124.126.58: IPV4 PUBLIC SAME NETWORK
[node-002:01434] mca: bml: Using tcp btl for send to [[53723,2],1] on node node-001
[node-002:01434] btl:tcp: path from 168.124.218.151 to 168.124.218.58: IPV4 PUBLIC SAME NETWORK
[node-002:01434] btl:tcp: path from 168.124.218.151 to 168.124.126.58: IPV4 PUBLIC DIFFERENT NETWORK
[node-002:01434] btl:tcp: path from 168.124.126.151 to 168.124.218.58: IPV4 PUBLIC DIFFERENT NETWORK
[node-002:01434] btl:tcp: path from 168.124.126.151 to 168.124.126.58: IPV4 PUBLIC SAME NETWORK
[node-002:01434] mca: bml: Using tcp btl for send to [[53723,2],1] on node node-001
[node-002:01434] btl: tcp: attempting to connect() to [[53723,2],1] address 168.124.126.58 on port 1024
[node-002:01434] btl:tcp: would block, so allowing background progress
[node-002:01434] btl:tcp: connect() to 168.124.126.58:1024 completed (complete_connect), sending connect ACK
[node-001:12771] btl:tcp: path from 168.124.218.58 to 168.124.218.151: IPV4 PUBLIC SAME NETWORK
[node-001:12771] btl:tcp: path from 168.124.218.58 to 168.124.126.151: IPV4 PUBLIC DIFFERENT NETWORK
[node-001:12771] btl:tcp: path from 168.124.126.58 to 168.124.218.151: IPV4 PUBLIC DIFFERENT NETWORK
[node-001:12771] btl:tcp: path from 168.124.126.58 to 168.124.126.151: IPV4 PUBLIC SAME NETWORK
[node-001:12771] btl:tcp: path from 168.124.218.58 to 168.124.218.151: IPV4 PUBLIC SAME NETWORK
[node-001:12771] btl:tcp: path from 168.124.218.58 to 168.124.126.151: IPV4 PUBLIC DIFFERENT NETWORK
[node-001:12771] btl:tcp: path from 168.124.126.58 to 168.124.218.151: IPV4 PUBLIC DIFFERENT NETWORK
[node-001:12771] btl:tcp: path from 168.124.126.58 to 168.124.126.151: IPV4 PUBLIC SAME NETWORK
[node-001:12771] btl: tcp: Match incoming connection from [[53723,2],0] 168.124.126.151 with locally known IP 16
8.124.218.151 failed (iface 0/2)!
[node-001:12771] btl:tcp: now connected to 168.124.126.151, process [[53723,2],0]
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[53723,1],0]) is on host: node-003
  Process 2 ([[53723,2],0]) is on host: unknown!
  BTLs attempted: self tcp

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[node-003:25380] [[53723,1],0] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[node-001:12771] [[53723,2],1] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[node-003:25380] *** An error occurred in MPI_Comm_spawn
[node-003:25380] *** reported by process [3520790529,0]
[node-003:25380] *** on communicator MPI_COMM_WORLD
[node-003:25380] *** MPI_ERR_INTERN: internal error
[node-003:25380] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node-003:25380] ***    and potentially your MPI job)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
[node-001:12771] *** An error occurred in MPI_Init
[node-001:12771] *** reported by process [3520790530,1]
[node-001:12771] *** on a NULL communicator
[node-001:12771] *** Unknown error
[node-001:12771] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node-001:12771] ***    and potentially your MPI job)
[node-002:01434] [[53723,2],0] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[node-003:25374] 2 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
[node-003:25374] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[node-003:25374] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[node-003:25374] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
bash-4.2$  bsub -n 3 -R "span[ptile=1]" -o $HOME/log mpirun -n 1 $HOME/producer

Do you think an upgrade would be beneficial? @jjhursey Thank!

@jjhursey
Copy link
Member

jjhursey commented Jun 7, 2021

You might try v4.1.1 to see if the behavior changes. This points to an issue with wireup around spawn from what I'm seeing from these logs.

@Extremys
Copy link
Author

Extremys commented Jun 7, 2021

So just build v4.1.1 and run the same example in same conditions, it seems to be more talktative i get this output:

Sender: LSF System <[email protected]>
Subject: Job 4104: <mpirun -n 1 /home/user/producer> in cluster <r_cluster> Exited

Job <mpirun -n 1 /home/user/producer> was submitted from host <node-001.cm.cluster> by user <user> in cluster <r_cluster>.
Job was executed on host(s) <1*node-003.cm.cluster>, in queue <STANDARD_BATCH>, as user <user> in cluster <r_cluster>.
                            <1*node-002.cm.cluster>
                            <1*node-001.cm.cluster>
</home/user> was used as the home directory.
</home/user> was used as the working directory.
Started at Mon Jun  7 19:46:22 2021
Results reported on Mon Jun  7 19:46:28 2021

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 /home/user/producer
------------------------------------------------------------

Exited with exit code 17.

Resource usage summary:

    CPU time :                                   0.18 sec.
    Max Memory :                                 32 MB
    Average Memory :                             18.00 MB
    Total Requested Memory :                     -
    Delta Memory :                               -
    Max Processes :                              5
    Max Threads :                                9
    Run time :                                   6 sec.
    Turnaround time :                            6 sec.

The output (if any) follows:

[node-003:07107] mca: base: components_register: registering framework btl components
[node-003:07107] mca: base: components_register: found loaded component self
[node-003:07107] mca: base: components_register: component self register function successful
[node-003:07107] mca: base: components_register: found loaded component tcp
[node-003:07107] mca: base: components_register: component tcp register function successful
[node-003:07107] mca: base: components_register: found loaded component vader
[node-003:07107] mca: base: components_register: component vader register function successful
[node-003:07107] mca: base: components_open: opening btl components
[node-003:07107] mca: base: components_open: found loaded component self
[node-003:07107] mca: base: components_open: component self open function successful
[node-003:07107] mca: base: components_open: found loaded component tcp
[node-003:07107] mca: base: components_open: component tcp open function successful
[node-003:07107] mca: base: components_open: found loaded component vader
[node-003:07107] mca: base: components_open: component vader open function successful
[node-003:07107] select: initializing btl component self
[node-003:07107] select: init of component self returned success
[node-003:07107] select: initializing btl component tcp
[node-003:07107] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[node-003:07107] btl: tcp: Found match: 127.0.0.1 (lo)
[node-003:07107] btl:tcp: Attempting to bind to AF_INET port 1024
[node-003:07107] btl:tcp: Successfully bound to AF_INET port 1024
[node-003:07107] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[node-003:07107] btl:tcp: examining interface eth0
[node-003:07107] btl:tcp: using ipv6 interface eth0
[node-003:07107] btl:tcp: examining interface eth1
[node-003:07107] btl:tcp: using ipv6 interface eth1
[node-003:07107] select: init of component tcp returned success
[node-003:07107] select: initializing btl component vader
[node-003:07107] select: init of component vader returned failure
[node-003:07107] mca: base: close: component vader closed
[node-003:07107] mca: base: close: unloading component vader
[node-003:07107] mca: bml: Using self btl for send to [[43381,1],0] on node node-003
0 1
[node-001:23861] mca: base: components_register: registering framework btl components
[node-001:23861] mca: base: components_register: found loaded component self
[node-001:23861] mca: base: components_register: component self register function successful
[node-001:23861] mca: base: components_register: found loaded component tcp
[node-001:23861] mca: base: components_register: component tcp register function successful
[node-001:23861] mca: base: components_register: found loaded component vader
[node-001:23861] mca: base: components_register: component vader register function successful
[node-001:23861] mca: base: components_open: opening btl components
[node-001:23861] mca: base: components_open: found loaded component self
[node-001:23861] mca: base: components_open: component self open function successful
[node-001:23861] mca: base: components_open: found loaded component tcp
[node-001:23861] mca: base: components_open: component tcp open function successful
[node-001:23861] mca: base: components_open: found loaded component vader
[node-001:23861] mca: base: components_open: component vader open function successful
[node-001:23861] select: initializing btl component self
[node-001:23861] select: init of component self returned success
[node-001:23861] select: initializing btl component tcp
[node-001:23861] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[node-001:23861] btl: tcp: Found match: 127.0.0.1 (lo)
[node-001:23861] btl:tcp: Attempting to bind to AF_INET port 1024
[node-001:23861] btl:tcp: Successfully bound to AF_INET port 1024
[node-001:23861] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[node-001:23861] btl:tcp: examining interface eth0
[node-001:23861] btl:tcp: using ipv6 interface eth0
[node-001:23861] btl:tcp: examining interface eth1
[node-001:23861] btl:tcp: using ipv6 interface eth1
[node-001:23861] select: init of component tcp returned success
[node-001:23861] select: initializing btl component vader
[node-001:23861] select: init of component vader returned failure
[node-001:23861] mca: base: close: component vader closed
[node-001:23861] mca: base: close: unloading component vader
[node-002:15418] mca: base: components_register: registering framework btl components
[node-002:15418] mca: base: components_register: found loaded component self
[node-002:15418] mca: base: components_register: component self register function successful
[node-002:15418] mca: base: components_register: found loaded component tcp
[node-002:15418] mca: base: components_register: component tcp register function successful
[node-002:15418] mca: base: components_register: found loaded component vader
[node-002:15418] mca: base: components_register: component vader register function successful
[node-002:15418] mca: base: components_open: opening btl components
[node-002:15418] mca: base: components_open: found loaded component self
[node-002:15418] mca: base: components_open: component self open function successful
[node-002:15418] mca: base: components_open: found loaded component tcp
[node-002:15418] mca: base: components_open: component tcp open function successful
[node-002:15418] mca: base: components_open: found loaded component vader
[node-002:15418] mca: base: components_open: component vader open function successful
[node-002:15418] select: initializing btl component self
[node-002:15418] select: init of component self returned success
[node-002:15418] select: initializing btl component tcp
[node-002:15418] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[node-002:15418] btl: tcp: Found match: 127.0.0.1 (lo)
[node-002:15418] btl:tcp: Attempting to bind to AF_INET port 1024
[node-002:15418] btl:tcp: Successfully bound to AF_INET port 1024
[node-002:15418] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[node-002:15418] btl:tcp: examining interface eth0
[node-002:15418] btl:tcp: using ipv6 interface eth0
[node-002:15418] btl:tcp: examining interface eth1
[node-002:15418] btl:tcp: using ipv6 interface eth1
[node-002:15418] select: init of component tcp returned success
[node-002:15418] select: initializing btl component vader
[node-002:15418] select: init of component vader returned failure
[node-002:15418] mca: base: close: component vader closed
[node-002:15418] mca: base: close: unloading component vader
[node-002:15418] mca: bml: Using self btl for send to [[43381,2],0] on node node-002
[node-001:23861] mca: bml: Using self btl for send to [[43381,2],1] on node node-001
[node-002:15418] btl:tcp: path from 169.124.218.151 to 169.124.218.58: IPV4 PUBLIC SAME NETWORK
[node-002:15418] btl:tcp: path from 169.124.218.151 to 169.124.126.58: IPV4 PUBLIC DIFFERENT NETWORK
[node-002:15418] btl:tcp: path from 169.124.126.151 to 169.124.218.58: IPV4 PUBLIC DIFFERENT NETWORK
[node-002:15418] btl:tcp: path from 169.124.126.151 to 169.124.126.58: IPV4 PUBLIC SAME NETWORK
[node-002:15418] mca: bml: Using tcp btl for send to [[43381,2],1] on node node-001
[node-002:15418] btl:tcp: path from 169.124.218.151 to 169.124.218.58: IPV4 PUBLIC SAME NETWORK
[node-002:15418] btl:tcp: path from 169.124.218.151 to 169.124.126.58: IPV4 PUBLIC DIFFERENT NETWORK
[node-002:15418] btl:tcp: path from 169.124.126.151 to 169.124.218.58: IPV4 PUBLIC DIFFERENT NETWORK
[node-002:15418] btl:tcp: path from 169.124.126.151 to 169.124.126.58: IPV4 PUBLIC SAME NETWORK
[node-002:15418] mca: bml: Using tcp btl for send to [[43381,2],1] on node node-001
[node-002:15418] btl: tcp: attempting to connect() to [[43381,2],1] address 169.124.126.58 on port 1024
[node-002:15418] btl:tcp: would block, so allowing background progress
[node-002:15418] btl:tcp: connect() to 169.124.126.58:1024 completed (complete_connect), sending connect ACK
[node-001:23861] btl:tcp: path from 169.124.218.58 to 169.124.218.151: IPV4 PUBLIC SAME NETWORK
[node-001:23861] btl:tcp: path from 169.124.218.58 to 169.124.126.151: IPV4 PUBLIC DIFFERENT NETWORK
[node-001:23861] btl:tcp: path from 169.124.126.58 to 169.124.218.151: IPV4 PUBLIC DIFFERENT NETWORK
[node-001:23861] btl:tcp: path from 169.124.126.58 to 169.124.126.151: IPV4 PUBLIC SAME NETWORK
[node-001:23861] btl:tcp: path from 169.124.218.58 to 169.124.218.151: IPV4 PUBLIC SAME NETWORK
[node-001:23861] btl:tcp: path from 169.124.218.58 to 169.124.126.151: IPV4 PUBLIC DIFFERENT NETWORK
[node-001:23861] btl:tcp: path from 169.124.126.58 to 169.124.218.151: IPV4 PUBLIC DIFFERENT NETWORK
[node-001:23861] btl:tcp: path from 169.124.126.58 to 169.124.126.151: IPV4 PUBLIC SAME NETWORK
[node-001:23861] btl: tcp: Match incoming connection from [[43381,2],0] 169.124.126.151 with locally known IP 169.124.218.151 failed (i
face 0/2)!
[node-001:23861] btl:tcp: now connected to 169.124.126.151, process [[43381,2],0]
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[43381,1],0]) is on host: node-003
  Process 2 ([[43381,2],0]) is on host: unknown!
  BTLs attempted: self tcp

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[node-003:07107] [[43381,1],0] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[node-001:23861] [[43381,2],1] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[node-003:07107] *** An error occurred in MPI_Comm_spawn
[node-003:07107] *** reported by process [2843017217,0]
[node-003:07107] *** on communicator MPI_COMM_WORLD
[node-003:07107] *** MPI_ERR_INTERN: internal error
[node-003:07107] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node-003:07107] ***    and potentially your MPI job)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
[node-002:15418] [[43381,2],0] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[node-001:23861] *** An error occurred in MPI_Init
[node-001:23861] *** reported by process [2843017218,1]
[node-001:23861] *** on a NULL communicator
[node-001:23861] *** Unknown error
[node-001:23861] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node-001:23861] ***    and potentially your MPI job)
[node-003:07088] 2 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
[node-003:07088] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[node-003:07088] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[node-003:07088] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle

Thank you for your time! @jjhursey

@Extremys
Copy link
Author

Extremys commented Jun 9, 2021

I still have the issue even if I force the eth interface choice, very strange ...

bsub -n 3 -R "span[ptile=1]" -o output.log mpirun -n 1 --mca btl_tcp_if_include eth1 $HOME/producer

output:

Job <4168> is submitted to default queue <STANDARD_BATCH>.
bash-4.2$ more loog4
Sender: LSF System <[email protected]>
Subject: Job 4168: <mpirun -n 1 --mca btl_tcp_if_include eth1 /home/user/producer> in cluster <r_cluster> Exited

Job <mpirun -n 1 --mca btl_tcp_if_include eth1 /home/user/producer> was submitted from host <node-001.cm.cluster> by user <user> in cluster <r_
cluster>.
Job was executed on host(s) <1*node-003.cm.cluster>, in queue <STANDARD_BATCH>, as user <user> in cluster <r_cluster>.
                            <1*node-002.cm.cluster>
                            <1*node-001.cm.cluster>
</home/user> was used as the home directory.
</home/user> was used as the working directory.
Started at Wed Jun  9 12:37:46 2021
Results reported on Wed Jun  9 12:37:52 2021

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 --mca btl_tcp_if_include eth1 /home/user/producer
------------------------------------------------------------

Exited with exit code 17.

Resource usage summary:

    CPU time :                                   0.40 sec.
    Max Memory :                                 33 MB
    Total Requested Memory :                     -
    Delta Memory :                               -
    Run time :                                   5 sec.
    Turnaround time :                            6 sec.

The output (if any) follows:

[node-003:10140] mca: base: components_register: registering framework btl components
[node-003:10140] mca: base: components_register: found loaded component self
[node-003:10140] mca: base: components_register: component self register function successful
[node-003:10140] mca: base: components_register: found loaded component tcp
[node-003:10140] mca: base: components_register: component tcp register function successful
[node-003:10140] mca: base: components_register: found loaded component sm
[node-003:10140] mca: base: components_register: found loaded component usnic
[node-003:10140] mca: base: components_register: component usnic register function successful
[node-003:10140] mca: base: components_register: found loaded component ofi
[node-003:10140] mca: base: components_register: component ofi register function successful
[node-003:10140] mca: base: components_register: found loaded component vader
[node-003:10140] mca: base: components_register: component vader register function successful
[node-003:10140] mca: base: components_open: opening btl components
[node-003:10140] mca: base: components_open: found loaded component self
[node-003:10140] mca: base: components_open: component self open function successful
[node-003:10140] mca: base: components_open: found loaded component tcp
[node-003:10140] mca: base: components_open: component tcp open function successful
[node-003:10140] mca: base: components_open: found loaded component usnic
[node-003:10140] mca: base: components_open: component usnic open function successful
[node-003:10140] mca: base: components_open: found loaded component ofi
[node-003:10140] mca: base: components_open: component ofi open function successful
[node-003:10140] mca: base: components_open: found loaded component vader
[node-003:10140] mca: base: components_open: component vader open function successful
[node-003:10140] select: initializing btl component self
[node-003:10140] select: init of component self returned success
[node-003:10140] select: initializing btl component tcp
[node-003:10140] btl:tcp: Attempting to bind to AF_INET port 1024
[node-003:10140] btl:tcp: Successfully bound to AF_INET port 1024
[node-003:10140] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[node-003:10140] btl:tcp: examining interface eth1
[node-003:10140] btl:tcp: using ipv6 interface eth1
[node-003:10140] select: init of component tcp returned success
[node-003:10140] select: initializing btl component usnic
[node-003:10140] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[node-003:10140] select: init of component usnic returned failure
[node-003:10140] mca: base: close: component usnic closed
[node-003:10140] mca: base: close: unloading component usnic
[node-003:10140] select: initializing btl component ofi
[node-003:10140] select: init of component ofi returned success
[node-003:10140] select: initializing btl component vader
[node-003:10140] select: init of component vader returned failure
[node-003:10140] mca: base: close: component vader closed
[node-003:10140] mca: base: close: unloading component vader
[node-003:10140] mca: bml: Using self btl for send to [[38226,1],0] on node node-003
0 1
[node-001:30432] mca: base: components_register: registering framework btl components
[node-001:30432] mca: base: components_register: found loaded component self
[node-001:30432] mca: base: components_register: component self register function successful
[node-001:30432] mca: base: components_register: found loaded component tcp
[node-001:30432] mca: base: components_register: component tcp register function successful
[node-001:30432] mca: base: components_register: found loaded component sm
[node-001:30432] mca: base: components_register: found loaded component usnic
[node-001:30432] mca: base: components_register: component usnic register function successful
[node-001:30432] mca: base: components_register: found loaded component ofi
[node-001:30432] mca: base: components_register: component ofi register function successful
[node-001:30432] mca: base: components_register: found loaded component vader
[node-001:30432] mca: base: components_register: component vader register function successful
[node-001:30432] mca: base: components_open: opening btl components
[node-001:30432] mca: base: components_open: found loaded component self
[node-001:30432] mca: base: components_open: component self open function successful
[node-001:30432] mca: base: components_open: found loaded component tcp
[node-001:30432] mca: base: components_open: component tcp open function successful
[node-001:30432] mca: base: components_open: found loaded component usnic
[node-001:30432] mca: base: components_open: component usnic open function successful
[node-001:30432] mca: base: components_open: found loaded component ofi
[node-001:30432] mca: base: components_open: component ofi open function successful
[node-001:30432] mca: base: components_open: found loaded component vader
[node-001:30432] mca: base: components_open: component vader open function successful
[node-001:30432] select: initializing btl component self
[node-001:30432] select: init of component self returned success
[node-001:30432] select: initializing btl component tcp
[node-001:30432] btl:tcp: Attempting to bind to AF_INET port 1024
[node-001:30432] btl:tcp: Successfully bound to AF_INET port 1024
[node-001:30432] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[node-001:30432] btl:tcp: examining interface eth1
[node-001:30432] btl:tcp: using ipv6 interface eth1
[node-001:30432] select: init of component tcp returned success
[node-001:30432] select: initializing btl component usnic
[node-001:30432] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[node-001:30432] select: init of component usnic returned failure
[node-001:30432] mca: base: close: component usnic closed
[node-001:30432] mca: base: close: unloading component usnic
[node-001:30432] select: initializing btl component ofi
[node-001:30432] select: init of component ofi returned success
[node-001:30432] select: initializing btl component vader
[node-001:30432] select: init of component vader returned failure
[node-001:30432] mca: base: close: component vader closed
[node-001:30432] mca: base: close: unloading component vader
[node-002:19988] mca: base: components_register: registering framework btl components
[node-002:19988] mca: base: components_register: found loaded component self
[node-002:19988] mca: base: components_register: component self register function successful
[node-002:19988] mca: base: components_register: found loaded component tcp
[node-002:19988] mca: base: components_register: component tcp register function successful
[node-002:19988] mca: base: components_register: found loaded component sm
[node-002:19988] mca: base: components_register: found loaded component usnic
[node-002:19988] mca: base: components_register: component usnic register function successful
[node-002:19988] mca: base: components_register: found loaded component ofi
[node-002:19988] mca: base: components_register: component ofi register function successful
[node-002:19988] mca: base: components_register: found loaded component vader
[node-002:19988] mca: base: components_register: component vader register function successful
[node-002:19988] mca: base: components_open: opening btl components
[node-002:19988] mca: base: components_open: found loaded component self
[node-002:19988] mca: base: components_open: component self open function successful
[node-002:19988] mca: base: components_open: found loaded component tcp
[node-002:19988] mca: base: components_open: component tcp open function successful
[node-002:19988] mca: base: components_open: found loaded component usnic
[node-002:19988] mca: base: components_open: component usnic open function successful
[node-002:19988] mca: base: components_open: found loaded component ofi
[node-002:19988] mca: base: components_open: component ofi open function successful
[node-002:19988] mca: base: components_open: found loaded component vader
[node-002:19988] mca: base: components_open: component vader open function successful
[node-002:19988] select: initializing btl component self
[node-002:19988] select: init of component self returned success
[node-002:19988] select: initializing btl component tcp
[node-002:19988] btl:tcp: Attempting to bind to AF_INET port 1024
[node-002:19988] btl:tcp: Successfully bound to AF_INET port 1024
[node-002:19988] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[node-002:19988] btl:tcp: examining interface eth1
[node-002:19988] btl:tcp: using ipv6 interface eth1
[node-002:19988] select: init of component tcp returned success
[node-002:19988] select: initializing btl component usnic
[node-002:19988] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[node-002:19988] select: init of component usnic returned failure
[node-002:19988] mca: base: close: component usnic closed
[node-002:19988] mca: base: close: unloading component usnic
[node-002:19988] select: initializing btl component ofi
[node-002:19988] select: init of component ofi returned success
[node-002:19988] select: initializing btl component vader
[node-002:19988] select: init of component vader returned failure
[node-002:19988] mca: base: close: component vader closed
[node-002:19988] mca: base: close: unloading component vader
[node-002:19988] mca: bml: Using self btl for send to [[38226,2],0] on node node-002
[node-001:30432] mca: bml: Using self btl for send to [[38226,2],1] on node node-001
[node-002:19988] btl:tcp: path from 169.124.126.151 to 169.124.126.58: IPV4 PUBLIC SAME NETWORK
[node-002:19988] mca: bml: Using tcp btl for send to [[38226,2],1] on node node-001
[node-002:19988] btl: tcp: attempting to connect() to [[38226,2],1] address 169.124.126.58 on port 1024
[node-002:19988] btl:tcp: would block, so allowing background progress
[node-002:19988] btl:tcp: connect() to 169.124.126.58:1024 completed (complete_connect), sending connect ACK
[node-001:30432] btl:tcp: path from 169.124.126.58 to 169.124.126.151: IPV4 PUBLIC SAME NETWORK
[node-001:30432] btl:tcp: now connected to 169.124.126.151, process [[38226,2],0]
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[38226,1],0]) is on host: node-003
  Process 2 ([[38226,2],0]) is on host: unknown!
  BTLs attempted: self tcp

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[node-003:10140] [[38226,1],0] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[node-001:30432] [[38226,2],1] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[node-002:19988] [[38226,2],0] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[node-003:10140] *** An error occurred in MPI_Comm_spawn
[node-003:10140] *** reported by process [2505179137,0]
[node-003:10140] *** on communicator MPI_COMM_WORLD
[node-003:10140] *** MPI_ERR_INTERN: internal error
[node-003:10140] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node-003:10140] ***    and potentially your MPI job)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
[node-001:30432] *** An error occurred in MPI_Init
[node-001:30432] *** reported by process [2505179138,1]
[node-001:30432] *** on a NULL communicator
[node-001:30432] *** Unknown error
[node-001:30432] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node-001:30432] ***    and potentially your MPI job)
[node-003:10135] 2 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
[node-003:10135] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[node-003:10135] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[node-003:10135] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle

@gpaulsen
Copy link
Member

@Extremys , Do you still have access to this system? Could you please try rerunning with Open MPI v4.0.6?

@jarunan
Copy link

jarunan commented Dec 6, 2021

Hi there,

Have you found a solution to this issue? I have the same issues here with OpenMPI 4.0.2 and 4.1.1 that MPI_COMM_SPAWN() cannot spawn across nodes. I am testing on a cluster with CentOS 7.9 and LSF Batch system, and GCC 6.3.0 was used.

I used this code for testing

#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>

#define NUM_SPAWNS 3

int main( int argc, char *argv[] )
{
  int np = NUM_SPAWNS;
  int errcodes[NUM_SPAWNS];
  MPI_Comm parentcomm, intercomm;

  MPI_Init( &argc, &argv );
  MPI_Comm_get_parent( &parentcomm );
  if (parentcomm == MPI_COMM_NULL)
    {
      /* Create 3 more processes - this example must be called spawn_example.exe for this to work. */
      MPI_Comm_spawn( argv[0], MPI_ARGV_NULL, np, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &intercomm, errcodes );
      printf("I'm the parent.\n");
    }
  else
    {
      printf("I'm the spawned.\n");
    }
  fflush(stdout);
  MPI_Finalize();
  return 0;
}

Running on one node, it looked fine:

$ bsub -n 6 -I "mpirun -n 1 spawn_example"
MPI job.
Job <195486300> is submitted to queue <normal.4h>.
<<Waiting for dispatch ...>>
<<Starting on eu-a2p-154>>
I'm the spawned.
I'm the spawned.
I'm the spawned.
I'm the parent.

But on 2 nodes, errors occured:

$ bsub -n 6 -R "span[ptile=3]" -I "mpirun -n 1 spawn_example"
MPI job.
Job <195486678> is submitted to queue <normal.4h>.
<<Waiting for dispatch ...>>
<<Starting on eu-a2p-274>>
[eu-a2p-217:30058] pml_ucx.c:175  Error: Failed to receive UCX worker address: Not found (-13)
[eu-a2p-217:30058] [[18089,2],2] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[eu-a2p-217:30058] *** An error occurred in MPI_Init
[eu-a2p-217:30058] *** reported by process [1185480706,2]
[eu-a2p-217:30058] *** on a NULL communicator
[eu-a2p-217:30058] *** Unknown error
[eu-a2p-217:30058] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[eu-a2p-217:30058] ***    and potentially your MPI job)
[eu-a2p-274:107025] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2147

@jjhursey
Copy link
Member

We recently made several Spawn-related fixes to Open MPI main and v5.0.x (see PR #10688 for the bulk of the fixes).

I just tried the example from this comment and the original post. Both passed with a build on main in an LSF allocation.

Please re-try your examples, and re-open the issue if the problem persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants