Skip to content

mpi4py: Regression in spawn tests #10631

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dalcinl opened this issue Aug 6, 2022 · 12 comments
Closed

mpi4py: Regression in spawn tests #10631

dalcinl opened this issue Aug 6, 2022 · 12 comments
Assignees

Comments

@dalcinl
Copy link
Contributor

dalcinl commented Aug 6, 2022

I believe changes over the last week may have introduce issues in spawn support. Two successive runs of mpi4py testsuite both failed at the same point. From the traceback, looks like the issue happens while children run MPI_Init_thread.

https://github.com/mpi4py/mpi4py-testing/runs/7703615156?check_suite_focus=true#step:17:1365

Traceback from link above
testArgsOnlyAtRootMultiple (test_spawn.TestSpawnSelf) ... [fv-az292-337:164868] *** Process received signal ***
[fv-az292-337:164868] Signal: Segmentation fault (11)
[fv-az292-337:164868] Signal code: Address not mapped (1)
[fv-az292-337:164868] Failing at address: 0x55a66b9ee180
[fv-az292-337:164868] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fdecf9c8090]
[fv-az292-337:164868] [ 1] /usr/local/lib/libopen-pal.so.0(+0xc8fc4)[0x7fdecebcffc4]
[fv-az292-337:164868] [ 2] /usr/local/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x45)[0x7fdecebd1733]
[fv-az292-337:164868] [ 3] /usr/local/lib/libopen-pal.so.0(+0xca9ab)[0x7fdecebd19ab]
[fv-az292-337:164868] [ 4] /usr/local/lib/libopen-pal.so.0(+0xcacab)[0x7fdecebd1cab]
[fv-az292-337:164868] [ 5] /usr/local/lib/libopen-pal.so.0(opal_progress+0x43)[0x7fdeceb3bd6f]
[fv-az292-337:164868] [ 6] /usr/local/lib/libopen-pal.so.0(ompi_sync_wait_mt+0x1ef)[0x7fdecebf1d3f]
[fv-az292-337:164868] [ 7] /usr/local/lib/libmpi.so.0(+0xa813e)[0x7fdececec13e]
[fv-az292-337:164868] [ 8] /usr/local/lib/libmpi.so.0(ompi_request_default_wait+0x2b)[0x7fdececec385]
[fv-az292-337:164868] [ 9] /usr/local/lib/libmpi.so.0(ompi_coll_base_bcast_intra_generic+0x760)[0x7fdecedd304b]
[fv-az292-337:164868] [10] /usr/local/lib/libmpi.so.0(ompi_coll_base_bcast_intra_pipeline+0x1a3)[0x7fdecedd3551]
[fv-az292-337:164868] [11] /usr/local/lib/libmpi.so.0(ompi_coll_tuned_bcast_intra_do_this+0x126)[0x7fdecee0bd76]
[fv-az292-337:164868] [12] /usr/local/lib/libmpi.so.0(ompi_coll_tuned_bcast_intra_dec_fixed+0x43c)[0x7fdecee02832]
[fv-az292-337:164868] [13] /usr/local/lib/libmpi.so.0(ompi_dpm_connect_accept+0x8a8)[0x7fdececbf3b7]
[fv-az292-337:164868] [14] /usr/local/lib/libmpi.so.0(ompi_dpm_dyn_init+0xd6)[0x7fdececccb28]
[fv-az292-337:164868] [15] /usr/local/lib/libmpi.so.0(ompi_mpi_init+0x837)[0x7fdececeeece]
[fv-az292-337:164868] [16] /usr/local/lib/libmpi.so.0(PMPI_Init_thread+0xdd)[0x7fdeced59548]
[fv-az292-337:164868] [17] /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so(+0x33f67)[0x7fdecf14af67]
[fv-az292-337:164868] [18] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(PyModule_ExecDef+0x73)[0x7fdecfdcc0c3]
[fv-az292-337:164868] [19] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(+0x274460)[0x7fdecfdfa460]
[fv-az292-337:164868] [20] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(+0x19745e)[0x7fdecfd1d45e]
[fv-az292-337:164868] [21] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(PyObject_Call+0x8e)[0x7fdecfceeffe]
[fv-az292-337:164868] [22] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x630b)[0x7fdecfd6bddb]
[fv-az292-337:164868] [23] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(+0x1de4dc)[0x7fdecfd644dc]
[fv-az292-337:164868] [24] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x5021)[0x7fdecfd6aaf1]
[fv-az292-337:164868] [25] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(+0x1de4dc)[0x7fdecfd644dc]
[fv-az292-337:164868] [26] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x773)[0x7fdecfd66243]
[fv-az292-337:164868] [27] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(+0x1de4dc)[0x7fdecfd644dc]
[fv-az292-337:164868] [28] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x33e)[0x7fdecfd65e0e]
[fv-az292-337:164868] [29] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(+0x1de4dc)[0x7fdecfd644dc]
[fv-az292-337:164868] *** End of error message ***
[fv-az292-337:164866] OPAL ERROR: Server not available in file dpm/dpm.c at line 403
[fv-az292-337:164855] OPAL ERROR: Server not available in file dpm/dpm.c at line 403
ERROR
[fv-az292-337:164867] OPAL ERROR: Server not available in file dpm/dpm.c at line 403
testCommSpawn (test_spawn.TestSpawnSelf) ... [fv-az292-337:00000] *** An error occurred in MPI_Init_thread
[fv-az292-337:00000] *** reported by process [1431306243,1]
[fv-az292-337:00000] *** on a NULL communicator
[fv-az292-337:00000] *** Unknown error
[fv-az292-337:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[fv-az292-337:00000] ***    and MPI will try to terminate your MPI job as well)
ok
testCommSpawnMultiple (test_spawn.TestSpawnSelf) ... 2 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
Extra bits from valgrind (locan run with debug build)
==1494211== Conditional jump or move depends on uninitialised value(s)
==1494211==    at 0x167996FC: pmix_bfrops_base_value_unload (bfrop_base_fns.c:409)
==1494211==    by 0x16798687: PMIx_Value_unload (bfrop_base_fns.c:54)
==1494211==    by 0x16065326: ompi_dpm_connect_accept (dpm.c:423)
==1494211==    by 0x160CCFD5: PMPI_Comm_spawn_multiple (comm_spawn_multiple.c:199)
==1494211==    by 0x15F1DD69: __pyx_pf_6mpi4py_3MPI_9Intracomm_38Spawn_multiple (MPI.c:149745)
==1494211==    by 0x15F1D6F8: __pyx_pw_6mpi4py_3MPI_9Intracomm_39Spawn_multiple (MPI.c:149423)
==1494211==    by 0x4991160: cfunction_call (methodobject.c:543)
==1494211==    by 0x498D262: _PyObject_MakeTpCall (call.c:215)
==1494211==    by 0x498C590: UnknownInlinedFun (abstract.h:112)
==1494211==    by 0x498C590: UnknownInlinedFun (abstract.h:99)
==1494211==    by 0x498C590: UnknownInlinedFun (abstract.h:123)
==1494211==    by 0x498C590: call_function (ceval.c:5869)
==1494211==    by 0x4985D92: _PyEval_EvalFrameDefault (ceval.c:4231)
==1494211==    by 0x49838D2: UnknownInlinedFun (pycore_ceval.h:46)
==1494211==    by 0x49838D2: _PyEval_Vector (ceval.c:5065)
==1494211==    by 0x4998FD7: UnknownInlinedFun (call.c:342)
==1494211==    by 0x4998FD7: UnknownInlinedFun (abstract.h:114)
==1494211==    by 0x4998FD7: method_vectorcall (classobject.c:53)
@jsquyres
Copy link
Member

jsquyres commented Aug 6, 2022

@awlauria Is this due to PMIx / PRTE updates?

@dalcinl
Copy link
Contributor Author

dalcinl commented Aug 6, 2022

@jsquyres Looks like that's the case. The two build below show that the regression comes from 4896db1.

@dalcinl
Copy link
Contributor Author

dalcinl commented Aug 17, 2022

@awlauria Can you provide any ETA for you looking into this?

@dalcinl
Copy link
Contributor Author

dalcinl commented Aug 17, 2022

An additional pointer: using a local debug build, the issue seems to be happen only with MPI_Comm_spawn_multiple(). All of my tests involving MPI_Comm_spawn() are successful.

dalcinl added a commit to mpi4py/mpi4py that referenced this issue Aug 17, 2022
dalcinl added a commit to mpi4py/mpi4py that referenced this issue Aug 17, 2022
dalcinl added a commit to mpi4py/mpi4py that referenced this issue Aug 17, 2022
@rhc54
Copy link
Contributor

rhc54 commented Aug 18, 2022

A quick glance at the trace shows the failure is in the btl/sm component. A grep of that code shows the only PMIx dependency is on a modex_recv of PMIX_LOCAL_RANK, with the module subsequently attempting to connect/send to the proc of that local rank.

The problem is clearly that the btl/sm is looking for the wrong value here. It needs to look for PMIX_NODE_RANK. I've told you folks this multiple times, and it has indeed been fixed before - but it seems to keep getting re-broken.

Just curious: am I the only one doing any triage on these issues? I don't look at many nor very often, but when I do look at one, it seems that the reason for the problem is very quick/easy to identify.

@rhc54
Copy link
Contributor

rhc54 commented Aug 18, 2022

A simple print statement is all that is required to immediately show the problem - printing out the backing file:

[Ralphs-iMac-2.local:83492] BACKING FILE /Users/rhc/tmp/prte.Ralphs-iMac-2.1000/dvm.83489/1/sm_segment.Ralphs-iMac-2.1000.12790001.2
[Ralphs-iMac-2.local:83491] BACKING FILE /Users/rhc/tmp/prte.Ralphs-iMac-2.1000/dvm.83489/1/sm_segment.Ralphs-iMac-2.1000.12790001.1
[Ralphs-iMac-2.local:83490] BACKING FILE /Users/rhc/tmp/prte.Ralphs-iMac-2.1000/dvm.83489/1/sm_segment.Ralphs-iMac-2.1000.12790001.0
Parent [pid 83490] about to spawn!
Parent [pid 83492] about to spawn!
Parent [pid 83491] about to spawn!
[Ralphs-iMac-2.local:83494] BACKING FILE /Users/rhc/tmp/prte.Ralphs-iMac-2.1000/dvm.83489/2/sm_segment.Ralphs-iMac-2.1000.12790002.0
[Ralphs-iMac-2.local:83493] BACKING FILE /Users/rhc/tmp/prte.Ralphs-iMac-2.1000/dvm.83489/2/sm_segment.Ralphs-iMac-2.1000.12790002.0

where the last digit of the filename is the local rank. You can see that the spawned procs step on each others backing file because they use their local rank, which is the same as they are from two app_contexts. The connection to the local rank is made by:

#define MCA_BTL_SM_LOCAL_RANK opal_process_info.my_local_rank

All you need do is change it to the node rank.

@jjhursey
Copy link
Member

I created a small test program: here

If I run with ucx it passes both of the following:

shell$ export OMPI_MCA_pml=ucx
shell$ mpirun -np 1 ./simple_spawn_multiple ./simple_spawn_multiple
Hello from a Child (A)
Hello from a Child (B)
Hello from a Child (B)
Spawning Multiple './simple_spawn_multiple' ... OK
shell$ ./simple_spawn_multiple ./simple_spawn_multiple
Spawning Multiple './simple_spawn_multiple' ... OK

We don't get the IO from the child processes in the second example (singleton spawn multiple), but that's a separate issue.

However, If I use ob1 then I can reproduce this issue with it ending in a segv. Making the change Ralph suggested ended in a hang. Investigating the hang further lead to the conclusion that there is a bug in PRRTE.

I took a look at the suggestion from @rhc54 and I don't think that will work. It will address the backing file problem, but the processes are now confused because they are getting incorrect values (it seems to me) for local_rank and local_peers. OpenPMIx/PRRTE is returning values relative to their appcontext not the single spawn operation.

A bit of debugging to help - I added the following towards the end of ompi/runtime/ompi_rte.c

+    opal_output(0, "JJH DEBUG) %d is [%s:%d] / [%d:%d] local_rank = %d, local_peers = %d, node_rank = %d",
+                getpid(),
+                opal_process_info.myprocid.nspace, opal_process_info.myprocid.rank,
+                opal_process_info.my_name.jobid, opal_process_info.my_name.vpid,
+                opal_process_info.my_local_rank,
+                opal_process_info.num_local_peers,
+                opal_process_info.my_node_rank);
[jjhursey@f5n17 mpi] mpirun -np 1 ./simple_spawn_multiple ./simple_spawn_multiple
[f5n17:3484068] JJH DEBUG) 3484068 is [prterun-f5n17-3484059@1:0] / [1625489409:0] local_rank = 0, local_peers = 0, node_rank = 0
[f5n17:3484072] JJH DEBUG) 3484072 is [prterun-f5n17-3484059@2:1] / [1625489410:1] local_rank = 0, local_peers = 2, node_rank = 2
[f5n17:3484071] JJH DEBUG) 3484071 is [prterun-f5n17-3484059@2:0] / [1625489410:0] local_rank = 0, local_peers = 2, node_rank = 1
[f5n17:3484073] JJH DEBUG) 3484073 is [prterun-f5n17-3484059@2:2] / [1625489410:2] local_rank = 1, local_peers = 2, node_rank = 3
  • PID 3484068 is the parent (the one calling MPI_Comm_spawn_multiple
    • local_rank = 0, local_peers = 0, node_rank = 0
    • It's PMIx namespace (prterun-f5n17-3484059@1) is unique from the children prterun-f5n17-3484059@2 which is expected
  • Children:
    • PID 3484071 is 1 process in the first appcontext passed to MPI_Comm_spawn_multiple
      • PMIx name: [prterun-f5n17-3484059@2:0]
    • PIDs 3484072 and 3484073 are the 2 processes in the second appcontext passed to MPI_Comm_spawn_multiple
      • PMIx names: [prterun-f5n17-3484059@2:1] and [prterun-f5n17-3484059@2:2]
    • So the rank in the PMIx name is correct, and the namespace is unique for the full set of 3 processes in the namespace.
    • However, the local_rank (PMIX_LOCAL_RANK) and local_peers (PMIX_LOCAL_SIZE) values are not relative to the namespace, but relative to the app context.
      • it seems that their values correspond to PMIX_APP_RANK and PMIX_APP_SIZE instead.

From the PMIx standard 4.1

  • PMIX_LOCAL_RANK
  • Rank of the specified process on its node - refers to the numerical location (starting from zero) of the process on its node when counting only those processes from the same job that share the node, ordered by their overall rank within that job.

  • PMIX_LOCAL_SIZE
  • Number of processes in the specified job or application realm on the caller’s node. Defaults to job realm unless the PMIX_APP_INFO and the PMIX_APPNUM qualifiers are given.

This seems to indicate that there is a bug in PRRTE that needs fixing.

@jjhursey
Copy link
Member

jjhursey commented Aug 18, 2022

I filed an issue on the PRRTE side to gain visibility: openpmix/prrte#1445

I think I found the problem in PRRTE, but I'll need @rhc54 to help with the fix. See the note here

@rhc54
Copy link
Contributor

rhc54 commented Aug 19, 2022

Fix has been committed to PRRTE master and ported to v3.0. It will fix this immediate problem, but still begs the full issue.

Let me explain my comments about node vs local rank. The rationale behind node rank lies in the fault tolerance area. If a proc from a given app dies on a node, and then a proc from that app (either the one that died or some migration) is restarted on that node, then the local rank gets reused - but the node rank does not. If you are using local rank, the app has the potential to crash on that node as the conflict will take down all the procs that were "connected" via the btl/sm to that local rank. Before you had FT, it didn't really make much difference - now that OMPI is supporting FT, it is problematic.

If you have added logic elsewhere in OMPI to correct the problem, then perhaps this is not as critical as it used to be. Nathan and I had spent a fair amount of time on this issue and concluded that using node rank was the best solution, but perhaps that has changed.

@jjhursey
Copy link
Member

For reference:

@awlauria We will need to pick this PRRTE change up as well.

@jjhursey
Copy link
Member

FYI: I can confirm that the PRRTE fix addresses this issue. I change my prrte submodule to the v3.0 branch (including 1447) and was able to run successfully without any OMPI modifications:

[jjhursey@f5n17 mpi] ./simple_spawn_multiple ./simple_spawn_multiple
Spawning Multiple './simple_spawn_multiple' ... OK
[jjhursey@f5n17 mpi] mpirun -np 1 ./simple_spawn_multiple ./simple_spawn_multiple
Hello from a Child (B)
Hello from a Child (B)
Hello from a Child (A)
Spawning Multiple './simple_spawn_multiple' ... OK

What Ralph mentions about using the node rank vs local rank makes sense. I filed a PR #10690 to make that change, but I want someone supporting FT to review so I flagged @abouteiller .

I filed Issue #10691 to track the missing IO.

Once the PRRTE submodule is updated then it should close this ticket.

@jjhursey jjhursey self-assigned this Aug 19, 2022
dalcinl added a commit to mpi4py/mpi4py that referenced this issue Aug 22, 2022
@jjhursey
Copy link
Member

jjhursey commented Sep 7, 2022

The fixes were merged into PRRTE v3 and the submodule pointer for Open MPI v5.0.x have been updated. I think we are good to close this issue.

@jjhursey jjhursey closed this as completed Sep 7, 2022
awlauria pushed a commit to awlauria/ompi that referenced this issue Jan 19, 2023
 * Ref Issue open-mpi#10631

Signed-off-by: Joshua Hursey <[email protected]>
(cherry picked from commit 9cef06a)
yli137 pushed a commit to yli137/ompi that referenced this issue Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants