-
Notifications
You must be signed in to change notification settings - Fork 899
mpi4py: Regression in spawn tests #10631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@awlauria Is this due to PMIx / PRTE updates? |
@jsquyres Looks like that's the case. The two build below show that the regression comes from 4896db1.
|
@awlauria Can you provide any ETA for you looking into this? |
An additional pointer: using a local debug build, the issue seems to be happen only with |
A quick glance at the trace shows the failure is in the btl/sm component. A grep of that code shows the only PMIx dependency is on a modex_recv of The problem is clearly that the btl/sm is looking for the wrong value here. It needs to look for Just curious: am I the only one doing any triage on these issues? I don't look at many nor very often, but when I do look at one, it seems that the reason for the problem is very quick/easy to identify. |
A simple print statement is all that is required to immediately show the problem - printing out the backing file:
where the last digit of the filename is the local rank. You can see that the spawned procs step on each others backing file because they use their local rank, which is the same as they are from two app_contexts. The connection to the local rank is made by: #define MCA_BTL_SM_LOCAL_RANK opal_process_info.my_local_rank All you need do is change it to the node rank. |
I created a small test program: here If I run with shell$ export OMPI_MCA_pml=ucx
shell$ mpirun -np 1 ./simple_spawn_multiple ./simple_spawn_multiple
Hello from a Child (A)
Hello from a Child (B)
Hello from a Child (B)
Spawning Multiple './simple_spawn_multiple' ... OK
shell$ ./simple_spawn_multiple ./simple_spawn_multiple
Spawning Multiple './simple_spawn_multiple' ... OK We don't get the IO from the child processes in the second example (singleton spawn multiple), but that's a separate issue. However, If I use I took a look at the suggestion from @rhc54 and I don't think that will work. It will address the backing file problem, but the processes are now confused because they are getting incorrect values (it seems to me) for A bit of debugging to help - I added the following towards the end of + opal_output(0, "JJH DEBUG) %d is [%s:%d] / [%d:%d] local_rank = %d, local_peers = %d, node_rank = %d",
+ getpid(),
+ opal_process_info.myprocid.nspace, opal_process_info.myprocid.rank,
+ opal_process_info.my_name.jobid, opal_process_info.my_name.vpid,
+ opal_process_info.my_local_rank,
+ opal_process_info.num_local_peers,
+ opal_process_info.my_node_rank); [jjhursey@f5n17 mpi] mpirun -np 1 ./simple_spawn_multiple ./simple_spawn_multiple
[f5n17:3484068] JJH DEBUG) 3484068 is [prterun-f5n17-3484059@1:0] / [1625489409:0] local_rank = 0, local_peers = 0, node_rank = 0
[f5n17:3484072] JJH DEBUG) 3484072 is [prterun-f5n17-3484059@2:1] / [1625489410:1] local_rank = 0, local_peers = 2, node_rank = 2
[f5n17:3484071] JJH DEBUG) 3484071 is [prterun-f5n17-3484059@2:0] / [1625489410:0] local_rank = 0, local_peers = 2, node_rank = 1
[f5n17:3484073] JJH DEBUG) 3484073 is [prterun-f5n17-3484059@2:2] / [1625489410:2] local_rank = 1, local_peers = 2, node_rank = 3
From the PMIx standard 4.1
This seems to indicate that there is a bug in PRRTE that needs fixing. |
I filed an issue on the PRRTE side to gain visibility: openpmix/prrte#1445 I think I found the problem in PRRTE, but I'll need @rhc54 to help with the fix. See the note here |
Fix has been committed to PRRTE master and ported to v3.0. It will fix this immediate problem, but still begs the full issue. Let me explain my comments about node vs local rank. The rationale behind node rank lies in the fault tolerance area. If a proc from a given app dies on a node, and then a proc from that app (either the one that died or some migration) is restarted on that node, then the local rank gets reused - but the node rank does not. If you are using local rank, the app has the potential to crash on that node as the conflict will take down all the procs that were "connected" via the btl/sm to that local rank. Before you had FT, it didn't really make much difference - now that OMPI is supporting FT, it is problematic. If you have added logic elsewhere in OMPI to correct the problem, then perhaps this is not as critical as it used to be. Nathan and I had spent a fair amount of time on this issue and concluded that using node rank was the best solution, but perhaps that has changed. |
For reference:
@awlauria We will need to pick this PRRTE change up as well. |
* Ref Issue open-mpi#10631 Signed-off-by: Joshua Hursey <[email protected]>
FYI: I can confirm that the PRRTE fix addresses this issue. I change my prrte submodule to the
What Ralph mentions about using the node rank vs local rank makes sense. I filed a PR #10690 to make that change, but I want someone supporting FT to review so I flagged @abouteiller . I filed Issue #10691 to track the missing IO. Once the PRRTE submodule is updated then it should close this ticket. |
The fixes were merged into PRRTE |
* Ref Issue open-mpi#10631 Signed-off-by: Joshua Hursey <[email protected]> (cherry picked from commit 9cef06a)
* Ref Issue open-mpi#10631 Signed-off-by: Joshua Hursey <[email protected]>
I believe changes over the last week may have introduce issues in spawn support. Two successive runs of mpi4py testsuite both failed at the same point. From the traceback, looks like the issue happens while children run
MPI_Init_thread
.https://github.com/mpi4py/mpi4py-testing/runs/7703615156?check_suite_focus=true#step:17:1365
Traceback from link above
Extra bits from valgrind (locan run with debug build)
The text was updated successfully, but these errors were encountered: