"mpirun --leave-session-attached" hangs in Cisco MTT runs under SLURM

Over the past week or so, Cisco's MTT runs had 100% hangs (i.e., timeouts) across master, v2.0.x, v2.1.x, and v3.0.x.  Jeff+Ralph narrowed the problem down to the use of `--leave-session-attached` in Cisco's MTT setup (which had been a recent addition, in an attempt to track down a different problem).  Removing `--leave-session-attached` fixed the timeouts.

We did a bunch of investigation to try to figure out why `--leave-session-attached` was hanging.  Here's some random notes (in no particular order) of what we found:

* Jeff could only get the problem to manifest when running through MTT.  Running the exact same commands outside MTT -- with the same MTT-compiled MPI install and the same MTT-compiled MPI executable -- did not result in a hang (!).
  * Jeff compared linked libraries, environment, ...etc.  I could not find any difference between running through MTT and running manually.  *Something* must be different, but I didn't figure out what it was.
* However, running MTT inside of an `salloc` *did* reproduce the problem 100% of the time (but still: running the same `/path/to/mpirun --leave-session-attached ...` commands manually inside that same `salloc` did *not* reproduce the problem.  Maddening!).
  * Hence, this is where most of the investigation focused -- e.g., getting an `salloc` and manually invoking the MTT client to MPI get, MPI install, Test get, Test build, and then repeatedly invoking the MTT client to Test run (e.g., the trivial tests -- which are especially helpful because they have a short MTT timeout).
  * These Test Runs will all timeout, and you can use these to investigate what is going on.
* It did not seem to matter if MTT was run via `salloc` or `sbatch` (one key difference being the location of `mpirun`: on the head node, or on the first node of the allocation).
* When the hang occurs, we can see that `mpirun` is still running, and it has forked an `srun` to launch the remote `orted`s.  However, no `orted`s are running on remote nodes.  ...but the `srun` is still running.  Totally weird.
  * It's not clear if the remote nodes launched and immediately died, or if `srun` itself somehow hung and never launched anything on the remote nodes.
* Per suggestion from Brian, Jeff ran a `slurmd` in foreground, verbose mode.  We did get a clue here, but don't yet know what to make of it.  Here's the log from the foreground `slurmd` when an MTT `mpirun` was invoked (this was the 10th `mpirun` that had run in this particular MTT run):

```
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6001
slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd: debug:  task_p_slurmd_launch_request: 1634454.10 0
slurmd: launch task 1634454.10 request from 182726.25@10.3.0.254 (port 33517)
slurmd: debug3: state for jobid 1546119: ctime:1490332582 revoked:0 expires:0
slurmd: debug3: state for jobid 1549242: ctime:1490643927 revoked:0 expires:0
slurmd: debug3: state for jobid 1552625: ctime:1490923698 revoked:0 expires:0
slurmd: debug3: state for jobid 1622251: ctime:1496935165 revoked:0 expires:0
slurmd: debug3: state for jobid 1634452: ctime:1497786308 revoked:0 expires:0
slurmd: debug3: state for jobid 1634453: ctime:1497889737 revoked:1497889860 expires:1497889860
slurmd: debug3: state for jobid 1634454: ctime:1497889905 revoked:0 expires:0
slurmd: debug:  Checking credential with 276 bytes of sig data
slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd: debug3: slurmstepd rank 0 (mpi001), parent rank -1 (NONE), children 1, depth 0, max_depth 1
slurmd: debug3: _send_slurmstepd_init: call to getpwuid_r
slurmd: debug3: _send_slurmstepd_init: return from getpwuid_r
slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd: debug:  task_p_slurmd_reserve_resources: 1634454 0
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6004
slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd: debug:  Sending signal 9 to step 1634454.10
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6004
slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd: debug:  Sending signal 9 to step 1634454.10
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 5016
slurmd: debug3: Entering _rpc_step_complete
slurmd: debug:  Entering stepd_completion, range_first = 1, range_last = 1
```

What's this RPC request to signal the tasks?  We can see that it's sending signal 9 -- but who did that?  And why?  And then why did `srun` just hang?

This is probably a good place to start with when resuming the investigation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

"mpirun --leave-session-attached" hangs in Cisco MTT runs under SLURM #3726

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

"mpirun --leave-session-attached" hangs in Cisco MTT runs under SLURM #3726

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions