Skip to content

"mpirun --leave-session-attached" hangs in Cisco MTT runs under SLURM #3726

Open
@jsquyres

Description

@jsquyres

Over the past week or so, Cisco's MTT runs had 100% hangs (i.e., timeouts) across master, v2.0.x, v2.1.x, and v3.0.x. Jeff+Ralph narrowed the problem down to the use of --leave-session-attached in Cisco's MTT setup (which had been a recent addition, in an attempt to track down a different problem). Removing --leave-session-attached fixed the timeouts.

We did a bunch of investigation to try to figure out why --leave-session-attached was hanging. Here's some random notes (in no particular order) of what we found:

  • Jeff could only get the problem to manifest when running through MTT. Running the exact same commands outside MTT -- with the same MTT-compiled MPI install and the same MTT-compiled MPI executable -- did not result in a hang (!).
    • Jeff compared linked libraries, environment, ...etc. I could not find any difference between running through MTT and running manually. Something must be different, but I didn't figure out what it was.
  • However, running MTT inside of an salloc did reproduce the problem 100% of the time (but still: running the same /path/to/mpirun --leave-session-attached ... commands manually inside that same salloc did not reproduce the problem. Maddening!).
    • Hence, this is where most of the investigation focused -- e.g., getting an salloc and manually invoking the MTT client to MPI get, MPI install, Test get, Test build, and then repeatedly invoking the MTT client to Test run (e.g., the trivial tests -- which are especially helpful because they have a short MTT timeout).
    • These Test Runs will all timeout, and you can use these to investigate what is going on.
  • It did not seem to matter if MTT was run via salloc or sbatch (one key difference being the location of mpirun: on the head node, or on the first node of the allocation).
  • When the hang occurs, we can see that mpirun is still running, and it has forked an srun to launch the remote orteds. However, no orteds are running on remote nodes. ...but the srun is still running. Totally weird.
    • It's not clear if the remote nodes launched and immediately died, or if srun itself somehow hung and never launched anything on the remote nodes.
  • Per suggestion from Brian, Jeff ran a slurmd in foreground, verbose mode. We did get a clue here, but don't yet know what to make of it. Here's the log from the foreground slurmd when an MTT mpirun was invoked (this was the 10th mpirun that had run in this particular MTT run):
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6001
slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd: debug:  task_p_slurmd_launch_request: 1634454.10 0
slurmd: launch task 1634454.10 request from [email protected] (port 33517)
slurmd: debug3: state for jobid 1546119: ctime:1490332582 revoked:0 expires:0
slurmd: debug3: state for jobid 1549242: ctime:1490643927 revoked:0 expires:0
slurmd: debug3: state for jobid 1552625: ctime:1490923698 revoked:0 expires:0
slurmd: debug3: state for jobid 1622251: ctime:1496935165 revoked:0 expires:0
slurmd: debug3: state for jobid 1634452: ctime:1497786308 revoked:0 expires:0
slurmd: debug3: state for jobid 1634453: ctime:1497889737 revoked:1497889860 expires:1497889860
slurmd: debug3: state for jobid 1634454: ctime:1497889905 revoked:0 expires:0
slurmd: debug:  Checking credential with 276 bytes of sig data
slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd: debug3: slurmstepd rank 0 (mpi001), parent rank -1 (NONE), children 1, depth 0, max_depth 1
slurmd: debug3: _send_slurmstepd_init: call to getpwuid_r
slurmd: debug3: _send_slurmstepd_init: return from getpwuid_r
slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd: debug:  task_p_slurmd_reserve_resources: 1634454 0
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6004
slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd: debug:  Sending signal 9 to step 1634454.10
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6004
slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd: debug:  Sending signal 9 to step 1634454.10
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 5016
slurmd: debug3: Entering _rpc_step_complete
slurmd: debug:  Entering stepd_completion, range_first = 1, range_last = 1

What's this RPC request to signal the tasks? We can see that it's sending signal 9 -- but who did that? And why? And then why did srun just hang?

This is probably a good place to start with when resuming the investigation.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions