Skip to content

spawn: the final one to fix in ibm test suite #13127

Open
@hppritcha

Description

@hppritcha

With much effort, we've fixed up problems in Open MPI and PRRTe/pmix so that, at least on one node, all but one of our ibm/dynamic tests runs.

The last - which hangs - is the no-disconnect test. This test spawns a tree of processes. Grandparent spawning parents, then parents spawning grandkids.

The problem is those parents! They hang around and have to be killed off manually.

Why is this you may ask?

Well the reason lies in this code:

int ompi_dpm_dyn_finalize(void)
{
    int i,j=0, max=0, num_dyns = 0;
    ompi_dpm_disconnect_obj **objs=NULL;
    ompi_communicator_t *comm=NULL;

    fprintf(stderr, "process %d had %d dyncomms to disconnect\n", getpid(), ompi_comm_num_dyncomm);
    if (1 <ompi_comm_num_dyncomm) {
        objs = (ompi_dpm_disconnect_obj**)malloc(ompi_comm_num_dyncomm *
                               sizeof(ompi_dpm_disconnect_obj*));
        if (NULL == objs) {
            return OMPI_ERR_OUT_OF_RESOURCE;
        }

        max = ompi_comm_get_num_communicators();
        for (i=3; i<max; i++) {
            comm = ompi_comm_lookup(i);
            if (NULL != comm &&  OMPI_COMM_IS_DYNAMIC(comm)) {
                num_dyns++;
                if (comm->c_name != NULL) {
                     fprintf(stderr, "process %d parent comm %d being counted as dynmaic %s\n", getpid(), i, comm->c_name);
                }

            }
        }

        fprintf(stderr, "process %d had %d dyncomms to disconnect num dyn comms %d\n", getpid(), ompi_comm_num_dyncomm, num_dyns);

        for (i=3; i<max; i++) {
            comm = ompi_comm_lookup(i);
            if (NULL != comm &&  OMPI_COMM_IS_DYNAMIC(comm)) {
                objs[j++] = disconnect_init(comm);
            }
        }

        if (j != ompi_comm_num_dyncomm) {
            cleanup_dpm_disconnect_objs(objs, j);
            return OMPI_ERROR;
        }

        disconnect_waitall(ompi_comm_num_dyncomm, objs);
        cleanup_dpm_disconnect_objs(objs, ompi_comm_num_dyncomm);
    }

    return OMPI_SUCCESS;
}

The grandparent and the grandchildren only have a single dyn intercomm while the parents have 3!
So, since the grandparents and the grandchildren zoom past this sync up operation since they don't think they have any dyn comms except the one from the spawn/spawned-so-set-up-parent comm, the parents hang there.
This routine has had this structure since the dawn of the OMPI era it appears. Notice the magic numbers - like the '3' in the loop over dyn comms.

Anyway, this points to a basic problem with the way OMPI is handing cleaning up dynamic communicators itself within MPI_Finalize. if the app does the disconnecting itself, this hang is not observed.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions