Skip to content

Can not return in multi-node MPI applications #429

@parrotsky

Description

@parrotsky

Hi, First I would like to thank the contributors for providing such an elegant and easy-to-go library to profile MPI programs.
MY problem:
I built a mpi cluster within a lan with up to 8 devices (Linux Ubuntu 20.04) according to the MPI tutorial.
I want to use Caliper to profile my applications over multiple devices. And before that, I wrote a simple hello world to test if it works.
The code is as below:

#include <mpi.h>
#include <stdio.h>
#include <caliper/cali.h>
#include <caliper/cali-manager.h>
// ...
// ...
int main(int argc, char** argv) {

	//l Initialize the MPI environment
	cali::ConfigManager mgr;
	mgr.add("runtime-report,event-trace(output=trace.cali)");
	int provided;
	MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
	if (provided < MPI_THREAD_MULTIPLE) {
		fprintf(stderr, "xxx MPI does not provide needed thread support!\n");
		return -1;
		// Error - MPI does not provide needed threading level
	}

	//     MPI_Init(&argc, &argv);

	mgr.start(); 
	// ...
	// Get the number of processes
	int world_size;
	MPI_Comm_size(MPI_COMM_WORLD, &world_size);


	// Get the rank of the process
	int world_rank;
	//   CALI_MARK_BEGIN("iemann_slice_precompute");
	MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
	//CALI_MARK_END("iemann_slice_precompute");
	// Get the name of the processor
	char processor_name[MPI_MAX_PROCESSOR_NAME];
	int name_len;
	MPI_Get_processor_name(processor_name, &name_len);

	// Print off a hello world message
	printf("Hello world from processor %s, rank %d out of %d processors\n",
			processor_name, world_rank, world_size);

	// Finalize the MPI environment.
	//
	mgr.flush();
	mgr.stop();
	MPI_Finalize();
}

the program works perfectly with multi-threads on a single device.

sky@nx01:~/cloud$ mpirun -np 2 ./hello
Hello world from processor nx01, rank 0 out of 2 processors
Hello world from processor nx01, rank 1 out of 2 processors
Path                   Min time/rank Max time/rank Avg time/rank Time %    
MPI_Comm_dup                0.000952      0.001182      0.001067 13.165525 
MPI_Get_processor_name      0.000133      0.000193      0.000163  2.011228 
Function               Count (min) Count (max) Time (min) Time (max) Time (avg) Time %    
                                 9          13   0.040653   0.040994   0.040823 92.516799 
MPI_Comm_dup                     2           2   0.001527   0.002249   0.001888  4.278705 
MPI_Recv                         4           4   0.000935   0.000935   0.000935  1.059478 
MPI_Comm_free                    1           1   0.000170   0.000287   0.000228  0.517841 
MPI_Get_processor_name           1           1   0.000170   0.000285   0.000228  0.515575 
MPI_Send                         4           4   0.000421   0.000421   0.000421  0.477048 
MPI_Finalize                     1           1   0.000069   0.000134   0.000102  0.230026 
MPI_Probe                        2           2   0.000186   0.000186   0.000186  0.210762 
MPI_Get_count                    2           2   0.000171   0.000171   0.000171  0.193766 

When I test them over two devices(nodes), the program could not return normally and got stuck in somewhere.

sky@nx01:~/cloud$ mpirun -np 2 --host nx01,nx02 ./hello
Hello world from processor nx02, rank 1 out of 2 processors
Hello world from processor nx01, rank 0 out of 2 processors
Path                   Min time/rank Max time/rank Avg time/rank Time %    
MPI_Comm_dup                0.003007      0.003007      0.003007 29.905520 
MPI_Get_processor_name      0.000132      0.000132      0.000132  1.312780 

Is there anybody who encounters the same issue or figure out where the bug locates?
Thanks a lot for answering.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions