Can not return in multi-node MPI applications

Hi, First I would like to thank the contributors for providing such an elegant and easy-to-go library to profile MPI programs.
MY problem:
 I built a mpi cluster within a lan with up to 8 devices (Linux Ubuntu 20.04) according to the  [MPI tutorial](https://mpitutorial.com/tutorials/running-an-mpi-cluster-within-a-lan/). 
I want to use Caliper to profile my applications over multiple devices. And before that, I wrote a simple hello world to test if it works. 
The code is as below:

```
#include <mpi.h>
#include <stdio.h>
#include <caliper/cali.h>
#include <caliper/cali-manager.h>
// ...
// ...
int main(int argc, char** argv) {

	//l Initialize the MPI environment
	cali::ConfigManager mgr;
	mgr.add("runtime-report,event-trace(output=trace.cali)");
	int provided;
	MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
	if (provided < MPI_THREAD_MULTIPLE) {
		fprintf(stderr, "xxx MPI does not provide needed thread support!\n");
		return -1;
		// Error - MPI does not provide needed threading level
	}

	//     MPI_Init(&argc, &argv);

	mgr.start(); 
	// ...
	// Get the number of processes
	int world_size;
	MPI_Comm_size(MPI_COMM_WORLD, &world_size);


	// Get the rank of the process
	int world_rank;
	//   CALI_MARK_BEGIN("iemann_slice_precompute");
	MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
	//CALI_MARK_END("iemann_slice_precompute");
	// Get the name of the processor
	char processor_name[MPI_MAX_PROCESSOR_NAME];
	int name_len;
	MPI_Get_processor_name(processor_name, &name_len);

	// Print off a hello world message
	printf("Hello world from processor %s, rank %d out of %d processors\n",
			processor_name, world_rank, world_size);

	// Finalize the MPI environment.
	//
	mgr.flush();
	mgr.stop();
	MPI_Finalize();
}

```
the program works perfectly with multi-threads on a single device.

```
sky@nx01:~/cloud$ mpirun -np 2 ./hello
Hello world from processor nx01, rank 0 out of 2 processors
Hello world from processor nx01, rank 1 out of 2 processors
Path                   Min time/rank Max time/rank Avg time/rank Time %    
MPI_Comm_dup                0.000952      0.001182      0.001067 13.165525 
MPI_Get_processor_name      0.000133      0.000193      0.000163  2.011228 
Function               Count (min) Count (max) Time (min) Time (max) Time (avg) Time %    
                                 9          13   0.040653   0.040994   0.040823 92.516799 
MPI_Comm_dup                     2           2   0.001527   0.002249   0.001888  4.278705 
MPI_Recv                         4           4   0.000935   0.000935   0.000935  1.059478 
MPI_Comm_free                    1           1   0.000170   0.000287   0.000228  0.517841 
MPI_Get_processor_name           1           1   0.000170   0.000285   0.000228  0.515575 
MPI_Send                         4           4   0.000421   0.000421   0.000421  0.477048 
MPI_Finalize                     1           1   0.000069   0.000134   0.000102  0.230026 
MPI_Probe                        2           2   0.000186   0.000186   0.000186  0.210762 
MPI_Get_count                    2           2   0.000171   0.000171   0.000171  0.193766 
```

When I test them over two devices(nodes), the program could not return normally and got stuck in somewhere.

```
sky@nx01:~/cloud$ mpirun -np 2 --host nx01,nx02 ./hello
Hello world from processor nx02, rank 1 out of 2 processors
Hello world from processor nx01, rank 0 out of 2 processors
Path                   Min time/rank Max time/rank Avg time/rank Time %    
MPI_Comm_dup                0.003007      0.003007      0.003007 29.905520 
MPI_Get_processor_name      0.000132      0.000132      0.000132  1.312780 

```
Is there anybody who encounters the same issue or figure out where the bug locates? 
Thanks a lot for answering.
 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can not return in multi-node MPI applications #429

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can not return in multi-node MPI applications #429

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions