Provide collective operation TTs

There are applications that require scalable global collective communication (e.g., allreduce for matrix-vector multiplication, like in CG). Currently, these reductions won't be efficient in TTG with the reduction terminals because they are a star and we have no notion of collectiveness in them. TTG should expand the set of collective operations and could even integrate MPI collectives for scalability. It could look like this:


```
ttg::Edge<void, double> rin, rout;
auto reduce_tt = ttg::coll::reduce(MPI_COMM_WORLD, rin, rout, 1, MPI_SUM, root); // sum over 1 element of type double
auto producer_tt = ttg::make_tt(..., ttg::edges(), ttg::edges(rin));
auto consumer_tt = ttg::make_tt(..., ttg::edges(rout), ...); // may distribute the value further
```

The input and output edges must have key type `void` because there can be only one concurrent instance per collective TT. When creating the TT we duplicate the communicator so there can be multiple collective TTs at the same time. The backend will need a way to suspend the task and check for the operation to complete so as to not block the thread in MPI.

Straightforward operations to consider:
- Reduce and allreduce
- Broadcast (we have `ttg::bcast` but it's not using the underlying collective)

There should probably be an overload for `std::vector` for `count > 1`.

Would need some more thought on how to describe the difference between input and output count (and a use-case):
- Gather and scatter
- Alltoall

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide collective operation TTs #276

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Provide collective operation TTs #276

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions