-
Notifications
You must be signed in to change notification settings - Fork 13
Description
There are applications that require scalable global collective communication (e.g., allreduce for matrix-vector multiplication, like in CG). Currently, these reductions won't be efficient in TTG with the reduction terminals because they are a star and we have no notion of collectiveness in them. TTG should expand the set of collective operations and could even integrate MPI collectives for scalability. It could look like this:
ttg::Edge<void, double> rin, rout;
auto reduce_tt = ttg::coll::reduce(MPI_COMM_WORLD, rin, rout, 1, MPI_SUM, root); // sum over 1 element of type double
auto producer_tt = ttg::make_tt(..., ttg::edges(), ttg::edges(rin));
auto consumer_tt = ttg::make_tt(..., ttg::edges(rout), ...); // may distribute the value further
The input and output edges must have key type void because there can be only one concurrent instance per collective TT. When creating the TT we duplicate the communicator so there can be multiple collective TTs at the same time. The backend will need a way to suspend the task and check for the operation to complete so as to not block the thread in MPI.
Straightforward operations to consider:
- Reduce and allreduce
- Broadcast (we have
ttg::bcastbut it's not using the underlying collective)
There should probably be an overload for std::vector for count > 1.
Would need some more thought on how to describe the difference between input and output count (and a use-case):
- Gather and scatter
- Alltoall