Skip to content

Conversation

@devreal
Copy link
Contributor

@devreal devreal commented Dec 23, 2024

Make PARSEC_MAX_DEVICE_FLOWS configurable and select a proper integer type, up to int128_t. Make sure the flow mask is properly checked.

This is an attempt at allowing an increase of the number of device inputs up to 128. The use of 128bit integers for flows requires proper handling of integer types so this PR introduces parsec_flow_mask_t and provides macros to set and test the flow mask.

Related: #694 where the flows are separated out of the gpu task structure.

@devreal devreal requested a review from a team as a code owner December 23, 2024 18:07
Make PARSEC_MAX_DEVICE_FLOWS configurable and select a proper integer
type, up to int128_t. Make sure the flow mask is properly checked.


Signed-off-by: Joseph Schuchart <[email protected]>
@abouteiller abouteiller added this to the v4.1 milestone Mar 14, 2025
int elem_sz;
int i;
for(i = 0; i < task->task_class->nb_flows; i++){
if(flow_mask & (1U << i)){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API break?

Copy link
Contributor

@abouteiller abouteiller Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this changes public interfaces, stricto sensu would require bump to v5; however, with the default build parameters, this is strictly equivalent to the 4.0 API.

@abouteiller abouteiller added API Change Change to the public API, backward incompatible (version major bump) enhancement New feature or request labels Mar 14, 2025
@abouteiller
Copy link
Contributor

abouteiller commented Mar 14, 2025

I am willing to let this go in v4.1 despite the in-principle API breakage since it is backward compatible as long as you don't compile with custom parameters (notably Dplasma still works as intended).

Note that in the end, even if we were to bump the API to v5, compiling with different sizes would make the ABI of stage_in/stage_out inconsistent between builds. But we don't have an ABI so we don't care(?).

Please chime in.

abouteiller added a commit to abouteiller/dplasma that referenced this pull request Mar 14, 2025
abouteiller added a commit to abouteiller/dplasma that referenced this pull request Mar 14, 2025
Copy link
Contributor

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should merge this because it creates an unstable API.

Moreover, I think it is unreasonable to expect a function call with 128 arguments because at the the a task is just a delegated function call. I could understand extreme cases, but over 64 arguments (and I'm being overly generous) it is costly to manage the copies and tranfers for such tasks.

Instead, we can change all the places we use dependency masks to allow for 64 bits masks, but then keep it there. This will give us a stable API and you a little more giggle rooms with tasks with large number of inputs.

gpu_task->sources[flow->flow_index] = candidate; /* save the candidate for release on transfer completion */
/* Push data into the GPU from the source device */
int rc = gpu_task->stage_in ? gpu_task->stage_in(gpu_task, (1U << flow->flow_index), gpu_stream): PARSEC_SUCCESS;
int rc = gpu_task->stage_in ? gpu_task->stage_in(gpu_task, PARSEC_FLOW_MASK(flow->flow_index), gpu_stream): PARSEC_SUCCESS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is extremely problematic, as we are looking to call a function that the user potentially registered. This is well worst than an API breakage, this make the API dependent on parsec build type parameters.

@devreal
Copy link
Contributor Author

devreal commented Mar 17, 2025

I can see that this the API change is a problem. I'll need to find a different way of expressing this then. The argument that this is in any way connected to the number of arguments to the task function is overly simplistic and not at all relevant for TTG (we can create them inside the task and pass them as array to the kernel as long as we stay below the 2kB argument cap). I'm not arguing for 128 inputs for the giggles but because I expect them to become relevant, e.g., for combining two levels of tree traversal in 3D MRA or for implementing 6D MRA, which will require ~70 separate device memory buffers.

Closing this, will think of something else.

@devreal devreal closed this Mar 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

API Change Change to the public API, backward incompatible (version major bump) enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants