-
Notifications
You must be signed in to change notification settings - Fork 19
Decouple device flow count from parameters #722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
91b0ef2 to
561854a
Compare
Make PARSEC_MAX_DEVICE_FLOWS configurable and select a proper integer type, up to int128_t. Make sure the flow mask is properly checked. Signed-off-by: Joseph Schuchart <[email protected]>
561854a to
e3a7829
Compare
| int elem_sz; | ||
| int i; | ||
| for(i = 0; i < task->task_class->nb_flows; i++){ | ||
| if(flow_mask & (1U << i)){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API break?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this changes public interfaces, stricto sensu would require bump to v5; however, with the default build parameters, this is strictly equivalent to the 4.0 API.
|
I am willing to let this go in v4.1 despite the in-principle API breakage since it is backward compatible as long as you don't compile with custom parameters (notably Dplasma still works as intended). Note that in the end, even if we were to bump the API to v5, compiling with different sizes would make the ABI of stage_in/stage_out inconsistent between builds. But we don't have an ABI so we don't care(?). Please chime in. |
bosilca
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should merge this because it creates an unstable API.
Moreover, I think it is unreasonable to expect a function call with 128 arguments because at the the a task is just a delegated function call. I could understand extreme cases, but over 64 arguments (and I'm being overly generous) it is costly to manage the copies and tranfers for such tasks.
Instead, we can change all the places we use dependency masks to allow for 64 bits masks, but then keep it there. This will give us a stable API and you a little more giggle rooms with tasks with large number of inputs.
| gpu_task->sources[flow->flow_index] = candidate; /* save the candidate for release on transfer completion */ | ||
| /* Push data into the GPU from the source device */ | ||
| int rc = gpu_task->stage_in ? gpu_task->stage_in(gpu_task, (1U << flow->flow_index), gpu_stream): PARSEC_SUCCESS; | ||
| int rc = gpu_task->stage_in ? gpu_task->stage_in(gpu_task, PARSEC_FLOW_MASK(flow->flow_index), gpu_stream): PARSEC_SUCCESS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is extremely problematic, as we are looking to call a function that the user potentially registered. This is well worst than an API breakage, this make the API dependent on parsec build type parameters.
|
I can see that this the API change is a problem. I'll need to find a different way of expressing this then. The argument that this is in any way connected to the number of arguments to the task function is overly simplistic and not at all relevant for TTG (we can create them inside the task and pass them as array to the kernel as long as we stay below the 2kB argument cap). I'm not arguing for 128 inputs for the giggles but because I expect them to become relevant, e.g., for combining two levels of tree traversal in 3D MRA or for implementing 6D MRA, which will require ~70 separate device memory buffers. Closing this, will think of something else. |
Make PARSEC_MAX_DEVICE_FLOWS configurable and select a proper integer type, up to int128_t. Make sure the flow mask is properly checked.
This is an attempt at allowing an increase of the number of device inputs up to 128. The use of 128bit integers for flows requires proper handling of integer types so this PR introduces
parsec_flow_mask_tand provides macros to set and test the flow mask.Related: #694 where the flows are separated out of the gpu task structure.