Allocate host copy on demand once data is moved to the host#711
Allocate host copy on demand once data is moved to the host#711abouteiller merged 1 commit intoICLDisco:masterfrom
Conversation
2e69d6e to
240cb52
Compare
|
We could replace the special handling of arenas in https://github.com/ICLDisco/parsec/blob/master/parsec/data.c#L49 with this mechanism. |
Provide callbacks on the parsec_data_copy_t to allocate and release the host memory. If a data has been evicted and there is only one device using it we can safely release the memory back to the application. Otherwise the memory will remain allocated. Signed-off-by: Joseph Schuchart <[email protected]>
240cb52 to
04f24df
Compare
bosilca
left a comment
There was a problem hiding this comment.
If I understand this PR correctly you are proposing a storage mechanisms out of device memory for data copies (which the data copy remains valid). In this case I would not use the main memory as the backend storage but a new specialized device for storage. If such device exists, then long lasting copies shall be stowed there, otherwise they will go in main memory.
| for(int i = 0; i < task->task_class->nb_flows; i++) { | ||
| if( !(flow_mask & (1U << i)) ) continue; | ||
| source = gtask->sources[i]; | ||
| assert(source->device_private != NULL); |
There was a problem hiding this comment.
Why is this needed ? A predecessor can provide a NULL data on a otherwise valid dependency. How is that case handled ?
There was a problem hiding this comment.
We cannot stage in data that does not exist. If we get here (because of RW or RO) the application requested data to be pushed onto the device. If the source has no memory then we cannot make up data. I consider this an error.
| dir = parsec_device_gpu_transfer_direction_d2d; | ||
| } else { | ||
| dir = parsec_device_gpu_transfer_direction_d2h; | ||
| if (dest->device_private == NULL && dest->alloc_cb != NULL) { |
There was a problem hiding this comment.
So the copy on the main memory somehow exists but it points to non existing private data (aka. it has been never used before). Why does it exists then ? And why do you want to allocate it via a mechanism different than the traditional arena (which is more or less similar to this but instead of having a specialized alloc/free per copy we have it per arena) ?
There was a problem hiding this comment.
Yes. We always have a host copy, which may not have device private memory allocated. That is the way we currently provide inputs for flows (instead of data). Unless we overhaul PaRSEC we need to pass a data copy. And I don't want to allocate device private for all of them.
There was a problem hiding this comment.
The only reason we currently pass the CPU copy between tasks, is because PaRSEC lacks a proper mechanism to acquire the accelerator data from the CPU (other than the pushout mechanism at the end of the execution of the GPU task). But, with the PR that allows data to be sent and received directly from/to GPU memory, there is no reason to ever need the data on the CPU, unless for final step of the computation before returning it to the user (in which case we could force a pushout).
There was a problem hiding this comment.
So then an application creates GPU copies and allocates memory on devices on devices randomly selected for all tasks submitted? That defeats the automatic device selection and creates massive pressure on the zone allocators, for no good reason. The clean way is to pass parsec_data_t* instead of data copies as inputs for device tasks and have the runtime figure out what data copies are current and where to allocate any copies. But that requires a significant rewrite for which no one has time.
There was a problem hiding this comment.
If we pass the parsec_data_t* between tasks, then we can only have one single version of the data at any moment, because only the most up-to-date version can be found via the data_t.
What would be such a device? Evicting to other devices is an option but they might suffer the same memory pressure as the device from which we are evicting so I'm not sure that is useful... The host memory is the memory that is least used when everything is offloaded to devices so evicting there makes the most sense. |
|
It would be a specialized storage device, such as disk or NVme or, something an NVLINK away from the accelerator, or at worst, the main memory. No computational support, but a larger memory zone that can be used for relieve pressure from the other, compute-capable but memory-limited, devices. |
devreal
left a comment
There was a problem hiding this comment.
This is now used in TTG for the MRA implementation to cut out almost all host allocations and allocate host memory on-demand for evicted data or data that is pushed out. There the host memory is allocated from a pinned memory pool.
Changes needed to make it work are in https://github.com/devreal/ttg/compare/serialize-buffer-query...devreal:ttg:serialize-buffer-query-and-copy-alloc?expand=1.
| for(int i = 0; i < task->task_class->nb_flows; i++) { | ||
| if( !(flow_mask & (1U << i)) ) continue; | ||
| source = gtask->sources[i]; | ||
| assert(source->device_private != NULL); |
There was a problem hiding this comment.
We cannot stage in data that does not exist. If we get here (because of RW or RO) the application requested data to be pushed onto the device. If the source has no memory then we cannot make up data. I consider this an error.
Provide callbacks on the parsec_data_copy_t to allocate and release the host memory. If a data has been evicted and there is only one device using it we can safely release the memory back to the application. Otherwise the memory will remain allocated.