You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Balanced KMeans implementation uses a RAFT get_workspace() resource that is used to allocated arrays on the order of minibatch_size within the build_fine_clusters() function (passed as device_memory) which then allocates mc_trainset_buf [mesocluster_size_max x dim] which is on the order of dataset size / n_clusters, i.e., orders of magnitude larger than minibatch size which is ~1GB. This will trigger a limiting_resource_adaptor.hpp:152: Exceeded memory limit exception, because the default allocation limit is set to total device memory / 4.
To avoid this problem for large datasets (~ device memory size), the user must increase the number of (mesoscale) clusters. However, while increasing the number of clusters commensurate with the dataset size is generally advisable, I believe that we should not artificially limit the allocation size when the user explicitly uses managed memory. This means even if we do not generally remove the resource limiter on the workspace resource, we should at least remove it specifically for the mc_trainset_buf allocation since there is no expectation that it should be on the order of minibatch size which is otherwise used to estimate the expected workspace resource needs.
Steps/Code to reproduce bug
The issue can be reproduced with the test script posted in this issue.
Expected behavior
I would expect to not run into a device resource allocator before device memory is sufficiently exhausted and I would expect to not encounter any OOM or resource limiter issues when using a managed memory allocator.
Environment details (please complete the following information):
Describe the bug
The Balanced KMeans implementation uses a RAFT
get_workspace()
resource that is used to allocated arrays on the order of minibatch_size within thebuild_fine_clusters()
function (passed asdevice_memory
) which then allocatesmc_trainset_buf [mesocluster_size_max x dim]
which is on the order of dataset size / n_clusters, i.e., orders of magnitude larger than minibatch size which is ~1GB. This will trigger alimiting_resource_adaptor.hpp:152: Exceeded memory limit
exception, because the default allocation limit is set to total device memory / 4.To avoid this problem for large datasets (~ device memory size), the user must increase the number of (mesoscale) clusters. However, while increasing the number of clusters commensurate with the dataset size is generally advisable, I believe that we should not artificially limit the allocation size when the user explicitly uses managed memory. This means even if we do not generally remove the resource limiter on the workspace resource, we should at least remove it specifically for the
mc_trainset_buf
allocation since there is no expectation that it should be on the order of minibatch size which is otherwise used to estimate the expected workspace resource needs.Steps/Code to reproduce bug
The issue can be reproduced with the test script posted in this issue.
Expected behavior
I would expect to not run into a device resource allocator before device memory is sufficiently exhausted and I would expect to not encounter any OOM or resource limiter issues when using a managed memory allocator.
Environment details (please complete the following information):
docker pull
&docker run
commands usedAdditional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: