Open
Description
We got a couple of issues, notably with SMOTENC
, where large datasets drive to a MemoryError
.
Here I will add a couple of points that could be addressed in the future:
- Check in the class
SMOTENC
if converting a dataset from sparse to dense is required (SMOTENC MemoryError #752, SMOTE-NC sampling_strategy='not majority' MemoryError: Unable to allocate 135. GiB #768, MemoryError SMOTENC #688, Benchmark for dataset size before Memory Errors onSMOTENC
resampled dataset creation #667) - A subset of the sampler could be implemented in Dask. We should probably prototype in
imblearn
before to contribute it upstream (Samplers / pipelines for imbalanced datasets dask/dask-ml#317)