-
Notifications
You must be signed in to change notification settings - Fork 87
Description
I've been having an issue using the multiprocessing to filter down the entire 2005-2022 dataset, and i wont be able to limit it to just one subreddit. and im currently working through an issue where combine_folder_multiprocess will hang. i ran into that a few times with smaller chunks of the reddit data, but i was able to just kill and restart it and it was able to pick up where it left off. but not with the 2tb dataset. and the processing of debugging this is made harder by multiprocessing.pool having a tendency to silently fail (especially if the OOM killer kicks in), where ProcessPoolExecutor will give a BrokenProcessPool exception. the two have effectively the same features, but ProcessPoolExecutor is probably what's going to get the most updates going forward.
https://stackoverflow.com/questions/65115092/occasional-deadlock-in-multiprocessing-pool
https://bugs.python.org/issue22393#msg315684
https://stackoverflow.com/questions/24896193/whats-the-difference-between-pythons-multiprocessing-and-concurrent-futures
Other than that suggestion (and I'll send a PR if i end up porting it over and it works well), I'll update this on what works. but, how much RAM does the system where you process the entire dataset have? right now the machine I'm using has 32gb, and i gave it 20 workers because i have 24 cores and wanted to use my computer at the same time it was running. i could easily give the machine more, it's a WSL vm currently assigned half my system memory. Would you expect 10 vs 20 workers, 32 vs 64gb of ram, etc, to have major effects on whether the script completes?