Skip to content

multiprocessing.pool vs ProcessPoolExecutor #14

@taygetea

Description

@taygetea

I've been having an issue using the multiprocessing to filter down the entire 2005-2022 dataset, and i wont be able to limit it to just one subreddit. and im currently working through an issue where combine_folder_multiprocess will hang. i ran into that a few times with smaller chunks of the reddit data, but i was able to just kill and restart it and it was able to pick up where it left off. but not with the 2tb dataset. and the processing of debugging this is made harder by multiprocessing.pool having a tendency to silently fail (especially if the OOM killer kicks in), where ProcessPoolExecutor will give a BrokenProcessPool exception. the two have effectively the same features, but ProcessPoolExecutor is probably what's going to get the most updates going forward.
https://stackoverflow.com/questions/65115092/occasional-deadlock-in-multiprocessing-pool
https://bugs.python.org/issue22393#msg315684
https://stackoverflow.com/questions/24896193/whats-the-difference-between-pythons-multiprocessing-and-concurrent-futures

Other than that suggestion (and I'll send a PR if i end up porting it over and it works well), I'll update this on what works. but, how much RAM does the system where you process the entire dataset have? right now the machine I'm using has 32gb, and i gave it 20 workers because i have 24 cores and wanted to use my computer at the same time it was running. i could easily give the machine more, it's a WSL vm currently assigned half my system memory. Would you expect 10 vs 20 workers, 32 vs 64gb of ram, etc, to have major effects on whether the script completes?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions