-
Notifications
You must be signed in to change notification settings - Fork 230
Description
Hey everyone! 😊
I've been diving into something interesting about the NemotronCC pipeline (nemo curator) and got a bit stuck—hoping some brilliant mind here can help me out! 🌟
So here's the deal: when creating quality buckets (0-19), the process requires sorting ALL documents by their quality score first before splitting them into buckets. That makes total sense for fair distribution... Common Crawl is MASSIVE. We're talking petabytes of data that definitely can't fit on one machine, so processing has to be distributed across many servers. But sorting normally requires comparing items globally, which gets insanely complicated when your data is spread across thousands of machines!
I've been thinking:
✨ How do they actually pull off this global sort at such an enormous scale?
✨ Is there some clever trick to avoid a full sort?
✨ How do they handle the shuffle phase without killing the network?
✨ Does Nemotron use some special distributed algorithm I don't know about?
I'm genuinely curious about the real engineering magic behind this! If anyone has insights into how large-scale sorting works for web-scale datasets (or specifically for NemotronCC), I'd be forever grateful! 🙏
Also totally okay with paper recommendations, engineering blog links, or just rough ideas!
Thanks so much in advance,