Skip to content

How does Nemo Curator sort petabytes of Common Crawl data for quality bucketing? #1585

@XevWright

Description

@XevWright

Hey everyone! 😊

I've been diving into something interesting about the NemotronCC pipeline (nemo curator) and got a bit stuck—hoping some brilliant mind here can help me out! 🌟

So here's the deal: when creating quality buckets (0-19), the process requires sorting ALL documents by their quality score first before splitting them into buckets. That makes total sense for fair distribution... Common Crawl is MASSIVE. We're talking petabytes of data that definitely can't fit on one machine, so processing has to be distributed across many servers. But sorting normally requires comparing items globally, which gets insanely complicated when your data is spread across thousands of machines!

I've been thinking:

✨ How do they actually pull off this global sort at such an enormous scale?

✨ Is there some clever trick to avoid a full sort?

✨ How do they handle the shuffle phase without killing the network?

✨ Does Nemotron use some special distributed algorithm I don't know about?

I'm genuinely curious about the real engineering magic behind this! If anyone has insights into how large-scale sorting works for web-scale datasets (or specifically for NemotronCC), I'd be forever grateful! 🙏

Also totally okay with paper recommendations, engineering blog links, or just rough ideas!

Thanks so much in advance,

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions