Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pile-CC Size #105

Open
KeremTurgutlu opened this issue Feb 15, 2023 · 0 comments
Open

Pile-CC Size #105

KeremTurgutlu opened this issue Feb 15, 2023 · 0 comments

Comments

@KeremTurgutlu
Copy link

I am writing a data pipeline to process common crawl and referencing your code here in pile-cc repo. In this repo PILE-CC raw version accounts for 200 GB however in the pile-cc repo we see that

3.5PB of network ingress in total is required. The final dataset should be (warning: this number is very rough and extrapolated; leave some slack space to be safe!) about 200TB. About 40k core days (non-hyperthreaded) are also required (again, a very rough estimate from extrapolation).

I am a bit confused about the difference here between 200TB and 200GB. Was there another pipeline which reduced the size from 200TB to 200GB? If so I am not able to find it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant