You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am writing a data pipeline to process common crawl and referencing your code here in pile-cc repo. In this repo PILE-CC raw version accounts for 200 GB however in the pile-cc repo we see that
3.5PB of network ingress in total is required. The final dataset should be (warning: this number is very rough and extrapolated; leave some slack space to be safe!) about 200TB. About 40k core days (non-hyperthreaded) are also required (again, a very rough estimate from extrapolation).
I am a bit confused about the difference here between 200TB and 200GB. Was there another pipeline which reduced the size from 200TB to 200GB? If so I am not able to find it. Thanks!
The text was updated successfully, but these errors were encountered:
I am writing a data pipeline to process common crawl and referencing your code here in pile-cc repo. In this repo PILE-CC raw version accounts for 200 GB however in the pile-cc repo we see that
I am a bit confused about the difference here between 200TB and 200GB. Was there another pipeline which reduced the size from 200TB to 200GB? If so I am not able to find it. Thanks!
The text was updated successfully, but these errors were encountered: