You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I hope you are doing well. I came across a reference to the "Web Pipeline" in the paper "Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research" and I am very interested in exploring it further. However, it seems that the pipeline is still in preparation. I would like to kindly inquire about the availability of the "Web Pipeline". Is there any information on when it might be released for public use?
The text was updated successfully, but these errors were encountered:
codefly13
changed the title
Inquiry about CommonCrawl WARC Pipeline Availability
Inquiry about Web Pipeline Availability
Apr 22, 2024
@dumitrac I'm interested in this as well. I'd like to utilize the Dolma toolkit to perform some filtering on CC data (which is what I assume @codefly13 was attempting to perform as well). However, I don't see an example of how to do this in the repo, and the following pipeline is just marked as being WIP: https://github.com/allenai/dolma/tree/main/sources/cc_warc
I'm very new to Dolma so there's a good chance I'm just missing something. Would appreciate some pointers. Thanks!
I hope you are doing well. I came across a reference to the "Web Pipeline" in the paper "Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research" and I am very interested in exploring it further. However, it seems that the pipeline is still in preparation. I would like to kindly inquire about the availability of the "Web Pipeline". Is there any information on when it might be released for public use?
The text was updated successfully, but these errors were encountered: