Inquiry about Web Pipeline Availability #151

codefly13 · 2024-04-22T01:29:15Z

I hope you are doing well. I came across a reference to the "Web Pipeline" in the paper "Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research" and I am very interested in exploring it further. However, it seems that the pipeline is still in preparation. I would like to kindly inquire about the availability of the "Web Pipeline". Is there any information on when it might be released for public use?

dumitrac · 2024-04-30T16:51:57Z

Hi @codefly13 - all of it is already available in the dolma toolkit (i.e. this repo).
Please let me know if you're looking for something different.

OxxoCodes · 2024-06-26T00:54:31Z

@dumitrac I'm interested in this as well. I'd like to utilize the Dolma toolkit to perform some filtering on CC data (which is what I assume @codefly13 was attempting to perform as well). However, I don't see an example of how to do this in the repo, and the following pipeline is just marked as being WIP: https://github.com/allenai/dolma/tree/main/sources/cc_warc

I'm very new to Dolma so there's a good chance I'm just missing something. Would appreciate some pointers. Thanks!

codefly13 changed the title ~~Inquiry about CommonCrawl WARC Pipeline Availability~~ Inquiry about Web Pipeline Availability Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry about Web Pipeline Availability #151

Inquiry about Web Pipeline Availability #151

codefly13 commented Apr 22, 2024

dumitrac commented Apr 30, 2024

OxxoCodes commented Jun 26, 2024

Inquiry about Web Pipeline Availability #151

Inquiry about Web Pipeline Availability #151

Comments

codefly13 commented Apr 22, 2024

dumitrac commented Apr 30, 2024

OxxoCodes commented Jun 26, 2024