You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If I understand correctly, due to the copyrighted nature of some of their datasets, they don't host direct links to all of them.
However, many of the links in the readme are to scripts that will download them. I have only used Project Gutenberg so far, but I assume if you run pile.py with the --force-download command it will download all 1.2 TB of data, minus the books3 datasource from Bibliotek, which must be commented out of the code in order for it to work.
Hi, @dboggs95 thanks for the response. I was interested in more fine-grained website information rather than the links to the actual dataset. For example, for the youtube caption dataset, I am interested in the URL of the youtube video used to collect the data. This GitHub repo currently contains scrapping scripts to collect data, but it does not specify links used to create Pile. Furthermore, even if the URL does exist, it is not clear to me that a mapping exists from the URL to the scrapped text.
Is it possible to gain access to the URL links (or any website information) from which the data was scrapped to generate PILE?
The text was updated successfully, but these errors were encountered: