You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The code above will download and process Wikipedia articles in the `simple` language from the October 1, 2023 wikipedia dump.
57
-
After running it, you will have a directory called `wikipedia/v0` with Wikipedia articles in it.
58
-
Wikipedia articles are going to be grouped in compressed JSONL files in dolma
59
+
This script will download and process Wikipedia articles in the `simple` language from the October 1, 2023 Wikipedia dump. After running it, you will find the articles stored in a directory named `wikipedia/v0`. The articles will be grouped into compressed JSONL files suitable for dolma.
60
+
61
+
Note: Update the `--date 20231001` argument by selecting a specific dump date from the Wikimedia dump website. Make sure to use the date format `YYYYMMDD`.
59
62
60
63
### Step 1: Run Taggers
61
64
@@ -105,74 +108,19 @@ The above command will create an attribute directory called `bff_duplicate_parag
105
108
106
109
### Step 3: Run Mixer
107
110
108
-
After running the taggers and and marking which paragraphs are duplicates, we can run the mixer to create a dataset with a subset of the languages and documents.
111
+
After running the taggers and marking which paragraphs are duplicates, we can run the mixer to create a dataset with a subset of the languages and documents.
109
112
110
-
For this step, we will pass a configuration file to the mix command instead of passing all the options on the command line. CLI invocation looks like this:
113
+
For this step, we will pass a configuration file to the `mix` command instead of passing all the options on the command line. The CLI invocation looks like this:
# this process option is overridden by the command line flag
171
-
"processes": 1
172
-
}
173
-
```
119
+
In this case, the configuration is provided via a JSON file, though a YAML file would also work. Additionally, we override the number of processes to 16 using the `--processes` flag.
120
+
121
+
You can find the configuration file [`wikipedia-mixer.json`](examples/wikipedia-mixer.json) in the examples repository, along with its YAML-equivalent version at [`wikipedia-mixer.yaml`](examples/wikipedia-mixer.yaml).
174
122
175
-
The above configuration will create a directory called`wikipedia/example0/documents` with a set of files that contain the documents that pass the filters.
123
+
The configuration will create a directory named`wikipedia/example0/documents` with a set of files containing the documents that pass the filters.
0 commit comments