Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V1.0 candidate; new deduper options, new taggers #100

Merged
merged 99 commits into from
Feb 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
99 commits
Select commit Hold shift + click to select a range
5634389
added more runs
soldni Nov 27, 2023
936bae3
new plots
soldni Nov 28, 2023
11be06d
tokenizer fix
soldni Nov 28, 2023
4e43dbe
squatted
soldni Nov 28, 2023
05a6656
new lang id
soldni Nov 29, 2023
997cf7d
all fasttext lang id
soldni Nov 29, 2023
dde3bb5
plots
soldni Nov 29, 2023
5bcbcd8
further plots
soldni Dec 1, 2023
bebd46f
wip
soldni Dec 1, 2023
907de63
progress!
soldni Dec 1, 2023
747eb52
style
soldni Dec 1, 2023
e6a1fd0
fixed format
soldni Dec 1, 2023
fdc9b13
added configs
soldni Dec 1, 2023
4040b55
dts
soldni Dec 1, 2023
876b9d4
configs
soldni Dec 2, 2023
6d874dd
more
soldni Dec 2, 2023
172172d
refine
soldni Dec 2, 2023
fdcb9bc
fix
soldni Dec 2, 2023
fa8ae25
fix
soldni Dec 2, 2023
5a59215
adding new features to deduper
soldni Dec 3, 2023
ed7c990
accidentally removed tests
soldni Dec 3, 2023
9c45b91
added cli options
soldni Dec 3, 2023
a6c89d0
big commit
soldni Dec 3, 2023
4d0ef02
improvement to tokenizer
soldni Dec 3, 2023
87d2801
bumping version
soldni Dec 3, 2023
f8da3db
fix error in empty
soldni Dec 3, 2023
430f7f2
new dedupe docs
soldni Dec 8, 2023
8d1f1f6
names
soldni Dec 8, 2023
fca1bae
configs
soldni Dec 19, 2023
4808b15
fixed paths
soldni Dec 19, 2023
0d49ec4
stack
soldni Dec 19, 2023
c80ca46
switched to v2
soldni Dec 19, 2023
486a350
fixed dedupe config
soldni Dec 19, 2023
4e25e4d
updated
soldni Dec 20, 2023
6ad8b1c
middle dedupe
soldni Dec 20, 2023
9c80a8b
mix text length
soldni Dec 20, 2023
b9dca47
Reddit processing code (#74)
drschwenk Nov 30, 2023
729d2e4
Merge branch 'main' into soldni/paper
soldni Dec 20, 2023
e8e2e98
more plots
soldni Dec 20, 2023
fd6b730
fixed version
soldni Dec 20, 2023
266548f
names
soldni Dec 20, 2023
0e83e52
different path
soldni Dec 20, 2023
4df8bff
added support for retries
soldni Dec 20, 2023
9541957
wip test
soldni Dec 21, 2023
be42570
fixed tests
soldni Dec 21, 2023
d2ab428
fixed
soldni Dec 21, 2023
ced2a2d
removing repetitions
soldni Dec 21, 2023
62a8d8c
dedupe docs
soldni Dec 21, 2023
1c86ee5
Merge branch 'main' into soldni/paper
soldni Dec 21, 2023
7335601
reddit stats
soldni Dec 21, 2023
785ac9e
paths
soldni Dec 21, 2023
63a1d1d
bugfix
soldni Dec 21, 2023
698a968
base
soldni Dec 21, 2023
357a740
version of pycld2 that compiles on M macs
soldni Dec 22, 2023
f4c3b9e
new config middle
soldni Dec 22, 2023
1f5f7d2
3 parts
soldni Dec 22, 2023
cad2030
further s3 tests
soldni Dec 23, 2023
f5fa8e6
decode
soldni Dec 23, 2023
1505c83
still write empty docs to attributes when skip_empty is True
soldni Dec 23, 2023
f2f1008
wiki adjusted
soldni Dec 27, 2023
c7dfbc7
wiki config
soldni Dec 27, 2023
9b6a526
simple counts
soldni Dec 28, 2023
1b88496
changed path
soldni Dec 30, 2023
170e0af
added new features
soldni Jan 2, 2024
a94d38f
plots
soldni Jan 9, 2024
e5f6f09
added new digits vocab
soldni Jan 9, 2024
4af1ef3
added config to sample
soldni Jan 4, 2024
378641d
small
soldni Jan 9, 2024
c740b8e
added tokenizer script
soldni Jan 10, 2024
2133298
merging
soldni Jan 15, 2024
13d809e
code abl
soldni Jan 15, 2024
35a21cd
cargo
soldni Jan 15, 2024
898374e
version bump
soldni Jan 17, 2024
586cc32
made it stable
soldni Jan 17, 2024
eb58c57
topics
soldni Jan 17, 2024
2dd17ac
sampling
soldni Jan 18, 2024
1afe414
rename
soldni Jan 18, 2024
9afab09
new config for 1.6
soldni Jan 20, 2024
9679158
Merge branch 'main' into soldni/paper
soldni Jan 20, 2024
5acff2f
llama config
soldni Jan 20, 2024
4bcaaa8
llama config (fix)
soldni Jan 20, 2024
f3dae82
Merge branch 'main' into soldni/paper
soldni Jan 23, 2024
b938db8
figures
soldni Jan 25, 2024
887a9b0
adding docs dedupe
soldni Jan 28, 2024
b4d70a8
added more dedup configs
soldni Jan 28, 2024
2e27442
style
soldni Jan 28, 2024
aeaf924
added counts
soldni Jan 28, 2024
7f93446
more cli
soldni Jan 28, 2024
a3ab54b
style
soldni Jan 29, 2024
2063ef0
style
soldni Jan 29, 2024
dd1a848
removed autopep8
soldni Jan 29, 2024
4cabad8
resorted
soldni Jan 29, 2024
d4e1b9b
testing change
soldni Jan 29, 2024
80898bb
corner cases
soldni Jan 31, 2024
625fc44
Merge branch 'main' into soldni/paper
soldni Jan 31, 2024
e46a0b6
figures
soldni Jan 31, 2024
d1e0975
added current paper
soldni Feb 1, 2024
93b4651
reverted cli
soldni Feb 1, 2024
fc3754d
documentation
soldni Feb 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,588 changes: 1,202 additions & 386 deletions Cargo.lock

Large diffs are not rendered by default.

23 changes: 17 additions & 6 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "dolma"
version = "0.9.4"
version = "1.0.0"
edition = "2021"
license = "Apache-2.0"

Expand All @@ -11,26 +11,37 @@ crate-type = ["cdylib"]

[dependencies]
ahash = { version = "0.8.1", features = ["runtime-rng"] }
anyhow = "1.0"
atomic-traits = "0.3"
aws-config = { version = "0.55.0"}
aws-sdk-s3 = "0.25.0"
byteorder = "1"
clap = { version = "4.1.11", features = ["derive"] }
console = "0.15"
env_logger = "0.10.0"
flate2 = { version = "1.0.28", features = ["zlib-ng"], default-features = false }
glob = "0.3.1"
humantime = "2.1"
indicatif = "0.17"
jsonpath-rust = "0.3.0"
log = "0.4.17"
regex = "1.8.4"
num_cpus = "1.0"
num-traits = "0.2"
parse-size = "1.0"
pyo3 = { version = "0.19.0", features = ["extension-module"] }
rand = "0.8.4"
rayon = "1.7.0"
serde = {version = "1.0.160", features = ["derive"]}
serde_json = "1.0"
regex = "1.8.4"
serde = { version = "1.0.160", features = ["derive", "rc"] }
serde_json = "1.0.108"
simple_logger = { version = "3.0", features = ["stderr", "colors"], default-features = false, optional = true }
structopt = { version = "0.3", optional = true }
thousands = "0.2"
threadpool = "1.8.1"
tokenizers = {version = "0.15.0", features = ["http"]}
tokio = {version = "1.27.0", features = ["full"]}
tokio-util = "0.7.7"
unicode-segmentation = "1.7"
glob = "0.3.1"


# [target.'cfg(target_arch = "aarch64")'.dependencies]
# flate2 = "1.0.28"
3 changes: 3 additions & 0 deletions configs/dolma-v1_5/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Dolma 1.5

This directory
96 changes: 96 additions & 0 deletions configs/dolma-v1_5/decontamination/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Decontamination Runbook

## Step 1: Create decontamination bloom filter

> Okay I think every thing is ready for decon testing now. The finalized ppl suite v3 is in `s3://ai2-llm/eval-data/perplexity/v3/`. And here is my proposed plan for decon testing if you agree and it's not too much compute. The following is the sequence of things to try. At each step if the document removal rate is >0.1% or so we back off to the next step and hope the remove rate is lower:
>
> - **Option 1** Decon against PPL Suite v3 (`s3://ai2-llm/eval-data/perplexity/v3/`) + PPL Suite v2 (`s3://ai2-llm/eval-data/perplexity/v2/`) for full backwards compatibility.
> - **Option 2** Decon against PPL Suite v3 (`s3://ai2-llm/eval-data/perplexity/v3/`) + PPL Suite v2-small (`s3://ai2-llm/eval-data/perplexity/v2_small/`) for at least full backwards for the in-loop metrics the model team was using.
> - **Option 3** Decon against PPL Suite v3 (`s3://ai2-llm/eval-data/perplexity/v3/`) + a subset of PPL Suite v2-small requested by Dirk and Iz (`s3://ai2-llm/eval-data/perplexity/v2_small/c4_en/`, `s3://ai2-llm/eval-data/perplexity/v2_small/pile/`, `s3://ai2-llm/eval-data/perplexity/v2_small/m2d2_s2orc/`, `s3://ai2-llm/eval-data/perplexity/v2_small/ice/`)
>
> Let me know if you disagree with any of this or if there's any thing I can do to help run the decon trials!


### Step 1.1: copy data locally

We copy data locally since the directory structure of the eval data in S3 is slightly different from the one we need.
In particular, we need all documents to be under `documents/` directory.

```bash
aws s3 sync s3://ai2-llm/eval-data/perplexity/v2 $HOME/perplexity/v2/documents
aws s3 sync s3://ai2-llm/eval-data/perplexity/v2_small $HOME/perplexity/v2_small/documents
aws s3 sync s3://ai2-llm/eval-data/perplexity/v3 $HOME/perplexity/v3/documents

aws s3 sync s3://ai2-llm/eval-data/perplexity/v2_small/c4_en $HOME/perplexity/v2_small_subset/documents/c4_en
aws s3 sync s3://ai2-llm/eval-data/perplexity/v2_small/pile $HOME/perplexity/v2_small_subset/documents/pile
aws s3 sync s3://ai2-llm/eval-data/perplexity/v2_small/m2d2_s2orc $HOME/perplexity/v2_small_subset/documents/m2d2_s2orc
aws s3 sync s3://ai2-llm/eval-data/perplexity/v2_small/ice $HOME/perplexity/v2_small_subset/documents/ice
```

### Step 1.1b: change type of IDs in v3 subset (TEMPORARY FIX)

v3 accidentally contains ids that are integers instead of strings. Until that's fixed, run:

```bash
python config/dolma-v1_5/decontamination/fix_ids_type.py
```

### Step 1.2: tag out paragraphs by uniseg length

For dolma, we want to decontaminate against paragraphs that are at least 13 uniseg words long,
so we need to compute their length first.

```bash
dolma tag --documents "${HOME}/perplexity/v2/documents/*/*/*.gz" --taggers uniseg_length_paragraphs_with_empty_v1 not_alphanum_paragraph_v1 --processes 188
dolma tag --documents "${HOME}/perplexity/v2_small/documents/*/*/*.gz" --taggers uniseg_length_paragraphs_with_empty_v1 not_alphanum_paragraph_v1 --processes 188
dolma tag --documents "${HOME}/perplexity/v3/documents/*/*/*.gz" --taggers uniseg_length_paragraphs_with_empty_v1 not_alphanum_paragraph_v1 --processes 188
dolma tag --documents "${HOME}/perplexity/v2_small_subset/documents/*/*/*.gz" --taggers uniseg_length_paragraphs_with_empty_v1 not_alphanum_paragraph_v1 --processes 188
```

### Step 1.3: filter out paragraphs that are too short

After tagging, we can filter out to make option 1/2/3.

```bash

dolma -c configs/dolma-v1_5/decontamination/step1_3-make-eval-set/option1.yaml mix
dolma -c configs/dolma-v1_5/decontamination/step1_3-make-eval-set/option2.yaml mix
dolma -c configs/dolma-v1_5/decontamination/step1_3-make-eval-set/option3.yaml mix

```

### Step 1.4: create bloom filter

First, we cat the contents of each dataset to get number of documents:

```bash
zcat $HOME/perplexity/option1/documents/* | jq '.text' -cr | wc -l
>>> 3681169
zcat $HOME/perplexity/option2/documents/* | jq '.text' -cr | wc -l
>>> 2336120
zcat $HOME/perplexity/option3/documents/* | jq '.text' -cr | wc -l
>>> 2020471
```

We use this numbers in the config files at `bloom_filter.estimated_doc_count`. For all three options, we set a `bloom_filter.desired_false_positive_rate` of 0.00001.

```bash
dolma -c configs/dolma-v1_5/decontamination/step1_4-create-bloom-filter/option1.yaml dedupe
dolma -c configs/dolma-v1_5/decontamination/step1_4-create-bloom-filter/option2.yaml dedupe
dolma -c configs/dolma-v1_5/decontamination/step1_4-create-bloom-filter/option3.yaml dedupe
```

## Step 2: Run decontamination

Tag content for Dolma V1.5 for decontamination:


```bash
dolma -c configs/dolma-v1_5/decontamination/step2-run-decontamination/cc.yaml dedupe
dolma -c configs/dolma-v1_5/decontamination/step2-run-decontamination/c4.yaml dedupe
dolma -c configs/dolma-v1_5/decontamination/step2-run-decontamination/stack.yaml dedupe
dolma -c configs/dolma-v1_5/decontamination/step2-run-decontamination/reddit.yaml dedupe
dolma -c configs/dolma-v1_5/decontamination/step2-run-decontamination/peS2o.yaml dedupe
dolma -c configs/dolma-v1_5/decontamination/step2-run-decontamination/books.yaml dedupe
dolma -c configs/dolma-v1_5/decontamination/step2-run-decontamination/wiki.yaml dedupe
```
33 changes: 33 additions & 0 deletions configs/dolma-v1_5/decontamination/fix_ids_type.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
import argparse
import json
from dolma.core.paths import glob_path
import tqdm

import smart_open


def fix_path(p: str):
with smart_open.open(p, 'rt') as f:
data = [json.loads(line) for line in f]

with smart_open.open(p, 'wt') as f:
for d in data:
if 'id' in d:
d['id'] = str(d['id'])
f.write(json.dumps(d) + '\n')


def main():
ap = argparse.ArgumentParser()
ap.add_argument('path', nargs='+')
args = ap.parse_args()

with tqdm.tqdm(desc='Files') as pbar:
for p in args.path:
for sp in glob_path(p):
fix_path(sp)
pbar.update()


if __name__ == '__main__':
main()
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
streams:
- name: "v2"
documents:
- ${oc.env:HOME}/perplexity/v2/documents/c4_100_domains/val/*.gz
- ${oc.env:HOME}/perplexity/v2/documents/c4_100_domains/test/*.gz
- ${oc.env:HOME}/perplexity/v2/documents/c4_en/val/*.gz
- ${oc.env:HOME}/perplexity/v2/documents/c4_en/test/*.gz
- ${oc.env:HOME}/perplexity/v2/documents/gab/val/*.gz
- ${oc.env:HOME}/perplexity/v2/documents/gab/test/*.gz
- ${oc.env:HOME}/perplexity/v2/documents/ice/val/*.gz
- ${oc.env:HOME}/perplexity/v2/documents/ice/test/*.gz
- ${oc.env:HOME}/perplexity/v2/documents/m2d2_s2orc/val/*.gz
- ${oc.env:HOME}/perplexity/v2/documents/m2d2_s2orc/test/*.gz
- ${oc.env:HOME}/perplexity/v2/documents/m2d2_wiki/val/*.gz
- ${oc.env:HOME}/perplexity/v2/documents/m2d2_wiki/test/*.gz
- ${oc.env:HOME}/perplexity/v2/documents/manosphere/val/*.gz
- ${oc.env:HOME}/perplexity/v2/documents/manosphere/test/*.gz
- ${oc.env:HOME}/perplexity/v2/documents/mc4_en/val/*.gz
- ${oc.env:HOME}/perplexity/v2/documents/mc4_en/test/*.gz
- ${oc.env:HOME}/perplexity/v2/documents/pile/val/*.gz
- ${oc.env:HOME}/perplexity/v2/documents/pile/test/*.gz
- ${oc.env:HOME}/perplexity/v2/documents/ptb/val/*.gz
- ${oc.env:HOME}/perplexity/v2/documents/ptb/test/*.gz
- ${oc.env:HOME}/perplexity/v2/documents/twitterAEE/val/*.gz
- ${oc.env:HOME}/perplexity/v2/documents/twitterAEE/test/*.gz
- ${oc.env:HOME}/perplexity/v2/documents/wikitext_103/val/*.gz
- ${oc.env:HOME}/perplexity/v2/documents/wikitext_103/test/*.gz

output: &output
path: ${oc.env:HOME}/perplexity/option1/documents
max_size_in_bytes: 500000000
discard_fields:
- attributes

attributes: &attributes
- uniseg_length_paragraphs_with_empty_v1
- not_alphanum_paragraph_v1

span_replacement: &span_replacement
- span: $.attributes.uniseg_length_paragraphs_with_empty_v1__uniseg_length_paragraphs_with_empty_v1__negative_paragraph
min_score: -12
replacement: ""
- span: $.attributes.not_alphanum_paragraph_v1__not_alphanum_paragraph_v1__all_punct
min_score: 0.5
replacement: ""

- name: "v3"
documents:
- ${oc.env:HOME}/perplexity/v3/documents/4chan_meta_sep/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/4chan_meta_sep/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/c4_100_domains/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/c4_100_domains/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/c4_en/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/c4_en/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/dolma_100_subreddits/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/dolma_100_subreddits/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/dolma-v1_5/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/dolma-v1_5/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/falcon-refinedweb/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/falcon-refinedweb/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/gab/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/gab/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/ice_fixed/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/ice_fixed/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/m2d2_s2orc_unsplit/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/m2d2_s2orc_unsplit/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/m2d2_wikipedia_unsplit/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/m2d2_wikipedia_unsplit/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/manosphere_meta_sep/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/manosphere_meta_sep/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/mc4/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/mc4/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/pile/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/pile/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/ptb/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/ptb/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/redpajama/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/redpajama/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/twitterAAE_HELM_fixed/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/twitterAAE_HELM_fixed/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/wikitext_103/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/wikitext_103/test/*.gz

output: *output
attributes: *attributes
span_replacement: *span_replacement
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
streams:
- name: "v2_small"
documents:
- ${oc.env:HOME}/perplexity/v2_small/documents/c4_100_domains/val/*.gz
- ${oc.env:HOME}/perplexity/v2_small/documents/c4_100_domains/test/*.gz
- ${oc.env:HOME}/perplexity/v2_small/documents/c4_en/val/*.gz
- ${oc.env:HOME}/perplexity/v2_small/documents/c4_en/test/*.gz
- ${oc.env:HOME}/perplexity/v2_small/documents/gab/val/*.gz
- ${oc.env:HOME}/perplexity/v2_small/documents/gab/test/*.gz
- ${oc.env:HOME}/perplexity/v2_small/documents/ice/val/*.gz
- ${oc.env:HOME}/perplexity/v2_small/documents/ice/test/*.gz
- ${oc.env:HOME}/perplexity/v2_small/documents/m2d2_s2orc/val/*.gz
- ${oc.env:HOME}/perplexity/v2_small/documents/m2d2_s2orc/test/*.gz
- ${oc.env:HOME}/perplexity/v2_small/documents/m2d2_wiki/val/*.gz
- ${oc.env:HOME}/perplexity/v2_small/documents/m2d2_wiki/test/*.gz
- ${oc.env:HOME}/perplexity/v2_small/documents/manosphere/val/*.gz
- ${oc.env:HOME}/perplexity/v2_small/documents/manosphere/test/*.gz
- ${oc.env:HOME}/perplexity/v2_small/documents/mc4_en/val/*.gz
- ${oc.env:HOME}/perplexity/v2_small/documents/mc4_en/test/*.gz
- ${oc.env:HOME}/perplexity/v2_small/documents/pile/val/*.gz
- ${oc.env:HOME}/perplexity/v2_small/documents/pile/test/*.gz
- ${oc.env:HOME}/perplexity/v2_small/documents/ptb/val/*.gz
- ${oc.env:HOME}/perplexity/v2_small/documents/ptb/test/*.gz
- ${oc.env:HOME}/perplexity/v2_small/documents/twitterAEE/val/*.gz
- ${oc.env:HOME}/perplexity/v2_small/documents/twitterAEE/test/*.gz
- ${oc.env:HOME}/perplexity/v2_small/documents/wikitext_103/val/*.gz
- ${oc.env:HOME}/perplexity/v2_small/documents/wikitext_103/test/*.gz

output: &output
path: ${oc.env:HOME}/perplexity/option2/documents
max_size_in_bytes: 500000000
discard_fields:
- attributes

attributes: &attributes
- uniseg_length_paragraphs_with_empty_v1
- not_alphanum_paragraph_v1

span_replacement: &span_replacement
- span: $.attributes.uniseg_length_paragraphs_with_empty_v1__uniseg_length_paragraphs_with_empty_v1__negative_paragraph
min_score: -12
replacement: ""
- span: $.attributes.not_alphanum_paragraph_v1__not_alphanum_paragraph_v1__all_punct
min_score: 0.5
replacement: ""

- name: "v3"
documents:
- ${oc.env:HOME}/perplexity/v3/documents/4chan_meta_sep/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/4chan_meta_sep/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/c4_100_domains/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/c4_100_domains/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/c4_en/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/c4_en/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/dolma_100_subreddits/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/dolma_100_subreddits/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/dolma-v1_5/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/dolma-v1_5/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/falcon-refinedweb/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/falcon-refinedweb/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/gab/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/gab/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/ice_fixed/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/ice_fixed/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/m2d2_s2orc_unsplit/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/m2d2_s2orc_unsplit/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/m2d2_wikipedia_unsplit/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/m2d2_wikipedia_unsplit/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/manosphere_meta_sep/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/manosphere_meta_sep/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/mc4/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/mc4/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/pile/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/pile/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/ptb/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/ptb/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/redpajama/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/redpajama/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/twitterAAE_HELM_fixed/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/twitterAAE_HELM_fixed/test/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/wikitext_103/val/*.gz
- ${oc.env:HOME}/perplexity/v3/documents/wikitext_103/test/*.gz

output: *output
attributes: *attributes
span_replacement: *span_replacement
Loading
Loading