-
-
Notifications
You must be signed in to change notification settings - Fork 3
File conventions and cleanup
silnlp uses the SIL_NLP_DATA_PATH environment variable to specify the path for a root folder (e.g., SIL_NLP_DATA_PATH="C:/silnlp"
). All of the reference data files and experiment files (configuration, models, predictions, etc) expected by the NMT scripts will be found under this root folder.
The subfolder structure that silnlp requires under this root folder is described in the table below.
Folder | Description |
---|---|
|
Data and experiments subfolder supporting Alignment experiments. |
|
Experiments subfolder with multiple subfolders, one per experiment. |
|
Subfolder for a single experiment. |
|
Data and experiments subfolder supporting Machine Translation experiments. |
|
Non-Scripture training data files (WMT '20, NewsTest, MultiCCAligned, etc). Refer to the next section for information on the naming conventions for the files in this subfolder. |
|
Experiments subfolder with multiple subfolders, one per experiment. |
|
Subfolder for a single experiment. |
|
Scripture training data files. When the extract_corpora script is run on a Paratext project, the extracted Scripture content is written to a file in this subfolder. Refer to the next section for information on the naming conventions for the files in this subfolder. |
|
Canonical list of verse references (e.g., "GEN 1:1"), in order, for all Scripture training data files extracted from the Paratext project. The order in which the verse references appear in this file is the same order in which the verse text appears in all Scripture training data files. This file can be generated by running the extract_corpora script on the Ref project (see below). |
|
Key Biblical Terms (KBT) data files. When the extract_corpora script is run on a Paratext project with populated KBT's, the extracted KBT's are written to file(s) in this subfolder>. Refer to the next section for information on the naming conventions for the files in this subfolder. |
|
Subfolder with Paratext projects and related Paratext supporting data. |
|
Subfolder with one or more Paratext projects. |
|
Subfolder with the files from an unzipped Paratext project. |
|
Subfolder containing a Reference Paratext project with versification that all other Paratext projects are aligned to when they are extracted. |
|
Reference files for processing Paratext KBT's. |
To be provided ...
Currently, checkpoints (in <experiment>/run/checkpoint-<ckpt-num>/
folders) that are older than one month are automatically deleted every Sunday at 1am CT. If a user would like for their checkpoints not to be deleted, they can put a "keep_until" file in the <experiment>
directory alongside the config.yml file for their experiment. The keep_until file should follow the format keep_until_YYYY-MM-DD.lock
. If the user would like to double check the keep_until file is working correctly, they can run scripts/clean_s3.py with the --dry-run option and verify that their checkpoint is listed as "Protected" in the "Extra Info" column of the output spreadsheet.