Skip to content

File conventions and cleanup

mshannon-sil edited this page Mar 4, 2025 · 1 revision

Folder Structure

silnlp uses the SIL_NLP_DATA_PATH environment variable to specify the path for a root folder (e.g., SIL_NLP_DATA_PATH="C:/silnlp"). All of the reference data files and experiment files (configuration, models, predictions, etc) expected by the NMT scripts will be found under this root folder.

The subfolder structure that silnlp requires under this root folder is described in the table below.

Folder Description
  • Alignment
Data and experiments subfolder supporting Alignment experiments.
    • experiments
Experiments subfolder with multiple subfolders, one per experiment.
      • <experiment>
Subfolder for a single experiment.
  • MT
Data and experiments subfolder supporting Machine Translation experiments.
    • corpora
Non-Scripture training data files (WMT '20, NewsTest, MultiCCAligned, etc).

Refer to the next section for information on the naming conventions for the files in this subfolder.
    • experiments
Experiments subfolder with multiple subfolders, one per experiment.
      • <experiment>
Subfolder for a single experiment.
    • scripture
Scripture training data files.
When the extract_corpora script is run on a Paratext project, the extracted Scripture content is written to a file in this subfolder.

Refer to the next section for information on the naming conventions for the files in this subfolder.
      • vref.txt
Canonical list of verse references (e.g., "GEN 1:1"), in order, for all Scripture training data files extracted from the Paratext project. The order in which the verse references appear in this file is the same order in which the verse text appears in all Scripture training data files.
This file can be generated by running the extract_corpora script on the Ref project (see below).
    • terms
Key Biblical Terms (KBT) data files.
When the extract_corpora script is run on a Paratext project with populated KBT's, the extracted KBT's are written to file(s) in this subfolder>.

Refer to the next section for information on the naming conventions for the files in this subfolder.
  • Paratext
Subfolder with Paratext projects and related Paratext supporting data.
    • projects
Subfolder with one or more Paratext projects.
      • <project>
Subfolder with the files from an unzipped Paratext project.
      • Ref
Subfolder containing a Reference Paratext project with versification that all other Paratext projects are aligned to when they are extracted.
    • terms
Reference files for processing Paratext KBT's.

File naming conventions

To be provided ...

Checkpoint cleanup

Currently, checkpoints (in <experiment>/run/checkpoint-<ckpt-num>/ folders) that are older than one month are automatically deleted every Sunday at 1am CT. If a user would like for their checkpoints not to be deleted, they can put a "keep_until" file in the <experiment> directory alongside the config.yml file for their experiment. The keep_until file should follow the format keep_until_YYYY-MM-DD.lock. If the user would like to double check the keep_until file is working correctly, they can run scripts/clean_s3.py with the --dry-run option and verify that their checkpoint is listed as "Protected" in the "Extra Info" column of the output spreadsheet.