Skip to content

Added support for MinIO and B2 buckets #620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 12, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .devcontainer/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,5 @@ ENV SIL_NLP_CACHE_EXPERIMENT_DIR=/root/.cache/silnlp/experiments
ENV SIL_NLP_CACHE_PROJECT_DIR=/root/.cache/silnlp/projects
# Set environment variables
ENV CLEARML_API_HOST="https://api.sil.hosted.allegro.ai"
ENV SIL_NLP_DATA_PATH=/silnlp
ENV EFLOMAL_PATH=/workspaces/silnlp/.venv/lib/python3.10/site-packages/eflomal/bin
CMD ["bash"]
8 changes: 6 additions & 2 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,18 @@
"--gpus",
"all",
"-v",
"${env:HOME}/.aws:/root/.aws", // Mount user's AWS credentials into the container
"-v",
"${env:HOME}/clearml/.clearml/hf-cache:/root/.cache/huggingface"
],
"containerEnv": {
"AWS_REGION": "${localEnv:AWS_REGION}",
"AWS_ACCESS_KEY_ID": "${localEnv:AWS_ACCESS_KEY_ID}",
"AWS_SECRET_ACCESS_KEY": "${localEnv:AWS_SECRET_ACCESS_KEY}",
"MINIO_ENDPOINT_URL": "${localEnv:MINIO_ENDPOINT_URL}",
"MINIO_ACCESS_KEY": "${localEnv:MINIO_ACCESS_KEY}",
"MINIO_SECRET_KEY": "${localEnv:MINIO_SECRET_KEY}",
"B2_ENDPOINT_URL": "${localEnv:B2_ENDPOINT_URL}",
"B2_KEY_ID": "${localEnv:B2_KEY_ID}",
"B2_APPLICATION_KEY": "${localEnv:B2_APPLICATION_KEY}",
"CLEARML_API_ACCESS_KEY": "${localEnv:CLEARML_API_ACCESS_KEY}",
"CLEARML_API_SECRET_KEY": "${localEnv:CLEARML_API_SECRET_KEY}"
},
Expand Down
50 changes: 28 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,15 +62,18 @@ These are the main requirements for the SILNLP code to run on a local machine. S
Create a text file with the following content and edit as necessary:
```
CLEARML_API_HOST="https://api.sil.hosted.allegro.ai"
CLEARML_API_ACCESS_KEY=xxxxx
CLEARML_API_SECRET_KEY=xxxxx
AWS_REGION="us-east-1"
AWS_ACCESS_KEY_ID=xxxxx
AWS_SECRET_ACCESS_KEY=xxxxx
SIL_NLP_DATA_PATH="/silnlp"
```
* If you do not intend to use SILNLP with ClearML and/or AWS, you can leave out the respective variables. If you need to generate ClearML credentials, see [ClearML setup](clear_ml_setup.md).
* Note that this does not give you direct access to an AWS S3 bucket from within the Docker container, it only allows you to run scripts referencing files in the bucket.
CLEARML_API_ACCESS_KEY=xxxxxxx
CLEARML_API_SECRET_KEY=xxxxxxx
MINIO_ENDPOINT_URL=https://truenas.psonet.languagetechnology.org:9000
MINIO_ACCESS_KEY=xxxxxxxxx
MINIO_SECRET_KEY=xxxxxxx
B2_ENDPOINT_URL=https://s3.us-east-005.backblazeb2.com
B2_KEY_ID=xxxxxxxx
B2_APPLICATION_KEY=xxxxxxxx
```
* Include SIL_NLP_DATA_PATH="/silnlp" if you are not using MinIO or B2 and will be storing files locally.
* If you do not intend to use SILNLP with ClearML, MinIO, and/or B2, you can leave out the respective variables. If you need to generate ClearML credentials, see [ClearML setup](clear_ml_setup.md).
* Note that this does not give you direct access to a MinIO or B2 bucket from within the Docker container, it only allows you to run scripts referencing files in the bucket.

6. Start container

Expand Down Expand Up @@ -129,22 +132,25 @@ These are the main requirements for the SILNLP code to run on a local machine. S
poetry install
```

10. If using ClearML and/or AWS, set the following environment variables:
10. If using ClearML, MinIO, and/or B2, set the following environment variables:
```
CLEARML_API_HOST="https://api.sil.hosted.allegro.ai"
CLEARML_API_ACCESS_KEY=xxxxx
CLEARML_API_SECRET_KEY=xxxxx
AWS_REGION="us-east-1"
AWS_ACCESS_KEY_ID=xxxxx
AWS_SECRET_ACCESS_KEY=xxxxx
SIL_NLP_DATA_PATH="/silnlp"
```
CLEARML_API_ACCESS_KEY=xxxxxxx
CLEARML_API_SECRET_KEY=xxxxxxx
MINIO_ENDPOINT_URL=https://truenas.psonet.languagetechnology.org:9000
MINIO_ACCESS_KEY=xxxxxxxxx
MINIO_SECRET_KEY=xxxxxxx
B2_ENDPOINT_URL=https://s3.us-east-005.backblazeb2.com
B2_KEY_ID=xxxxxxxx
B2_APPLICATION_KEY=xxxxxxxx
```
* Include SIL_NLP_DATA_PATH="/silnlp" if you are not using MinIO or B2 and will be storing files locally.
* If you need to generate ClearML credentials, see [ClearML setup](clear_ml_setup.md).
* Note that this does not give you direct access to an AWS S3 bucket from within the Docker container, it only allows you to run scripts referencing files in the bucket.
* Note that this does not give you direct access to a MinIO or B2 bucket from within the Docker container, it only allows you to run scripts referencing files in the bucket.
* For instructions on how to permanently set up environment variables for your operating system, see the corresponding section under the Development Environment Setup header below.

11. If using AWS, there are two options:
* Option 1: Mount the bucket to your filesystem following the instructions under [Install and Configure Rclone](https://github.com/sillsdev/silnlp/blob/master/s3_bucket_setup.md#install-and-configure-rclone).
11. If using MinIO or B2, there are two options:
* Option 1: Mount the bucket to your filesystem following the instructions under [Install and Configure Rclone](https://github.com/sillsdev/silnlp/blob/master/bucket_setup.md#install-and-configure-rclone).
* Option 2: Create a local cache for the bucket following the instructions under [Create SILNLP cache](https://github.com/sillsdev/silnlp/blob/master/manual_setup.md#create-silnlp-cache).

## Development Environment Setup
Expand Down Expand Up @@ -177,7 +183,7 @@ Follow the instructions below to set up a Dev Container in VS Code. This is the

4. Define environment variables.

Set the following environment variables with your respective credentials: CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY. Additionally, set AWS_REGION. The typical value is "us-east-1".
Set the following environment variables with your respective credentials: CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY, MINIO_ACCESS_KEY, MINIO_SECRET_KEY, B2_KEY_ID, B2_APPLICATION_KEY. Also set MINIO_ENDPOINT_URL to https://truenas.psonet.languagetechnology.org:9000 and B2_ENDPOINT_URL to https://s3.us-east-005.backblazeb2.com with no quotations.
* Linux / macOS users: To set environment variables permanently, add each variable as a new line to the `.bashrc` file (Linux) or `.profile` file (macOS) in your home directory with the format
```
export VAR="VAL"
Expand Down Expand Up @@ -210,7 +216,7 @@ Follow the instructions below to set up a Dev Container in VS Code. This is the
10. Install and activate Poetry environment.
* In the VS Code terminal, run `poetry install` to install the necessary Python libraries, and then run `poetry shell` to enter the environment in the terminal.

11. (Optional) Locally mount the S3 bucket. This will allow you to interact directly with the S3 bucket from your local terminal (outside of the dev container). See instructions [here](s3_bucket_setup.md).
11. (Optional) Locally mount the MinIO and/or B2 bucket(s). This will allow you to interact directly with the bucket(s) from your local terminal (outside of the dev container). See instructions [here](bucket_setup.md).

To get back into the dev container and poetry environment each subsequent time, open the silnlp folder in VS Code, select the "Reopen in Container" option from the Remote Connection menu (bottom left corner), and use the `poetry shell` command in the terminal.

Expand Down
63 changes: 63 additions & 0 deletions bucket_setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# MinIO/B2 bucket setup

We use MinIO and Backblaze B2 storage for storing our experiment data. Here is some workspace setup to enable a decent workflow.

### Note For MinIO setup

In order to access the MinIO bucket locally, you must have a VPN connected to its network. If you need VPN access, please reach out to an SILNLP dev team member.

### Note For Backblaze B2 usage

Backblaze B2 is only used as a backup storage option when the MinIO bucket is unavailable or when running experiments from the ORU Titan Server.

### Install and configure rclone

**Windows**

The following will mount /silnlp on your B drive or /nlp-research on your M drive and allow you to explore, read and write.
* Install WinFsp: http://www.secfs.net/winfsp/rel/ (Click the button to "Download WinFsp Installer" not the "SSHFS-Win (x64)" installer)
* Download rclone from: https://rclone.org/downloads/
* Unzip to your desktop (or some convient location).
* Add the folder that contains rclone.exe to your PATH environment variable.
* Take the `scripts/rclone/rclone.conf` file from this SILNLP repo and copy it to `~\AppData\Roaming\rclone` (creating folders if necessary)
* Add your credentials in the appropriate fields in `~\AppData\Roaming\rclone`
* Take the `scripts/rclone/mount_minio_to_m.bat` and `scripts/rclone/mount_b2_to_b.bat` file from this SILNLP repo and copy it to the folder that contains the unzipped rclone.
* Double-click either bat file. A command window should open and remain open. You should see something like, if running mount_minio_to_m.bat:
```
C:\Users\David\Software\rclone>call rclone mount --vfs-cache-mode full --use-server-modtime miniosilnlp:nlp-research M:
The service rclone has been started.
```

**Linux / macOS**

The following will mount /nlp-research to a M folder or /silnlp to a B folder in your home directory and allow you to explore, read and write.
* For macOS, first download and install macFUSE: https://osxfuse.github.io/
* Download rclone from: https://rclone.org/install/
* Take the `scripts/rclone/rclone.conf` file from this SILNLP repo and copy it to `~/.config/rclone/rclone.conf` (creating folders if necessary)
* Add your credentials in the appropriate fields in `~/.config/rclone/rclone.conf`
* Create a folder called "M" or "B" in your user directory
* Run the following command for MinIO:
```
rclone mount --vfs-cache-mode full --use-server-modtime miniosilnlp:nlp-research ~/M
```
* OR run the following command for B2:
```
rclone mount --vfs-cache-mode full --use-server-modtime b2silnlp:silnlp ~/B
```
### To start M: and/or B: drive on start up

**Windows**

Put a shortcut to the mount_minio_to_m.bat and/or mount_b2_to_b.bat file in the Startup folder.
* In Windows Explorer put `shell:startup` in the address bar or open `C:\Users\<Username>\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Startup`
* Right click to add a new shortcut. Choose `mount_minio_to_m.bat` and/or `mount_b2_to_b.bat` as the target, you can leave the name as the default.

Now your MinIO or B2 bucket should be mounted as M: or B: drive, respectively, when you start Windows.

**Linux / macOS**
* Run `crontab -e`
* For MinIO, paste `@reboot rclone mount --vfs-cache-mode full --use-server-modtime miniosilnlp:nlp-research ~/M` into the file, save and exit
* For B2, paste `@reboot rclone mount --vfs-cache-mode full --use-server-modtime b2silnlp:silnlp ~/B` into the file, save and exit
* Reboot Linux / macOS

Now your MinIO or B2 bucket should be mounted as ~/M or ~/B respectively when you start Linux / macOS.
11 changes: 7 additions & 4 deletions manual_setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,9 +73,9 @@ __Download and install__ the following before creating any projects or starting
"editor.formatOnSave": true,
```

### S3 bucket setup
### MinIO and/or B2 bucket(s) setup

See [S3 bucket setup](s3_bucket_setup.md).
See [Bucket setup](bucket_setup.md).

### ClearML setup

Expand All @@ -88,8 +88,11 @@ See [ClearML setup](clear_ml_setup.md).
* Create the directory "$HOME/.cache/silnlp/projects" and set the environment variable SIL_NLP_CACHE_PROJECT_DIR to that path.

### Additional Environment Variables
* Set the following environment variables with your respective credentials: CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY.
* Set SIL_NLP_DATA_PATH to "/silnlp" and CLEARML_API_HOST to "https://api.sil.hosted.allegro.ai".
* Set the following environment variables with your respective credentials: CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY, MINIO_ACCESS_KEY, MINIO_SECRET_KEY B2_KEY_ID, B2_APPLICATION_KEY.
* Set SIL_NLP_DATA_PATH to "/silnlp" if you are not using MinIO or B2 and will be storing files locally.
* Set CLEARML_API_HOST to "https://api.sil.hosted.allegro.ai".
* Set MINIO_ENDPOINT_URL to https://truenas.psonet.languagetechnology.org:9000
* Set B2_ENDPOINT_URL to https://s3.us-east-005.backblazeb2.com

### Setting Up and Running Experiments

Expand Down
56 changes: 0 additions & 56 deletions s3_bucket_setup.md

This file was deleted.

33 changes: 25 additions & 8 deletions scripts/clean_s3.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import argparse
import csv
import datetime
import os
import re
import time
from typing import Tuple
Expand Down Expand Up @@ -39,30 +40,46 @@ def clean_research(max_months: int, dry_run: bool) -> Tuple[int, int]:
)
# create a csv filename to store the deleted files that includes the current datetime
output_csv = f"deleted_research_files_{time.strftime('%Y%m%d-%H%M%S')}" + ("_dryrun" if dry_run else "") + ".csv"
return _delete_data(max_months, dry_run, regex_to_delete, output_csv, checkpoint_protection=True)
return _delete_data(
max_months, dry_run, regex_to_delete, output_csv, bucket_service="minio", checkpoint_protection=True
)


def clean_production(max_months: int, dry_run: bool) -> Tuple[int, int]:
print("Cleaning production")
regex_to_delete = re.compile(r"^(production|dev|int-qa|ext-qa)/builds/.+")
output_csv = f"deleted_production_files_{time.strftime('%Y%m%d-%H%M%S')}" + ("_dryrun" if dry_run else "") + ".csv"
return _delete_data(max_months, dry_run, regex_to_delete, output_csv)
return _delete_data(max_months, dry_run, regex_to_delete, output_csv, bucket_service="aws")


def _delete_data(
max_months: int, dry_run: bool, regex_to_delete: str, output_csv: str, checkpoint_protection: bool = False
max_months: int,
dry_run: bool,
regex_to_delete: str,
output_csv: str,
bucket_service: str,
checkpoint_protection: bool = False,
) -> Tuple[int, int]:
max_age = max_months * MONTH_IN_SECONDS

s3 = boto3.client("s3")
if bucket_service == "minio":
s3 = boto3.client(
"s3",
endpoint_url=os.getenv("MINIO_ENDPOINT_URL"),
aws_access_key_id=os.getenv("MINIO_ACCESS_KEY"),
aws_secret_access_key=os.getenv("MINIO_SECRET_KEY"),
)
bucket_name = "nlp-research"
else:
s3 = boto3.client("s3")
bucket_name = "silnlp"
paginator = s3.get_paginator("list_objects_v2")
total_deleted = 0
storage_space_freed = 0
keep_until_dates = {}
# First pass, identify keep until files
# which must follow the format keep_until_YYYY-MM-DD.lock and be located in the same folder
# as the experiment's config.yml file
for page in paginator.paginate(Bucket="silnlp"):
for page in paginator.paginate(Bucket=bucket_name):
for obj in page["Contents"]:
s3_filename = obj["Key"]
parts = s3_filename.split("/")
Expand All @@ -83,7 +100,7 @@ def _delete_data(
csv_writer.writerow(["Filename", "LastModified", "Eligible for Deletion", "Extra Info"])
else:
csv_writer.writerow(["Filename", "LastModified", "Deleted", "Extra Info"])
for page in paginator.paginate(Bucket="silnlp"):
for page in paginator.paginate(Bucket=bucket_name):
for obj in page["Contents"]:
s3_filename = obj["Key"]
if regex_to_delete.search(s3_filename) is None:
Expand Down Expand Up @@ -126,7 +143,7 @@ def _delete_data(
print(s3_filename)
print(f"{(now - last_modified) / MONTH_IN_SECONDS} months old")
if not dry_run:
s3.delete_object(Bucket="silnlp", Key=s3_filename)
s3.delete_object(Bucket=bucket_name, Key=s3_filename)
print("Deleted")
total_deleted += 1
storage_space_freed += obj["Size"]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,4 @@ rem copy your key and secret to rclone.conf

rem run rclone - execute this file in the rclone folder

call rclone mount --vfs-cache-mode full --use-server-modtime s3silnlp:silnlp S:
call rclone mount --vfs-cache-mode full --use-server-modtime b2silnlp:silnlp B:
13 changes: 13 additions & 0 deletions scripts/rclone/mount_minio_to_m.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
rem Install rclone
rem get rclone from https://rclone.org/downloads/
rem extract the files to a folder
rem then move this bat file to the folder where you run this bat file to start the service
rem --use-server-modtime flag speeds up displaying large numbers of files. Not exactly mod time, but close enough.

rem configure rclone
rem copy the adjacent file "rclone.conf" to: C:\Users\<username>\AppData\Roaming\rclone\rclone.conf
rem copy your key and secret to rclone.conf

rem run rclone - execute this file in the rclone folder

call rclone mount --vfs-cache-mode full --use-server-modtime --no-check-certificate miniosilnlp:nlp-research M:
Loading