Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for MinIO and B2 buckets #620

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .devcontainer/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,5 @@ ENV SIL_NLP_CACHE_EXPERIMENT_DIR=/root/.cache/silnlp/experiments
ENV SIL_NLP_CACHE_PROJECT_DIR=/root/.cache/silnlp/projects
# Set environment variables
ENV CLEARML_API_HOST="https://api.sil.hosted.allegro.ai"
ENV SIL_NLP_DATA_PATH=/silnlp
ENV EFLOMAL_PATH=/workspaces/silnlp/.venv/lib/python3.10/site-packages/eflomal/bin
CMD ["bash"]
8 changes: 6 additions & 2 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,18 @@
"--gpus",
"all",
"-v",
"${env:HOME}/.aws:/root/.aws", // Mount user's AWS credentials into the container
"-v",
"${env:HOME}/clearml/.clearml/hf-cache:/root/.cache/huggingface"
],
"containerEnv": {
"AWS_REGION": "${localEnv:AWS_REGION}",
"AWS_ACCESS_KEY_ID": "${localEnv:AWS_ACCESS_KEY_ID}",
"AWS_SECRET_ACCESS_KEY": "${localEnv:AWS_SECRET_ACCESS_KEY}",
"MINIO_ENDPOINT_URL": "${localEnv:MINIO_ENDPOINT_URL}",
"MINIO_ACCESS_KEY": "${localEnv:MINIO_ACCESS_KEY}",
"MINIO_SECRET_KEY": "${localEnv:MINIO_SECRET_KEY}",
"B2_ENDPOINT_URL": "${localEnv:B2_ENDPOINT_URL}",
"B2_KEY_ID": "${localEnv:B2_KEY_ID}",
"B2_APPLICATION_KEY": "${localEnv:B2_APPLICATION_KEY}",
"CLEARML_API_ACCESS_KEY": "${localEnv:CLEARML_API_ACCESS_KEY}",
"CLEARML_API_SECRET_KEY": "${localEnv:CLEARML_API_SECRET_KEY}"
},
Expand Down
50 changes: 28 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,15 +62,18 @@ These are the main requirements for the SILNLP code to run on a local machine. S
Create a text file with the following content and edit as necessary:
```
CLEARML_API_HOST="https://api.sil.hosted.allegro.ai"
CLEARML_API_ACCESS_KEY=xxxxx
CLEARML_API_SECRET_KEY=xxxxx
AWS_REGION="us-east-1"
AWS_ACCESS_KEY_ID=xxxxx
AWS_SECRET_ACCESS_KEY=xxxxx
SIL_NLP_DATA_PATH="/silnlp"
```
* If you do not intend to use SILNLP with ClearML and/or AWS, you can leave out the respective variables. If you need to generate ClearML credentials, see [ClearML setup](clear_ml_setup.md).
* Note that this does not give you direct access to an AWS S3 bucket from within the Docker container, it only allows you to run scripts referencing files in the bucket.
CLEARML_API_ACCESS_KEY=xxxxxxx
CLEARML_API_SECRET_KEY=xxxxxxx
B2_ENDPOINT_URL=https://s3.us-east-005.backblazeb2.com
B2_KEY_ID=xxxxxxxx
B2_APPLICATION_KEY=xxxxxxxx
MINIO_ENDPOINT_URL=https://truenas.psonet.languagetechnology.org:9000
MINIO_ACCESS_KEY=xxxxxxxxx
MINIO_SECRET_KEY=xxxxxxx
```
* Include SIL_NLP_DATA_PATH="/silnlp" if you are not using B2 or MinIO and will be storing files locally.
* If you do not intend to use SILNLP with ClearML and/or B2/MinIO, you can leave out the respective variables. If you need to generate ClearML credentials, see [ClearML setup](clear_ml_setup.md).
* Note that this does not give you direct access to a B2 or MinIO bucket from within the Docker container, it only allows you to run scripts referencing files in the bucket.

6. Start container

Expand Down Expand Up @@ -129,22 +132,25 @@ These are the main requirements for the SILNLP code to run on a local machine. S
poetry install
```

10. If using ClearML and/or AWS, set the following environment variables:
10. If using ClearML and/or B2/MinIO, set the following environment variables:
```
CLEARML_API_HOST="https://api.sil.hosted.allegro.ai"
CLEARML_API_ACCESS_KEY=xxxxx
CLEARML_API_SECRET_KEY=xxxxx
AWS_REGION="us-east-1"
AWS_ACCESS_KEY_ID=xxxxx
AWS_SECRET_ACCESS_KEY=xxxxx
SIL_NLP_DATA_PATH="/silnlp"
```
CLEARML_API_ACCESS_KEY=xxxxxxx
CLEARML_API_SECRET_KEY=xxxxxxx
B2_ENDPOINT_URL=https://s3.us-east-005.backblazeb2.com
B2_KEY_ID=xxxxxxxx
B2_APPLICATION_KEY=xxxxxxxx
MINIO_ENDPOINT_URL=https://truenas.psonet.languagetechnology.org:9000
MINIO_ACCESS_KEY=xxxxxxxxx
MINIO_SECRET_KEY=xxxxxxx
```
* Include SIL_NLP_DATA_PATH="/silnlp" if you are not using B2 or MinIO and will be storing files locally.
* If you need to generate ClearML credentials, see [ClearML setup](clear_ml_setup.md).
* Note that this does not give you direct access to an AWS S3 bucket from within the Docker container, it only allows you to run scripts referencing files in the bucket.
* Note that this does not give you direct access to a B2 or MinIO bucket from within the Docker container, it only allows you to run scripts referencing files in the bucket.
* For instructions on how to permanently set up environment variables for your operating system, see the corresponding section under the Development Environment Setup header below.

11. If using AWS, there are two options:
* Option 1: Mount the bucket to your filesystem following the instructions under [Install and Configure Rclone](https://github.com/sillsdev/silnlp/blob/master/s3_bucket_setup.md#install-and-configure-rclone).
11. If using B2/MinIO, there are two options:
* Option 1: Mount the bucket to your filesystem following the instructions under [Install and Configure Rclone](https://github.com/sillsdev/silnlp/blob/master/bucket_setup.md#install-and-configure-rclone).
* Option 2: Create a local cache for the bucket following the instructions under [Create SILNLP cache](https://github.com/sillsdev/silnlp/blob/master/manual_setup.md#create-silnlp-cache).

## Development Environment Setup
Expand Down Expand Up @@ -177,7 +183,7 @@ Follow the instructions below to set up a Dev Container in VS Code. This is the

4. Define environment variables.

Set the following environment variables with your respective credentials: CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY. Additionally, set AWS_REGION. The typical value is "us-east-1".
Set the following environment variables with your respective credentials: CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY, B2_KEY_ID, B2_APPLICATION_KEY, MINIO_ACCESS_KEY, MINIO_SECRET_KEY. Also set B2_ENDPOINT_URL to https://s3.us-east-005.backblazeb2.com and set MINIO_ENDPOINT_URL to https://truenas.psonet.languagetechnology.org:9000 with no quotations.
* Linux / macOS users: To set environment variables permanently, add each variable as a new line to the `.bashrc` file (Linux) or `.profile` file (macOS) in your home directory with the format
```
export VAR="VAL"
Expand Down Expand Up @@ -210,7 +216,7 @@ Follow the instructions below to set up a Dev Container in VS Code. This is the
10. Install and activate Poetry environment.
* In the VS Code terminal, run `poetry install` to install the necessary Python libraries, and then run `poetry shell` to enter the environment in the terminal.

11. (Optional) Locally mount the S3 bucket. This will allow you to interact directly with the S3 bucket from your local terminal (outside of the dev container). See instructions [here](s3_bucket_setup.md).
11. (Optional) Locally mount the B2 and/or MinIO bucket(s). This will allow you to interact directly with the bucket(s) from your local terminal (outside of the dev container). See instructions [here](bucket_setup.md).

To get back into the dev container and poetry environment each subsequent time, open the silnlp folder in VS Code, select the "Reopen in Container" option from the Remote Connection menu (bottom left corner), and use the `poetry shell` command in the terminal.

Expand Down
59 changes: 59 additions & 0 deletions bucket_setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# B2/MinIO bucket setup

We use Backblaze B2 and MinIO storage for storing our experiment data. Here is some workspace setup to enable a decent workflow.

### Note For MinIO setup

In order to access the MinIO bucket locally, you must have a VPN connected to its network. If you need VPN access, please reach out to an SILNLP dev team member.

### Install and configure rclone

**Windows**

The following will mount /silnlp on your B drive or /nlp-research on your M drive and allow you to explore, read and write.
* Install WinFsp: http://www.secfs.net/winfsp/rel/ (Click the button to "Download WinFsp Installer" not the "SSHFS-Win (x64)" installer)
* Download rclone from: https://rclone.org/downloads/
* Unzip to your desktop (or some convient location).
* Add the folder that contains rclone.exe to your PATH environment variable.
* Take the `scripts/rclone/rclone.conf` file from this SILNLP repo and copy it to `~\AppData\Roaming\rclone` (creating folders if necessary)
* Add your credentials in the appropriate fields in `~\AppData\Roaming\rclone`
* Take the `scripts/rclone/mount_b2_to_b.bat` and `scripts/rclone/mount_minio_to_m.bat` file from this SILNLP repo and copy it to the folder that contains the unzipped rclone.
* Double-click either bat file. A command window should open and remain open. You should see something like, if running mount_b2_to_b.bat:
```
C:\Users\David\Software\rclone>call rclone mount --vfs-cache-mode full --use-server-modtime b2silnlp:silnlp B:
The service rclone has been started.
```

**Linux / macOS**

The following will mount /silnlp to a B folder or /nlp-research to a M folder in your home directory and allow you to explore, read and write.
* For macOS, first download and install macFUSE: https://osxfuse.github.io/
* Download rclone from: https://rclone.org/install/
* Take the `scripts/rclone/rclone.conf` file from this SILNLP repo and copy it to `~/.config/rclone/rclone.conf` (creating folders if necessary)
* Add your credentials in the appropriate fields in `~/.config/rclone/rclone.conf`
* Create a folder called "B" or "M" in your user directory
* Run the following command for B2:
```
rclone mount --vfs-cache-mode full --use-server-modtime b2silnlp:silnlp ~/B
```
* OR run the following command for MinIO:
```
rclone mount --vfs-cache-mode full --use-server-modtime miniosilnlp:nlp-research ~/M
```
### To start B: and/or M: drive on start up

**Windows**

Put a shortcut to the mount_b2_to_b.bat and/or mount_minio_to_m.bat file in the Startup folder.
* In Windows Explorer put `shell:startup` in the address bar or open `C:\Users\<Username>\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Startup`
* Right click to add a new shortcut. Choose `mount_b2_to_b.bat` and/or `mount_minio_to_m.bat` as the target, you can leave the name as the default.

Now your B2 and/or MinIO bucket should be mounted as B: or M: drive, respectively, when you start Windows.

**Linux / macOS**
* Run `crontab -e`
* For B2, paste `@reboot rclone mount --vfs-cache-mode full --use-server-modtime b2silnlp:silnlp ~/B` into the file, save and exit
* For MinIO, paste `@reboot rclone mount --vfs-cache-mode full --use-server-modtime miniosilnlp:nlp-research ~/M` into the file, save and exit
* Reboot Linux / macOS

Now your B2 and/or MinIO bucket should be mounted as ~/B or ~/M respectively when you start Linux / macOS.
11 changes: 7 additions & 4 deletions manual_setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,9 +73,9 @@ __Download and install__ the following before creating any projects or starting
"editor.formatOnSave": true,
```

### S3 bucket setup
### B2 and/or MinIO bucket(s) setup

See [S3 bucket setup](s3_bucket_setup.md).
See [Bucket setup](bucket_setup.md).

### ClearML setup

Expand All @@ -88,8 +88,11 @@ See [ClearML setup](clear_ml_setup.md).
* Create the directory "$HOME/.cache/silnlp/projects" and set the environment variable SIL_NLP_CACHE_PROJECT_DIR to that path.

### Additional Environment Variables
* Set the following environment variables with your respective credentials: CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY.
* Set SIL_NLP_DATA_PATH to "/silnlp" and CLEARML_API_HOST to "https://api.sil.hosted.allegro.ai".
* Set the following environment variables with your respective credentials: CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY, B2_KEY_ID, B2_APPLICATION_KEY, MINIO_ACCESS_KEY, MINIO_SECRET_KEY.
* Set SIL_NLP_DATA_PATH to "/silnlp" if you are not using B2 or MinIO and will be storing files locally.
* Set CLEARML_API_HOST to "https://api.sil.hosted.allegro.ai".
* Set B2_ENDPOINT_URL to https://s3.us-east-005.backblazeb2.com
* Set MINIO_ENDPOINT_URL to https://truenas.psonet.languagetechnology.org:9000

### Setting Up and Running Experiments

Expand Down
56 changes: 0 additions & 56 deletions s3_bucket_setup.md

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,4 @@ rem copy your key and secret to rclone.conf

rem run rclone - execute this file in the rclone folder

call rclone mount --vfs-cache-mode full --use-server-modtime s3silnlp:silnlp S:
call rclone mount --vfs-cache-mode full --use-server-modtime b2silnlp:silnlp B:
13 changes: 13 additions & 0 deletions scripts/rclone/mount_minio_to_m.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
rem Install rclone
rem get rclone from https://rclone.org/downloads/
rem extract the files to a folder
rem then move this bat file to the folder where you run this bat file to start the service
rem --use-server-modtime flag speeds up displaying large numbers of files. Not exactly mod time, but close enough.

rem configure rclone
rem copy the adjacent file "rclone.conf" to: C:\Users\<username>\AppData\Roaming\rclone\rclone.conf
rem copy your key and secret to rclone.conf

rem run rclone - execute this file in the rclone folder

call rclone mount --vfs-cache-mode full --use-server-modtime --no-check-certificate miniosilnlp:nlp-research M:
17 changes: 11 additions & 6 deletions scripts/rclone/rclone.conf
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
[s3silnlp]
type = s3
provider = AWS
access_key_id = xxxxxxxxxxxxxxxxxx
secret_access_key = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
region = us-east-1
[b2silnlp]
type = b2
account = xxxxxxxxx
key = xxxxxxxxxxxx

[miniosilnlp]
type= s3
provider = Other
access_key_id = xxxxxxxx
secret_access_key = xxxxxxxxxx
endpoint = https://truenas.psonet.languagetechnology.org:9000

Loading