ClearML Agent Setup

On Linux w/Docker

Log in as root
```
sudo -i
```
Install the GPU driver
```
ubuntu-drivers install
```
Install Docker
Install NVIDIA Container Toolkit and configure NVIDIA Container Runtime for Docker
If using MIG partitions:
1. Install nvidia-mig-manager.service
2. Configure /etc/nvidia-mig-manager/config.yaml with your MIG configuration, e.g.
```
version: v1
mig-configs:
  sil-config:
    - devices: [0]
      mig-enabled: true
      mig-devices:
        3g.47gb: 2
    - devices: [1]
      mig-enabled: false
      mig-devices: {}
```
3. Configure /etc/systemd/system/nvidia-mig-manager.service.d/override.conf to use your mig-config, e.g.
```
[Service]
Environment="MIG_PARTED_SELECTED_CONFIG=sil-config"
```
4. Run nvidia-mig-parted apply or reboot the server
Create clearml user
```
adduser clearml
```
Add clearml user to docker group: https://docs.docker.com/engine/install/linux-postinstall/#manage-docker-as-a-non-root-user
Log in as clearml user. IMPORTANT: Do not create/modify any files in the clearml user directory as root.
```
su - clearml
```
Install clearml-agent
```
pip install clearml-agent
```

Add a clearml.conf file

Either copy it from an existing setup, or use the skeleton provided in the SILNLP repo under scripts/clearml_agent/clearml.conf and fill out the ClearML credentials, git credentials, worker id, and worker name sections. Also add the following lines to the extra_docker_arguments section and fill out the access key and secret access key sections.

extra_docker_arguments: [
      "--env","SIL_NLP_DATA_PATH=/silnlp",
      "--env","B2_ENDPOINT_URL=https://s3.us-east-005.backblazeb2.com"
      "--env","B2_KEY_ID=***your B2 access key***",
      "--env","B2_APPLICATION_KEY=***your B2 secret key***",
      "--env","MINIO_ENDPOINT_URL=https://truenas.psonet.languagetechnology.org:9000",
      "--env","MINIO_ACCESS_KEY=***your MinIO access key***",
      "--env","MINIO_SECRET_KEY=***your MinIO secret key***",
      "--env","TOKENIZERS_PARALLELISM=false",
      "-v","/home/clearml/.clearml/hf-cache:/root/.cache/huggingface"
    ]

Create a startup script called start-agents.sh, e.g.

!/bin/sh
# Kill all clearml-agents running
ps -A | grep clearml-agent | awk '{print $1}' | xargs kill -9 $1
# GPU 0
/home/clearml/.local/bin/clearml-agent daemon --use-owner-token --detached --docker --force-current-version --create-queue --gpus 0:0   --queue 47gb_queue
/home/clearml/.local/bin/clearml-agent daemon --use-owner-token --detached --docker --force-current-version --create-queue --gpus 0:1   --queue 47gb_queue
# GPU 1
/home/clearml/.local/bin/clearml-agent daemon --use-owner-token --detached --docker --force-current-version --create-queue --gpus 1   --queue 94gb_queue

Set up the app-armor profile
1. Copy the docker-apparmor file in the silnlp repo to the /etc/apparmor.d/ directory.
2. Load the profile with sudo apparmor_parser -r -W docker-apparmor
Start the agents
```
./start-agents.sh
```
To configure the GPUs to survive a reboot:
1. Become the root user again
```
exit
```
2. Create a file called clearml-agent in /etc/init.d/ directory, e.g.

#!/bin/sh
set -e

### BEGIN INIT INFO
# Provides:           clearml-agents
# Required-Start:     $syslog $remote_fs $local_fs $syslog mountall
# Required-Stop:      $syslog $remote_fs $local_fs $syslog
# Should-Start:
# Should-Stop:
# Default-Start:      2 3 4 5
# Default-Stop:       0 1 6
# Short-Description:  ClearML Agents and queues to service GPUs
# Description:
#  "ClearML is an open source platform that automates and simplifies
#  developing and managing machine learning solutions.  ClearML Agent
#  is a virtual environment and execution manager for DL/ML solutions
#  on GPU machines."  --https://clear.ml
### END INIT INFO

export PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin:/usr/local/bin

NAME="clearml-agents"

# Get lsb functions
. /lib/lsb/init-functions

fail_unless_root() {
        if [ "$(id -u)" != '0' ]; then
                log_failure_msg "$NAME must be run as root"
                exit 1
        fi
}

do_start_stop() {
        STOP=""
        if [ "$1" = "stop" ]; then
                STOP="--stop"
        fi

        # Half GPUs 0:0 and 0:1 and Full GPU 1
        su --login --command "/home/clearml/.local/bin/clearml-agent daemon --use-owner-token --detached --docker --force-current-version --create-queue --gpus 0:0   --queue cheetah_47gb ${STOP}" clearml
        su --login --command "/home/clearml/.local/bin/clearml-agent daemon --use-owner-token --detached --docker --force-current-version --create-queue --gpus 0:1   --queue cheetah_47gb ${STOP}" clearml
        su --login --command "/home/clearml/.local/bin/clearml-agent daemon --use-owner-token --detached --docker --force-current-version --create-queue --gpus 1   --queue cheetah_94gb ${STOP}" clearml
}

case "$1" in
        start)
                fail_unless_root

                log_begin_msg "Starting $NAME"
                do_start_stop
                log_end_msg $?
                ;;

        stop)
                fail_unless_root
                do_start_stop "stop"
                ;;

        restart)
                fail_unless_root
                do_start_stop "stop"
                do_start_stop
                ;;

        status)
                ps -ef | head -1
                ps -ef | grep clearml-agent | grep -v grep
                ;;

        *)
                echo "Usage: service clearml-agents {start|stop|restart|status}"
                exit 1
                ;;
esac

On SLURM servers

Install Miniconda

Clone and enter the SILNLP repo

git clone https://github.com/sillsdev/silnlp.git
cd silnlp

Create a new conda environment using the environment.yml file in the repo
```
conda env create --file environment.yml
```
Activate the conda environment
```
conda activate silnlp
```
Install Poetry with the official installer, not pipx
- Make sure to install the version that matches the one listed at the top of the poetry.lock file in SILNLP.
- Poetry must be installed after the conda environment is activated so that it uses the correct Python version.
- Double check that Poetry has been added to the path. You may need to restart the terminal, but make sure to activate the silnlp conda environment again upon reentering the terminal.
Install clearml-agent-slurm
```
pip3 install -U --extra-index-url https://*****@*****.allegro.ai/repository/clearml_agent_slurm/simple "clearml-agent-slurm==0.4.0"
```
- The credentials can be found by clicking on the question mark in the upper right corner of the ClearML dashboard, then clicking ClearML Python Package setup and copying the credentials in step 1.
Add a clearml.conf file
- Either copy it from an existing setup, or use the skeleton provided in the SILNLP repo under scripts/clearml_agent/clearml.conf and fill out the ClearML credentials, git credentials, worker id, and worker name.

Set environment variables in .bashrc

export PYTHONPATH=
export MINIO_ENDPOINT_URL=https://truenas.psonet.languagetechnology.org:9000
export MINIO_ACCESS_KEY=xxxxxx
export MINIO_SECRET_KEY=xxxxxx
export B2_ENDPOINT_URL=https://s3.us-east-005.backblazeb2.com
export B2_KEY_ID=xxxxxx
export B2_APPLICATION_KEY=xxxxxx

export SIL_NLP_DATA_PATH="/silnlp"
export TOKENIZERS_PARALLELISM=false

Create a batch template file called slurm.clearml.template
- You'll need to update the --account and --partition parameters for your use case in the example below


#!/bin/bash
# available template variables (default value separator ":")
# ${CLEARML_QUEUE_NAME}
# ${CLEARML_QUEUE_ID}
# ${CLEARML_WORKER_ID}.
# complex template variables  (default value separator ":")
# ${CLEARML_TASK.id}
# ${CLEARML_TASK.name}
# ${CLEARML_TASK.project.id}
# ${CLEARML_TASK.hyperparams.properties.user_key.value}


# example
#SBATCH --job-name=clearml_task_${CLEARML_TASK.id}       # Job name DO NOT CHANGE
#SBATCH --output=task-${CLEARML_TASK.id}-%j.log
#SBATCH --account ***your account name***
#SBATCH --partition ***partition to use***
#SBATCH --time=${CLEARML_TASK.hyperparams.properties.time_limit.value:18:00:00}             # Time limit hrs:min:sec
#SBATCH --nodes=1


conda activate silnlp

${CLEARML_PRE_SETUP}

echo whoami $(whoami)

${CLEARML_AGENT_EXECUTE}

${CLEARML_POST_SETUP}

Start the agent

nohup clearml-agent-slurm --template-files slurm.clearml.template --queue ***queue_name***

Press Ctrl + Z to suspend the process
Move the process to the background
```
bg
```

On Linux w/conda

Log in as root
```
sudo -i
```
Create clearml user
```
adduser clearml
```
Log in as clearml user
```
su - clearml
```
Install and initialize Miniconda

Clone and enter the SILNLP repo

git clone https://github.com/sillsdev/silnlp.git
cd silnlp

Create a new conda environment using the environment.yml file in the repo
```
conda env create --file environment.yml
```
Activate the conda environment
```
conda activate silnlp
```
Install Poetry with the official installer, not pipx
- Make sure to install the version that matches the one listed at the top of the poetry.lock file in SILNLP.
- Poetry must be installed after the conda environment is activated so that it uses the correct Python version.
- Double check that Poetry has been added to the path. You may need to restart the terminal, but make sure to activate the silnlp conda environment again upon reentering the terminal.
Install clearml-agent
```
pip install clearml-agent
```
Add a clearml.conf file
- Either copy it from an existing setup, or use the skeleton provided in the SILNLP repo under scripts/clearml_agent/clearml.conf and fill out the ClearML credentials, git credentials, worker id, worker name, and python binary (use the conda python path).

Set environment variables in .bashrc

export PYTHONPATH=
export MINIO_ENDPOINT_URL=https://truenas.psonet.languagetechnology.org:9000
export MINIO_ACCESS_KEY=xxxxxx
export MINIO_SECRET_KEY=xxxxxx
export B2_ENDPOINT_URL=https://s3.us-east-005.backblazeb2.com
export B2_KEY_ID=xxxxxx
export B2_APPLICATION_KEY=xxxxxx
export SIL_NLP_DATA_PATH="/silnlp"
export TOKENIZERS_PARALLELISM=false

Create a startup script called start-agents.sh, e.g.

#!/bin/sh
# Kill all clearml-agents running
ps -A | grep clearml-agent | awk '{print $1}' | xargs kill -9 $1
# GPU 0
/home/clearml/miniconda3/envs/silnlp/bin/clearml-agent daemon --use-owner-token --detached  --create-queue --gpus 0   --queue 24gb_queue

Start the agents
```
./start-agents.sh
```
Configure agents to restart on reboot
- Follow the corresponding instructions to restart agents on reboot in the On Linux w/Docker section, and modify the clearml-agent commands enclosed in quotation marks in the script to match with the clearml-agent command in your start-agent.sh script

On Windows w/conda

Install Miniconda

Clone and enter the SILNLP repo

git clone https://github.com/sillsdev/silnlp.git
cd silnlp

Create a new conda environment using the environment.yml file in the repo
```
conda env create --file environment.yml
```
Activate the conda environment
```
conda activate silnlp
```
Follow these instructions to disable Git Credential Manager for Windows
Install clearml-agent
```
pip install clearml-agent
```
Install pywin32
```
pip install pywin32
```
Install poetry with the official installer, not pipx
- Make sure to install the version that matches the one listed at the top of the poetry.lock file in SILNLP.
- Poetry must be installed after the conda environment is activated so that it uses the correct Python version.
- Double check that Poetry has been added to the path. You may need to restart the terminal, but make sure to activate the silnlp conda environment again upon reentering the terminal.
Add a clearml.conf file
- Either copy it from an existing setup, or use the skeleton provided in the SILNLP repo under scripts/clearml_agent/clearml.conf and fill out the ClearML credentials, git credentials, worker id, worker name, and python binary (use the conda python path).

Set the following environment variables

setx MINIO_ENDPOINT_URL=https://truenas.psonet.languagetechnology.org:9000
setx MINIO_ACCESS_KEY=xxxxxx
setx MINIO_SECRET_KEY=xxxxxx
setx B2_ENDPOINT_URL=https://s3.us-east-005.backblazeb2.com
setx B2_KEY_ID=xxxxxx
setx B2_APPLICATION_KEY=xxxxxx
setx SIL_NLP_DATA_PATH "/silnlp"
setx TOKENIZERS_PARALLELISM "false"

Create a start_agents.bat script

There is no --detached option since it's not supported on Windows.
Replace <username> with the name of the user running the clearml agent.

@echo off
REM Kill all clearml-agent processes running
for /f "tokens=2" %%i in ('tasklist /FI "IMAGENAME eq python.exe" /FO LIST ^| findstr clearml-agent') do (
    echo Killing clearml-agent with PID %%i
    taskkill /PID %%i /F
)

REM GPU 0
C:\Users\<username>\miniconda3\envs\silnlp\Scripts\clearml-agent daemon --use-owner-token --create-queue --gpus 0 --queue 24gb_queue

Run the script
```
start-agents.bat
```
Troubleshooting
- If you get import errors such as ImportError: cannot import name 'ssl' from 'urllib3.util.ssl_' or ImportError: DLL load failed while importing _sqlite3: The specified module could not be found., you need to add the DLLs and Library/bin folders inside your conda environment folder to the Path and/or copy libcrypto-1_1-x64.dll, libssl-1_1-x64.dll, and sqlite3.dll from the Library/bin folder to the DLLs folder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ClearML Agent Setup

On Linux w/Docker

On SLURM servers

On Linux w/conda

On Windows w/conda

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally