Skip to content

[Gym] Tutorial contains steps that are cluster/user dependent #1933

@jbaczek

Description

@jbaczek

Describe the bug

This setup tutorial contains steps that are almost impossible to reproduce in general or are really brittle.

  1. This Tip section can very easily fail depending on the cluster setup:
mkdir -p "$(dirname "$CONTAINER_IMAGE_PATH")"
enroot import -o "$CONTAINER_IMAGE_PATH" "docker://${CONTAINER_IMAGE_PATH}"
# Swap to local container path
CONTAINER_IMAGE_PATH=./$CONTAINER_IMAGE_PATH

For me enroot used $HOME/.cache as the directory for container layers cache. The container image is 28G, while NFS that is mounted as user's home is only 10G on my cluster. This leads to out of disk space error.
Even if we use some other directory for cache, we still might fail in extracting stage, because of restrictions on number of threads available for a user on a given cluster. The solution is to run this on a compute node, while mounting storage.

  1. Using MOUNTS="$PWD:$PWD" is in opposition to the containerisation philosophy. The code should already be present inside the container. In addition mounting home as in the setup tutorial led to a runtime error:
ERROR unit/environments/test_nemo_gym.py::test_nemo_gym_sanity - ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, ray::NemoGym.__init__() (pid=3905017, ip=10.65.18.29, actor_id=9d4200400d583b7144594fd801000000, repr=<nemo_rl.environments.nemo_gym.NemoGym object at 0x155555151850>)
  1. This line: echo "hf_token: {your HF token}" >> env.yaml is dangerous! If you follow the instructions from the setup tutorial this file is saved in the external FS and is a security issue. Depending on access patterns on cluster's filesystem this token can be visible to anyone.

Steps/Code to reproduce bug

Follow this setup tutorial

Expected behavior

The tutorial should be general, work on slurm cluster regardless of the setup and should pose security issues.

Additional context

This was discovered during an effort to run an example script from the docs.
@bxyu-nvidia for vis

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions