trainstack is an orchestration repository for building production training workflows on top of upstream slime, while keeping upstream code clean and easy to upgrade.
Chinese README: README_zh.md
Teams often need to:
- integrate custom environments (for example HTTP-based environments like LiveWeb-Arena),
- run commander/worker style distributed training on Lium,
- keep deployment scripts and operational docs close to code.
Doing all of that by directly modifying upstream slime makes upgrades painful.
trainstack solves this by putting project logic in separate modules and keeping slime/ as a submodule dependency.
- Run baseline slime training through wrappers:
python train.py ...python train_async.py ...
- Run commander/worker orchestration with
relay-trainer/. - Extend training with project plugins under
trainstack_plugins/. - Integrate external environments through HTTP protocol (decoupled from trainer runtime).
git clone git@github.com:wang-tong0/trainstack.git
cd trainstack
git submodule update --init --recursiveOptional install:
python -m pip install -e .
python -m pip install -e relay-trainerThis section assumes you start from zero and want to:
- run SFT on Lium,
- run RL on Lium,
- sync checkpoints to Hugging Face.
- Python 3.10+ and
git - Docker installed and logged in:
Get Docker,
docker login.
Verify:
docker --version docker info | grep -i Username - Lium CLI installed and initialized:
Lium docs and
lium-cli on PyPI.
Verify:
lium --version test -f ~/.lium/config.ini && echo "lium config ok"
- Hugging Face token configured:
HF access tokens.
Verify:
test -s ~/.cache/huggingface/token && echo "hf token cache ok"
- A Lium template pointing to your relay-trainer image.
Verify:
The template output should show your expected image/tag (for example
lium templates relay-trainer
<dockerhub_user>/relay-trainer:<tag>).
cd trainstack
docker build -f relay-trainer/docker/Dockerfile -t <dockerhub_user>/relay-trainer:<tag> .
docker push <dockerhub_user>/relay-trainer:<tag>Update your Lium template image/tag to this pushed image.
cd relay-trainer
cp configs/launch_stack.example.yaml configs/launch_stack.yamlEdit configs/launch_stack.yaml:
commander.public_url: public URL reachable from Lium podcommander.shared_secret: shared secret for commander/workerlium.template_id: your Lium templatelium.volume: persistent volume (new:name=...orid:<huid>)run.hf_repo: your HF model repo (e.g.yourname/trainstack-demo)run.hf_dry_run: setfalsefor real upload
For real training (instead of mock trainer), set your slime commands under worker.env:
worker:
env:
SLIME_SFT_CMD: "bash /workspace/slime/scripts/run-qwen3-4B-base-sft.sh"
SLIME_RL_CMD: "bash /workspace/slime/scripts/run-qwen2.5-0.5B-reproducibility.sh"Set:
run.mode: sftrun.run_id: <your-sft-run-id>
Then run:
python tools/relayctl.py launch-stack configs/launch_stack.yamlCheck status:
python tools/relayctl.py status --commander-url http://127.0.0.1:8080
lium ps
lium exec <pod_name> 'tail -n 200 /tmp/relay-worker.log'Reuse the same file, update:
run.mode: rlrun.run_id: <your-rl-run-id>- optional:
run.save_rollout_trajectories: true
Then run again:
python tools/relayctl.py launch-stack configs/launch_stack.yamlIf you want to push explicitly after run completion:
lium exec <pod_name> 'cd /workspace/slime/relay-trainer && /root/venv/bin/python tools/push_latest_to_hf.py --run-root /mnt/relay/runs/<run_id> --repo-id <hf_repo> --branch main'Use:
relay-trainer/configs/launch_stack.liveweb_sft.example.yamlrelay-trainer/configs/launch_stack.liveweb_rl.example.yaml
These are templates for HTTP-env-based workflows with LiveWeb-Arena.
# forwards to slime/train.py
python train.py --help
# forwards to slime/train_async.py
python train_async.py --helpUse this when you only need normal slime training and minimal orchestration.
Use relay-trainer/ when you need Lium pod orchestration, checkpoint lifecycle, and optional HF sync.
cd relay-trainer
python tools/relayctl.py launch-stack configs/launch_stack.example.yamlFor LiveWeb-Arena examples:
relay-trainer/configs/launch_stack.liveweb_sft.example.yamlrelay-trainer/configs/launch_stack.liveweb_rl.example.yaml
trainstack/
├── slime/ # Upstream framework (git submodule)
├── relay-trainer/ # Commander/Worker project
├── trainstack_plugins/ # Project-specific plugins (e.g. HTTP env adapter)
├── docs/ # Documentation index + language-specific docs
├── scripts/ # Utility scripts / smoke tests
├── train.py # Wrapper -> slime/train.py
└── train_async.py # Wrapper -> slime/train_async.py
Start here:
docs/README.md
English:
docs/en/getting_started.mddocs/en/training_quickstart.mddocs/en/project_structure.md
中文:
docs/zh/development_guide.mddocs/zh/project_structure.mddocs/zh/commander_worker_usage.mddocs/zh/http_env_decoupling.md