Skip to content

Conversation

@jbarnes850
Copy link

@jbarnes850 jbarnes850 commented Jan 8, 2026

This PR adds an OSWorld VLM training cookbook and reference integration for multi‑turn desktop automation in Slime. It is written for a reviewer who has not seen OSWorld before and explains how Slime connects to a VM‑backed GUI environment, why an HTTP bridge is required, and how training is run end‑to‑end without touching Slime core code.

architecture

OSWorld depends on desktop‑env (torch 2.5.1), while Slime’s training stack uses torch 2.9+, so the environment runs on the host and training runs in the container.

The cookbook makes that split explicit, proxies environment calls through a stateless HTTP server, and drives a multi‑turn VLM rollout loop that produces Qwen3‑VL multimodal tokens and aligned loss masks. Experience replay is used to avoid advantage collapse in sparse reward tasks, and reward shaping provides partial credit for valid actions, execution, and UI changes.

Reproducible artifacts are referenced to validate the pipeline without hunting for external dependencies: the task registry and replay buffer are in Jarrodbarnes/osworld-union-v1, the SFT warmup checkpoint is Jarrodbarnes/osworld-vlm-sft-step25, and the GSPO checkpoint is Jarrodbarnes/osworld-vlm-gspo-step25. Metrics for the training runs are tracked in W&B at jbarnes850-near-protocol/osworld-grpo.

The README documents the OSWorld subset used for training (Ubuntu only), how to launch the host VM, how the HTTP bridge is wired, and how to run training. Verification was done by running pre‑commit and compiling the example module.

This PR keeps all code confined to examples/osworld/ and does not modify Slime core paths.

@jbarnes850 jbarnes850 marked this pull request as ready for review January 8, 2026 18:46
@jbarnes850 jbarnes850 force-pushed the feature/osworld-vlm-cookbook branch from edff4f7 to 983cd40 Compare January 8, 2026 23:10
@GuanxingLu
Copy link
Contributor

This looks awesome. Any chance you could share some training logs (e.g., reward curve) or a quick demo?

@jbarnes850 jbarnes850 marked this pull request as draft January 12, 2026 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants