Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions examples/opencua_vlm_multi_turn/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# VLM Multi-Turn (FSDP backend, opencua click dataset)
Training VLM with FSDP on [opencua click dataset](https://huggingface.co/datasets/mlfoundations-cua-dev/agentnet-clicks) with multi-turn reasoning with interactive environment feedback, using GRPO. .

## Data prepocessing
There are some minor modifications to the prompt by converting absolute coordinates to relative coordinates to better accommodate Qwen3. In addition, images are converted to Base64 format so that they can be correctly recognized by slime. Therefore, you need to run the data preprocessing script before training or evaluation:
```
python examples/opencua_vlm_multi_turn/data_prepocessing.py
```

Note that this script currently processes only a single subset of the dataset. For better performance, it is recommended to preprocess all dataset files.

## Rollout
The multi-turn rollout is implemented through a custom generate function `examples.opencua_vlm_multi_turn.rollout.generate`, overriding the original generate function.

In terms of the environment interaction, this example initializes a custom interactive environment in `examples/opencua_vlm_multi_turn/env_opencua.py` with the APIs below.

<details>
<summary>Environment API (opencua click)</summary>

- `build_env(sample: Sample | None = None, args: Any | None = None, **_) -> Geo3kEnv`: constructs the env.
- `reset() -> tuple[dict, dict]`: clears internal state.
- `step(response_text: str) -> tuple[dict, bool, dict]`: Parses the actor’s response text and updates the environment state. Returns a new observation. Similar to on-screen click feedback, if the model’s predicted location is inaccurate, the feedback should indicate the direction and distance relative to the true bounding box.
- `format_observation(observation: dict) -> dict`: converts an env observation into a chat message.
</details><br>

## Reward
The reward model is the OpenCUA Click RM in `slime/rollout/rm_hub/opencua_click.py`. It evaluates a predicted (x, y) location against a ground-truth bounding box and provides a reward based on accuracy and validity:

- Hard constraint: If the predicted coordinates are outside the [0, 1] range, a penalty of -0.5 is returned.

- Inside bounding box: If the prediction lies within the ground-truth box, the reward is 1.0.

- Soft distance penalty: For predictions outside the box but still within [0, 1], a smooth L-infinity distance (d_soft) to the box is computed, and the reward decays exponentially with distance: exp(-5.0 * d_soft).


The `compute_opencua_reward` function parses the model’s output text to extract coordinates in the \boxed{(x, y)} format. If parsing fails or the solution is invalid, negative rewards are returned (-1.0). Otherwise, it computes the reward using the `grounding_reward` function.

![VLM multi-turn opencua reward](opencua_click_reward.png)


## Reproduce
```bash
# 1) Set environment variable
export WANDB_API_KEY=...
export SLIME_SCRIPT_MODEL_NAME=Qwen3-VL-2B-Instruct
export SLIME_SCRIPT_NUM_GPUS=4
export SLIME_SCRIPT_TRAIN_BACKEND=fsdp

# 2) Download the dataset
hf download mlfoundations-cua-dev/agentnet-clicks --repo-type=dataset --local-dir /root/datasets/opencua


python examples/opencua_vlm_multi_turn/data_prepocessing.py
# 3) Run the script:
cd /root/slime

python examples/opencua_vlm_multi_turn/run_opencua_vlm_multi_turn.py
```

## What each file does
- `examples/opencua_vlm_multi_turn/data_prepocessing.py`: adapts the data format so that it can be correctly recognized and processed by slime.
- `examples/opencua_vlm_multi_turn/run_opencua_vlm_multi_turn.py`: downloads model, sets training/rollout args, and launches the run.
- `examples/opencua_vlm_multi_turn/opencua_vlm_multi_turn_config.yaml`: specifies `max_turns` and `rollout_interaction_env_path` for the multi-turn rollout.
- `examples/opencua_vlm_multi_turn/rollout.py`: custom multi-turn rollout that calls SGLang for token generation, builds loss masks/log_probs, enforces max_turns, and early-stops on max_new_tokens.
- `examples/opencua_vlm_multi_turn/env_opencua.py`: opencua click env that parses `\boxed{(x, y)}`, scores position answers, and returns env feedback per turn.
- `slime/rollout/rm_hub/opencua_click.py`:computes rewards based on the opencua click environment.
120 changes: 120 additions & 0 deletions examples/opencua_vlm_multi_turn/data_prepocessing.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
import base64
import json
from io import BytesIO

from datasets import load_dataset
from PIL import Image

# -------------------------
# 1. Load dataset
# -------------------------
dataset = load_dataset("parquet", data_files="/root/datasets/opencua/train-00000-of-00162.parquet")["train"]

# -------------------------
# 2. Helper functions
# -------------------------
NEW_EASYR1_PREFIX = """You are an expert UI element locator.

Given a GUI image and a user's element description, identify the specified element
and output its location as a normalized coordinate.

The coordinate system is defined as:
- x and y are normalized to the range [0, 1]
- (0, 0) corresponds to the top-left corner of the image
- (1, 1) corresponds to the bottom-right corner of the image

For elements with area, return the center point.

Output the final answer in \\boxed{(x,y)}.
Do not include any explanation outside the box.
"""


def pil_to_base64(image: Image.Image) -> str:
buffered = BytesIO()
image.save(buffered, format="PNG")
img_bytes = buffered.getvalue()
img_base64 = base64.b64encode(img_bytes).decode("utf-8")
while len(img_base64) % 4 != 0:
img_base64 += "="
return f"data:image/png;base64,{img_base64}"


# -------------------------
# 3. Processing function (batched)
# -------------------------
def process_batch(batch):
new_easyr1_prompts = []
relative_answers = []
relative_bboxes = []
images_base64 = []

for i in range(len(batch["easyr1_prompt"])):
ex = {k: batch[k][i] for k in batch.keys()}

# -------- easyr1_prompt --------
prompt = ex["easyr1_prompt"]
if "<image>" in prompt:
_, after = prompt.split("<image>", 1)
prompt = NEW_EASYR1_PREFIX + "\n\n<image>" + after
new_easyr1_prompts.append(prompt)

# -------- relative answer --------
x_px, y_px = ex["click_point"]
w, h = ex["image_width"], ex["image_height"]
rx = x_px / w
ry = y_px / h
relative_answers.append(f"\\boxed{{({rx:.4f},{ry:.4f})}}")

# -------- relative bbox --------
x1, y1, x2, y2 = ex["bbox"]
rel_bbox = [x1 / w, y1 / h, x2 / w, y2 / h]
relative_bboxes.append(json.dumps(rel_bbox))

# -------- images -> base64 list --------
img_entry = ex["images"] #
if isinstance(img_entry, list):
base64_list = []
for im in img_entry:
if isinstance(im, Image.Image):
base64_list.append(pil_to_base64(im))
elif isinstance(im, str):
base64_list.append(im)
else:
raise ValueError(f"Unsupported image type in list: {type(im)}")
images_base64.append(base64_list)
elif isinstance(img_entry, Image.Image):
images_base64.append([pil_to_base64(img_entry)])
else:
raise ValueError(f"Unsupported image type: {type(img_entry)}")

return {
"easyr1_prompt": new_easyr1_prompts,
"relative_answer": relative_answers,
"relative_bbox": relative_bboxes,
"images_base64": images_base64,
}


# -------------------------
# 4. Map dataset
# -------------------------

# if dataset is too large, you can select a small subset to test
dataset = dataset.select(range(500))


dataset = dataset.map(
process_batch,
batched=True,
batch_size=16,
num_proc=1,
)


# -------------------------
# 6. Save
# -------------------------
print(dataset[0])
dataset.to_parquet("/root/datasets/opencua/train_relative_base64.parquet")
print("Saved train_relative_base64.parquet")
203 changes: 203 additions & 0 deletions examples/opencua_vlm_multi_turn/env_opencua.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
from __future__ import annotations

import json
import logging
import re
from typing import Any

try:
import orjson # type: ignore
except Exception: # pragma: no cover - optional dependency
orjson = None
from examples.opencua_vlm_multi_turn.base_env import BaseInteractionEnv

from slime.utils.types import Sample

logger = logging.getLogger(__name__)

# Matches the JSON payload emitted between <tool_call> ... </tool_call> tags.
TOOL_CALL_RE = re.compile(r"<tool_call>\s*(\{.*?\})\s*</tool_call>", re.DOTALL)
# Accept either name; verl uses `calc_opencua_reward` while the instruction refers to `calc_score`.
SUPPORTED_TOOL_NAMES = {"calc_score", "calc_opencua_reward"}


class OpencuaEnv(BaseInteractionEnv):
"""
Minimal environment for OpenCUA tasks where the model outputs a \boxed{(x,y)} response.

Reward is 1.0 if the predicted point is inside the ground-truth relative_bbox,
0.0 otherwise. Feedback is provided in English indicating if the point is
left/right/up/down relative to the target bbox.
"""

def __init__(self, *, ground_truth: str | None = None, max_turns: int | None = None):
"""
Args:
ground_truth (str): JSON string of relative bbox [x_min, y_min, x_max, y_max]
max_turns (int | None): maximum allowed steps per episode
"""
self.ground_truth = str(ground_truth) if ground_truth is not None else None
self.last_score: float | None = None
self.turn = 0
self.max_turns = max_turns

def reset(self):
"""Reset the environment to start a new episode."""
self.turn = 0
self.last_score = None
observation = {} # No initial observation needed
reset_info = {"ground_truth_available": self.ground_truth is not None}
return observation, reset_info

def close(self):
"""No resources to release in this minimal environment."""
return

# -----------------------------
# Reward computation
# -----------------------------
def _score_answer(self, answer: str) -> float:
"""
Parse the answer string \boxed{(x,y)} and check if the point lies
inside the ground-truth bbox.
"""
if not self.ground_truth:
return 0.0
try:
bbox = json.loads(self.ground_truth)
x_min, y_min, x_max, y_max = bbox

except Exception:
return 0.0

# Parse the \boxed{(x,y)} pattern
m = re.search(r"\\boxed\s*\{\s*\(\s*([-+]?\d*\.?\d+)\s*,\s*([-+]?\d*\.?\d+)\s*\)\s*\}", answer)
logger.info(f"Parse the boxed pattern, m={m},answer={answer}")
if not m:
return 0.0
try:
x = float(m.group(1))
y = float(m.group(2))
logger.info(f"score_answer,boxed x={x},y={y}")
except Exception:
logger.info("m can not be extracted")
return 0.0

# Reward = 1 if point is inside bbox
if x_min <= x <= x_max and y_min <= y <= y_max:
return 1.0
elif x > 1 or y > 1 or x < 0 or y < 0:
return 0.0 # not relative grounding
else:
return 0.2 # realtive grounding, format reward

# -----------------------------
# Directional feedback
# -----------------------------
def _directional_feedback(self, x: float, y: float, bbox: list[float]) -> str:
"""
Provide English feedback about the predicted point relative to bbox.

Returns a string like:
"Point is outside the box: slightly left and slightly above."
"Point is inside the box."
"""
x_min, y_min, x_max, y_max = bbox
horz, vert = "", ""

# Horizontal feedback
if x < x_min:
horz = "slightly left"
elif x > x_max:
horz = "slightly right"
else:
horz = "horizontally aligned"

# Vertical feedback
if y < y_min:
vert = "slightly above"
elif y > y_max:
vert = "slightly below"
else:
vert = "vertically aligned"

if horz == "horizontally aligned" and vert == "vertically aligned":
return "Point is inside the box."
if x < 0 or x > 1 or y < 0 or y > 1:
return "The range of x and y must be in [0,1]"
return f"Point is outside the box: {horz} and {vert}."

# -----------------------------
# Step function
# -----------------------------
def step(self, response_text: str):
"""
Step function for the environment.

Args:
response_text (str): model response, must contain \boxed{(x,y)}

Returns:
obs (dict): observation dictionary with 'obs_str' and 'tool_score'
done (bool): whether the episode is finished
info (dict): additional information including parsed_answer and score
"""
self.turn += 1
done = self.max_turns is not None and self.turn >= self.max_turns

parsed_answer = response_text.strip()
score = self._score_answer(parsed_answer)
logger.info(f"parsed_answer={parsed_answer},score={score}")
self.last_score = score

# Parse the predicted point for feedback
try:
bbox = json.loads(self.ground_truth)
m = re.search(r"\\boxed\(\(\s*([\d\.]+)\s*,\s*([\d\.]+)\s*\)\s*\)", parsed_answer)
if m:
x_pred = float(m.group(1))
y_pred = float(m.group(2))
direction_feedback = self._directional_feedback(x_pred, y_pred, bbox)
else:
direction_feedback = "No valid point found in the answer."
except Exception:
direction_feedback = "Failed to parse ground-truth bbox."

obs_str = f"{direction_feedback} (score={score})"
obs = {"obs_str": obs_str, "role": "tool", "tool_score": score}

info = {
"parsed_answer": parsed_answer,
"score": score,
"turn": self.turn,
"tool_executed": True,
"answer_missing": True if score == 0.0 else False,
}

return obs, done, info


def _extract_ground_truth(sample: Sample | None) -> str | None:
"""Resolve the ground-truth answer from label or metadata."""
if sample is None:
return None
if sample.label is not None:
return str(sample.label)
# metadata = sample.metadata
# for key in ("answer", "ground_truth", "label"):
# if key in metadata and metadata[key] is not None:
# return str(metadata[key])
return None


def build_env(sample: Sample | None = None, args: Any | None = None, **_: Any) -> OpencuaEnv:
"""
Construct a OpencuaEnv. Ground truth is pulled from sample.label or metadata.
"""
ground_truth = _extract_ground_truth(sample)
max_turns = args.max_turns
if max_turns is None:
raise ValueError("max_turns must be set via --custom-config-path in the custom config file.")
if ground_truth is None:
logger.warning("Ground truth answer missing; calc_score tool will always return 0.")
return OpencuaEnv(ground_truth=ground_truth, max_turns=max_turns)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
max_turns: 1
rollout_interaction_env_path: examples.opencua_vlm_multi_turn.env_opencua
Loading