THUDM · kafkayu · Jan 11, 2026 · Jan 11, 2026 · Jan 11, 2026 · Jan 12, 2026
diff --git a/examples/opencua_vlm_multi_turn/README.md b/examples/opencua_vlm_multi_turn/README.md
@@ -0,0 +1,66 @@
+# VLM Multi-Turn (FSDP backend, opencua click dataset)
+Training VLM with FSDP on [opencua click dataset](https://huggingface.co/datasets/mlfoundations-cua-dev/agentnet-clicks) with multi-turn reasoning with interactive environment feedback, using GRPO. .
+
+## Data prepocessing
+There are some minor modifications to the prompt by converting absolute coordinates to relative coordinates to better accommodate Qwen3. In addition, images are converted to Base64 format so that they can be correctly recognized by slime. Therefore, you need to run the data preprocessing script before training or evaluation:
+```
+python examples/opencua_vlm_multi_turn/data_prepocessing.py
+```
+
+Note that this script currently processes only a single subset of the dataset. For better performance, it is recommended to preprocess all dataset files.
+
+## Rollout
+The multi-turn rollout is implemented through a custom generate function  `examples.opencua_vlm_multi_turn.rollout.generate`, overriding the original generate function.
+
+In terms of the environment interaction, this example initializes a custom interactive environment in `examples/opencua_vlm_multi_turn/env_opencua.py` with the APIs below.
+
+<details>
+<summary>Environment API (opencua click)</summary>
+
+- `build_env(sample: Sample | None = None, args: Any | None = None, **_) -> Geo3kEnv`: constructs the env.
+- `reset() -> tuple[dict, dict]`: clears internal state.
+- `step(response_text: str) -> tuple[dict, bool, dict]`: Parses the actor’s response text and updates the environment state. Returns a new observation. Similar to on-screen click feedback, if the model’s predicted location is inaccurate, the feedback should indicate the direction and distance relative to the true bounding box.
+- `format_observation(observation: dict) -> dict`: converts an env observation into a chat message.
+</details><br>
+
+## Reward
+The reward model is the OpenCUA Click RM in `slime/rollout/rm_hub/opencua_click.py`. It evaluates a predicted (x, y) location against a ground-truth bounding box and provides a reward based on accuracy and validity:
+
+- Hard constraint: If the predicted coordinates are outside the [0, 1] range, a penalty of -0.5 is returned.
+
+- Inside bounding box: If the prediction lies within the ground-truth box, the reward is 1.0.
+
+- Soft distance penalty: For predictions outside the box but still within [0, 1], a smooth L-infinity distance (d_soft) to the box is computed, and the reward decays exponentially with distance: exp(-5.0 * d_soft).
+
+
+The `compute_opencua_reward` function parses the model’s output text to extract coordinates in the \boxed{(x, y)} format. If parsing fails or the solution is invalid, negative rewards are returned (-1.0). Otherwise, it computes the reward using the `grounding_reward` function. 
+
+![VLM multi-turn opencua reward](opencua_click_reward.png)
+
+
+## Reproduce
+```bash
+# 1) Set environment variable
+export WANDB_API_KEY=...
+export SLIME_SCRIPT_MODEL_NAME=Qwen3-VL-2B-Instruct
+export SLIME_SCRIPT_NUM_GPUS=4
+export SLIME_SCRIPT_TRAIN_BACKEND=fsdp
+
+# 2) Download the dataset
+hf download mlfoundations-cua-dev/agentnet-clicks --repo-type=dataset --local-dir /root/datasets/opencua
+
+
+python examples/opencua_vlm_multi_turn/data_prepocessing.py
+# 3) Run the script:
+cd /root/slime
+
+python examples/opencua_vlm_multi_turn/run_opencua_vlm_multi_turn.py
+```
+
+## What each file does
+- `examples/opencua_vlm_multi_turn/data_prepocessing.py`: adapts the data format so that it can be correctly recognized and processed by slime.
+- `examples/opencua_vlm_multi_turn/run_opencua_vlm_multi_turn.py`: downloads model, sets training/rollout args, and launches the run.
+- `examples/opencua_vlm_multi_turn/opencua_vlm_multi_turn_config.yaml`: specifies `max_turns` and `rollout_interaction_env_path` for the multi-turn rollout.
+- `examples/opencua_vlm_multi_turn/rollout.py`: custom multi-turn rollout that calls SGLang for token generation, builds loss masks/log_probs, enforces max_turns, and early-stops on max_new_tokens.
+- `examples/opencua_vlm_multi_turn/env_opencua.py`: opencua click env that parses `\boxed{(x, y)}`, scores position answers, and returns env feedback per turn.
+- `slime/rollout/rm_hub/opencua_click.py`:computes rewards based on the opencua click environment.
diff --git a/examples/opencua_vlm_multi_turn/data_prepocessing.py b/examples/opencua_vlm_multi_turn/data_prepocessing.py
@@ -0,0 +1,120 @@
+import base64
+import json
+from io import BytesIO
+
+from datasets import load_dataset
+from PIL import Image
+
+# -------------------------
+# 1. Load dataset
+# -------------------------
+dataset = load_dataset("parquet", data_files="/root/datasets/opencua/train-00000-of-00162.parquet")["train"]
+
+# -------------------------
+# 2. Helper functions
+# -------------------------
+NEW_EASYR1_PREFIX = """You are an expert UI element locator.
+
+Given a GUI image and a user's element description, identify the specified element
+and output its location as a normalized coordinate.
+
+The coordinate system is defined as:
+- x and y are normalized to the range [0, 1]
+- (0, 0) corresponds to the top-left corner of the image
+- (1, 1) corresponds to the bottom-right corner of the image
+
+For elements with area, return the center point.
+
+Output the final answer in \\boxed{(x,y)}.
+Do not include any explanation outside the box.
+"""
+
+
+def pil_to_base64(image: Image.Image) -> str:
+    buffered = BytesIO()
+    image.save(buffered, format="PNG")
+    img_bytes = buffered.getvalue()
+    img_base64 = base64.b64encode(img_bytes).decode("utf-8")
+    while len(img_base64) % 4 != 0:
+        img_base64 += "="
+    return f"data:image/png;base64,{img_base64}"
+
+
+# -------------------------
+# 3. Processing function (batched)
+# -------------------------
+def process_batch(batch):
+    new_easyr1_prompts = []
+    relative_answers = []
+    relative_bboxes = []
+    images_base64 = []
+
+    for i in range(len(batch["easyr1_prompt"])):
+        ex = {k: batch[k][i] for k in batch.keys()}
+
+        # -------- easyr1_prompt --------
+        prompt = ex["easyr1_prompt"]
+        if "<image>" in prompt:
+            _, after = prompt.split("<image>", 1)
+            prompt = NEW_EASYR1_PREFIX + "\n\n<image>" + after
+        new_easyr1_prompts.append(prompt)
+
+        # -------- relative answer --------
+        x_px, y_px = ex["click_point"]
+        w, h = ex["image_width"], ex["image_height"]
+        rx = x_px / w
+        ry = y_px / h
+        relative_answers.append(f"\\boxed{{({rx:.4f},{ry:.4f})}}")
+
+        # -------- relative bbox --------
+        x1, y1, x2, y2 = ex["bbox"]
+        rel_bbox = [x1 / w, y1 / h, x2 / w, y2 / h]
+        relative_bboxes.append(json.dumps(rel_bbox))
+
+        # -------- images -> base64 list --------
+        img_entry = ex["images"]  #
+        if isinstance(img_entry, list):
+            base64_list = []
+            for im in img_entry:
+                if isinstance(im, Image.Image):
+                    base64_list.append(pil_to_base64(im))
+                elif isinstance(im, str):
+                    base64_list.append(im)
+                else:
+                    raise ValueError(f"Unsupported image type in list: {type(im)}")
+            images_base64.append(base64_list)
+        elif isinstance(img_entry, Image.Image):
+            images_base64.append([pil_to_base64(img_entry)])
+        else:
+            raise ValueError(f"Unsupported image type: {type(img_entry)}")
+
+    return {
+        "easyr1_prompt": new_easyr1_prompts,
+        "relative_answer": relative_answers,
+        "relative_bbox": relative_bboxes,
+        "images_base64": images_base64,
+    }
+
+
+# -------------------------
+# 4. Map dataset
+# -------------------------
+
+# if dataset is too large, you can select a small subset to test
+dataset = dataset.select(range(500))
+
+
+dataset = dataset.map(
+    process_batch,
+    batched=True,
+    batch_size=16,
+    num_proc=1,
+)
+
+
+# -------------------------
+# 6. Save
+# -------------------------
+print(dataset[0])
+dataset.to_parquet("/root/datasets/opencua/train_relative_base64.parquet")
+print("Saved train_relative_base64.parquet")
diff --git a/examples/opencua_vlm_multi_turn/env_opencua.py b/examples/opencua_vlm_multi_turn/env_opencua.py
@@ -0,0 +1,203 @@
+from __future__ import annotations
+
+import json
+import logging
+import re
+from typing import Any
+
+try:
+    import orjson  # type: ignore
+except Exception:  # pragma: no cover - optional dependency
+    orjson = None
+from examples.opencua_vlm_multi_turn.base_env import BaseInteractionEnv
+
+from slime.utils.types import Sample
+
+logger = logging.getLogger(__name__)
+
+# Matches the JSON payload emitted between <tool_call> ... </tool_call> tags.
+TOOL_CALL_RE = re.compile(r"<tool_call>\s*(\{.*?\})\s*</tool_call>", re.DOTALL)
+# Accept either name; verl uses `calc_opencua_reward` while the instruction refers to `calc_score`.
+SUPPORTED_TOOL_NAMES = {"calc_score", "calc_opencua_reward"}
+
+
+class OpencuaEnv(BaseInteractionEnv):
+    """
+    Minimal environment for OpenCUA tasks where the model outputs a \boxed{(x,y)} response.
+
+    Reward is 1.0 if the predicted point is inside the ground-truth relative_bbox,
+    0.0 otherwise. Feedback is provided in English indicating if the point is
+    left/right/up/down relative to the target bbox.
+    """
+
+    def __init__(self, *, ground_truth: str | None = None, max_turns: int | None = None):
+        """
+        Args:
+            ground_truth (str): JSON string of relative bbox [x_min, y_min, x_max, y_max]
+            max_turns (int | None): maximum allowed steps per episode
+        """
+        self.ground_truth = str(ground_truth) if ground_truth is not None else None
+        self.last_score: float | None = None
+        self.turn = 0
+        self.max_turns = max_turns
+
+    def reset(self):
+        """Reset the environment to start a new episode."""
+        self.turn = 0
+        self.last_score = None
+        observation = {}  # No initial observation needed
+        reset_info = {"ground_truth_available": self.ground_truth is not None}
+        return observation, reset_info
+
+    def close(self):
+        """No resources to release in this minimal environment."""
+        return
+
+    # -----------------------------
+    # Reward computation
+    # -----------------------------
+    def _score_answer(self, answer: str) -> float:
+        """
+        Parse the answer string \boxed{(x,y)} and check if the point lies
+        inside the ground-truth bbox.
+        """
+        if not self.ground_truth:
+            return 0.0
+        try:
+            bbox = json.loads(self.ground_truth)
+            x_min, y_min, x_max, y_max = bbox
+
+        except Exception:
+            return 0.0
+
+        # Parse the \boxed{(x,y)} pattern
+        m = re.search(r"\\boxed\s*\{\s*\(\s*([-+]?\d*\.?\d+)\s*,\s*([-+]?\d*\.?\d+)\s*\)\s*\}", answer)
+        logger.info(f"Parse the boxed pattern, m={m},answer={answer}")
+        if not m:
+            return 0.0
+        try:
+            x = float(m.group(1))
+            y = float(m.group(2))
+            logger.info(f"score_answer,boxed x={x},y={y}")
+        except Exception:
+            logger.info("m can not be extracted")
+            return 0.0
+
+        # Reward = 1 if point is inside bbox
+        if x_min <= x <= x_max and y_min <= y <= y_max:
+            return 1.0
+        elif x > 1 or y > 1 or x < 0 or y < 0:
+            return 0.0  # not relative grounding
+        else:
+            return 0.2  # realtive grounding, format reward
+
+    # -----------------------------
+    # Directional feedback
+    # -----------------------------
+    def _directional_feedback(self, x: float, y: float, bbox: list[float]) -> str:
+        """
+        Provide English feedback about the predicted point relative to bbox.
+
+        Returns a string like:
+            "Point is outside the box: slightly left and slightly above."
+            "Point is inside the box."
+        """
+        x_min, y_min, x_max, y_max = bbox
+        horz, vert = "", ""
+
+        # Horizontal feedback
+        if x < x_min:
+            horz = "slightly left"
+        elif x > x_max:
+            horz = "slightly right"
+        else:
+            horz = "horizontally aligned"
+
+        # Vertical feedback
+        if y < y_min:
+            vert = "slightly above"
+        elif y > y_max:
+            vert = "slightly below"
+        else:
+            vert = "vertically aligned"
+
+        if horz == "horizontally aligned" and vert == "vertically aligned":
+            return "Point is inside the box."
+        if x < 0 or x > 1 or y < 0 or y > 1:
+            return "The range of x and y must be in [0,1]"
+        return f"Point is outside the box: {horz} and {vert}."
+
+    # -----------------------------
+    # Step function
+    # -----------------------------
+    def step(self, response_text: str):
+        """
+        Step function for the environment.
+
+        Args:
+            response_text (str): model response, must contain \boxed{(x,y)}
+
+        Returns:
+            obs (dict): observation dictionary with 'obs_str' and 'tool_score'
+            done (bool): whether the episode is finished
+            info (dict): additional information including parsed_answer and score
+        """
+        self.turn += 1
+        done = self.max_turns is not None and self.turn >= self.max_turns
+
+        parsed_answer = response_text.strip()
+        score = self._score_answer(parsed_answer)
+        logger.info(f"parsed_answer={parsed_answer},score={score}")
+        self.last_score = score
+
+        # Parse the predicted point for feedback
+        try:
+            bbox = json.loads(self.ground_truth)
+            m = re.search(r"\\boxed\(\(\s*([\d\.]+)\s*,\s*([\d\.]+)\s*\)\s*\)", parsed_answer)
+            if m:
+                x_pred = float(m.group(1))
+                y_pred = float(m.group(2))
+                direction_feedback = self._directional_feedback(x_pred, y_pred, bbox)
+            else:
+                direction_feedback = "No valid point found in the answer."
+        except Exception:
+            direction_feedback = "Failed to parse ground-truth bbox."
+
+        obs_str = f"{direction_feedback} (score={score})"
+        obs = {"obs_str": obs_str, "role": "tool", "tool_score": score}
+
+        info = {
+            "parsed_answer": parsed_answer,
+            "score": score,
+            "turn": self.turn,
+            "tool_executed": True,
+            "answer_missing": True if score == 0.0 else False,
+        }
+
+        return obs, done, info
+
+
+def _extract_ground_truth(sample: Sample | None) -> str | None:
+    """Resolve the ground-truth answer from label or metadata."""
+    if sample is None:
+        return None
+    if sample.label is not None:
+        return str(sample.label)
+    # metadata = sample.metadata
+    # for key in ("answer", "ground_truth", "label"):
+    #     if key in metadata and metadata[key] is not None:
+    #         return str(metadata[key])
+    return None
+
+
+def build_env(sample: Sample | None = None, args: Any | None = None, **_: Any) -> OpencuaEnv:
+    """
+    Construct a OpencuaEnv. Ground truth is pulled from sample.label or metadata.
+    """
+    ground_truth = _extract_ground_truth(sample)
+    max_turns = args.max_turns
+    if max_turns is None:
+        raise ValueError("max_turns must be set via --custom-config-path in the custom config file.")
+    if ground_truth is None:
+        logger.warning("Ground truth answer missing; calc_score tool will always return 0.")
+    return OpencuaEnv(ground_truth=ground_truth, max_turns=max_turns)
diff --git a/examples/opencua_vlm_multi_turn/opencua_click_reward.png b/examples/opencua_vlm_multi_turn/opencua_click_reward.png
diff --git a/examples/opencua_vlm_multi_turn/opencua_vlm_multi_turn_config.yaml b/examples/opencua_vlm_multi_turn/opencua_vlm_multi_turn_config.yaml
@@ -0,0 +1,2 @@
+max_turns: 1
+rollout_interaction_env_path: examples.opencua_vlm_multi_turn.env_opencua
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		max_turns: 1
		rollout_interaction_env_path: examples.opencua_vlm_multi_turn.env_opencua