Qredence · Zochory · Jun 7, 2026 · Jun 2, 2026 · Jun 4, 2026 · Jun 6, 2026
diff --git a/.gitignore b/.gitignore
@@ -54,6 +54,8 @@ docs/superpowers/
 docs/plans/
 docs/archive/
 
+qa-results/
+
 plan/
 
 # Frontend

diff --git a/docs/agent-harness/architecture-invariants.md b/docs/agent-harness/architecture-invariants.md
@@ -21,6 +21,27 @@ Transport code may call runtime services and schemas. Runtime code should not im
 FastAPI route modules, or test-only helpers. Configuration/package-root modules must not pull in
 heavy runtime providers such as DSPy, MLflow, PostHog, or Daytona at import time.
 
+## Async Execution Boundary
+
+The sandbox interpreters (Daytona, Modal) expose a synchronous, blocking `execute(...)` that
+performs a network round-trip per code iteration. `dspy.RLM.aforward` only awaits the LM predictor
+calls — it still runs sandbox code through the **synchronous** `repl.execute(...)` (verified in
+dspy 3.2.1). Therefore the heavy RLM turn is driven sync-in-a-thread via
+`asyncio.to_thread(self.agent, ...)` in `runtime/agent/runtime.py`, which offloads both the LM
+calls and the blocking sandbox I/O to a worker thread and keeps the event loop free.
+
+Do not replace this `asyncio.to_thread` wrapping with a direct `await agent.acall(...)`/`aforward`
+on the RLM heavy path while the interpreter's `execute` stays synchronous — doing so would block the
+event loop on every code-execution iteration and regress server concurrency. The native chat
+streaming path is the exception: it interleaves per-token streaming through `async_planner_step`
+(which uses `acall` on the planner predictor only, not sandbox execution).
+
+MCP-backed ReAct tools are the other async exception. Tools converted with
+`dspy.Tool.from_mcp_tool(session, tool)` are bound to a live MCP `ClientSession` and must be invoked
+through an async ReAct path (`acall`) while that session remains open. Keep MCP tools out of sync
+ReAct calls, close the provider when the runtime shuts down, and rebuild the agent from base tools
+plus the current MCP attachment when servers are reattached.
+
 ## Frontend Boundaries
 
 Keep shared UI primitives reusable:

diff --git a/docs/how-to-guides/dspy-integration.md b/docs/how-to-guides/dspy-integration.md
@@ -468,6 +468,16 @@ When MLflow is enabled, RLM execution automatically captures:
 - Reasoning trajectories
 - Timing and token usage
 
+For variable-mode `dspy.RLM` runs, Fleet records each REPL/code trajectory step
+as a child MLflow `TOOL` span named `repl_execute`, with bounded inputs and
+outputs. This keeps the MLflow trace tree aligned with the compact trace rows
+rendered in the chat surface.
+
+Fleet also records an `rlm_available_tools` `LLM` span that advertises the
+`repl_execute` schema through `mlflow.chat.tools`. MLflow tool-call judges use
+that schema as the available-tool set and then evaluate the concrete
+`repl_execute` `TOOL` spans as the calls that actually ran.
+
 ### Optimize with MIPROv2
 
 Use collected traces for DSPy optimization:

diff --git a/docs/how-to-guides/mlflow-workflows.md b/docs/how-to-guides/mlflow-workflows.md
@@ -99,6 +99,21 @@ Inspect active scorers before debugging unexpected assessment warnings:
 uv run python scripts/mlflow_cli.py scorers list
 ```
 
+Stop a stale scheduled scorer without deleting its registration:
+
+```bash
+# from repo root
+uv run python scripts/mlflow_cli.py scorers stop --name "Trace Judge"
+```
+
+Restart a stopped scorer only after confirming its judge model and trace inputs
+are correct:
+
+```bash
+# from repo root
+uv run python scripts/mlflow_cli.py scorers start --name "Trace Judge" --sample-rate 1.0
+```
+
 Remove a stale scorer only as an explicit maintenance action:
 
 ```bash
@@ -127,6 +142,17 @@ uv run fleet web
 ```
 
 As you use the app, MLflow traces are recorded in the configured experiment.
+For RLM document-analysis and variable-mode runs, Fleet also materializes
+trajectory code execution as MLflow `TOOL` spans named `repl_execute`. Those
+spans include bounded `mlflow.spanInputs` / `mlflow.spanOutputs` payloads so the
+MLflow trace tree, external scorers, and the chat transcript all describe the
+same REPL actions.
+
+Fleet also emits a compact `rlm_available_tools` `LLM` span with
+`mlflow.chat.tools` metadata for the RLM REPL. MLflow's built-in tool-call
+judges read available tool schemas from `LLM` / `CHAT_MODEL` spans and read
+actual calls from `TOOL` spans; keeping both in the trace prevents judges from
+falling back to model-based tool extraction.
 
 ## 4. Record Human Feedback and Ground Truth
 

diff --git a/docs/reference/adr/001-rlm-runtime-architecture.md b/docs/reference/adr/001-rlm-runtime-architecture.md
@@ -27,10 +27,21 @@ The architecture consists of these layers:
 The primary runtime (`src/fleet_rlm/runtime/agent/runtime.py`) owns session state, tool binding, streaming, and persistence. Its default agent module (`src/fleet_rlm/runtime/modules/escalating.py`) extends `dspy.Module` to provide:
 
 - **Stateful conversation**: `dspy.History` for persistent chat memory
-- **Lightweight-to-heavy escalation**: `dspy.ChainOfThought` for simple turns, escalating to the Daytona-backed RLM path when needed
-- **Tool orchestration**: Dynamic tool registration and dispatch
+- **Lightweight-to-heavy escalation**: `dspy.ChainOfThought` for simple turns; the
+  `[TOOLS NEEDED]` sentinel routes to a real `dspy.ReAct` tool loop (`FleetAgent`), while
+  forced `rlm`/`rlm_only` modes and auto-detected URL-document analysis route to the
+  Daytona-backed `dspy.RLM` heavy path
+- **Tool orchestration**: Dynamic tool registration and dispatch (including optional
+  DSPy-native MCP tools discovered from `FLEET_RLM_MCP_SERVERS`)
 - **Recursive delegation**: `runtime/tools/rlm_delegate.py` and `integrations/daytona/isolation.py` build bounded child RLM runs
 
+MCP tools are opt-in and session-backed. `AgentRuntime.attach_mcp_tools(...)` connects the
+configured MCP servers, converts discovered tools with `dspy.Tool.from_mcp_tool(...)`, and rebuilds
+the agent from the stable base tool set plus the current MCP attachment. Reattaching MCP servers
+replaces the previous MCP tools and closes their provider; runtime shutdown closes any remaining
+MCP sessions. Because these tools are async, the sentinel ReAct route is driven through an async
+ReAct call, while forced/url RLM routes remain sync-in-thread for Daytona sandbox execution.
+
 ### 2. Signature-Based Contracts
 
 Agent behavior is defined through DSPy signatures

diff --git a/docs/reference/frontend-backend-integration.md b/docs/reference/frontend-backend-integration.md
@@ -124,6 +124,26 @@ The frontend keeps the following runtime controls aligned with backend requests:
 - `context_paths`
 - `batch_concurrency`
 
+When `execution_mode` is `auto`, prompts that combine a public HTTP(S) URL
+with documentation-analysis intent (`analyze`, `summarize`, `read`, `docs`, or
+`documentation`) route directly to the Daytona-backed RLM document path. The
+backend fetches the document through the redirect-validating document helpers
+and passes `source_url`, `document_text`, and `source_metadata` as separate
+variable-mode `dspy.RLM` inputs. That keeps large documentation bodies in REPL
+variables instead of folding them into the prompt text.
+`execution_mode="rlm_only"` still forces RLM execution, while
+`execution_mode="tools_only"` bypasses the automatic URL-to-RLM route.
+
+Fleet's RLM prompt envelope follows the Fast-RLM usage pattern for large
+variable-mode tasks: repeat the task at the top and bottom, keep bulk data in
+REPL variables, make available tools ordinary Python callables, and keep
+intermediate printed output bounded. The server runtime settings feed the chat
+agent's RLM wrappers directly:
+
+- `rlm_max_iterations` -> `dspy.RLM(max_iterations=...)`
+- `rlm_max_llm_calls` -> `dspy.RLM(max_llm_calls=...)`
+- `agent_max_output_chars` -> `dspy.RLM(max_output_chars=...)`
+
 The backend enriches frames with runtime context. The frontend treats these keys
 as stable when present:
 
@@ -138,6 +158,11 @@ as stable when present:
 - `sandbox_id`
 - `workspace_path`
 - `sandbox_transition`
+- `selected_skills`
+- `routing_decision`
+- `source_url`
+- `trajectory_index`
+- `rlm_limits`
 
 ### Transcript Stream
 
@@ -148,9 +173,15 @@ The frontend reduces frames into:
 - user and assistant messages
 - reasoning and trajectory rows
 - tool and sandbox cards
+- selected-skill and routing status rows
 - HITL / clarification cards
 - summary rows and warnings
 
+RLM trajectories that include `{reasoning, code, output}` are normalized into
+`execution_step` frames with `step.type="repl"`. The transcript renders these
+as compact expandable sandbox rows; large code/output payloads stay summarized
+in chat while the workbench receives the structured step payload.
+
 The adapter stack is:
 
 1. `ws-frame-parser.ts` normalizes raw websocket frames.

diff --git a/pyproject.toml b/pyproject.toml
@@ -34,7 +34,7 @@ dependencies = [
     "dspy[optuna]==3.2.1",
 
     # Core orchestration and CLI/runtime ergonomics.
-    "daytona>=0.181.0,<1",
+    "daytona>=0.184.0,<1",
     "hydra-core>=1.3,<2",
     "prompt-toolkit>=3.0.50,<4",
     "rich>=14.3.3,<15",
@@ -56,7 +56,7 @@ dependencies = [
     "tomli>=2.0.0; python_version < '3.11'",
 
     # Web API surface.
-    "fastapi[standard]==0.136.1",
+    "fastapi[standard]==0.136.3",
     "joserfc>=1.0.1",
     "uvicorn>=0.47.0,<1",
 

diff --git a/scripts/capture_phase0_baseline.sh b/scripts/capture_phase0_baseline.sh
@@ -0,0 +1,29 @@
+#!/bin/bash
+# Phase 0 baseline capture script
+# Run this to capture golden payloads and openapi.yaml baseline before refactoring
+
+set -e
+
+echo "=== Phase 0: Capturing golden payloads and openapi.yaml baseline ==="
+
+# Clean up old golden payloads and recreate the directory
+rm -rf tests/contracts/golden_payloads
+mkdir -p tests/contracts/golden_payloads
+
+# Capture openapi.yaml baseline
+echo "Capturing openapi.yaml baseline..."
+cp openapi.yaml tests/contracts/golden_payloads/openapi_baseline.yaml
+
+# Capture frontend generated client baseline
+echo "Capturing frontend generated client baseline..."
+cp src/frontend/src/lib/rlm-api/generated/openapi.ts tests/contracts/golden_payloads/openapi_client_baseline.ts
+
+# Run golden payload capture tests
+echo "Running golden payload capture tests..."
+uv run pytest tests/contracts/test_golden_payloads.py::test_capture_chat_websocket_golden_payloads -v
+uv run pytest tests/contracts/test_golden_payloads.py::test_capture_passive_events_websocket_golden_payloads -v
+
+echo "=== Phase 0 baseline capture complete ==="
+echo "Golden payloads saved to: tests/contracts/golden_payloads/"
+echo "OpenAPI baseline saved to: tests/contracts/golden_payloads/openapi_baseline.yaml"
+echo "Client baseline saved to: tests/contracts/golden_payloads/openapi_client_baseline.ts"
diff --git a/scripts/live_daytona_verify.py b/scripts/live_daytona_verify.py
@@ -22,7 +22,7 @@
 # Ensure repo root is on path for imports
 sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 
-from fleet_rlm.api.runtime_services.chat_persistence import (
+from fleet_rlm.api.runtime_services.session_manifest import (
     ensure_session_volume_layout,
     load_manifest_from_volume,
 )

diff --git a/scripts/mlflow_cli.py b/scripts/mlflow_cli.py
@@ -129,6 +129,52 @@ def _delete_scorer(delete_scorer: Any, *, name: str, experiment_id: str | None,
     delete_scorer(name)
 
 
+def _get_scorer(mlflow: Any, *, name: str, experiment_id: str | None, version: str | None = None) -> Any:
+    get_scorer = getattr(getattr(mlflow, "genai", None), "get_scorer", None)
+    if not callable(get_scorer):
+        raise RuntimeError("mlflow.genai.get_scorer is not available in this MLflow version.")
+
+    parameters = inspect.signature(get_scorer).parameters
+    kwargs: dict[str, Any] = {"name": name}
+    if "experiment_id" in parameters:
+        kwargs["experiment_id"] = experiment_id
+    if "version" in parameters and version:
+        kwargs["version"] = int(version) if str(version).isdigit() else version
+    return get_scorer(**kwargs)
+
+
+def do_scorers_stop(args: argparse.Namespace) -> int:
+    mlflow, config, active_experiment_id = _configure_mlflow_tracking()
+    experiment_id = args.experiment_id or active_experiment_id
+    scorer = _get_scorer(mlflow, name=args.name, experiment_id=experiment_id)
+    stop_scorer = getattr(scorer, "stop", None)
+    if not callable(stop_scorer):
+        raise RuntimeError("This MLflow scorer does not expose stop().")
+    stop_scorer()
+    print(f"stopped_scorer={args.name}")
+    print(f"experiment={config.experiment}")
+    print(f"experiment_id={experiment_id or ''}")
+    return 0
+
+
+def do_scorers_start(args: argparse.Namespace) -> int:
+    mlflow, config, active_experiment_id = _configure_mlflow_tracking()
+    experiment_id = args.experiment_id or active_experiment_id
+    scorer = _get_scorer(mlflow, name=args.name, experiment_id=experiment_id)
+    start_scorer = getattr(scorer, "start", None)
+    if not callable(start_scorer):
+        raise RuntimeError("This MLflow scorer does not expose start().")
+
+    from mlflow.genai.scorers import ScorerSamplingConfig
+
+    start_scorer(sampling_config=ScorerSamplingConfig(sample_rate=args.sample_rate, filter_string=args.filter_string))
+    print(f"started_scorer={args.name}")
+    print(f"sample_rate={args.sample_rate}")
+    print(f"experiment={config.experiment}")
+    print(f"experiment_id={experiment_id or ''}")
+    return 0
+
+
 def do_scorers_delete(args: argparse.Namespace) -> int:
     if not args.yes:
         print("Refusing to delete scorer without --yes.")
@@ -199,6 +245,18 @@ def main() -> int:
     psl.add_argument("--experiment-id", default=None)
     psl.set_defaults(func=do_scorers_list)
 
+    pss = scorer_subparsers.add_parser("stop", help="Stop a persisted scorer schedule without deleting it")
+    pss.add_argument("--name", required=True)
+    pss.add_argument("--experiment-id", default=None)
+    pss.set_defaults(func=do_scorers_stop)
+
+    psr = scorer_subparsers.add_parser("start", help="Start or resume a persisted scorer schedule")
+    psr.add_argument("--name", required=True)
+    psr.add_argument("--experiment-id", default=None)
+    psr.add_argument("--sample-rate", type=float, default=1.0)
+    psr.add_argument("--filter-string", default=None)
+    psr.set_defaults(func=do_scorers_start)
+
     psd = scorer_subparsers.add_parser("delete", help="Delete a persisted scorer by name")
     psd.add_argument("--name", required=True)
     psd.add_argument("--experiment-id", default=None)

diff --git a/scripts/validate_rlm_e2e_trace.py b/scripts/validate_rlm_e2e_trace.py
@@ -125,6 +125,9 @@ async def _collect_chat_until_terminal(
         if payload.get("type") == "error":
             raise RuntimeError(f"Chat websocket error: {payload}")
 
+        if payload.get("type") == "execution_completed":
+            return events, payload
+
         if payload.get("type") != "event":
             continue
         kind = payload.get("data", {}).get("kind")
@@ -331,8 +334,8 @@ async def _run_validation(args: argparse.Namespace) -> ValidationResult:
         chat_ws_url = _make_ws_url(args.server_url, "/api/v1/ws/execution")
         execution_ws_url = _make_ws_url(
             args.server_url,
-            "/api/v1/ws/execution",
-            query=(f"workspace_id={args.workspace_id}&user_id={args.user_id}&session_id={session_id}"),
+            "/api/v1/ws/execution/events",
+            query=f"session_id={session_id}",
         )
 
         async with websockets.connect(
@@ -394,9 +397,12 @@ async def _run_validation(args: argparse.Namespace) -> ValidationResult:
                     if event.get("session_id") != session_id:
                         raise RuntimeError("Execution event session_id mismatch.")
 
-                terminal_kind = terminal_chat_payload.get("data", {}).get("kind")
-                if terminal_kind != "final":
-                    raise RuntimeError(f"Terminal chat event kind is {terminal_kind!r}; expected 'final'.")
+                terminal_kind = terminal_chat_payload.get("data", {}).get("kind") or terminal_chat_payload.get("type")
+                if terminal_kind not in {"final", "execution_completed"}:
+                    raise RuntimeError(
+                        f"Terminal chat event kind is {terminal_kind!r}; expected 'final' or "
+                        "'execution_completed'."
+                    )
 
                 await _persist_artifact_via_command(
                     chat_ws,

diff --git a/scripts/verify_phase0_regression.sh b/scripts/verify_phase0_regression.sh
@@ -0,0 +1,28 @@
+#!/bin/bash
+# Phase 0 regression verification script
+# Run this after refactoring to verify against golden payload baseline
+
+set -e
+
+echo "=== Phase 0: Verifying regression against baseline ==="
+
+# Run regression tests
+echo "Running regression tests against golden payloads..."
+uv run pytest tests/contracts/test_golden_payloads.py::test_regression_chat_websocket_events -v
+uv run pytest tests/contracts/test_golden_payloads.py::test_regression_passive_events_websocket_events -v
+
+# Compare openapi.yaml
+echo "Comparing openapi.yaml against baseline..."
+if ! diff -u tests/contracts/golden_payloads/openapi_baseline.yaml openapi.yaml; then
+    echo "WARNING: openapi.yaml has changed from baseline"
+    echo "Review the diff above. If changes are intentional, update the baseline."
+fi
+
+# Compare generated client
+echo "Comparing generated client against baseline..."
+if ! diff -u tests/contracts/golden_payloads/openapi_client_baseline.ts src/frontend/src/lib/rlm-api/generated/openapi.ts; then
+    echo "WARNING: Generated client has changed from baseline"
+    echo "Review the diff above. If changes are intentional, update the baseline."
+fi
+
+echo "=== Phase 0 regression verification complete ==="