Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,8 @@ docs/superpowers/
docs/plans/
docs/archive/

qa-results/

plan/

# Frontend
Expand Down
21 changes: 21 additions & 0 deletions docs/agent-harness/architecture-invariants.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,27 @@ Transport code may call runtime services and schemas. Runtime code should not im
FastAPI route modules, or test-only helpers. Configuration/package-root modules must not pull in
heavy runtime providers such as DSPy, MLflow, PostHog, or Daytona at import time.

## Async Execution Boundary

The sandbox interpreters (Daytona, Modal) expose a synchronous, blocking `execute(...)` that
performs a network round-trip per code iteration. `dspy.RLM.aforward` only awaits the LM predictor
calls — it still runs sandbox code through the **synchronous** `repl.execute(...)` (verified in
dspy 3.2.1). Therefore the heavy RLM turn is driven sync-in-a-thread via
`asyncio.to_thread(self.agent, ...)` in `runtime/agent/runtime.py`, which offloads both the LM
calls and the blocking sandbox I/O to a worker thread and keeps the event loop free.

Do not replace this `asyncio.to_thread` wrapping with a direct `await agent.acall(...)`/`aforward`
on the RLM heavy path while the interpreter's `execute` stays synchronous — doing so would block the
event loop on every code-execution iteration and regress server concurrency. The native chat
streaming path is the exception: it interleaves per-token streaming through `async_planner_step`
(which uses `acall` on the planner predictor only, not sandbox execution).

MCP-backed ReAct tools are the other async exception. Tools converted with
`dspy.Tool.from_mcp_tool(session, tool)` are bound to a live MCP `ClientSession` and must be invoked
through an async ReAct path (`acall`) while that session remains open. Keep MCP tools out of sync
ReAct calls, close the provider when the runtime shuts down, and rebuild the agent from base tools
plus the current MCP attachment when servers are reattached.

## Frontend Boundaries

Keep shared UI primitives reusable:
Expand Down
10 changes: 10 additions & 0 deletions docs/how-to-guides/dspy-integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -468,6 +468,16 @@ When MLflow is enabled, RLM execution automatically captures:
- Reasoning trajectories
- Timing and token usage

For variable-mode `dspy.RLM` runs, Fleet records each REPL/code trajectory step
as a child MLflow `TOOL` span named `repl_execute`, with bounded inputs and
outputs. This keeps the MLflow trace tree aligned with the compact trace rows
rendered in the chat surface.

Fleet also records an `rlm_available_tools` `LLM` span that advertises the
`repl_execute` schema through `mlflow.chat.tools`. MLflow tool-call judges use
that schema as the available-tool set and then evaluate the concrete
`repl_execute` `TOOL` spans as the calls that actually ran.

### Optimize with MIPROv2

Use collected traces for DSPy optimization:
Expand Down
26 changes: 26 additions & 0 deletions docs/how-to-guides/mlflow-workflows.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,21 @@ Inspect active scorers before debugging unexpected assessment warnings:
uv run python scripts/mlflow_cli.py scorers list
```

Stop a stale scheduled scorer without deleting its registration:

```bash
# from repo root
uv run python scripts/mlflow_cli.py scorers stop --name "Trace Judge"
```

Restart a stopped scorer only after confirming its judge model and trace inputs
are correct:

```bash
# from repo root
uv run python scripts/mlflow_cli.py scorers start --name "Trace Judge" --sample-rate 1.0
```

Remove a stale scorer only as an explicit maintenance action:

```bash
Expand Down Expand Up @@ -127,6 +142,17 @@ uv run fleet web
```

As you use the app, MLflow traces are recorded in the configured experiment.
For RLM document-analysis and variable-mode runs, Fleet also materializes
trajectory code execution as MLflow `TOOL` spans named `repl_execute`. Those
spans include bounded `mlflow.spanInputs` / `mlflow.spanOutputs` payloads so the
MLflow trace tree, external scorers, and the chat transcript all describe the
same REPL actions.

Fleet also emits a compact `rlm_available_tools` `LLM` span with
`mlflow.chat.tools` metadata for the RLM REPL. MLflow's built-in tool-call
judges read available tool schemas from `LLM` / `CHAT_MODEL` spans and read
actual calls from `TOOL` spans; keeping both in the trace prevents judges from
falling back to model-based tool extraction.

## 4. Record Human Feedback and Ground Truth

Expand Down
15 changes: 13 additions & 2 deletions docs/reference/adr/001-rlm-runtime-architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,10 +27,21 @@ The architecture consists of these layers:
The primary runtime (`src/fleet_rlm/runtime/agent/runtime.py`) owns session state, tool binding, streaming, and persistence. Its default agent module (`src/fleet_rlm/runtime/modules/escalating.py`) extends `dspy.Module` to provide:

- **Stateful conversation**: `dspy.History` for persistent chat memory
- **Lightweight-to-heavy escalation**: `dspy.ChainOfThought` for simple turns, escalating to the Daytona-backed RLM path when needed
- **Tool orchestration**: Dynamic tool registration and dispatch
- **Lightweight-to-heavy escalation**: `dspy.ChainOfThought` for simple turns; the
`[TOOLS NEEDED]` sentinel routes to a real `dspy.ReAct` tool loop (`FleetAgent`), while
forced `rlm`/`rlm_only` modes and auto-detected URL-document analysis route to the
Daytona-backed `dspy.RLM` heavy path
- **Tool orchestration**: Dynamic tool registration and dispatch (including optional
DSPy-native MCP tools discovered from `FLEET_RLM_MCP_SERVERS`)
- **Recursive delegation**: `runtime/tools/rlm_delegate.py` and `integrations/daytona/isolation.py` build bounded child RLM runs

MCP tools are opt-in and session-backed. `AgentRuntime.attach_mcp_tools(...)` connects the
configured MCP servers, converts discovered tools with `dspy.Tool.from_mcp_tool(...)`, and rebuilds
the agent from the stable base tool set plus the current MCP attachment. Reattaching MCP servers
replaces the previous MCP tools and closes their provider; runtime shutdown closes any remaining
MCP sessions. Because these tools are async, the sentinel ReAct route is driven through an async
ReAct call, while forced/url RLM routes remain sync-in-thread for Daytona sandbox execution.

### 2. Signature-Based Contracts

Agent behavior is defined through DSPy signatures
Expand Down
31 changes: 31 additions & 0 deletions docs/reference/frontend-backend-integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,26 @@ The frontend keeps the following runtime controls aligned with backend requests:
- `context_paths`
- `batch_concurrency`

When `execution_mode` is `auto`, prompts that combine a public HTTP(S) URL
with documentation-analysis intent (`analyze`, `summarize`, `read`, `docs`, or
`documentation`) route directly to the Daytona-backed RLM document path. The
backend fetches the document through the redirect-validating document helpers
and passes `source_url`, `document_text`, and `source_metadata` as separate
variable-mode `dspy.RLM` inputs. That keeps large documentation bodies in REPL
variables instead of folding them into the prompt text.
`execution_mode="rlm_only"` still forces RLM execution, while
`execution_mode="tools_only"` bypasses the automatic URL-to-RLM route.

Fleet's RLM prompt envelope follows the Fast-RLM usage pattern for large
variable-mode tasks: repeat the task at the top and bottom, keep bulk data in
REPL variables, make available tools ordinary Python callables, and keep
intermediate printed output bounded. The server runtime settings feed the chat
agent's RLM wrappers directly:

- `rlm_max_iterations` -> `dspy.RLM(max_iterations=...)`
- `rlm_max_llm_calls` -> `dspy.RLM(max_llm_calls=...)`
- `agent_max_output_chars` -> `dspy.RLM(max_output_chars=...)`

The backend enriches frames with runtime context. The frontend treats these keys
as stable when present:

Expand All @@ -138,6 +158,11 @@ as stable when present:
- `sandbox_id`
- `workspace_path`
- `sandbox_transition`
- `selected_skills`
- `routing_decision`
- `source_url`
- `trajectory_index`
- `rlm_limits`

### Transcript Stream

Expand All @@ -148,9 +173,15 @@ The frontend reduces frames into:
- user and assistant messages
- reasoning and trajectory rows
- tool and sandbox cards
- selected-skill and routing status rows
- HITL / clarification cards
- summary rows and warnings

RLM trajectories that include `{reasoning, code, output}` are normalized into
`execution_step` frames with `step.type="repl"`. The transcript renders these
as compact expandable sandbox rows; large code/output payloads stay summarized
in chat while the workbench receives the structured step payload.

The adapter stack is:

1. `ws-frame-parser.ts` normalizes raw websocket frames.
Expand Down
4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ dependencies = [
"dspy[optuna]==3.2.1",

# Core orchestration and CLI/runtime ergonomics.
"daytona>=0.181.0,<1",
"daytona>=0.184.0,<1",
"hydra-core>=1.3,<2",
"prompt-toolkit>=3.0.50,<4",
"rich>=14.3.3,<15",
Expand All @@ -56,7 +56,7 @@ dependencies = [
"tomli>=2.0.0; python_version < '3.11'",

# Web API surface.
"fastapi[standard]==0.136.1",
"fastapi[standard]==0.136.3",
"joserfc>=1.0.1",
"uvicorn>=0.47.0,<1",

Expand Down
29 changes: 29 additions & 0 deletions scripts/capture_phase0_baseline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#!/bin/bash
# Phase 0 baseline capture script
# Run this to capture golden payloads and openapi.yaml baseline before refactoring

set -e

echo "=== Phase 0: Capturing golden payloads and openapi.yaml baseline ==="

# Clean up old golden payloads and recreate the directory
rm -rf tests/contracts/golden_payloads
mkdir -p tests/contracts/golden_payloads

# Capture openapi.yaml baseline
echo "Capturing openapi.yaml baseline..."
cp openapi.yaml tests/contracts/golden_payloads/openapi_baseline.yaml

# Capture frontend generated client baseline
echo "Capturing frontend generated client baseline..."
cp src/frontend/src/lib/rlm-api/generated/openapi.ts tests/contracts/golden_payloads/openapi_client_baseline.ts

# Run golden payload capture tests
echo "Running golden payload capture tests..."
uv run pytest tests/contracts/test_golden_payloads.py::test_capture_chat_websocket_golden_payloads -v
uv run pytest tests/contracts/test_golden_payloads.py::test_capture_passive_events_websocket_golden_payloads -v

echo "=== Phase 0 baseline capture complete ==="
echo "Golden payloads saved to: tests/contracts/golden_payloads/"
echo "OpenAPI baseline saved to: tests/contracts/golden_payloads/openapi_baseline.yaml"
echo "Client baseline saved to: tests/contracts/golden_payloads/openapi_client_baseline.ts"
2 changes: 1 addition & 1 deletion scripts/live_daytona_verify.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
# Ensure repo root is on path for imports
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))

from fleet_rlm.api.runtime_services.chat_persistence import (
from fleet_rlm.api.runtime_services.session_manifest import (
ensure_session_volume_layout,
load_manifest_from_volume,
)
Expand Down
58 changes: 58 additions & 0 deletions scripts/mlflow_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,52 @@ def _delete_scorer(delete_scorer: Any, *, name: str, experiment_id: str | None,
delete_scorer(name)


def _get_scorer(mlflow: Any, *, name: str, experiment_id: str | None, version: str | None = None) -> Any:
get_scorer = getattr(getattr(mlflow, "genai", None), "get_scorer", None)
if not callable(get_scorer):
raise RuntimeError("mlflow.genai.get_scorer is not available in this MLflow version.")

parameters = inspect.signature(get_scorer).parameters
kwargs: dict[str, Any] = {"name": name}
if "experiment_id" in parameters:
kwargs["experiment_id"] = experiment_id
if "version" in parameters and version:
kwargs["version"] = int(version) if str(version).isdigit() else version
return get_scorer(**kwargs)


def do_scorers_stop(args: argparse.Namespace) -> int:
mlflow, config, active_experiment_id = _configure_mlflow_tracking()
experiment_id = args.experiment_id or active_experiment_id
scorer = _get_scorer(mlflow, name=args.name, experiment_id=experiment_id)
stop_scorer = getattr(scorer, "stop", None)
if not callable(stop_scorer):
raise RuntimeError("This MLflow scorer does not expose stop().")
stop_scorer()
print(f"stopped_scorer={args.name}")
print(f"experiment={config.experiment}")
print(f"experiment_id={experiment_id or ''}")
return 0


def do_scorers_start(args: argparse.Namespace) -> int:
mlflow, config, active_experiment_id = _configure_mlflow_tracking()
experiment_id = args.experiment_id or active_experiment_id
scorer = _get_scorer(mlflow, name=args.name, experiment_id=experiment_id)
start_scorer = getattr(scorer, "start", None)
if not callable(start_scorer):
raise RuntimeError("This MLflow scorer does not expose start().")

from mlflow.genai.scorers import ScorerSamplingConfig

start_scorer(sampling_config=ScorerSamplingConfig(sample_rate=args.sample_rate, filter_string=args.filter_string))
print(f"started_scorer={args.name}")
print(f"sample_rate={args.sample_rate}")
print(f"experiment={config.experiment}")
print(f"experiment_id={experiment_id or ''}")
return 0


def do_scorers_delete(args: argparse.Namespace) -> int:
if not args.yes:
print("Refusing to delete scorer without --yes.")
Expand Down Expand Up @@ -199,6 +245,18 @@ def main() -> int:
psl.add_argument("--experiment-id", default=None)
psl.set_defaults(func=do_scorers_list)

pss = scorer_subparsers.add_parser("stop", help="Stop a persisted scorer schedule without deleting it")
pss.add_argument("--name", required=True)
pss.add_argument("--experiment-id", default=None)
pss.set_defaults(func=do_scorers_stop)

psr = scorer_subparsers.add_parser("start", help="Start or resume a persisted scorer schedule")
psr.add_argument("--name", required=True)
psr.add_argument("--experiment-id", default=None)
psr.add_argument("--sample-rate", type=float, default=1.0)
psr.add_argument("--filter-string", default=None)
psr.set_defaults(func=do_scorers_start)

psd = scorer_subparsers.add_parser("delete", help="Delete a persisted scorer by name")
psd.add_argument("--name", required=True)
psd.add_argument("--experiment-id", default=None)
Expand Down
16 changes: 11 additions & 5 deletions scripts/validate_rlm_e2e_trace.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,9 @@ async def _collect_chat_until_terminal(
if payload.get("type") == "error":
raise RuntimeError(f"Chat websocket error: {payload}")

if payload.get("type") == "execution_completed":
return events, payload

if payload.get("type") != "event":
continue
kind = payload.get("data", {}).get("kind")
Expand Down Expand Up @@ -331,8 +334,8 @@ async def _run_validation(args: argparse.Namespace) -> ValidationResult:
chat_ws_url = _make_ws_url(args.server_url, "/api/v1/ws/execution")
execution_ws_url = _make_ws_url(
args.server_url,
"/api/v1/ws/execution",
query=(f"workspace_id={args.workspace_id}&user_id={args.user_id}&session_id={session_id}"),
"/api/v1/ws/execution/events",
query=f"session_id={session_id}",
)

async with websockets.connect(
Expand Down Expand Up @@ -394,9 +397,12 @@ async def _run_validation(args: argparse.Namespace) -> ValidationResult:
if event.get("session_id") != session_id:
raise RuntimeError("Execution event session_id mismatch.")

terminal_kind = terminal_chat_payload.get("data", {}).get("kind")
if terminal_kind != "final":
raise RuntimeError(f"Terminal chat event kind is {terminal_kind!r}; expected 'final'.")
terminal_kind = terminal_chat_payload.get("data", {}).get("kind") or terminal_chat_payload.get("type")
if terminal_kind not in {"final", "execution_completed"}:
raise RuntimeError(
f"Terminal chat event kind is {terminal_kind!r}; expected 'final' or "
"'execution_completed'."
)

await _persist_artifact_via_command(
chat_ws,
Expand Down
28 changes: 28 additions & 0 deletions scripts/verify_phase0_regression.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/bin/bash
# Phase 0 regression verification script
# Run this after refactoring to verify against golden payload baseline

set -e

echo "=== Phase 0: Verifying regression against baseline ==="

# Run regression tests
echo "Running regression tests against golden payloads..."
uv run pytest tests/contracts/test_golden_payloads.py::test_regression_chat_websocket_events -v
uv run pytest tests/contracts/test_golden_payloads.py::test_regression_passive_events_websocket_events -v

# Compare openapi.yaml
echo "Comparing openapi.yaml against baseline..."
if ! diff -u tests/contracts/golden_payloads/openapi_baseline.yaml openapi.yaml; then
echo "WARNING: openapi.yaml has changed from baseline"
echo "Review the diff above. If changes are intentional, update the baseline."
fi

# Compare generated client
echo "Comparing generated client against baseline..."
if ! diff -u tests/contracts/golden_payloads/openapi_client_baseline.ts src/frontend/src/lib/rlm-api/generated/openapi.ts; then
echo "WARNING: Generated client has changed from baseline"
echo "Review the diff above. If changes are intentional, update the baseline."
fi

echo "=== Phase 0 regression verification complete ==="
Loading
Loading