Skip to content

Refactor RLM runtime, event streaming, and frontend chat adapters#271

Merged
Zochory merged 7 commits into
mainfrom
improvement/rlm
Jun 7, 2026
Merged

Refactor RLM runtime, event streaming, and frontend chat adapters#271
Zochory merged 7 commits into
mainfrom
improvement/rlm

Conversation

@Zochory
Copy link
Copy Markdown
Member

@Zochory Zochory commented Jun 6, 2026

Summary

  • Reworked the Daytona-backed RLM runtime and websocket event pipeline to make chat, tool execution, and session lifecycle handling more explicit and testable.
  • Updated observability and MLflow integration paths, plus supporting scripts for phase regression, trace validation, and live Daytona verification.
  • Aligned frontend conversation adapters and workspace message rendering with the backend event model.
  • Refreshed architecture docs and agent guidance to match the new runtime boundaries and workflows.

Testing

  • Added and updated unit, contract, and integration coverage across API, runtime, Daytona, MLflow, and frontend event-adapter layers.
  • Validation status depends on the local lane; routine checks were covered where applicable, with broader trace and regression scripts added for end-to-end verification.

Summary by CodeRabbit

  • New Features

    • Browser page fetching for rendered content and link extraction
    • MCP (Model Context Protocol) server integration for extensible tool discovery
    • Improved RLM recursion tracking with bounded history snapshots
    • Model picker UI for selecting planner and delegate models
    • Enhanced document bundling with auxiliary index files
  • Bug Fixes

    • Fixed stale sandbox slot reconciliation during concurrent execution
    • Improved broker startup failure handling with cooldown and caching
    • Better streaming event terminal detection and classification
  • Improvements

    • Enhanced MLflow tracing with RLM trajectory spans and tool-call judges
    • Refined websocket event structure with typed runtime events
    • Better persistence and session state management

Zochory added 3 commits June 2, 2026 06:47
Add browser-aware document/runtime tooling, Daytona broker and sandbox slot recovery, richer MLflow trace context, and frontend rendering for trajectory/tool state.

Cover the review follow-ups around SSRF guarding, broker cooldown retry, browser skill selection, scoped trajectory de-duplication, and safe sandbox reconciliation.
feat(phase7): align RLM recursion with reference implementation and tighten persistence contracts

- Update daytona to 0.184.0 and fastapi to 0.136.3
- Add Phase 7 documentation covering history management, bounded child snapshots, token-aware compaction, explicit depth tracking, and fallback behavior
- Tighten PersistenceDep type from Any to PersistenceProtocol and raise RuntimeError when backend not initialized
- Remove ServerState compatibility proxy attributes (_SERVER_STATE_PROXY_ATTRS)
- Move RuntimeEventContext from api/events/
Copilot AI review requested due to automatic review settings June 6, 2026 20:59
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 6, 2026

Review Change Stack

Warning

Review limit reached

@Zochory, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 25 minutes and 12 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 12784544-60c2-4558-aa37-d9231a20a267

📥 Commits

Reviewing files that changed from the base of the PR and between a28a8b8 and 860e3d1.

📒 Files selected for processing (26)
  • scripts/capture_phase0_baseline.sh
  • scripts/mlflow_cli.py
  • src/fleet_rlm/api/events/project_graph.py
  • src/fleet_rlm/api/routers/ws/turn_runner.py
  • src/fleet_rlm/api/runtime_services/run_lifecycle.py
  • src/fleet_rlm/api/runtime_services/session_persistence.py
  • src/fleet_rlm/integrations/observability/mlflow_context.py
  • src/fleet_rlm/integrations/observability/mlflow_runtime.py
  • src/fleet_rlm/runtime/agent/runtime.py
  • src/fleet_rlm/runtime/modules/factory.py
  • src/fleet_rlm/runtime/tools/binding.py
  • src/frontend/src/components/agent-elements/input/model-picker.tsx
  • src/frontend/src/features/workspace/conversation/transcript/__tests__/workspace-message-list.agent-elements.test.tsx
  • src/frontend/src/features/workspace/conversation/transcript/workspace-message-list.tsx
  • src/frontend/src/features/workspace/conversation/workspace-agent-input-bar.tsx
  • src/frontend/src/features/workspace/screen/workspace-screen.tsx
  • src/frontend/src/lib/workspace/__tests__/backend-chat-event-adapter.test.ts
  • src/frontend/src/lib/workspace/backend-chat-event-adapter.ts
  • src/frontend/src/lib/workspace/backend-chat-event-trajectory.ts
  • tests/unit/api/test_chat_persistence.py
  • tests/unit/api/test_runtime_diagnostics.py
  • tests/unit/api/test_ws_turn_runner.py
  • tests/unit/cli/test_mlflow_cli.py
  • tests/unit/integrations/test_daytona_sandbox_executor.py
  • tests/unit/integrations/test_mlflow_context.py
  • tests/unit/runtime/test_tools.py
📝 Walkthrough

Walkthrough

Refactors websocket streaming into modular services, introduces a typed RuntimeEvent schema and projections, overhauls lifecycle/persistence, adds Daytona browser tooling and slot reconciliation, extends MCP integration, updates frontend adapters/UI, and broadens tests and docs. Minor dependency/version and scripts updates.

Changes

WebSocket runtime, RLM, tools, and frontend integration

Layer / File(s) Summary
Runtime events schema and projections
src/fleet_rlm/runtime/events.py, src/fleet_rlm/api/events/*
Adds RuntimeEvent schema and chat/graph projectors; step builder consumes runtime events.
WebSocket pipeline: endpoint, loop, stream, turn
src/fleet_rlm/api/routers/ws/*
Replaces monolith with connection loop, stream events, turn runner, and summary helpers; endpoint rewired.
Runtime services: lifecycle, session manifest/persistence, diagnostics
src/fleet_rlm/api/runtime_services/*
Adds ExecutionLifecycleManager, splits manifest/persistence, improves diagnostics and error classification.
Daytona runtime and tooling
src/fleet_rlm/integrations/daytona/*
Semaphore reconciliation, reconciled sandbox acquisition, stdlib broker, executor failure latch, browser snapshots, VFS filtering.
Agent runtime and RLM modules
src/fleet_rlm/runtime/agent/*, .../modules/*
FleetAgent→ReAct, Runtime emits RuntimeEvents, unified escalation (COT/ReAct/RLM), variable-mode tool toggles, signatures expanded.
Tools: MCP and browser/doc fetch
src/fleet_rlm/runtime/tools/*
MCP tool provider/configs; browser_fetch_page tool; document fetch bundles same-origin indexes; registry/binding updates.
Observability: MLflow context/runtime/traces
src/fleet_rlm/integrations/observability/*
Trajectory TOOL spans, resilient import, persisted scorers surface/skip, correlation tags.
Execution helpers
src/fleet_rlm/runtime/execution/streaming_events.py
Focuses on final payload assembly and HITL helpers; uses shared schema version.
API dependencies and transports
src/fleet_rlm/api/dependencies.py, .../transport.py, .../artifacts.py
Typed persistence dep, removes ServerState proxying, stricter get_persistence, import fixes.
Frontend adapters and UI
src/frontend/src/**/*
ModelPicker and SubagentTool; adapters render canonical repl steps and routing notes; root document wrapper.
Workspace UI wiring
src/frontend/src/features/workspace/**/*
Active model context and settings hook; trajectory finalization dedup; HITL rollback; tests updated.
CLI/scripts/docs and minor configs
scripts/*, docs/*, .gitignore, pyproject.toml, ui/build.py
MLflow scorer start/stop CLI; validation scripts; docs expanded; ignore and versions updated.
Tests suite
tests/**/*
Golden payloads; API/runtime/service/Daytona/MCP/MLflow/frontend tests aligned with new architecture.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant WS_Endpoint
  participant ConnLoop
  participant TurnRunner
  participant AgentRuntime
  participant Lifecycle
  participant RepoDB
  Client->>WS_Endpoint: Connect / send message
  WS_Endpoint->>ConnLoop: _chat_message_loop(...)
  ConnLoop->>TurnRunner: run_streaming_turn(...)
  TurnRunner->>Lifecycle: initialize_turn_lifecycle()
  TurnRunner->>AgentRuntime: aiter_chat_turn_stream(...)
  AgentRuntime-->>TurnRunner: RuntimeEvent (tool/reasoning/status)
  TurnRunner->>Lifecycle: emit_step(event)
  Lifecycle->>RepoDB: append_step (async batch)
  TurnRunner-->>Client: WS frame (project_chat)
  AgentRuntime-->>TurnRunner: RuntimeEvent DONE/ERROR
  TurnRunner->>Lifecycle: complete_run(status, summary)
  Lifecycle->>RepoDB: update_run_status
  TurnRunner-->>Client: execution_completed
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • Qredence/fleet-rlm#268 — Prior Daytona persistence/runtime maintenance overlaps the new session_manifest/session_persistence/run_lifecycle split.
  • Qredence/fleet-rlm#269 — Extends the same skill-selection module that this PR augments with a browser-interaction skill.
  • Qredence/fleet-rlm#255 — Touches conversation adapters also modified here for canonical repl/routing rendering.

Poem

A rabbit taps the socket drum,
Events parade, in tidy strum—
ReAct or RLM, choose the way,
Tools assemble, browser plays.
MCP sings, MLflow writes,
Slots reconciled, green traffic lights—
Ship it swift, with carrot bytes! 🥕✨

Comment thread tests/unit/integrations/test_daytona_sandbox_executor.py Fixed
Comment thread src/fleet_rlm/api/runtime_services/run_lifecycle.py Fixed
Comment thread src/fleet_rlm/api/routers/ws/turn_runner.py Fixed
Comment thread tests/unit/api/test_chat_persistence.py
Comment thread tests/unit/runtime/test_tools.py
Comment thread tests/unit/runtime/test_tools.py Fixed
Comment thread tests/unit/runtime/test_tools.py Fixed
Comment thread src/fleet_rlm/integrations/observability/auto_assessment.py
Comment thread tests/unit/integrations/test_mlflow_context.py Fixed
Comment thread tests/unit/integrations/test_mlflow_context.py Fixed
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a Phase 0 safety net with golden-payload capture and regression verification, integrates DSPy-native MCP tools, adds browser automation via Playwright, refactors the WebSocket connection loop, and aligns RLM recursion and history management with the reference implementation. The review feedback highlights several critical bugs and improvements: a baseline deletion issue in the capture script, a missing _interpreter assignment in EscalatingFleetModule, TypeError risks when calling scorer methods in mlflow_cli.py, potential synchronous/asynchronous mismatches in session_persistence.py, robustness issues in the browser fetch sandbox code, and unhandled orphan <think> tags in mlflow_runtime.py.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread scripts/capture_phase0_baseline.sh Outdated
Comment thread src/fleet_rlm/runtime/modules/escalating.py
Comment thread scripts/mlflow_cli.py Outdated
Comment thread scripts/mlflow_cli.py Outdated
Comment thread src/fleet_rlm/api/runtime_services/session_persistence.py Outdated
Comment thread src/fleet_rlm/api/runtime_services/session_persistence.py Outdated
Comment thread src/fleet_rlm/runtime/tools/binding.py
Comment thread src/fleet_rlm/integrations/observability/mlflow_runtime.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/frontend/src/lib/workspace/__tests__/backend-chat-event-adapter.test.ts (1)

1-1302: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix frontend formatting before merge (vp check is failing in CI).

Please run vp check --fix (or the repo’s formatting lane) and commit the result so the frontend check passes.

As per coding guidelines, frontend TS/TSX changes must go through the formatting/lint workflow.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/frontend/src/lib/workspace/__tests__/backend-chat-event-adapter.test.ts`
around lines 1 - 1302, Formatting check failed for frontend TypeScript files;
run the repo formatter and commit the fixes. Run the project's frontend
formatting/lint pipeline (e.g., vp check --fix or the repo's formatting lane),
reformat the changed test file(s) that reference applyWsFrameToMessages and
related tests, then commit the updated files so the vp check passes in CI.

Sources: Coding guidelines, Pipeline failures

src/fleet_rlm/api/runtime_services/chat_persistence.py (1)

131-149: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Disallow persisting while the run is intentionally left alive.

If cancel_active_run=False but persist_on_disconnect stays True, this path leaves stream_task running and then immediately calls local_persist(... release_idle_session=True). That can snapshot stale session state and release the Daytona session underneath the background execution. Make this combination impossible, or force persist_on_disconnect=False whenever the run is allowed to continue.

Suggested guard
 async def handle_chat_disconnect(
     *,
     pending_receive_task: asyncio.Task[object] | None,
     stream_task: asyncio.Task[str | None] | None,
     cancel_flag: dict[str, bool],
     local_persist: Callable[..., Awaitable[None]],
     lifecycle: ExecutionLifecycleManager | None,
     cancel_active_run: bool = True,
     persist_on_disconnect: bool = True,
 ) -> None:
     """Cleanly stop the active websocket loop after a client disconnect."""
+    if not cancel_active_run and persist_on_disconnect:
+        raise ValueError(
+            "persist_on_disconnect must be False when background execution is left running"
+        )
     if cancel_active_run:
         cancel_flag["cancelled"] = True
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fleet_rlm/api/runtime_services/chat_persistence.py` around lines 131 -
149, When cancel_active_run is False we must not persist because stream_task
continues running; update the logic in the shutdown function (references:
cancel_active_run, persist_on_disconnect, stream_task, local_persist,
cancel_task, _log_background_disconnect_task_result) to prevent calling
local_persist when a run is intentionally left alive — either by forcing
persist_on_disconnect = False before the early/else branch where stream_task is
left running or by adding a guard that returns/sets persist_on_disconnect=False
whenever cancel_active_run is False, ensuring local_persist(...) with
release_idle_session=True is never invoked while stream_task may still be
active.
src/fleet_rlm/runtime/agent/runtime.py (1)

503-508: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Keep the public stream on a single event type.

aiter_chat_turn_stream() still promises AsyncIterator[StreamEvent], but both the post-hoc path and the native tool/clarification path now yield RuntimeEvent. That already breaks type-checking in CI, and it also means callers can receive a mixed event object shape within one stream depending on which branch executes. Convert everything at this boundary to one event class before yielding, or update the signature and all downstream consumers in the same change.

Also applies to: 805-805, 822-931

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fleet_rlm/runtime/agent/runtime.py` around lines 503 - 508, The post-hoc
path in _aiter_chat_turn_stream_posthoc is yielding RuntimeEvent objects while
aiter_chat_turn_stream promises AsyncIterator[StreamEvent], which mixes event
shapes; modify _aiter_chat_turn_stream_posthoc (and the other post-hoc/native
branches noted around the other ranges) to convert/normalize every RuntimeEvent
into the public StreamEvent shape before yielding (map fields, types, and any
nested payloads to StreamEvent), keeping the aiter_chat_turn_stream signature
unchanged; alternatively, if you prefer to expose RuntimeEvent, update
aiter_chat_turn_stream's return type and all downstream consumers in the same
change—pick one approach and apply consistently across
_aiter_chat_turn_stream_posthoc, the native tool/clarification branch, and the
other affected ranges so the stream always emits a single event class.

Source: Pipeline failures

🟠 Major comments (24)
src/fleet_rlm/integrations/observability/mlflow_runtime.py-515-523 (1)

515-523: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Orphan </think> cleanup is currently unreachable in callback flow.

_strip_think_tags() now handles orphan closings, but the callback only invokes it when a full <think>...</think> pair is detected. Orphan-only </think> payloads still bypass sanitization.

Proposed fix
-                if isinstance(content, str) and _THINK_TAG_RE.search(content):
+                if isinstance(content, str) and (
+                    _THINK_TAG_RE.search(content) or _ORPHAN_THINK_CLOSE_RE.search(content)
+                ):
                     message["content"] = _strip_think_tags(content)
@@
-            if isinstance(text, str) and _THINK_TAG_RE.search(text):
+            if isinstance(text, str) and (
+                _THINK_TAG_RE.search(text) or _ORPHAN_THINK_CLOSE_RE.search(text)
+            ):
                 choice["text"] = _strip_think_tags(text)

Also applies to: 555-560

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fleet_rlm/integrations/observability/mlflow_runtime.py` around lines 515
- 523, The callback path currently only strips think tags when a full
<think>...</think> pair is found, so orphan-only payloads like "</think>" bypass
sanitization; update the callback(s) around the existing tag-handling logic (the
branches referenced around the regions that call _THINK_TAG_RE processing at
~515 and ~555) to invoke _strip_think_tags(text) before any further handling (or
at minimum when _ORPHAN_THINK_CLOSE_RE.search(text) is true), ensuring orphan
closing tags are removed; keep using the existing _strip_think_tags,
_THINK_TAG_RE and _ORPHAN_THINK_CLOSE_RE symbols so the single sanitization
routine is the canonical cleanup for all branches.
src/fleet_rlm/integrations/observability/mlflow_context.py-155-169 (1)

155-169: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix _coerce_trajectory_steps typing so CI typecheck passes.

Line 157 and Line 166 currently trigger mypy/pyright overload errors from dict(step, index=step.get("index", index)). Build normalized step dicts first, then set index explicitly.

Proposed fix
+def _normalize_trajectory_step(step: dict[Any, Any], index: int) -> dict[str, Any]:
+    normalized: dict[str, Any] = {str(key): value for key, value in step.items()}
+    if normalized.get("index") is None:
+        normalized["index"] = index
+    return normalized
+
+
 def _coerce_trajectory_steps(raw: Any) -> list[dict[str, Any]]:
     if isinstance(raw, list):
-        return [dict(step, index=step.get("index", index)) for index, step in enumerate(raw) if isinstance(step, dict)]
+        steps: list[dict[str, Any]] = []
+        for index, step in enumerate(raw):
+            if isinstance(step, dict):
+                steps.append(_normalize_trajectory_step(step, index))
+        return steps
 
     if not isinstance(raw, dict):
         return []
 
     for key in ("trajectory", "steps"):
         nested = raw.get(key)
         if isinstance(nested, list):
-            return [
-                dict(step, index=step.get("index", index))
-                for index, step in enumerate(nested)
-                if isinstance(step, dict)
-            ]
+            steps: list[dict[str, Any]] = []
+            for index, step in enumerate(nested):
+                if isinstance(step, dict):
+                    steps.append(_normalize_trajectory_step(step, index))
+            return steps
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fleet_rlm/integrations/observability/mlflow_context.py` around lines 155
- 169, In _coerce_trajectory_steps, mypy/pyright complain about using dict(step,
index=step.get("index", index)); instead, first ensure each raw step is a dict
by calling dict(step) or copy into a new variable (e.g., normalized =
dict(step)), then set normalized["index"] = normalized.get("index", index") or
index, and return those normalized dicts; apply this change in both branches
where dict(step, index=...) is used so the function returns list[dict[str, Any]]
without type errors.

Source: Pipeline failures

src/fleet_rlm/api/routers/ws/connection_loop.py-411-420 (1)

411-420: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

A received message can be dropped in the streaming branch.

When _handle_message_while_streaming(...) returns False, the unconditional continue still discards that message instead of deferring it to the idle path.

Proposed fix
-                    if await _handle_message_while_streaming(
+                    handled = await _handle_message_while_streaming(
                         websocket=self.websocket,
                         msg=msg,
                         agent=self.agent,
                         runtime=self.runtime,
                         session=self.session,
                         local_persist=self.local_persist,
-                    ):
+                    )
+                    if handled:
                         continue
+                    self.pending_message = msg
                     continue
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fleet_rlm/api/routers/ws/connection_loop.py` around lines 411 - 420, The
streaming branch currently discards messages because it calls
_handle_message_while_streaming(...) and then always executes an unconditional
continue; change the control flow so you only continue the outer loop when
_handle_message_while_streaming(...) returns True (keep the existing continue
inside the True branch) and let execution fall through to the idle handling path
when it returns False so the same msg is processed by the idle logic; locate the
call to _handle_message_while_streaming in connection_loop.py and remove or
guard the extra unconditional continue so that only the True result causes the
loop to continue.
scripts/capture_phase0_baseline.sh-23-30 (1)

23-30: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Baseline artifacts are deleted before script completion.

Line 23 removes tests/contracts/golden_payloads after baselines are copied, so openapi_baseline.yaml and openapi_client_baseline.ts are not actually preserved even though Lines 28-30 report success.

Proposed fix
 # Run golden payload capture tests
 echo "Running golden payload capture tests..."
 # Temporarily remove golden payloads directory to trigger capture
 rm -rf tests/contracts/golden_payloads
 uv run pytest tests/contracts/test_golden_payloads.py::test_capture_chat_websocket_golden_payloads -v
 uv run pytest tests/contracts/test_golden_payloads.py::test_capture_passive_events_websocket_golden_payloads -v
+
+# Recreate directory and persist OpenAPI/client baselines after recapture
+mkdir -p tests/contracts/golden_payloads
+cp openapi.yaml tests/contracts/golden_payloads/openapi_baseline.yaml
+cp src/frontend/src/lib/rlm-api/generated/openapi.ts tests/contracts/golden_payloads/openapi_client_baseline.ts
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/capture_phase0_baseline.sh` around lines 23 - 30, The script
currently executes "rm -rf tests/contracts/golden_payloads" after the baseline
capture steps, which deletes the artifacts that the later echo lines claim were
saved; remove or relocate that command so baselines are preserved: either delete
the "rm -rf tests/contracts/golden_payloads" line entirely or move it to run
before the pytest commands (so it clears old artifacts first), ensuring the
pytest invocations (uv run pytest ... test_capture_*_golden_payloads) produce
and retain openapi_baseline.yaml and openapi_client_baseline.ts referenced by
the echo messages.
scripts/verify_phase0_regression.sh-16-26 (1)

16-26: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Regression diffs don’t fail the script.

When baselines drift, the script only warns and exits successfully. That makes this lane non-blocking even on real contract regressions.

Proposed fix
+failed=0
+
 # Compare openapi.yaml
 echo "Comparing openapi.yaml against baseline..."
 if ! diff -u tests/contracts/golden_payloads/openapi_baseline.yaml openapi.yaml; then
     echo "WARNING: openapi.yaml has changed from baseline"
     echo "Review the diff above. If changes are intentional, update the baseline."
+    failed=1
 fi
 
 # Compare generated client
 echo "Comparing generated client against baseline..."
 if ! diff -u tests/contracts/golden_payloads/openapi_client_baseline.ts src/frontend/src/lib/rlm-api/generated/openapi.ts; then
     echo "WARNING: Generated client has changed from baseline"
     echo "Review the diff above. If changes are intentional, update the baseline."
+    failed=1
 fi
 
+if [ "$failed" -ne 0 ]; then
+    exit 1
+fi
+
 echo "=== Phase 0 regression verification complete ==="
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/verify_phase0_regression.sh` around lines 16 - 26, The script
currently only warns when generated artifacts differ (the two diff checks in
verify_phase0_regression.sh) but does not fail the run; update both diff failure
branches that start with "if ! diff -u
tests/contracts/golden_payloads/openapi_baseline.yaml ..." and "if ! diff -u
tests/contracts/golden_payloads/openapi_client_baseline.ts ..." to cause a
non-zero exit (e.g., add an exit 1 in each then-branch or set a failing flag and
exit non-zero after checks) so the script fails CI on regression diffs.
src/fleet_rlm/api/routers/ws/stream_summary.py-66-78 (1)

66-78: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Namespace the runtime failure metadata consistently.

These four fields are emitted as runtime_degraded / runtime_failure_*, while the rest of this helper writes fleet_rlm.*. That splits one trace schema in two, so degraded/failure filters built against the fleet_rlm. namespace will miss these terminal cases.

Suggested fix
     for key in (
         "runtime_degraded",
         "runtime_failure_category",
         "runtime_failure_phase",
         "runtime_fallback_used",
     ):
         value = payload.get(key, runtime.get(key))
         if value in (None, "", False):
             if key in {"runtime_degraded", "runtime_fallback_used"} and value is False:
-                metadata[key] = False
+                metadata[f"fleet_rlm.{key}"] = False
             continue
-        metadata[key] = value
+        metadata[f"fleet_rlm.{key}"] = value
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fleet_rlm/api/routers/ws/stream_summary.py` around lines 66 - 78, The
runtime failure keys ("runtime_degraded", "runtime_failure_category",
"runtime_failure_phase", "runtime_fallback_used") must be stored under the
fleet_rlm namespace so traces remain consistent: when reading a value check both
the namespaced and legacy keys (e.g. payload.get("fleet_rlm.runtime_degraded")
or payload.get("runtime_degraded") then fallback to runtime.get(...)), and when
writing to metadata set the namespaced key
(metadata["fleet_rlm.runtime_degraded"], etc.). Preserve the special-case
handling for False on runtime_degraded and runtime_fallback_used but assign the
boolean to the namespaced key. Update the loop that iterates those keys and uses
payload, runtime and metadata accordingly.
src/fleet_rlm/api/routers/ws/stream_summary.py-247-273 (1)

247-273: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Backfill payload sources and attachments on the run_result path.

When run_result is present, this branch merges warnings and final_artifact but never copies top-level payload["sources"] or payload["attachments"]. The fallback branch does, so the completion shape changes depending on whether the runtime returned run_result, and citations/artifacts disappear on the common path.

Suggested fix
         normalized.setdefault("duration_ms", summary_payload.get("duration_ms"))
         normalized.setdefault("warnings", warnings)
+        normalized.setdefault("sources", list(payload.get("sources") or []))
+        normalized.setdefault("attachments", list(payload.get("attachments") or []))
         nested_summary = _as_record(normalized.get("summary"))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fleet_rlm/api/routers/ws/stream_summary.py` around lines 247 - 273, The
run_result branch never preserves top-level payload["sources"] and
payload["attachments"], causing citations/artifacts to vanish when runtime
returns run_result; in the block handling run_result (where normalized is
created and final_artifact is set) add code to backfill sources and attachments
into normalized (e.g., normalized.setdefault("sources", payload.get("sources"))
and normalized.setdefault("attachments", payload.get("attachments"))) so
existing run_result values are preserved but payload sources/attachments are
present when missing; keep this logic alongside the existing setting of
"final_artifact" and "warnings".
src/fleet_rlm/api/routers/ws/turn_runner.py-59-66 (1)

59-66: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep persisted run status aligned with the terminal summary.

A done event that requires human review is normalized to status="needs_human_review" in build_execution_completion_summary(), but _terminal_run_status() still returns RunStatus.COMPLETED. That gives the same run two terminal truths depending on whether consumers read the DB status or the completion payload.

Also applies to: 95-99

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fleet_rlm/api/routers/ws/turn_runner.py` around lines 59 - 66,
_terminal_run_status currently maps a terminal "done" event to COMPLETED/FAILED
but doesn't account for the "needs_human_review" normalization done by
build_execution_completion_summary; update _terminal_run_status to inspect the
final payload (same way build_execution_completion_summary does) and return
RunStatus.NEEDS_HUMAN_REVIEW when the payload indicates human review is required
(e.g. payload.get("status") == "needs_human_review" or via the same helper used
to detect that state), and make the same change in the other terminal-status
mapping at the later block referenced (the similar logic around lines 95-99) so
DB run status and completion summary remain consistent.
scripts/validate_rlm_e2e_trace.py-128-129 (1)

128-129: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not treat execution_completed as a conversational terminal event.

execution_completed is the passive execution-stream terminal signal; accepting it as the chat terminal (Line 128 / Line 401) can let this harness pass even when /api/v1/ws/execution never emits a final event.

Suggested fix
-        if payload.get("type") == "execution_completed":
-            return events, payload
-
         if payload.get("type") != "event":
             continue
         kind = payload.get("data", {}).get("kind")
         if kind in {"final", "error", "cancelled"}:
             return events, payload
@@
-                terminal_kind = terminal_chat_payload.get("data", {}).get("kind") or terminal_chat_payload.get("type")
-                if terminal_kind not in {"final", "execution_completed"}:
+                terminal_kind = terminal_chat_payload.get("data", {}).get("kind") or terminal_chat_payload.get("type")
+                if terminal_kind != "final":
                     raise RuntimeError(
-                        f"Terminal chat event kind is {terminal_kind!r}; expected 'final' or "
-                        "'execution_completed'."
+                        f"Terminal chat event kind is {terminal_kind!r}; expected 'final'."
                     )

Based on learnings: /api/v1/ws/execution is conversational, while /api/v1/ws/execution/events is subscription-only/passive.

Also applies to: 400-405

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/validate_rlm_e2e_trace.py` around lines 128 - 129, Do not treat
payload.get("type") == "execution_completed" as a conversational terminal;
update the event handling in the code that currently returns on
payload.get("type") == "execution_completed" so it only returns/terminates on
the conversational "final" event (e.g., payload.get("type") == "final") and keep
"execution_completed" as a passive/subscription-only signal (handle but do not
use to short-circuit/accept the harness). Locate references to
payload.get("type") == "execution_completed" (including the occurrence around
the current check and the similar block later) and replace the return condition
to check for "final" instead, leaving any logging/collection of
"execution_completed" intact. Ensure the behavior for /api/v1/ws/execution
(conversational) relies on "final" while subscription/event-stream handling
still observes "execution_completed" without treating it as terminal.

Sources: Coding guidelines, Learnings

src/fleet_rlm/api/events/project_chat.py-23-29 (1)

23-29: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Map status/progress runtime events to execution_started, not generic execution_step.

This mapping currently downgrades non-terminal events to execution_step, but the frontend adapter only runs sandbox-progress projection and status-tone inference on execution_started. As a result, sandbox progress/status events can lose their canonical rendering path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fleet_rlm/api/events/project_chat.py` around lines 23 - 29, The
_frame_kind function currently maps non-terminal RuntimeEventKind values to
"execution_step", which downgrades status/progress events; update _frame_kind to
map status/progress runtime events to "execution_started" instead of
"execution_step". Concretely, add (or use) a set like _STATUS_PROGRESS_KINDS
containing the appropriate RuntimeEventKind members (e.g., status/progress
kinds) and change the logic in _frame_kind to return "execution_started" when
event_kind is in _TURN_STARTED_KINDS OR in _STATUS_PROGRESS_KINDS, keep
returning "execution_completed" for _TERMINAL_KINDS, and only return
"execution_step" for true intermediate steps not in those two sets.
src/frontend/src/lib/workspace/backend-chat-event-adapter.ts-671-673 (1)

671-673: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

resolve_hitl failure detection is too narrow.

Only treating "error" as failure will misclassify statuses like "failed", "rejected", or "cancelled" as success and incorrectly apply resolution.

Suggested fix
-      const succeeded =
-        asOptionalText(result?.status)?.toLowerCase() !== "error";
+      const status = asOptionalText(result?.status)?.toLowerCase();
+      const succeeded =
+        status == null ||
+        !["error", "failed", "failure", "rejected", "cancelled"].includes(status);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/frontend/src/lib/workspace/backend-chat-event-adapter.ts` around lines
671 - 673, The current detection for resolve_hitl failures only treats
result?.status === "error" as a failure; update the succeeded computation in
backend-chat-event-adapter.ts (the succeeded variable that uses
asOptionalText(result?.status)) to normalize (trim().toLowerCase()) and treat a
broader set of failure statuses (e.g., "error","failed","rejected","cancelled")
as failures—or alternatively only treat an explicit success set (e.g.,
"ok","success","succeeded") as success—and use that check when applying
resolve_hitl logic so statuses like "failed" or "rejected" are not misclassified
as successful.
tests/contracts/test_golden_payloads.py-54-57 (1)

54-57: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

skipif predicates are inverted for capture vs regression modes.

Right now capture tests are skipped when no baseline exists, while regression tests run and fail on missing golden files. Flip the predicates so capture runs only when baseline is absent, and regression runs only when baseline exists.

💡 Proposed fix
 `@pytest.mark.skipif`(
-    not GOLDEN_PAYLOADS_DIR.exists(),
-    reason="Golden payloads directory exists - run once to capture baseline",
+    GOLDEN_PAYLOADS_DIR.exists(),
+    reason="Golden payloads already captured - skip baseline capture",
 )
@@
 `@pytest.mark.skipif`(
-    not GOLDEN_PAYLOADS_DIR.exists(),
-    reason="Golden payloads directory exists - run once to capture baseline",
+    GOLDEN_PAYLOADS_DIR.exists(),
+    reason="Golden payloads already captured - skip baseline capture",
 )
@@
 `@pytest.mark.skipif`(
-    GOLDEN_PAYLOADS_DIR.exists(),
-    reason="Golden payloads already captured - use for regression testing",
+    not GOLDEN_PAYLOADS_DIR.exists(),
+    reason="Golden payloads missing - run baseline capture first",
 )
@@
 `@pytest.mark.skipif`(
-    GOLDEN_PAYLOADS_DIR.exists(),
-    reason="Golden payloads already captured - use for regression testing",
+    not GOLDEN_PAYLOADS_DIR.exists(),
+    reason="Golden payloads missing - run baseline capture first",
 )

Also applies to: 88-91, 126-129, 158-161

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/contracts/test_golden_payloads.py` around lines 54 - 57, The skipif
predicates are inverted: for capture tests flip the condition to skip when
GOLDEN_PAYLOADS_DIR.exists() (so capture runs only when baseline is absent) and
update the reason text accordingly, and for regression tests flip their
condition to skip when not GOLDEN_PAYLOADS_DIR.exists() (so regression runs only
when baseline exists); update each pytest.mark.skipif(...) instance in this file
(the three other occurrences shown) to use the corrected boolean and matching
reason string.
src/fleet_rlm/api/runtime_services/run_lifecycle.py-214-224 (1)

214-224: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid potential deadlock when stopping the persistence worker.

Line 218 can block indefinitely if the worker already exited and the queue is full (no consumer left), which can hang complete_run().

💡 Proposed fix
     async def _stop_persist_worker(self) -> None:
         if self._persist_worker_task is None:
             return
-        if self._persist_queue is not None:
-            await self._persist_queue.put(None)
+        task = self._persist_worker_task
+        queue = self._persist_queue
+        if queue is not None and not task.done():
+            try:
+                queue.put_nowait(None)
+            except asyncio.QueueFull:
+                task.cancel()
         try:
-            await self._persist_worker_task
+            await task
         except asyncio.CancelledError:
             pass
         self._persist_worker_task = None
         self._persist_queue = None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fleet_rlm/api/runtime_services/run_lifecycle.py` around lines 214 - 224,
The stop routine _stop_persist_worker can deadlock on "await
self._persist_queue.put(None)" if the worker has already exited and the queue is
full; change the code to do a non-blocking enqueue of the sentinel (use
self._persist_queue.put_nowait(None)) and catch asyncio.QueueFull (or Exception)
to skip sending the sentinel when the queue is full, then proceed to await the
worker task and clear _persist_worker_task and _persist_queue as before.
src/fleet_rlm/api/runtime_services/session_persistence.py-157-173 (1)

157-173: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

The local-store fallback is a silent no-op for the UUID-based backend shape described here.

This helper only writes when update_chat_session exposes external_session_id. The else branch explicitly skips the tenant_id/session_id signature called out in the comment for LocalStore / FleetRepository, so the "durable fallback" never persists anything for those implementations. Please thread the canonical session identifiers into this helper or add a persistence API that can update by external session id instead of silently dropping the write.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fleet_rlm/api/runtime_services/session_persistence.py` around lines 157 -
173, The helper currently only calls update_chat_session when it accepts
external_session_id, leaving implementations like LocalStore.update_chat_session
/ FleetRepository (which expect tenant_id and session_id UUIDs) as a silent
no-op; change the helper to accept or derive the canonical tenant_id and
session_id and call update_fn(tenant_id=..., session_id=...,
metadata_json={...}) in the else branch (or add a persistence API on the
repository that accepts external_session_id and use that). Locate the code using
update_fn (the inspect.signature block) and either thread the
tenant_id/session_id into this helper from the caller or add/consume a new
persistence method that maps external_session_id => tenant_id/session_id, then
call update_fn with the UUID params and metadata_json to ensure the durable
fallback actually persists.
src/fleet_rlm/api/routers/ws/stream_events.py-135-137 (1)

135-137: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid unconditionally awaiting prepare hooks.

prepare is typed as Callable[[], Any], but it is always awaited. A synchronous hook will raise TypeError and terminate streaming.

Suggested fix
+import inspect
@@
-    if request.prepare is not None:
-        await request.prepare()
+    if request.prepare is not None:
+        prepared = request.prepare()
+        if inspect.isawaitable(prepared):
+            await prepared
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fleet_rlm/api/routers/ws/stream_events.py` around lines 135 - 137, The
code unconditionally awaits request.prepare() which fails for synchronous hooks;
change the call in stream handling (around request.prepare and the async for
using request.agent.aiter_chat_turn_stream and _build_agent_stream_kwargs) to
call request.prepare(), capture its return value, and only await it if it's
awaitable (use inspect.isawaitable or asyncio.iscoroutine to detect); ensure
synchronous prepare hooks are invoked without awaiting and preserve the existing
behavior for async prepare hooks.
src/fleet_rlm/api/routers/ws/stream_events.py-123-124 (1)

123-124: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard payload normalization to mappings only.

dict(getattr(event, "payload", {}) or {}) can throw for string/list/object payloads, which converts malformed event data into a stream crash.

Suggested fix
+from collections.abc import Mapping
@@
-    return WorkspaceEvent(
+    raw_payload = getattr(event, "payload", None)
+    payload = dict(raw_payload) if isinstance(raw_payload, Mapping) else {}
+
+    return WorkspaceEvent(
         kind=kind,
         text=str(getattr(event, "text", "") or ""),
-        payload=dict(getattr(event, "payload", {}) or {}),
+        payload=payload,
         timestamp=timestamp,
         terminal=is_terminal_stream_event_kind(kind),
     )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fleet_rlm/api/routers/ws/stream_events.py` around lines 123 - 124, The
current payload normalization line in stream_events.py uses dict(getattr(event,
"payload", {}) or {}), which will raise when payload is a non-mapping
(string/list/etc.); change it to guard for mapping types: fetch payload =
getattr(event, "payload", None), if isinstance(payload, Mapping) then use
dict(payload) else use {}. Import Mapping from collections.abc and replace the
direct dict(...) call with this guarded logic in the code path that constructs
the event payload (the variable/argument named payload in the event creation).
src/fleet_rlm/integrations/daytona/runtime.py-96-109 (1)

96-109: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Browser skill snapshot selection can degrade to a non-browser image on cold environments

Line 96 introduces browser snapshot selection, but when that snapshot is missing/inactive, resolve_sandbox_spec_snapshot() falls back through fallback_to_declarative_image() which currently builds only the base image. That drops Playwright/Chromium deps and can make browser skill runs fail despite selecting browser capabilities.

💡 Proposed fix
diff --git a/src/fleet_rlm/integrations/daytona/snapshots.py b/src/fleet_rlm/integrations/daytona/snapshots.py
@@
 def fallback_to_declarative_image(spec: SandboxSpec) -> SandboxSpec:
     """Replace a snapshot-based spec with a declarative image build."""
-    image = build_base_snapshot_image()
+    if spec.snapshot == BROWSER_SNAPSHOT_NAME:
+        image = build_browser_snapshot_image()
+    else:
+        image = build_base_snapshot_image()
     return dataclasses.replace(spec, image=image, snapshot=None)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fleet_rlm/integrations/daytona/runtime.py` around lines 96 - 109,
resolve_snapshot_for_skills can pick the browser snapshot but downstream
fallback_to_declarative_image currently only builds the base image, dropping
Playwright/Chromium; update the fallback path so the selected snapshot flavor is
preserved: modify fallback_to_declarative_image to accept a snapshot_name or a
browser_flag and, when called from resolve_sandbox_spec_snapshot with a
browser-selected name (BROWSER_SNAPSHOT_NAME returned by
resolve_snapshot_for_skills), build the declarative fallback that includes
browser deps (the same layers/packages as the browser snapshot) instead of
always using DEFAULT_SNAPSHOT_NAME; update resolve_sandbox_spec_snapshot to pass
the chosen snapshot (or browser flag) into fallback_to_declarative_image so
browser-capable runs still get Playwright/Chromium when the original snapshot is
missing.
src/fleet_rlm/runtime/tools/document_tools.py-361-367 (1)

361-367: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Enforce an aggregate bundle budget.

Each downloaded document still gets its own 50 MiB cap, so a single root fetch can now concatenate roughly 150 MiB of text into one tool response. That's enough to overwhelm the sandbox and any downstream LLM context. Stop bundling once a small combined text budget is reached, or skip auxiliary documents when the primary document is already large.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fleet_rlm/runtime/tools/document_tools.py` around lines 361 - 367, The
bundling currently concatenates all combined_parts into text without an overall
size cap, so update the bundling logic that builds text, metadata,
bundled_sources and bundled_source_count to enforce a global combined budget
(e.g., a small byte/char limit) by iterating combined_parts and adding them
until adding the next part would exceed the budget; skip or truncate auxiliary
parts (auxiliary_errors sources) when the primary document already uses most of
the budget, and ensure bundled_sources and bundled_source_count only reflect the
actually included parts and that metadata["auxiliary_errors"] is adjusted to
reflect skipped/truncated items; operate on the same symbols (combined_parts,
bundled_sources, auxiliary_errors, metadata) and keep the per-document cap but
stop further concatenation once the aggregate budget is reached.
src/fleet_rlm/runtime/tools/document_tools.py-175-193 (1)

175-193: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Auxiliary bundling is not actually limited to root URLs.

The docstring says this is for root documentation URLs, but the current check only excludes paths with a suffix. URLs like https://docs.example/api or https://docs.example/guide/ still qualify and will silently pull in /llms.txt and /sitemap.xml, changing a page fetch into a site-wide bundle. Gate this on / or a stricter root heuristic before adding auxiliary sources.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fleet_rlm/runtime/tools/document_tools.py` around lines 175 - 193, The
function _documentation_auxiliary_urls currently treats non-file paths like
"/api" or "/guide/" as roots and will add site-level companions; change the
gating logic to only allow auxiliary bundling for true root pages by normalizing
and checking the path equals "/" (or empty) before constructing candidates.
Specifically, replace the current heuristic that only tests Path(path).suffix
with a stricter check on parsed.path (e.g., normalize with parsed.path or
parsed.path.rstrip("/") and require it to be "/" or empty) before using origin
and _AUXILIARY_DOC_PATHS to build candidates (keep variables current,
candidates, origin, and _AUXILIARY_DOC_PATHS) and update the docstring to
reflect the stricter root-only behavior.
src/fleet_rlm/integrations/daytona/bridge_callbacks.py-42-46 (1)

42-46: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

semantic_callbacks_enabled is bypassable from sandbox code.

Hiding llm_query* from bridge_tools() does not actually disable them: code running inside the sandbox can still POST tool_name="llm_query" to the local broker, and invoke_tool() will execute it because the dispatcher never checks this flag. If this flag is meant to be a policy boundary, enforce it in invoke_tool() or whitelist only the names returned by bridge_tools().

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fleet_rlm/integrations/daytona/bridge_callbacks.py` around lines 42 - 46,
The current check of semantic_callbacks_enabled in bridge_callbacks.py only
omits adding "llm_query" and "llm_query_batched" to the tools map but does not
prevent sandbox code from calling them; update the runtime enforcement by making
invoke_tool() validate requested tool names against the allowed toolset returned
by bridge_tools() (or an explicit allowed_tools list created in
bridge_callbacks.py), and if semantic_callbacks_enabled is False reject or raise
on any invocation of "llm_query" / "llm_query_batched" (or any tool not present
in the allowed tools). Concretely, ensure invoke_tool() consults the tools dict
produced by bridge_tools() (and the semantic_callbacks_enabled flag or an
allowed_tools whitelist) before executing interpreter.llm_query or
interpreter.llm_query_batched to enforce the policy boundary.
src/fleet_rlm/integrations/daytona/bridge.py-143-158 (1)

143-158: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the broker and wrapper timeouts in sync.

/tool_call now waits broker_tool_call_timeout, but the generated wrapper still hard-times out its HTTP POST at 130 seconds later in this file. With the default 180-second broker timeout, callbacks that take 130-180 seconds will fail client-side first and surface as transport errors even though the broker is still waiting. Use the same timeout in both places or clamp the broker timeout to the wrapper limit.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fleet_rlm/integrations/daytona/bridge.py` around lines 143 - 158, The
wrapper currently uses a hardcoded __DAYTONA_TOOL_CALL_TIMEOUT_S__ for waiting
on call_id results, which can be shorter than the broker's
broker_tool_call_timeout and cause client-side timeouts; update the loop in
bridge.py (the code that waits on call_id, touches _results, _pending_requests,
and calls _send_json) to use the same broker timeout value
(broker_tool_call_timeout) or clamp broker_tool_call_timeout to
__DAYTONA_TOOL_CALL_TIMEOUT_S__ so both sides share the same timeout; locate
where __DAYTONA_TOOL_CALL_TIMEOUT_S__ is referenced and replace/derive it from
broker_tool_call_timeout (or apply min(broker_tool_call_timeout,
__DAYTONA_TOOL_CALL_TIMEOUT_S__)) so callbacks that take up to the broker
timeout do not fail client-side.
src/fleet_rlm/integrations/daytona/sandbox_executor.py-848-867 (1)

848-867: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The 5-minute broker latch survives explicit resets.

_owner_broker_start_error() checks _BROKER_START_FAILURES before the owner field, so after one broker start failure the next run still short-circuits here for the full cooldown even if soft_reset() or close_bridge() was called. That makes the new reset-time _bridge_start_error = None ineffective for same-sandbox recovery. Clear the session-wide cache on explicit reset/close, or scope the cache to the executor instance instead of the sandbox id.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fleet_rlm/integrations/daytona/sandbox_executor.py` around lines 848 -
867, The session-wide broker failure cache (_BROKER_START_FAILURES) causes
_owner_broker_start_error(session, ...) to short-circuit new runs even after
owner._bridge_start_error is cleared; update the reset/close flow (e.g., in
soft_reset() and close_bridge()) to also clear the session entry from
_BROKER_START_FAILURES by calling _clear_broker_start_error(session) (or remove
the session key), or alternatively refactor the cache to be scoped to the
executor instance rather than the sandbox id so that setting
owner._bridge_start_error = None actually permits immediate recovery; ensure
changes reference the existing symbols _BROKER_START_FAILURES,
_owner_broker_start_error, _clear_broker_start_error, soft_reset, close_bridge,
and owner._bridge_start_error.
src/fleet_rlm/runtime/agent/runtime.py-125-138 (1)

125-138: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't unwrap the escalating runtime into its inner ReAct here.

When self.agent is the default EscalatingFleetModule, this probe can return its inner .react program and the native path then drives that child directly with only chat_history/user_message. That skips the outer module’s routing/escalation flow and drops the inputs built by _escalation_call_args(...) (execution_mode, conversation_summary, core_memory, bounded history), so URL-document RLM routing and other escalation behavior can be silently bypassed during streaming. Keep native streaming restricted to a true FleetAgent, or make the escalating module itself implement the stepwise interface instead of unwrapping it.

Also applies to: 822-823, 847-850

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fleet_rlm/runtime/agent/runtime.py` around lines 125 - 138, The probe in
_get_streamable_react_program incorrectly unwraps an EscalatingFleetModule by
returning its .react child; change the logic so streaming is only allowed for
true FleetAgent instances (or objects that implement the full stepwise streaming
interface) and refuse to return the inner .react when the outer program is an
EscalatingFleetModule (or when the original program exposes escalation-specific
APIs like _escalation_call_args / routing). Concretely, update
_get_streamable_react_program to detect EscalatingFleetModule (or presence of
escalation attributes) and skip returning candidate if the outer program is an
escalation module; ensure the same check is applied in the other similar probes
referenced around lines 822-823 and 847-850 so native streaming cannot bypass
escalation inputs.
src/fleet_rlm/runtime/modules/escalating.py-276-278 (1)

276-278: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

RLM fallback object is incompatible with the _run_rlm call contract

Line 277 sets self._rlm to ChainOfThought(RLMReActChatSignature), but Lines 579–595 invoke that object with task/prompt kwargs. In forced RLM mode without an interpreter, this will raise at runtime instead of cleanly degrading.

Suggested fix
         else:
-            self._rlm = dspy.ChainOfThought(RLMReActChatSignature)
+            self._rlm = None

Also applies to: 579-595

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fleet_rlm/runtime/modules/escalating.py` around lines 276 - 278, The
fallback assigned to self._rlm (ChainOfThought(RLMReActChatSignature)) doesn't
accept the keyword args used by _run_rlm (it is invoked with task= / prompt= in
the 579–595 call site), so replace the direct ChainOfThought assignment with an
adapter that matches the _run_rlm call contract: either construct a wrapper
callable (or small adapter class with __call__(self, *, task=None, prompt=None,
**kwargs)) that forwards task/prompt into the ChainOfThought invocation in the
expected positional form, or change the _run_rlm invocation to use the
ChainOfThought-compatible signature; ensure the symbol names referenced are
self._rlm, ChainOfThought, RLMReActChatSignature and _run_rlm so the adapter
preserves the original behavior while accepting task/prompt kwargs.
🧹 Nitpick comments (4)
tests/unit/api/test_dependencies.py (1)

95-113: ⚡ Quick win

Add a regression test for the uninitialized persistence branch.

get_persistence now raises when both repository and local_store are absent, but that new branch isn’t asserted here.

Suggested test addition
+def test_get_persistence_raises_when_backend_uninitialized(clean_runtime_env):
+    dependencies_module = importlib.import_module("fleet_rlm.api.dependencies")
+    app = SimpleNamespace(state=SimpleNamespace(persistence_deps=dependencies_module.PersistenceDeps()))
+    request = _build_request(app)
+    try:
+        dependencies_module.get_persistence(request)
+        assert False, "Expected RuntimeError when persistence backend is uninitialized"
+    except RuntimeError as exc:
+        assert "Persistence backend not initialized" in str(exc)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/api/test_dependencies.py` around lines 95 - 113, Extend the
existing test_get_persistence_prefers_repository_then_local_store to cover the
new uninitialized branch: create an app whose state.persistence_deps is
dependencies_module.PersistenceDeps() with neither repository nor local_store,
build the request with _build_request(app), and assert that calling
dependencies_module.get_persistence(request) raises the expected exception (use
pytest.raises to check the raise). Ensure you reference get_persistence and
PersistenceDeps when locating the code under test.
src/fleet_rlm/runtime/modules/skill_selection.py (1)

68-99: ⚡ Quick win

Reduce overlap between long-context and browser-interaction keywords.

http://, https://, fetch, browse, and scrape in long-context can overshadow browser-intent routing, making browser-interaction less likely to be selected when it should be.

Suggested direction
     "long-context": [
@@
-        "http://",
-        "https://",
-        "fetch",
-        "browse",
-        "scrape",
-        "analyze https",
-        "analyze http",
+        "analyze https",
+        "analyze http",
     ],
     "browser-interaction": [
+        "http://",
+        "https://",
+        "fetch",
+        "browse",
+        "scrape",
         "render the page",
         "rendered page",
         "javascript page",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fleet_rlm/runtime/modules/skill_selection.py` around lines 68 - 99, The
keyword lists in skill_selection.py have overlapping tokens causing
browser-intent to be underselected; remove the URL- and browser-specific tokens
('http://', 'https://', 'fetch', 'browse', 'scrape') from the "long-context"
list and ensure they remain (or are added) only in the "browser-interaction"
list so routing prefers the browser skill; update the "long-context" and
"browser-interaction" arrays accordingly in the file (referencing the
"long-context" and "browser-interaction" keys) and run any selection/unit tests
to verify correct intent routing.
src/frontend/src/components/agent-elements/input/model-picker.tsx (2)

13-13: ⚡ Quick win

Use the canonical cn() import path.

Line [13] uses a relative cn import; this should come from the shared alias to keep frontend utilities consistent.

Proposed change
-import { cn } from "../utils/cn";
+import { cn } from "`@/lib/utils`";

As per coding guidelines, use @/lib/utils as the canonical cn() import path for frontend TypeScript files.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/frontend/src/components/agent-elements/input/model-picker.tsx` at line
13, The import for cn in model-picker.tsx uses a relative path ("../utils/cn");
update it to use the canonical shared alias by importing cn from "`@/lib/utils`"
instead, ensuring the component continues to use cn() for className
concatenation and keeping frontend utility imports consistent across the
codebase.

Source: Coding guidelines


65-65: ⚡ Quick win

Replace arbitrary Tailwind pixel classes with shared utilities/tokens.

Lines [65], [84], [101], and [131] use bracket pixel values (e.g., text-[12px], rounded-[6px], rounded-[8px]). These should be converted to canonical typography/radius utilities (or added via shared @utility tokens) to match the design-system contract.

As per coding guidelines, arbitrary Tailwind values should be eliminated in favor of semantic/shared utilities.

Also applies to: 84-84, 101-101, 131-131

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/frontend/src/components/agent-elements/input/model-picker.tsx` at line
65, Replace the arbitrary Tailwind pixel classes in the model-picker.tsx file
with canonical typography and radius utilities. Specifically, convert
text-[12px] to the appropriate shared typography utility token, rounded-[6px] to
the corresponding canonical radius utility, and rounded-[8px] at line 131 to its
shared equivalent. Apply these replacements across all four occurrences
mentioned at lines 65, 84, 101, and 131 to ensure consistency with the design
system contract and eliminate arbitrary bracket notation values in favor of
semantic shared utilities defined via `@utility` tokens.

Source: Coding guidelines

Comment thread src/fleet_rlm/api/events/project_graph.py Outdated
Comment thread src/fleet_rlm/api/routers/ws/turn_runner.py
Comment thread tests/unit/runtime/test_tools.py
Comment thread tests/unit/integrations/test_mlflow_context.py Fixed
Comment thread tests/unit/integrations/test_mlflow_context.py Fixed
Comment thread tests/unit/api/test_runtime_diagnostics.py Fixed
@Zochory Zochory merged commit a54d84f into main Jun 7, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants