GetSmallAI · morganlinton · Jun 10, 2026 · Jun 8, 2026 · Jun 9, 2026 · Jun 10, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -4,6 +4,8 @@ on:
   push:
     branches: [main]
   pull_request:
+  schedule:
+    - cron: "0 6 * * *"
 
 jobs:
   build:
@@ -20,6 +22,22 @@ jobs:
       - run: cargo build --release --verbose
       - run: cargo test --verbose
 
+  agent-eval:
+    name: live agent eval (optional)
+    runs-on: macos-latest
+    if: github.event_name == 'schedule' || contains(github.event.head_commit.message, '[eval]')
+    continue-on-error: true
+    steps:
+      - uses: actions/checkout@v6
+      - uses: dtolnay/rust-toolchain@stable
+      - uses: Swatinem/rust-cache@v2
+      - run: brew install ollama
+      - run: ollama pull qwen2.5-coder:7b
+      - run: brew services start ollama
+      - run: cargo build --release
+      - run: cargo run --release -- --eval read-and-explain --model qwen2.5-coder:7b
+      - run: cargo run --release -- --eval fix-failing-test --model qwen2.5-coder:7b
+
   lint:
     name: clippy + fmt
     runs-on: ubuntu-latest

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,45 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
 
 ## [Unreleased]
 
+## [0.7.0] - 2026-06-09
+
+### Added
+
+- **Turn tracing and session event log.** Every turn appends structured events
+  (tool calls with redacted args, approvals, output compaction, warmup, timing)
+  to a sidecar at `.sessions/<session-id>.events.jsonl`, enabled by default via
+  `display.eventLog.enabled`. `/trace on|off` shows nested subagent/critic tool
+  calls as indented lines in the TUI — previously invisible — and
+  `/export <session> events` copies the sidecar. The end-of-turn status line
+  gains a timing breakdown (TTFT, model, tools, approval, total), the loader
+  names the tool currently running, and compaction of oversized tool output is
+  reported with the original size.
+- **Agent eval CLI.** `small-harness --eval <fixture> [--model M] [--json]`
+  runs a bundled eval fixture from the shell and exits 0 on pass / 1 on fail,
+  for CI scripts. An optional `agent-eval` CI job runs two fixtures against
+  Ollama nightly or on `[eval]` in a commit message (continue-on-error, so a
+  flaky local model never blocks merges). New integration tests drive the real
+  agent loop against a mock OpenAI-compatible SSE server — no live LLM needed.
+
+### Fixed
+
+- `file_edit` can create new files via the empty-`old_text` convention used by
+  Claude Code and similar harnesses.
+- Tool responses for three model-facing edge cases: `file_read` with an offset
+  past EOF returns a clear error instead of silently-empty content, `list_dir`
+  reports the real entry `total` when a listing is truncated, and `grep` drops
+  unparseable ripgrep output lines instead of emitting malformed matches.
+- The rubric heading parser matches `(weight:` case-insensitively on raw bytes,
+  fixing potential mis-parses of criterion names containing certain Unicode
+  characters.
+- The HTTP client now uses a 10-second connect timeout so a dead backend fails
+  fast instead of hanging, without capping long streaming completions.
+
+### Changed
+
+- Internal: the 3,000-line commands module was split into focused submodules
+  (config, context, memory, session). No behavior change.
+
 ## [0.6.1] - 2026-06-07
 
 ### Added

diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "small-harness"
-version = "0.6.1"
+version = "0.7.0"
 edition = "2021"
 
 [[bin]]

diff --git a/Quickstart.md b/Quickstart.md
@@ -143,8 +143,20 @@ Useful commands:
 /sessions search dispatch
 /new          start a clean conversation
 /export current markdown
+/export current events   copy the session event log sidecar
 ```
 
+**Transparent mode** (see everything the agent did):
+
+```text
+/verbose on
+/trace on
+```
+
+The event log lives beside each transcript:
+`.sessions/<session-id>.events.jsonl` (tool calls, approvals, compaction,
+warmup, per-turn timing summary).
+
 Good habits:
 
 - Ask for one focused change at a time.
@@ -233,6 +245,13 @@ printf 'What changed in this branch?\n' | cargo run --release --
 Approval-gated write and shell tools are denied in one-shot mode unless you pass
 `--allow-tools`.
 
+Run a bundled agent eval from the shell (exit code 0 on pass):
+
+```bash
+cargo run --release -- --eval read-and-explain --model qwen2.5-coder:7b
+cargo run --release -- --eval fix-failing-test --json
+```
+
 ## A Good First Session
 
 Here is a simple sequence that exercises the whole product:
@@ -307,6 +326,7 @@ Small Harness keeps local state under `.sessions/`:
 .sessions/
   history.jsonl          input history
   *.jsonl                session transcripts
+  *.events.jsonl         per-session structured event logs (tools, timing, approvals)
   project-memory/
     index.json           safe metadata-only repo index
     notes.jsonl          durable project notes from /remember

diff --git a/README.md b/README.md
@@ -19,7 +19,7 @@
 <p align="center">
   <a href="https://github.com/GetSmallAI/SmallHarness/actions/workflows/ci.yml"><img alt="CI" src="https://github.com/GetSmallAI/SmallHarness/actions/workflows/ci.yml/badge.svg"></a>
   <img alt="Rust" src="https://img.shields.io/badge/Rust-1.75%2B-dea584">
-  <img alt="Version" src="https://img.shields.io/badge/version-0.6.1-111827">
+  <img alt="Version" src="https://img.shields.io/badge/version-0.7.0-111827">
   <img alt="Backends" src="https://img.shields.io/badge/backends-Ollama%20%7C%20LM%20Studio%20%7C%20MLX%20%7C%20llama.cpp%20%7C%20OpenRouter%20%7C%20OpenAI-2563eb">
   <img alt="Apple Silicon" src="https://img.shields.io/badge/Apple%20Silicon-optimized-111827">
   <img alt="License MIT" src="https://img.shields.io/badge/license-MIT-111827">
@@ -328,6 +328,7 @@ this exact call`. The session cache resets on `/new`.
 /image <path>          attach an image to the next user turn
 /reasoning on|off      toggle the streaming reasoning panel
 /verbose on|off        show every tool call with its full args + result
+/trace on|off          show nested subagent/critic tool calls (indented)
 /compare [model]       re-send the last prompt against OpenRouter for A/B
 ```
 
@@ -593,6 +594,10 @@ root. Common shape:
   "tools": ["file_read", "grep", "list_dir", "file_edit", "file_write", "shell", "update_plan", "task"],
   "toolSelection": "auto",
   "maxSteps": 20,
+  "display": {
+    "toolDisplay": "grouped",
+    "eventLog": { "enabled": true }
+  },
   "workspaceRoot": "/path/to/project",
   "outsideWorkspace": "prompt",
   "context": {
@@ -639,6 +644,14 @@ runtime.
 - **`/verbose on|off`** switches to a debug tool view: every tool call is
   printed with its full arguments and a large result preview, so you can see
   exactly what the agent is doing. `/verbose off` restores the normal view.
+- **`/trace on|off`** shows nested subagent and critic tool activity as
+  indented lines in the TUI (without flooding the parent context). Every turn
+  is also logged to a sidecar at `.sessions/<session-id>.events.jsonl` with
+  tool calls, approvals, compaction, warmup, and timing — enabled by default
+  via `display.eventLog.enabled` in `agent.config.json`.
+- **Turn footer timing.** After each turn the status line includes step count
+  and a breakdown when available: `TTFT`, `model`, `tools`, `approval`, and
+  `total` seconds alongside the existing token and cost stats.
 - **Slash-command completion.** Type `/` and a menu of matching commands (with
   descriptions) appears beneath the prompt; the best match also shows as dim
   ghost text. **↑/↓** select, **Tab** accepts (with a trailing space), **→**
@@ -652,6 +665,8 @@ runtime.
 - **One-shot mode** — `small-harness --print "summarize this repo"` or
   `printf '…\n' | small-harness` for scripts and CI. Approval-gated tools
   are denied by default; pass `--allow-tools` to allow them.
+- **Agent eval CLI** — `small-harness --eval fix-failing-test [--model M] [--json]`
+  runs a bundled eval fixture and exits 0/1 (for CI scripts).
 - **Warmup.** Small Harness sends a 1-token request with the full system
   prompt + tools at startup so llama.cpp-derived engines have a hot
   prompt-eval cache before your first prompt. Disable with `WARMUP=false`.