Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ on:
push:
branches: [main]
pull_request:
schedule:
- cron: "0 6 * * *"

jobs:
build:
Expand All @@ -20,6 +22,22 @@ jobs:
- run: cargo build --release --verbose
- run: cargo test --verbose

agent-eval:
name: live agent eval (optional)
runs-on: macos-latest
if: github.event_name == 'schedule' || contains(github.event.head_commit.message, '[eval]')
continue-on-error: true
steps:
- uses: actions/checkout@v6
- uses: dtolnay/rust-toolchain@stable
- uses: Swatinem/rust-cache@v2
- run: brew install ollama
- run: ollama pull qwen2.5-coder:7b
- run: brew services start ollama
- run: cargo build --release
- run: cargo run --release -- --eval read-and-explain --model qwen2.5-coder:7b
- run: cargo run --release -- --eval fix-failing-test --model qwen2.5-coder:7b

lint:
name: clippy + fmt
runs-on: ubuntu-latest
Expand Down
39 changes: 39 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,45 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).

## [Unreleased]

## [0.7.0] - 2026-06-09

### Added

- **Turn tracing and session event log.** Every turn appends structured events
(tool calls with redacted args, approvals, output compaction, warmup, timing)
to a sidecar at `.sessions/<session-id>.events.jsonl`, enabled by default via
`display.eventLog.enabled`. `/trace on|off` shows nested subagent/critic tool
calls as indented lines in the TUI — previously invisible — and
`/export <session> events` copies the sidecar. The end-of-turn status line
gains a timing breakdown (TTFT, model, tools, approval, total), the loader
names the tool currently running, and compaction of oversized tool output is
reported with the original size.
- **Agent eval CLI.** `small-harness --eval <fixture> [--model M] [--json]`
runs a bundled eval fixture from the shell and exits 0 on pass / 1 on fail,
for CI scripts. An optional `agent-eval` CI job runs two fixtures against
Ollama nightly or on `[eval]` in a commit message (continue-on-error, so a
flaky local model never blocks merges). New integration tests drive the real
agent loop against a mock OpenAI-compatible SSE server — no live LLM needed.

### Fixed

- `file_edit` can create new files via the empty-`old_text` convention used by
Claude Code and similar harnesses.
- Tool responses for three model-facing edge cases: `file_read` with an offset
past EOF returns a clear error instead of silently-empty content, `list_dir`
reports the real entry `total` when a listing is truncated, and `grep` drops
unparseable ripgrep output lines instead of emitting malformed matches.
- The rubric heading parser matches `(weight:` case-insensitively on raw bytes,
fixing potential mis-parses of criterion names containing certain Unicode
characters.
- The HTTP client now uses a 10-second connect timeout so a dead backend fails
fast instead of hanging, without capping long streaming completions.

### Changed

- Internal: the 3,000-line commands module was split into focused submodules
(config, context, memory, session). No behavior change.

## [0.6.1] - 2026-06-07

### Added
Expand Down
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "small-harness"
version = "0.6.1"
version = "0.7.0"
edition = "2021"

[[bin]]
Expand Down
20 changes: 20 additions & 0 deletions Quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,8 +143,20 @@ Useful commands:
/sessions search dispatch
/new start a clean conversation
/export current markdown
/export current events copy the session event log sidecar
```

**Transparent mode** (see everything the agent did):

```text
/verbose on
/trace on
```

The event log lives beside each transcript:
`.sessions/<session-id>.events.jsonl` (tool calls, approvals, compaction,
warmup, per-turn timing summary).

Good habits:

- Ask for one focused change at a time.
Expand Down Expand Up @@ -233,6 +245,13 @@ printf 'What changed in this branch?\n' | cargo run --release --
Approval-gated write and shell tools are denied in one-shot mode unless you pass
`--allow-tools`.

Run a bundled agent eval from the shell (exit code 0 on pass):

```bash
cargo run --release -- --eval read-and-explain --model qwen2.5-coder:7b
cargo run --release -- --eval fix-failing-test --json
```

## A Good First Session

Here is a simple sequence that exercises the whole product:
Expand Down Expand Up @@ -307,6 +326,7 @@ Small Harness keeps local state under `.sessions/`:
.sessions/
history.jsonl input history
*.jsonl session transcripts
*.events.jsonl per-session structured event logs (tools, timing, approvals)
project-memory/
index.json safe metadata-only repo index
notes.jsonl durable project notes from /remember
Expand Down
17 changes: 16 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
<p align="center">
<a href="https://github.com/GetSmallAI/SmallHarness/actions/workflows/ci.yml"><img alt="CI" src="https://github.com/GetSmallAI/SmallHarness/actions/workflows/ci.yml/badge.svg"></a>
<img alt="Rust" src="https://img.shields.io/badge/Rust-1.75%2B-dea584">
<img alt="Version" src="https://img.shields.io/badge/version-0.6.1-111827">
<img alt="Version" src="https://img.shields.io/badge/version-0.7.0-111827">
<img alt="Backends" src="https://img.shields.io/badge/backends-Ollama%20%7C%20LM%20Studio%20%7C%20MLX%20%7C%20llama.cpp%20%7C%20OpenRouter%20%7C%20OpenAI-2563eb">
<img alt="Apple Silicon" src="https://img.shields.io/badge/Apple%20Silicon-optimized-111827">
<img alt="License MIT" src="https://img.shields.io/badge/license-MIT-111827">
Expand Down Expand Up @@ -328,6 +328,7 @@ this exact call`. The session cache resets on `/new`.
/image <path> attach an image to the next user turn
/reasoning on|off toggle the streaming reasoning panel
/verbose on|off show every tool call with its full args + result
/trace on|off show nested subagent/critic tool calls (indented)
/compare [model] re-send the last prompt against OpenRouter for A/B
```

Expand Down Expand Up @@ -593,6 +594,10 @@ root. Common shape:
"tools": ["file_read", "grep", "list_dir", "file_edit", "file_write", "shell", "update_plan", "task"],
"toolSelection": "auto",
"maxSteps": 20,
"display": {
"toolDisplay": "grouped",
"eventLog": { "enabled": true }
},
"workspaceRoot": "/path/to/project",
"outsideWorkspace": "prompt",
"context": {
Expand Down Expand Up @@ -639,6 +644,14 @@ runtime.
- **`/verbose on|off`** switches to a debug tool view: every tool call is
printed with its full arguments and a large result preview, so you can see
exactly what the agent is doing. `/verbose off` restores the normal view.
- **`/trace on|off`** shows nested subagent and critic tool activity as
indented lines in the TUI (without flooding the parent context). Every turn
is also logged to a sidecar at `.sessions/<session-id>.events.jsonl` with
tool calls, approvals, compaction, warmup, and timing — enabled by default
via `display.eventLog.enabled` in `agent.config.json`.
- **Turn footer timing.** After each turn the status line includes step count
and a breakdown when available: `TTFT`, `model`, `tools`, `approval`, and
`total` seconds alongside the existing token and cost stats.
- **Slash-command completion.** Type `/` and a menu of matching commands (with
descriptions) appears beneath the prompt; the best match also shows as dim
ghost text. **↑/↓** select, **Tab** accepts (with a trailing space), **→**
Expand All @@ -652,6 +665,8 @@ runtime.
- **One-shot mode** — `small-harness --print "summarize this repo"` or
`printf '…\n' | small-harness` for scripts and CI. Approval-gated tools
are denied by default; pass `--allow-tools` to allow them.
- **Agent eval CLI** — `small-harness --eval fix-failing-test [--model M] [--json]`
runs a bundled eval fixture and exits 0/1 (for CI scripts).
- **Warmup.** Small Harness sends a 1-token request with the full system
prompt + tools at startup so llama.cpp-derived engines have a hot
prompt-eval cache before your first prompt. Disable with `WARMUP=false`.
Expand Down
Loading