This harness runs manual integration evaluations through the programmatic API (AskyClient.run_turn).
- Primary use: evaluate research pipeline behavior and answer quality.
- Also supports non-research runs (
research_mode=false) with the same dataset and scoring. - Designed for manual benchmark/evaluation runs, not default unit-test flow.
- Dataset: documents + test cases + expected checks.
- Matrix: one or more run profiles (model + mode + params + knobs).
- Snapshots: pinned local copies of source docs for reproducibility.
- Run Output: per-case JSONL plus summary and markdown report.
- Dataset seed:
/Users/evren/code/asky/evals/research_pipeline/datasets/rfc_http_nist_v1.yaml - Matrix seed:
/Users/evren/code/asky/evals/research_pipeline/matrices/default.toml - Snapshot root (default):
/Users/evren/code/asky/temp/research_eval/snapshots - Run output root (default):
/Users/evren/code/asky/temp/research_eval/runs
# 1) Prepare pinned local documents
uv run python -m asky.evals.research_pipeline.run prepare \
--dataset evals/research_pipeline/datasets/rfc_http_nist_v1.yaml
# 2) Execute all runs from a matrix
uv run python -m asky.evals.research_pipeline.run run \
--matrix evals/research_pipeline/matrices/default.toml
# 3) Execute only selected run IDs
uv run python -m asky.evals.research_pipeline.run run \
--matrix evals/research_pipeline/matrices/default.toml \
--run research-glmflash-local
# 4) Rebuild report from existing outputs
uv run python -m asky.evals.research_pipeline.run report \
--dataset evals/research_pipeline/datasets/rfc_http_nist_v1.yaml \
--results-dir temp/research_eval/runs/<timestamp>Add a new file under /Users/evren/code/asky/evals/research_pipeline/datasets/, for example:
id: my_policy_eval_v1
docs:
- id: policy-1
title: "Policy A"
url: "https://example.com/policy-a.pdf"
- id: policy-2
title: "Policy B"
url: "https://example.com/policy-b.pdf"
tests:
- id: policy-a-claim-1
doc_id: policy-1
query: "What is the minimum review interval?"
expected:
type: contains
text: "at least every 90 days"
- id: compare-a-b
doc_ids: [policy-1, policy-2]
query: "Which document requires annual external audit?"
expected:
type: regex
pattern: "(?i)annual\\s+external\\s+audit"Notes:
doc_idanddoc_idsare both supported.doc_idsenables multi-document questions in one test case.- Supported expectation types right now:
contains,regex.
Add a matrix under /Users/evren/code/asky/evals/research_pipeline/matrices/, for example:
dataset = "../datasets/my_policy_eval_v1.yaml"
snapshot_root = "temp/research_eval/snapshots"
output_root = "temp/research_eval/runs"
[[runs]]
id = "research-glmflash-local"
model_alias = "glmflash"
research_mode = true
source_provider = "local_snapshot"
lean = true
preload_local_sources = true
preload_shortlist = false
save_history = false
[runs.parameters]
temperature = 0.1
[[runs]]
id = "standard-gf-live"
model_alias = "gf"
research_mode = false
source_provider = "live_web"
lean = false
preload_local_sources = false
preload_shortlist = true
save_history = falseuv run python -m asky.evals.research_pipeline.run prepare --dataset evals/research_pipeline/datasets/my_policy_eval_v1.yaml
uv run python -m asky.evals.research_pipeline.run run --matrix evals/research_pipeline/matrices/my_policy_eval_v1.tomlTop-level fields:
id: optional string (defaults to filename stem)docs: required non-empty listtests: required non-empty list
docs[] fields:
id(required, unique)title(required)url(required)
tests[] fields:
id(required, unique)query(required)doc_idordoc_ids(required, referencesdocs.id)expected(required)
expected fields:
type="contains"withtexttype="regex"withpattern
Top-level fields:
dataset: optional path (if omitted, pass--datasetat runtime)snapshot_root: optional pathoutput_root: optional path[[runs]]: required, at least one
runs[] fields:
id(required, unique)model_alias(required)research_mode(defaulttrue)source_provider(auto,local_snapshot,live_web,mock_web)lean(defaultfalse)preload_local_sources(defaulttrue)preload_shortlist(defaulttrue)save_history(defaultfalse)disabled_tools(string CSV or string list)additional_source_context(optional string)query_prefix/query_suffix(optional strings)[runs.parameters](optional key/value model parameters)
Disable tools per run:
disabled_toolsaccepts CSV string or string list.- Example:
[[runs]]
id = "research-gf-no-web-search"
model_alias = "gf"
research_mode = true
source_provider = "local_snapshot"
disabled_tools = ["web_search", "get_url_content"]This is useful for A/B comparisons of pass rate, latency, and tool-mix impact.
Provider behavior:
source_provider="auto":- research mode ->
local_snapshot - standard mode ->
live_web
- research mode ->
mock_webis a placeholder for future stubbed-network mode (not yet implemented).- Guardrail: generic URL/content tools reject local filesystem targets (
local://,file://, direct paths). If you need local-source reads, use explicit local-source tooling/adapters.
Matrix paths (dataset, snapshot_root, output_root) resolve as:
./...or../...: relative to matrix file location.- other relative paths (for example
temp/...,evals/...): relative to current working directory. - absolute paths: used directly.
Best for normative phrases you want exactly:
expected:
type: contains
text: "no more than 30 days"Guidelines:
- Keep target substring short but specific.
- Avoid punctuation/capitalization-sensitive long phrases unless needed.
- If the model often paraphrases, prefer
regex.
expected:
type: regex
pattern: "(?i)\\bhead\\b.*\\boptions\\b.*\\btrace\\b"Guidelines:
- Use
(?i)for case-insensitive matches. - Use
(?s)(or(?is)) when answers can span multiple lines. - Prefer lookahead patterns for keyword/number presence checks when order is not important:
expected:
type: regex
pattern: "(?is)(?=.*\\bAAL1\\b)(?=.*\\b30\\s*days\\b).*"- Prefer robust patterns over exact long sentence matching.
- Keep pattern focused on required facts.
- Run matrix.
- Inspect failed cases in
results.jsonl. - Compare
expectedvs actualanswercontent. - Adjust expectation strictness (shorter
containsor betterregex). - Re-run only target profile with
--run.
Per-run artifacts:
<output>/<run_id>/artifacts/results.jsonl<output>/<run_id>/artifacts/results.md<output>/<run_id>/artifacts/summary.json
Session artifacts:
<output>/summary.json<output>/report.md
Meaning of key summary fields:
passed_cases/failed_cases: assertion outcomes.error_cases: execution errors (infra/runtime), not answer-quality misses.halted_cases: run-turn halted before normal answer completion.avg_elapsed_ms: average latency per case.token_usage_totals: role-based token totals:main: primary model calls forrun_turn.summarizer: summarization model calls.audit_planner: reserved for future pipeline stage (currently0/0/0unless implemented).
timing_totals_ms: per-run timing totals including:run_wall_ms,case_total_ms,source_prepare_ms,client_init_ms,run_turn_msllm_total_ms,tool_total_ms,local_ingestion_ms,shortlist_ms
timing_averages_ms: per-case averages for timing metrics.timing_counts: call counts forllm,tool,local_ingestion,shortlist.disabled_tools: tools disabled for this run profile.tool_call_counts: aggregate counts by tool name.tool_call_breakdown: aggregate counts by tool name + argument payload.
Per-case results.jsonl rows also include token_usage with the same roles and
input_tokens / output_tokens / total_tokens.
Per-case rows also include timings_ms for detailed phase timing breakdown.
Per-case rows also include tool_calls (tool name + parsed arguments).
results.md is an automatic markdown conversion of results.jsonl with:
- case summary table (
PASS/FAIL/ERROR/HALTED) - fail-focused detail sections including query, expected assertion payload, answer text, and captured tool calls with arguments.
report.md now also includes:
- per-run tool call totals by tool type
- per-run case failure details (query/expected/answer/tool calls)
This gives a single top-level markdown file for most analysis, while keeping
results.md as per-run drill-down.
Cause: --run filter did not match any [[runs]].id (or run is commented out in matrix).
Check error field in results.jsonl.
If error_cases > 0, failures are runtime/integration errors, not model-quality misses.
artifacts/results.md usually gives the quickest failure triage view.
run now prints live progress for:
- run lifecycle (
run_start,run_end) - case lifecycle (
case_start,case_end) - external invocation transitions (
run_turn,llm,toolstart/end) - preload/summarizer status messages when callbacks emit updates
This means long waits should still show heartbeat-style lines while each case executes.
run_end now includes run_wall_ms, and per-run summary lines include timing totals.
Each role reports input/output/total token counts:
main: request/response tokens consumed by the selected run model alias.summarizer: tokens consumed by the configured summarization model.audit_planner: currently a placeholder counter for future planner/audit stages.
You can compare these fields across runs to evaluate quality/cost tradeoffs when
changing model aliases or [runs.parameters].
Key timing metrics:
source_prepare_ms: source-provider query construction and query-affix assembly.client_init_ms:AskyClientcreation overhead.run_turn_ms: total orchestration/model/tool duration for one case.llm_total_ms: sum of measured LLM request windows (llm_start->llm_end).tool_total_ms: sum of measured tool dispatch windows (tool_start->tool_end).local_ingestion_ms: timed local corpus preload execution.shortlist_ms: timed shortlist preload execution.run_wall_ms: total wall clock per run profile.session_wall_ms(session summary): total wall clock for the whole matrix execution.
report.md now includes:
- Summary table column
Disabled Tools - A dedicated
Tool Call Breakdownsection per run- tool name
- number of calls
- normalized argument payload
Use this to verify how tool usage patterns changed between run profiles and how that affects pass/fail outcomes.
Run prepare first for datasets that use local_snapshot provider.
Runs now create unique timestamp dirs; if same-second collision occurs, suffixes are appended (_001, _002, ...).
- Runs are isolated with per-run DB/Chroma runtime dirs to avoid cross-run contamination.
- Model parameter sweeps are supported via
[runs.parameters]and merged intoAskyConfig.model_parameters_override. - Non-research runs (
research_mode=false) can still be evaluated against the same test dataset.