From 50db38ae3955e5771365089e9c7ad8e42f2d001a Mon Sep 17 00:00:00 2001 From: Algis Dumbris Date: Sun, 31 May 2026 17:07:22 +0300 Subject: [PATCH] =?UTF-8?q?feat(065):=20D1=20retrieval=20datasets=20?= =?UTF-8?q?=E2=80=94=20frozen=20corpus,=20golden=20set,=20baseline=20(A2)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Generate and commit the Spec 065 D1 retrieval evaluation artifacts via A1's mcp-eval datasets/retrieval CLI (cb37f84): - corpus_v1.json: frozen 45-tool snapshot over 7 no-auth reference MCP servers (filesystem, git, memory, sqlite, fetch, time, sequential-thinking), via GET /api/v1/tools. Immutable (CN-002); refresh = corpus_v2 (FR-012). - corpus_v1.source.json: secret-free, reproducible mcpproxy source config. - retrieval_golden_v1.json: 47 graded queries (relevance 0|1|2), 11 cross-server hard-negatives (FR-001), R-C compliant (queries never name the tool). Passes schema + INV-1 validation. - baseline_v1.json: reference Recall@k/MRR/nDCG@10/MAP + Recall@5 tolerance 0.05, the CI regression anchor (FR-009). Retrieval metrics top-level (scorer reads them directly); empty security section reserved for D2 (CN-004). - README.md: documented, repeatable regeneration procedure (FR-012). Verified end-to-end against a live BM25 index: validate OK, gate PASS (Recall@5=0.681 >= baseline-0.05). Score reports stay local (CN-003). Related #MCP-740 Co-Authored-By: Paperclip --- .../datasets/README.md | 45 ++ .../datasets/baseline_v1.json | 26 + .../datasets/corpus_v1.json | 279 ++++++++ .../datasets/corpus_v1.source.json | 335 ++++++++++ .../datasets/retrieval_golden_v1.json | 595 ++++++++++++++++++ 5 files changed, 1280 insertions(+) create mode 100644 specs/065-evaluation-foundation/datasets/README.md create mode 100644 specs/065-evaluation-foundation/datasets/baseline_v1.json create mode 100644 specs/065-evaluation-foundation/datasets/corpus_v1.json create mode 100644 specs/065-evaluation-foundation/datasets/corpus_v1.source.json create mode 100644 specs/065-evaluation-foundation/datasets/retrieval_golden_v1.json diff --git a/specs/065-evaluation-foundation/datasets/README.md b/specs/065-evaluation-foundation/datasets/README.md new file mode 100644 index 000000000..70cdeb92f --- /dev/null +++ b/specs/065-evaluation-foundation/datasets/README.md @@ -0,0 +1,45 @@ +# Spec 065 — D1 retrieval datasets + +Versioned, frozen evaluation artifacts for the tool-retrieval benchmark (CN-002, +FR-012). Generated by A1's harness (`~/repos/mcp-eval`, `mcp-eval datasets` / +`mcp-eval retrieval`). **Immutable once committed — a refresh is `*_v2.json`, never +an edit of `*_v1.json`.** + +| File | What it is | Committed? | +|------|------------|------------| +| `corpus_v1.source.json` | mcpproxy config of 7 no-auth reference servers used to freeze the corpus (secret-free, reproducible) | yes | +| `corpus_v1.json` | Frozen snapshot of 45 tools (`GET /api/v1/tools`) — the universe the eval scores against | yes | +| `retrieval_golden_v1.json` | 47 graded queries → tool(s), relevance 0\|1\|2, ≥8 hard-negatives (FR-001); R-C (queries never name the tool) | yes | +| `baseline_v1.json` | Reference Recall@k/MRR/nDCG@10/MAP + Recall@5 tolerance — the CI regression anchor (FR-009). `security` section filled by D2 (CN-004) | yes | +| score reports (`report.json` / `.html`) | Per-run output | **no** (CN-003 — stay local) | + +## Regenerate (documented + repeatable — FR-012) + +```bash +# 1. Boot a throwaway mcpproxy over the committed source config (fresh data-dir). +mcpproxy serve --config specs/065-evaluation-foundation/datasets/corpus_v1.source.json \ + --data-dir /tmp/mcpproxy-corpus-snapshot --listen 127.0.0.1:8092 +# (all 7 servers connect via npx/uvx, no tokens; quarantine disabled for a clean index) + +# 2. Freeze the corpus (only when intentionally cutting corpus_v2). +cd ~/repos/mcp-eval && PYTHONPATH=src uv run python -m mcp_eval.cli datasets snapshot \ + --out /specs/065-evaluation-foundation/datasets/corpus_v1.json \ + --base-url http://127.0.0.1:8092 --api-key eval-corpus-snapshot + +# 3. Validate the golden set (schema + INV-1: every tool_id ∈ corpus). +PYTHONPATH=src uv run python -m mcp_eval.cli datasets validate \ + --corpus .../corpus_v1.json --golden .../retrieval_golden_v1.json + +# 4. Score + gate against the baseline (deterministic; gate = Recall@5 ≥ baseline−0.05). +PYTHONPATH=src uv run python -m mcp_eval.cli retrieval \ + --corpus .../corpus_v1.json --golden .../retrieval_golden_v1.json \ + --baseline .../baseline_v1.json --tolerance 0.05 \ + --base-url http://127.0.0.1:8092 --api-key eval-corpus-snapshot +``` + +The golden set was seeded by intent and **hand-curated** for graded relevance and +cross-server hard-negatives (e.g. `filesystem:search_files` vs `memory:search_nodes`; +`sqlite:read_query` vs `filesystem:read_text_file`; `fetch:fetch` vs +`filesystem:read_text_file`), then validated. Invariants: **INV-1** (no dangling +labels), **INV-2** (removing a labeled tool drives that query's Recall→0 — proven by +the harness scorer tests). diff --git a/specs/065-evaluation-foundation/datasets/baseline_v1.json b/specs/065-evaluation-foundation/datasets/baseline_v1.json new file mode 100644 index 000000000..d7847ea48 --- /dev/null +++ b/specs/065-evaluation-foundation/datasets/baseline_v1.json @@ -0,0 +1,26 @@ +{ + "__doc__": "Spec 065 D1+D2 regression baseline. Retrieval metrics are TOP-LEVEL because mcp-eval RetrievalScorer reads baseline[\"recall_at\"][\"5\"] etc. directly (quickstart \u00a74: --baseline datasets/baseline_v1.json). The CI gate (FR-009/MCP-742) fails if a fresh Recall@5 < baseline.recall_at[5] - tolerance.recall_at_5. The \"security\" section is appended by the D2 issue (CN-004); it is intentionally empty here.", + "corpus_version": "corpus_v1", + "golden_version": "retrieval_golden_v1", + "generated_from": { + "harness": "mcp-eval retrieval @ cb37f84", + "source_config": "datasets/corpus_v1.source.json", + "mcpproxy": "BM25 index over corpus_v1 (45 tools, 7 no-auth reference servers)", + "runs": 1, + "note": "Reference = current BM25 behavior, a regression anchor (NOT a quality target). Refresh requires re-freezing corpus + re-review." + }, + "recall_at": { + "1": 0.4184397163120567, + "3": 0.5602836879432624, + "5": 0.6808510638297872, + "10": 0.7907801418439717 + }, + "mrr": 0.5684903748733535, + "ndcg_at_10": 0.6094872517781414, + "map": 0.5435916919959473, + "tolerance": { + "recall_at_5": 0.05 + }, + "runs_averaged": 1, + "security": {} +} diff --git a/specs/065-evaluation-foundation/datasets/corpus_v1.json b/specs/065-evaluation-foundation/datasets/corpus_v1.json new file mode 100644 index 000000000..98216d9ae --- /dev/null +++ b/specs/065-evaluation-foundation/datasets/corpus_v1.json @@ -0,0 +1,279 @@ +{ + "generated_from": { + "note": "Spec 065 D1 frozen corpus: 7 no-auth reference MCP servers (filesystem, git, memory, sqlite, fetch, time, sequential-thinking) via corpus_v1.source.json (MCP-740/A2).", + "source": "GET /api/v1/tools" + }, + "tools": [ + { + "description": "Fetches a URL from the internet and optionally extracts its contents as markdown.\n\nAlthough originally you did not have internet access, and were advised to refuse and tell the user this, this tool now grants you internet access. Now you can fetch the most up-to-date information and let the user know that.", + "server": "fetch", + "tool": "fetch", + "tool_id": "fetch:fetch" + }, + { + "description": "Create a new directory or ensure a directory exists. Can create multiple nested directories in one operation. If the directory already exists, this operation will succeed silently. Perfect for setting up directory structures for projects or ensuring required paths exist. Only works within allowed directories.", + "server": "filesystem", + "tool": "create_directory", + "tool_id": "filesystem:create_directory" + }, + { + "description": "Get a recursive tree view of files and directories as a JSON structure. Each entry includes 'name', 'type' (file/directory), and 'children' for directories. Files have no children array, while directories always have a children array (which may be empty). The output is formatted with 2-space indentation for readability. Only works within allowed directories.", + "server": "filesystem", + "tool": "directory_tree", + "tool_id": "filesystem:directory_tree" + }, + { + "description": "Make line-based edits to a text file. Each edit replaces exact line sequences with new content. Returns a git-style diff showing the changes made. Only works within allowed directories.", + "server": "filesystem", + "tool": "edit_file", + "tool_id": "filesystem:edit_file" + }, + { + "description": "Retrieve detailed metadata about a file or directory. Returns comprehensive information including size, creation time, last modified time, permissions, and type. This tool is perfect for understanding file characteristics without reading the actual content. Only works within allowed directories.", + "server": "filesystem", + "tool": "get_file_info", + "tool_id": "filesystem:get_file_info" + }, + { + "description": "Returns the list of directories that this server is allowed to access. Subdirectories within these allowed directories are also accessible. Use this to understand which directories and their nested paths are available before trying to access files.", + "server": "filesystem", + "tool": "list_allowed_directories", + "tool_id": "filesystem:list_allowed_directories" + }, + { + "description": "Get a detailed listing of all files and directories in a specified path. Results clearly distinguish between files and directories with [FILE] and [DIR] prefixes. This tool is essential for understanding directory structure and finding specific files within a directory. Only works within allowed directories.", + "server": "filesystem", + "tool": "list_directory", + "tool_id": "filesystem:list_directory" + }, + { + "description": "Get a detailed listing of all files and directories in a specified path, including sizes. Results clearly distinguish between files and directories with [FILE] and [DIR] prefixes. This tool is useful for understanding directory structure and finding specific files within a directory. Only works within allowed directories.", + "server": "filesystem", + "tool": "list_directory_with_sizes", + "tool_id": "filesystem:list_directory_with_sizes" + }, + { + "description": "Move or rename files and directories. Can move files between directories and rename them in a single operation. If the destination exists, the operation will fail. Works across different directories and can be used for simple renaming within the same directory. Both source and destination must be within allowed directories.", + "server": "filesystem", + "tool": "move_file", + "tool_id": "filesystem:move_file" + }, + { + "description": "Read the complete contents of a file as text. DEPRECATED: Use read_text_file instead.", + "server": "filesystem", + "tool": "read_file", + "tool_id": "filesystem:read_file" + }, + { + "description": "Read an image or audio file. Returns the base64 encoded data and MIME type. Only works within allowed directories.", + "server": "filesystem", + "tool": "read_media_file", + "tool_id": "filesystem:read_media_file" + }, + { + "description": "Read the contents of multiple files simultaneously. This is more efficient than reading files one by one when you need to analyze or compare multiple files. Each file's content is returned with its path as a reference. Failed reads for individual files won't stop the entire operation. Only works within allowed directories.", + "server": "filesystem", + "tool": "read_multiple_files", + "tool_id": "filesystem:read_multiple_files" + }, + { + "description": "Read the complete contents of a file from the file system as text. Handles various text encodings and provides detailed error messages if the file cannot be read. Use this tool when you need to examine the contents of a single file. Use the 'head' parameter to read only the first N lines of a file, or the 'tail' parameter to read only the last N lines of a file. Operates on the file as text regardless of extension. Only works within allowed directories.", + "server": "filesystem", + "tool": "read_text_file", + "tool_id": "filesystem:read_text_file" + }, + { + "description": "Recursively search for files and directories matching a pattern. The patterns should be glob-style patterns that match paths relative to the working directory. Use pattern like '*.ext' to match files in current directory, and '**/*.ext' to match files in all subdirectories. Returns full paths to all matching items. Great for finding files when you don't know their exact location. Only searches within allowed directories.", + "server": "filesystem", + "tool": "search_files", + "tool_id": "filesystem:search_files" + }, + { + "description": "Create a new file or completely overwrite an existing file with new content. Use with caution as it will overwrite existing files without warning. Handles text content with proper encoding. Only works within allowed directories.", + "server": "filesystem", + "tool": "write_file", + "tool_id": "filesystem:write_file" + }, + { + "description": "Adds file contents to the staging area", + "server": "git", + "tool": "git_add", + "tool_id": "git:git_add" + }, + { + "description": "List Git branches", + "server": "git", + "tool": "git_branch", + "tool_id": "git:git_branch" + }, + { + "description": "Switches branches", + "server": "git", + "tool": "git_checkout", + "tool_id": "git:git_checkout" + }, + { + "description": "Records changes to the repository", + "server": "git", + "tool": "git_commit", + "tool_id": "git:git_commit" + }, + { + "description": "Creates a new branch from an optional base branch", + "server": "git", + "tool": "git_create_branch", + "tool_id": "git:git_create_branch" + }, + { + "description": "Shows differences between branches or commits", + "server": "git", + "tool": "git_diff", + "tool_id": "git:git_diff" + }, + { + "description": "Shows changes that are staged for commit", + "server": "git", + "tool": "git_diff_staged", + "tool_id": "git:git_diff_staged" + }, + { + "description": "Shows changes in the working directory that are not yet staged", + "server": "git", + "tool": "git_diff_unstaged", + "tool_id": "git:git_diff_unstaged" + }, + { + "description": "Shows the commit logs", + "server": "git", + "tool": "git_log", + "tool_id": "git:git_log" + }, + { + "description": "Unstages all staged changes", + "server": "git", + "tool": "git_reset", + "tool_id": "git:git_reset" + }, + { + "description": "Shows the contents of a commit", + "server": "git", + "tool": "git_show", + "tool_id": "git:git_show" + }, + { + "description": "Shows the working tree status", + "server": "git", + "tool": "git_status", + "tool_id": "git:git_status" + }, + { + "description": "Add new observations to existing entities in the knowledge graph", + "server": "memory", + "tool": "add_observations", + "tool_id": "memory:add_observations" + }, + { + "description": "Create multiple new entities in the knowledge graph", + "server": "memory", + "tool": "create_entities", + "tool_id": "memory:create_entities" + }, + { + "description": "Create multiple new relations between entities in the knowledge graph. Relations should be in active voice", + "server": "memory", + "tool": "create_relations", + "tool_id": "memory:create_relations" + }, + { + "description": "Delete multiple entities and their associated relations from the knowledge graph", + "server": "memory", + "tool": "delete_entities", + "tool_id": "memory:delete_entities" + }, + { + "description": "Delete specific observations from entities in the knowledge graph", + "server": "memory", + "tool": "delete_observations", + "tool_id": "memory:delete_observations" + }, + { + "description": "Delete multiple relations from the knowledge graph", + "server": "memory", + "tool": "delete_relations", + "tool_id": "memory:delete_relations" + }, + { + "description": "Open specific nodes in the knowledge graph by their names", + "server": "memory", + "tool": "open_nodes", + "tool_id": "memory:open_nodes" + }, + { + "description": "Read the entire knowledge graph", + "server": "memory", + "tool": "read_graph", + "tool_id": "memory:read_graph" + }, + { + "description": "Search for nodes in the knowledge graph based on a query", + "server": "memory", + "tool": "search_nodes", + "tool_id": "memory:search_nodes" + }, + { + "description": "A detailed tool for dynamic and reflective problem-solving through thoughts.\nThis tool helps analyze problems through a flexible thinking process that can adapt and evolve.\nEach thought can build on, question, or revise previous insights as understanding deepens.\n\nWhen to use this tool:\n- Breaking down complex problems into steps\n- Planning and design with room for revision\n- Analysis that might need course correction\n- Problems where the full scope might not be clear initially\n- Problems that require a multi-step solution\n- Tasks that need to maintain context over multiple steps\n- Situations where irrelevant information needs to be filtered out\n\nKey features:\n- You can adjust total_thoughts up or down as you progress\n- You can question or revise previous thoughts\n- You can add more thoughts even after reaching what seemed like the end\n- You can express uncertainty and explore alternative approaches\n- Not every thought needs to build linearly - you can branch or backtrack\n- Generates a solution hypothesis\n- Verifies the hypothesis based on the Chain of Thought steps\n- Repeats the process until satisfied\n- Provides a correct answer\n\nParameters explained:\n- thought: Your current thinking step, which can include:\n * Regular analytical steps\n * Revisions of previous thoughts\n * Questions about previous decisions\n * Realizations about needing more analysis\n * Changes in approach\n * Hypothesis generation\n * Hypothesis verification\n- nextThoughtNeeded: True if you need more thinking, even if at what seemed like the end\n- thoughtNumber: Current number in sequence (can go beyond initial total if needed)\n- totalThoughts: Current estimate of thoughts needed (can be adjusted up/down)\n- isRevision: A boolean indicating if this thought revises previous thinking\n- revisesThought: If is_revision is true, which thought number is being reconsidered\n- branchFromThought: If branching, which thought number is the branching point\n- branchId: Identifier for the current branch (if any)\n- needsMoreThoughts: If reaching end but realizing more thoughts needed\n\nYou should:\n1. Start with an initial estimate of needed thoughts, but be ready to adjust\n2. Feel free to question or revise previous thoughts\n3. Don't hesitate to add more thoughts if needed, even at the \"end\"\n4. Express uncertainty when present\n5. Mark thoughts that revise previous thinking or branch into new paths\n6. Ignore information that is irrelevant to the current step\n7. Generate a solution hypothesis when appropriate\n8. Verify the hypothesis based on the Chain of Thought steps\n9. Repeat the process until satisfied with the solution\n10. Provide a single, ideally correct answer as the final output\n11. Only set nextThoughtNeeded to false when truly done and a satisfactory answer is reached", + "server": "sequential-thinking", + "tool": "sequentialthinking", + "tool_id": "sequential-thinking:sequentialthinking" + }, + { + "description": "Add a business insight to the memo", + "server": "sqlite", + "tool": "append_insight", + "tool_id": "sqlite:append_insight" + }, + { + "description": "Create a new table in the SQLite database", + "server": "sqlite", + "tool": "create_table", + "tool_id": "sqlite:create_table" + }, + { + "description": "Get the schema information for a specific table", + "server": "sqlite", + "tool": "describe_table", + "tool_id": "sqlite:describe_table" + }, + { + "description": "List all tables in the SQLite database", + "server": "sqlite", + "tool": "list_tables", + "tool_id": "sqlite:list_tables" + }, + { + "description": "Execute a SELECT query on the SQLite database", + "server": "sqlite", + "tool": "read_query", + "tool_id": "sqlite:read_query" + }, + { + "description": "Execute an INSERT, UPDATE, or DELETE query on the SQLite database", + "server": "sqlite", + "tool": "write_query", + "tool_id": "sqlite:write_query" + }, + { + "description": "Convert time between timezones", + "server": "time", + "tool": "convert_time", + "tool_id": "time:convert_time" + }, + { + "description": "Get current time in a specific timezones", + "server": "time", + "tool": "get_current_time", + "tool_id": "time:get_current_time" + } + ], + "version": "corpus_v1" +} diff --git a/specs/065-evaluation-foundation/datasets/corpus_v1.source.json b/specs/065-evaluation-foundation/datasets/corpus_v1.source.json new file mode 100644 index 000000000..4f0a91c78 --- /dev/null +++ b/specs/065-evaluation-foundation/datasets/corpus_v1.source.json @@ -0,0 +1,335 @@ +{ + "listen": "127.0.0.1:8092", + "enable_socket": false, + "data_dir": "/tmp/mcp740/data", + "debug_search": false, + "mcpServers": [ + { + "name": "filesystem", + "protocol": "stdio", + "command": "npx", + "args": [ + "-y", + "@modelcontextprotocol/server-filesystem", + "/tmp" + ], + "oauth": null, + "enabled": true, + "quarantined": false, + "created": "2026-05-31T17:02:24.376369+03:00", + "updated": "0001-01-01T00:00:00Z" + }, + { + "name": "memory", + "protocol": "stdio", + "command": "npx", + "args": [ + "-y", + "@modelcontextprotocol/server-memory" + ], + "oauth": null, + "enabled": true, + "quarantined": false, + "created": "2026-05-31T17:02:24.376369+03:00", + "updated": "0001-01-01T00:00:00Z" + }, + { + "name": "sequential-thinking", + "protocol": "stdio", + "command": "npx", + "args": [ + "-y", + "@modelcontextprotocol/server-sequential-thinking" + ], + "oauth": null, + "enabled": true, + "quarantined": false, + "created": "2026-05-31T17:02:24.376369+03:00", + "updated": "0001-01-01T00:00:00Z" + }, + { + "name": "git", + "protocol": "stdio", + "command": "uvx", + "args": [ + "mcp-server-git" + ], + "oauth": null, + "enabled": true, + "quarantined": false, + "created": "2026-05-31T17:02:24.37637+03:00", + "updated": "0001-01-01T00:00:00Z" + }, + { + "name": "fetch", + "protocol": "stdio", + "command": "uvx", + "args": [ + "mcp-server-fetch" + ], + "oauth": null, + "enabled": true, + "quarantined": false, + "created": "2026-05-31T17:02:24.37637+03:00", + "updated": "0001-01-01T00:00:00Z" + }, + { + "name": "time", + "protocol": "stdio", + "command": "uvx", + "args": [ + "mcp-server-time" + ], + "oauth": null, + "enabled": true, + "quarantined": false, + "created": "2026-05-31T17:02:24.37637+03:00", + "updated": "0001-01-01T00:00:00Z" + }, + { + "name": "sqlite", + "protocol": "stdio", + "command": "uvx", + "args": [ + "mcp-server-sqlite", + "--db-path", + "/tmp/mcpproxy-corpus-snapshot/snapshot.db" + ], + "oauth": null, + "enabled": true, + "quarantined": false, + "created": "2026-05-31T17:02:24.37637+03:00", + "updated": "0001-01-01T00:00:00Z" + } + ], + "tools_limit": 15, + "tool_response_limit": 20000, + "call_tool_timeout": "2m0s", + "max_result_size_chars": 500000, + "environment": { + "inherit_system_safe": true, + "allowed_system_vars": [ + "PATH", + "HOME", + "TMPDIR", + "TEMP", + "TMP", + "SHELL", + "TERM", + "LANG", + "USER", + "USERNAME", + "XDG_CONFIG_HOME", + "XDG_DATA_HOME", + "XDG_CACHE_HOME", + "XDG_RUNTIME_DIR", + "LC_ALL", + "LC_CTYPE", + "LC_NUMERIC", + "LC_TIME", + "LC_COLLATE", + "LC_MONETARY", + "LC_MESSAGES", + "LC_PAPER", + "LC_NAME", + "LC_ADDRESS", + "LC_TELEPHONE", + "LC_MEASUREMENT", + "LC_IDENTIFICATION" + ], + "custom_vars": {}, + "enhance_path": false + }, + "logging": { + "level": "info", + "enable_file": true, + "enable_console": true, + "filename": "main.log", + "max_size": 10, + "max_backups": 5, + "max_age": 30, + "compress": true, + "json_format": false + }, + "api_key": "eval-corpus-snapshot", + "require_mcp_auth": false, + "read_only_mode": false, + "disable_management": false, + "allow_server_add": true, + "allow_server_remove": true, + "enable_prompts": true, + "check_server_repo": true, + "docker_isolation": { + "enabled": false, + "enable_cache_volume": true, + "default_images": { + "bash": "alpine:3.18", + "binary": "alpine:3.18", + "cargo": "rust:1.75-slim", + "composer": "php:8.2-cli-alpine", + "gem": "ruby:3.2-alpine", + "go": "golang:1.21-alpine", + "node": "node:22", + "npm": "node:22", + "npx": "node:22", + "php": "php:8.2-cli-alpine", + "pip": "ghcr.io/astral-sh/uv:python3.13-bookworm-slim", + "pipx": "ghcr.io/astral-sh/uv:python3.13-bookworm-slim", + "python": "ghcr.io/astral-sh/uv:python3.13-bookworm-slim", + "python3": "ghcr.io/astral-sh/uv:python3.13-bookworm-slim", + "ruby": "ruby:3.2-alpine", + "rustc": "rust:1.75-slim", + "sh": "alpine:3.18", + "uvx": "ghcr.io/astral-sh/uv:python3.13-bookworm-slim", + "yarn": "node:22" + }, + "registry": "docker.io", + "network_mode": "bridge", + "memory_limit": "512m", + "cpu_limit": "1.0", + "timeout": "30s", + "log_max_size": "100m", + "log_max_files": "3" + }, + "registries": [ + { + "id": "pulse", + "name": "Pulse MCP", + "description": "Browse and discover MCP use-cases, servers, clients, and news", + "url": "https://www.pulsemcp.com/", + "servers_url": "https://api.pulsemcp.com/v0beta/servers", + "tags": [ + "verified" + ], + "protocol": "custom/pulse" + }, + { + "id": "docker-mcp-catalog", + "name": "Docker MCP Catalog", + "description": "A collection of secure, high-quality MCP servers as docker images", + "url": "https://hub.docker.com/catalogs/mcp", + "servers_url": "https://hub.docker.com/v2/repositories/mcp/", + "tags": [ + "verified" + ], + "protocol": "custom/docker" + }, + { + "id": "fleur", + "name": "Fleur", + "description": "Fleur is the app store for Claude", + "url": "https://www.fleurmcp.com/", + "servers_url": "https://raw.githubusercontent.com/fleuristes/app-registry/refs/heads/main/apps.json", + "tags": [ + "verified" + ], + "protocol": "custom/fleur" + }, + { + "id": "azure-mcp-demo", + "name": "Azure MCP Registry Demo", + "description": "A reference implementation of MCP registry using Azure API Center", + "url": "https://demo.registry.azure-mcp.net/", + "servers_url": "https://demo.registry.azure-mcp.net/v0/servers", + "tags": [ + "verified", + "demo", + "azure", + "reference" + ], + "protocol": "mcp/v0" + }, + { + "id": "remote-mcp-servers", + "name": "Remote MCP Servers", + "description": "Community-maintained list of remote Model Context Protocol servers", + "url": "https://remote-mcp-servers.com/", + "servers_url": "https://remote-mcp-servers.com/api/servers", + "tags": [ + "verified", + "community", + "remote" + ], + "protocol": "custom/remote" + } + ], + "features": { + "enable_runtime": true, + "enable_event_bus": true, + "enable_sse": true, + "enable_observability": true, + "enable_health_checks": true, + "enable_metrics": true, + "enable_tracing": false, + "enable_oauth": true, + "enable_quarantine": true, + "enable_docker_isolation": false, + "enable_search": true, + "enable_caching": true, + "enable_async_storage": true, + "enable_web_ui": true, + "enable_tray": true, + "enable_debug_logging": false, + "enable_contract_tests": false + }, + "tls": { + "enabled": false, + "require_client_cert": false, + "hsts": true + }, + "tokenizer": { + "enabled": true, + "default_model": "gpt-4", + "encoding": "cl100k_base" + }, + "enable_code_execution": false, + "code_execution_timeout_ms": 120000, + "code_execution_pool_size": 10, + "activity_retention_days": 90, + "activity_max_records": 100000, + "activity_max_response_size": 65536, + "activity_cleanup_interval_min": 60, + "intent_declaration": { + "strict_server_validation": true + }, + "sensitive_data_detection": { + "enabled": true, + "scan_requests": true, + "scan_responses": true, + "max_payload_size_kb": 1024, + "entropy_threshold": 4.5, + "categories": { + "api_token": true, + "auth_token": true, + "cloud_credentials": true, + "credit_card": true, + "database_credential": true, + "high_entropy": true, + "private_key": true, + "sensitive_file": true + } + }, + "output_validation": { + "mode": "warn", + "max_bytes": 5242880, + "max_depth": 64, + "missing_structured_content": "allow" + }, + "output_sanitisation": { + "response_action": "spotlight", + "strip_classes": [ + "ansi", + "c0c1", + "bidi", + "zero_width" + ], + "max_redactions": 100 + }, + "telemetry": { + "enabled": false, + "last_startup_outcome": "success", + "notice_shown": true + }, + "routing_mode": "retrieve_tools", + "quarantine_enabled": false +} diff --git a/specs/065-evaluation-foundation/datasets/retrieval_golden_v1.json b/specs/065-evaluation-foundation/datasets/retrieval_golden_v1.json new file mode 100644 index 000000000..d5f45dfd2 --- /dev/null +++ b/specs/065-evaluation-foundation/datasets/retrieval_golden_v1.json @@ -0,0 +1,595 @@ +{ + "corpus_version": "corpus_v1", + "queries": [ + { + "id": "q-fs-read", + "query": "Show me the full text content stored inside a file on disk", + "labels": [ + { + "tool_id": "filesystem:read_text_file", + "relevance": 2 + }, + { + "tool_id": "filesystem:read_file", + "relevance": 1 + }, + { + "tool_id": "filesystem:read_multiple_files", + "relevance": 1 + } + ] + }, + { + "id": "q-fs-read-many", + "query": "Open several documents at once and return all of their contents together", + "labels": [ + { + "tool_id": "filesystem:read_multiple_files", + "relevance": 2 + }, + { + "tool_id": "filesystem:read_text_file", + "relevance": 1 + } + ] + }, + { + "id": "q-fs-write", + "query": "Save this text as a brand-new file, replacing whatever was there before", + "labels": [ + { + "tool_id": "filesystem:write_file", + "relevance": 2 + } + ] + }, + { + "id": "q-fs-edit", + "query": "Change a handful of specific lines inside an existing text document", + "labels": [ + { + "tool_id": "filesystem:edit_file", + "relevance": 2 + }, + { + "tool_id": "filesystem:write_file", + "relevance": 1 + } + ] + }, + { + "id": "q-fs-list", + "query": "Give me a listing of everything contained in a particular folder", + "labels": [ + { + "tool_id": "filesystem:list_directory", + "relevance": 2 + }, + { + "tool_id": "filesystem:list_directory_with_sizes", + "relevance": 1 + }, + { + "tool_id": "filesystem:directory_tree", + "relevance": 1 + } + ] + }, + { + "id": "q-fs-list-sizes", + "query": "List a folder's contents and tell me how large each item is", + "labels": [ + { + "tool_id": "filesystem:list_directory_with_sizes", + "relevance": 2 + }, + { + "tool_id": "filesystem:list_directory", + "relevance": 1 + } + ] + }, + { + "id": "q-fs-tree", + "query": "Produce a nested map of a directory and all of its subfolders", + "labels": [ + { + "tool_id": "filesystem:directory_tree", + "relevance": 2 + }, + { + "tool_id": "filesystem:list_directory", + "relevance": 1 + } + ] + }, + { + "id": "q-fs-search", + "query": "Locate files anywhere beneath a folder whose names match a wildcard pattern", + "labels": [ + { + "tool_id": "filesystem:search_files", + "relevance": 2 + }, + { + "tool_id": "memory:search_nodes", + "relevance": 0 + } + ], + "notes": "hard-negative: near-dup of memory:search_nodes (both 'search', but this is a filename lookup on disk, not a graph query)" + }, + { + "id": "q-fs-mkdir", + "query": "Make sure a nested folder path exists, creating any missing parents", + "labels": [ + { + "tool_id": "filesystem:create_directory", + "relevance": 2 + } + ] + }, + { + "id": "q-fs-move", + "query": "Rename a file or relocate it into a different directory", + "labels": [ + { + "tool_id": "filesystem:move_file", + "relevance": 2 + } + ] + }, + { + "id": "q-fs-info", + "query": "Get detailed metadata such as size and timestamps for one file", + "labels": [ + { + "tool_id": "filesystem:get_file_info", + "relevance": 2 + } + ] + }, + { + "id": "q-fs-media", + "query": "Load an image and return its base64 data along with the mime type", + "labels": [ + { + "tool_id": "filesystem:read_media_file", + "relevance": 2 + } + ] + }, + { + "id": "q-fs-allowed", + "query": "Which directories is this sandbox actually permitted to touch?", + "labels": [ + { + "tool_id": "filesystem:list_allowed_directories", + "relevance": 2 + } + ] + }, + { + "id": "q-git-stage", + "query": "Put my modified files into the staging area for the next commit", + "labels": [ + { + "tool_id": "git:git_add", + "relevance": 2 + } + ] + }, + { + "id": "q-git-commit", + "query": "Permanently record my staged changes into the project's revision history", + "labels": [ + { + "tool_id": "git:git_commit", + "relevance": 2 + }, + { + "tool_id": "memory:add_observations", + "relevance": 0 + } + ], + "notes": "hard-negative: near-dup of memory:add_observations (both 'record/add', but this writes a VCS commit, not graph facts)" + }, + { + "id": "q-git-status", + "query": "Tell me which files are currently modified, staged, or untracked", + "labels": [ + { + "tool_id": "git:git_status", + "relevance": 2 + } + ] + }, + { + "id": "q-git-log", + "query": "Show the history of past commits in this repository", + "labels": [ + { + "tool_id": "git:git_log", + "relevance": 2 + } + ] + }, + { + "id": "q-git-diff-unstaged", + "query": "What edits have I made in my working copy that aren't staged yet?", + "labels": [ + { + "tool_id": "git:git_diff_unstaged", + "relevance": 2 + }, + { + "tool_id": "git:git_diff", + "relevance": 1 + }, + { + "tool_id": "git:git_status", + "relevance": 1 + } + ] + }, + { + "id": "q-git-diff-staged", + "query": "Show only the changes I've already added to the index", + "labels": [ + { + "tool_id": "git:git_diff_staged", + "relevance": 2 + }, + { + "tool_id": "git:git_diff", + "relevance": 1 + } + ] + }, + { + "id": "q-git-diff", + "query": "Compare two branches and show how they differ", + "labels": [ + { + "tool_id": "git:git_diff", + "relevance": 2 + } + ] + }, + { + "id": "q-git-branch-list", + "query": "List all of the branches that exist in this repository", + "labels": [ + { + "tool_id": "git:git_branch", + "relevance": 2 + } + ] + }, + { + "id": "q-git-create-branch", + "query": "Start a new branch based off the current one", + "labels": [ + { + "tool_id": "git:git_create_branch", + "relevance": 2 + } + ] + }, + { + "id": "q-git-checkout", + "query": "Switch my working copy over to a different branch", + "labels": [ + { + "tool_id": "git:git_checkout", + "relevance": 2 + } + ] + }, + { + "id": "q-git-show", + "query": "Display the full contents and changes of one specific commit", + "labels": [ + { + "tool_id": "git:git_show", + "relevance": 2 + } + ] + }, + { + "id": "q-git-reset", + "query": "Unstage everything I've added, putting it back into the working directory", + "labels": [ + { + "tool_id": "git:git_reset", + "relevance": 2 + } + ] + }, + { + "id": "q-mem-create-entity", + "query": "Add brand-new nodes to my knowledge graph", + "labels": [ + { + "tool_id": "memory:create_entities", + "relevance": 2 + }, + { + "tool_id": "memory:create_relations", + "relevance": 1 + } + ] + }, + { + "id": "q-mem-create-rel", + "query": "Connect two existing nodes together with a labeled relationship", + "labels": [ + { + "tool_id": "memory:create_relations", + "relevance": 2 + } + ] + }, + { + "id": "q-mem-add-obs", + "query": "Attach a few additional facts to a node I already stored", + "labels": [ + { + "tool_id": "memory:add_observations", + "relevance": 2 + } + ] + }, + { + "id": "q-mem-search", + "query": "Find stored notes in my knowledge base that mention a particular topic", + "labels": [ + { + "tool_id": "memory:search_nodes", + "relevance": 2 + }, + { + "tool_id": "filesystem:search_files", + "relevance": 0 + } + ], + "notes": "hard-negative: near-dup of filesystem:search_files (both 'search', but this queries the graph, not the filesystem)" + }, + { + "id": "q-mem-read-graph", + "query": "Dump out my entire knowledge graph at once", + "labels": [ + { + "tool_id": "memory:read_graph", + "relevance": 2 + } + ] + }, + { + "id": "q-mem-open", + "query": "Retrieve a couple of specific nodes by their exact names", + "labels": [ + { + "tool_id": "memory:open_nodes", + "relevance": 2 + }, + { + "tool_id": "memory:search_nodes", + "relevance": 1 + } + ] + }, + { + "id": "q-mem-del-entity", + "query": "Remove some nodes and every link attached to them", + "labels": [ + { + "tool_id": "memory:delete_entities", + "relevance": 2 + } + ] + }, + { + "id": "q-mem-del-rel", + "query": "Delete particular relationships between stored nodes", + "labels": [ + { + "tool_id": "memory:delete_relations", + "relevance": 2 + } + ] + }, + { + "id": "q-mem-del-obs", + "query": "Strip specific facts off a node without deleting the node itself", + "labels": [ + { + "tool_id": "memory:delete_observations", + "relevance": 2 + } + ] + }, + { + "id": "q-sql-select", + "query": "Run a query to pull matching rows out of my local relational database", + "labels": [ + { + "tool_id": "sqlite:read_query", + "relevance": 2 + }, + { + "tool_id": "filesystem:read_text_file", + "relevance": 0 + } + ], + "notes": "hard-negative: near-dup of filesystem:read_text_file (both 'read', but this is a SQL SELECT, not reading a file)" + }, + { + "id": "q-sql-write", + "query": "Insert, update, or delete records in the database", + "labels": [ + { + "tool_id": "sqlite:write_query", + "relevance": 2 + }, + { + "tool_id": "filesystem:write_file", + "relevance": 0 + } + ], + "notes": "hard-negative: near-dup of filesystem:write_file (both 'write', but this mutates DB rows, not a file)" + }, + { + "id": "q-sql-create-table", + "query": "Define a new table with its columns in the database", + "labels": [ + { + "tool_id": "sqlite:create_table", + "relevance": 2 + } + ] + }, + { + "id": "q-sql-list-tables", + "query": "Show me every table that exists in this database", + "labels": [ + { + "tool_id": "sqlite:list_tables", + "relevance": 2 + } + ] + }, + { + "id": "q-sql-describe", + "query": "What columns and types does a given table have?", + "labels": [ + { + "tool_id": "sqlite:describe_table", + "relevance": 2 + } + ] + }, + { + "id": "q-sql-insight", + "query": "Record a business takeaway into the running analysis memo", + "labels": [ + { + "tool_id": "sqlite:append_insight", + "relevance": 2 + }, + { + "tool_id": "memory:add_observations", + "relevance": 0 + } + ], + "notes": "hard-negative: near-dup of memory:add_observations (both append a fact, but this targets the SQLite insight memo)" + }, + { + "id": "q-fetch", + "query": "Download a web page from a URL and give me its content as markdown", + "labels": [ + { + "tool_id": "fetch:fetch", + "relevance": 2 + }, + { + "tool_id": "filesystem:read_text_file", + "relevance": 0 + } + ], + "notes": "hard-negative: near-dup of filesystem:read_text_file (both return text, but this retrieves a remote URL, not a local file)" + }, + { + "id": "q-time-now", + "query": "What is the current wall-clock time in a given city's timezone?", + "labels": [ + { + "tool_id": "time:get_current_time", + "relevance": 2 + }, + { + "tool_id": "time:convert_time", + "relevance": 0 + } + ], + "notes": "hard-negative: near-dup of time:convert_time (same server, but this reports 'now', it does not translate a supplied timestamp)" + }, + { + "id": "q-time-convert", + "query": "Translate a specific timestamp from one timezone into another", + "labels": [ + { + "tool_id": "time:convert_time", + "relevance": 2 + }, + { + "tool_id": "time:get_current_time", + "relevance": 0 + } + ], + "notes": "hard-negative: near-dup of time:get_current_time (same server, but this converts a given time rather than reading the clock)" + }, + { + "id": "q-seqthink", + "query": "Help me reason through a hard problem step by step, revising my thoughts as I go", + "labels": [ + { + "tool_id": "sequential-thinking:sequentialthinking", + "relevance": 2 + } + ] + }, + { + "id": "q-hn-persist-fact", + "query": "I want to persist a new piece of structured information so I can recall it later", + "labels": [ + { + "tool_id": "memory:create_entities", + "relevance": 2 + }, + { + "tool_id": "sqlite:write_query", + "relevance": 1 + }, + { + "tool_id": "filesystem:write_file", + "relevance": 1 + } + ], + "notes": "graded: 'persist information' is genuinely ambiguous across graph/DB/file backends" + }, + { + "id": "q-hn-history", + "query": "Look back over the recorded history of what changed and when", + "labels": [ + { + "tool_id": "git:git_log", + "relevance": 2 + }, + { + "tool_id": "memory:read_graph", + "relevance": 0 + } + ], + "notes": "hard-negative: near-dup of memory:read_graph (both surface stored state, but 'history of changes' means the VCS log)" + }, + { + "id": "q-hn-delete-records", + "query": "Get rid of some stored records I no longer need", + "labels": [ + { + "tool_id": "sqlite:write_query", + "relevance": 2 + }, + { + "tool_id": "memory:delete_entities", + "relevance": 1 + }, + { + "tool_id": "filesystem:move_file", + "relevance": 0 + } + ], + "notes": "hard-negative: near-dup of filesystem:move_file (deletion is not relocation); graded across DB/graph" + } + ] +}