A reinforcement-learning benchmark that drops an AI agent into a git-backed codebase seeded with realistic secret leaks — hardcoded API keys, base64-encoded tokens, credentials buried in git history — and grades how quickly and safely the agent remediates every one. The environment runs as a stateless FastAPI server; the agent interacts over HTTP with bash commands and structured inspection actions, receiving a composite reward after every step that captures security progress, code health, and time efficiency.
Production secret leaks remain a top-5 cause of cloud breaches. Developers commit API keys, push .env files, or leave credentials in migration scripts — then scramble to rotate and rewrite history. Existing linting tools flag secrets but don't fix them; LLM agents can, but there's no standardized benchmark to measure how well. SecretsAuditEnv fills that gap with a 13-task curriculum spanning trivial single-file fixes to multi-service cascading leaks with git-history rewriting, complete with a deterministic grading pipeline and an anti-gaming reward function.
All 13 tasks are defined in tasks/ with full metadata in each task.json. Secret types are detected by graders/security.py using regex patterns for AWS keys, GitHub tokens, Firebase keys, connection strings, private keys, SQL passwords, assignment secrets, and base64-encoded variants.
| ID | Difficulty | Title | Description | Scan Mode | Visibility Tiers | Conflict Map |
|---|---|---|---|---|---|---|
| 1 | Easy | Cloud Provisioning | Hardcoded AWS Access Key in config.py | dir |
surface, surface, shallow | — |
| 2 | Easy | Database Layer | Password embedded in a raw SQL connection string | dir |
surface, surface, shallow | — |
| 3 | Easy | Frontend Config | Firebase API key exposed in a client-side config file | dir |
surface, surface, shallow | — |
| 4 | Easy | System Logging | Debug logging leaks a user token | dir |
surface, surface, shallow | — |
| 5 | Easy | Git Basics | A tracked .env file leaks credentials in the working tree | dir |
surface, surface, shallow | — |
| 6 | Medium | Utility Module | A base64-encoded auth token hides in utils.py | dir |
surface, shallow, deep | — |
| 7 | Medium | CI/CD Pipeline | A deployment workflow prints a secret directly to logs | dir |
surface, shallow, deep | — |
| 8 | Medium | Noise Filtering | High-entropy dummy values are mixed with one real secret in TOML | dir |
surface, shallow, deep | — |
| 9 | Medium | DB Migration | A legacy migration embeds administrator credentials | dir |
surface, shallow, deep | — |
| 10 | Medium | Deployment | A multiline RSA private key is embedded in a shell script | dir |
surface, shallow, deep | — |
| 11 | Hard | Microservices | The same API key is duplicated across five services | dir |
surface, deep, cascading | ✓ |
| 12 | Hard | Deep Logic | A secret is embedded as a local variable inside a function | dir |
surface, deep, cascading | ✓ |
| 13 | Hard | Legacy Audit | A secret was committed in v1.0 and still exists in Git history | git |
surface, deep, cascading | ✓ |
Scan modes: dir scans the working directory only. git scans the full git commit history (all revisions), so agents must use git filter-repo to clean history — deleting files won't work.
Secrets are not all visible at episode start. The environment implements a 4-tier progressive disclosure system defined per-secret in task.json:
| Tier | When Visible | Typical Use |
|---|---|---|
SURFACE |
Immediately on /reset |
Obvious hardcoded keys in source files |
SHALLOW |
After agent calls inspect_file <path> |
Secrets that require reading the file to notice |
DEEP |
After inspect_git_history or inspect_encoded |
Secrets in git commits or base64-encoded blobs |
CASCADING |
After a specified trigger secret is fixed | Secrets that only become relevant after another is remediated |
Hard tasks (11–13) include conflict maps that encode dependencies: fixing secret s1 may reveal s3, while s2 may block s3 until resolved.
Defined in graders/reward.py. Computed fresh after every /step:
detection_score = (initial_leaks - current_leaks) / initial_leaks
base = 0.4 × detection_score + 0.6 × detection_score
health_score = pytest_passed / pytest_total
efficiency = 0.15 × max(0, 1 - steps_taken / step_budget)
if base > 0:
total_reward = min(0.999, base × health_score + efficiency)
else:
total_reward = 0.001
# Final clamp: validator requires strict (0, 1)
total_reward = max(0.001, min(0.999, total_reward))
Key properties:
- Reward is strictly in (0, 1) — never exactly 0.0 or 1.0 (required by OpenEnv validator)
- Reward starts at 0.001 until the agent actually fixes something
- Health gate: breaking the test suite multiplies reward toward the floor
- Efficiency bonus (0.0–0.15): rewards agents that solve in fewer steps
- Step budgets: Easy = 10, Medium = 20, Hard = 30
Actions are sent as the action field in POST /step. Two categories:
| Action | Format | Effect |
|---|---|---|
inspect_file |
inspect_file <path> |
Marks file as inspected → unlocks SHALLOW secrets for that path |
inspect_git_history |
inspect_git_history [path] |
Scans git history → unlocks all DEEP secrets requiring git inspection |
inspect_encoded |
inspect_encoded <path> [line] |
Decodes base64 blobs → unlocks DEEP encoded secrets for that path |
Any string that doesn't match a structured prefix is executed as a bash command in the task workspace. Common patterns:
cat config.py— read file contentssed -i 's/AKIA.../os.getenv("AWS_KEY")/' config.py— redact a secretgit filter-repo --replace-text <(echo 'ghp_xxx==>REDACTED') --force— clean git historygitleaks detect --no-git --source .— run leak scanner
Commands time out after 90 seconds. Exit code, stdout, and stderr are returned in the observation.
Every /step and /reset response returns a session object with these fields (also listed in openenv.yaml):
| Key | Type | Description |
|---|---|---|
visible_secrets |
list[dict] |
Secrets the agent can currently see (filtered by visibility tier) |
hidden_count_hint |
int |
Number of secrets not yet visible — tells agent more exist |
ranked_actions |
list[dict] |
Top-5 heuristic action suggestions sorted by priority (0.0–1.0) |
top_blocker |
string |
One-sentence description of the highest-priority next action |
step_budget |
int |
Total step budget for this difficulty tier |
steps_taken |
int |
Steps consumed so far |
steps_remaining |
int |
step_budget - steps_taken |
efficiency_bonus |
float |
Current efficiency bonus value (decays each step) |
conflict_map |
dict |
Dependency graph between visible secrets (reveals/blocks relationships) |
security_score |
float |
Fraction of initial leaks fixed (0.0–1.0) |
health_score |
float |
Fraction of pytest tests passing (0.0–1.0) |
reward |
float |
Composite reward after this step |
observation |
string |
Combined stdout/stderr from last command + health/security messages |
last_result |
dict |
Raw action, exit_code, stdout, stderr, timed_out from last command |
Generated by server/observation.py using 5 heuristics:
- Visible unfixed secrets → priority 0.92 (fix these first)
- Uninspected high-risk files → priority based on filename suspicion score
- Git history not yet scanned → priority 0.75 (medium/hard only)
- Hidden secrets remaining → priority 0.68 (suggest encoded inspection)
- All visible fixed → priority 0.85 (run gitleaks validation)
| Method | Path | Tag | Description |
|---|---|---|---|
GET |
/health |
OpenEnv | Returns {"status": "healthy"} |
GET |
/metadata |
OpenEnv | Environment metadata (name, version, task count) |
GET |
/schema |
OpenEnv | Action/observation/state JSON schemas |
GET |
/tasks |
OpenEnv | Lists all 13 tasks with task_id, action_schema |
POST |
/grader |
OpenEnv | Grades a completed episode: {task_id, step_rewards, step_infos} → {score} |
POST |
/baseline |
OpenEnv | Runs baseline evaluation across 3 tasks |
POST |
/mcp |
OpenEnv | JSON-RPC 2.0 initialize support |
POST |
/reset |
Core | Reset to a task: {task_id: N} → full session state |
POST |
/step |
Core | Execute an action: {action: "..."} → updated session state |
GET |
/state |
Core | Current session state |
GET |
/web |
Debug | Browser-based debug dashboard |
# Install dependencies
pip install -r requirements.txt
# Generate task workspaces
python tools/generate_tasks.py
# Start the server
uvicorn server.app:app --host 0.0.0.0 --port 7860
# Open the debug UI
open http://localhost:7860/web
# Run the agent (loops over 3 tasks by default)
export API_BASE_URL="https://openrouter.ai/api/v1"
export HF_TOKEN="your-api-key"
export MODEL_NAME="Qwen/Qwen2.5-72b-instruct"
export ENV_URL="http://localhost:7860"
python inference.py
# Or specify tasks explicitly
python inference.py --task-id task_1,task_2,task_3,task_6docker build -t secretsauditenv .
docker run -p 7860:7860 secretsauditenv
# Server auto-generates tasks and starts on port 7860The Dockerfile installs Python 3.11-slim with bash, git, git-filter-repo, and all pip dependencies.
The baseline agent (inference.py) loops over 3+ tasks (required by OpenEnv validator), running a ReAct loop for each:
- For each task (
task_1,task_2,task_3, ...):- Emit
[START] task=<id> env=secrets_audit model=<model> POST /resetwith the task ID- Auto-inject: automatically runs
cat <primary_file>to give the LLM context before it starts - Per-difficulty step cap: Easy=8, Medium=15, Hard=25 (maximizes efficiency bonus)
- For each step:
- Build structured prompt with task-specific hints → call LLM → normalize action →
POST /step - Emit
[STEP] step=N action=... reward=0.XX done=false error=null
- Build structured prompt with task-specific hints → call LLM → normalize action →
- Call
POST /graderwith{task_id, step_rewards, step_infos}to get official score - Emit
[END] success=true steps=N score=0.XXX rewards=0.XX,0.XX,...
- Emit
- The validator counts
[START]/[END]pairs and checks each score is in(0, 1)
Environment variables:
| Variable | Required | Description |
|---|---|---|
API_BASE_URL |
Yes | OpenAI-compatible API endpoint |
HF_TOKEN |
Yes | API key for the model provider |
MODEL_NAME |
Yes | Model identifier |
ENV_URL |
No | Server URL, defaults to http://localhost:7860 |
TASK_ID |
No | Comma-separated task list, defaults to task_1,task_2,task_3 |
Prompt engineering:
- Task-specific hints for Task 13 (git history leak — instructs agent to use
git filter-repo) - Partial-fix detection — when reward is ~0.5, tells agent git history still leaks
- Urgency escalation — after 5+ steps with no progress, forces direct fix attempt
Anti-loop features:
- Detects 3 consecutive identical actions with identical rewards → forces a different command
- Smart atomic enforcement — rejects
&∧chaining but allows${VAR}in sed patterns - Ensures minimum 3 tasks are always run (auto-pads if fewer specified)
Custom regex-based scanner (no external tools required). Detects:
- AWS Access Keys (
AKIA...) - GitHub tokens (
ghp_...) - Firebase API keys (
AIza...) - Service tokens (
tok_live_...,sk_test_...) - Private keys (PEM format)
- SQL connection strings (
postgres://user:pass@host/db) - SQL passwords (
PASSWORD 'value') - High-entropy assignment secrets (Shannon entropy ≥ 3.2)
- Base64-encoded variants of all the above
Supports .gitleaks.toml allowlists. For scan_mode: git, scans all commits via git rev-list --all.
Anti-gaming: if an agent deletes .git, the grader recovers by creating a fresh snapshot and scanning that — the secret still gets found.
Runs pytest -q --junitxml in the task workspace and parses the JUnit XML report. Score = passed / total. Returns 0.0 if any errors or if pytest times out (60s default).
Available at GET /web. A single-page dark-mode dashboard served from server/web_ui.html that lets you:
- Start any of the 13 tasks with one click
- Send structured actions or raw bash commands
- View real-time metrics (reward, leaks, hidden count, steps, efficiency, health)
- See ranked action suggestions and conflict maps
- Browse visible secrets and full observation text
- Review action history
No external dependencies — pure HTML/CSS/JS.
# Run the environment test suite
python -m pytest tests/ -v
# Run openenv validation (6/6 checks)
openenv validate http://localhost:7860
# Run the full submission validator
bash validate-submission.shOpenEnv validator checks (all passing ✅):
GET /healthreturns{"status": "healthy"}GET /metadatareturns valid metadataGET /schemareturns action/observation schemasGET /tasksreturns task list withtask_idfieldsPOST /graderaccepts{task_id, step_rewards}and returns{score}in(0, 1)POST /mcpresponds to JSON-RPC 2.0initialize
Deep validation checks (Phase 2):
- At least 3 tasks with working graders
- All task scores strictly between 0 and 1 (not 0.0 or 1.0)
inference.pyloops over 3+ tasks, emitting[START]/[END]per task
.
├── server/
│ ├── app.py # FastAPI routes (OpenEnv + Core + Debug)
│ ├── environment.py # Core environment logic, visibility tiers, action parsing
│ ├── observation.py # Ranked action heuristics and top_blocker computation
│ └── web_ui.html # Debug dashboard (served at /web)
├── graders/
│ ├── security.py # Regex-based secret scanner with git history support
│ ├── health.py # Pytest-based health grader with JUnit parsing
│ ├── reward.py # Composite reward: security × health + efficiency
│ ├── grader.py # OpenEnv grader interface
│ ├── gitleaks_eval.py # Spec-aligned wrapper with git integrity reporting
│ └── health_eval.py # Spec-aligned wrapper with failure messaging
├── tasks/
│ ├── easy/task_01..05/ # 5 easy tasks (single-file, surface+shallow secrets)
│ ├── medium/task_06..10/ # 5 medium tasks (multi-format, surface+shallow+deep)
│ └── hard/task_11..13/ # 3 hard tasks (cascading, conflict maps, git history)
├── tests/
│ ├── test_hidden_leaks.py # 11 tests: visibility tier logic
│ ├── test_ranked_actions.py # 12 tests: heuristic action suggestions
│ └── test_reward.py # 11 tests: reward formula and efficiency bonus
├── tools/
│ └── generate_tasks.py # Generates all 13 task workspaces with seeded secrets
├── inference.py # Baseline agent: loops 3+ tasks, calls POST /grader
├── openenv.yaml # OpenEnv spec (tasks, required_endpoints, entry_points)
├── Dockerfile # Python 3.11-slim + git + HEALTHCHECK
├── hf_space_entrypoint.sh # Docker entrypoint: generate tasks → start uvicorn
├── validate-submission.sh # Submission validator
├── requirements.txt # Pip dependencies
└── pyproject.toml # Package metadata
MIT