SecretsAuditEnv

A reinforcement-learning benchmark that drops an AI agent into a git-backed codebase seeded with realistic secret leaks — hardcoded API keys, base64-encoded tokens, credentials buried in git history — and grades how quickly and safely the agent remediates every one. The environment runs as a stateless FastAPI server; the agent interacts over HTTP with bash commands and structured inspection actions, receiving a composite reward after every step that captures security progress, code health, and time efficiency.

Why This Exists

Production secret leaks remain a top-5 cause of cloud breaches. Developers commit API keys, push .env files, or leave credentials in migration scripts — then scramble to rotate and rewrite history. Existing linting tools flag secrets but don't fix them; LLM agents can, but there's no standardized benchmark to measure how well. SecretsAuditEnv fills that gap with a 13-task curriculum spanning trivial single-file fixes to multi-service cascading leaks with git-history rewriting, complete with a deterministic grading pipeline and an anti-gaming reward function.

Task Curriculum

All 13 tasks are defined in tasks/ with full metadata in each task.json. Secret types are detected by graders/security.py using regex patterns for AWS keys, GitHub tokens, Firebase keys, connection strings, private keys, SQL passwords, assignment secrets, and base64-encoded variants.

ID	Difficulty	Title	Description	Scan Mode	Visibility Tiers	Conflict Map
1	Easy	Cloud Provisioning	Hardcoded AWS Access Key in config.py	`dir`	surface, surface, shallow	—
2	Easy	Database Layer	Password embedded in a raw SQL connection string	`dir`	surface, surface, shallow	—
3	Easy	Frontend Config	Firebase API key exposed in a client-side config file	`dir`	surface, surface, shallow	—
4	Easy	System Logging	Debug logging leaks a user token	`dir`	surface, surface, shallow	—
5	Easy	Git Basics	A tracked .env file leaks credentials in the working tree	`dir`	surface, surface, shallow	—
6	Medium	Utility Module	A base64-encoded auth token hides in utils.py	`dir`	surface, shallow, deep	—
7	Medium	CI/CD Pipeline	A deployment workflow prints a secret directly to logs	`dir`	surface, shallow, deep	—
8	Medium	Noise Filtering	High-entropy dummy values are mixed with one real secret in TOML	`dir`	surface, shallow, deep	—
9	Medium	DB Migration	A legacy migration embeds administrator credentials	`dir`	surface, shallow, deep	—
10	Medium	Deployment	A multiline RSA private key is embedded in a shell script	`dir`	surface, shallow, deep	—
11	Hard	Microservices	The same API key is duplicated across five services	`dir`	surface, deep, cascading	✓
12	Hard	Deep Logic	A secret is embedded as a local variable inside a function	`dir`	surface, deep, cascading	✓
13	Hard	Legacy Audit	A secret was committed in v1.0 and still exists in Git history	`git`	surface, deep, cascading	✓

Scan modes: dir scans the working directory only. git scans the full git commit history (all revisions), so agents must use git filter-repo to clean history — deleting files won't work.

Secret Visibility Tiers

Secrets are not all visible at episode start. The environment implements a 4-tier progressive disclosure system defined per-secret in task.json:

Tier	When Visible	Typical Use
`SURFACE`	Immediately on `/reset`	Obvious hardcoded keys in source files
`SHALLOW`	After agent calls `inspect_file <path>`	Secrets that require reading the file to notice
`DEEP`	After `inspect_git_history` or `inspect_encoded`	Secrets in git commits or base64-encoded blobs
`CASCADING`	After a specified trigger secret is fixed	Secrets that only become relevant after another is remediated

Hard tasks (11–13) include conflict maps that encode dependencies: fixing secret s1 may reveal s3, while s2 may block s3 until resolved.

Reward Formula

Defined in graders/reward.py. Computed fresh after every /step:

detection_score = (initial_leaks - current_leaks) / initial_leaks
base            = 0.4 × detection_score + 0.6 × detection_score
health_score    = pytest_passed / pytest_total
efficiency      = 0.15 × max(0, 1 - steps_taken / step_budget)

if base > 0:
    total_reward = min(0.999, base × health_score + efficiency)
else:
    total_reward = 0.001

# Final clamp: validator requires strict (0, 1)
total_reward = max(0.001, min(0.999, total_reward))

Key properties:

Reward is strictly in (0, 1) — never exactly 0.0 or 1.0 (required by OpenEnv validator)
Reward starts at 0.001 until the agent actually fixes something
Health gate: breaking the test suite multiplies reward toward the floor
Efficiency bonus (0.0–0.15): rewards agents that solve in fewer steps
Step budgets: Easy = 10, Medium = 20, Hard = 30

Action Space

Actions are sent as the action field in POST /step. Two categories:

Structured Actions (intercepted before bash)

Action	Format	Effect
`inspect_file`	`inspect_file <path>`	Marks file as inspected → unlocks SHALLOW secrets for that path
`inspect_git_history`	`inspect_git_history [path]`	Scans git history → unlocks all DEEP secrets requiring git inspection
`inspect_encoded`	`inspect_encoded <path> [line]`	Decodes base64 blobs → unlocks DEEP encoded secrets for that path

Bash Commands (executed in workspace via `/usr/bin/bash -lc`)

Any string that doesn't match a structured prefix is executed as a bash command in the task workspace. Common patterns:

cat config.py — read file contents
sed -i 's/AKIA.../os.getenv("AWS_KEY")/' config.py — redact a secret
git filter-repo --replace-text <(echo 'ghp_xxx==>REDACTED') --force — clean git history
gitleaks detect --no-git --source . — run leak scanner

Commands time out after 90 seconds. Exit code, stdout, and stderr are returned in the observation.

Observation Keys

Every /step and /reset response returns a session object with these fields (also listed in openenv.yaml):

Key	Type	Description
`visible_secrets`	`list[dict]`	Secrets the agent can currently see (filtered by visibility tier)
`hidden_count_hint`	`int`	Number of secrets not yet visible — tells agent more exist
`ranked_actions`	`list[dict]`	Top-5 heuristic action suggestions sorted by priority (0.0–1.0)
`top_blocker`	`string`	One-sentence description of the highest-priority next action
`step_budget`	`int`	Total step budget for this difficulty tier
`steps_taken`	`int`	Steps consumed so far
`steps_remaining`	`int`	`step_budget - steps_taken`
`efficiency_bonus`	`float`	Current efficiency bonus value (decays each step)
`conflict_map`	`dict`	Dependency graph between visible secrets (reveals/blocks relationships)
`security_score`	`float`	Fraction of initial leaks fixed (0.0–1.0)
`health_score`	`float`	Fraction of pytest tests passing (0.0–1.0)
`reward`	`float`	Composite reward after this step
`observation`	`string`	Combined stdout/stderr from last command + health/security messages
`last_result`	`dict`	Raw action, exit_code, stdout, stderr, timed_out from last command

Ranked Actions

Generated by server/observation.py using 5 heuristics:

Visible unfixed secrets → priority 0.92 (fix these first)
Uninspected high-risk files → priority based on filename suspicion score
Git history not yet scanned → priority 0.75 (medium/hard only)
Hidden secrets remaining → priority 0.68 (suggest encoded inspection)
All visible fixed → priority 0.85 (run gitleaks validation)

API Endpoints

Method	Path	Tag	Description
`GET`	`/health`	OpenEnv	Returns `{"status": "healthy"}`
`GET`	`/metadata`	OpenEnv	Environment metadata (name, version, task count)
`GET`	`/schema`	OpenEnv	Action/observation/state JSON schemas
`GET`	`/tasks`	OpenEnv	Lists all 13 tasks with `task_id`, `action_schema`
`POST`	`/grader`	OpenEnv	Grades a completed episode: `{task_id, step_rewards, step_infos}` → `{score}`
`POST`	`/baseline`	OpenEnv	Runs baseline evaluation across 3 tasks
`POST`	`/mcp`	OpenEnv	JSON-RPC 2.0 `initialize` support
`POST`	`/reset`	Core	Reset to a task: `{task_id: N}` → full session state
`POST`	`/step`	Core	Execute an action: `{action: "..."}` → updated session state
`GET`	`/state`	Core	Current session state
`GET`	`/web`	Debug	Browser-based debug dashboard

Quick Start

Local Development

# Install dependencies
pip install -r requirements.txt

# Generate task workspaces
python tools/generate_tasks.py

# Start the server
uvicorn server.app:app --host 0.0.0.0 --port 7860

# Open the debug UI
open http://localhost:7860/web

# Run the agent (loops over 3 tasks by default)
export API_BASE_URL="https://openrouter.ai/api/v1"
export HF_TOKEN="your-api-key"
export MODEL_NAME="Qwen/Qwen2.5-72b-instruct"
export ENV_URL="http://localhost:7860"
python inference.py

# Or specify tasks explicitly
python inference.py --task-id task_1,task_2,task_3,task_6

Docker

docker build -t secretsauditenv .
docker run -p 7860:7860 secretsauditenv
# Server auto-generates tasks and starts on port 7860

The Dockerfile installs Python 3.11-slim with bash, git, git-filter-repo, and all pip dependencies.

inference.py — Agent Loop

The baseline agent (inference.py) loops over 3+ tasks (required by OpenEnv validator), running a ReAct loop for each:

For each task (task_1, task_2, task_3, ...):
- Emit [START] task=<id> env=secrets_audit model=<model>
- POST /reset with the task ID
- Auto-inject: automatically runs cat <primary_file> to give the LLM context before it starts
- Per-difficulty step cap: Easy=8, Medium=15, Hard=25 (maximizes efficiency bonus)
- For each step:
  - Build structured prompt with task-specific hints → call LLM → normalize action → POST /step
  - Emit [STEP] step=N action=... reward=0.XX done=false error=null
- Call POST /grader with {task_id, step_rewards, step_infos} to get official score
- Emit [END] success=true steps=N score=0.XXX rewards=0.XX,0.XX,...
The validator counts [START]/[END] pairs and checks each score is in (0, 1)

Environment variables:

Variable	Required	Description
`API_BASE_URL`	Yes	OpenAI-compatible API endpoint
`HF_TOKEN`	Yes	API key for the model provider
`MODEL_NAME`	Yes	Model identifier
`ENV_URL`	No	Server URL, defaults to `http://localhost:7860`
`TASK_ID`	No	Comma-separated task list, defaults to `task_1,task_2,task_3`

Prompt engineering:

Task-specific hints for Task 13 (git history leak — instructs agent to use git filter-repo)
Partial-fix detection — when reward is ~0.5, tells agent git history still leaks
Urgency escalation — after 5+ steps with no progress, forces direct fix attempt

Anti-loop features:

Detects 3 consecutive identical actions with identical rewards → forces a different command
Smart atomic enforcement — rejects && and ; chaining but allows ${VAR} in sed patterns
Ensures minimum 3 tasks are always run (auto-pads if fewer specified)

Grading Pipeline

Security Grader (`graders/security.py`)

Custom regex-based scanner (no external tools required). Detects:

AWS Access Keys (AKIA...)
GitHub tokens (ghp_...)
Firebase API keys (AIza...)
Service tokens (tok_live_..., sk_test_...)
Private keys (PEM format)
SQL connection strings (postgres://user:pass@host/db)
SQL passwords (PASSWORD 'value')
High-entropy assignment secrets (Shannon entropy ≥ 3.2)
Base64-encoded variants of all the above

Supports .gitleaks.toml allowlists. For scan_mode: git, scans all commits via git rev-list --all.

Anti-gaming: if an agent deletes .git, the grader recovers by creating a fresh snapshot and scanning that — the secret still gets found.

Health Grader (`graders/health.py`)

Runs pytest -q --junitxml in the task workspace and parses the JUnit XML report. Score = passed / total. Returns 0.0 if any errors or if pytest times out (60s default).

Web Debug UI

Available at GET /web. A single-page dark-mode dashboard served from server/web_ui.html that lets you:

Start any of the 13 tasks with one click
Send structured actions or raw bash commands
View real-time metrics (reward, leaks, hidden count, steps, efficiency, health)
See ranked action suggestions and conflict maps
Browse visible secrets and full observation text
Review action history

No external dependencies — pure HTML/CSS/JS.

Validation

# Run the environment test suite
python -m pytest tests/ -v

# Run openenv validation (6/6 checks)
openenv validate http://localhost:7860

# Run the full submission validator
bash validate-submission.sh

OpenEnv validator checks (all passing ✅):

GET /health returns {"status": "healthy"}
GET /metadata returns valid metadata
GET /schema returns action/observation schemas
GET /tasks returns task list with task_id fields
POST /grader accepts {task_id, step_rewards} and returns {score} in (0, 1)
POST /mcp responds to JSON-RPC 2.0 initialize

Deep validation checks (Phase 2):

At least 3 tasks with working graders
All task scores strictly between 0 and 1 (not 0.0 or 1.0)
inference.py loops over 3+ tasks, emitting [START]/[END] per task

Project Structure

.
├── server/
│   ├── app.py              # FastAPI routes (OpenEnv + Core + Debug)
│   ├── environment.py      # Core environment logic, visibility tiers, action parsing
│   ├── observation.py      # Ranked action heuristics and top_blocker computation
│   └── web_ui.html         # Debug dashboard (served at /web)
├── graders/
│   ├── security.py         # Regex-based secret scanner with git history support
│   ├── health.py           # Pytest-based health grader with JUnit parsing
│   ├── reward.py           # Composite reward: security × health + efficiency
│   ├── grader.py           # OpenEnv grader interface
│   ├── gitleaks_eval.py    # Spec-aligned wrapper with git integrity reporting
│   └── health_eval.py      # Spec-aligned wrapper with failure messaging
├── tasks/
│   ├── easy/task_01..05/   # 5 easy tasks (single-file, surface+shallow secrets)
│   ├── medium/task_06..10/ # 5 medium tasks (multi-format, surface+shallow+deep)
│   └── hard/task_11..13/   # 3 hard tasks (cascading, conflict maps, git history)
├── tests/
│   ├── test_hidden_leaks.py    # 11 tests: visibility tier logic
│   ├── test_ranked_actions.py  # 12 tests: heuristic action suggestions
│   └── test_reward.py         # 11 tests: reward formula and efficiency bonus
├── tools/
│   └── generate_tasks.py   # Generates all 13 task workspaces with seeded secrets
├── inference.py            # Baseline agent: loops 3+ tasks, calls POST /grader
├── openenv.yaml            # OpenEnv spec (tasks, required_endpoints, entry_points)
├── Dockerfile              # Python 3.11-slim + git + HEALTHCHECK
├── hf_space_entrypoint.sh  # Docker entrypoint: generate tasks → start uvicorn
├── validate-submission.sh  # Submission validator
├── requirements.txt        # Pip dependencies
└── pyproject.toml          # Package metadata

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SecretsAuditEnv

Why This Exists

Task Curriculum

Secret Visibility Tiers

Reward Formula

Action Space

Structured Actions (intercepted before bash)

Bash Commands (executed in workspace via `/usr/bin/bash -lc`)

Observation Keys

Ranked Actions

API Endpoints

Quick Start

Local Development

Docker

inference.py — Agent Loop

Grading Pipeline

Security Grader (`graders/security.py`)

Health Grader (`graders/health.py`)

Web Debug UI

Validation

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
graders		graders
server		server
tasks		tasks
tests		tests
tools		tools
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
hf_space_entrypoint.sh		hf_space_entrypoint.sh
inference.py		inference.py
openenv.yaml		openenv.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock
validate-submission.sh		validate-submission.sh

Folders and files

Latest commit

History

Repository files navigation

SecretsAuditEnv

Why This Exists

Task Curriculum

Secret Visibility Tiers

Reward Formula

Action Space

Structured Actions (intercepted before bash)

Bash Commands (executed in workspace via /usr/bin/bash -lc)

Observation Keys

Ranked Actions

API Endpoints

Quick Start

Local Development

Docker

inference.py — Agent Loop

Grading Pipeline

Security Grader (graders/security.py)

Health Grader (graders/health.py)

Web Debug UI

Validation

Project Structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Bash Commands (executed in workspace via `/usr/bin/bash -lc`)

Security Grader (`graders/security.py`)

Health Grader (`graders/health.py`)

Packages