Skip to content

dexterhere-2k/openenv-secaudit

Repository files navigation

SecretsAuditEnv

A reinforcement-learning benchmark that drops an AI agent into a git-backed codebase seeded with realistic secret leaks — hardcoded API keys, base64-encoded tokens, credentials buried in git history — and grades how quickly and safely the agent remediates every one. The environment runs as a stateless FastAPI server; the agent interacts over HTTP with bash commands and structured inspection actions, receiving a composite reward after every step that captures security progress, code health, and time efficiency.

Why This Exists

Production secret leaks remain a top-5 cause of cloud breaches. Developers commit API keys, push .env files, or leave credentials in migration scripts — then scramble to rotate and rewrite history. Existing linting tools flag secrets but don't fix them; LLM agents can, but there's no standardized benchmark to measure how well. SecretsAuditEnv fills that gap with a 13-task curriculum spanning trivial single-file fixes to multi-service cascading leaks with git-history rewriting, complete with a deterministic grading pipeline and an anti-gaming reward function.


Task Curriculum

All 13 tasks are defined in tasks/ with full metadata in each task.json. Secret types are detected by graders/security.py using regex patterns for AWS keys, GitHub tokens, Firebase keys, connection strings, private keys, SQL passwords, assignment secrets, and base64-encoded variants.

ID Difficulty Title Description Scan Mode Visibility Tiers Conflict Map
1 Easy Cloud Provisioning Hardcoded AWS Access Key in config.py dir surface, surface, shallow
2 Easy Database Layer Password embedded in a raw SQL connection string dir surface, surface, shallow
3 Easy Frontend Config Firebase API key exposed in a client-side config file dir surface, surface, shallow
4 Easy System Logging Debug logging leaks a user token dir surface, surface, shallow
5 Easy Git Basics A tracked .env file leaks credentials in the working tree dir surface, surface, shallow
6 Medium Utility Module A base64-encoded auth token hides in utils.py dir surface, shallow, deep
7 Medium CI/CD Pipeline A deployment workflow prints a secret directly to logs dir surface, shallow, deep
8 Medium Noise Filtering High-entropy dummy values are mixed with one real secret in TOML dir surface, shallow, deep
9 Medium DB Migration A legacy migration embeds administrator credentials dir surface, shallow, deep
10 Medium Deployment A multiline RSA private key is embedded in a shell script dir surface, shallow, deep
11 Hard Microservices The same API key is duplicated across five services dir surface, deep, cascading
12 Hard Deep Logic A secret is embedded as a local variable inside a function dir surface, deep, cascading
13 Hard Legacy Audit A secret was committed in v1.0 and still exists in Git history git surface, deep, cascading

Scan modes: dir scans the working directory only. git scans the full git commit history (all revisions), so agents must use git filter-repo to clean history — deleting files won't work.


Secret Visibility Tiers

Secrets are not all visible at episode start. The environment implements a 4-tier progressive disclosure system defined per-secret in task.json:

Tier When Visible Typical Use
SURFACE Immediately on /reset Obvious hardcoded keys in source files
SHALLOW After agent calls inspect_file <path> Secrets that require reading the file to notice
DEEP After inspect_git_history or inspect_encoded Secrets in git commits or base64-encoded blobs
CASCADING After a specified trigger secret is fixed Secrets that only become relevant after another is remediated

Hard tasks (11–13) include conflict maps that encode dependencies: fixing secret s1 may reveal s3, while s2 may block s3 until resolved.


Reward Formula

Defined in graders/reward.py. Computed fresh after every /step:

detection_score = (initial_leaks - current_leaks) / initial_leaks
base            = 0.4 × detection_score + 0.6 × detection_score
health_score    = pytest_passed / pytest_total
efficiency      = 0.15 × max(0, 1 - steps_taken / step_budget)

if base > 0:
    total_reward = min(0.999, base × health_score + efficiency)
else:
    total_reward = 0.001

# Final clamp: validator requires strict (0, 1)
total_reward = max(0.001, min(0.999, total_reward))

Key properties:

  • Reward is strictly in (0, 1) — never exactly 0.0 or 1.0 (required by OpenEnv validator)
  • Reward starts at 0.001 until the agent actually fixes something
  • Health gate: breaking the test suite multiplies reward toward the floor
  • Efficiency bonus (0.0–0.15): rewards agents that solve in fewer steps
  • Step budgets: Easy = 10, Medium = 20, Hard = 30

Action Space

Actions are sent as the action field in POST /step. Two categories:

Structured Actions (intercepted before bash)

Action Format Effect
inspect_file inspect_file <path> Marks file as inspected → unlocks SHALLOW secrets for that path
inspect_git_history inspect_git_history [path] Scans git history → unlocks all DEEP secrets requiring git inspection
inspect_encoded inspect_encoded <path> [line] Decodes base64 blobs → unlocks DEEP encoded secrets for that path

Bash Commands (executed in workspace via /usr/bin/bash -lc)

Any string that doesn't match a structured prefix is executed as a bash command in the task workspace. Common patterns:

  • cat config.py — read file contents
  • sed -i 's/AKIA.../os.getenv("AWS_KEY")/' config.py — redact a secret
  • git filter-repo --replace-text <(echo 'ghp_xxx==>REDACTED') --force — clean git history
  • gitleaks detect --no-git --source . — run leak scanner

Commands time out after 90 seconds. Exit code, stdout, and stderr are returned in the observation.


Observation Keys

Every /step and /reset response returns a session object with these fields (also listed in openenv.yaml):

Key Type Description
visible_secrets list[dict] Secrets the agent can currently see (filtered by visibility tier)
hidden_count_hint int Number of secrets not yet visible — tells agent more exist
ranked_actions list[dict] Top-5 heuristic action suggestions sorted by priority (0.0–1.0)
top_blocker string One-sentence description of the highest-priority next action
step_budget int Total step budget for this difficulty tier
steps_taken int Steps consumed so far
steps_remaining int step_budget - steps_taken
efficiency_bonus float Current efficiency bonus value (decays each step)
conflict_map dict Dependency graph between visible secrets (reveals/blocks relationships)
security_score float Fraction of initial leaks fixed (0.0–1.0)
health_score float Fraction of pytest tests passing (0.0–1.0)
reward float Composite reward after this step
observation string Combined stdout/stderr from last command + health/security messages
last_result dict Raw action, exit_code, stdout, stderr, timed_out from last command

Ranked Actions

Generated by server/observation.py using 5 heuristics:

  1. Visible unfixed secrets → priority 0.92 (fix these first)
  2. Uninspected high-risk files → priority based on filename suspicion score
  3. Git history not yet scanned → priority 0.75 (medium/hard only)
  4. Hidden secrets remaining → priority 0.68 (suggest encoded inspection)
  5. All visible fixed → priority 0.85 (run gitleaks validation)

API Endpoints

Method Path Tag Description
GET /health OpenEnv Returns {"status": "healthy"}
GET /metadata OpenEnv Environment metadata (name, version, task count)
GET /schema OpenEnv Action/observation/state JSON schemas
GET /tasks OpenEnv Lists all 13 tasks with task_id, action_schema
POST /grader OpenEnv Grades a completed episode: {task_id, step_rewards, step_infos}{score}
POST /baseline OpenEnv Runs baseline evaluation across 3 tasks
POST /mcp OpenEnv JSON-RPC 2.0 initialize support
POST /reset Core Reset to a task: {task_id: N} → full session state
POST /step Core Execute an action: {action: "..."} → updated session state
GET /state Core Current session state
GET /web Debug Browser-based debug dashboard

Quick Start

Local Development

# Install dependencies
pip install -r requirements.txt

# Generate task workspaces
python tools/generate_tasks.py

# Start the server
uvicorn server.app:app --host 0.0.0.0 --port 7860

# Open the debug UI
open http://localhost:7860/web

# Run the agent (loops over 3 tasks by default)
export API_BASE_URL="https://openrouter.ai/api/v1"
export HF_TOKEN="your-api-key"
export MODEL_NAME="Qwen/Qwen2.5-72b-instruct"
export ENV_URL="http://localhost:7860"
python inference.py

# Or specify tasks explicitly
python inference.py --task-id task_1,task_2,task_3,task_6

Docker

docker build -t secretsauditenv .
docker run -p 7860:7860 secretsauditenv
# Server auto-generates tasks and starts on port 7860

The Dockerfile installs Python 3.11-slim with bash, git, git-filter-repo, and all pip dependencies.


inference.py — Agent Loop

The baseline agent (inference.py) loops over 3+ tasks (required by OpenEnv validator), running a ReAct loop for each:

  1. For each task (task_1, task_2, task_3, ...):
    • Emit [START] task=<id> env=secrets_audit model=<model>
    • POST /reset with the task ID
    • Auto-inject: automatically runs cat <primary_file> to give the LLM context before it starts
    • Per-difficulty step cap: Easy=8, Medium=15, Hard=25 (maximizes efficiency bonus)
    • For each step:
      • Build structured prompt with task-specific hints → call LLM → normalize action → POST /step
      • Emit [STEP] step=N action=... reward=0.XX done=false error=null
    • Call POST /grader with {task_id, step_rewards, step_infos} to get official score
    • Emit [END] success=true steps=N score=0.XXX rewards=0.XX,0.XX,...
  2. The validator counts [START]/[END] pairs and checks each score is in (0, 1)

Environment variables:

Variable Required Description
API_BASE_URL Yes OpenAI-compatible API endpoint
HF_TOKEN Yes API key for the model provider
MODEL_NAME Yes Model identifier
ENV_URL No Server URL, defaults to http://localhost:7860
TASK_ID No Comma-separated task list, defaults to task_1,task_2,task_3

Prompt engineering:

  • Task-specific hints for Task 13 (git history leak — instructs agent to use git filter-repo)
  • Partial-fix detection — when reward is ~0.5, tells agent git history still leaks
  • Urgency escalation — after 5+ steps with no progress, forces direct fix attempt

Anti-loop features:

  • Detects 3 consecutive identical actions with identical rewards → forces a different command
  • Smart atomic enforcement — rejects && and ; chaining but allows ${VAR} in sed patterns
  • Ensures minimum 3 tasks are always run (auto-pads if fewer specified)

Grading Pipeline

Security Grader (graders/security.py)

Custom regex-based scanner (no external tools required). Detects:

  • AWS Access Keys (AKIA...)
  • GitHub tokens (ghp_...)
  • Firebase API keys (AIza...)
  • Service tokens (tok_live_..., sk_test_...)
  • Private keys (PEM format)
  • SQL connection strings (postgres://user:pass@host/db)
  • SQL passwords (PASSWORD 'value')
  • High-entropy assignment secrets (Shannon entropy ≥ 3.2)
  • Base64-encoded variants of all the above

Supports .gitleaks.toml allowlists. For scan_mode: git, scans all commits via git rev-list --all.

Anti-gaming: if an agent deletes .git, the grader recovers by creating a fresh snapshot and scanning that — the secret still gets found.

Health Grader (graders/health.py)

Runs pytest -q --junitxml in the task workspace and parses the JUnit XML report. Score = passed / total. Returns 0.0 if any errors or if pytest times out (60s default).


Web Debug UI

Available at GET /web. A single-page dark-mode dashboard served from server/web_ui.html that lets you:

  • Start any of the 13 tasks with one click
  • Send structured actions or raw bash commands
  • View real-time metrics (reward, leaks, hidden count, steps, efficiency, health)
  • See ranked action suggestions and conflict maps
  • Browse visible secrets and full observation text
  • Review action history

No external dependencies — pure HTML/CSS/JS.


Validation

# Run the environment test suite
python -m pytest tests/ -v

# Run openenv validation (6/6 checks)
openenv validate http://localhost:7860

# Run the full submission validator
bash validate-submission.sh

OpenEnv validator checks (all passing ✅):

  1. GET /health returns {"status": "healthy"}
  2. GET /metadata returns valid metadata
  3. GET /schema returns action/observation schemas
  4. GET /tasks returns task list with task_id fields
  5. POST /grader accepts {task_id, step_rewards} and returns {score} in (0, 1)
  6. POST /mcp responds to JSON-RPC 2.0 initialize

Deep validation checks (Phase 2):

  • At least 3 tasks with working graders
  • All task scores strictly between 0 and 1 (not 0.0 or 1.0)
  • inference.py loops over 3+ tasks, emitting [START]/[END] per task

Project Structure

.
├── server/
│   ├── app.py              # FastAPI routes (OpenEnv + Core + Debug)
│   ├── environment.py      # Core environment logic, visibility tiers, action parsing
│   ├── observation.py      # Ranked action heuristics and top_blocker computation
│   └── web_ui.html         # Debug dashboard (served at /web)
├── graders/
│   ├── security.py         # Regex-based secret scanner with git history support
│   ├── health.py           # Pytest-based health grader with JUnit parsing
│   ├── reward.py           # Composite reward: security × health + efficiency
│   ├── grader.py           # OpenEnv grader interface
│   ├── gitleaks_eval.py    # Spec-aligned wrapper with git integrity reporting
│   └── health_eval.py      # Spec-aligned wrapper with failure messaging
├── tasks/
│   ├── easy/task_01..05/   # 5 easy tasks (single-file, surface+shallow secrets)
│   ├── medium/task_06..10/ # 5 medium tasks (multi-format, surface+shallow+deep)
│   └── hard/task_11..13/   # 3 hard tasks (cascading, conflict maps, git history)
├── tests/
│   ├── test_hidden_leaks.py    # 11 tests: visibility tier logic
│   ├── test_ranked_actions.py  # 12 tests: heuristic action suggestions
│   └── test_reward.py         # 11 tests: reward formula and efficiency bonus
├── tools/
│   └── generate_tasks.py   # Generates all 13 task workspaces with seeded secrets
├── inference.py            # Baseline agent: loops 3+ tasks, calls POST /grader
├── openenv.yaml            # OpenEnv spec (tasks, required_endpoints, entry_points)
├── Dockerfile              # Python 3.11-slim + git + HEALTHCHECK
├── hf_space_entrypoint.sh  # Docker entrypoint: generate tasks → start uvicorn
├── validate-submission.sh  # Submission validator
├── requirements.txt        # Pip dependencies
└── pyproject.toml          # Package metadata

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors