AI-powered test generation agent that analyzes GitHub PR diffs, identifies untested code paths, generates targeted tests, runs them, and self-corrects through a reflexion loop.
Two modes: coverage (fill gaps in existing tests) and mutation (find bugs your tests miss).
Built with LangGraph (Python) using a Plan-and-Execute + Reflexion architecture, inspired by Meta's Automated Compliance Hardening (ACH) research.
Generates tests targeting uncovered code paths. Measures line coverage and improves it.
agent-forge run owner/repo --pr 42
# or explicitly:
agent-forge run owner/repo --pr 42 --mode coverageInjects realistic bugs into changed code, runs existing tests to find which bugs go undetected (surviving mutants), then generates targeted killing tests to catch them. Reports a mutation score โ a better quality metric than line coverage.
agent-forge run owner/repo --pr 42 --mode mutationPR Diff โ Code Analysis โ Coverage Check โ Test Generation โ Test Execution โ Self-Correction โ Report
- Fetches the PR diff from GitHub
- Parses changed files with tree-sitter to extract method signatures, annotations, and dependencies
- Detects the build tool (Gradle/Maven) and existing test coverage
- Generates targeted JUnit 5 tests using GPT-4o, based on uncovered code paths
- Compiles and runs tests using
./gradlew test - Self-corrects โ if tests fail, a critic analyzes errors and the generator fixes them (up to 3 iterations)
[Coverage phase completes]
โ
Mutation Generator โ Equivalence Detector โ Mutation Runner
โ
Killing Test Generator โ 3-Stage Filter โ Mutation Critic โ Report
- Generates mutations โ realistic bugs in changed code (off-by-one, wrong operator, missing null check, etc.) using
gpt-4o-miniat temperature 0.7 โ a different model/temperature than the test generator to prevent AI blind spots - Filters equivalents โ LLM-as-judge removes mutations that are logically identical to original code
- Runs mutation tests โ injects each mutation, runs all tests, records killed vs. survived
- Generates killing tests โ for each surviving mutant, writes a test that passes on original code but fails on the mutant
- 3-stage filter โ each killing test must: compile โ pass original โ fail mutant
- Reflexion loop โ up to 2 iterations to fix rejected killing tests
- Python 3.12+
- GitHub CLI (
gh) โ authenticated - Java 21 โ for running generated tests against Java projects
- OpenAI API key โ for LLM-powered test generation
git clone https://github.com/rkp4u/test_agent.git
cd test_agent
python3 -m venv .venv
source .venv/bin/activate
make devcp .env.example .envEdit .env:
OPENAI_API_KEY=sk-proj-your-key-here
GITHUB_TOKEN=ghp_your-token-here # optional, gh CLI preferred
brew install gh # macOS
gh auth login
gh auth status # verifyagent-forge run owner/repo --pr 42agent-forge run owner/repo --pr 42 --mode mutation -v| Option | Description | Default |
|---|---|---|
--mode |
coverage or mutation |
coverage |
--pr, -p |
PR number (required) | โ |
--max-iterations, -m |
Max reflexion iterations | 3 |
--verbose, -v |
Show detailed output | off |
--dry-run |
Analyze without running tests | off |
โญโโโโโโโโโโโโโโโโ Agent Forge โ Test Generation โโโโโโโโโโโโโโโโโโฎ
โ Repository: rkp4u/agent_demo โ
โ PR: #1 โ Add TransactionMetrics for investigation performance โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โ [1/7] Planning
โ [2/7] Fetching PR diff (1 files changed)
โ [3/7] Analyzing code (tree-sitter)
โโโ 6 new/modified methods found
โ [4/7] Checking existing coverage
โ [5/7] Generating tests
โโโ 1 test files generated
โโโ 7 test methods targeting uncovered paths
โ [6/7] Running tests
โโโ 5 passed, 2 failed
โโโ Entering reflexion loop...
โ [5/7] Regenerating tests (iteration 2/3)
โ [6/7] Running tests โ 7 passed, 0 failed
โ [7/7] Generating report
โญโโโโโโโโโโโโโโโโโโ Results โโโโโโโโโโโโโโโโโโโโฎ
โ Tests generated: 7 (in 1 files) โ
โ Iterations: 2/3 โ
โ Results: 7 passed, 0 failed โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โ [1-7] Coverage phase (7 tests, all passing)
โ [8/12] Generating mutations (6 mutants)
โ [9/12] Filtering equivalent mutants (6 remain, 0 removed)
โ [10/12] Running tests against mutants
โโโ 6 killed, 0 survived, 0 build failures
โโโ All mutants caught by existing tests!
โ [11/12] Generating killing tests (0 needed)
โ [12/12] Generating report
โญโโโโโโโโโโโโโโโโ Mutation Testing Results โโโโโโโโโโโโโโโโโฎ
โ Mutation score: 100% (6/6 mutants killed) โ
โ Killed by existing: 6 โ
โ Killed by new tests: 0 โ
โ Still surviving: 0 โ
โ Equivalent (filtered): 0 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
START โ Planner โ Diff Fetcher โ Code Analyzer โ Coverage Checker
โ Test Generator โ Test Runner โ Critic
โ (fail, max 3) โ (pass)
Test Generator [MODE ROUTER]
โ โ
(coverage) (mutation)
โ โ
Reporter Mutation Generator
โ โ
END Equivalence Detector
โ
Mutation Runner
โ
Killing Test Generator
โ
Killing Test Runner
โ
Mutation Critic
โ (retry, max 2) โ (done)
Killing Test Generator Reporter โ END
Key design decisions:
- LangGraph for stateful workflow with conditional edges and two independent reflexion loops
- Different models per phase โ mutation generator uses
gpt-4o-miniat temp 0.7, test generator usesgpt-4oat temp 0.2 โ prevents AI blind spots - String-based mutation injection โ replaces exact code snippets rather than line numbers for robustness
- 3-stage filter for killing tests โ compile โ pass original โ fail mutant (eliminates false positives)
- tree-sitter for multi-language AST parsing
src/agent_forge/
โโโ cli/
โ โโโ app.py # Commands: run, analyze, version. --mode flag
โ โโโ display.py # Rich panels, mutation score tables, surviving mutants
โโโ config/
โ โโโ settings.py # All settings including mutation model/temperature
โโโ engine/
โ โโโ graph.py # LangGraph StateGraph with mode_router conditional edge
โ โโโ state.py # AgentState โ 20+ fields including mutation state
โ โโโ nodes/
โ โ โโโ planner.py
โ โ โโโ diff_fetcher.py
โ โ โโโ code_analyzer.py
โ โ โโโ coverage_checker.py
โ โ โโโ test_generator.py
โ โ โโโ test_runner.py
โ โ โโโ critic.py
โ โ โโโ reporter.py
โ โ โโโ mutation_generator.py # Generates realistic bugs via LLM
โ โ โโโ equivalence_detector.py # LLM-as-judge filtering
โ โ โโโ mutation_runner.py # Injects mutations, runs tests
โ โ โโโ killing_test_generator.py # Generates bug-catching tests
โ โ โโโ killing_test_runner.py # 3-stage filter pipeline
โ โ โโโ mutation_critic.py # Rule-based reflexion feedback
โ โโโ prompts/
โ โโโ test_generation.py
โ โโโ mutation.py # 6 prompt builders for mutation pipeline
โโโ tools/
โ โโโ github/ # GitHub client (gh CLI + PyGithub fallback)
โ โโโ analysis/ # tree-sitter AST analyzer
โ โโโ runners/
โ โโโ gradle.py
โ โโโ mutation_injector.py # Context manager: inject โ test โ restore
โโโ models/
โโโ ...
โโโ mutation.py # Mutant, MutationRunResult, KillingTestResult
| Variable | Description | Default |
|---|---|---|
OPENAI_API_KEY |
OpenAI API key | (required) |
GITHUB_TOKEN |
GitHub PAT (fallback if gh CLI unavailable) | (optional) |
AGENT_FORGE_MODEL |
Model for test generation | gpt-4o |
AGENT_FORGE_TEMPERATURE |
Temperature for test generation | 0.2 |
AGENT_FORGE_MAX_REFLEXION_ITERATIONS |
Coverage reflexion iterations | 3 |
AGENT_FORGE_TEST_TIMEOUT_SECONDS |
Test execution timeout (seconds) | 300 |
AGENT_FORGE_MUTATION_MODEL |
Model for mutation generation | gpt-4o-mini |
AGENT_FORGE_MUTATION_TEMPERATURE |
Temperature for mutation generation | 0.7 |
AGENT_FORGE_EQUIVALENCE_MODEL |
Model for equivalence detection | gpt-4o-mini |
AGENT_FORGE_MAX_MUTANTS_PER_PR |
Cap on mutations generated per PR | 12 |
AGENT_FORGE_MAX_MUTATION_ITERATIONS |
Killing test reflexion iterations | 2 |
| Node | Model | Temp | Rationale |
|---|---|---|---|
| Test generator | gpt-4o | 0.2 | Precision โ tests must be syntactically correct |
| Mutation generator | gpt-4o-mini | 0.7 | Creative โ different model prevents AI blind spots |
| Equivalence detector | gpt-4o-mini | 0.0 | Cheap binary classification |
| Killing test generator | gpt-4o | 0.2 | Precision โ must compile and pass 3-stage filter |
Using the same model that writes tests to also generate mutations creates a systematic blind spot โ it tends to generate bugs the model "knows" to avoid. Separating models is key to mutation effectiveness.
| Language | AST Parsing | Test Generation | Mutation Testing |
|---|---|---|---|
| Java (JUnit 5 + Gradle) | โ tree-sitter | โ GPT-4o | โ |
| Python (pytest) | Planned | Planned | Planned |
| TypeScript (Jest) | Planned | Planned | Planned |
| Kotlin (JUnit 5) | Planned | Planned | Planned |
- JaCoCo coverage collection (real line-level coverage deltas)
- Maven runner support
- Python + TypeScript language handlers
-
--output jsonfor CI integration - GitHub Actions integration โ post mutation score as PR comment
- Report persistence and
agent-forge report <id>command - Penetration testing profile (semgrep-based static analysis)
make test # Unit tests
make lint # Check with ruff
make format # Auto-fix with ruff
make typecheck # mypyApache 2.0