added tests/haiku-tester with simple tests #55

honix · 2026-01-01T10:49:00Z

No description provided.

claude · 2026-01-01T10:50:43Z

PR Review: E2E Testing with Claude CLI & Docker

This is an excellent addition to the project! The implementation of AI-powered E2E testing using Claude CLI with Chrome MCP is innovative and well-structured. Here's my detailed feedback:

✅ Strengths

Architecture & Design

Clean separation: Test infrastructure (Docker), test runner (claude_runner.py), and test cases are well-organized
Reusable fixtures: The conftest.py provides excellent Docker Compose fixtures with proper lifecycle management
Structured output: Using JSON schema for test results ensures consistent, parseable output
Comprehensive test fixtures: The wiki fixtures in tests/fixtures/wiki/ provide good coverage for testing

Docker Implementation

Proper layer caching: Dependencies installed before code copy in all Dockerfiles
Health checks: All services have appropriate health checks with reasonable intervals
Security: Test backend uses non-root user (appuser) - good practice
Resource cleanup: .dockerignore files prevent bloat and speed up builds

Code Quality

Type hints: Good use of type annotations (Optional, dataclasses)
Documentation: Clear docstrings and inline comments
Error handling: Comprehensive exception handling in claude_runner.py

🐛 Potential Issues

1. Hard-coded URLs (tests/conftest.py:63-64)

frontend_url = "http://localhost:5173"
backend_url = "http://localhost:8000/health"

Issue: These should be configurable or derived from environment variables.

Risk: If ports are already in use, tests will fail silently or connect to wrong services.

Recommendation: Use testcontainers' port mapping or make configurable:

# Get actual mapped ports from containers
backend_port = compose.get_service_port("backend", 8000)
frontend_port = compose.get_service_port("frontend", 5173)
frontend_url = f"http://localhost:{frontend_port}"

2. Missing subprocess error output (tests/claude_runner.py:76)

output = result.stdout.strip()

Issue: stderr is captured but never used or logged, making debugging difficult when tests fail.

Recommendation: Include stderr in raw_output or log it:

output = result.stdout.strip()
if result.stderr:
    print(f"stderr: {result.stderr}", file=sys.stderr)
# Or include in raw_output:
raw_output = f"stdout:\n{result.stdout}\nstderr:\n{result.stderr}"

3. Race condition in wait_for_url (tests/conftest.py:61)

The 5-second sleep after healthchecks is a smell:

# Give a bit more time after healthchecks pass
time.sleep(5)

Issue: This suggests healthchecks might not be fully reliable or application startup has additional delays.

Recommendation: Either:

Improve healthchecks to check application readiness (e.g., check database connection, not just HTTP 200)
Use polling with exponential backoff instead of fixed sleep
Document why this is needed

4. Overly permissive error handling (tests/conftest.py:24-26)

except urllib.error.HTTPError as e:
    # Any HTTP response (including 404) means server is up
    return True

Issue: A 404 or 500 error doesn't mean the service is actually ready for testing.

Recommendation: Only accept 2xx status codes:

except urllib.error.HTTPError as e:
    if 200 <= e.code < 300:
        return True
    last_error = e
    time.sleep(interval)

5. Git config in Docker build (tests/Dockerfile.backend:22-23)

RUN git config --global user.email "[email protected]" && \
    git config --global user.name "Test User"

Issue: This runs as root before switching to appuser, so the config goes to /root/.gitconfig which appuser can't read.

Recommendation: Run git config after USER directive or use --system flag:

USER appuser
RUN git config --global user.email "[email protected]" && \
    git config --global user.name "Test User"

⚠️ Security Concerns

1. API Keys in Environment (Moderate)

API keys passed via environment variables in docker-compose
tests/.env added to .gitignore (good!)
.env.example shows example key format (could encourage copy-paste of real keys)

Recommendation: Add warning comments in .env.example files not to commit real keys.

2. No Input Validation on Test Prompts

The run_claude_test() function doesn't validate or sanitize the prompt parameter before passing to subprocess.

Risk: Low (internal test code), but could be exploited if prompts come from untrusted sources.

Recommendation: If prompts ever come from external sources, add validation.

🚀 Performance Considerations

1. Long Test Timeouts

Tests have 180-300 second timeouts
This is reasonable for AI-driven browser automation but could slow CI/CD

Recommendation:

Add timeout configuration via environment variable
Consider test parallelization (pytest-xdist)
Document expected runtime in README

2. Docker Image Size

Frontend: node:20-alpine (good choice - small base)
Backend: python:3.11-slim (good choice)
Test backend builds on each run

Recommendation: Consider caching test Docker images in CI/CD to speed up test runs.

3. No Test Isolation

@pytest.fixture(scope="session")
def wiki_app(docker_compose_file):

All tests share the same containers (session scope). If tests modify state, they could affect each other.

Recommendation: Either:

Document that tests should be read-only
Use function scope for true isolation (slower but safer)
Add cleanup between tests

🧪 Test Coverage

Current Coverage

✅ Chat functionality (list pages, clickable links)
✅ Page navigation
✅ Raw vs. formatted view comparison
✅ Markdown rendering

Missing Coverage

❌ Thread system (spawn_thread, thread status, accept/reject)
❌ Git operations (branch creation, checkout, merge)
❌ WebSocket connectivity (connection errors, reconnection)
❌ Error states (network failures, API errors)
❌ Page editing (create, edit, delete pages)
❌ Concurrent operations (multiple users, race conditions)

Recommendation: Add tests for these critical features in future PRs.

📝 Code Quality Suggestions

1. Magic Numbers

Multiple hard-coded values throughout:

Timeouts: 120, 180, 300 seconds
Intervals: 2.0, 5 seconds
Retries: 3, 5

Recommendation: Extract to constants at module level:

DEFAULT_TIMEOUT = 120
HEALTHCHECK_TIMEOUT = 60
POLL_INTERVAL = 2.0

2. Duplicate Dockerfiles

backend/Dockerfile and tests/Dockerfile.backend are similar with small differences.

Recommendation: Consider multi-stage builds or parameterized builds to reduce duplication.

3. Test Naming

Test class names (TestChat, TestPageViews) are good, but haiku-tester/ directory name uses hyphen (inconsistent with Python conventions).

Recommendation: Rename to haiku_tester/ or e2e_tests/ to match Python naming conventions.

🎯 Best Practices Alignment

Aligned with CLAUDE.md ✅

Uses project structure conventions
Follows Python/TypeScript style
Proper git integration awareness
Good documentation

Minor Deviations

Test directory uses haiku-tester (hyphenated) vs. Python snake_case convention
No mention of tests in main CLAUDE.md (should document test strategy)

📋 Recommendations Summary

High Priority (Fix before merge):

Fix git config ownership issue in Dockerfile.backend:22-23
Make URLs configurable instead of hard-coded
Include stderr in error output for debugging

Medium Priority (Consider for this PR):
4. Improve error handling in wait_for_url (don't accept 404/500)
5. Extract magic numbers to constants
6. Rename haiku-tester to haiku_tester

Low Priority (Future improvements):
7. Add tests for thread system, git operations, editing
8. Add test isolation or document read-only requirement
9. Consider multi-stage Dockerfiles to reduce duplication
10. Document test strategy in main CLAUDE.md

🎉 Conclusion

This is a high-quality PR that adds significant value to the project. The innovative use of Claude CLI for E2E testing is well-executed and the infrastructure is solid. The issues I've identified are mostly minor and don't block merging - they're opportunities for improvement.

Recommendation: ✅ Approve with minor fixes

Great work! This sets a strong foundation for comprehensive E2E testing. 🚀

claude · 2026-01-01T10:53:09Z

PR Review: Haiku Tester Integration

Great work on adding a comprehensive testing infrastructure! This is a solid foundation for automated browser UI testing. Here's my detailed review:

✅ Strengths

Well-structured test infrastructure
- Clean separation of concerns with Docker containers
- Proper use of testcontainers for container lifecycle management
- Health checks ensure services are ready before tests run
Good documentation
- CLAUDE.md updates clearly explain the testing setup
- .env.example files provide guidance for configuration
- README in test fixtures explains their purpose
Proper gitignore updates
- tests/.env correctly added to .gitignore:79
Docker best practices mostly followed
- Multi-stage awareness with layer caching
- Non-root user in test backend for security
- Health checks with reasonable intervals

🐛 Issues Found

Critical

Missing dependency installation (tests/Dockerfile.backend:15)
- The Dockerfile copies requirements.txt but pytest dependencies are in tests/requirements.txt
- Tests will fail because testcontainers won't be available
- Fix: Add COPY tests/requirements.txt test-requirements.txt and RUN pip install --no-cache-dir -r test-requirements.txt
Unused return value in wait_for_url (tests/conftest.py:29)
- Function returns bool but the return value isn't checked
- If the URL fails, tests continue anyway and fail with confusing errors
- Fix: Already raises RuntimeError on lines 67-70, so this is actually OK, but consider removing the bool return type for clarity

High Priority

Hardcoded timeout values scattered across files
- tests/haiku-tester/test_views.py has timeouts: 300, 180, 300, 180
- tests/claude_runner.py:46 has default timeout=120
- Recommendation: Define timeout constants at module level for easier tuning
Docker healthcheck inconsistency (tests/docker-compose.yml)
- Backend healthcheck uses Python command (line 19)
- Frontend uses wget (line 39)
- Backend Dockerfile uses the same Python command (backend/Dockerfile:19)
- Recommendation: Be consistent - use wget for both or Python for both
Incomplete error handling in claude_runner.py
- Lines 87-92: JSON parsing failure returns failed result but doesn't log the actual output
- Makes debugging failed tests difficult
- Fix: Consider logging output to stderr before returning

Medium Priority

Duplicate pytest in requirements (backend/requirements.txt:21-22)
- pytest already in backend/requirements.txt
- Also in tests/requirements.txt:1
- Recommendation: Keep test dependencies only in tests/requirements.txt and remove from backend/requirements.txt
.dockerignore improvements needed
- Root .dockerignore:38 excludes entire Sviter-wiki submodule
- This is correct for tests (using fixtures instead)
- But comments say "Submodule (use fixtures instead)" - could be clearer about when this applies
- Recommendation: Add comment: "# Submodule (tests use fixtures/ instead; production builds need real wiki)"
Git config in container (tests/Dockerfile.backend:22-23)
- Uses placeholder [email protected]
- Not a security issue for tests, but worth noting
- Recommendation: Add comment explaining this is for test git operations only

Low Priority / Code Quality

Type hints inconsistency
- tests/claude_runner.py uses dataclasses and type hints
- tests/conftest.py has some type hints but not all
- Recommendation: Add type hints to conftest.py functions for consistency
Magic numbers in health checks
- Interval=5s, timeout=3s, retries=5/3, start_period=10s/15s
- Not necessarily wrong, but not documented why these values were chosen
- Recommendation: Add comments explaining the rationale
Schema validation not enforced (tests/claude_runner.py:13-30)
- TEST_RESULT_SCHEMA defined but not used to validate the response
- Claude CLI validates it, but the Python code doesn't double-check
- Recommendation: Consider using jsonschema library to validate or add a comment that CLI handles validation
Long test prompts could be externalized
- test_views.py has 10-30 line prompts embedded in code
- Makes tests harder to read
- Recommendation: Consider moving prompts to separate files or using triple-quoted strings at module level

🔒 Security Review

✅ No significant security issues found

API keys properly excluded from git (tests/.env)
Non-root user in Docker containers
No credential leakage in code
Proper use of environment variables for secrets

Minor notes:

tests/.env.example shows API key format - this is fine for documentation
Docker containers expose ports only to localhost (good)

📊 Test Coverage

Current tests cover:

✅ Chat functionality (list pages, click links)
✅ Page navigation
✅ Raw vs Formatted view comparison
✅ Markdown rendering

Gaps to consider for future PRs:

Page editing functionality
Thread spawning and management
Git branch operations
WebSocket real-time updates
Error states and edge cases

🚀 Performance Considerations

Container startup time
- testcontainers starts fresh containers for each test session
- This is correct for isolation but slow
- Recommendation: Document expected test run time (currently ~3-5 minutes)
Health check intervals
- 5-second intervals are reasonable for tests
- Could be tuned lower for faster startup if needed
Claude model choice
- Using haiku (fast, cheap) is perfect for UI tests
- Good tradeoff between cost and capability

📝 Documentation Quality

✅ Excellent documentation overall

CLAUDE.md updates are clear and comprehensive
.env.example files well-commented
Test fixtures have README
Makefile help is self-explanatory

Minor suggestions:

Add expected test runtime to CLAUDE.md
Document what happens if tests fail (how to debug)

✏️ Recommendations

Must fix before merge:

Fix missing test dependencies in tests/Dockerfile.backend

Should fix before merge:
2. Add timeout constants instead of magic numbers
3. Improve error logging in claude_runner.py

Nice to have:
4. Consolidate pytest dependency location
5. Add type hints to conftest.py
6. Externalize long test prompts

Future improvements:
7. Add tests for editing, threads, git operations
8. Document test debugging procedures
9. Consider CI/CD integration

🎯 Overall Assessment

Rating: 8/10 - Solid implementation with one critical fix needed

This PR adds valuable testing infrastructure with proper Docker isolation and good documentation. The main issue is the missing dependency installation in the test Dockerfile. Once fixed, this will be a great foundation for automated UI testing.

The use of Claude haiku with Chrome MCP is innovative and well-suited for this use case. The test structure is clean and the fixtures are minimal and appropriate.

Recommendation: Request changes - Fix the critical dependency issue, then merge.

🔧 Files to Review

Key files reviewed:

tests/Dockerfile.backend (needs dependency fix)
tests/conftest.py (good structure)
tests/haiku-tester/test_views.py (comprehensive tests)
tests/claude_runner.py (clean abstraction)
tests/docker-compose.yml (well configured)
CLAUDE.md (excellent documentation updates)

Great work overall! Let me know if you'd like me to elaborate on any of these points.

claude · 2026-01-01T11:07:20Z

PR Review: Haiku Tester Test Infrastructure

This PR adds a comprehensive browser-based UI testing framework using Claude haiku model with Chrome MCP. Overall quality is high with good architectural decisions.

Strengths

Excellent separation of concerns with test infrastructure in tests/ directory
Smart use of Docker for test isolation and reproducibility
Proper testcontainers integration with session-scoped fixtures
Robust exponential backoff in wait_for_url() implementation
Good documentation throughout with clear docstrings
Proper error handling in claude_runner.py for all failure modes
Clean Dockerfile layering for efficient caching

Critical Issues

Missing subprocess error handling (claude_runner.py:74) - result.stdout.strip() called without checking result.returncode. Fix: Check returncode before parsing output.
Port conflicts (docker-compose.yml) - Hard-coded ports 8000 and 5173 may conflict with running dev servers. Fix: Document that dev servers must be stopped or use dynamic ports.
Incomplete cleanup (conftest.py:84) - compose.stop() doesn't remove containers. Fix: Use compose.down() instead.

Medium Priority

Incomplete .dockerignore missing common patterns (*.log, *.tmp, coverage/)
No Docker prerequisite check in Makefile
Hardcoded model name in tests - should be configurable via environment
Inconsistent timeouts (180s vs 300s) without documentation

Minor Issues

Missing troubleshooting section in documentation
No retry logic for transient CLI failures
Magic numbers in conftest.py (0.5s, 4.0s delays) not explained
Unused docker_compose_file fixture parameter

Security

Good: Non-root user, gitignored secrets, environment variables
Minor: No API key format validation

Test Coverage

Tested: Chat, navigation, markdown rendering, clickable links
Not tested: Editing, git ops, threads, WebSocket failures, error states

Recommendations

Before merge:

Fix subprocess returncode handling
Change compose.stop() to compose.down()
Remove unused fixture parameter
Document port conflicts

Verdict: Approve with minor changes

Solid foundation for browser UI testing. Architecture is sound, code quality is good, Docker setup well-thought-out. Critical issues are straightforward to fix.

Rating: 8/10

Code Quality: 9/10
Architecture: 9/10
Documentation: 8/10
Test Coverage: 7/10
Production Readiness: 8/10

added tests/haiku-tester with simple tests

fa3dd5f

upd claude.md

ec37470

honix marked this pull request as ready for review January 1, 2026 11:04

better wait_for_url

83b3ab2

honix merged commit a8a8128 into main Jan 1, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

added tests/haiku-tester with simple tests #55

added tests/haiku-tester with simple tests #55

Uh oh!

honix commented Jan 1, 2026

Uh oh!

claude bot commented Jan 1, 2026

Uh oh!

claude bot commented Jan 1, 2026

Uh oh!

claude bot commented Jan 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

added tests/haiku-tester with simple tests #55

added tests/haiku-tester with simple tests #55

Uh oh!

Conversation

honix commented Jan 1, 2026

Uh oh!

claude bot commented Jan 1, 2026

PR Review: E2E Testing with Claude CLI & Docker

✅ Strengths

Architecture & Design

Docker Implementation

Code Quality

🐛 Potential Issues

1. Hard-coded URLs (tests/conftest.py:63-64)

2. Missing subprocess error output (tests/claude_runner.py:76)

3. Race condition in wait_for_url (tests/conftest.py:61)

4. Overly permissive error handling (tests/conftest.py:24-26)

5. Git config in Docker build (tests/Dockerfile.backend:22-23)

⚠️ Security Concerns

1. API Keys in Environment (Moderate)

2. No Input Validation on Test Prompts

🚀 Performance Considerations

1. Long Test Timeouts

2. Docker Image Size

3. No Test Isolation

🧪 Test Coverage

Current Coverage

Missing Coverage

📝 Code Quality Suggestions

1. Magic Numbers

2. Duplicate Dockerfiles

3. Test Naming

🎯 Best Practices Alignment

Aligned with CLAUDE.md ✅

Minor Deviations

📋 Recommendations Summary

🎉 Conclusion

Uh oh!

claude bot commented Jan 1, 2026

PR Review: Haiku Tester Integration

✅ Strengths

🐛 Issues Found

Critical

High Priority

Medium Priority

Low Priority / Code Quality

🔒 Security Review

📊 Test Coverage

🚀 Performance Considerations

📝 Documentation Quality

✏️ Recommendations

🎯 Overall Assessment

🔧 Files to Review

Uh oh!

claude bot commented Jan 1, 2026

PR Review: Haiku Tester Test Infrastructure

Strengths

Critical Issues

Medium Priority

Minor Issues

Security

Test Coverage

Recommendations

Verdict: Approve with minor changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants