Skip to content

enhancement: validate crawl-sim JSON outputs against JSON Schema in CI #29

@BraedenBDev

Description

@BraedenBDev

Follow-up from docs/issues/2026-04-12-accuracy-from-live-audit.md (M4 remainder) and docs/plans/2026-04-18-live-audit-follow-through.md.

Problem

docs/output-schemas.md exists, but the contracts are still only human-readable. There are no *.schema.json files and CI does not validate actual script outputs against machine-readable schemas. That means output drift can still land silently.

Goal

Make output contracts enforceable.

Scope

  • Add JSON Schema files for the core script outputs, at minimum:
    • fetch-as-bot.sh
    • extract-meta.sh
    • extract-jsonld.sh
    • extract-links.sh
    • check-robots.sh
    • check-llmstxt.sh
    • check-sitemap.sh
    • diff-render.sh
    • compute-score.sh
    • build-report.sh
  • Add CI validation that runs representative fixtures through those schemas.
  • Document the schema directory and validation workflow in docs/output-schemas.md.

Acceptance criteria

  • schemas/*.schema.json exists for the script outputs above
  • CI fails if fixture output no longer matches its schema
  • docs/output-schemas.md points to the machine-readable schemas, not just prose examples
  • At least one regression test proves schema validation catches a malformed fixture

Why this matters

The live audit showed that crawl-sim's raw data can be correct while the interpretation layer drifts. Locking the JSON contracts reduces that drift surface area.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions