Skip to content

feat(parser): PDF spec ingest — extraction, OCR fallback, hierarchy inference #65

@thewrz

Description

@thewrz

Context

PDF is the most common format spec writers receive from manufacturers and owners. Inferring CSI hierarchy from PDF shares the same core challenge as plaintext (no structural metadata — must reconstruct from visual layout), but compounds it with a class of PDF-specific pathologies that make reliable extraction genuinely hard.

Value

Without PDF support, SpecR requires manual conversion for a large percentage of real-world specs. This is the highest-friction gap between current capability and practical daily use.

Scope

Shared with plaintext: hierarchy inference

  • Must infer PART/ARTICLE/paragraph hierarchy from text patterns alone
  • Same signals: indent, numbering prefix, ALL-CAPS headers, blank-line grouping
  • Target: reuse plaintext inference pipeline, PDF adapter feeds normalized text into it

PDF-specific pathologies (the hard part)

Font encoding corruption
PDF fonts can use custom encoding vectors that map glyph IDs to incorrect Unicode codepoints. Extraction libraries return garbage characters (e.g., "fi" ligature rendered as two separate unmapped bytes, or entire sections in Symbol/Zapf encoding). Must detect and handle: character frequency analysis, encoding fingerprinting, fallback to heuristic re-mapping.

Scanned / image-based PDFs
PDFs generated by scanning paper specs contain no text layer — only embedded raster images. Require OCR (Tesseract or cloud fallback). Must detect: check for embedded text streams; if absent or < N chars/page, trigger OCR path.

Malformed PDF structure
Cross-reference tables corrupt, object streams truncated, linearization broken. pdf-parse and pdfjs-dist handle many of these but not all. Need graceful degradation: try primary extractor → fallback extractor → partial result with warnings → hard error with actionable message.

Reading order
PDF content streams have no guaranteed reading order. Multi-column layouts, sidebars, headers/footers, and footnotes get interleaved into the main text stream. Must strip page furniture (headers/footers via position heuristics) and detect/skip columnar layouts.

Hyphenation artifacts
Soft hyphens inserted at line breaks appear as literal hyphens in extracted text, splitting words across lines. Must rejoin.

Deliverables

  • src/parser/pdf/index.ts — PDF adapter
  • src/parser/pdf/extract.ts — extraction pipeline (primary + fallback + OCR gate)
  • src/parser/pdf/normalize.ts — reading-order repair, header/footer strip, hyphen rejoin
  • Reuse src/parser/text/ inference pipeline for hierarchy (depends on feat(parser): plaintext spec ingest — hierarchy inference from indent + numbering patterns #64)
  • Fixtures: machine-generated PDF, scanned PDF, font-corrupted PDF, multi-column PDF
  • POST /parse accepts application/pdf
  • API response includes warnings[] for degraded extraction quality

Dependencies

Risks

  • OCR accuracy on low-DPI scans may be too low for reliable section detection
  • Font encoding corruption may be unrecoverable without source document
  • This issue is intentionally not scoped to Phase 2 — complexity warrants its own phase gate

Priority

High value, high complexity. Do not start until plaintext adapter ships and proves the text inference pipeline is solid (#64). Phase 3 earliest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    phase:3Phase 3 issues

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions