feat(parser): PDF spec ingest — extraction, OCR fallback, hierarchy inference

## Context

PDF is the most common format spec writers receive from manufacturers and owners. Inferring CSI hierarchy from PDF shares the same core challenge as plaintext (no structural metadata — must reconstruct from visual layout), but compounds it with a class of PDF-specific pathologies that make reliable extraction genuinely hard.

## Value

Without PDF support, SpecR requires manual conversion for a large percentage of real-world specs. This is the highest-friction gap between current capability and practical daily use.

## Scope

### Shared with plaintext: hierarchy inference
- Must infer PART/ARTICLE/paragraph hierarchy from text patterns alone
- Same signals: indent, numbering prefix, ALL-CAPS headers, blank-line grouping
- Target: reuse plaintext inference pipeline, PDF adapter feeds normalized text into it

### PDF-specific pathologies (the hard part)

**Font encoding corruption**
PDF fonts can use custom encoding vectors that map glyph IDs to incorrect Unicode codepoints. Extraction libraries return garbage characters (e.g., "ﬁ" ligature rendered as two separate unmapped bytes, or entire sections in Symbol/Zapf encoding). Must detect and handle: character frequency analysis, encoding fingerprinting, fallback to heuristic re-mapping.

**Scanned / image-based PDFs**
PDFs generated by scanning paper specs contain no text layer — only embedded raster images. Require OCR (Tesseract or cloud fallback). Must detect: check for embedded text streams; if absent or < N chars/page, trigger OCR path.

**Malformed PDF structure**
Cross-reference tables corrupt, object streams truncated, linearization broken. `pdf-parse` and `pdfjs-dist` handle many of these but not all. Need graceful degradation: try primary extractor → fallback extractor → partial result with warnings → hard error with actionable message.

**Reading order**
PDF content streams have no guaranteed reading order. Multi-column layouts, sidebars, headers/footers, and footnotes get interleaved into the main text stream. Must strip page furniture (headers/footers via position heuristics) and detect/skip columnar layouts.

**Hyphenation artifacts**
Soft hyphens inserted at line breaks appear as literal hyphens in extracted text, splitting words across lines. Must rejoin.

## Deliverables

- `src/parser/pdf/index.ts` — PDF adapter
- `src/parser/pdf/extract.ts` — extraction pipeline (primary + fallback + OCR gate)
- `src/parser/pdf/normalize.ts` — reading-order repair, header/footer strip, hyphen rejoin
- Reuse `src/parser/text/` inference pipeline for hierarchy (depends on #64)
- Fixtures: machine-generated PDF, scanned PDF, font-corrupted PDF, multi-column PDF
- `POST /parse` accepts `application/pdf`
- API response includes `warnings[]` for degraded extraction quality

## Dependencies

- #64 (plaintext inference pipeline) — PDF feeds into it after normalization
- OCR library eval needed (Tesseract.js vs. cloud API vs. native binary)
- PDF extraction library eval: `pdfjs-dist` vs `pdf-parse` vs `unpdf` vs `pdf2json`

## Risks

- OCR accuracy on low-DPI scans may be too low for reliable section detection
- Font encoding corruption may be unrecoverable without source document
- This issue is intentionally not scoped to Phase 2 — complexity warrants its own phase gate

## Priority

High value, high complexity. Do not start until plaintext adapter ships and proves the text inference pipeline is solid (#64). Phase 3 earliest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parser): PDF spec ingest — extraction, OCR fallback, hierarchy inference #65

Context

Value

Scope

Shared with plaintext: hierarchy inference

PDF-specific pathologies (the hard part)

Deliverables

Dependencies

Risks

Priority

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat(parser): PDF spec ingest — extraction, OCR fallback, hierarchy inference #65

Description

Context

Value

Scope

Shared with plaintext: hierarchy inference

PDF-specific pathologies (the hard part)

Deliverables

Dependencies

Risks

Priority

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions