You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PDF is the most common format spec writers receive from manufacturers and owners. Inferring CSI hierarchy from PDF shares the same core challenge as plaintext (no structural metadata — must reconstruct from visual layout), but compounds it with a class of PDF-specific pathologies that make reliable extraction genuinely hard.
Value
Without PDF support, SpecR requires manual conversion for a large percentage of real-world specs. This is the highest-friction gap between current capability and practical daily use.
Scope
Shared with plaintext: hierarchy inference
Must infer PART/ARTICLE/paragraph hierarchy from text patterns alone
Same signals: indent, numbering prefix, ALL-CAPS headers, blank-line grouping
Target: reuse plaintext inference pipeline, PDF adapter feeds normalized text into it
PDF-specific pathologies (the hard part)
Font encoding corruption
PDF fonts can use custom encoding vectors that map glyph IDs to incorrect Unicode codepoints. Extraction libraries return garbage characters (e.g., "fi" ligature rendered as two separate unmapped bytes, or entire sections in Symbol/Zapf encoding). Must detect and handle: character frequency analysis, encoding fingerprinting, fallback to heuristic re-mapping.
Scanned / image-based PDFs
PDFs generated by scanning paper specs contain no text layer — only embedded raster images. Require OCR (Tesseract or cloud fallback). Must detect: check for embedded text streams; if absent or < N chars/page, trigger OCR path.
Malformed PDF structure
Cross-reference tables corrupt, object streams truncated, linearization broken. pdf-parse and pdfjs-dist handle many of these but not all. Need graceful degradation: try primary extractor → fallback extractor → partial result with warnings → hard error with actionable message.
Reading order
PDF content streams have no guaranteed reading order. Multi-column layouts, sidebars, headers/footers, and footnotes get interleaved into the main text stream. Must strip page furniture (headers/footers via position heuristics) and detect/skip columnar layouts.
Hyphenation artifacts
Soft hyphens inserted at line breaks appear as literal hyphens in extracted text, splitting words across lines. Must rejoin.
Context
PDF is the most common format spec writers receive from manufacturers and owners. Inferring CSI hierarchy from PDF shares the same core challenge as plaintext (no structural metadata — must reconstruct from visual layout), but compounds it with a class of PDF-specific pathologies that make reliable extraction genuinely hard.
Value
Without PDF support, SpecR requires manual conversion for a large percentage of real-world specs. This is the highest-friction gap between current capability and practical daily use.
Scope
Shared with plaintext: hierarchy inference
PDF-specific pathologies (the hard part)
Font encoding corruption
PDF fonts can use custom encoding vectors that map glyph IDs to incorrect Unicode codepoints. Extraction libraries return garbage characters (e.g., "fi" ligature rendered as two separate unmapped bytes, or entire sections in Symbol/Zapf encoding). Must detect and handle: character frequency analysis, encoding fingerprinting, fallback to heuristic re-mapping.
Scanned / image-based PDFs
PDFs generated by scanning paper specs contain no text layer — only embedded raster images. Require OCR (Tesseract or cloud fallback). Must detect: check for embedded text streams; if absent or < N chars/page, trigger OCR path.
Malformed PDF structure
Cross-reference tables corrupt, object streams truncated, linearization broken.
pdf-parseandpdfjs-disthandle many of these but not all. Need graceful degradation: try primary extractor → fallback extractor → partial result with warnings → hard error with actionable message.Reading order
PDF content streams have no guaranteed reading order. Multi-column layouts, sidebars, headers/footers, and footnotes get interleaved into the main text stream. Must strip page furniture (headers/footers via position heuristics) and detect/skip columnar layouts.
Hyphenation artifacts
Soft hyphens inserted at line breaks appear as literal hyphens in extracted text, splitting words across lines. Must rejoin.
Deliverables
src/parser/pdf/index.ts— PDF adaptersrc/parser/pdf/extract.ts— extraction pipeline (primary + fallback + OCR gate)src/parser/pdf/normalize.ts— reading-order repair, header/footer strip, hyphen rejoinsrc/parser/text/inference pipeline for hierarchy (depends on feat(parser): plaintext spec ingest — hierarchy inference from indent + numbering patterns #64)POST /parseacceptsapplication/pdfwarnings[]for degraded extraction qualityDependencies
pdfjs-distvspdf-parsevsunpdfvspdf2jsonRisks
Priority
High value, high complexity. Do not start until plaintext adapter ships and proves the text inference pipeline is solid (#64). Phase 3 earliest.