feat: PDF watermark removal via --remove-watermarks by arcaputo3 · Pull Request #70 · TJC-LP/xlcr

arcaputo3 · 2026-03-16T19:51:50Z

Summary

Adds PDF-to-PDF conversion with --remove-watermarks flag (CLI + HTTP)
Three-layer watermark removal: artifacts, annotations, and content-stream vector paths
Vector path removal uses two strategies:
- Marked-content: strips trailing blocks wrapped in explicit BMC/BDC Watermark tags
- Cross-page heuristic: for unmarked stamps (e.g., CorpTax), strips trailing path blocks only when 2+ pages share the exact same operator count (watermark signature)
Wires all ConvertOptions through HTTP server query params (fixes pre-existing gap where password, strip-masters, etc. were ignored)

Test plan

Unit tests: artifact removal, annotation removal, dispatch guard, capability checks
Safety tests: password-protected PDF preservation, legitimate trailing vector art preservation
Integration: 80-page watermarked PDF — watermarks removed, text fidelity verified
Batch: 18 tax PDFs processed, 100% fidelity confirmed via text diff
Full test suite green (all modules)

🤖 Generated with Claude Code

Add PDF-to-PDF conversion support with three-layer watermark removal: 1. Artifact-based: removes standard PDF Artifact.ArtifactSubtype.Watermark 2. Annotation-based: removes WatermarkAnnotation instances 3. Content-stream vector paths: detects and strips trailing glyph-outline watermarks appended after page text (e.g., CorpTax "Confidential" stamps) The vector path strategy scans backwards from the last text operator (ET), verifies trailing ops are path/graphics-state only, then bulk-removes them using suppressUpdate for performance on large documents. Also wires ConvertOptions through the HTTP server — previously all conversion options (password, strip-masters, fixed-layout, etc.) were silently ignored by the /convert and /split endpoints. CLI: xlcr convert -i watermarked.pdf -o clean.pdf --remove-watermarks HTTP: POST /convert?to=pdf&remove-watermarks=true Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…euristics Three-layer safety improvements to the vector watermark removal strategy: - Strategy 3a: only remove trailing vector blocks explicitly wrapped in BMC/BDC marked-content tags with Watermark semantics - Strategy 3b: for unmarked watermarks (e.g., CorpTax stamps), use cross-page consistency — only strip when 2+ pages share the exact same trailing path operator count (watermark stamp signature). Single-page docs and unique trailing vector art are preserved. - Use getCommandName() instead of toString() for operator matching - Extract shared safe-command sets for clarity Also: - Add password-protected PDF preservation test - Add legitimate trailing vector artwork preservation test - Fix repeated query param parsing for sheet names (getQueryParams) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Normal mode (--remove-watermarks) now uses two heuristics in sequence: - Common-prefix: strips identical trailing op sequences across pages, preserving page-specific vector art (handles Seminole-type files) - Mode-count fallback: when prefix is too short (watermark varies per page), strips pages sharing the most common trailing op count (handles L3Harris/CorpTax-type files with per-page variation) Aggressive mode (--remove-watermarks-aggressive) strips ALL trailing path ops after the last text block on every page unconditionally. Use when normal mode can't distinguish watermark from legitimate art. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

arcaputo3 and others added 3 commits March 12, 2026 17:22

arcaputo3 marked this pull request as ready for review March 17, 2026 12:13

arcaputo3 merged commit e1ed0de into main Mar 17, 2026
17 checks passed

arcaputo3 deleted the feat/pdf-watermark-removal branch March 17, 2026 12:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: PDF watermark removal via --remove-watermarks#70

feat: PDF watermark removal via --remove-watermarks#70
arcaputo3 merged 3 commits intomainfrom
feat/pdf-watermark-removal

arcaputo3 commented Mar 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arcaputo3 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arcaputo3 commented Mar 16, 2026 •

edited

Loading