Skip to content

feat: PDF watermark removal via --remove-watermarks#70

Merged
arcaputo3 merged 3 commits intomainfrom
feat/pdf-watermark-removal
Mar 17, 2026
Merged

feat: PDF watermark removal via --remove-watermarks#70
arcaputo3 merged 3 commits intomainfrom
feat/pdf-watermark-removal

Conversation

@arcaputo3
Copy link
Copy Markdown
Contributor

@arcaputo3 arcaputo3 commented Mar 16, 2026

Summary

  • Adds PDF-to-PDF conversion with --remove-watermarks flag (CLI + HTTP)
  • Three-layer watermark removal: artifacts, annotations, and content-stream vector paths
  • Vector path removal uses two strategies:
    • Marked-content: strips trailing blocks wrapped in explicit BMC/BDC Watermark tags
    • Cross-page heuristic: for unmarked stamps (e.g., CorpTax), strips trailing path blocks only when 2+ pages share the exact same operator count (watermark signature)
  • Wires all ConvertOptions through HTTP server query params (fixes pre-existing gap where password, strip-masters, etc. were ignored)

Test plan

  • Unit tests: artifact removal, annotation removal, dispatch guard, capability checks
  • Safety tests: password-protected PDF preservation, legitimate trailing vector art preservation
  • Integration: 80-page watermarked PDF — watermarks removed, text fidelity verified
  • Batch: 18 tax PDFs processed, 100% fidelity confirmed via text diff
  • Full test suite green (all modules)

🤖 Generated with Claude Code

arcaputo3 and others added 3 commits March 12, 2026 17:22
Add PDF-to-PDF conversion support with three-layer watermark removal:

1. Artifact-based: removes standard PDF Artifact.ArtifactSubtype.Watermark
2. Annotation-based: removes WatermarkAnnotation instances
3. Content-stream vector paths: detects and strips trailing glyph-outline
   watermarks appended after page text (e.g., CorpTax "Confidential" stamps)

The vector path strategy scans backwards from the last text operator (ET),
verifies trailing ops are path/graphics-state only, then bulk-removes them
using suppressUpdate for performance on large documents.

Also wires ConvertOptions through the HTTP server — previously all conversion
options (password, strip-masters, fixed-layout, etc.) were silently ignored
by the /convert and /split endpoints.

CLI: xlcr convert -i watermarked.pdf -o clean.pdf --remove-watermarks
HTTP: POST /convert?to=pdf&remove-watermarks=true

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…euristics

Three-layer safety improvements to the vector watermark removal strategy:

- Strategy 3a: only remove trailing vector blocks explicitly wrapped in
  BMC/BDC marked-content tags with Watermark semantics
- Strategy 3b: for unmarked watermarks (e.g., CorpTax stamps), use
  cross-page consistency — only strip when 2+ pages share the exact same
  trailing path operator count (watermark stamp signature). Single-page
  docs and unique trailing vector art are preserved.
- Use getCommandName() instead of toString() for operator matching
- Extract shared safe-command sets for clarity

Also:
- Add password-protected PDF preservation test
- Add legitimate trailing vector artwork preservation test
- Fix repeated query param parsing for sheet names (getQueryParams)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Normal mode (--remove-watermarks) now uses two heuristics in sequence:
- Common-prefix: strips identical trailing op sequences across pages,
  preserving page-specific vector art (handles Seminole-type files)
- Mode-count fallback: when prefix is too short (watermark varies per
  page), strips pages sharing the most common trailing op count
  (handles L3Harris/CorpTax-type files with per-page variation)

Aggressive mode (--remove-watermarks-aggressive) strips ALL trailing
path ops after the last text block on every page unconditionally. Use
when normal mode can't distinguish watermark from legitimate art.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@arcaputo3 arcaputo3 marked this pull request as ready for review March 17, 2026 12:13
@arcaputo3 arcaputo3 merged commit e1ed0de into main Mar 17, 2026
17 checks passed
@arcaputo3 arcaputo3 deleted the feat/pdf-watermark-removal branch March 17, 2026 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant