feat: PDF watermark removal via --remove-watermarks#70
Merged
Conversation
Add PDF-to-PDF conversion support with three-layer watermark removal: 1. Artifact-based: removes standard PDF Artifact.ArtifactSubtype.Watermark 2. Annotation-based: removes WatermarkAnnotation instances 3. Content-stream vector paths: detects and strips trailing glyph-outline watermarks appended after page text (e.g., CorpTax "Confidential" stamps) The vector path strategy scans backwards from the last text operator (ET), verifies trailing ops are path/graphics-state only, then bulk-removes them using suppressUpdate for performance on large documents. Also wires ConvertOptions through the HTTP server — previously all conversion options (password, strip-masters, fixed-layout, etc.) were silently ignored by the /convert and /split endpoints. CLI: xlcr convert -i watermarked.pdf -o clean.pdf --remove-watermarks HTTP: POST /convert?to=pdf&remove-watermarks=true Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…euristics Three-layer safety improvements to the vector watermark removal strategy: - Strategy 3a: only remove trailing vector blocks explicitly wrapped in BMC/BDC marked-content tags with Watermark semantics - Strategy 3b: for unmarked watermarks (e.g., CorpTax stamps), use cross-page consistency — only strip when 2+ pages share the exact same trailing path operator count (watermark stamp signature). Single-page docs and unique trailing vector art are preserved. - Use getCommandName() instead of toString() for operator matching - Extract shared safe-command sets for clarity Also: - Add password-protected PDF preservation test - Add legitimate trailing vector artwork preservation test - Fix repeated query param parsing for sheet names (getQueryParams) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Normal mode (--remove-watermarks) now uses two heuristics in sequence: - Common-prefix: strips identical trailing op sequences across pages, preserving page-specific vector art (handles Seminole-type files) - Mode-count fallback: when prefix is too short (watermark varies per page), strips pages sharing the most common trailing op count (handles L3Harris/CorpTax-type files with per-page variation) Aggressive mode (--remove-watermarks-aggressive) strips ALL trailing path ops after the last text block on every page unconditionally. Use when normal mode can't distinguish watermark from legitimate art. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--remove-watermarksflag (CLI + HTTP)BMC/BDC WatermarktagsConvertOptionsthrough HTTP server query params (fixes pre-existing gap where password, strip-masters, etc. were ignored)Test plan
🤖 Generated with Claude Code