Skip to content

Improve image-heavy PDF compression#20

Merged
rohan-patnaik merged 4 commits intomainfrom
codex/compress-image-heavy
Feb 4, 2026
Merged

Improve image-heavy PDF compression#20
rohan-patnaik merged 4 commits intomainfrom
codex/compress-image-heavy

Conversation

@rohan-patnaik
Copy link
Copy Markdown
Owner

@rohan-patnaik rohan-patnaik commented Feb 4, 2026

Summary

  • add staged image-heavy compression with early Ghostscript, validation, and candidate selection
  • record compression metrics in job metadata and document new env flags
  • add compression tests and update worker/convex wiring

Testing

  • Not run (python3 -m pytest apps/worker/tests/test_tools.py; pytest not installed)

Summary by CodeRabbit

  • New Features

    • Enhanced PDF compression with multi-stage optimization pipeline
    • Image optimization and detection capabilities
    • Configurable compression profiles and settings via environment variables
    • Improved validation and error handling for PDF processing
  • Documentation

    • Updated compression pipeline documentation and configuration guidance
  • Chores

    • Updated Convex dependency

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

This PR refactors the PDF compression functionality into a comprehensive, environment-driven staged pipeline with enhanced validation, instrumentation, and configuration. It updates the worker's tool result structure to propagate compression metadata, adds extensive test coverage, documents the new pipeline, and bumps a web dependency.

Changes

Cohort / File(s) Summary
Compression Configuration
apps/worker/.env.example
Adds 22 lines defining ZENPDF_COMPRESS_* environment variables controlling compression profiles, image optimization, timeouts, savings thresholds, and tool-specific settings.
Compression Core Implementation
apps/worker/zenpdf_worker/tools.py
Refactors compress_pdf into a staged, environment-driven pipeline with preflight checks, multi-step optimization (mutool, qpdf, ghostscript, etc.), image-heavy detection, dynamic ghostscript presets, result tracking, and finalized metadata output. Updates signature to return tuple[Path, dict].
Worker Integration
apps/worker/zenpdf_worker/worker.py
Renames ToolRunResult field from result to tool_result (type Optional[Dict[str, Any]]); updates compress tool handling and job processing to use the new field name.
Compression Tests
apps/worker/tests/test_tools.py
Adds 69 lines including helper function _make_image_pdf, tests for image/text-heavy PDF compression, environment variable configuration validation, and encrypted PDF rejection.
Pipeline Documentation
docs/TOOL_TECHNIQUES.md
Expands Compress PDF section from terse notes to detailed multi-step pipeline including goals, preflight, normalization, image detection, conditional ghostscript downsampling, optimization stages, and comprehensive configuration guidance.
Dependencies
apps/web/package.json
Bumps convex dependency from ^1.31.6 to ^1.31.7.

Estimated Code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • PR #18: Both modify apps/worker/zenpdf_worker/tools.py to add environment-driven parallel execution and ZENPDF_COMPRESS_* settings for staging compression steps.
  • PR #17: Both implement the same staged compression pipeline refactoring with compress_pdf returning (Path, dict) and ToolRunResult/tool_result plumbing changes in worker.py.
  • PR #16: Both modify the compress_pdf implementation in tools.py, with this PR refactoring the core function and the related PR adding a ghostscript optimization path.

Poem

🐰 A compression quest, stage by stage we go,
Ghostscript, QPDF dancing through the flow,
Environment variables tune each knob and dial,
Metrics bloom, and PDFs shrink with style! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 34.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Improve image-heavy PDF compression' directly and clearly summarizes the main change: enhancing compression specifically for PDFs containing significant image content through staged processing with early Ghostscript invocation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch codex/compress-image-heavy

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@rohan-patnaik
Copy link
Copy Markdown
Owner Author

CodeRabbit review (committed diff vs main) output:\n\n

@rohan-patnaik
Copy link
Copy Markdown
Owner Author

CodeRabbit review (committed diff vs main) output:

Starting CodeRabbit review in plain text mode...

Connecting to review service
Setting up
Analyzing
Reviewing

Review completed ✔

@rohan-patnaik rohan-patnaik force-pushed the codex/compress-image-heavy branch from acdaef6 to 95be773 Compare February 4, 2026 15:36
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
apps/worker/zenpdf_worker/tools.py (1)

923-1018: ⚠️ Potential issue | 🟠 Major

Recompute status after final output bytes are known.

Status and the “no smaller output” warning are decided before later post-processing can replace the output. If a later pass reduces size beyond thresholds, tool_result can still report no_change. Recalculate status (and the warning) after output_bytes is computed.

🛠️ Suggested fix
-    if method == "original":
-        method = "passthrough"
-        warnings.append("No smaller output found; preserving original content.")
+    if method == "original":
+        method = "passthrough"
@@
-    savings_percent = round((savings_bytes / size_bytes) * 100, 2) if size_bytes else 0.0
+    savings_ratio = (savings_bytes / size_bytes) if size_bytes else 0.0
+    status = (
+        "success"
+        if (savings_bytes >= min_savings_bytes and savings_ratio >= savings_threshold)
+        else "no_change"
+    )
+    if status == "no_change":
+        warnings.append("No smaller output found; preserving original content.")
+    savings_percent = round(savings_ratio * 100, 2) if size_bytes else 0.0
🤖 Fix all issues with AI agents
In `@apps/worker/.env.example`:
- Around line 28-48: The .env.example contains duplicated compression
environment variables (e.g., ZENPDF_COMPRESS_PROFILE,
ZENPDF_COMPRESS_AUTO_IMAGE_HEAVY, ZENPDF_COMPRESS_USE_ZOPFLI,
ZENPDF_COMPRESS_GS_PASSTHROUGH_JPEG, ZENPDF_COMPRESS_SAVINGS_THRESHOLD_PCT,
ZENPDF_COMPRESS_MIN_SAVINGS_BYTES, ZENPDF_COMPRESS_TIMEOUT_*,
ZENPDF_COMPRESS_ENABLE_IMAGE_OPT, ZENPDF_QPDF_OI_*,
ZENPDF_COMPRESS_ENABLE_PDFSIZEOPT, ZENPDF_COMPRESS_ENABLE_JBIG2,
ZENPDF_COMPRESS_PDFSIZEOPT_ARGS) causing ambiguous defaults; remove the
duplicate definitions so each variable appears only once, keeping the intended
canonical values (either the earlier block or this block) and consolidating
timeout variables (e.g., ZENPDF_COMPRESS_TIMEOUT_SECONDS vs timeout
base/per-page/max) into a single clear set of keys, ensuring comments explain
defaults where appropriate.

Comment on lines +28 to +48
# Compression tuning (defaults)
ZENPDF_COMPRESS_PROFILE=balanced
ZENPDF_COMPRESS_AUTO_IMAGE_HEAVY=1
ZENPDF_COMPRESS_USE_ZOPFLI=0
ZENPDF_COMPRESS_GS_PASSTHROUGH_JPEG=0
ZENPDF_COMPRESS_SAVINGS_THRESHOLD_PCT=0.08
ZENPDF_COMPRESS_MIN_SAVINGS_BYTES=200000
ZENPDF_COMPRESS_TIMEOUT_BASE_SECONDS=120
ZENPDF_COMPRESS_TIMEOUT_PER_MB_SECONDS=3
ZENPDF_COMPRESS_TIMEOUT_PER_PAGE_SECONDS=1.5
ZENPDF_COMPRESS_TIMEOUT_MAX_SECONDS=900
# Leave empty to use calculated timeout; set to a positive integer to override.
ZENPDF_COMPRESS_TIMEOUT_SECONDS=
ZENPDF_COMPRESS_ENABLE_IMAGE_OPT=0
ZENPDF_QPDF_OI_QUALITY=75
ZENPDF_QPDF_OI_MIN_WIDTH=128
ZENPDF_QPDF_OI_MIN_HEIGHT=128
ZENPDF_QPDF_OI_MIN_AREA=16384
ZENPDF_COMPRESS_ENABLE_PDFSIZEOPT=0
ZENPDF_COMPRESS_ENABLE_JBIG2=0
ZENPDF_COMPRESS_PDFSIZEOPT_ARGS=
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove duplicated compression keys to avoid ambiguous defaults.

This block redefines keys already set above, which can lead to confusion about which value is used. Please consolidate to a single definition per key (either move the earlier entries into this block or trim duplicates here).

🛠️ Suggested cleanup (trim duplicates in the new block)
 ZENPDF_COMPRESS_SAVINGS_THRESHOLD_PCT=0.08
 ZENPDF_COMPRESS_MIN_SAVINGS_BYTES=200000
-ZENPDF_COMPRESS_TIMEOUT_BASE_SECONDS=120
-ZENPDF_COMPRESS_TIMEOUT_PER_MB_SECONDS=3
-ZENPDF_COMPRESS_TIMEOUT_PER_PAGE_SECONDS=1.5
-ZENPDF_COMPRESS_TIMEOUT_MAX_SECONDS=900
 # Leave empty to use calculated timeout; set to a positive integer to override.
 ZENPDF_COMPRESS_TIMEOUT_SECONDS=
-ZENPDF_COMPRESS_ENABLE_IMAGE_OPT=0
-ZENPDF_QPDF_OI_QUALITY=75
-ZENPDF_QPDF_OI_MIN_WIDTH=128
-ZENPDF_QPDF_OI_MIN_HEIGHT=128
-ZENPDF_QPDF_OI_MIN_AREA=16384
-ZENPDF_COMPRESS_ENABLE_PDFSIZEOPT=0
-ZENPDF_COMPRESS_ENABLE_JBIG2=0
-ZENPDF_COMPRESS_PDFSIZEOPT_ARGS=
🧰 Tools
🪛 dotenv-linter (4.0.0)

[warning] 30-30: [UnorderedKey] The ZENPDF_COMPRESS_AUTO_IMAGE_HEAVY key should go before the ZENPDF_COMPRESS_PROFILE key

(UnorderedKey)


[warning] 32-32: [UnorderedKey] The ZENPDF_COMPRESS_GS_PASSTHROUGH_JPEG key should go before the ZENPDF_COMPRESS_PROFILE key

(UnorderedKey)


[warning] 33-33: [UnorderedKey] The ZENPDF_COMPRESS_SAVINGS_THRESHOLD_PCT key should go before the ZENPDF_COMPRESS_USE_ZOPFLI key

(UnorderedKey)


[warning] 34-34: [UnorderedKey] The ZENPDF_COMPRESS_MIN_SAVINGS_BYTES key should go before the ZENPDF_COMPRESS_PROFILE key

(UnorderedKey)


[warning] 35-35: [DuplicatedKey] The ZENPDF_COMPRESS_TIMEOUT_BASE_SECONDS key is duplicated

(DuplicatedKey)


[warning] 35-35: [UnorderedKey] The ZENPDF_COMPRESS_TIMEOUT_BASE_SECONDS key should go before the ZENPDF_COMPRESS_USE_ZOPFLI key

(UnorderedKey)


[warning] 36-36: [DuplicatedKey] The ZENPDF_COMPRESS_TIMEOUT_PER_MB_SECONDS key is duplicated

(DuplicatedKey)


[warning] 36-36: [UnorderedKey] The ZENPDF_COMPRESS_TIMEOUT_PER_MB_SECONDS key should go before the ZENPDF_COMPRESS_USE_ZOPFLI key

(UnorderedKey)


[warning] 37-37: [DuplicatedKey] The ZENPDF_COMPRESS_TIMEOUT_PER_PAGE_SECONDS key is duplicated

(DuplicatedKey)


[warning] 37-37: [UnorderedKey] The ZENPDF_COMPRESS_TIMEOUT_PER_PAGE_SECONDS key should go before the ZENPDF_COMPRESS_USE_ZOPFLI key

(UnorderedKey)


[warning] 38-38: [DuplicatedKey] The ZENPDF_COMPRESS_TIMEOUT_MAX_SECONDS key is duplicated

(DuplicatedKey)


[warning] 38-38: [UnorderedKey] The ZENPDF_COMPRESS_TIMEOUT_MAX_SECONDS key should go before the ZENPDF_COMPRESS_TIMEOUT_PER_MB_SECONDS key

(UnorderedKey)


[warning] 40-40: [UnorderedKey] The ZENPDF_COMPRESS_TIMEOUT_SECONDS key should go before the ZENPDF_COMPRESS_USE_ZOPFLI key

(UnorderedKey)


[warning] 41-41: [DuplicatedKey] The ZENPDF_COMPRESS_ENABLE_IMAGE_OPT key is duplicated

(DuplicatedKey)


[warning] 41-41: [UnorderedKey] The ZENPDF_COMPRESS_ENABLE_IMAGE_OPT key should go before the ZENPDF_COMPRESS_GS_PASSTHROUGH_JPEG key

(UnorderedKey)


[warning] 42-42: [DuplicatedKey] The ZENPDF_QPDF_OI_QUALITY key is duplicated

(DuplicatedKey)


[warning] 43-43: [DuplicatedKey] The ZENPDF_QPDF_OI_MIN_WIDTH key is duplicated

(DuplicatedKey)


[warning] 43-43: [UnorderedKey] The ZENPDF_QPDF_OI_MIN_WIDTH key should go before the ZENPDF_QPDF_OI_QUALITY key

(UnorderedKey)


[warning] 44-44: [DuplicatedKey] The ZENPDF_QPDF_OI_MIN_HEIGHT key is duplicated

(DuplicatedKey)


[warning] 44-44: [UnorderedKey] The ZENPDF_QPDF_OI_MIN_HEIGHT key should go before the ZENPDF_QPDF_OI_MIN_WIDTH key

(UnorderedKey)


[warning] 45-45: [DuplicatedKey] The ZENPDF_QPDF_OI_MIN_AREA key is duplicated

(DuplicatedKey)


[warning] 45-45: [UnorderedKey] The ZENPDF_QPDF_OI_MIN_AREA key should go before the ZENPDF_QPDF_OI_MIN_HEIGHT key

(UnorderedKey)


[warning] 46-46: [DuplicatedKey] The ZENPDF_COMPRESS_ENABLE_PDFSIZEOPT key is duplicated

(DuplicatedKey)


[warning] 46-46: [UnorderedKey] The ZENPDF_COMPRESS_ENABLE_PDFSIZEOPT key should go before the ZENPDF_COMPRESS_GS_PASSTHROUGH_JPEG key

(UnorderedKey)


[warning] 47-47: [DuplicatedKey] The ZENPDF_COMPRESS_ENABLE_JBIG2 key is duplicated

(DuplicatedKey)


[warning] 47-47: [UnorderedKey] The ZENPDF_COMPRESS_ENABLE_JBIG2 key should go before the ZENPDF_COMPRESS_ENABLE_PDFSIZEOPT key

(UnorderedKey)


[warning] 48-48: [DuplicatedKey] The ZENPDF_COMPRESS_PDFSIZEOPT_ARGS key is duplicated

(DuplicatedKey)


[warning] 48-48: [UnorderedKey] The ZENPDF_COMPRESS_PDFSIZEOPT_ARGS key should go before the ZENPDF_COMPRESS_PROFILE key

(UnorderedKey)

🤖 Prompt for AI Agents
In `@apps/worker/.env.example` around lines 28 - 48, The .env.example contains
duplicated compression environment variables (e.g., ZENPDF_COMPRESS_PROFILE,
ZENPDF_COMPRESS_AUTO_IMAGE_HEAVY, ZENPDF_COMPRESS_USE_ZOPFLI,
ZENPDF_COMPRESS_GS_PASSTHROUGH_JPEG, ZENPDF_COMPRESS_SAVINGS_THRESHOLD_PCT,
ZENPDF_COMPRESS_MIN_SAVINGS_BYTES, ZENPDF_COMPRESS_TIMEOUT_*,
ZENPDF_COMPRESS_ENABLE_IMAGE_OPT, ZENPDF_QPDF_OI_*,
ZENPDF_COMPRESS_ENABLE_PDFSIZEOPT, ZENPDF_COMPRESS_ENABLE_JBIG2,
ZENPDF_COMPRESS_PDFSIZEOPT_ARGS) causing ambiguous defaults; remove the
duplicate definitions so each variable appears only once, keeping the intended
canonical values (either the earlier block or this block) and consolidating
timeout variables (e.g., ZENPDF_COMPRESS_TIMEOUT_SECONDS vs timeout
base/per-page/max) into a single clear set of keys, ensuring comments explain
defaults where appropriate.

@rohan-patnaik rohan-patnaik merged commit 97fcbe6 into main Feb 4, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant