Empower your AI agents with the ability to securely read and extract information from PDF files using the Model Context Protocol (MCP).
- π Extract text content from PDF files (full document or specific pages)
- πΌοΈ Extract embedded images from PDF pages as base64-encoded data
- π Preserve content order - Text and images returned in exact document layout order (NEW v1.2.0)
- π Get metadata (author, title, creation date, etc.)
- π’ Count pages in PDF documents
- π Support for both local files and URLs
- π‘οΈ Secure - Confines file access to project root directory
- β‘ Fast - Parallel processing for maximum performance
- π Batch processing - Handle multiple PDFs in a single request
- π¦ Multiple deployment options - npm or Smithery
- β Y-Coordinate Based Ordering: Text and images returned in exact document order
- β Natural Reading Flow: Content parts preserve the layout sequence as it appears in PDF
- β Intelligent Grouping: Automatically groups text items on the same line
- β Optimized for AI: Enables AI models to understand content in natural reading order
- β Image Extraction: Extract embedded images from PDF pages as base64-encoded data
- β Performance Optimization: Parallel page processing for 5-10x speedup
- β Deep Refactoring: Modular architecture with 98.9% test coverage (91 tests)
- β Fixed critical bugs: Buffer/Uint8Array compatibility for PDF.js v5.x
- β
Fixed schema validation: Resolved
exclusiveMinimumissue affecting Windsurf, Mistral API, and other tools - β Improved metadata extraction: Robust fallback handling for PDF.js compatibility
- β Updated dependencies: All packages updated to latest versions
- β Migrated to Biome: 50x faster linting and formatting with unified tooling
Install automatically for Claude Desktop:
npx -y @smithery/cli install @sylphxltd/pdf-reader-mcp --client claudeInstall the package:
pnpm add @sylphx/pdf-reader-mcp
# or
npm install @sylphx/pdf-reader-mcpConfigure your MCP client (e.g., Claude Desktop, Cursor):
{
"mcpServers": {
"pdf-reader-mcp": {
"command": "npx",
"args": ["@sylphx/pdf-reader-mcp"]
}
}
}Important: Make sure your MCP client sets the correct working directory (cwd) to your project root.
git clone https://github.com/sylphlab/pdf-reader-mcp.git
cd pdf-reader-mcp
pnpm install
pnpm run buildThen configure your MCP client to use node dist/index.js.
Once configured, your AI agent can read PDFs using the read_pdf tool:
{
"sources": [
{
"path": "documents/report.pdf",
"pages": [1, 2, 3]
}
],
"include_metadata": true
}{
"sources": [{ "path": "documents/report.pdf" }],
"include_metadata": true,
"include_page_count": true,
"include_full_text": false
}{
"sources": [
{
"url": "https://example.com/document.pdf"
}
],
"include_full_text": true
}{
"sources": [
{ "path": "doc1.pdf", "pages": "1-5" },
{ "path": "doc2.pdf" },
{ "url": "https://example.com/doc3.pdf" }
],
"include_full_text": true
}{
"sources": [
{
"path": "presentation.pdf",
"pages": [1, 2, 3]
}
],
"include_images": true,
"include_full_text": true
}Response includes:
- Text content from each page
- Embedded images as base64-encoded data with metadata (width, height, format)
- Each image includes page number and index
Note: Image extraction works best with JPEG and PNG images. Large PDFs with many images may produce large responses.
You can specify pages in multiple ways:
- Array of page numbers:
[1, 3, 5](1-based indexing) - Range string:
"1-10"(extracts pages 1 through 10) - Multiple ranges:
"1-5,10-15,20"(commas separate ranges and individual pages) - Omit for all pages: Don't include the
pagesfield to extract all pages
For large PDF files (>20 MB), extract specific pages instead of the full document:
{
"sources": [
{
"path": "large-document.pdf",
"pages": "1-10"
}
]
}This prevents hitting AI model context limits and improves performance.
Extract embedded images from PDF pages as base64-encoded data:
{
"sources": [{ "path": "document.pdf" }],
"include_images": true
}Image data format:
{
"images": [
{
"page": 1,
"index": 0,
"width": 800,
"height": 600,
"format": "rgb",
"data": "base64-encoded-image-data..."
}
]
}Supported formats:
- β RGB - Standard color images (most common)
- β RGBA - Images with transparency
- β Grayscale - Black and white images
- β Works with JPEG, PNG, and other embedded formats
Important considerations:
- πΈ Image extraction increases response size significantly
- πΈ Useful for AI models with vision capabilities
- πΈ Set
include_images: false(default) to extract text only - πΈ Combine with
pagesparameter to limit extraction scope
Text and images are now returned in exact document order!
The server uses Y-coordinates from PDF.js to preserve the natural reading flow of the document. This means AI models receive content parts in the same sequence as they appear on the page.
Example document layout:
Page 1:
[Heading text]
[Image: Chart]
[Description text]
[Image: Photo A]
[Image: Photo B]
[Conclusion text]
Content parts returned:
[
{ type: "text", text: "Heading text" },
{ type: "image", data: "base64..." }, // Chart
{ type: "text", text: "Description text" },
{ type: "image", data: "base64..." }, // Photo A
{ type: "image", data: "base64..." }, // Photo B
{ type: "text", text: "Conclusion text" }
]
Benefits:
- β AI understands context between text and images
- β Natural reading flow preserved
- β Better comprehension for complex documents
- β Automatic line grouping for multi-line text blocks
When is ordering applied?
- Automatically enabled when
include_images: true - Works with both specific pages and full document extraction
- Content on each page is independently sorted by Y-position
Important: The server only accepts relative paths for security reasons. Absolute paths are blocked to prevent unauthorized file system access.
β
Good: "path": "documents/report.pdf"
β Bad: "path": "/Users/john/documents/report.pdf"
Solution: Configure the cwd (current working directory) in your MCP client settings.
Solution: Clear npm cache and reinstall:
npm cache clean --force
npx @sylphx/pdf-reader-mcp@latestRestart your MCP client completely after updating.
Causes:
- Using absolute paths (not allowed for security)
- Incorrect working directory
Solution: Use relative paths and configure cwd in your MCP client:
{
"mcpServers": {
"pdf-reader-mcp": {
"command": "npx",
"args": ["@sylphx/pdf-reader-mcp"],
"cwd": "/path/to/your/project"
}
}
}Solution: Update to the latest version (all recent compatibility issues have been fixed):
npm update @sylphx/pdf-reader-mcp@latestThen restart your editor completely.
Benchmarks on a standard PDF file:
| Operation | Ops/sec | Speed |
|---|---|---|
| Handle Non-Existent File | ~12,933 | Fastest |
| Get Full Text | ~5,575 | |
| Get Specific Page | ~5,329 | |
| Get Multiple Pages | ~5,242 | |
| Get Metadata & Page Count | ~4,912 | Slowest |
Performance varies based on PDF complexity and system resources.
See Performance Documentation for details.
- Runtime: Node.js 22+
- PDF Processing: PDF.js (pdfjs-dist)
- Validation: Zod with JSON Schema generation
- Protocol: Model Context Protocol (MCP) SDK
- Build: TypeScript
- Testing: Vitest with 100% coverage goal
- Code Quality: Biome (linting + formatting)
- CI/CD: GitHub Actions
- Security First: Strict path validation and sandboxing
- Simple Interface: Single tool handles all PDF operations
- Structured Output: Predictable JSON format for AI parsing
- Performance: Efficient caching and lazy loading
- Reliability: Comprehensive error handling and validation
See Design Philosophy for more details.
- Node.js >= 22.0.0
- pnpm (recommended) or npm
git clone https://github.com/sylphlab/pdf-reader-mcp.git
cd pdf-reader-mcp
pnpm installpnpm run build # Build TypeScript to dist/
pnpm run watch # Build in watch mode
pnpm run test # Run tests
pnpm run test:watch # Run tests in watch mode
pnpm run test:cov # Run tests with coverage
pnpm run check # Run Biome (lint + format check)
pnpm run check:fix # Fix Biome issues automatically
pnpm run lint # Lint with Biome
pnpm run format # Format with Biome
pnpm run typecheck # TypeScript type checking
pnpm run benchmark # Run performance benchmarks
pnpm run validate # Full validation (check + test)We maintain high test coverage using Vitest:
pnpm run test # Run all tests
pnpm run test:cov # Run with coverage reportAll tests must pass before merging. Current: 31/31 tests passing β
The project uses Biome for fast, unified linting and formatting:
pnpm run check # Check code quality
pnpm run check:fix # Auto-fix issuesWe welcome contributions! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes and ensure tests pass
- Run
pnpm run check:fixto format code - Commit using Conventional Commits
- Open a Pull Request
See CONTRIBUTING.md for detailed guidelines.
- Full Documentation - Complete guides and API reference
- Getting Started Guide - Quick start guide
- API Reference - Detailed API documentation
- Design Philosophy - Architecture and design decisions
- Performance - Benchmarks and optimization
- Comparison - How it compares to alternatives
-
Image extraction from PDFsβ Completed (v1.0.0) -
Performance optimizations for parallel processingβ Completed (v1.0.0) - Annotation extraction support
- OCR integration for scanned PDFs
- Streaming support for very large files
- Enhanced caching mechanisms
- PDF form field extraction
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Contributing: CONTRIBUTING.md
If you find this project useful, please:
- β Star the repository
- π Watch for updates
- π Report bugs
- π‘ Suggest features
- π Contribute code
This project is licensed under the MIT License.
Made with β€οΈ by Sylphx
