Skip to content

An MCP server built with Node.js/TypeScript that allows AI agents to securely read PDF files (local or URL) and extract text, metadata, or page counts. Uses pdf-parse.

License

Notifications You must be signed in to change notification settings

sylphxltd/pdf-reader-mcp

Repository files navigation

PDF Reader MCP Server

MseeP.ai Security Assessment Badge CI/CD Pipeline codecov npm version License: MIT smithery badge

PDF Reader Server MCP server

Empower your AI agents with the ability to securely read and extract information from PDF files using the Model Context Protocol (MCP).

✨ Features

  • πŸ“„ Extract text content from PDF files (full document or specific pages)
  • πŸ–ΌοΈ Extract embedded images from PDF pages as base64-encoded data
  • πŸ“ Preserve content order - Text and images returned in exact document layout order (NEW v1.2.0)
  • πŸ“Š Get metadata (author, title, creation date, etc.)
  • πŸ”’ Count pages in PDF documents
  • 🌐 Support for both local files and URLs
  • πŸ›‘οΈ Secure - Confines file access to project root directory
  • ⚑ Fast - Parallel processing for maximum performance
  • πŸ”„ Batch processing - Handle multiple PDFs in a single request
  • πŸ“¦ Multiple deployment options - npm or Smithery

πŸ†• Recent Updates (October 2025)

v1.2.0 - Content Ordering (Latest)

  • βœ… Y-Coordinate Based Ordering: Text and images returned in exact document order
  • βœ… Natural Reading Flow: Content parts preserve the layout sequence as it appears in PDF
  • βœ… Intelligent Grouping: Automatically groups text items on the same line
  • βœ… Optimized for AI: Enables AI models to understand content in natural reading order

v1.1.0 - Image Extraction

  • βœ… Image Extraction: Extract embedded images from PDF pages as base64-encoded data
  • βœ… Performance Optimization: Parallel page processing for 5-10x speedup
  • βœ… Deep Refactoring: Modular architecture with 98.9% test coverage (91 tests)

Previous Updates

  • βœ… Fixed critical bugs: Buffer/Uint8Array compatibility for PDF.js v5.x
  • βœ… Fixed schema validation: Resolved exclusiveMinimum issue affecting Windsurf, Mistral API, and other tools
  • βœ… Improved metadata extraction: Robust fallback handling for PDF.js compatibility
  • βœ… Updated dependencies: All packages updated to latest versions
  • βœ… Migrated to Biome: 50x faster linting and formatting with unified tooling

πŸ“¦ Installation

Option 1: Using Smithery (Easiest)

Install automatically for Claude Desktop:

npx -y @smithery/cli install @sylphxltd/pdf-reader-mcp --client claude

Option 2: Using npm/pnpm (Recommended)

Install the package:

pnpm add @sylphx/pdf-reader-mcp
# or
npm install @sylphx/pdf-reader-mcp

Configure your MCP client (e.g., Claude Desktop, Cursor):

{
  "mcpServers": {
    "pdf-reader-mcp": {
      "command": "npx",
      "args": ["@sylphx/pdf-reader-mcp"]
    }
  }
}

Important: Make sure your MCP client sets the correct working directory (cwd) to your project root.

Option 3: Local Development Build

git clone https://github.com/sylphlab/pdf-reader-mcp.git
cd pdf-reader-mcp
pnpm install
pnpm run build

Then configure your MCP client to use node dist/index.js.

πŸš€ Quick Start

Once configured, your AI agent can read PDFs using the read_pdf tool:

Example 1: Extract text from specific pages

{
  "sources": [
    {
      "path": "documents/report.pdf",
      "pages": [1, 2, 3]
    }
  ],
  "include_metadata": true
}

Example 2: Get metadata and page count only

{
  "sources": [{ "path": "documents/report.pdf" }],
  "include_metadata": true,
  "include_page_count": true,
  "include_full_text": false
}

Example 3: Read from URL

{
  "sources": [
    {
      "url": "https://example.com/document.pdf"
    }
  ],
  "include_full_text": true
}

Example 4: Process multiple PDFs

{
  "sources": [
    { "path": "doc1.pdf", "pages": "1-5" },
    { "path": "doc2.pdf" },
    { "url": "https://example.com/doc3.pdf" }
  ],
  "include_full_text": true
}

Example 5: Extract images from PDF

{
  "sources": [
    {
      "path": "presentation.pdf",
      "pages": [1, 2, 3]
    }
  ],
  "include_images": true,
  "include_full_text": true
}

Response includes:

  • Text content from each page
  • Embedded images as base64-encoded data with metadata (width, height, format)
  • Each image includes page number and index

Note: Image extraction works best with JPEG and PNG images. Large PDFs with many images may produce large responses.

πŸ“– Usage Guide

Page Specification

You can specify pages in multiple ways:

  • Array of page numbers: [1, 3, 5] (1-based indexing)
  • Range string: "1-10" (extracts pages 1 through 10)
  • Multiple ranges: "1-5,10-15,20" (commas separate ranges and individual pages)
  • Omit for all pages: Don't include the pages field to extract all pages

Working with Large PDFs

For large PDF files (>20 MB), extract specific pages instead of the full document:

{
  "sources": [
    {
      "path": "large-document.pdf",
      "pages": "1-10"
    }
  ]
}

This prevents hitting AI model context limits and improves performance.

Image Extraction

Extract embedded images from PDF pages as base64-encoded data:

{
  "sources": [{ "path": "document.pdf" }],
  "include_images": true
}

Image data format:

{
  "images": [
    {
      "page": 1,
      "index": 0,
      "width": 800,
      "height": 600,
      "format": "rgb",
      "data": "base64-encoded-image-data..."
    }
  ]
}

Supported formats:

  • βœ… RGB - Standard color images (most common)
  • βœ… RGBA - Images with transparency
  • βœ… Grayscale - Black and white images
  • βœ… Works with JPEG, PNG, and other embedded formats

Important considerations:

  • πŸ”Έ Image extraction increases response size significantly
  • πŸ”Έ Useful for AI models with vision capabilities
  • πŸ”Έ Set include_images: false (default) to extract text only
  • πŸ”Έ Combine with pages parameter to limit extraction scope

Content Ordering (NEW in v1.2.0)

Text and images are now returned in exact document order!

The server uses Y-coordinates from PDF.js to preserve the natural reading flow of the document. This means AI models receive content parts in the same sequence as they appear on the page.

Example document layout:

Page 1:
  [Heading text]
  [Image: Chart]
  [Description text]
  [Image: Photo A]
  [Image: Photo B]
  [Conclusion text]

Content parts returned:

[
  { type: "text", text: "Heading text" },
  { type: "image", data: "base64..." },  // Chart
  { type: "text", text: "Description text" },
  { type: "image", data: "base64..." },  // Photo A
  { type: "image", data: "base64..." },  // Photo B
  { type: "text", text: "Conclusion text" }
]

Benefits:

  • βœ… AI understands context between text and images
  • βœ… Natural reading flow preserved
  • βœ… Better comprehension for complex documents
  • βœ… Automatic line grouping for multi-line text blocks

When is ordering applied?

  • Automatically enabled when include_images: true
  • Works with both specific pages and full document extraction
  • Content on each page is independently sorted by Y-position

Security: Relative Paths Only

Important: The server only accepts relative paths for security reasons. Absolute paths are blocked to prevent unauthorized file system access.

βœ… Good: "path": "documents/report.pdf" ❌ Bad: "path": "/Users/john/documents/report.pdf"

Solution: Configure the cwd (current working directory) in your MCP client settings.

πŸ”§ Troubleshooting

Issue: "No tools" showing up

Solution: Clear npm cache and reinstall:

npm cache clean --force
npx @sylphx/pdf-reader-mcp@latest

Restart your MCP client completely after updating.

Issue: "File not found" errors

Causes:

  1. Using absolute paths (not allowed for security)
  2. Incorrect working directory

Solution: Use relative paths and configure cwd in your MCP client:

{
  "mcpServers": {
    "pdf-reader-mcp": {
      "command": "npx",
      "args": ["@sylphx/pdf-reader-mcp"],
      "cwd": "/path/to/your/project"
    }
  }
}

Issue: Cursor/Claude Code compatibility

Solution: Update to the latest version (all recent compatibility issues have been fixed):

npm update @sylphx/pdf-reader-mcp@latest

Then restart your editor completely.

⚑ Performance

Benchmarks on a standard PDF file:

Operation Ops/sec Speed
Handle Non-Existent File ~12,933 Fastest
Get Full Text ~5,575
Get Specific Page ~5,329
Get Multiple Pages ~5,242
Get Metadata & Page Count ~4,912 Slowest

Performance varies based on PDF complexity and system resources.

See Performance Documentation for details.

πŸ—οΈ Architecture

Tech Stack

  • Runtime: Node.js 22+
  • PDF Processing: PDF.js (pdfjs-dist)
  • Validation: Zod with JSON Schema generation
  • Protocol: Model Context Protocol (MCP) SDK
  • Build: TypeScript
  • Testing: Vitest with 100% coverage goal
  • Code Quality: Biome (linting + formatting)
  • CI/CD: GitHub Actions

Design Principles

  1. Security First: Strict path validation and sandboxing
  2. Simple Interface: Single tool handles all PDF operations
  3. Structured Output: Predictable JSON format for AI parsing
  4. Performance: Efficient caching and lazy loading
  5. Reliability: Comprehensive error handling and validation

See Design Philosophy for more details.

πŸ§ͺ Development

Prerequisites

  • Node.js >= 22.0.0
  • pnpm (recommended) or npm

Setup

git clone https://github.com/sylphlab/pdf-reader-mcp.git
cd pdf-reader-mcp
pnpm install

Available Scripts

pnpm run build        # Build TypeScript to dist/
pnpm run watch        # Build in watch mode
pnpm run test         # Run tests
pnpm run test:watch   # Run tests in watch mode
pnpm run test:cov     # Run tests with coverage
pnpm run check        # Run Biome (lint + format check)
pnpm run check:fix    # Fix Biome issues automatically
pnpm run lint         # Lint with Biome
pnpm run format       # Format with Biome
pnpm run typecheck    # TypeScript type checking
pnpm run benchmark    # Run performance benchmarks
pnpm run validate     # Full validation (check + test)

Testing

We maintain high test coverage using Vitest:

pnpm run test         # Run all tests
pnpm run test:cov     # Run with coverage report

All tests must pass before merging. Current: 31/31 tests passing βœ…

Code Quality

The project uses Biome for fast, unified linting and formatting:

pnpm run check        # Check code quality
pnpm run check:fix    # Auto-fix issues

Contributing

We welcome contributions! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes and ensure tests pass
  4. Run pnpm run check:fix to format code
  5. Commit using Conventional Commits
  6. Open a Pull Request

See CONTRIBUTING.md for detailed guidelines.

πŸ“š Documentation

πŸ—ΊοΈ Roadmap

  • Image extraction from PDFs βœ… Completed (v1.0.0)
  • Performance optimizations for parallel processing βœ… Completed (v1.0.0)
  • Annotation extraction support
  • OCR integration for scanned PDFs
  • Streaming support for very large files
  • Enhanced caching mechanisms
  • PDF form field extraction

🀝 Support & Community

If you find this project useful, please:

  • ⭐ Star the repository
  • πŸ‘€ Watch for updates
  • πŸ› Report bugs
  • πŸ’‘ Suggest features
  • πŸ”€ Contribute code

πŸ“„ License

This project is licensed under the MIT License.


Made with ❀️ by Sylphx

About

An MCP server built with Node.js/TypeScript that allows AI agents to securely read PDF files (local or URL) and extract text, metadata, or page counts. Uses pdf-parse.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published