Skip to content

amin-bake/pdf-to-csv

PDF to CSV Converter

License: MIT Python 3.11+ Next.js 15 TypeScript

A modern, scalable web application that converts PDF files to CSV format using microservices architecture. Built with Next.js frontend and Flask microservices, featuring real-time progress tracking, drag-and-drop interface, and cloud-ready deployment.

Landing Page

🏗️ Architecture

This project uses a microservices architecture with the following components:

  • Frontend: Next.js 15 with TypeScript, Tailwind CSS, and shadcn/ui
  • Upload Service: Handles file uploads and validation (Port 5001)
  • Conversion Service: PDF to CSV conversion with pdfplumber/Tabula (Port 5002)
  • Download Service: Manages file downloads and batch operations (Port 5003)
  • Shared Libraries: Common utilities, storage abstraction, and type definitions

Directory Structure

pdf-to-csv/
├── frontend/              # Next.js frontend application
│   ├── app/              # Next.js App Router
│   ├── components/       # React components
│   ├── hooks/            # Custom React hooks
│   ├── lib/              # Utilities and API client
│   └── store/            # Zustand state management
├── services/
│   ├── upload/           # Upload microservice
│   ├── conversion/       # Conversion microservice (Refactored ✨)
│   │   ├── app.py           # Flask API routes only
│   │   ├── worker.py        # Background job orchestration
│   │   ├── extractors.py    # PDF data extraction layer
│   │   ├── analyzers.py     # Table intelligence & structure analysis
│   │   ├── converters.py    # Format-specific output generation
│   │   └── ARCHITECTURE.md  # Detailed architecture documentation
│   └── download/         # Download microservice
├── shared/               # Shared Python libraries
│   ├── storage.py        # Storage backend abstraction
│   ├── types.py          # Common type definitions
│   └── utils.py          # Utility functions
├── infrastructure/       # Docker & Kubernetes configs
├── docs/                 # Comprehensive documentation
│   ├── MIGRATION_PLAN.md
│   ├── ARCHITECTURE.md
│   ├── API_SPECIFICATION.md
│   ├── FRONTEND_COMPONENTS.md
│   ├── DEPLOYMENT_GUIDE.md
│   └── DOCKER_KUBERNETES.md
├── legacy/              # Original Flask monolith (deprecated)
└── docker-compose.yml   # Development environment

## 🎯 Recent Updates

### Conversion Service Refactoring (Dec 2025)

The conversion service has been refactored from a monolithic 1180-line file into a clean, modular architecture following industry best practices:

**Key Improvements:**
- ✅ **Modular Architecture**: Separated into 5 focused modules (128 + 178 + 84 + 272 + 350 lines)
- ✅ **Separation of Concerns**: Clear layers for extraction, analysis, conversion, orchestration, and API
- ✅ **Improved Testability**: Each component can be tested independently
- ✅ **Better Maintainability**: Changes isolated to specific modules
- ✅ **Enhanced Scalability**: Easy to add new parsers, formats, or swap implementations
- ✅ **Reduced Coupling**: Clear interfaces between layers with minimal dependencies

**New Module Structure:**
```
conversion/
├── extractors.py    (84 lines)  - Pure PDF data extraction
├── analyzers.py     (272 lines) - Table intelligence & structure detection
├── converters.py    (350 lines) - Format-specific output generation
├── worker.py        (178 lines) - Background job orchestration
└── app.py           (128 lines) - Flask API routes (reduced from 1180+)
```

See [services/conversion/ARCHITECTURE.md](services/conversion/ARCHITECTURE.md) for detailed documentation.

## ✨ Features

- 🏗️ **Microservices Architecture**: Scalable, independently deployable services
- 📤 **Multiple File Upload**: Drag & drop or select multiple PDF files at once
- 📊 **Smart Table Extraction**: Automatically detects and extracts tables from PDFs
- 📑 **Multiple Output Formats**: Convert to CSV, Excel (.xlsx), JSON, or plain text formats
- 🔄 **Automatic Merging**: Combines all tables from each PDF into a single output file
- 📈 **Real-time Progress**: Visual progress bars with status polling
- ⬇️ **Flexible Downloads**: Download files individually or all at once as a ZIP
- 🎯 **Dual Parser Support**: Choose between pdfplumber (default) or Tabula
- 💻 **Modern UI**: Next.js with Tailwind CSS and shadcn/ui components
- 🎨 **Multiple Color Themes**: Switch between "Earthy Forest" (green), "Cherry Blossom Bloom" (red/pink), and "Pastel Rainbow Fantasy" (dreamy pastels)
- 🌓 **Dark Mode Support**: Full light/dark mode for all color themes
- ⚡ **Background Processing**: Async conversion with React Query
- 🐳 **Docker Ready**: Complete containerization with docker-compose
- ☁️ **Cloud Native**: Deploy frontend to Vercel, backend to any cloud provider
- 📦 **Storage Abstraction**: Local filesystem or S3-compatible storage

## 🚀 Quick Start

### Prerequisites

- **Node.js** 18+ and npm
- **Python** 3.11+
- **(Optional)** Docker & Docker Compose
- **(Optional)** Java Runtime for Tabula parser

### Option 1: Local Development (Recommended)

1. **Clone the repository**

   ```powershell
   git clone https://github.com/amin-bake/pdf-to-csv.git
   cd pdf-to-csv
  1. Install root dependencies

    npm install
  2. Set up Frontend

    cd frontend
    npm install
    cp .env.local.example .env.local
    cd ..
  3. Set up Backend Services

    # Upload Service
    cd services/upload
    python -m venv .venv
    .\.venv\Scripts\Activate.ps1
    pip install -r requirements.txt
    cd ../..
    
    # Repeat for conversion and download services
  4. Run All Services

    # From root directory
    npm run dev

    This starts:

Option 2: Docker Compose

# Build and start all services
docker-compose up --build

# Frontend will be available at http://localhost:3000

Running the Application (Legacy)

python app.py

Then open your browser and navigate to http://localhost:5000

📖 Usage

  1. Choose Conversion Type: Select PDF to CSV, PDF to Excel, PDF to JSON, or PDF to Text from the homepage
  2. Upload Files: Drag & drop PDF files or click to browse
  3. Select Parser: Choose between pdfplumber (default) or Tabula
  4. Choose Output Options:
    • Select output format (CSV, Excel, JSON, or Text)
    • Optionally merge all tables into a single file
  5. Convert: Click "Convert" to start processing
  6. Monitor Progress: Watch real-time conversion status
  7. Download: Download individual files or all files as ZIP

Conversion Formats

  • PDF to CSV: Extract tables to comma-separated values format

    • Great for data analysis and spreadsheet import
    • Lightweight and universally compatible
  • PDF to Excel: Extract tables to Excel spreadsheets (.xlsx)

    • Multiple tables saved as separate sheets when merged
    • Auto-adjusted column widths for better readability
    • Native Excel format with formatting support
  • PDF to JSON: Extract tables to structured JSON format

    • Tabular data: First row used as object keys (headers)
    • Non-tabular documents: Full text extraction for CVs, resumes, reports
    • Tables preserved with metadata (table number, row/column counts, headers)
    • Automatic header cleaning (removes newlines, ensures uniqueness)
    • Empty rows and columns filtered out
    • Intelligent structure detection distinguishes titles, headers, and data
    • Perfect for web APIs and data interchange
    • Human-readable with proper indentation
  • PDF to Text: Extract plain text content from PDFs

    • Tabular documents: Tables formatted with aligned columns
    • Text documents: Full text extraction with page separators
    • Clean, readable output format
    • Preserves document structure and formatting
    • Perfect for text analysis, NLP, and content extraction

Parser Options

  • pdfplumber (default): Works well with most PDFs, no additional dependencies
  • Tabula: Better for complex table layouts, requires Java runtime

📚 Documentation

Comprehensive documentation is available in the /docs directory:

Comprehensive documentation is available in the /docs directory:

🔌 API Endpoints

Upload Service (Port 5001)

  • POST /api/v1/upload - Upload PDF file
  • GET /health - Health check

Conversion Service (Port 5002)

  • POST /api/v1/convert - Start conversion job
  • GET /api/v1/status/:id - Check conversion status
  • DELETE /api/v1/convert/:id - Cancel conversion
  • GET /health - Health check

Download Service (Port 5003)

  • GET /api/v1/download/:id - Download converted file
  • POST /api/v1/download/batch - Download multiple files as ZIP
  • GET /api/v1/download/:id/info - Get file metadata
  • GET /health - Health check

See API_SPECIFICATION.md for complete API documentation.

🧪 Testing

# Frontend tests
cd frontend
npm test

# Backend tests
pytest services/

# E2E tests
pytest tests/test_e2e.py

🐳 Docker Deployment

Development

docker-compose up

Production

docker-compose -f docker-compose.prod.yml up -d

☁️ Cloud Deployment

Frontend (Vercel)

cd frontend
vercel --prod

Backend Services

🗺️ Project Status

Current Version: 0.1.0

Microservices architecture implemented!

The application has been successfully migrated from a monolithic Flask application to a modern microservices architecture with Next.js frontend.

Completed Phases

  • Phase 1: Project restructuring and documentation
  • Phase 2: Frontend development (Next.js 15 + TypeScript + Tailwind v4 + Dark Mode)
  • Phase 3: Backend microservices (Upload, Conversion, Download services)

Next Steps

  • 📋 Phase 4: Storage & Infrastructure improvements (S3, Redis, persistent storage)
  • 📋 Phase 5: Testing & Documentation (E2E tests, API docs, monitoring)
  • 📋 Phase 6: Production deployment (CI/CD, scaling, observability)

Architecture Highlights

  • Frontend: Next.js 15 with modern React patterns and full TypeScript
  • Backend: Three independent Flask microservices with health checks
  • Communication: RESTful APIs with CORS support
  • Storage: Shared file system (ready for S3 migration)
  • Deployment: Docker Compose for local dev, cloud-ready for production

See docs/PHASE_3_COMPLETE.md for detailed implementation notes.

🛠️ Technology Stack

Frontend

  • Framework: Next.js 15 (App Router)
  • Language: TypeScript 5
  • Styling: Tailwind CSS v4
  • Components: shadcn/ui (Radix UI)
  • State: Zustand + React Query
  • Icons: Lucide React

Backend

  • Framework: Flask 3.0
  • Language: Python 3.11
  • PDF Parsing: pdfplumber, Tabula-py
  • Storage: Local filesystem / S3
  • Server: Gunicorn

Infrastructure

  • Containerization: Docker
  • Orchestration: Docker Compose / Kubernetes
  • CI/CD: GitHub Actions
  • Hosting: Vercel (Frontend), Cloud providers (Backend)

📝 Environment Variables

Frontend (.env.local)

NEXT_PUBLIC_UPLOAD_SERVICE_URL=http://localhost:5001
NEXT_PUBLIC_CONVERSION_SERVICE_URL=http://localhost:5002
NEXT_PUBLIC_DOWNLOAD_SERVICE_URL=http://localhost:5003
NEXT_PUBLIC_MAX_FILE_SIZE=52428800

Backend Services

FLASK_ENV=development
STORAGE_BACKEND=local
S3_BUCKET=your-bucket
AWS_ACCESS_KEY_ID=your-key
AWS_SECRET_ACCESS_KEY=your-secret
CORS_ORIGINS=http://localhost:3000

🤝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

📞 Support

For questions, issues, or feature requests:

🗺️ Roadmap

Completed ✅

  • Initial Flask prototype (v0.0.1)
  • Microservices architecture design
  • Complete documentation suite
  • Next.js 15 frontend with TypeScript
  • Dark mode with theme toggle
  • Upload microservice with validation
  • Conversion microservice with pdfplumber
  • Download microservice with ZIP support
  • Docker Compose development environment
  • Health checks for all services

In Progress 🚧

  • Storage abstraction (S3 support)
  • Redis for job queue management
  • Comprehensive test suite
  • API documentation (OpenAPI/Swagger)

Planned 📋

  • Production deployment (Vercel + AWS/GCP)
  • Real-time WebSocket updates
  • User authentication & authorization
  • File history and management dashboard
  • OCR support for scanned PDFs
  • API rate limiting
  • Monitoring & observability (Datadog/New Relic)
  • CI/CD pipeline automation
  • Horizontal scaling configuration
  • Additional output formats (Excel, JSON)
  • Comprehensive monitoring

Made with ❤️ for the open-source community

Download behavior test

python test_download_types.py

Download all test

python test_download_all.py


## 🔧 Configuration

The application uses in-memory storage for uploaded files and temporary directories for converted files. For production use, consider:

- Using a production WSGI server (gunicorn, waitress)
- Implementing file cleanup mechanisms
- Adding authentication/authorization
- Setting up proper logging
- Configuring file size limits

## 🤝 Contributing

Contributions are welcome! Please read our [Contributing Guidelines](CONTRIBUTING.md) before submitting a pull request.

### Development Setup

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes and test thoroughly
4. Commit your changes (`git commit -m 'Add amazing feature'`)
5. Push to your branch (`git push origin feature/amazing-feature`)
6. Open a Pull Request

## 📝 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- [pdfplumber](https://github.com/jsvine/pdfplumber) - PDF table extraction
- [Tabula](https://github.com/tabulapedia/tabula-py) - Alternative parser
- [Flask](https://flask.palletsprojects.com/) - Web framework
- [pandas](https://pandas.pydata.org/) - Data manipulation

## 📧 Support

If you encounter any issues or have questions:

- Open an [issue](https://github.com/YOUR_USERNAME/pdf-to-csv/issues)
- Check existing issues for solutions
- Read the [Contributing Guidelines](CONTRIBUTING.md)

## 🗺️ Roadmap

Potential future enhancements:

- [ ] Support for more output formats (Excel, JSON)
- [ ] Advanced table detection options
- [ ] Batch processing queue for large files
- [ ] Cloud storage integration
- [ ] API authentication
- [ ] Docker containerization
- [ ] Persistent storage option

---

Made with ❤️ by Mohamed Ali

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors