PDF to CSV Converter

A modern, scalable web application that converts PDF files to CSV format using microservices architecture. Built with Next.js frontend and Flask microservices, featuring real-time progress tracking, drag-and-drop interface, and cloud-ready deployment.

🏗️ Architecture

This project uses a microservices architecture with the following components:

Frontend: Next.js 15 with TypeScript, Tailwind CSS, and shadcn/ui
Upload Service: Handles file uploads and validation (Port 5001)
Conversion Service: PDF to CSV conversion with pdfplumber/Tabula (Port 5002)
Download Service: Manages file downloads and batch operations (Port 5003)
Shared Libraries: Common utilities, storage abstraction, and type definitions

Directory Structure

pdf-to-csv/
├── frontend/              # Next.js frontend application
│   ├── app/              # Next.js App Router
│   ├── components/       # React components
│   ├── hooks/            # Custom React hooks
│   ├── lib/              # Utilities and API client
│   └── store/            # Zustand state management
├── services/
│   ├── upload/           # Upload microservice
│   ├── conversion/       # Conversion microservice (Refactored ✨)
│   │   ├── app.py           # Flask API routes only
│   │   ├── worker.py        # Background job orchestration
│   │   ├── extractors.py    # PDF data extraction layer
│   │   ├── analyzers.py     # Table intelligence & structure analysis
│   │   ├── converters.py    # Format-specific output generation
│   │   └── ARCHITECTURE.md  # Detailed architecture documentation
│   └── download/         # Download microservice
├── shared/               # Shared Python libraries
│   ├── storage.py        # Storage backend abstraction
│   ├── types.py          # Common type definitions
│   └── utils.py          # Utility functions
├── infrastructure/       # Docker & Kubernetes configs
├── docs/                 # Comprehensive documentation
│   ├── MIGRATION_PLAN.md
│   ├── ARCHITECTURE.md
│   ├── API_SPECIFICATION.md
│   ├── FRONTEND_COMPONENTS.md
│   ├── DEPLOYMENT_GUIDE.md
│   └── DOCKER_KUBERNETES.md
├── legacy/              # Original Flask monolith (deprecated)
└── docker-compose.yml   # Development environment

## 🎯 Recent Updates

### Conversion Service Refactoring (Dec 2025)

The conversion service has been refactored from a monolithic 1180-line file into a clean, modular architecture following industry best practices:

**Key Improvements:**
- ✅ **Modular Architecture**: Separated into 5 focused modules (128 + 178 + 84 + 272 + 350 lines)
- ✅ **Separation of Concerns**: Clear layers for extraction, analysis, conversion, orchestration, and API
- ✅ **Improved Testability**: Each component can be tested independently
- ✅ **Better Maintainability**: Changes isolated to specific modules
- ✅ **Enhanced Scalability**: Easy to add new parsers, formats, or swap implementations
- ✅ **Reduced Coupling**: Clear interfaces between layers with minimal dependencies

**New Module Structure:**
```
conversion/
├── extractors.py    (84 lines)  - Pure PDF data extraction
├── analyzers.py     (272 lines) - Table intelligence & structure detection
├── converters.py    (350 lines) - Format-specific output generation
├── worker.py        (178 lines) - Background job orchestration
└── app.py           (128 lines) - Flask API routes (reduced from 1180+)
```

See [services/conversion/ARCHITECTURE.md](services/conversion/ARCHITECTURE.md) for detailed documentation.

## ✨ Features

- 🏗️ **Microservices Architecture**: Scalable, independently deployable services
- 📤 **Multiple File Upload**: Drag & drop or select multiple PDF files at once
- 📊 **Smart Table Extraction**: Automatically detects and extracts tables from PDFs
- 📑 **Multiple Output Formats**: Convert to CSV, Excel (.xlsx), JSON, or plain text formats
- 🔄 **Automatic Merging**: Combines all tables from each PDF into a single output file
- 📈 **Real-time Progress**: Visual progress bars with status polling
- ⬇️ **Flexible Downloads**: Download files individually or all at once as a ZIP
- 🎯 **Dual Parser Support**: Choose between pdfplumber (default) or Tabula
- 💻 **Modern UI**: Next.js with Tailwind CSS and shadcn/ui components
- 🎨 **Multiple Color Themes**: Switch between "Earthy Forest" (green), "Cherry Blossom Bloom" (red/pink), and "Pastel Rainbow Fantasy" (dreamy pastels)
- 🌓 **Dark Mode Support**: Full light/dark mode for all color themes
- ⚡ **Background Processing**: Async conversion with React Query
- 🐳 **Docker Ready**: Complete containerization with docker-compose
- ☁️ **Cloud Native**: Deploy frontend to Vercel, backend to any cloud provider
- 📦 **Storage Abstraction**: Local filesystem or S3-compatible storage

## 🚀 Quick Start

### Prerequisites

- **Node.js** 18+ and npm
- **Python** 3.11+
- **(Optional)** Docker & Docker Compose
- **(Optional)** Java Runtime for Tabula parser

### Option 1: Local Development (Recommended)

1. **Clone the repository**

   ```powershell
   git clone https://github.com/amin-bake/pdf-to-csv.git
   cd pdf-to-csv

Install root dependencies
```
npm install
```

Set up Frontend

cd frontend
npm install
cp .env.local.example .env.local
cd ..

Set up Backend Services

# Upload Service
cd services/upload
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
cd ../..

# Repeat for conversion and download services

Run All Services
```
# From root directory
npm run dev
```
This starts:
- Frontend: http://localhost:3000
- Upload Service: http://localhost:5001
- Conversion Service: http://localhost:5002
- Download Service: http://localhost:5003

Option 2: Docker Compose

# Build and start all services
docker-compose up --build

# Frontend will be available at http://localhost:3000

Running the Application (Legacy)

python app.py

Then open your browser and navigate to http://localhost:5000

📖 Usage

Choose Conversion Type: Select PDF to CSV, PDF to Excel, PDF to JSON, or PDF to Text from the homepage
Upload Files: Drag & drop PDF files or click to browse
Select Parser: Choose between pdfplumber (default) or Tabula
Choose Output Options:
- Select output format (CSV, Excel, JSON, or Text)
- Optionally merge all tables into a single file
Convert: Click "Convert" to start processing
Monitor Progress: Watch real-time conversion status
Download: Download individual files or all files as ZIP

Conversion Formats

PDF to CSV: Extract tables to comma-separated values format
- Great for data analysis and spreadsheet import
- Lightweight and universally compatible
PDF to Excel: Extract tables to Excel spreadsheets (.xlsx)
- Multiple tables saved as separate sheets when merged
- Auto-adjusted column widths for better readability
- Native Excel format with formatting support
PDF to JSON: Extract tables to structured JSON format
- Tabular data: First row used as object keys (headers)
- Non-tabular documents: Full text extraction for CVs, resumes, reports
- Tables preserved with metadata (table number, row/column counts, headers)
- Automatic header cleaning (removes newlines, ensures uniqueness)
- Empty rows and columns filtered out
- Intelligent structure detection distinguishes titles, headers, and data
- Perfect for web APIs and data interchange
- Human-readable with proper indentation
PDF to Text: Extract plain text content from PDFs
- Tabular documents: Tables formatted with aligned columns
- Text documents: Full text extraction with page separators
- Clean, readable output format
- Preserves document structure and formatting
- Perfect for text analysis, NLP, and content extraction

Parser Options

pdfplumber (default): Works well with most PDFs, no additional dependencies
Tabula: Better for complex table layouts, requires Java runtime

📚 Documentation

Comprehensive documentation is available in the /docs directory:

MIGRATION_PLAN.md: 8-week migration strategy from monolith to microservices
ARCHITECTURE.md: Detailed technical architecture and design decisions
API_SPECIFICATION.md: Complete RESTful API documentation
FRONTEND_COMPONENTS.md: React component specifications
DEPLOYMENT_GUIDE.md: Deployment instructions for all platforms
DOCKER_KUBERNETES.md: Container orchestration configurations

🔌 API Endpoints

Upload Service (Port 5001)

POST /api/v1/upload - Upload PDF file
GET /health - Health check

Conversion Service (Port 5002)

POST /api/v1/convert - Start conversion job
GET /api/v1/status/:id - Check conversion status
DELETE /api/v1/convert/:id - Cancel conversion
GET /health - Health check

Download Service (Port 5003)

GET /api/v1/download/:id - Download converted file
POST /api/v1/download/batch - Download multiple files as ZIP
GET /api/v1/download/:id/info - Get file metadata
GET /health - Health check

See API_SPECIFICATION.md for complete API documentation.

🧪 Testing

# Frontend tests
cd frontend
npm test

# Backend tests
pytest services/

# E2E tests
pytest tests/test_e2e.py

🐳 Docker Deployment

Development

docker-compose up

Production

docker-compose -f docker-compose.prod.yml up -d

☁️ Cloud Deployment

Frontend (Vercel)

cd frontend
vercel --prod

Backend Services

AWS ECS/Fargate: See DEPLOYMENT_GUIDE.md
Google Cloud Run: See DEPLOYMENT_GUIDE.md
Kubernetes: See DOCKER_KUBERNETES.md
AWS ECS/Fargate: See DEPLOYMENT_GUIDE.md
Google Cloud Run: See DEPLOYMENT_GUIDE.md
Kubernetes: See DOCKER_KUBERNETES.md

🗺️ Project Status

Current Version: 0.1.0

✅ Microservices architecture implemented!

The application has been successfully migrated from a monolithic Flask application to a modern microservices architecture with Next.js frontend.

Completed Phases

✅ Phase 1: Project restructuring and documentation
✅ Phase 2: Frontend development (Next.js 15 + TypeScript + Tailwind v4 + Dark Mode)
✅ Phase 3: Backend microservices (Upload, Conversion, Download services)

Next Steps

📋 Phase 4: Storage & Infrastructure improvements (S3, Redis, persistent storage)
📋 Phase 5: Testing & Documentation (E2E tests, API docs, monitoring)
📋 Phase 6: Production deployment (CI/CD, scaling, observability)

Architecture Highlights

Frontend: Next.js 15 with modern React patterns and full TypeScript
Backend: Three independent Flask microservices with health checks
Communication: RESTful APIs with CORS support
Storage: Shared file system (ready for S3 migration)
Deployment: Docker Compose for local dev, cloud-ready for production

See docs/PHASE_3_COMPLETE.md for detailed implementation notes.

🛠️ Technology Stack

Frontend

Framework: Next.js 15 (App Router)
Language: TypeScript 5
Styling: Tailwind CSS v4
Components: shadcn/ui (Radix UI)
State: Zustand + React Query
Icons: Lucide React

Backend

Framework: Flask 3.0
Language: Python 3.11
PDF Parsing: pdfplumber, Tabula-py
Storage: Local filesystem / S3
Server: Gunicorn

Infrastructure

Containerization: Docker
Orchestration: Docker Compose / Kubernetes
CI/CD: GitHub Actions
Hosting: Vercel (Frontend), Cloud providers (Backend)

📝 Environment Variables

Frontend (.env.local)

NEXT_PUBLIC_UPLOAD_SERVICE_URL=http://localhost:5001
NEXT_PUBLIC_CONVERSION_SERVICE_URL=http://localhost:5002
NEXT_PUBLIC_DOWNLOAD_SERVICE_URL=http://localhost:5003
NEXT_PUBLIC_MAX_FILE_SIZE=52428800

Backend Services

FLASK_ENV=development
STORAGE_BACKEND=local
S3_BUCKET=your-bucket
AWS_ACCESS_KEY_ID=your-key
AWS_SECRET_ACCESS_KEY=your-secret
CORS_ORIGINS=http://localhost:3000

🤝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with Flask
PDF parsing by pdfplumber
Alternative parsing with Tabula
UI components from shadcn/ui
Icons by Lucide

📞 Support

For questions, issues, or feature requests:

🗺️ Roadmap

Completed ✅

In Progress 🚧

Storage abstraction (S3 support)
Redis for job queue management
Comprehensive test suite
API documentation (OpenAPI/Swagger)

Planned 📋

Made with ❤️ for the open-source community

Download behavior test

python test_download_types.py

Download all test

python test_download_all.py


## 🔧 Configuration

The application uses in-memory storage for uploaded files and temporary directories for converted files. For production use, consider:

- Using a production WSGI server (gunicorn, waitress)
- Implementing file cleanup mechanisms
- Adding authentication/authorization
- Setting up proper logging
- Configuring file size limits

## 🤝 Contributing

Contributions are welcome! Please read our [Contributing Guidelines](CONTRIBUTING.md) before submitting a pull request.

### Development Setup

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes and test thoroughly
4. Commit your changes (`git commit -m 'Add amazing feature'`)
5. Push to your branch (`git push origin feature/amazing-feature`)
6. Open a Pull Request

## 📝 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- [pdfplumber](https://github.com/jsvine/pdfplumber) - PDF table extraction
- [Tabula](https://github.com/tabulapedia/tabula-py) - Alternative parser
- [Flask](https://flask.palletsprojects.com/) - Web framework
- [pandas](https://pandas.pydata.org/) - Data manipulation

## 📧 Support

If you encounter any issues or have questions:

- Open an [issue](https://github.com/YOUR_USERNAME/pdf-to-csv/issues)
- Check existing issues for solutions
- Read the [Contributing Guidelines](CONTRIBUTING.md)

## 🗺️ Roadmap

Potential future enhancements:

- [ ] Support for more output formats (Excel, JSON)
- [ ] Advanced table detection options
- [ ] Batch processing queue for large files
- [ ] Cloud storage integration
- [ ] API authentication
- [ ] Docker containerization
- [ ] Persistent storage option

---

Made with ❤️ by Mohamed Ali

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github		.github
frontend		frontend
legacy		legacy
scripts		scripts
security		security
services		services
shared		shared
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
image.png		image.png
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PDF to CSV Converter

🏗️ Architecture

Directory Structure

Option 2: Docker Compose

Running the Application (Legacy)

📖 Usage

Conversion Formats

Parser Options

📚 Documentation

🔌 API Endpoints

Upload Service (Port 5001)

Conversion Service (Port 5002)

Download Service (Port 5003)

🧪 Testing

🐳 Docker Deployment

Development

Production

☁️ Cloud Deployment

Frontend (Vercel)

Backend Services

🗺️ Project Status

Current Version: 0.1.0

Completed Phases

Next Steps

Architecture Highlights

🛠️ Technology Stack

Frontend

Backend

Infrastructure

📝 Environment Variables

Frontend (.env.local)

Backend Services

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Support

🗺️ Roadmap

Completed ✅

In Progress 🚧

Planned 📋

Download behavior test

Download all test

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages