A modern, scalable web application that converts PDF files to CSV format using microservices architecture. Built with Next.js frontend and Flask microservices, featuring real-time progress tracking, drag-and-drop interface, and cloud-ready deployment.
This project uses a microservices architecture with the following components:
- Frontend: Next.js 15 with TypeScript, Tailwind CSS, and shadcn/ui
- Upload Service: Handles file uploads and validation (Port 5001)
- Conversion Service: PDF to CSV conversion with pdfplumber/Tabula (Port 5002)
- Download Service: Manages file downloads and batch operations (Port 5003)
- Shared Libraries: Common utilities, storage abstraction, and type definitions
pdf-to-csv/
├── frontend/ # Next.js frontend application
│ ├── app/ # Next.js App Router
│ ├── components/ # React components
│ ├── hooks/ # Custom React hooks
│ ├── lib/ # Utilities and API client
│ └── store/ # Zustand state management
├── services/
│ ├── upload/ # Upload microservice
│ ├── conversion/ # Conversion microservice (Refactored ✨)
│ │ ├── app.py # Flask API routes only
│ │ ├── worker.py # Background job orchestration
│ │ ├── extractors.py # PDF data extraction layer
│ │ ├── analyzers.py # Table intelligence & structure analysis
│ │ ├── converters.py # Format-specific output generation
│ │ └── ARCHITECTURE.md # Detailed architecture documentation
│ └── download/ # Download microservice
├── shared/ # Shared Python libraries
│ ├── storage.py # Storage backend abstraction
│ ├── types.py # Common type definitions
│ └── utils.py # Utility functions
├── infrastructure/ # Docker & Kubernetes configs
├── docs/ # Comprehensive documentation
│ ├── MIGRATION_PLAN.md
│ ├── ARCHITECTURE.md
│ ├── API_SPECIFICATION.md
│ ├── FRONTEND_COMPONENTS.md
│ ├── DEPLOYMENT_GUIDE.md
│ └── DOCKER_KUBERNETES.md
├── legacy/ # Original Flask monolith (deprecated)
└── docker-compose.yml # Development environment
## 🎯 Recent Updates
### Conversion Service Refactoring (Dec 2025)
The conversion service has been refactored from a monolithic 1180-line file into a clean, modular architecture following industry best practices:
**Key Improvements:**
- ✅ **Modular Architecture**: Separated into 5 focused modules (128 + 178 + 84 + 272 + 350 lines)
- ✅ **Separation of Concerns**: Clear layers for extraction, analysis, conversion, orchestration, and API
- ✅ **Improved Testability**: Each component can be tested independently
- ✅ **Better Maintainability**: Changes isolated to specific modules
- ✅ **Enhanced Scalability**: Easy to add new parsers, formats, or swap implementations
- ✅ **Reduced Coupling**: Clear interfaces between layers with minimal dependencies
**New Module Structure:**
```
conversion/
├── extractors.py (84 lines) - Pure PDF data extraction
├── analyzers.py (272 lines) - Table intelligence & structure detection
├── converters.py (350 lines) - Format-specific output generation
├── worker.py (178 lines) - Background job orchestration
└── app.py (128 lines) - Flask API routes (reduced from 1180+)
```
See [services/conversion/ARCHITECTURE.md](services/conversion/ARCHITECTURE.md) for detailed documentation.
## ✨ Features
- 🏗️ **Microservices Architecture**: Scalable, independently deployable services
- 📤 **Multiple File Upload**: Drag & drop or select multiple PDF files at once
- 📊 **Smart Table Extraction**: Automatically detects and extracts tables from PDFs
- 📑 **Multiple Output Formats**: Convert to CSV, Excel (.xlsx), JSON, or plain text formats
- 🔄 **Automatic Merging**: Combines all tables from each PDF into a single output file
- 📈 **Real-time Progress**: Visual progress bars with status polling
- ⬇️ **Flexible Downloads**: Download files individually or all at once as a ZIP
- 🎯 **Dual Parser Support**: Choose between pdfplumber (default) or Tabula
- 💻 **Modern UI**: Next.js with Tailwind CSS and shadcn/ui components
- 🎨 **Multiple Color Themes**: Switch between "Earthy Forest" (green), "Cherry Blossom Bloom" (red/pink), and "Pastel Rainbow Fantasy" (dreamy pastels)
- 🌓 **Dark Mode Support**: Full light/dark mode for all color themes
- ⚡ **Background Processing**: Async conversion with React Query
- 🐳 **Docker Ready**: Complete containerization with docker-compose
- ☁️ **Cloud Native**: Deploy frontend to Vercel, backend to any cloud provider
- 📦 **Storage Abstraction**: Local filesystem or S3-compatible storage
## 🚀 Quick Start
### Prerequisites
- **Node.js** 18+ and npm
- **Python** 3.11+
- **(Optional)** Docker & Docker Compose
- **(Optional)** Java Runtime for Tabula parser
### Option 1: Local Development (Recommended)
1. **Clone the repository**
```powershell
git clone https://github.com/amin-bake/pdf-to-csv.git
cd pdf-to-csv
-
Install root dependencies
npm install
-
Set up Frontend
cd frontend npm install cp .env.local.example .env.local cd ..
-
Set up Backend Services
# Upload Service cd services/upload python -m venv .venv .\.venv\Scripts\Activate.ps1 pip install -r requirements.txt cd ../.. # Repeat for conversion and download services
-
Run All Services
# From root directory npm run devThis starts:
- Frontend: http://localhost:3000
- Upload Service: http://localhost:5001
- Conversion Service: http://localhost:5002
- Download Service: http://localhost:5003
# Build and start all services
docker-compose up --build
# Frontend will be available at http://localhost:3000python app.pyThen open your browser and navigate to http://localhost:5000
- Choose Conversion Type: Select PDF to CSV, PDF to Excel, PDF to JSON, or PDF to Text from the homepage
- Upload Files: Drag & drop PDF files or click to browse
- Select Parser: Choose between pdfplumber (default) or Tabula
- Choose Output Options:
- Select output format (CSV, Excel, JSON, or Text)
- Optionally merge all tables into a single file
- Convert: Click "Convert" to start processing
- Monitor Progress: Watch real-time conversion status
- Download: Download individual files or all files as ZIP
-
PDF to CSV: Extract tables to comma-separated values format
- Great for data analysis and spreadsheet import
- Lightweight and universally compatible
-
PDF to Excel: Extract tables to Excel spreadsheets (.xlsx)
- Multiple tables saved as separate sheets when merged
- Auto-adjusted column widths for better readability
- Native Excel format with formatting support
-
PDF to JSON: Extract tables to structured JSON format
- Tabular data: First row used as object keys (headers)
- Non-tabular documents: Full text extraction for CVs, resumes, reports
- Tables preserved with metadata (table number, row/column counts, headers)
- Automatic header cleaning (removes newlines, ensures uniqueness)
- Empty rows and columns filtered out
- Intelligent structure detection distinguishes titles, headers, and data
- Perfect for web APIs and data interchange
- Human-readable with proper indentation
-
PDF to Text: Extract plain text content from PDFs
- Tabular documents: Tables formatted with aligned columns
- Text documents: Full text extraction with page separators
- Clean, readable output format
- Preserves document structure and formatting
- Perfect for text analysis, NLP, and content extraction
- pdfplumber (default): Works well with most PDFs, no additional dependencies
- Tabula: Better for complex table layouts, requires Java runtime
Comprehensive documentation is available in the /docs directory:
Comprehensive documentation is available in the /docs directory:
- MIGRATION_PLAN.md: 8-week migration strategy from monolith to microservices
- ARCHITECTURE.md: Detailed technical architecture and design decisions
- API_SPECIFICATION.md: Complete RESTful API documentation
- FRONTEND_COMPONENTS.md: React component specifications
- DEPLOYMENT_GUIDE.md: Deployment instructions for all platforms
- DOCKER_KUBERNETES.md: Container orchestration configurations
POST /api/v1/upload- Upload PDF fileGET /health- Health check
POST /api/v1/convert- Start conversion jobGET /api/v1/status/:id- Check conversion statusDELETE /api/v1/convert/:id- Cancel conversionGET /health- Health check
GET /api/v1/download/:id- Download converted filePOST /api/v1/download/batch- Download multiple files as ZIPGET /api/v1/download/:id/info- Get file metadataGET /health- Health check
See API_SPECIFICATION.md for complete API documentation.
# Frontend tests
cd frontend
npm test
# Backend tests
pytest services/
# E2E tests
pytest tests/test_e2e.pydocker-compose updocker-compose -f docker-compose.prod.yml up -dcd frontend
vercel --prod-
AWS ECS/Fargate: See DEPLOYMENT_GUIDE.md
-
Google Cloud Run: See DEPLOYMENT_GUIDE.md
-
Kubernetes: See DOCKER_KUBERNETES.md
-
AWS ECS/Fargate: See DEPLOYMENT_GUIDE.md
-
Google Cloud Run: See DEPLOYMENT_GUIDE.md
-
Kubernetes: See DOCKER_KUBERNETES.md
✅ Microservices architecture implemented!
The application has been successfully migrated from a monolithic Flask application to a modern microservices architecture with Next.js frontend.
- ✅ Phase 1: Project restructuring and documentation
- ✅ Phase 2: Frontend development (Next.js 15 + TypeScript + Tailwind v4 + Dark Mode)
- ✅ Phase 3: Backend microservices (Upload, Conversion, Download services)
- 📋 Phase 4: Storage & Infrastructure improvements (S3, Redis, persistent storage)
- 📋 Phase 5: Testing & Documentation (E2E tests, API docs, monitoring)
- 📋 Phase 6: Production deployment (CI/CD, scaling, observability)
- Frontend: Next.js 15 with modern React patterns and full TypeScript
- Backend: Three independent Flask microservices with health checks
- Communication: RESTful APIs with CORS support
- Storage: Shared file system (ready for S3 migration)
- Deployment: Docker Compose for local dev, cloud-ready for production
See docs/PHASE_3_COMPLETE.md for detailed implementation notes.
- Framework: Next.js 15 (App Router)
- Language: TypeScript 5
- Styling: Tailwind CSS v4
- Components: shadcn/ui (Radix UI)
- State: Zustand + React Query
- Icons: Lucide React
- Framework: Flask 3.0
- Language: Python 3.11
- PDF Parsing: pdfplumber, Tabula-py
- Storage: Local filesystem / S3
- Server: Gunicorn
- Containerization: Docker
- Orchestration: Docker Compose / Kubernetes
- CI/CD: GitHub Actions
- Hosting: Vercel (Frontend), Cloud providers (Backend)
NEXT_PUBLIC_UPLOAD_SERVICE_URL=http://localhost:5001
NEXT_PUBLIC_CONVERSION_SERVICE_URL=http://localhost:5002
NEXT_PUBLIC_DOWNLOAD_SERVICE_URL=http://localhost:5003
NEXT_PUBLIC_MAX_FILE_SIZE=52428800FLASK_ENV=development
STORAGE_BACKEND=local
S3_BUCKET=your-bucket
AWS_ACCESS_KEY_ID=your-key
AWS_SECRET_ACCESS_KEY=your-secret
CORS_ORIGINS=http://localhost:3000Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with Flask
- PDF parsing by pdfplumber
- Alternative parsing with Tabula
- UI components from shadcn/ui
- Icons by Lucide
For questions, issues, or feature requests:
- 🐛 Open an issue
- 💬 Start a discussion
- 📧 Contact: your-email@example.com
- Initial Flask prototype (v0.0.1)
- Microservices architecture design
- Complete documentation suite
- Next.js 15 frontend with TypeScript
- Dark mode with theme toggle
- Upload microservice with validation
- Conversion microservice with pdfplumber
- Download microservice with ZIP support
- Docker Compose development environment
- Health checks for all services
- Storage abstraction (S3 support)
- Redis for job queue management
- Comprehensive test suite
- API documentation (OpenAPI/Swagger)
- Production deployment (Vercel + AWS/GCP)
- Real-time WebSocket updates
- User authentication & authorization
- File history and management dashboard
- OCR support for scanned PDFs
- API rate limiting
- Monitoring & observability (Datadog/New Relic)
- CI/CD pipeline automation
- Horizontal scaling configuration
- Additional output formats (Excel, JSON)
- Comprehensive monitoring
Made with ❤️ for the open-source community
python test_download_types.py
python test_download_all.py
## 🔧 Configuration
The application uses in-memory storage for uploaded files and temporary directories for converted files. For production use, consider:
- Using a production WSGI server (gunicorn, waitress)
- Implementing file cleanup mechanisms
- Adding authentication/authorization
- Setting up proper logging
- Configuring file size limits
## 🤝 Contributing
Contributions are welcome! Please read our [Contributing Guidelines](CONTRIBUTING.md) before submitting a pull request.
### Development Setup
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes and test thoroughly
4. Commit your changes (`git commit -m 'Add amazing feature'`)
5. Push to your branch (`git push origin feature/amazing-feature`)
6. Open a Pull Request
## 📝 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## 🙏 Acknowledgments
- [pdfplumber](https://github.com/jsvine/pdfplumber) - PDF table extraction
- [Tabula](https://github.com/tabulapedia/tabula-py) - Alternative parser
- [Flask](https://flask.palletsprojects.com/) - Web framework
- [pandas](https://pandas.pydata.org/) - Data manipulation
## 📧 Support
If you encounter any issues or have questions:
- Open an [issue](https://github.com/YOUR_USERNAME/pdf-to-csv/issues)
- Check existing issues for solutions
- Read the [Contributing Guidelines](CONTRIBUTING.md)
## 🗺️ Roadmap
Potential future enhancements:
- [ ] Support for more output formats (Excel, JSON)
- [ ] Advanced table detection options
- [ ] Batch processing queue for large files
- [ ] Cloud storage integration
- [ ] API authentication
- [ ] Docker containerization
- [ ] Persistent storage option
---
Made with ❤️ by Mohamed Ali
