Meet PDF Linguist – your ultimate tool for extracting, merging, and translating PDF content! 🌟 Whether it’s digital PDFs or scanned ones, this versatile app makes it a breeze to transform and translate documents while keeping the original formatting intact. Perfect for tackling all your document translation and manipulation needs! 📄✨
- Translation for Digital PDFs 🔍
- Translation for Scanned PDFs 🔐
All these features are wrapped up in a single app, giving you complete control over your PDF workflows. 🏠
- 📄 Extract Text and Images: Quickly extract searchable text and embedded images.
- 💬 Translate Content: Convert text into your desired language effortlessly.
- 🔠 Render Complex Scripts: Perfect for Indic languages like Devanagari, thanks to the HarfBuzz engine to handle intricate ligatures and unicode combinations.
- 🌈 Merge & Preserve Layout: Combines translated content while keeping the original layout intact.
- 🔖 Multiple Output Formats: Save your work as PDF and Word (DOCX).
- 🔄 Multi-page Handling: Process multi-page documents seamlessly.
- 🔍 OCR-Based Extraction: Extract text from scanned images using Tesseract OCR.
- 🕒 Customizable Cropping: Select specific dimensions to exclude unwanted sections.
- 💬 Translate Content: High-quality translations for regional languages.
- 🎨 Merge & Preserve Layout: Seamlessly combine translated text with the scanned PDF’s layout.
- 🔖 Multiple Output Formats: Export translations as PDF and Word (DOCX).
- Python installed on your system.
- Recommended: Virtual environment setup.
-
Clone the repository:
git clone https://github.com/NitinReddy-A/PDF_Linguist.git
-
Navigate to the project directory:
cd PDF_Linguist -
Set up a virtual environment (optional):
python -m venv virtual-env source virtual-env/bin/activate # Linux/Mac virtual-env\Scripts\activate # Windows
-
Install dependencies:
pip install -r requirements.txt
Place the PDF file you want to process in the documents/ folder.
Launch the app using Streamlit:
streamlit run app.py- Digital PDF: For PDFs with searchable text.
- Scanned PDF: For image-based PDFs needing OCR.
- Upload your PDF using the file uploader.
- Choose the target language for translation.
- Click the button to extract, translate, and reassemble the PDF.
- Download your translated file in PDF or DOCX format.
- Upload your scanned PDF.
- Define crop dimensions (optional).
- Select the target translation language.
- Process the file to extract text via OCR, translate, and merge the layout.
- Download your translated file in PDF or DOCX format.
- You’ll need API keys for ConvertAPI and LightPDF API.
- Update the script with your API keys and specify the destination language code.
- Streamlit: For the user-friendly web app.
- PyMuPDF (fitz): Handles digital PDFs and converts pages to images.
- Pillow (PIL): For image processing like cropping and resizing.
- PyTesseract: Extracts text from scanned PDFs using OCR.
- Googletrans: For language translation.
- FPDF: Generates PDF files from translations.
- PDF2Image: Converts PDF pages to images for OCR.
- LightPDF API: Enables OCR-based conversion of scanned PDFs.
- ConvertAPI: Converts between PDF and other formats seamlessly.
- Translated PDFs and DOCX files are saved in the
documents/folder. - Both formats preserve the layout and formatting of the original document.
PDF Linguist simplifies document management with:
- High-accuracy text and image extraction.
- Smooth translations in multiple languages.
- Seamless merging while preserving original layouts.
💡 The dual-mode functionality ensures you’re ready for any type of PDF – digital or scanned. Manage and translate documents like a pro! 🔹✨