A tool for paraphrasing research papers and academic documents while preserving their structure, citations, equations, and technical terminology.
- Supports multiple document formats (DOCX, PDF, TXT)
- Preserves document structure and formatting
- Handles special content:
- Citations and references
- Mathematical equations
- Technical terminology
- Section-specific formatting
- Parallel processing for better performance
- User-friendly web interface
- Progress tracking
- Multiple language model support
- Clone the repository:
git clone https://github.com/Mr-vero/ParaphraseDoctor
cd ParaphraseDoctor
- Install the required dependencies:
pip install numpy tqdm python-docx pypdf transformers torch gradio
- Run the application:
python main.py
-
Access the web interface:
- The application will start a local server
- Open your web browser and go to http://localhost:7860
- A public URL will also be provided for temporary access
-
Using the interface:
- Upload your document (supported formats: .docx, .pdf, .txt)
- Configure preservation options:
- Preserve Citations: Keep citation formatting and references
- Preserve Equations: Maintain mathematical formulas
- Click "Submit" to start processing
- Monitor progress through the progress bar
- Download the paraphrased document when complete
-
Document Reading
- Parses different document formats
- Extracts text while maintaining structure
- Identifies special content (citations, equations, etc.)
-
Content Analysis
- Identifies different sections (abstract, methodology, etc.)
- Detects technical terms and special notation
- Preserves document hierarchy
-
Paraphrasing
- Uses specialized models for different content types:
- Main content: IndoT5-base-paraphrase
- Technical sections: MT5-small
- Processes content in parallel for better performance
- Maintains context and coherence
- Uses specialized models for different content types:
-
Structure Preservation
- Keeps original formatting
- Maintains citations and references
- Preserves equations and technical terms
- Retains document layout
The tool can be configured through the following parameters in document_paraphraser.py
:
self.models = {
'main': {
'name': "Wikidepia/IndoT5-base-paraphrase",
'max_length': 128,
'batch_size': 4
},
'technical': {
'name': "google/mt5-small",
'max_length': 64,
'batch_size': 8
}
}
- Python 3.7 or higher
- Minimum 4GB RAM (8GB recommended)
- CUDA-capable GPU (optional, for better performance)
- Internet connection for initial model download
- Supported operating systems:
- Windows 10/11
- macOS 10.15 or later
- Linux (Ubuntu 18.04 or later)
-
Processing Speed
- Large documents may take significant time
- Performance depends on hardware capabilities
- GPU recommended for faster processing
-
Content Handling
- Complex mathematical equations might need manual review
- Some technical terminology may require verification
- Document formatting might need minor adjustments
-
Model Limitations
- Initial model loading takes time
- Internet required for first run
- Memory usage increases with document size
Common issues and solutions:
-
Memory Error
- Reduce batch size in configuration
- Close other applications
- Use smaller document chunks
-
Slow Processing
- Enable GPU if available
- Reduce document size
- Adjust batch processing parameters
-
Format Issues
- Ensure document is properly formatted
- Check for supported file types
- Verify document encoding
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License. See LICENSE for details.
- HuggingFace Transformers for NLP models
- Gradio for the web interface
- Python-docx for document processing
- PyPDF for PDF handling
For issues and feature requests:
- Check existing issues on GitHub
- Create a new issue with:
- Clear description
- Steps to reproduce
- System information
- Error messages if any
Planned features:
- Additional language model support
- Enhanced formatting preservation
- Batch processing capabilities
- Custom model training options
- API integration support