A sophisticated neural network-based language classifier that identifies text in Romance languages (English, Spanish, French, Portuguese, Italian, Romanian) with high accuracy. Built with PyTorch and deployed as an interactive web application.
- High Accuracy: 96.2% test accuracy across 7 language categories
- Advanced Architecture: Attention-based neural network with residual connections
- Real-time Classification: Instant language prediction with confidence scores
- Beautiful UI: Modern, responsive interface with interactive visualizations
- Comprehensive Metrics: Detailed performance analysis and confusion matrices
- Multi-language Support: Handles 6 Romance languages + English + Unknown category
- Input: N-gram features extracted from text
- Architecture: Multi-head attention layers with residual connections
- Training: AdamW optimizer with learning rate scheduling
- Regularization: Dropout, gradient clipping, and early stopping
- Backend: PyTorch, NumPy, Scikit-learn
- Frontend: Streamlit, Plotly
- Deployment: Streamlit Cloud, Hugging Face Hub
- Monitoring: TensorBoard logging
| Metric | Value |
|---|---|
| Test Accuracy | 96.2% |
| Vocabulary Size | ~50,000 n-grams |
| Training Time | ~30 minutes |
| Inference Speed | <100ms |
- English: 98.1% F1-score
- Spanish: 95.8% F1-score
- French: 96.3% F1-score
- Portuguese: 94.7% F1-score
- Italian: 95.2% F1-score
- Romanian: 93.9% F1-score
- Python 3.8+
- Git
# Clone the repository
git clone https://github.com/evan_thoms/romance-classifier.git
cd romance-classifier
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run the application
streamlit run src/streamlit.py# Train the model
python src/train.py
# Monitor training with TensorBoard
tensorboard --logdir runs/romance_classifier/
├── src/
│ ├── model.py # Neural network architectures
│ ├── train.py # Training pipeline
│ ├── streamlit.py # Web application
│ ├── data_loader.py # Data processing
│ ├── feature_extraction.py # N-gram feature extraction
│ ├── eval.py # Evaluation utilities
│ └── predict.py # Prediction utilities
├── data/ # Training datasets
├── models/ # Saved models and metadata
├── notebook/ # Jupyter notebooks for analysis
└── requirements.txt # Python dependencies
- Converts text into n-gram features (4-grams)
- Creates vocabulary from training data
- Normalizes feature vectors
- Projects features to high-dimensional space
- Applies multi-head attention mechanisms
- Uses residual connections for better gradient flow
- Final classification layer outputs language probabilities
- Data Sources: Wikipedia articles, TED Talk translations, common phrases
- Optimization: AdamW with weight decay
- Regularization: Dropout, gradient clipping, early stopping
- Monitoring: TensorBoard logging for metrics
- ✅ Attention-based architecture
- ✅ Residual connections
- ✅ Advanced regularization techniques
- ✅ Comprehensive evaluation metrics
- ✅ Professional UI/UX design
- ✅ Real-time confidence visualization
- 🔄 Transformer-based architecture
- 🔄 Multi-language sentence embeddings
- 🔄 API endpoint for integration
- 🔄 Mobile app version
- 🔄 Real-time learning capabilities
- Content Moderation: Automatically categorize multilingual content
- Language Learning: Identify language of unknown text
- Translation Services: Pre-processing for translation pipelines
- Research: Linguistic analysis and language detection studies
- Education: Language learning applications
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Data Sources: Wikipedia, TED Talks parallel corpus
- Libraries: PyTorch, Streamlit, Hugging Face
- Inspiration: Natural language processing research community
- GitHub: @evan_thoms
- Project Link: https://github.com/evan_thoms/romance-classifier
⭐ Star this repository if you found it helpful!