A comprehensive natural language processing pipeline that combines speech recognition, entity analysis, and text summarization. This project was developed as part of a Text Analytics course, focusing on processing and analyzing audio content using state-of-the-art ML models.
This project implements an end-to-end pipeline for processing audio content, with three main components:
- Speech Recognition: Uses Facebook's Wav2Vec2 model to convert speech to text
- Entity Analysis: Employs spaCy for named entity recognition and analysis
- Text Summarization: Leverages FLAN-T5 for generating multi-level summaries
The pipeline was initially developed and tested using the LibriSpeech dataset, achieving high accuracy in transcription and meaningful entity extraction and summarization results.
├── app.py # Streamlit web interface
├── transcribe_audio.py # Speech recognition module
├── entity_analyzer.py # Named entity recognition module
├── text_summarizer.py # Text summarization module
├── text_restorer.py # Text preprocessing module
├── process_transcripts.py # Transcript processing utilities
├── data_processor.py # Data handling utilities
├── requirements.txt # Python dependencies
└── environment.yml # Conda environment specification
- Implements
AudioTranscriberclass using Wav2Vec2 - Supports multiple audio formats
- Automatic device selection (CUDA/MPS/CPU)
- Includes resampling and audio preprocessing
- Outputs results in CSV format
EntityAnalyzerclass powered by spaCy- Extracts and categorizes named entities
- Provides detailed entity statistics:
- Entity type distribution
- Unique entity counts
- Overall entity statistics
- Uses FLAN-T5 for advanced summarization
- Generates multiple summary levels:
- Short summaries (30-50 words)
- Long summaries (50-150 words)
- Adaptive length based on input text
- Batch processing support for DataFrames
- Interactive Streamlit dashboard
- Real-time audio processing
- Visualization of entity distribution
- Downloadable results in multiple formats
- Support for various audio input formats
- Set up the environment (choose one):
Using conda:
conda env create -f environment.yml
conda activate speech-analysisUsing pip:
pip install -r requirements.txt
python -m spacy download en_core_web_smstreamlit run app.py- Transcribe audio:
python transcribe_audio.py path/to/audio.wav --output results.csv- Process entities and generate summaries:
from entity_analyzer import EntityAnalyzer
from text_summarizer import TextSummarizer
# Initialize components
analyzer = EntityAnalyzer()
summarizer = TextSummarizer()
# Process text
entities = analyzer.extract_entities(text)
short_summary, long_summary = summarizer.generate_summaries(text)- Python 3.10+
- 4GB+ RAM
- GPU Support:
- CUDA-compatible GPU (optional)
- Apple M1/M2 chip (MPS support)
- Storage:
- ~5GB for models and dependencies
- Additional space for audio processing
- torch==2.7.0
- transformers==4.30.2
- spacy==3.8.7
- streamlit==1.24.0
- torchaudio (for audio processing)
- pandas (for data handling)
- plotly (for visualizations)
- First run downloads required models
- Processing time depends on:
- Audio file length
- Selected device (GPU/CPU)
- Chosen summarization length
- GPU acceleration recommended for batch processing
- Integration with real-time audio streaming
- Support for additional languages
- Custom model fine-tuning options
- Enhanced entity visualization
- Batch processing optimization
Yunze Wei, Lanfeng Zheng, Keyu Shen, Bo Zhao, Kaiyuan Deng