A simple FastAPI server that provides an OpenAI Whisper API-compatible endpoint backed by NVIDIA's Parakeet-TDT model for speech recognition + Pyannote for speaker diarization.
- Complete drop-in replacement for OpenAI's Whisper API
- Uses NVIDIA's Parakeet-TDT 0.6B V2 model for high-quality transcription
- Supports all Whisper API response formats (json, text, srt, vtt, verbose_json)
- Supports word-level and segment-level timestamps
- Optional speaker diarization using Pyannote.audio
- FastAPI-based server with automatic OpenAPI documentation
- NVIDIA GPU with CUDA support (recommended)
- Python 3.8 or higher
- HuggingFace account and access token (required for speaker diarization)
-
Clone this repository:
git clone https://github.com/jfgonsalves/parakeet-diarized cd parakeet-diarized -
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Set up speaker diarization (optional):
- Create a free account at HuggingFace
- Generate an access token at HuggingFace Settings
- Accept the user agreement for the Pyannote speaker diarization model
-
Run the server:
With speaker diarization:
./run.sh --hf-token "your_token_here"Without speaker diarization:
./run.sh
Other options:
./run.sh --help # See all available options ./run.sh --port 8080 --debug --hf-token "your_token_here"
The API mimics the OpenAI Whisper API interface:
POST /v1/audio/transcriptions
Parameters:
file: The audio file to transcribe (multipart/form-data)model: Model to use (defaults to "whisper-1", but will use Parakeet regardless)language: Language of the audio (optional)response_format: Format of the response (defaults to "json", options: json, text, srt, vtt, verbose_json)timestamps: Whether to include timestamps (defaults to false)timestamp_granularities: Timestamp detail level (accepts "segment")temperature: Temperature for sampling (defaults to 0.0)vad_filter: Voice activity detection filter (defaults to false)prompt: Optional prompt to guide the transcription (ignored but accepted for compatibility)diarize: Enable speaker diarization (defaults to true, requires HuggingFace token)include_diarization_in_text: Include speaker labels in transcript text (defaults to true)
Example with curl:
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-H "Content-Type: multipart/form-data" \
-F file=@/path/to/your/audio.wav \
-F model=whisper-1 \
-F timestamps=true \
-F diarize=trueGET /health
Returns the health status of the API and the loaded model.
This API is designed to be a drop-in replacement for the OpenAI Whisper API:
- Supports all Whisper API response formats (json, text, srt, vtt, verbose_json)
- Accepts all major Whisper API parameters for compatibility
- Returns responses in the same format as the OpenAI Whisper API
- Provides a
/v1/modelsendpoint for application compatibility
Minor differences:
- The
modelparameter is accepted but ignored - always uses Parakeet-TDT - Some advanced Whisper-specific parameters might have no effect
- Performance characteristics may differ from OpenAI's implementation
The API supports multiple response formats:
{
"text": "Full transcription text goes here"
}{
"text": "Full transcription text goes here",
"task": "transcribe",
"language": "en",
"duration": 10.5,
"model": "parakeet-tdt-0.6b-v2",
"segments": [
{
"id": 0,
"seek": 0,
"start": 0.0,
"end": 2.5,
"text": "Segment text",
"tokens": [50364, 2425, 286, 257],
"temperature": 0.0,
"avg_logprob": -0.5,
"compression_ratio": 1.0,
"no_speech_prob": 0.1
},
{
"id": 1,
"start": 2.5,
"end": 5.0,
"text": "Another segment",
"tokens": [50364, 5816, 2121],
"temperature": 0.0,
"avg_logprob": -0.6,
"compression_ratio": 1.0,
"no_speech_prob": 0.05
}
]
}Full transcription text goes here
1
00:00:00,000 --> 00:00:02,500
Segment text
2
00:00:02,500 --> 00:00:05,000
Another segment
WEBVTT
00:00:00.000 --> 00:00:02.500
Segment text
00:00:02.500 --> 00:00:05.000
Another segment
The segments field is included when the timestamps parameter is set to true or when using verbose_json format.
The API includes speaker diarization capabilities using Pyannote.audio:
For speaker diarization to work, you need:
- HuggingFace Account: Create a free account at huggingface.co
- Access Token: Generate a token at HuggingFace Settings
- Model Agreement: Accept the user agreement for pyannote/speaker-diarization-3.1
- Environment Variable: Set
HUGGINGFACE_ACCESS_TOKENwith your token
- Automatic speaker detection and labeling
- Integration with transcription segments
- Optional speaker labels in transcript text
- Support for multiple speakers per audio file
Enable diarization by setting diarize=true in your API request:
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-H "Content-Type: multipart/form-data" \
-F file=@/path/to/your/audio.wav \
-F diarize=true \
-F include_diarization_in_text=trueWhen include_diarization_in_text=true, the transcript will include speaker labels:
Speaker 1: Hello, how are you today?
Speaker 2: I'm doing well, thank you for asking.
Use the run.sh script to configure and start the server:
./run.sh --help
# Options:
# --debug Enable debug mode
# --port PORT Set server port (default: 8000)
# --host HOST Set server host (default: 0.0.0.0)
# --skip-deps-check Skip dependency checking
# --hf-token TOKEN Set HuggingFace access token for speaker diarization
# --help Show help messageEnvironment Variables (for settings not available as command line arguments):
ENABLE_DIARIZATION: Enable/disable diarization globally (default: true)INCLUDE_DIARIZATION_IN_TEXT: Include speaker labels in text by default (default: true)MODEL_ID: Parakeet model to use (default: nvidia/parakeet-tdt-0.6b-v2)TEMPERATURE: Sampling temperature (default: 0.0)CHUNK_DURATION: Audio chunk duration in seconds (default: 500)TEMP_DIR: Temporary directory for audio processing (default: /tmp/parakeet)
The NVIDIA Parakeet-TDT model offers:
- Fast transcription (top model on the HF Open ASR leaderboard)
- Support for punctuation and capitalization
- High accuracy with word error rates as low as 1.69% on LibriSpeech test-clean
Pyannote.audio speaker diarization adds:
- Automatic speaker identification using state-of-the-art models
- Real-time speaker change detection
- Support for unlimited number of speakers
This project builds upon excellent work by:
- NVIDIA NeMo Team: For the outstanding Parakeet-TDT model that provides state-of-the-art speech recognition
- Pyannote Team: For the powerful Pyannote.audio speaker diarization toolkit
This project is released under MIT License. However, the Parakeet-TDT model is governed by the CC-BY-4.0 license.