Reason
Hive agents currently can only process text input, which limits automation for business workflows that involve audio (sales calls, voicemails, recordings, podcasts). Adding speech capabilities allows agents to:
- Accept audio as input via Speech-to-Text (STT)
- Produce audio as output via Text-to-Speech (TTS)
This extends Hive's automation reach to voice-based business processes without requiring manual transcription.
Why Now?
Use Cases (Hive-Specific)
1. Sales Call Follow-up Agent
| Step |
Tool Used |
Action |
| 1 |
speech_to_text |
Transcribe sales call recording |
| 2 |
hubspot_tool |
Update contact record with call notes |
| 3 |
email_tool |
Send personalized follow-up email |
Value: Automates post-call admin work, ensures CRM is always updated.
2. Voicemail to Ticket Agent
| Step |
Tool Used |
Action |
| 1 |
speech_to_text |
Transcribe customer voicemail |
| 2 |
Agent logic |
Create support ticket from transcript |
| 3 |
email_tool |
Send confirmation to customer |
Value: Eliminates manual voicemail listening, faster ticket creation.
3. Content Repurposing Agent
| Step |
Tool Used |
Action |
| 1 |
speech_to_text |
Transcribe podcast/video recording |
| 2 |
Agent logic |
Generate blog post, social media posts |
| 3 |
web_search_tool |
Find related content to reference |
Value: One recording → multiple content pieces, saves hours of work.
4. Audio Report Agent
| Step |
Tool Used |
Action |
| 1 |
csv_tool |
Query and analyze data |
| 2 |
Agent logic |
Generate summary report |
| 3 |
text_to_speech |
Convert report to audio file |
Value: Executives can listen to reports during commute.
Scope
MVP (This PR)
A minimal, focused implementation using cloud-based backends for simplicity:
Speech-to-Text (STT):
| Tool |
Backend |
Description |
speech_to_text |
OpenAI Whisper API |
Transcribe audio file (WAV, MP3, M4A) to text |
Text-to-Speech (TTS):
| Tool |
Backend |
Description |
text_to_speech |
gTTS (Google Text-to-Speech) |
Convert text to MP3 audio file |
Why this scope:
- Simple implementation (~150-200 lines, similar to
email_tool)
- No local model management required
- Quick to review and merge
- Provides core functionality that covers all use cases above
Future Directions (Follow-up PRs)
After the MVP is merged, additional backends can be added:
| Backend |
Type |
Benefit |
Complexity |
| OpenAI Whisper (local) |
STT |
Offline, privacy-sensitive, no API costs |
Medium - requires model download management |
| Vosk |
STT |
Lightweight, fully offline, fast |
Medium - requires model download management |
| pyttsx3 |
TTS |
Offline, cross-platform, uses system voices |
Low |
| Google Cloud STT/TTS |
Both |
Enterprise-grade, extensive language support |
Low - just API calls |
| ElevenLabs |
TTS |
High-quality, natural voices |
Low - just API calls |
These can be proposed as separate issues once the MVP is established.
Implementation Details
1. Functions
@mcp.tool()
def speech_to_text(
audio_path: str,
language: str = "en",
) -> dict:
"""
Transcribe audio file to text using OpenAI Whisper API.
Args:
audio_path: Path to audio file (WAV, MP3, M4A, WEBM)
language: Language code (e.g., "en", "es", "fr")
Returns:
Dict with transcribed text or error
"""
@mcp.tool()
def text_to_speech(
text: str,
output_path: str = None,
language: str = "en",
) -> dict:
"""
Convert text to speech audio file using gTTS.
Args:
text: Text to convert to speech
output_path: Where to save the audio file (optional, generates temp file if not provided)
language: Language code (e.g., "en", "es", "fr")
Returns:
Dict with path to generated audio file or error
"""
2. Credentials
| Variable |
Required |
Description |
OPENAI_API_KEY |
Yes (for STT) |
OpenAI API key for Whisper (get key) |
New file: tools/src/aden_tools/credentials/speech.py
SPEECH_CREDENTIALS = {
"openai_speech": CredentialSpec(
env_var="OPENAI_API_KEY",
tools=["speech_to_text"],
required=True,
help_url="https://platform.openai.com/api-keys",
description="OpenAI API key for Whisper speech-to-text",
),
}
Note: text_to_speech (gTTS) requires no API key.
3. Documentation
New file: tools/src/aden_tools/tools/speech_tool/README.md
Contents:
- Tool descriptions and parameters
- Supported audio formats
- Setup instructions
- Usage examples
- Language codes reference
4. Tests
New file: tools/tests/tools/test_speech_tool.py
Test coverage:
- Input validation (empty path, invalid format, file not found)
- Language parameter handling
- Output format validation
- Credential resolution
- Mock API tests (no actual API calls in tests)
File Structure
tools/src/aden_tools/tools/speech_tool/
├── __init__.py
├── speech_tool.py
└── README.md
tools/src/aden_tools/credentials/
└── speech.py
tools/tests/tools/
└── test_speech_tool.py
Modified files:
tools/pyproject.toml — Add dependencies: openai, gtts
tools/src/aden_tools/credentials/__init__.py — Register SPEECH_CREDENTIALS
tools/src/aden_tools/tools/__init__.py — Register speech tool
Related
Reason
Hive agents currently can only process text input, which limits automation for business workflows that involve audio (sales calls, voicemails, recordings, podcasts). Adding speech capabilities allows agents to:
This extends Hive's automation reach to voice-based business processes without requiring manual transcription.
Why Now?
Use Cases (Hive-Specific)
1. Sales Call Follow-up Agent
speech_to_texthubspot_toolemail_toolValue: Automates post-call admin work, ensures CRM is always updated.
2. Voicemail to Ticket Agent
speech_to_textemail_toolValue: Eliminates manual voicemail listening, faster ticket creation.
3. Content Repurposing Agent
speech_to_textweb_search_toolValue: One recording → multiple content pieces, saves hours of work.
4. Audio Report Agent
csv_tooltext_to_speechValue: Executives can listen to reports during commute.
Scope
MVP (This PR)
A minimal, focused implementation using cloud-based backends for simplicity:
Speech-to-Text (STT):
speech_to_textText-to-Speech (TTS):
text_to_speechWhy this scope:
email_tool)Future Directions (Follow-up PRs)
After the MVP is merged, additional backends can be added:
These can be proposed as separate issues once the MVP is established.
Implementation Details
1. Functions
2. Credentials
OPENAI_API_KEYNew file:
tools/src/aden_tools/credentials/speech.pyNote:
text_to_speech(gTTS) requires no API key.3. Documentation
New file:
tools/src/aden_tools/tools/speech_tool/README.mdContents:
4. Tests
New file:
tools/tests/tools/test_speech_tool.pyTest coverage:
File Structure
Modified files:
tools/pyproject.toml— Add dependencies:openai,gttstools/src/aden_tools/credentials/__init__.py— RegisterSPEECH_CREDENTIALStools/src/aden_tools/tools/__init__.py— Register speech toolRelated