Skip to content

[Integration]: Speech Tool (STT & TTS) #3568

@Ttian18

Description

@Ttian18

Reason

Hive agents currently can only process text input, which limits automation for business workflows that involve audio (sales calls, voicemails, recordings, podcasts). Adding speech capabilities allows agents to:

  • Accept audio as input via Speech-to-Text (STT)
  • Produce audio as output via Text-to-Speech (TTS)

This extends Hive's automation reach to voice-based business processes without requiring manual transcription.

Why Now?

Use Cases (Hive-Specific)

1. Sales Call Follow-up Agent

Step Tool Used Action
1 speech_to_text Transcribe sales call recording
2 hubspot_tool Update contact record with call notes
3 email_tool Send personalized follow-up email

Value: Automates post-call admin work, ensures CRM is always updated.

2. Voicemail to Ticket Agent

Step Tool Used Action
1 speech_to_text Transcribe customer voicemail
2 Agent logic Create support ticket from transcript
3 email_tool Send confirmation to customer

Value: Eliminates manual voicemail listening, faster ticket creation.

3. Content Repurposing Agent

Step Tool Used Action
1 speech_to_text Transcribe podcast/video recording
2 Agent logic Generate blog post, social media posts
3 web_search_tool Find related content to reference

Value: One recording → multiple content pieces, saves hours of work.

4. Audio Report Agent

Step Tool Used Action
1 csv_tool Query and analyze data
2 Agent logic Generate summary report
3 text_to_speech Convert report to audio file

Value: Executives can listen to reports during commute.

Scope

MVP (This PR)

A minimal, focused implementation using cloud-based backends for simplicity:

Speech-to-Text (STT):

Tool Backend Description
speech_to_text OpenAI Whisper API Transcribe audio file (WAV, MP3, M4A) to text

Text-to-Speech (TTS):

Tool Backend Description
text_to_speech gTTS (Google Text-to-Speech) Convert text to MP3 audio file

Why this scope:

  • Simple implementation (~150-200 lines, similar to email_tool)
  • No local model management required
  • Quick to review and merge
  • Provides core functionality that covers all use cases above

Future Directions (Follow-up PRs)

After the MVP is merged, additional backends can be added:

Backend Type Benefit Complexity
OpenAI Whisper (local) STT Offline, privacy-sensitive, no API costs Medium - requires model download management
Vosk STT Lightweight, fully offline, fast Medium - requires model download management
pyttsx3 TTS Offline, cross-platform, uses system voices Low
Google Cloud STT/TTS Both Enterprise-grade, extensive language support Low - just API calls
ElevenLabs TTS High-quality, natural voices Low - just API calls

These can be proposed as separate issues once the MVP is established.

Implementation Details

1. Functions

@mcp.tool()
def speech_to_text(
    audio_path: str,
    language: str = "en",
) -> dict:
    """
    Transcribe audio file to text using OpenAI Whisper API.

    Args:
        audio_path: Path to audio file (WAV, MP3, M4A, WEBM)
        language: Language code (e.g., "en", "es", "fr")

    Returns:
        Dict with transcribed text or error
    """

@mcp.tool()
def text_to_speech(
    text: str,
    output_path: str = None,
    language: str = "en",
) -> dict:
    """
    Convert text to speech audio file using gTTS.

    Args:
        text: Text to convert to speech
        output_path: Where to save the audio file (optional, generates temp file if not provided)
        language: Language code (e.g., "en", "es", "fr")

    Returns:
        Dict with path to generated audio file or error
    """

2. Credentials

Variable Required Description
OPENAI_API_KEY Yes (for STT) OpenAI API key for Whisper (get key)

New file: tools/src/aden_tools/credentials/speech.py

SPEECH_CREDENTIALS = {
    "openai_speech": CredentialSpec(
        env_var="OPENAI_API_KEY",
        tools=["speech_to_text"],
        required=True,
        help_url="https://platform.openai.com/api-keys",
        description="OpenAI API key for Whisper speech-to-text",
    ),
}

Note: text_to_speech (gTTS) requires no API key.

3. Documentation

New file: tools/src/aden_tools/tools/speech_tool/README.md

Contents:

  • Tool descriptions and parameters
  • Supported audio formats
  • Setup instructions
  • Usage examples
  • Language codes reference

4. Tests

New file: tools/tests/tools/test_speech_tool.py

Test coverage:

  • Input validation (empty path, invalid format, file not found)
  • Language parameter handling
  • Output format validation
  • Credential resolution
  • Mock API tests (no actual API calls in tests)

File Structure

tools/src/aden_tools/tools/speech_tool/
├── __init__.py
├── speech_tool.py
└── README.md

tools/src/aden_tools/credentials/
└── speech.py

tools/tests/tools/
└── test_speech_tool.py

Modified files:

  • tools/pyproject.toml — Add dependencies: openai, gtts
  • tools/src/aden_tools/credentials/__init__.py — Register SPEECH_CREDENTIALS
  • tools/src/aden_tools/tools/__init__.py — Register speech tool

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions