[Integration]: Speech Tool (STT & TTS)

## Reason

Hive agents currently can only process text input, which limits automation for business workflows that involve audio (sales calls, voicemails, recordings, podcasts). Adding speech capabilities allows agents to:

- **Accept audio as input** via Speech-to-Text (STT)
- **Produce audio as output** via Text-to-Speech (TTS)

This extends Hive's automation reach to voice-based business processes without requiring manual transcription.

### Why Now?

- Voice AI market reached **$5.4B in 2025** (+25% YoY) ([Speechmatics](https://www.speechmatics.com/company/articles-and-news/voice-ai-in-2026-9-numbers-that-signal-whats-next))
- Companies report **171% average ROI** on voice AI implementations ([Master of Code](https://masterofcode.com/blog/ai-agent-statistics))
- Voice agents reduce operational costs by **60-80%** ([Precedence Research](https://www.precedenceresearch.com/voice-based-ai-companion-product-market))

## Use Cases (Hive-Specific)

### 1. Sales Call Follow-up Agent

| Step | Tool Used | Action |
|------|-----------|--------|
| 1 | `speech_to_text` | Transcribe sales call recording |
| 2 | `hubspot_tool` | Update contact record with call notes |
| 3 | `email_tool` | Send personalized follow-up email |

**Value**: Automates post-call admin work, ensures CRM is always updated.

### 2. Voicemail to Ticket Agent

| Step | Tool Used | Action |
|------|-----------|--------|
| 1 | `speech_to_text` | Transcribe customer voicemail |
| 2 | Agent logic | Create support ticket from transcript |
| 3 | `email_tool` | Send confirmation to customer |

**Value**: Eliminates manual voicemail listening, faster ticket creation.

### 3. Content Repurposing Agent

| Step | Tool Used | Action |
|------|-----------|--------|
| 1 | `speech_to_text` | Transcribe podcast/video recording |
| 2 | Agent logic | Generate blog post, social media posts |
| 3 | `web_search_tool` | Find related content to reference |

**Value**: One recording → multiple content pieces, saves hours of work.

### 4. Audio Report Agent

| Step | Tool Used | Action |
|------|-----------|--------|
| 1 | `csv_tool` | Query and analyze data |
| 2 | Agent logic | Generate summary report |
| 3 | `text_to_speech` | Convert report to audio file |

**Value**: Executives can listen to reports during commute.

## Scope

### MVP (This PR)

A minimal, focused implementation using cloud-based backends for simplicity:

**Speech-to-Text (STT)**:
| Tool | Backend | Description |
|------|---------|-------------|
| `speech_to_text` | OpenAI Whisper API | Transcribe audio file (WAV, MP3, M4A) to text |

**Text-to-Speech (TTS)**:
| Tool | Backend | Description |
|------|---------|-------------|
| `text_to_speech` | gTTS (Google Text-to-Speech) | Convert text to MP3 audio file |

**Why this scope:**
- Simple implementation (~150-200 lines, similar to `email_tool`)
- No local model management required
- Quick to review and merge
- Provides core functionality that covers all use cases above

### Future Directions (Follow-up PRs)

After the MVP is merged, additional backends can be added:

| Backend | Type | Benefit | Complexity |
|---------|------|---------|------------|
| OpenAI Whisper (local) | STT | Offline, privacy-sensitive, no API costs | Medium - requires model download management |
| Vosk | STT | Lightweight, fully offline, fast | Medium - requires model download management |
| pyttsx3 | TTS | Offline, cross-platform, uses system voices | Low |
| Google Cloud STT/TTS | Both | Enterprise-grade, extensive language support | Low - just API calls |
| ElevenLabs | TTS | High-quality, natural voices | Low - just API calls |

These can be proposed as separate issues once the MVP is established.

## Implementation Details

### 1. Functions

```python
@mcp.tool()
def speech_to_text(
    audio_path: str,
    language: str = "en",
) -> dict:
    """
    Transcribe audio file to text using OpenAI Whisper API.

    Args:
        audio_path: Path to audio file (WAV, MP3, M4A, WEBM)
        language: Language code (e.g., "en", "es", "fr")

    Returns:
        Dict with transcribed text or error
    """

@mcp.tool()
def text_to_speech(
    text: str,
    output_path: str = None,
    language: str = "en",
) -> dict:
    """
    Convert text to speech audio file using gTTS.

    Args:
        text: Text to convert to speech
        output_path: Where to save the audio file (optional, generates temp file if not provided)
        language: Language code (e.g., "en", "es", "fr")

    Returns:
        Dict with path to generated audio file or error
    """
```

### 2. Credentials

| Variable | Required | Description |
|----------|----------|-------------|
| `OPENAI_API_KEY` | Yes (for STT) | OpenAI API key for Whisper ([get key](https://platform.openai.com/api-keys)) |

New file: `tools/src/aden_tools/credentials/speech.py`

```python
SPEECH_CREDENTIALS = {
    "openai_speech": CredentialSpec(
        env_var="OPENAI_API_KEY",
        tools=["speech_to_text"],
        required=True,
        help_url="https://platform.openai.com/api-keys",
        description="OpenAI API key for Whisper speech-to-text",
    ),
}
```

Note: `text_to_speech` (gTTS) requires no API key.

### 3. Documentation

New file: `tools/src/aden_tools/tools/speech_tool/README.md`

Contents:
- Tool descriptions and parameters
- Supported audio formats
- Setup instructions
- Usage examples
- Language codes reference

### 4. Tests

New file: `tools/tests/tools/test_speech_tool.py`

Test coverage:
- Input validation (empty path, invalid format, file not found)
- Language parameter handling
- Output format validation
- Credential resolution
- Mock API tests (no actual API calls in tests)

## File Structure

```
tools/src/aden_tools/tools/speech_tool/
├── __init__.py
├── speech_tool.py
└── README.md

tools/src/aden_tools/credentials/
└── speech.py

tools/tests/tools/
└── test_speech_tool.py
```

**Modified files:**
- `tools/pyproject.toml` — Add dependencies: `openai`, `gtts`
- `tools/src/aden_tools/credentials/__init__.py` — Register `SPEECH_CREDENTIALS`
- `tools/src/aden_tools/tools/__init__.py` — Register speech tool

## Related

- **Parent issue**: #2805
- **Related**: #3349 (Whisper-based Transcription Agent)
  - This proposal differs by providing a **tool-level** interface (vs agent-level) and including **TTS**
  - The local Whisper backend from #3349 could be integrated in a future PR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Integration]: Speech Tool (STT & TTS) #3568

Reason

Why Now?

Use Cases (Hive-Specific)

1. Sales Call Follow-up Agent

2. Voicemail to Ticket Agent

3. Content Repurposing Agent

4. Audio Report Agent

Scope

MVP (This PR)

Future Directions (Follow-up PRs)

Implementation Details

1. Functions

2. Credentials

3. Documentation

4. Tests

File Structure

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Step	Tool Used	Action
1	`speech_to_text`	Transcribe sales call recording
2	`hubspot_tool`	Update contact record with call notes
3	`email_tool`	Send personalized follow-up email

Step	Tool Used	Action
1	`speech_to_text`	Transcribe customer voicemail
2	Agent logic	Create support ticket from transcript
3	`email_tool`	Send confirmation to customer

Step	Tool Used	Action
1	`speech_to_text`	Transcribe podcast/video recording
2	Agent logic	Generate blog post, social media posts
3	`web_search_tool`	Find related content to reference

Step	Tool Used	Action
1	`csv_tool`	Query and analyze data
2	Agent logic	Generate summary report
3	`text_to_speech`	Convert report to audio file

Backend	Type	Benefit	Complexity
OpenAI Whisper (local)	STT	Offline, privacy-sensitive, no API costs	Medium - requires model download management
Vosk	STT	Lightweight, fully offline, fast	Medium - requires model download management
pyttsx3	TTS	Offline, cross-platform, uses system voices	Low
Google Cloud STT/TTS	Both	Enterprise-grade, extensive language support	Low - just API calls
ElevenLabs	TTS	High-quality, natural voices	Low - just API calls

[Integration]: Speech Tool (STT & TTS) #3568

Description

Reason

Why Now?

Use Cases (Hive-Specific)

1. Sales Call Follow-up Agent

2. Voicemail to Ticket Agent

3. Content Repurposing Agent

4. Audio Report Agent

Scope

MVP (This PR)

Future Directions (Follow-up PRs)

Implementation Details

1. Functions

2. Credentials

3. Documentation

4. Tests

File Structure

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions