Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
186 changes: 186 additions & 0 deletions docs/concepts/pipeline-wrapper.md
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,192 @@ async def run_chat_completion_async(self, model: str, messages: List[dict], body
)
```

## Streaming from Multiple Components

!!! info "Smart Streaming Behavior"
By default, Hayhooks streams only the **last** streaming-capable component in your pipeline. This is usually what you want - the final output streaming to users.

For advanced use cases, you can control which components stream using the `streaming_components` parameter.

When your pipeline contains multiple components that support streaming (e.g., multiple LLMs), you can control which ones stream their outputs as the pipeline executes.

### Default Behavior: Stream Only the Last Component

By default, only the last streaming-capable component will stream:

```python
class MultiLLMWrapper(BasePipelineWrapper):
def setup(self) -> None:
from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage

self.pipeline = Pipeline()

# First LLM - initial answer
self.pipeline.add_component(
"prompt_1",
ChatPromptBuilder(
template=[
ChatMessage.from_system("You are a helpful assistant."),
ChatMessage.from_user("{{query}}")
]
)
)
self.pipeline.add_component("llm_1", OpenAIChatGenerator(model="gpt-4o-mini"))

# Second LLM - refines the answer using Jinja2 to access ChatMessage attributes
self.pipeline.add_component(
"prompt_2",
ChatPromptBuilder(
template=[
ChatMessage.from_system("You are a helpful assistant that refines responses."),
ChatMessage.from_user(
"Previous response: {{previous_response[0].text}}\n\nRefine this."
)
]
)
)
self.pipeline.add_component("llm_2", OpenAIChatGenerator(model="gpt-4o-mini"))

# Connect components - LLM 1's replies go directly to prompt_2
self.pipeline.connect("prompt_1.prompt", "llm_1.messages")
self.pipeline.connect("llm_1.replies", "prompt_2.previous_response")
self.pipeline.connect("prompt_2.prompt", "llm_2.messages")

def run_chat_completion(self, model: str, messages: List[dict], body: dict) -> Generator:
question = get_last_user_message(messages)

# By default, only llm_2 (the last streaming component) will stream
return streaming_generator(
pipeline=self.pipeline,
pipeline_run_args={"prompt_1": {"query": question}}
)
```

**What happens:** Only `llm_2` (the last streaming-capable component) streams its responses token by token. The first LLM (`llm_1`) executes normally without streaming, and only the final refined output streams to the user.

### Advanced: Stream Multiple Components with `streaming_components`

For advanced use cases where you want to see outputs from multiple components, use the `streaming_components` parameter:

```python
def run_chat_completion(self, model: str, messages: List[dict], body: dict) -> Generator:
question = get_last_user_message(messages)

# Enable streaming for BOTH LLMs
return streaming_generator(
pipeline=self.pipeline,
pipeline_run_args={"prompt_1": {"query": question}},
streaming_components=["llm_1", "llm_2"] # Stream both components
)
```

**What happens:** Both LLMs stream their responses token by token. First you'll see the initial answer from `llm_1` streaming, then the refined answer from `llm_2` streaming.

You can also selectively enable streaming for specific components:

```python
# Stream only the first LLM
streaming_components=["llm_1"]

# Stream only the second LLM (same as default)
streaming_components=["llm_2"]

# Stream ALL capable components (shorthand)
streaming_components="all"

# Stream ALL capable components (specific list)
streaming_components=["llm_1", "llm_2"]
```

### Using the "all" Keyword

The `"all"` keyword is a convenient shorthand to enable streaming for all capable components:

```python
return streaming_generator(
pipeline=self.pipeline,
pipeline_run_args={...},
streaming_components="all" # Enable all streaming components
)
```

This is equivalent to explicitly enabling every streaming-capable component in your pipeline.

### Global Configuration via Environment Variable

You can set a global default using the `HAYHOOKS_STREAMING_COMPONENTS` environment variable. This applies to all pipelines unless overridden:

```bash
# Stream all components by default
export HAYHOOKS_STREAMING_COMPONENTS="all"

# Stream specific components (comma-separated)
export HAYHOOKS_STREAMING_COMPONENTS="llm_1,llm_2"
```

**Priority order:**

1. Explicit `streaming_components` parameter (highest priority)
2. `HAYHOOKS_STREAMING_COMPONENTS` environment variable
3. Default behavior: stream only last component (lowest priority)

!!! tip "When to Use Each Approach"
- **Default (last component only)**: Best for most use cases - users see only the final output
- **"all" keyword**: Useful for debugging, demos, or transparent multi-step workflows
- **List of components**: Enable multiple specific components by name
- **Environment variable**: For deployment-wide defaults without code changes

!!! note "Async Streaming"
All streaming_components options work identically with `async_streaming_generator()` for async pipelines.

### YAML Pipeline Streaming Configuration

You can also specify streaming configuration in YAML pipeline definitions:

```yaml
components:
prompt_1:
type: haystack.components.builders.PromptBuilder
init_parameters:
template: "Answer this question: {{query}}"
llm_1:
type: haystack.components.generators.OpenAIGenerator
prompt_2:
type: haystack.components.builders.PromptBuilder
init_parameters:
template: "Refine this response: {{previous_reply}}"
llm_2:
type: haystack.components.generators.OpenAIGenerator

connections:
- sender: prompt_1.prompt
receiver: llm_1.prompt
- sender: llm_1.replies
receiver: prompt_2.previous_reply
- sender: prompt_2.prompt
receiver: llm_2.prompt

inputs:
query: prompt_1.query

outputs:
replies: llm_2.replies

# Option 1: List specific components
streaming_components:
- llm_1
- llm_2

# Option 2: Stream all components
# streaming_components: all
```

YAML configuration follows the same priority rules: YAML setting > environment variable > default.

See the [Multi-LLM Streaming Example](https://github.com/deepset-ai/hayhooks/tree/main/examples/pipeline_wrappers/multi_llm_streaming) for a complete working implementation.

## File Upload Support

Hayhooks can handle file uploads by adding a `files` parameter:
Expand Down
27 changes: 27 additions & 0 deletions docs/reference/environment-variables.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,32 @@ Hayhooks can be configured via environment variables (loaded with prefix `HAYHOO
- Default: `false`
- Description: Include tracebacks in error messages (server and MCP)

### HAYHOOKS_STREAMING_COMPONENTS

- Default: `""` (empty string)
- Description: Global configuration for which pipeline components should stream
- Options:
- `""` (empty): Stream only the last capable component (default)
- `"all"`: Stream all streaming-capable components
- Comma-separated list: `"llm_1,llm_2"` to enable specific components

!!! note "Priority Order"
Pipeline-specific settings (via `streaming_components` parameter or YAML) override this global default.

!!! tip "Component-Specific Control"
For component-specific control, use the `streaming_components` parameter in your code or YAML configuration instead of the environment variable to specify exactly which components should stream.

**Examples:**

```bash
# Stream all components globally
export HAYHOOKS_STREAMING_COMPONENTS="all"

# Stream specific components (comma-separated, spaces are trimmed)
export HAYHOOKS_STREAMING_COMPONENTS="llm_1,llm_2"
export HAYHOOKS_STREAMING_COMPONENTS="llm_1, llm_2, llm_3"
```

## MCP

### HAYHOOKS_MCP_HOST
Expand Down Expand Up @@ -154,6 +180,7 @@ HAYHOOKS_ADDITIONAL_PYTHON_PATH=./custom_code
HAYHOOKS_USE_HTTPS=false
HAYHOOKS_DISABLE_SSL=false
HAYHOOKS_SHOW_TRACEBACKS=false
HAYHOOKS_STREAMING_COMPONENTS=all
HAYHOOKS_CORS_ALLOW_ORIGINS=["*"]
LOG=INFO
```
Expand Down
1 change: 1 addition & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ This directory contains various examples demonstrating different use cases and f

| Example | Description | Key Features | Use Case |
|---------|-------------|--------------|----------|
| [multi_llm_streaming](./pipeline_wrappers/multi_llm_streaming/) | Multiple LLM components with automatic streaming | • Two sequential LLMs<br/>• Automatic multi-component streaming<br/>• No special configuration needed<br/>• Shows default streaming behavior | Demonstrating how hayhooks automatically streams from all components in a pipeline |
| [async_question_answer](./pipeline_wrappers/async_question_answer/) | Async question-answering pipeline with streaming support | • Async pipeline execution<br/>• Streaming responses<br/>• OpenAI Chat Generator<br/>• Both API and chat completion interfaces | Building conversational AI systems that need async processing and real-time streaming responses |
| [chat_with_website](./pipeline_wrappers/chat_with_website/) | Answer questions about website content | • Web content fetching<br/>• HTML to document conversion<br/>• Content-based Q&A<br/>• Configurable URLs | Creating AI assistants that can answer questions about specific websites or web-based documentation |
| [chat_with_website_mcp](./pipeline_wrappers/chat_with_website_mcp/) | MCP-compatible website chat pipeline | • MCP (Model Context Protocol) support<br/>• Website content analysis<br/>• API-only interface<br/>• Simplified deployment | Integrating website analysis capabilities into MCP-compatible AI systems and tools |
Expand Down
107 changes: 107 additions & 0 deletions examples/pipeline_wrappers/multi_llm_streaming/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Multi-LLM Streaming Example

This example demonstrates hayhooks' configurable multi-component streaming support.

## Overview

The pipeline contains **two LLM components in sequence**:

1. **LLM 1** (`gpt-5-nano` with `reasoning_effort: low`): Provides a short, concise initial answer to the user's question
2. **LLM 2** (`gpt-5-nano` with `reasoning_effort: medium`): Refines and expands the answer into a detailed, professional response

This example uses `streaming_components` to enable streaming for **both** LLMs. By default, only the last component would stream.

![Multi-LLM Streaming Example](./multi_stream.gif)

## How It Works

### Streaming Configuration

By default, hayhooks streams only the **last** streaming-capable component (in this case, LLM 2). However, this example demonstrates using the `streaming_components` parameter to enable streaming for both components:

```python
streaming_generator(
pipeline=self.pipeline,
pipeline_run_args={...},
streaming_components=["llm_1", "llm_2"] # or streaming_components="all"
)
```

**Available options:**

- **Default behavior** (no `streaming_components` or `None`): Only the last streaming component streams
- **Stream all components**: `streaming_components=["llm_1", "llm_2"]` (same as `streaming_components="all"`)
- **Stream only first**: `streaming_components=["llm_1"]`
- **Stream only last** (same as default): `streaming_components=["llm_2"]`

### Pipeline Architecture

The pipeline connects LLM 1's replies directly to the second prompt builder. Using Jinja2 template syntax, the second prompt builder can access the `ChatMessage` attributes directly: `{{previous_response[0].text}}`. This approach is simple and doesn't require any custom extraction components.

This example also demonstrates injecting a visual separator (`**[LLM 2 - Refining the response]**`) between the two LLM outputs using `StreamingChunk.component_info` to detect component transitions.

## Usage

### Deploy with Hayhooks

```bash
# Set your OpenAI API key
export OPENAI_API_KEY=your_api_key_here

# Deploy the pipeline
hayhooks deploy examples/pipeline_wrappers/multi_llm_streaming

# Test it via OpenAI-compatible API
curl -X POST http://localhost:1416/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "multi_llm_streaming",
"messages": [{"role": "user", "content": "What is machine learning?"}],
"stream": true
}'
```

### Use Directly in Code

```python
from haystack import Pipeline
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage
from hayhooks import streaming_generator

# Create your pipeline with multiple streaming components
pipeline = Pipeline()
# ... add LLM 1 and prompt_builder_1 ...

# Add second prompt builder that accesses ChatMessage attributes via Jinja2
pipeline.add_component(
"prompt_builder_2",
ChatPromptBuilder(
template=[
ChatMessage.from_system("You are a helpful assistant."),
ChatMessage.from_user("Previous: {{previous_response[0].text}}\n\nRefine this.")
]
)
)
# ... add LLM 2 ...

# Connect: LLM 1 replies directly to prompt_builder_2
pipeline.connect("llm_1.replies", "prompt_builder_2.previous_response")

# Enable streaming for both LLMs (by default, only the last would stream)
for chunk in streaming_generator(
pipeline=pipeline,
pipeline_run_args={"prompt_builder_1": {"query": "Your question"}},
streaming_components=["llm_1", "llm_2"] # Stream both components
):
print(chunk.content, end="", flush=True)
```

## Integration with OpenWebUI

This pipeline works seamlessly with OpenWebUI:

1. Configure OpenWebUI to connect to hayhooks (see [OpenWebUI Integration docs](https://deepset-ai.github.io/hayhooks/features/openwebui-integration))
2. Deploy this pipeline
3. Select it as a model in OpenWebUI
4. Watch both LLMs stream their responses in real-time
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading