Agentic retrieval is an LLM-guided multi-round retrieval method that significantly improves retrieval quality for complex queries through intelligent judgment and query optimization.
✅ Intelligent Judgment: LLM automatically determines if retrieval results are sufficient
✅ Multi-Round Retrieval: Automatically performs a second round of retrieval when insufficient
✅ Multi-Query Strategy: Generates 2-3 complementary queries to improve recall
✅ Automatic Fallback: Falls back to Lightweight retrieval on failure
✅ Complete Metadata: Returns detailed retrieval process information
Run chat_with_memory.py and select retrieval mode:
uv run python src/bootstrap.py demo/chat_with_memory.pySelect the 4th option: Agentic Retrieval - LLM-guided multi-round retrieval (experimental)
from agentic_layer.memory_manager import MemoryManager
from memory_layer.llm.llm_provider import LLMProvider
from agentic_layer.agentic_utils import AgenticConfig
# Initialize LLM Provider
llm = LLMProvider(
provider_type="openai",
model="gpt-4",
api_key="your_api_key",
base_url="https://api.openai.com/v1",
temperature=0.0,
)
# Initialize Memory Manager
memory_manager = MemoryManager()
# Execute Agentic retrieval
result = await memory_manager.retrieve_agentic(
query="What foods does the user like?",
group_id="food_lovers_group",
llm_provider=llm,
top_k=20,
)
# View results
print(f"Retrieved {result['count']} memories")
print(f"Is sufficient: {result['metadata']['is_sufficient']}")
if result['metadata']['is_multi_round']:
print(f"Refined queries: {result['metadata']['refined_queries']}")from agentic_layer.agentic_utils import AgenticConfig
# Create custom configuration
config = AgenticConfig(
# Round 1 configuration
round1_emb_top_n=50, # Embedding candidates
round1_bm25_top_n=50, # BM25 candidates
round1_top_n=20, # Top N after RRF fusion
round1_rerank_top_n=5, # Top N after rerank for LLM judgment
# LLM configuration
llm_temperature=0.0, # Low temperature for judgment
llm_max_tokens=500,
# Round 2 configuration
enable_multi_query=True, # Enable multi-query
num_queries=3, # Expected number of queries
round2_per_query_top_n=50, # Recall per query
# Fusion configuration
combined_total=40, # Total after merging
final_top_n=20, # Final top N
# Rerank configuration
use_reranker=True,
reranker_instruction="Rank based on relevance between query and memory",
)
# Use custom configuration
result = await memory_manager.retrieve_agentic(
query="What foods does the user like?",
group_id="food_lovers_group",
llm_provider=llm,
agentic_config=config,
){
"memories": [
{
"event_id": "...",
"user_id": "...",
"group_id": "...",
"timestamp": "2024-01-15T10:30:00",
"episode": "User said he loves Sichuan cuisine, especially Mapo Tofu",
"summary": "User's cuisine preferences",
"subject": "Eating habits",
"score": 0.95
},
# ... more memories
],
"count": 20,
"metadata": {
# Basic information
"retrieval_mode": "agentic",
"is_multi_round": True, # Whether multi-round retrieval was performed
# Round 1 statistics
"round1_count": 20,
"round1_reranked_count": 5,
"round1_latency_ms": 800,
# LLM judgment
"is_sufficient": False,
"reasoning": "Missing user's specific cuisine preferences and taste information",
"missing_info": ["Cuisine preferences", "Taste habits", "Dietary restrictions"],
# Round 2 statistics (only when multi-round)
"refined_queries": [
"What is the user's favorite cuisine?",
"What flavors does the user like?",
"What dietary restrictions does the user have?"
],
"query_strategy": "Break down original query into multiple specific sub-questions",
"num_queries": 3,
"round2_count": 40,
"round2_latency_ms": 600,
"multi_query_total_docs": 120,
# Final statistics
"final_count": 20,
"total_latency_ms": 3500
}
}User Query
↓
Round 1: Hybrid Search (Embedding + BM25 + RRF)
↓
RRF Fusion → Top 20
↓
Rerank → Top 5
↓
LLM Judges Sufficiency
↓
├─ Sufficient → Return Round 1's Top 20 ✅
│
└─ Insufficient → LLM generates multi-queries (2-3)
↓
Round 2: Parallel retrieval for all queries
↓
Multi-query RRF fusion
↓
Deduplicate + merge to 40
↓
Rerank → Top 20 ✅
| Metric | Single Round (Sufficient) | Multi-Round (Insufficient) |
|---|---|---|
| Latency | 2-5s | 5-10s |
| LLM Calls | 1 | 2 |
| Token Usage | ~500 | ~1500 |
| API Cost | ~$0.001 | ~$0.003 |
Estimated values based on GPT-4
-
Complex Queries: Requires information from multiple perspectives
- ❌ "What does the user like to eat?" (too broad)
- ✅ "What is the user's favorite Sichuan dish and taste preferences?"
-
Scattered Information: Related memories distributed across different time points
-
High Quality Requirements: Scenarios requiring high recall and precision
-
Simple Queries: Questions that can be directly answered
- "What day is it today?"
- "What is the user's name?"
-
Latency Sensitive: Scenarios requiring < 1 second response
-
Cost Sensitive: Cannot afford LLM API costs
Agentic retrieval automatically falls back to Lightweight retrieval in the following cases:
- ❌ LLM API call failure
- ❌ Timeout (default 60 seconds)
- ❌
llm_providernot provided - ❌ Candidate memories are empty
Fallback is marked in metadata:
{
"metadata": {
"retrieval_mode": "agentic_fallback",
"fallback_reason": "LLM API timeout"
}
}# Use cheaper model
llm = LLMProvider(
provider_type="openai",
model="gpt-4o-mini", # Cheaper
# model="gpt-4", # More accurate but more expensive
)config = AgenticConfig(
enable_multi_query=False, # Only generate 1 query (reduce cost)
)config = AgenticConfig(
use_reranker=False, # Don't use reranker (reduce latency and cost)
)Reasons:
- Incorrect API Key
- Network issues
- API rate limiting
Solutions:
- Check API Key in
.envfile - Verify network connection
- Check detailed error information in logs
Reasons:
- Slow LLM response
- Too many candidate memories
- Reranker timeout
Solutions:
- Reduce
time_range_days(reduce candidates) - Disable reranker
- Use faster LLM model
Reasons:
- Inaccurate LLM judgment
- Unreasonable query generation
- Prompt not adapted
Solutions:
- Use stronger LLM model (e.g., GPT-4)
- Adjust prompt template (in
agentic_utils.py) - Increase
round1_rerank_top_n(give LLM more samples)
| Feature | Lightweight | Agentic |
|---|---|---|
| Latency | 0.5-2s | 5-10s |
| LLM Calls | ❌ None | ✅ 1-2 |
| Multi-Round | ❌ No | ✅ Yes |
| Recall | Medium | High |
| Precision | Medium | High |
| Cost | Low | Medium |
| Use Cases | Simple queries | Complex queries |
- ✅ Prioritize Lightweight: For simple queries, Lightweight is sufficient
- ✅ Use Agentic for Complex Queries: Only when needed
- ✅ Monitor Costs: Track LLM token consumption
- ✅ Log Analysis: Regularly review if LLM judgments are reasonable
- ✅ A/B Testing: Compare effects of different modes
import asyncio
from agentic_layer.memory_manager import MemoryManager
from memory_layer.llm.llm_provider import LLMProvider
async def main():
# Initialize
llm = LLMProvider("openai", model="gpt-4", api_key="...")
memory_manager = MemoryManager()
# User query
query = "What foods does the user like? Any dietary restrictions?"
# Execute retrieval
result = await memory_manager.retrieve_agentic(
query=query,
group_id="food_lovers_group",
llm_provider=llm,
)
# Display results
print(f"\n{'='*60}")
print(f"Query: {query}")
print(f"{'='*60}\n")
print(f"Retrieval mode: {result['metadata']['retrieval_mode']}")
print(f"Retrieved {result['count']} memories")
print(f"Total latency: {result['metadata']['total_latency_ms']:.0f}ms\n")
# LLM judgment
print(f"LLM judgment: {'✅ Sufficient' if result['metadata']['is_sufficient'] else '❌ Insufficient'}")
print(f"Reasoning: {result['metadata']['reasoning']}\n")
# Multi-round information
if result['metadata']['is_multi_round']:
print(f"📝 Entered Round 2")
print(f"Generated queries:")
for i, q in enumerate(result['metadata']['refined_queries'], 1):
print(f" {i}. {q}")
print()
# Display memories
print(f"Top 5 memories:")
for i, mem in enumerate(result['memories'][:5], 1):
print(f"\n[{i}] {mem['timestamp'][:10]}")
print(f" {mem['episode'][:100]}...")
print(f" Score: {mem['score']:.3f}")
if __name__ == "__main__":
asyncio.run(main())Notes:
- Agentic retrieval is an experimental feature and may be adjusted in future versions
- Please understand the costs and limitations of LLM APIs before using
- It is recommended to conduct thorough testing before deploying in production environments