Feature Suggestion: Optional RAG (Retrieval-Augmented Generation) for Large Document Processing

Dear Chorus Development Team,
I'm writing to suggest a feature that could significantly enhance Chorus's ability to handle large documents while reducing API costs and staying within rate limits. This suggestion comes from a real problem I encountered and is inspired by how other AI tools have solved similar challenges.

The Problem I Encountered
Recently, I attempted to upload a trading analysis report (Quant Analyzer portfolio report) to Chorus using the multi-model interface. The document was approximately 40-50,000 tokens (11 pages with 502 individual trade entries, performance metrics, and statistical data).
What happened:

Claude Sonnet 4.5: Failed with "Rate limit exceeded" (30,000 input tokens/minute)
Gemini 2.5 Flash: Failed with "429 Too Many Requests"
GPT-5: Failed with "Context limit reached"

All three models failed simultaneously when Chorus attempted to process the file. However, when I uploaded the same document directly to claude.ai in a fresh conversation, it worked perfectly fine.
The core issue: When routing through APIs with multi-model requests, large documents quickly exceed rate limits, especially on lower API tiers. This is a fundamental constraint that affects many users working with substantial documents, research materials, or data reports.

How Other Tools Solve This: RAG Systems
I also use RabbitHoles.ai, which is a non-linear, canvas-based research tool. It allows users to create multiple interconnected chats where context from one chat can be sent to another "receiving chat." For example:
Canvas Structure:
Chat A: Market fundamentals (8k tokens)
    ↓
Chat B: Strategy analysis (12k tokens)
    ↓
Chat C: Risk assessment (6k tokens)
    ↓
Chat D: Trade recommendations (10k tokens)

When I ask a question in Chat D that's connected to A→B→C→D:
Total context = 36k tokens
Without optimization, sending a query to Chat D would require transmitting all 36k tokens to the API, which would:

Exceed many API rate limits
Cost significantly more
Include large amounts of irrelevant context

RabbitHoles' solution: They implemented RAG (Retrieval-Augmented Generation)—a system that extracts only the most relevant information from connected contexts before sending to the API.
How it works:

When I ask a question in Chat D: "What's the optimal risk/reward ratio?"
RAG semantically searches all connected chats (A, B, C, D)
Extracts only relevant chunks: ~5-7k tokens instead of 36k
Sends just those relevant chunks + my question to the API
Result: 5-7x cost reduction, stays within rate limits, faster responses

Architecture note: RabbitHoles implements RAG on-device (local vector database like ChromaDB running on the user's computer), not on cloud servers. This means:

Zero infrastructure costs for the company
Complete user privacy (data never leaves device)
Fast retrieval (no network latency)
Aligns with local-first, privacy-focused architecture


How RAG Could Work in Chorus
While Chorus is a linear chat app (each conversation is separate), RAG could still provide immense value for large document processing. Here's how:
Current Workflow (Without RAG):
User uploads 40k token document
    ↓
Chorus sends entire 40k tokens to API
    ↓
❌ Hits rate limit / exceeds context window
    ↓
Upload fails or becomes very expensive
Proposed Workflow (With Optional RAG):
User uploads 40k token document
    ↓
Chorus detects large file, shows RAG toggle
    ↓
User enables RAG mode
    ↓
Document is chunked and embedded locally (on-device)
    ↓
User asks: "What's the Sharpe Ratio?"
    ↓
RAG searches chunks locally, finds relevant sections (3-5k tokens)
    ↓
Sends only relevant context + question to API
    ↓
✅ Stays within limits, 80% cost reduction, faster response

Proposed Implementation: Semi-Automatic RAG Toggle
I suggest implementing a user-controlled, optional RAG system with three modes:
Mode 1: Full Context (Default)

Sends entire document to API
Best for: Comprehensive analysis, financial documents, legal docs, medical records
Cost: Standard (full token usage)
Quality: Maximum comprehensiveness

Mode 2: RAG Mode (User-enabled)

Extracts only relevant chunks based on queries
Best for: Q&A, finding specific facts, research notes, general documents
Cost: 5-10x cheaper (typically 5-7k tokens vs 40-50k)
Quality: Excellent for targeted questions, may miss broader context

Mode 3: Hybrid Mode (Balanced)

RAG finds relevant chunks, then expands with surrounding context
Best for: Most use cases—balance between cost and comprehensiveness
Cost: 3-5x cheaper (typically 10-15k tokens)
Quality: High—captures both specific answers and broader context


Suggested UI/UX Flow
┌────────────────────────────────────────────────┐
│  📎 Document uploaded: QuanAnalyzer.pdf        │
│  Size: 42,000 tokens (~11 pages)               │
│                                                │
│  ⚠️  Large document detected                   │
│                                                │
│  Processing Mode:                              │
│  ○ Full Context - Comprehensive analysis      │
│  ● Hybrid - Balanced (Recommended) ✨          │
│  ○ RAG - Maximum cost savings                  │
│                                                │
│  💰 Cost Estimate (10 queries):                │
│  • Full Context: ~$4.20 (420k tokens)         │
│  • Hybrid: ~$1.30 (130k tokens)               │
│  • RAG: ~$0.60 (60k tokens)                   │
│                                                │
│  💡 RAG extracts relevant information only.   │
│     Recommended for:                           │
│     ✓ Quick questions & fact-finding          │
│     ✓ Specific searches                        │
│     ✓ General research documents               │
│                                                │
│  ⚠️  Use Full Context for:                     │
│     • Financial analysis & reports             │
│     • Legal documents                          │
│     • Medical records                          │
│     • Any document requiring complete review   │
│                                                │
│  [ Continue ] [ Settings ]                     │
└────────────────────────────────────────────────┘

Why This Aligns with Chorus's Architecture
Based on my understanding, Chorus is built with on-device processing as a core principle, with only API-related operations going to the cloud. This aligns perfectly with how RAG should be implemented:
On-Device RAG Processing (Recommended):
User's Computer (Chorus app):
├── Document chunking (local)
├── Vector database (ChromaDB/LanceDB - runs locally)
├── Embedding generation (local or via user's API key)
├── Semantic search (local)
└── Only sends extracted chunks to API (via user's keys)

Benefits:
✅ Zero cloud infrastructure costs for Chorus
✅ User privacy maintained (data stays local)
✅ Fast processing (no network latency for retrieval)
✅ Aligns with Chorus's local-first philosophy
✅ Scales perfectly (each user's device does the work)
vs. Cloud-Based RAG (Not Recommended):
Chorus Cloud Servers:
├── Store all user documents
├── Run vector database ($70-200/month)
├── Process embeddings ($20-100/month)
├── Added complexity and costs

Drawbacks:
❌ Ongoing infrastructure costs
❌ Privacy concerns (documents on servers)
❌ Network latency
❌ Doesn't align with local-first approach
The on-device approach means:

No additional monthly costs for Chorus
User retains full control and privacy
Implementation complexity similar to RabbitHoles.ai
Sustainable for one-time payment or low-cost subscription model


Technical Implementation Considerations
Suggested Tech Stack (On-Device):

Vector Database: ChromaDB or LanceDB (embedded, no server needed)
Embeddings:

Option A: Local embedding models (fast, free)
Option B: API-based embeddings via user's keys (more accurate)


Chunking: Smart text splitting with overlap (500-1000 tokens per chunk)
Search: Cosine similarity for semantic matching

Development Phases:
Phase 1: Basic RAG Toggle (2-3 weeks)

Simple keyword-based extraction (no vector DB initially)
Manual toggle for documents >15k tokens
Proof of concept

Phase 2: Smart RAG (1-2 months)

Implement local vector database
Semantic search with embeddings
Auto-suggest RAG for large files
Cost estimation UI

Phase 3: Hybrid Mode (2-3 months)

Context expansion around relevant chunks
Intelligent mode selection
Multi-document RAG support
Advanced settings (chunk size, retrieval depth, etc.)


Real-World Use Cases
Use Case 1: Financial Analysis (My Experience)
Document: 42,000-token trading report with 502 trades
Questions:

"What's the profit factor?" → RAG perfect (extract summary stats)
"Show me October trades" → RAG perfect (find specific section)
"Analyze patterns across all months" → Full Context better (needs complete data)
"Compare risk metrics" → Hybrid ideal (finds relevant sections + context)

Outcome with RAG:

Quick questions: 85% cost savings
Deep analysis: User switches to Full Context
Overall: ~60% cost reduction with maintained quality

Use Case 2: Research Papers
Document: 30,000-token academic paper
Questions:

"What methodology did they use?" → RAG extracts Methods section
"What were the conclusions?" → RAG extracts Discussion/Conclusion
"Summarize the entire paper" → Full Context required

Use Case 3: Meeting Notes / Documentation
Document: 25,000 tokens of project documentation
Questions:

"What's the deadline for feature X?" → RAG finds specific mention
"Who's responsible for Y?" → RAG extracts relevant section
"Give me a complete project overview" → Hybrid mode ideal


Cost-Benefit Analysis
For Users:
Without RAG:

Upload 40k token document
10 queries = 400k tokens processed
Cost: ~$4.00 (with Claude API at $10/M tokens)

With RAG:

Upload 40k token document (processed locally, ~$0)
10 queries = ~60k tokens processed
Cost: ~$0.60
Savings: $3.40 (85% reduction)

Annual savings for active user:

100 large documents × 10 queries each
Savings: $340/year
Time saved: Faster responses (less tokens to process)

For Chorus:
Development Investment:

Phase 1 (basic): 2-3 weeks, 1 developer (~$5-8k)
Phase 2 (full): 1-2 months, 1 developer (~$15-25k)
Total: ~$20-33k one-time investment

Ongoing Costs:

On-device implementation: $0/month (runs on user's computer)
Cloud implementation: $120-380/month (not recommended)

Revenue Potential:

Increased user satisfaction → Higher retention
Competitive differentiation → New user acquisition
Could enable "Pro" tier with advanced RAG features
Positions Chorus as enterprise-ready for large documents

ROI:

Qualitative: Solves major pain point, handles enterprise use cases
Quantitative: Feature enables users who would otherwise abandon the product
Competitive: RabbitHoles.ai charges $89-250 one-time partly because of this capability


Why This Matters

Removes Upload Limitations: Users can work with documents of any size without hitting API rate limits
Cost Efficiency: Dramatically reduces API costs for users working with large documents (my use case: 85% savings)
Maintains Quality: User control means they choose when comprehensiveness matters vs. when cost-efficiency is preferred
Privacy & Speed: On-device processing keeps data local and retrieval instant
Competitive Advantage: Most chat wrappers (ChatHub, Poe, etc.) don't offer this—Chorus would stand out
Enterprise-Ready: Makes Chorus viable for professional users dealing with financial reports, legal documents, research papers, etc.
Aligns with Philosophy: On-device RAG fits Chorus's local-first, privacy-focused architecture perfectly


Potential Concerns & Responses
Concern 1: "This is complex to implement"
Response: Phase 1 (basic version) is achievable in 2-3 weeks. RabbitHoles.ai proves it's viable for small teams. On-device implementation using existing libraries (ChromaDB) reduces complexity significantly.
Concern 2: "Users might not understand when to use RAG vs Full Context"
Response: Smart defaults handle this. Auto-suggest RAG for large files with clear explanations. Hybrid mode provides good balance for uncertain cases. User education through simple tooltips.
Concern 3: "Will this slow down the app?"
Response: On-device processing is actually faster for retrieval (5-50ms vs 200-500ms for cloud). Initial chunking/embedding happens once per document, then cached locally. Subsequent queries are near-instant.
Concern 4: "What about device compatibility?"
Response: ChromaDB and similar tools work on all major platforms (Windows, macOS, Linux). Minimal system requirements (works on 5-year-old computers). Fallback to cloud-based processing if device constraints detected.
Concern 5: "Users might extract sensitive info incorrectly"
Response: Clear warnings for financial/legal/medical documents. Default to Full Context for unknown document types. User always has final control.

Competitive Landscape
Tools WITH RAG:

✅ RabbitHoles.ai (non-linear, canvas-based, on-device)
✅ Perplexity.ai (web search + RAG)
✅ Notion AI (workspace RAG)
✅ ChatGPT with Custom GPTs (knowledge base RAG)

Tools WITHOUT RAG:

❌ ChatHub.gg (multi-model, no RAG)
❌ Poe.com (mostly full-context)
❌ Most chat wrappers
❌ Chorus (currently)

Opportunity: Chorus could be the first multi-model linear chat tool with local RAG, combining the best of both worlds:

Multi-model flexibility (Chorus's strength)
Large document handling (RAG's strength)
Local-first privacy (both tools' philosophy)


Suggested Next Steps
If this feature interests the team, I'd suggest:

Community Feedback: Poll users about document size pain points and RAG interest
Prototype: Build Phase 1 (basic RAG toggle) as proof of concept
Beta Testing: Release to small user group for feedback
Iterate: Refine based on real-world usage patterns
Launch: Roll out with clear documentation and examples

I'd be happy to:

Beta test early versions
Provide detailed use cases and feedback
Help document the feature for other users
Share insights from my experience with RabbitHoles.ai's RAG system


Conclusion
RAG represents a fundamental capability for handling large documents in AI tools. As document sizes and context requirements grow, this will become increasingly essential.
By implementing optional, on-device RAG, Chorus can:

✅ Solve real user pain points (like my 40k token upload failure)
✅ Reduce API costs by 60-85% for large document workflows
✅ Stay within API rate limits regardless of document size
✅ Maintain privacy and speed through local processing
✅ Differentiate from competitors
✅ Enable enterprise and professional use cases
✅ Align perfectly with the local-first architecture

Most importantly, user control through the three-mode system (Full Context / Hybrid / RAG) means Chorus can offer power users advanced capabilities while keeping the simple experience for casual users.
Thank you for building Chorus and for considering this suggestion. I believe this feature could be transformative for users working with substantial documents and research materials.
Best regards,
Ingvar

P.S. If helpful, I'm happy to provide the specific Quant Analyzer PDF that failed to upload as a test case, or discuss technical implementation details further. I have experience with both Chorus and RabbitHoles.ai and would love to see Chorus adopt this capability.

Key Resources for Implementation:

ChromaDB: https://www.trychroma.com/ (local vector DB)
LanceDB: https://lancedb.com/ (alternative local vector DB)
RabbitHoles.ai: https://rabbitholes.ai/ (reference implementation)
Sentence Transformers: Local embedding models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Suggestion: Optional RAG (Retrieval-Augmented Generation) for Large Document Processing #51

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature Suggestion: Optional RAG (Retrieval-Augmented Generation) for Large Document Processing #51

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions