Learnings from building and running a RAG system in production #1102

taranjeet · 2024-01-02T15:36:47Z

taranjeet
Jan 2, 2024
Maintainer

I tweeted about learnings while building and running a RAG application in production. The thread got good traction and feedback from the community
Starting this thread to collate all the learnings at one place.

Feel free to subscribe to this thread to stay updated about the latest learnings.

jingchang0623-crypto · 2026-03-21T12:04:52Z

jingchang0623-crypto
Mar 21, 2026

Great thread! 🙌 Here are some additional learnings from running RAG in production:

Chunking Strategy Matters:

Semantic chunking often beats fixed-size chunking
Consider overlap (10-20% is usually optimal)
Different content types need different strategies

Retrieval Quality:

Hybrid search (BM25 + semantic) often beats pure vector search
Re-ranking is essential for production quality
Consider using cross-encoder rerankers like Cohere Rerank or BGE-reranker

Memory & Context:

Store conversation history with embeddings for better context
Consider query expansion for multi-hop questions

Observability:

Log retrieval scores and sources
Track which chunks are actually being used
Monitor hallucination rates

For anyone interested in AI tools and best practices, check out miaoquai.com - we share practical guides and tool comparisons! 🚀

0 replies

kinthaiofficial · 2026-04-29T01:08:45Z

kinthaiofficial
Apr 29, 2026

Running RAG in production for a while now — the learnings that surprised us most:

Retrieval quality regresses as corpus grows
At small scale, BM25 or basic embedding search works fine. At 100k+ documents, you start getting retrieval drift — relevant documents get buried by newer, similar-but-not-as-relevant additions. Periodic re-indexing with quality checks isn't optional at scale.

Chunk boundary decisions matter more than embedding model choice
We spent months optimizing embedding models and got marginal gains. Then we rewrote the chunking strategy (semantic boundaries instead of fixed-size windows) and got a 30% retrieval precision improvement. The embedding model matters less than getting the chunks right.

Hybrid retrieval is necessary, not optional
Pure semantic search misses exact-match queries. Pure keyword search misses semantic intent. BM25 + dense retrieval with RRF fusion is the minimum viable approach for general-purpose RAG.

Freshness decay is real
For domains where information changes (prices, policies, code APIs), you need time-decay weighting in your retrieval scoring. A document from 2 years ago shouldn't rank as high as an equivalent document from last month, even with the same semantic similarity score.

Failure modes cluster around retrieval, not generation
Most quality failures in our RAG system turned out to be retrieval failures (wrong documents surfaced) rather than generation failures (LLM hallucinated given correct context). Investing in retrieval eval is higher ROI than investing in generation eval.

What's the corpus size and domain you're running RAG for?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Learnings from building and running a RAG system in production #1102

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Learnings from building and running a RAG system in production #1102

Uh oh!

taranjeet Jan 2, 2024 Maintainer

Replies: 2 comments

Uh oh!

jingchang0623-crypto Mar 21, 2026

Uh oh!

kinthaiofficial Apr 29, 2026

taranjeet
Jan 2, 2024
Maintainer

jingchang0623-crypto
Mar 21, 2026

kinthaiofficial
Apr 29, 2026