Update A Case Study of Remote KV Cache_ LMCache + Cohere + CoreWeave CAIOS.md

nijaba · web-flow · commit 78721a0b32f8 · 2025-10-29T09:02:46.000-04:00
Added a last minute sentence
diff --git a/_posts/A Case Study of Remote KV Cache_ LMCache + Cohere + CoreWeave CAIOS.md b/_posts/A Case Study of Remote KV Cache_ LMCache + Cohere + CoreWeave CAIOS.md
@@ -15,7 +15,7 @@ By Walter Beller-Morales (Cohere), Samuel Shen (Tensormesh), Kishor Aher (CoreWe
 
 Enterprises today are racing to integrate large language models (LLMs) into their products and workflows, but doing it at scale brings challenges in performance, cost, and accuracy. Organizations need models to be based on their specific data, while making sure that this information remains private. [**Cohere**](https://cohere.com), one of the leading enterprise AI companies, built its North platform to help organizations use their own internal data safely and effectively to power retrieval-augmented generation (RAG). North allows enterprises to ground model outputs in trusted, private knowledge bases, delivering accurate, contextual responses tailored to their business.
 
-When you use RAG, you are prefixing each request with the relevant contextual data so that the model will provide relevant answers. This approach introduces a computational hurdle as it adds large amounts of data which need to re-processed everytime a query is received as this information does not modify the weights of the model but only stores it in the KV cache which is a temporary memory which is typically discarded after the query is processed. The richer the context an LLM is given, the more **tokens** it must process, and behind every few tokens lies a growing [**Key and Value tensors**](https://medium.com/analytics-vidhya/understanding-q-k-v-in-transformer-self-attention-9a5eddaa5960) **(KV) cache** that stores intermediate model states. This cache is essential for generating coherent responses, but it grows rapidly with input length, consuming vast amounts of GPU or CPU memory.
+When you use RAG, you are prefixing each request with the relevant contextual data so that the model will provide relevant answers. This approach introduces a computational hurdle as it adds large amounts of data which need to re-processed everytime a query is received as this information does not modify the weights of the model but only stores it in the KV cache which is a temporary memory which is typically discarded after the query is processed.  The richer the context an LLM is given, the more **tokens** it must process, and behind every few tokens lies a growing [**Key and Value tensors**](https://medium.com/analytics-vidhya/understanding-q-k-v-in-transformer-self-attention-9a5eddaa5960) **(KV) cache** that stores intermediate model states. This cache is essential for generating coherent responses, but it grows rapidly with input length, consuming vast amounts of GPU or CPU memory. This is not specific to RAG: any additional prompt content (such as tool call arguments, code, or long instructions) also increases compute cost because it must be re-encoded on every request, but RAG is the use case in this blog.
 
 At scale, this creates a performance and cost bottleneck for inference. Even for efficient inference engines like vLLM. Cohere’s engineering team set out to solve this problem by exploring whether **KV caches could be stored remotely**, freeing up local memory without slowing down inference.