Skip to content

Commit 78721a0

Browse files
authored
Update A Case Study of Remote KV Cache_ LMCache + Cohere + CoreWeave CAIOS.md
Added a last minute sentence
1 parent e98e3aa commit 78721a0

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

_posts/A Case Study of Remote KV Cache_ LMCache + Cohere + CoreWeave CAIOS.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ By Walter Beller-Morales (Cohere), Samuel Shen (Tensormesh), Kishor Aher (CoreWe
1515

1616
Enterprises today are racing to integrate large language models (LLMs) into their products and workflows, but doing it at scale brings challenges in performance, cost, and accuracy. Organizations need models to be based on their specific data, while making sure that this information remains private. [**Cohere**](https://cohere.com), one of the leading enterprise AI companies, built its North platform to help organizations use their own internal data safely and effectively to power retrieval-augmented generation (RAG). North allows enterprises to ground model outputs in trusted, private knowledge bases, delivering accurate, contextual responses tailored to their business.
1717

18-
When you use RAG, you are prefixing each request with the relevant contextual data so that the model will provide relevant answers. This approach introduces a computational hurdle as it adds large amounts of data which need to re-processed everytime a query is received as this information does not modify the weights of the model but only stores it in the KV cache which is a temporary memory which is typically discarded after the query is processed. The richer the context an LLM is given, the more **tokens** it must process, and behind every few tokens lies a growing [**Key and Value tensors**](https://medium.com/analytics-vidhya/understanding-q-k-v-in-transformer-self-attention-9a5eddaa5960) **(KV) cache** that stores intermediate model states. This cache is essential for generating coherent responses, but it grows rapidly with input length, consuming vast amounts of GPU or CPU memory.
18+
When you use RAG, you are prefixing each request with the relevant contextual data so that the model will provide relevant answers. This approach introduces a computational hurdle as it adds large amounts of data which need to re-processed everytime a query is received as this information does not modify the weights of the model but only stores it in the KV cache which is a temporary memory which is typically discarded after the query is processed. The richer the context an LLM is given, the more **tokens** it must process, and behind every few tokens lies a growing [**Key and Value tensors**](https://medium.com/analytics-vidhya/understanding-q-k-v-in-transformer-self-attention-9a5eddaa5960) **(KV) cache** that stores intermediate model states. This cache is essential for generating coherent responses, but it grows rapidly with input length, consuming vast amounts of GPU or CPU memory. This is not specific to RAG: any additional prompt content (such as tool call arguments, code, or long instructions) also increases compute cost because it must be re-encoded on every request, but RAG is the use case in this blog.
1919

2020
At scale, this creates a performance and cost bottleneck for inference. Even for efficient inference engines like vLLM. Cohere’s engineering team set out to solve this problem by exploring whether **KV caches could be stored remotely**, freeing up local memory without slowing down inference.
2121

0 commit comments

Comments
 (0)