What's the need for CAG, if it'll just use the context window of the llm? #18

faizrazadec · 2025-01-30T13:23:48Z

I am trying to understand the need for Context-Aware Generation (CAG) if it only utilizes the context window of the LLM. From my perspective, it seems that the key-value pairs of data are being generated and placed in the context window, which could be achieved through the system prompt as well. In fact, we could load the entire document into the system prompt, and it would function similarly to CAG. Could you clarify the difference between these two approaches?

For my use case, I have been working with embeddings before considering the system prompt. However, after reviewing your paper and examining your code, I shifted to using key-value pairs, which I now feed into the model. It seems to yield results similar to using the system prompt directly.

Here’s the approach I implemented in my code. The main difference, as I see it, is that you use a greedy encoding method and break the document into tokens before creating key-value pairs, which you refer to as the "KV cache." But if the retrieval process from the KV cache is similar to fetching relevant information from embeddings, how do the two approaches differ in terms of efficiency or output?

I could be missing something, as I am still learning, and would appreciate your insight @hhhuang to help me better understand this concept.

import google.generativeai as genai

class KVStore:
    def __init__(self):
        self.store = {}

    def add(self, key, value):
        self.store[key] = value

    def get(self, key):
        return self.store.get(key, None)
    
    def items(self):
        return self.store.items()

def load_schema(file_path):
    with open(file_path, 'r') as file:
        schema = file.read()
    return schema

# Instantiate KVStore
kv_store = KVStore()

# Add the schema to the KV store
schema_file_path = 'demographics_Schema.txt'
rawschema = load_schema(schema_file_path)
kv_store.add("bigquery_schema", rawschema)

# To view the entire dictionary directly
print("Complete KV Store dictionary:")
print(kv_store.store)  # This will print the entire dictionary

def create_system_prompt(schema):
    return f"""
    You are a SQL query generator. Given the following BigQuery schema, generate a SQL query based on the user's request.

    Schema:
    {schema}

    User's Request: {{user_request}}
    SQL Query:
    """

def generate_sql_query(user_request, kv_store):
    # Retrieve schema from KV store using a key
    schema = kv_store.get("bigquery_schema")
    print(schema)
    if schema is None:
        return "Schema not found in KV store!"
    
    # Prepare the prompt using the retrieved schema
    system_prompt = create_system_prompt(schema)
    prompt = system_prompt.format(user_request=user_request)

    # Call Gemini API to generate SQL query
    genai.configure(api_key="AIzaSyBNA_GEyGD-As9oxQUTz1EfrAjITdFTSoE")
    model = genai.GenerativeModel("gemini-1.5-flash")
    response = model.generate_content(prompt)
    return response.text

if __name__ == "__main__":
    user_request = "Show the racial distribution in each state."
    
    # Generate SQL query using the KV store
    sql_query = generate_sql_query(user_request, kv_store)
    print("Generated SQL Query:", sql_query)

The text was updated successfully, but these errors were encountered:

brian030128 · 2025-02-01T18:05:57Z

Hi @faizrazadec ,

During LLM inference, prompts are converted into a KV Cache by the model, which is a computationally intensive task. In fact, there's a benchmark metric called TTFT (Time to First Token) that measures the time required for this process.

Cached Augmented Generation (CAG) precomputes the prompt into a KV Cache, making it a more efficient approach compared to the method you mentioned. However, the output remains the same.

voodoohop · 2025-03-18T13:41:10Z

@brian030128 isnt this just the same as using context caching (which many providers already offer) with a large context? what is new about it?

JuanChavarriaU · 2025-03-25T21:38:58Z

@voodoohop bro If you refer to context cache by Google because I think they're not the same. CAG seeks to eliminate retrieval latency and minimizing retrieval errors while maintaining context relevance. on the other hand, context caching is to improve generative model efficiency. Use context caching to reduce the cost of requests that contain repeat content with high input token counts

also @faizrazadec bro, you should delete the api-key from your code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the need for CAG, if it'll just use the context window of the llm? #18

What's the need for CAG, if it'll just use the context window of the llm? #18

faizrazadec commented Jan 30, 2025

brian030128 commented Feb 1, 2025

voodoohop commented Mar 18, 2025

JuanChavarriaU commented Mar 25, 2025 •

edited

Loading

What's the need for CAG, if it'll just use the context window of the llm? #18

What's the need for CAG, if it'll just use the context window of the llm? #18

Comments

faizrazadec commented Jan 30, 2025

brian030128 commented Feb 1, 2025

voodoohop commented Mar 18, 2025

JuanChavarriaU commented Mar 25, 2025 • edited Loading

JuanChavarriaU commented Mar 25, 2025 •

edited

Loading