Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's the need for CAG, if it'll just use the context window of the llm? #18

Open
faizrazadec opened this issue Jan 30, 2025 · 3 comments

Comments

@faizrazadec
Copy link

I am trying to understand the need for Context-Aware Generation (CAG) if it only utilizes the context window of the LLM. From my perspective, it seems that the key-value pairs of data are being generated and placed in the context window, which could be achieved through the system prompt as well. In fact, we could load the entire document into the system prompt, and it would function similarly to CAG. Could you clarify the difference between these two approaches?

For my use case, I have been working with embeddings before considering the system prompt. However, after reviewing your paper and examining your code, I shifted to using key-value pairs, which I now feed into the model. It seems to yield results similar to using the system prompt directly.

Here’s the approach I implemented in my code. The main difference, as I see it, is that you use a greedy encoding method and break the document into tokens before creating key-value pairs, which you refer to as the "KV cache." But if the retrieval process from the KV cache is similar to fetching relevant information from embeddings, how do the two approaches differ in terms of efficiency or output?

I could be missing something, as I am still learning, and would appreciate your insight @hhhuang to help me better understand this concept.

import google.generativeai as genai

class KVStore:
    def __init__(self):
        self.store = {}

    def add(self, key, value):
        self.store[key] = value

    def get(self, key):
        return self.store.get(key, None)
    
    def items(self):
        return self.store.items()

def load_schema(file_path):
    with open(file_path, 'r') as file:
        schema = file.read()
    return schema

# Instantiate KVStore
kv_store = KVStore()

# Add the schema to the KV store
schema_file_path = 'demographics_Schema.txt'
rawschema = load_schema(schema_file_path)
kv_store.add("bigquery_schema", rawschema)

# To view the entire dictionary directly
print("Complete KV Store dictionary:")
print(kv_store.store)  # This will print the entire dictionary

def create_system_prompt(schema):
    return f"""
    You are a SQL query generator. Given the following BigQuery schema, generate a SQL query based on the user's request.

    Schema:
    {schema}

    User's Request: {{user_request}}
    SQL Query:
    """

def generate_sql_query(user_request, kv_store):
    # Retrieve schema from KV store using a key
    schema = kv_store.get("bigquery_schema")
    print(schema)
    if schema is None:
        return "Schema not found in KV store!"
    
    # Prepare the prompt using the retrieved schema
    system_prompt = create_system_prompt(schema)
    prompt = system_prompt.format(user_request=user_request)

    # Call Gemini API to generate SQL query
    genai.configure(api_key="AIzaSyBNA_GEyGD-As9oxQUTz1EfrAjITdFTSoE")
    model = genai.GenerativeModel("gemini-1.5-flash")
    response = model.generate_content(prompt)
    return response.text

if __name__ == "__main__":
    user_request = "Show the racial distribution in each state."
    
    # Generate SQL query using the KV store
    sql_query = generate_sql_query(user_request, kv_store)
    print("Generated SQL Query:", sql_query)
@brian030128
Copy link
Collaborator

Hi @faizrazadec ,

During LLM inference, prompts are converted into a KV Cache by the model, which is a computationally intensive task. In fact, there's a benchmark metric called TTFT (Time to First Token) that measures the time required for this process.

Cached Augmented Generation (CAG) precomputes the prompt into a KV Cache, making it a more efficient approach compared to the method you mentioned. However, the output remains the same.

@voodoohop
Copy link

@brian030128 isnt this just the same as using context caching (which many providers already offer) with a large context? what is new about it?

@JuanChavarriaU
Copy link

JuanChavarriaU commented Mar 25, 2025

@voodoohop bro If you refer to context cache by Google because I think they're not the same. CAG seeks to eliminate retrieval latency and minimizing retrieval errors while maintaining context relevance. on the other hand, context caching is to improve generative model efficiency. Use context caching to reduce the cost of requests that contain repeat content with high input token counts

also @faizrazadec bro, you should delete the api-key from your code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants