You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to understand the need for Context-Aware Generation (CAG) if it only utilizes the context window of the LLM. From my perspective, it seems that the key-value pairs of data are being generated and placed in the context window, which could be achieved through the system prompt as well. In fact, we could load the entire document into the system prompt, and it would function similarly to CAG. Could you clarify the difference between these two approaches?
For my use case, I have been working with embeddings before considering the system prompt. However, after reviewing your paper and examining your code, I shifted to using key-value pairs, which I now feed into the model. It seems to yield results similar to using the system prompt directly.
Here’s the approach I implemented in my code. The main difference, as I see it, is that you use a greedy encoding method and break the document into tokens before creating key-value pairs, which you refer to as the "KV cache." But if the retrieval process from the KV cache is similar to fetching relevant information from embeddings, how do the two approaches differ in terms of efficiency or output?
I could be missing something, as I am still learning, and would appreciate your insight @hhhuang to help me better understand this concept.
import google.generativeai as genai
class KVStore:
def __init__(self):
self.store = {}
def add(self, key, value):
self.store[key] = value
def get(self, key):
return self.store.get(key, None)
def items(self):
return self.store.items()
def load_schema(file_path):
with open(file_path, 'r') as file:
schema = file.read()
return schema
# Instantiate KVStore
kv_store = KVStore()
# Add the schema to the KV store
schema_file_path = 'demographics_Schema.txt'
rawschema = load_schema(schema_file_path)
kv_store.add("bigquery_schema", rawschema)
# To view the entire dictionary directly
print("Complete KV Store dictionary:")
print(kv_store.store) # This will print the entire dictionary
def create_system_prompt(schema):
return f"""
You are a SQL query generator. Given the following BigQuery schema, generate a SQL query based on the user's request.
Schema:
{schema}
User's Request: {{user_request}}
SQL Query:
"""
def generate_sql_query(user_request, kv_store):
# Retrieve schema from KV store using a key
schema = kv_store.get("bigquery_schema")
print(schema)
if schema is None:
return "Schema not found in KV store!"
# Prepare the prompt using the retrieved schema
system_prompt = create_system_prompt(schema)
prompt = system_prompt.format(user_request=user_request)
# Call Gemini API to generate SQL query
genai.configure(api_key="AIzaSyBNA_GEyGD-As9oxQUTz1EfrAjITdFTSoE")
model = genai.GenerativeModel("gemini-1.5-flash")
response = model.generate_content(prompt)
return response.text
if __name__ == "__main__":
user_request = "Show the racial distribution in each state."
# Generate SQL query using the KV store
sql_query = generate_sql_query(user_request, kv_store)
print("Generated SQL Query:", sql_query)
The text was updated successfully, but these errors were encountered:
During LLM inference, prompts are converted into a KV Cache by the model, which is a computationally intensive task. In fact, there's a benchmark metric called TTFT (Time to First Token) that measures the time required for this process.
Cached Augmented Generation (CAG) precomputes the prompt into a KV Cache, making it a more efficient approach compared to the method you mentioned. However, the output remains the same.
I am trying to understand the need for Context-Aware Generation (CAG) if it only utilizes the context window of the LLM. From my perspective, it seems that the key-value pairs of data are being generated and placed in the context window, which could be achieved through the system prompt as well. In fact, we could load the entire document into the system prompt, and it would function similarly to CAG. Could you clarify the difference between these two approaches?
For my use case, I have been working with embeddings before considering the system prompt. However, after reviewing your paper and examining your code, I shifted to using key-value pairs, which I now feed into the model. It seems to yield results similar to using the system prompt directly.
Here’s the approach I implemented in my code. The main difference, as I see it, is that you use a greedy encoding method and break the document into tokens before creating key-value pairs, which you refer to as the "KV cache." But if the retrieval process from the KV cache is similar to fetching relevant information from embeddings, how do the two approaches differ in terms of efficiency or output?
I could be missing something, as I am still learning, and would appreciate your insight @hhhuang to help me better understand this concept.
The text was updated successfully, but these errors were encountered: