Skip to content

Commit db14c9f

Browse files
authored
feat: new pinecone API (#285)
1 parent 07d452c commit db14c9f

File tree

3 files changed

+63
-11
lines changed

3 files changed

+63
-11
lines changed

infrastructure/movie-search-app/README.md

+55-3
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,59 @@ Here is the [link to the documentation for AlloyDB](https://cloud.google.com/all
9696
Create a database with the name movies and the user movies_owner. You can choose your own names for the database and the user. The application takes it from environment variables. Optionally you can modify the application to use secret manager in Google Cloud as more secured approach.
9797

9898
### Migrate data from Pinecone to AlloyDB
99-
- Move the data from Pinecone to AlloyDB
99+
Move the data from Pinecone to AlloyDB
100+
- Pinecone index structure consists primarily from 3 main parts:
101+
ID - unique row ID
102+
VALUES - vector embedding value (text-embedding-004 from Google)
103+
METADATA - Supplemental information about the data in key/value format
104+
105+
- The future AlloyDB/PostreSQL table as it is defined in the app will have the following structure:
106+
```
107+
Table "public.alloydb_table"
108+
Column | Type | Collation | Nullable | Default
109+
--------------------+-------------+-----------+----------+---------
110+
langchain_id | uuid | | not null |
111+
content | text | | not null |
112+
embedding | vector(768) | | not null |
113+
langchain_metadata | json | | |
114+
Indexes:
115+
"alloydb_table_pkey" PRIMARY KEY, btree (langchain_id)
116+
```
117+
And here is the json keys for the langchain_metadata column (from the movie dataset):
118+
```
119+
jsonb_object_keys
120+
---------------------
121+
tags
122+
genre
123+
image
124+
title
125+
actors
126+
poster
127+
writer
128+
runtime
129+
summary
130+
director
131+
imdblink
132+
boxoffice
133+
imdbscore
134+
imdbvotes
135+
languages
136+
viewrating
137+
netflixlink
138+
releasedate
139+
tmdbtrailer
140+
trailersite
141+
seriesormovie
142+
awardsreceived
143+
hiddengemscore
144+
metacriticscore
145+
productionhouse
146+
awardsnominatedfor
147+
netflixreleasedate
148+
countryavailability
149+
rottentomatoesscore
150+
```
151+
- All the metadata keys are taken from the Pinecone metadata keeping the same structure.
100152

101153
### Enable virtual environment for Python
102154
You can use either your laptop or a virtual machnie for deployment. Using a VM deployed in the same Google Cloud project simplifies deployeent and network configuration. On a Debian Linux you can enable it in the shell using the following command:
@@ -126,9 +178,9 @@ pip install -r requirements.txt
126178
export PINECONE_INDEX_NAME=netflix-index-01
127179
export PORT=8080
128180
export DB_USER=movies_owner
129-
export DB_PASS=DatabasePassword
181+
export DB_PASS={DATABASEPASSSWORD}
130182
export DB_NAME=movies
131-
export INSTANCE_HOST=ALLOYDB_IP
183+
export INSTANCE_HOST={ALLOYDB_IP}
132184
export DB_PORT=5432
133185
```
134186
- Here is the command used to start the application

infrastructure/movie-search-app/movie_search.py

+5-5
Original file line numberDiff line numberDiff line change
@@ -209,13 +209,13 @@ def get_movies(db: sqlalchemy.engine.base.Engine, embeddings: str) -> dict:
209209
stmt = sqlalchemy.text(
210210
"""
211211
SELECT
212-
mj.metadata->'title' as title,
213-
mj.metadata->'summary' as summary,
214-
mj.metadata->'director' as director,
215-
mj.metadata->'actors' as actors,
212+
mj.langchain_metadata->'title' as title,
213+
mj.langchain_metadata->'summary' as summary,
214+
mj.langchain_metadata->'director' as director,
215+
mj.langchain_metadata->'actors' as actors,
216216
(mj.embedding <=> (:embeddings)::vector) as distance
217217
FROM
218-
movies_json mj
218+
alloydb_table mj
219219
ORDER BY
220220
distance ASC
221221
LIMIT 5;

infrastructure/movie-search-app/pinecone_model.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414

1515
import google.generativeai as genai
1616
from typing import Iterable
17-
from pinecone import Pinecone # as Pinecone
17+
from pinecone.grpc import PineconeGRPC as Pinecone
1818
import logging
1919
import os
2020
from data_model import ChatMessage, State
@@ -58,10 +58,10 @@ def get_movies(embedding: list[float]) -> dict:
5858
logging.warning("PINECONE_INDEX_NAME not set, using default: %s", PINECONE_INDEX_NAME)
5959
pc = Pinecone(api_key=state.pinecone_api_key)
6060
index = pc.Index(name=PINECONE_INDEX_NAME)
61-
query_resp = index.query(vector=embedding, namespace="sandpaper", top_k=5)
61+
query_resp = index.query(vector=embedding, namespace="sandpaper", top_k=5, include_metadata=True)
6262
movies_list = []
6363
for match in query_resp.matches:
64-
meta = index.fetch(ids=[match['id']], namespace="sandpaper")["vectors"][match['id']]["metadata"]
64+
meta = match["metadata"]
6565
movies_list.append({"title":meta["title"],"summary":meta["summary"],"director":meta["director"],"genre": meta["genre"],"actors": meta["actors"]})
6666
return movies_list
6767

0 commit comments

Comments
 (0)