Skip to content

Commit 7f8d7b7

Browse files
authored
Merge pull request #1 from codefarm0/vector-embeding-test-code
Vector embeding final code
2 parents 329e574 + e0a0ee1 commit 7f8d7b7

13 files changed

+4049
-4
lines changed

docs/embedding-process.md

Lines changed: 295 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,295 @@
1+
# Vector Store Embedding Process Documentation
2+
3+
## Overview
4+
This document explains the embedding process used in the VectorStoreConfig class, which converts text documents into vector embeddings for semantic search capabilities.
5+
6+
## Process Flow
7+
8+
```plantuml
9+
@startuml
10+
skinparam backgroundColor white
11+
skinparam handwritten false
12+
13+
actor "Application" as app
14+
participant "VectorStoreConfig" as config
15+
participant "TikaDocumentReader" as tika
16+
participant "TokenTextSplitter" as splitter
17+
participant "EmbeddingModel" as embedder
18+
participant "SimpleVectorStore" as store
19+
database "Vector Store File" as file
20+
21+
app -> config: Initialize VectorStore
22+
activate config
23+
24+
alt Vector Store File Exists
25+
config -> file: Load existing store
26+
file --> config: Return loaded store
27+
else Vector Store File Doesn't Exist
28+
config -> tika: Read document
29+
activate tika
30+
tika --> config: Return documents
31+
deactivate tika
32+
33+
config -> splitter: Split documents
34+
activate splitter
35+
splitter --> config: Return split documents
36+
deactivate splitter
37+
38+
loop For each split document
39+
config -> embedder: Generate embedding
40+
activate embedder
41+
embedder --> config: Return vector
42+
deactivate embedder
43+
44+
config -> store: Add embedding
45+
activate store
46+
store --> config: Confirmation
47+
deactivate store
48+
49+
config -> config: Wait 1 second
50+
end
51+
52+
config -> file: Save vector store
53+
end
54+
55+
config --> app: Return vector store
56+
deactivate config
57+
58+
@enduml
59+
```
60+
61+
## Detailed Process Explanation
62+
63+
### 1. Initialization
64+
```java
65+
SimpleVectorStore store = SimpleVectorStore.builder(embeddingModel).build();
66+
```
67+
- Creates a new SimpleVectorStore instance
68+
- Configures it with the provided embedding model (HuggingFace in this case)
69+
70+
### 2. Vector Store File Check
71+
```java
72+
File vectorStoreFile = new File(vectorStoreProperties.getVectorStorePath());
73+
if (vectorStoreFile.exists()) {
74+
store.load(vectorStoreFile);
75+
}
76+
```
77+
- Checks if a previously created vector store exists
78+
- If exists, loads the pre-computed embeddings
79+
- This prevents re-computing embeddings for the same documents
80+
81+
### 3. Document Processing (if no existing store)
82+
```java
83+
vectorStoreProperties.getDocumentsToLoad().forEach(document -> {
84+
TikaDocumentReader documentReader = new TikaDocumentReader(document);
85+
List<Document> documents = documentReader.get();
86+
```
87+
- Iterates through each document specified in properties
88+
- Uses Apache Tika to read and parse the document
89+
- Tika handles various document formats (PDF, DOC, TXT, etc.)
90+
91+
### 4. Text Splitting
92+
```java
93+
TextSplitter textSplitter = new TokenTextSplitter();
94+
List<Document> splitDocs = textSplitter.apply(documents);
95+
```
96+
- Splits documents into smaller chunks using TokenTextSplitter
97+
- This is necessary because:
98+
- Embedding models have token limits
99+
- Smaller chunks provide more precise semantic search
100+
- Helps in managing memory usage
101+
102+
### 5. Embedding Generation
103+
```java
104+
store.add(splitDocs);
105+
```
106+
- For each split document:
107+
1. The text is tokenized
108+
2. Tokens are converted to embeddings using the configured model
109+
3. Embeddings are stored in the vector store
110+
111+
#### Detailed Embedding Process
112+
113+
1. **Text Tokenization**
114+
```java
115+
// Example of how text is tokenized internally
116+
String text = "The quick brown fox jumps over the lazy dog";
117+
// Tokenized into: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
118+
```
119+
120+
2. **OpenAI API Call**
121+
```java
122+
// Internal API call to OpenAI (simplified)
123+
POST https://api.openai.com/v1/embeddings
124+
{
125+
"model": "text-embedding-ada-002",
126+
"input": "The quick brown fox jumps over the lazy dog",
127+
"encoding_format": "float"
128+
}
129+
130+
// Response
131+
{
132+
"data": [{
133+
"embedding": [0.0023064255, -0.009327292, ...], // 1536-dimensional vector
134+
"index": 0
135+
}],
136+
"model": "text-embedding-ada-002",
137+
"usage": {
138+
"prompt_tokens": 9,
139+
"total_tokens": 9
140+
}
141+
}
142+
```
143+
144+
3. **Vector Storage**
145+
```java
146+
// Example of how embeddings are stored
147+
Map<String, float[]> embeddings = new HashMap<>();
148+
embeddings.put("doc1_chunk1", [0.0023064255, -0.009327292, ...]);
149+
```
150+
151+
#### Cost and Performance Considerations
152+
153+
1. **API Costs**
154+
- OpenAI charges per token for embeddings
155+
- Example cost calculation:
156+
```
157+
Input text: "The quick brown fox jumps over the lazy dog"
158+
Tokens: 9
159+
Cost per 1K tokens: $0.0001
160+
Total cost: (9/1000) * $0.0001 = $0.0000009
161+
```
162+
163+
2. **Rate Limiting**
164+
```java
165+
// Current implementation
166+
Thread.sleep(1000); // 1 second delay between calls
167+
168+
// Alternative implementation with exponential backoff
169+
private void processWithBackoff(String text) {
170+
int maxRetries = 3;
171+
int baseDelay = 1000; // 1 second
172+
173+
for (int i = 0; i < maxRetries; i++) {
174+
try {
175+
return generateEmbedding(text);
176+
} catch (RateLimitException e) {
177+
int delay = baseDelay * (int) Math.pow(2, i);
178+
Thread.sleep(delay);
179+
}
180+
}
181+
}
182+
```
183+
184+
3. **Batch Processing**
185+
```java
186+
// Example of batch processing multiple texts
187+
List<String> texts = Arrays.asList(
188+
"First document chunk",
189+
"Second document chunk",
190+
"Third document chunk"
191+
);
192+
193+
// OpenAI allows up to 2048 tokens per request
194+
// Batch size calculation
195+
int maxTokensPerRequest = 2048;
196+
int averageTokensPerText = 100;
197+
int optimalBatchSize = maxTokensPerRequest / averageTokensPerText;
198+
```
199+
200+
#### Example: Complete Embedding Flow
201+
202+
```java
203+
// 1. Document splitting
204+
TextSplitter splitter = new TokenTextSplitter();
205+
List<Document> chunks = splitter.apply(documents);
206+
207+
// 2. Embedding generation
208+
for (Document chunk : chunks) {
209+
// 2.1 Prepare text
210+
String text = chunk.getContent();
211+
212+
// 2.2 Generate embedding
213+
EmbeddingResponse response = openAiClient.createEmbedding(
214+
EmbeddingRequest.builder()
215+
.model("text-embedding-ada-002")
216+
.input(text)
217+
.build()
218+
);
219+
220+
// 2.3 Extract vector
221+
float[] embedding = response.getData().get(0).getEmbedding();
222+
223+
// 2.4 Store in vector store
224+
store.add(new Document(chunk.getContent(), embedding));
225+
226+
// 2.5 Rate limiting
227+
Thread.sleep(1000);
228+
}
229+
```
230+
231+
#### Vector Similarity
232+
233+
The generated embeddings enable semantic search through vector similarity:
234+
235+
```java
236+
// Example of similarity calculation
237+
float[] queryEmbedding = generateEmbedding("What is machine learning?");
238+
float[] documentEmbedding = store.getEmbedding("doc1_chunk1");
239+
240+
// Cosine similarity calculation
241+
float similarity = cosineSimilarity(queryEmbedding, documentEmbedding);
242+
// Returns value between -1 and 1, where 1 means most similar
243+
```
244+
245+
The embedding process:
246+
- Converts text into numerical vectors (1536 dimensions for OpenAI's ada-002 model)
247+
- Preserves semantic meaning in the vector space
248+
- Enables similarity calculations between texts
249+
- Allows for efficient semantic search and retrieval
250+
251+
### 6. Rate Limiting
252+
```java
253+
Thread.sleep(1000);
254+
```
255+
- Implements a 1-second delay between documents
256+
- Prevents overwhelming the embedding API
257+
- Helps avoid rate limiting issues
258+
259+
### 7. Persistence
260+
```java
261+
store.save(vectorStoreFile);
262+
```
263+
- Saves the computed embeddings to disk
264+
- Enables reuse of embeddings in future runs
265+
- Improves performance by avoiding recomputation
266+
267+
## Technical Details
268+
269+
### Embedding Model (HuggingFace)
270+
- Uses the "sentence-transformers/all-MiniLM-L6-v2" model
271+
- Generates 384-dimensional vectors
272+
- Optimized for semantic similarity tasks
273+
- Supports multiple languages
274+
275+
### Vector Store
276+
- Stores embeddings in memory during processing
277+
- Persists to disk for long-term storage
278+
- Enables efficient similarity search
279+
- Supports incremental updates
280+
281+
### Performance Considerations
282+
- Embedding generation is computationally expensive
283+
- Rate limiting is implemented to prevent API overload
284+
- Caching through file persistence improves performance
285+
- Text splitting optimizes memory usage
286+
287+
## Usage Example
288+
```java
289+
// Configuration in application.properties
290+
sfg.aiapp.vectorStorePath=./tmp/vectorstore.json
291+
sfg.aiapp.documentsToLoad=classpath:./movies500.csv
292+
293+
// The vector store can then be used for semantic search
294+
List<Document> results = vectorStore.similaritySearch("query text", 5);
295+
```

0 commit comments

Comments
 (0)