-
Notifications
You must be signed in to change notification settings - Fork 28
[BUG] Single lines exceeding chunk_size are not split, creating oversized chunks #52
Description
Project
vgrep
Description
The chunk_content function only splits content when current_chunk is non-empty AND the size limit is exceeded. For files with a single very long line (e.g., minified JS, long strings, base64 data), the entire line becomes one chunk regardless of size. This can create chunks 10x or larger than the configured chunk_size, causing memory issues and degraded embedding quality.
Error Message
# No error - the chunk is silently created with excessive size
# May cause downstream issues:
thread 'main' panicked at 'capacity overflow'
# or
Error: Failed to generate embedding: context length exceededDebug Logs
# The problematic code at src/core/indexer.rs:347-352:
for (line_idx, line) in lines.iter().enumerate() {
let line_len = line.len() + 1;
// BUG: The second condition prevents splitting on first line
if char_count + line_len > self.chunk_size && !current_chunk.is_empty() {
// ^^^^^^^^^^^^^^^^^^^^^^^^
// FALSE on first iteration!
chunks.push(...); // Never reached for single long lines
}
// Line is added regardless of size:
current_chunk.push_str(line); // Could be 100KB!
current_chunk.push('\n');
char_count += line_len;
}System Information
Affects all platforms.
- OS: Any
- vgrep version: current main branch
- File: src/core/indexer.rs:347
- Default chunk_size: 512 charactersScreenshots
No response
Steps to Reproduce
1. Create test directory
mkdir -p /tmp/vgrep-longline-bug && cd /tmp/vgrep-longline-bug
2. Create file with single long line (5000 chars, ~10x chunk_size)
python3 -c "print('x' * 5000)" > longline.rs
3. Create minified JS (realistic example)
curl -s https://code.jquery.com/jquery-3.7.1.min.js > jquery.min.js
4. Index and check chunk sizes
vgrep serve &
sleep 2
vgrep index .
5. Query database for chunk sizes
sqlite3 ~/.vgrep/projects/*.db
"SELECT length(content), substr(path,-20) FROM chunks ORDER BY length(content) DESC LIMIT 5;"
Shows chunks of 5000+ characters when limit is 512
Expected Behavior
Long lines should be split at chunk_size boundaries. A 5000-character line should produce approximately 10 chunks of ~500 characters each, ensuring:
- Consistent chunk sizes for embedding quality
- Bounded memory usage
- Better semantic search granularity
Actual Behavior
- Single line of 5000 chars creates ONE chunk of 5001 bytes
- Chunk is ~10x larger than configured
chunk_size(512) - Minified files create chunks of 50KB+ (entire file as one chunk)
- Embedding model may truncate or fail on oversized input
- Search results point to entire file instead of relevant section
Additional Context
Commonly affected files:
- Minified JavaScript/CSS (*.min.js, *.min.css)
- Base64-encoded data in source files
- Generated code with long lines
- SQL files with large INSERT statements
- JSON files without pretty-printing
Suggested fix:
// Split long lines at chunk_size boundaries
for (line_idx, line) in lines.iter().enumerate() {
// Handle lines longer than chunk_size
if line.len() > self.chunk_size {
for sub_chunk in line.as_bytes().chunks(self.chunk_size) {
let sub_str = std::str::from_utf8(sub_chunk).unwrap_or("");
// Process sub_str as separate chunk
}
continue;
}
// ... existing logic for normal lines
}