Skip to content

[BUG] Single lines exceeding chunk_size are not split, creating oversized chunks #52

@Crimsonyx412

Description

@Crimsonyx412

Project

vgrep

Description

The chunk_content function only splits content when current_chunk is non-empty AND the size limit is exceeded. For files with a single very long line (e.g., minified JS, long strings, base64 data), the entire line becomes one chunk regardless of size. This can create chunks 10x or larger than the configured chunk_size, causing memory issues and degraded embedding quality.

Error Message

# No error - the chunk is silently created with excessive size
# May cause downstream issues:
thread 'main' panicked at 'capacity overflow'
# or
Error: Failed to generate embedding: context length exceeded

Debug Logs

# The problematic code at src/core/indexer.rs:347-352:

for (line_idx, line) in lines.iter().enumerate() {
    let line_len = line.len() + 1;
    
    // BUG: The second condition prevents splitting on first line
    if char_count + line_len > self.chunk_size && !current_chunk.is_empty() {
    //                                            ^^^^^^^^^^^^^^^^^^^^^^^^
    //                                            FALSE on first iteration!
        chunks.push(...);  // Never reached for single long lines
    }
    
    // Line is added regardless of size:
    current_chunk.push_str(line);  // Could be 100KB!
    current_chunk.push('\n');
    char_count += line_len;
}

System Information

Affects all platforms.
- OS: Any
- vgrep version: current main branch
- File: src/core/indexer.rs:347
- Default chunk_size: 512 characters

Screenshots

No response

Steps to Reproduce

1. Create test directory

mkdir -p /tmp/vgrep-longline-bug && cd /tmp/vgrep-longline-bug

2. Create file with single long line (5000 chars, ~10x chunk_size)

python3 -c "print('x' * 5000)" > longline.rs

3. Create minified JS (realistic example)

curl -s https://code.jquery.com/jquery-3.7.1.min.js > jquery.min.js

4. Index and check chunk sizes

vgrep serve &
sleep 2
vgrep index .

5. Query database for chunk sizes

sqlite3 ~/.vgrep/projects/*.db
"SELECT length(content), substr(path,-20) FROM chunks ORDER BY length(content) DESC LIMIT 5;"

Shows chunks of 5000+ characters when limit is 512

Expected Behavior

Long lines should be split at chunk_size boundaries. A 5000-character line should produce approximately 10 chunks of ~500 characters each, ensuring:

  1. Consistent chunk sizes for embedding quality
  2. Bounded memory usage
  3. Better semantic search granularity

Actual Behavior

  • Single line of 5000 chars creates ONE chunk of 5001 bytes
  • Chunk is ~10x larger than configured chunk_size (512)
  • Minified files create chunks of 50KB+ (entire file as one chunk)
  • Embedding model may truncate or fail on oversized input
  • Search results point to entire file instead of relevant section

Additional Context

Commonly affected files:

  • Minified JavaScript/CSS (*.min.js, *.min.css)
  • Base64-encoded data in source files
  • Generated code with long lines
  • SQL files with large INSERT statements
  • JSON files without pretty-printing

Suggested fix:

// Split long lines at chunk_size boundaries
for (line_idx, line) in lines.iter().enumerate() {
    // Handle lines longer than chunk_size
    if line.len() > self.chunk_size {
        for sub_chunk in line.as_bytes().chunks(self.chunk_size) {
            let sub_str = std::str::from_utf8(sub_chunk).unwrap_or("");
            // Process sub_str as separate chunk
        }
        continue;
    }
    // ... existing logic for normal lines
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingvalidValid issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions