[BUG] Single lines exceeding chunk_size are not split, creating oversized chunks

### Project

vgrep

### Description

The `chunk_content` function only splits content when `current_chunk` is non-empty AND the size limit is exceeded. For files with a single very long line (e.g., minified JS, long strings, base64 data), the entire line becomes one chunk regardless of size. This can create chunks 10x or larger than the configured `chunk_size`, causing memory issues and degraded embedding quality.

### Error Message

```shell
# No error - the chunk is silently created with excessive size
# May cause downstream issues:
thread 'main' panicked at 'capacity overflow'
# or
Error: Failed to generate embedding: context length exceeded
```

### Debug Logs

```shell
# The problematic code at src/core/indexer.rs:347-352:

for (line_idx, line) in lines.iter().enumerate() {
    let line_len = line.len() + 1;
    
    // BUG: The second condition prevents splitting on first line
    if char_count + line_len > self.chunk_size && !current_chunk.is_empty() {
    //                                            ^^^^^^^^^^^^^^^^^^^^^^^^
    //                                            FALSE on first iteration!
        chunks.push(...);  // Never reached for single long lines
    }
    
    // Line is added regardless of size:
    current_chunk.push_str(line);  // Could be 100KB!
    current_chunk.push('\n');
    char_count += line_len;
}
```

### System Information

```shell
Affects all platforms.
- OS: Any
- vgrep version: current main branch
- File: src/core/indexer.rs:347
- Default chunk_size: 512 characters
```

### Screenshots

_No response_

### Steps to Reproduce

# 1. Create test directory  
mkdir -p /tmp/vgrep-longline-bug && cd /tmp/vgrep-longline-bug

# 2. Create file with single long line (5000 chars, ~10x chunk_size)
python3 -c "print('x' * 5000)" > longline.rs

# 3. Create minified JS (realistic example)
curl -s https://code.jquery.com/jquery-3.7.1.min.js > jquery.min.js

# 4. Index and check chunk sizes
vgrep serve &
sleep 2
vgrep index .

# 5. Query database for chunk sizes
sqlite3 ~/.vgrep/projects/*.db \
  "SELECT length(content), substr(path,-20) FROM chunks ORDER BY length(content) DESC LIMIT 5;"
# Shows chunks of 5000+ characters when limit is 512

### Expected Behavior

Long lines should be split at `chunk_size` boundaries. A 5000-character line should produce approximately 10 chunks of ~500 characters each, ensuring:

1. Consistent chunk sizes for embedding quality
2. Bounded memory usage
3. Better semantic search granularity

### Actual Behavior

- Single line of 5000 chars creates ONE chunk of 5001 bytes
- Chunk is ~10x larger than configured `chunk_size` (512)
- Minified files create chunks of 50KB+ (entire file as one chunk)
- Embedding model may truncate or fail on oversized input
- Search results point to entire file instead of relevant section

### Additional Context

Commonly affected files:

- Minified JavaScript/CSS (*.min.js, *.min.css)
- Base64-encoded data in source files
- Generated code with long lines
- SQL files with large INSERT statements
- JSON files without pretty-printing

Suggested fix:

```
// Split long lines at chunk_size boundaries
for (line_idx, line) in lines.iter().enumerate() {
    // Handle lines longer than chunk_size
    if line.len() > self.chunk_size {
        for sub_chunk in line.as_bytes().chunks(self.chunk_size) {
            let sub_str = std::str::from_utf8(sub_chunk).unwrap_or("");
            // Process sub_str as separate chunk
        }
        continue;
    }
    // ... existing logic for normal lines
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Single lines exceeding chunk_size are not split, creating oversized chunks #52

Project

Description

Error Message

Debug Logs

System Information

Screenshots

Steps to Reproduce

1. Create test directory

2. Create file with single long line (5000 chars, ~10x chunk_size)

3. Create minified JS (realistic example)

4. Index and check chunk sizes

5. Query database for chunk sizes

Shows chunks of 5000+ characters when limit is 512

Expected Behavior

Actual Behavior

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Single lines exceeding chunk_size are not split, creating oversized chunks #52

Description

Project

Description

Error Message

Debug Logs

System Information

Screenshots

Steps to Reproduce

1. Create test directory

2. Create file with single long line (5000 chars, ~10x chunk_size)

3. Create minified JS (realistic example)

4. Index and check chunk sizes

5. Query database for chunk sizes

Shows chunks of 5000+ characters when limit is 512

Expected Behavior

Actual Behavior

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions