-
Notifications
You must be signed in to change notification settings - Fork 28
[BUG] Empty File Extension Pattern Matches Binary Files #87
Copy link
Copy link
Closed
Labels
Description
Project
vgrep
Description
The should_index() function in src/core/indexer.rs line 310 includes an empty string "" in the list of indexable file extensions. This causes files without extensions (including compiled binaries, executables, and other non-text files) to be indexed, leading to errors or corrupted embeddings.
Error Message
Error: Failed to read file
Caused by: stream did not contain valid UTF-8
Or silently produces garbage embeddings for binary content.Debug Logs
System Information
- Bounty Version: 0.1.0
- OS: Ubuntu 24.04 LTS
- Rust: 1.75+Screenshots
No response
Steps to Reproduce
- Create a project with compiled binaries:
cd /tmp/test_project echo 'fn main() { println!("hello"); }' > main.rs rustc main.rs -o my_binary
- Run indexer:
vgrep index - Observe that
my_binary(the compiled executable) is attempted to be indexed
Expected Behavior
Files without extensions should NOT be indexed by default, except for specific known filenames (Makefile, Dockerfile, etc.) which are already handled in the filename check.
Actual Behavior
All files without extensions are considered indexable, causing:
- UTF-8 decode errors for binary files
- Wasted processing time attempting to read binaries
- Potentially corrupted embeddings if binary content is partially UTF-8 valid
- Index bloat from non-code files
Additional Context
The same bug exists in:
src/core/indexer.rs:686(ServerIndexer)src/watcher.rs:244(FileWatcher)
All three locations need to be fixed consistently.
Reactions are currently unavailable