-
Notifications
You must be signed in to change notification settings - Fork 27
[BUG] Null/Control Characters Cause Tokenization Failure Without Proper Error Message #218
Copy link
Copy link
Closed
Labels
Description
Project
vgrep
Description
The embedding engine does not sanitize or validate input text for null bytes (\x00) or other control characters. When such characters are present in the query or text to embed, the llama.cpp tokenizer fails with a cryptic "Failed to tokenize" error. This affects both the CLI search and all API endpoints (/search, /embed, /embed_batch).
Error Message
{"error":"Failed to generate embedding: Failed to tokenize"}Debug Logs
$ curl -s -X POST http://127.0.0.1:7777/embed \
-H 'Content-Type: application/json' \
-d '{"text":"hello\u0000world"}'
{"error":"Embedding failed: Failed to tokenize"}
# Normal text works fine:
$ curl -s -X POST http://127.0.0.1:7777/embed \
-H 'Content-Type: application/json' \
-d '{"text":"hello world"}'
{"embedding":[...],"dimensions":1024}System Information
Bounty Version: 0.1.0
OS: Ubuntu 24.04 LTS
CPU: AMD EPYC-Genoa Processor (8 cores)
RAM: 15 GBScreenshots
No response
Steps to Reproduce
- Start vgrep server:
vgrep serve - Send embed request with null character:
curl -X POST http://127.0.0.1:7777/embed \ -H 'Content-Type: application/json' \ -d '{"text":"hello\u0000world"}'
- Observe "Failed to tokenize" error
Expected Behavior
- Control characters should be stripped or sanitized before tokenization
- Error message should specifically mention invalid characters
- Or: gracefully handle null bytes in input
Actual Behavior
- Null bytes pass through to tokenizer unchanged
- Tokenization fails with generic error
- No indication what caused the failure
Additional Context
Location: src/core/embeddings.rs - embed() and embed_batch() functions
The issue occurs in the tokenization step:
let tokens = self.model.str_to_token(text, AddBos::Always)
.context("Failed to tokenize")?;No input sanitization is performed before this call.
Security Note: This could be exploited to cause failures when processing files that legitimately contain binary data or when malicious input is sent to the API.
Reactions are currently unavailable