Skip to content

[BUG] Null/Control Characters Cause Tokenization Failure Without Proper Error Message #218

@EnthusiasticTech

Description

@EnthusiasticTech

Project

vgrep

Description

The embedding engine does not sanitize or validate input text for null bytes (\x00) or other control characters. When such characters are present in the query or text to embed, the llama.cpp tokenizer fails with a cryptic "Failed to tokenize" error. This affects both the CLI search and all API endpoints (/search, /embed, /embed_batch).

Error Message

{"error":"Failed to generate embedding: Failed to tokenize"}

Debug Logs

$ curl -s -X POST http://127.0.0.1:7777/embed \
  -H 'Content-Type: application/json' \
  -d '{"text":"hello\u0000world"}'
{"error":"Embedding failed: Failed to tokenize"}

# Normal text works fine:
$ curl -s -X POST http://127.0.0.1:7777/embed \
  -H 'Content-Type: application/json' \
  -d '{"text":"hello world"}'
{"embedding":[...],"dimensions":1024}

System Information

Bounty Version: 0.1.0
OS: Ubuntu 24.04 LTS
CPU: AMD EPYC-Genoa Processor (8 cores)
RAM: 15 GB

Screenshots

No response

Steps to Reproduce

  1. Start vgrep server: vgrep serve
  2. Send embed request with null character:
    curl -X POST http://127.0.0.1:7777/embed \
      -H 'Content-Type: application/json' \
      -d '{"text":"hello\u0000world"}'
  3. Observe "Failed to tokenize" error

Expected Behavior

  1. Control characters should be stripped or sanitized before tokenization
  2. Error message should specifically mention invalid characters
  3. Or: gracefully handle null bytes in input

Actual Behavior

  1. Null bytes pass through to tokenizer unchanged
  2. Tokenization fails with generic error
  3. No indication what caused the failure

Additional Context

Location: src/core/embeddings.rs - embed() and embed_batch() functions

The issue occurs in the tokenization step:

let tokens = self.model.str_to_token(text, AddBos::Always)
    .context("Failed to tokenize")?;

No input sanitization is performed before this call.

Security Note: This could be exploited to cause failures when processing files that legitimately contain binary data or when malicious input is sent to the API.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingvalidValid issuevgrep

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions