【WIP】feat: add local llama-cpp embedding support by Mijamind719 · Pull Request #1388 · volcengine/OpenViking

Mijamind719 · 2026-04-12T12:50:42Z

Description

This PR adds an initial local dense embedding path for OpenViking based on llama-cpp-python, with bge-small-zh-v1.5-f16 as the default local embedding model.

The goal is to make local CPU embedding available without changing the existing remote providers, while keeping installation risk isolated behind an optional extra dependency.

Related Issue

N/A

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)
Performance improvement
Test update

Changes Made

Add provider: "local" embedding config support and default implicit local embedding behavior when no embedding provider is configured.
Add a LocalDenseEmbedder based on llama-cpp-python with:
- default model bge-small-zh-v1.5-f16
- optional model_path
- optional cache_dir
- automatic model download into the local cache directory
Extend embedder interfaces with explicit query/document semantics so retrieval-side and indexing-side embedding can use different roles.
Update context collection schema management to persist embedding metadata and detect incompatible existing indexes.
Add explicit configuration and rebuild-required errors for local embedding failures and embedding/index mismatches.
Update ov doctor to diagnose the local embedding dependency, local model cache state, and invalid local model configuration.
Add openviking[local-embed] extra instead of making llama-cpp-python a hard dependency of the main package.
Add a design document at docs/design/local-embedding-llama-cpp-design.md to capture the implementation scope and rollout decisions.

Testing

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have tested this on the following platforms:
- Linux
- macOS
- Windows

Targeted validation performed:

python3 -m py_compile on the touched Python modules
PYTHONPATH=. ./.venv/bin/python -m pytest -q tests/cli/test_doctor.py --maxfail=1
PYTHONPATH=. ./.venv/bin/python -m pytest -q tests/unit/test_local_embedder.py tests/storage/test_collection_schemas.py tests/misc/test_config_validation.py --maxfail=1
Real local smoke test on macOS with actual llama-cpp-python + downloaded GGUF model:
- local model download succeeded
- embedder initialization succeeded
- embed_query() and embed_document() returned valid 512-d vectors
- relevant Chinese query/document similarity was higher than an unrelated sentence

Known Limitations

In the current real runtime tested for bge-small-zh-v1.5-f16, native multi-sequence batch embedding is not available because the created llama context reports n_seq_max = 1.
embed_batch() is therefore safe and functional, but currently falls back to sequential embedding in this runtime instead of delivering native batch throughput.
This PR only reports index incompatibility and asks for rebuild. It does not implement a rebuild workflow.
This PR does not claim full end-to-end validation of the whole OpenViking product flow yet; validation is currently focused on targeted tests plus a real local embedding smoke test.

Checklist

My code follows the project's coding style
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
Any dependent changes have been merged and published

Screenshots (if applicable)

N/A

Additional Notes

Follow-up investigation is still needed for true native multi-sequence batch support in llama-cpp-python for this embedding model/runtime combination.

Co-authored-by: GPT-5.4 <noreply@openai.com>

github-actions · 2026-04-12T12:52:37Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 70
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes Sub-PR theme: Add embedding metadata validation to collection initialization Relevant files: openviking/storage/collection_schemas.py openviking/storage/errors.py openviking/storage/viking_vector_index_backend.py openviking/storage/vikingdb_manager.py tests/storage/test_collection_schemas.py Sub-PR theme: Add local llama-cpp embedding support Relevant files: openviking/models/embedder/init.py openviking/models/embedder/base.py openviking/models/embedder/local_embedders.py openviking_cli/doctor.py openviking_cli/utils/config/embedding_config.py tests/cli/test_doctor.py tests/misc/test_config_validation.py tests/unit/test_local_embedder.py pyproject.toml
⚡ Recommended focus areas for review Backward Compatibility Break The `init_context_collection` function now raises `EmbeddingConfigurationError` when the storage backend does not implement `get_collection_meta`, breaking existing deployments that use backends without this method. Previously, the function simply returned `False` when the collection already existed. existing_meta = None if hasattr(storage, "get_collection_meta"): existing_meta = await storage.get_collection_meta() if not existing_meta: raise EmbeddingConfigurationError( "Existing collection metadata is unavailable; cannot validate embedding compatibility" ) Blocking Async Operations The `LocalDenseEmbedder` does not override the base class async methods (`embed_async`, `embed_batch_async`). The base class default implementation may not properly offload the blocking llama-cpp operations to a thread pool, potentially starving the async event loop. class LocalDenseEmbedder(DenseEmbedderBase): """Dense embedder backed by a local GGUF model via llama-cpp-python.""" def __init__( self, model_name: str = DEFAULT_LOCAL_DENSE_MODEL, model_path: Optional[str] = None, cache_dir: Optional[str] = None, dimension: Optional[int] = None, query_instruction: Optional[str] = None, config: Optional[Dict[str, Any]] = None, ): runtime_config = dict(config or {}) runtime_config.setdefault("provider", "local") super().__init__(model_name, runtime_config) self.model_spec = get_local_model_spec(model_name) self.model_path = model_path self.cache_dir = cache_dir or DEFAULT_LOCAL_MODEL_CACHE_DIR self.query_instruction = ( query_instruction if query_instruction is not None else self.model_spec.query_instruction ) self._dimension = dimension or self.model_spec.dimension if self._dimension != self.model_spec.dimension: raise ValueError( f"Local model '{model_name}' has fixed dimension {self.model_spec.dimension}, " f"but got dimension={self._dimension}" ) self._resolved_model_path = self._resolve_model_path() self._llama = self._load_model() def _import_llama(self): try: module = importlib.import_module("llama_cpp") except ImportError as exc: raise EmbeddingConfigurationError( "Local embedding is enabled but 'llama-cpp-python' is not installed. " 'Install it with: pip install "openviking[local-embed]". ' "If you prefer a remote provider, set embedding.dense.provider explicitly in ov.conf." ) from exc llama_cls = getattr(module, "Llama", None) if llama_cls is None: raise EmbeddingConfigurationError( "llama_cpp.Llama is unavailable in the installed llama-cpp-python package." ) return llama_cls def _resolve_model_path(self) -> Path: if self.model_path: resolved = Path(self.model_path).expanduser().resolve() if not resolved.exists(): raise EmbeddingConfigurationError( f"Local embedding model file not found: {resolved}" ) return resolved cache_root = Path(self.cache_dir).expanduser().resolve() cache_root.mkdir(parents=True, exist_ok=True) target = get_local_model_cache_path(self.model_name, self.cache_dir) if target.exists(): return target self._download_model(self.model_spec.download_url, target) return target def _download_model(self, url: str, target: Path) -> None: logger.info("Downloading local embedding model %s to %s", self.model_name, target) tmp_target = target.with_suffix(target.suffix + ".part") try: with requests.get(url, stream=True, timeout=(10, 300)) as response: response.raise_for_status() with tmp_target.open("wb") as fh: for chunk in response.iter_content(chunk_size=1024 * 1024): if chunk: fh.write(chunk) os.replace(tmp_target, target) except Exception as exc: tmp_target.unlink(missing_ok=True) raise EmbeddingConfigurationError( f"Failed to download local embedding model '{self.model_name}' from {url} " f"to {target}: {exc}" ) from exc def _load_model(self): llama_cls = self._import_llama() try: return llama_cls( model_path=str(self._resolved_model_path), embedding=True, verbose=False, ) except Exception as exc: raise EmbeddingConfigurationError( f"Failed to load GGUF embedding model from {self._resolved_model_path}: {exc}" ) from exc def _format_text(self, text: str, *, is_query: bool) -> str: if is_query and self.query_instruction: return f"{self.query_instruction}{text}" return text @staticmethod def _extract_embedding(payload: Any) -> List[float]: if isinstance(payload, dict): data = payload.get("data") if isinstance(data, list) and data: item = data[0] if isinstance(item, dict) and "embedding" in item: return list(item["embedding"]) if "embedding" in payload: return list(payload["embedding"]) raise RuntimeError("Unexpected llama-cpp-python embedding response format") @staticmethod def _extract_embeddings(payload: Any) -> List[List[float]]: if isinstance(payload, dict): data = payload.get("data") if isinstance(data, list): vectors: List[List[float]] = [] for item in data: if not isinstance(item, dict) or "embedding" not in item: raise RuntimeError( "Unexpected llama-cpp-python batch embedding response format" ) vectors.append(list(item["embedding"])) return vectors raise RuntimeError("Unexpected llama-cpp-python batch embedding response format") def embed(self, text: str, is_query: bool = False) -> EmbedResult: formatted = self._format_text(text, is_query=is_query) def _call() -> EmbedResult: payload = self._llama.create_embedding(formatted) return EmbedResult(dense_vector=self._extract_embedding(payload)) try: result = self._run_with_retry( _call, logger=logger, operation_name="local embedding", ) except Exception as exc: raise RuntimeError(f"Local embedding failed: {exc}") from exc estimated_tokens = self._estimate_tokens(formatted) self.update_token_usage( model_name=self.model_name, provider="local", prompt_tokens=estimated_tokens, completion_tokens=0, ) return result def embed_batch(self, texts: List[str], is_query: bool = False) -> List[EmbedResult]: if not texts: return [] formatted = [self._format_text(text, is_query=is_query) for text in texts] def _call() -> List[EmbedResult]: payload = self._llama.create_embedding(formatted) return [ EmbedResult(dense_vector=vector) for vector in self._extract_embeddings(payload) ] try: results = self._run_with_retry( _call, logger=logger, operation_name="local batch embedding", ) except Exception as exc: raise RuntimeError(f"Local batch embedding failed: {exc}") from exc estimated_tokens = sum(self._estimate_tokens(text) for text in formatted) self.update_token_usage( model_name=self.model_name, provider="local", prompt_tokens=estimated_tokens, completion_tokens=0, ) return results def get_dimension(self) -> int: return self._dimension def close(self): close_fn = getattr(self._llama, "close", None) if callable(close_fn): close_fn()

github-actions · 2026-04-12T12:54:11Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category Suggestion Impact

General

Add retries for model downloads

Add retry logic for model downloads using the existing _run_with_retry helper to
improve resilience against transient network errors.

openviking/models/embedder/local_embedders.py [145-161]

 def _download_model(self, url: str, target: Path) -> None:
     logger.info("Downloading local embedding model %s to %s", self.model_name, target)
     tmp_target = target.with_suffix(target.suffix + ".part")
-    try:
+
+    def _download():
         with requests.get(url, stream=True, timeout=(10, 300)) as response:
             response.raise_for_status()
             with tmp_target.open("wb") as fh:
                 for chunk in response.iter_content(chunk_size=1024 * 1024):
                     if chunk:
                         fh.write(chunk)
         os.replace(tmp_target, target)
+
+    try:
+        self._run_with_retry(
+            _download,
+            logger=logger,
+            operation_name="local model download",
+        )
     except Exception as exc:
         tmp_target.unlink(missing_ok=True)
         raise EmbeddingConfigurationError(
             f"Failed to download local embedding model '{self.model_name}' from {url} "
             f"to {target}: {exc}"
         ) from exc

Suggestion importance[1-10]: 6

__

Why: This improves resilience against transient network errors during model downloads by reusing the existing _run_with_retry helper, making the local embedder more robust.

Low

Co-authored-by: GPT-5.4 <noreply@openai.com>

Legacy issue: investigate true llama-cpp native multi-sequence batch support for local embedding models such as bge-small-zh-v1.5-f16 (current runtime reports n_seq_max=1, so embed_batch uses sequential mode). Co-authored-by: GPT-5.4 <noreply@openai.com>

jcp0578 · 2026-04-13T06:33:07Z

发现有几个

[P1] 现有集合缺少 metadata 时会直接阻断升级路径
create_collection() 返回 False 后，不再沿用“集合已存在就继续运行”的旧行为，而是要求现有 collection 必须能读到 embedding metadata，否则直接抛 EmbeddingConfigurationError。这会让旧版本创建的 collection，或尚未实现 get_collection_meta() 的存储后端，在升级后无法继续启动，即使它们之前一直可以正常工作。可考虑为 legacy collection 提供兼容分支或显式迁移路径，或补充说明
文件: openviking/storage/collection_schemas.py
行号: 225-232
[P2] Local embedder 的 async 路径仍会同步阻塞事件循环
embed_compat() 会优先调用 embed_async()，但基类默认实现只是同步执行 self.embed(...)。LocalDenseEmbedder 没有覆写 async 方法，因此这里的 llama-cpp-python 推理，以及相关的本地初始化路径，仍会沿着 async 调用链在事件循环线程内执行。对于服务端并发 embedding / 入库场景，这会带来明显的延迟放大和可用性风险。建议显式覆写 async 接口，并把阻塞调用卸载到线程池。
文件: openviking/models/embedder/local_embedders.py
行号: 76-89
[P2] model identity 计算失败时会静默降级为仅按模型名比较
这个分支本来是在把本地模型文件身份编码进 collection metadata，以提高 embedding 兼容性校验的准确性；但当前实现把路径解析和哈希计算整个包在 except Exception: 里，失败后直接回退成 model_identity = model，同时没有任何日志。这样一旦 identity 计算异常，不同本地模型文件就会重新退化成“只按模型名比较”，从而削弱这里新增的不兼容检测。建议至少记录 warning，并只捕获预期异常类型。
文件: openviking/storage/collection_schemas.py
行号: 146-154

feat: add local llama-cpp embedding support

dd16683

Co-authored-by: GPT-5.4 <noreply@openai.com>

github-project-automation bot moved this to Backlog in OpenViking project Apr 12, 2026

github-project-automation bot added this to OpenViking project Apr 12, 2026

Mijamind719 assigned ZaynJarvis Apr 12, 2026

fix: fallback local batch embedding to sequential mode

e352097

Co-authored-by: GPT-5.4 <noreply@openai.com>

Mijamind719 force-pushed the embedding_local branch from 9c8e7ef to 6e57688 Compare April 13, 2026 00:52

Mijamind719 force-pushed the embedding_local branch from 6e57688 to f6ff2a0 Compare April 13, 2026 03:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【WIP】feat: add local llama-cpp embedding support#1388

【WIP】feat: add local llama-cpp embedding support#1388
Mijamind719 wants to merge 3 commits intovolcengine:mainfrom
Mijamind719:embedding_local

Mijamind719 commented Apr 12, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

jcp0578 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Mijamind719 commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

Changes Made

Testing

Known Limitations

Checklist

Screenshots (if applicable)

Additional Notes

Uh oh!

github-actions bot commented Apr 12, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions bot commented Apr 12, 2026

PR Code Suggestions ✨

Uh oh!

jcp0578 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Mijamind719 commented Apr 12, 2026 •

edited

Loading