Skip to content

【WIP】feat: add local llama-cpp embedding support#1388

Open
Mijamind719 wants to merge 3 commits intovolcengine:mainfrom
Mijamind719:embedding_local
Open

【WIP】feat: add local llama-cpp embedding support#1388
Mijamind719 wants to merge 3 commits intovolcengine:mainfrom
Mijamind719:embedding_local

Conversation

@Mijamind719
Copy link
Copy Markdown
Collaborator

@Mijamind719 Mijamind719 commented Apr 12, 2026

Description

This PR adds an initial local dense embedding path for OpenViking based on llama-cpp-python, with bge-small-zh-v1.5-f16 as the default local embedding model.

The goal is to make local CPU embedding available without changing the existing remote providers, while keeping installation risk isolated behind an optional extra dependency.

Related Issue

N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

  • Add provider: "local" embedding config support and default implicit local embedding behavior when no embedding provider is configured.
  • Add a LocalDenseEmbedder based on llama-cpp-python with:
    • default model bge-small-zh-v1.5-f16
    • optional model_path
    • optional cache_dir
    • automatic model download into the local cache directory
  • Extend embedder interfaces with explicit query/document semantics so retrieval-side and indexing-side embedding can use different roles.
  • Update context collection schema management to persist embedding metadata and detect incompatible existing indexes.
  • Add explicit configuration and rebuild-required errors for local embedding failures and embedding/index mismatches.
  • Update ov doctor to diagnose the local embedding dependency, local model cache state, and invalid local model configuration.
  • Add openviking[local-embed] extra instead of making llama-cpp-python a hard dependency of the main package.
  • Add a design document at docs/design/local-embedding-llama-cpp-design.md to capture the implementation scope and rollout decisions.

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux
    • macOS
    • Windows

Targeted validation performed:

  • python3 -m py_compile on the touched Python modules
  • PYTHONPATH=. ./.venv/bin/python -m pytest -q tests/cli/test_doctor.py --maxfail=1
  • PYTHONPATH=. ./.venv/bin/python -m pytest -q tests/unit/test_local_embedder.py tests/storage/test_collection_schemas.py tests/misc/test_config_validation.py --maxfail=1
  • Real local smoke test on macOS with actual llama-cpp-python + downloaded GGUF model:
    • local model download succeeded
    • embedder initialization succeeded
    • embed_query() and embed_document() returned valid 512-d vectors
    • relevant Chinese query/document similarity was higher than an unrelated sentence

Known Limitations

  • In the current real runtime tested for bge-small-zh-v1.5-f16, native multi-sequence batch embedding is not available because the created llama context reports n_seq_max = 1.
  • embed_batch() is therefore safe and functional, but currently falls back to sequential embedding in this runtime instead of delivering native batch throughput.
  • This PR only reports index incompatibility and asks for rebuild. It does not implement a rebuild workflow.
  • This PR does not claim full end-to-end validation of the whole OpenViking product flow yet; validation is currently focused on targeted tests plus a real local embedding smoke test.

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • Any dependent changes have been merged and published

Screenshots (if applicable)

N/A

Additional Notes

Follow-up investigation is still needed for true native multi-sequence batch support in llama-cpp-python for this embedding model/runtime combination.

Co-authored-by: GPT-5.4 <noreply@openai.com>
@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 70
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Add embedding metadata validation to collection initialization

Relevant files:

  • openviking/storage/collection_schemas.py
  • openviking/storage/errors.py
  • openviking/storage/viking_vector_index_backend.py
  • openviking/storage/vikingdb_manager.py
  • tests/storage/test_collection_schemas.py

Sub-PR theme: Add local llama-cpp embedding support

Relevant files:

  • openviking/models/embedder/init.py
  • openviking/models/embedder/base.py
  • openviking/models/embedder/local_embedders.py
  • openviking_cli/doctor.py
  • openviking_cli/utils/config/embedding_config.py
  • tests/cli/test_doctor.py
  • tests/misc/test_config_validation.py
  • tests/unit/test_local_embedder.py
  • pyproject.toml

⚡ Recommended focus areas for review

Backward Compatibility Break

The init_context_collection function now raises EmbeddingConfigurationError when the storage backend does not implement get_collection_meta, breaking existing deployments that use backends without this method. Previously, the function simply returned False when the collection already existed.

existing_meta = None
if hasattr(storage, "get_collection_meta"):
    existing_meta = await storage.get_collection_meta()

if not existing_meta:
    raise EmbeddingConfigurationError(
        "Existing collection metadata is unavailable; cannot validate embedding compatibility"
    )
Blocking Async Operations

The LocalDenseEmbedder does not override the base class async methods (embed_async, embed_batch_async). The base class default implementation may not properly offload the blocking llama-cpp operations to a thread pool, potentially starving the async event loop.

class LocalDenseEmbedder(DenseEmbedderBase):
    """Dense embedder backed by a local GGUF model via llama-cpp-python."""

    def __init__(
        self,
        model_name: str = DEFAULT_LOCAL_DENSE_MODEL,
        model_path: Optional[str] = None,
        cache_dir: Optional[str] = None,
        dimension: Optional[int] = None,
        query_instruction: Optional[str] = None,
        config: Optional[Dict[str, Any]] = None,
    ):
        runtime_config = dict(config or {})
        runtime_config.setdefault("provider", "local")
        super().__init__(model_name, runtime_config)

        self.model_spec = get_local_model_spec(model_name)
        self.model_path = model_path
        self.cache_dir = cache_dir or DEFAULT_LOCAL_MODEL_CACHE_DIR
        self.query_instruction = (
            query_instruction
            if query_instruction is not None
            else self.model_spec.query_instruction
        )
        self._dimension = dimension or self.model_spec.dimension
        if self._dimension != self.model_spec.dimension:
            raise ValueError(
                f"Local model '{model_name}' has fixed dimension {self.model_spec.dimension}, "
                f"but got dimension={self._dimension}"
            )

        self._resolved_model_path = self._resolve_model_path()
        self._llama = self._load_model()

    def _import_llama(self):
        try:
            module = importlib.import_module("llama_cpp")
        except ImportError as exc:
            raise EmbeddingConfigurationError(
                "Local embedding is enabled but 'llama-cpp-python' is not installed. "
                'Install it with: pip install "openviking[local-embed]". '
                "If you prefer a remote provider, set embedding.dense.provider explicitly in ov.conf."
            ) from exc

        llama_cls = getattr(module, "Llama", None)
        if llama_cls is None:
            raise EmbeddingConfigurationError(
                "llama_cpp.Llama is unavailable in the installed llama-cpp-python package."
            )
        return llama_cls

    def _resolve_model_path(self) -> Path:
        if self.model_path:
            resolved = Path(self.model_path).expanduser().resolve()
            if not resolved.exists():
                raise EmbeddingConfigurationError(
                    f"Local embedding model file not found: {resolved}"
                )
            return resolved

        cache_root = Path(self.cache_dir).expanduser().resolve()
        cache_root.mkdir(parents=True, exist_ok=True)
        target = get_local_model_cache_path(self.model_name, self.cache_dir)
        if target.exists():
            return target

        self._download_model(self.model_spec.download_url, target)
        return target

    def _download_model(self, url: str, target: Path) -> None:
        logger.info("Downloading local embedding model %s to %s", self.model_name, target)
        tmp_target = target.with_suffix(target.suffix + ".part")
        try:
            with requests.get(url, stream=True, timeout=(10, 300)) as response:
                response.raise_for_status()
                with tmp_target.open("wb") as fh:
                    for chunk in response.iter_content(chunk_size=1024 * 1024):
                        if chunk:
                            fh.write(chunk)
            os.replace(tmp_target, target)
        except Exception as exc:
            tmp_target.unlink(missing_ok=True)
            raise EmbeddingConfigurationError(
                f"Failed to download local embedding model '{self.model_name}' from {url} "
                f"to {target}: {exc}"
            ) from exc

    def _load_model(self):
        llama_cls = self._import_llama()
        try:
            return llama_cls(
                model_path=str(self._resolved_model_path),
                embedding=True,
                verbose=False,
            )
        except Exception as exc:
            raise EmbeddingConfigurationError(
                f"Failed to load GGUF embedding model from {self._resolved_model_path}: {exc}"
            ) from exc

    def _format_text(self, text: str, *, is_query: bool) -> str:
        if is_query and self.query_instruction:
            return f"{self.query_instruction}{text}"
        return text

    @staticmethod
    def _extract_embedding(payload: Any) -> List[float]:
        if isinstance(payload, dict):
            data = payload.get("data")
            if isinstance(data, list) and data:
                item = data[0]
                if isinstance(item, dict) and "embedding" in item:
                    return list(item["embedding"])
            if "embedding" in payload:
                return list(payload["embedding"])
        raise RuntimeError("Unexpected llama-cpp-python embedding response format")

    @staticmethod
    def _extract_embeddings(payload: Any) -> List[List[float]]:
        if isinstance(payload, dict):
            data = payload.get("data")
            if isinstance(data, list):
                vectors: List[List[float]] = []
                for item in data:
                    if not isinstance(item, dict) or "embedding" not in item:
                        raise RuntimeError(
                            "Unexpected llama-cpp-python batch embedding response format"
                        )
                    vectors.append(list(item["embedding"]))
                return vectors
        raise RuntimeError("Unexpected llama-cpp-python batch embedding response format")

    def embed(self, text: str, is_query: bool = False) -> EmbedResult:
        formatted = self._format_text(text, is_query=is_query)

        def _call() -> EmbedResult:
            payload = self._llama.create_embedding(formatted)
            return EmbedResult(dense_vector=self._extract_embedding(payload))

        try:
            result = self._run_with_retry(
                _call,
                logger=logger,
                operation_name="local embedding",
            )
        except Exception as exc:
            raise RuntimeError(f"Local embedding failed: {exc}") from exc

        estimated_tokens = self._estimate_tokens(formatted)
        self.update_token_usage(
            model_name=self.model_name,
            provider="local",
            prompt_tokens=estimated_tokens,
            completion_tokens=0,
        )
        return result

    def embed_batch(self, texts: List[str], is_query: bool = False) -> List[EmbedResult]:
        if not texts:
            return []

        formatted = [self._format_text(text, is_query=is_query) for text in texts]

        def _call() -> List[EmbedResult]:
            payload = self._llama.create_embedding(formatted)
            return [
                EmbedResult(dense_vector=vector) for vector in self._extract_embeddings(payload)
            ]

        try:
            results = self._run_with_retry(
                _call,
                logger=logger,
                operation_name="local batch embedding",
            )
        except Exception as exc:
            raise RuntimeError(f"Local batch embedding failed: {exc}") from exc

        estimated_tokens = sum(self._estimate_tokens(text) for text in formatted)
        self.update_token_usage(
            model_name=self.model_name,
            provider="local",
            prompt_tokens=estimated_tokens,
            completion_tokens=0,
        )
        return results

    def get_dimension(self) -> int:
        return self._dimension

    def close(self):
        close_fn = getattr(self._llama, "close", None)
        if callable(close_fn):
            close_fn()

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Add retries for model downloads

Add retry logic for model downloads using the existing _run_with_retry helper to
improve resilience against transient network errors.

openviking/models/embedder/local_embedders.py [145-161]

 def _download_model(self, url: str, target: Path) -> None:
     logger.info("Downloading local embedding model %s to %s", self.model_name, target)
     tmp_target = target.with_suffix(target.suffix + ".part")
-    try:
+
+    def _download():
         with requests.get(url, stream=True, timeout=(10, 300)) as response:
             response.raise_for_status()
             with tmp_target.open("wb") as fh:
                 for chunk in response.iter_content(chunk_size=1024 * 1024):
                     if chunk:
                         fh.write(chunk)
         os.replace(tmp_target, target)
+
+    try:
+        self._run_with_retry(
+            _download,
+            logger=logger,
+            operation_name="local model download",
+        )
     except Exception as exc:
         tmp_target.unlink(missing_ok=True)
         raise EmbeddingConfigurationError(
             f"Failed to download local embedding model '{self.model_name}' from {url} "
             f"to {target}: {exc}"
         ) from exc
Suggestion importance[1-10]: 6

__

Why: This improves resilience against transient network errors during model downloads by reusing the existing _run_with_retry helper, making the local embedder more robust.

Low

Co-authored-by: GPT-5.4 <noreply@openai.com>
Legacy issue: investigate true llama-cpp native multi-sequence batch support for local embedding models such as bge-small-zh-v1.5-f16 (current runtime reports n_seq_max=1, so embed_batch uses sequential mode).

Co-authored-by: GPT-5.4 <noreply@openai.com>
@jcp0578
Copy link
Copy Markdown
Contributor

jcp0578 commented Apr 13, 2026

发现有几个

  1. [P1] 现有集合缺少 metadata 时会直接阻断升级路径
    create_collection() 返回 False 后,不再沿用“集合已存在就继续运行”的旧行为,而是要求现有 collection 必须能读到 embedding metadata,否则直接抛 EmbeddingConfigurationError。这会让旧版本创建的 collection,或尚未实现 get_collection_meta() 的存储后端,在升级后无法继续启动,即使它们之前一直可以正常工作。可考虑为 legacy collection 提供兼容分支或显式迁移路径,或补充说明
    文件: openviking/storage/collection_schemas.py
    行号: 225-232

  2. [P2] Local embedder 的 async 路径仍会同步阻塞事件循环
    embed_compat() 会优先调用 embed_async(),但基类默认实现只是同步执行 self.embed(...)。LocalDenseEmbedder 没有覆写 async 方法,因此这里的 llama-cpp-python 推理,以及相关的本地初始化路径,仍会沿着 async 调用链在事件循环线程内执行。对于服务端并发 embedding / 入库场景,这会带来明显的延迟放大和可用性风险。建议显式覆写 async 接口,并把阻塞调用卸载到线程池。
    文件: openviking/models/embedder/local_embedders.py
    行号: 76-89

  3. [P2] model identity 计算失败时会静默降级为仅按模型名比较
    这个分支本来是在把本地模型文件身份编码进 collection metadata,以提高 embedding 兼容性校验的准确性;但当前实现把路径解析和哈希计算整个包在 except Exception: 里,失败后直接回退成 model_identity = model,同时没有任何日志。这样一旦 identity 计算异常,不同本地模型文件就会重新退化成“只按模型名比较”,从而削弱这里新增的不兼容检测。建议至少记录 warning,并只捕获预期异常类型。
    文件: openviking/storage/collection_schemas.py
    行号: 146-154

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants