Skip to content

Comments

UPSTREAM PR #18330: full modern bert support#1128

Open
loci-dev wants to merge 10 commits intomainfrom
loci/pr-18330-full-modern-bert-support
Open

UPSTREAM PR #18330: full modern bert support#1128
loci-dev wants to merge 10 commits intomainfrom
loci/pr-18330-full-modern-bert-support

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Feb 1, 2026

Note

Source pull request: ggml-org/llama.cpp#18330

Made support for conversion from hf->gguf and execution on llama.cpp after my recent (granite-embd-support)[https://github.com/ggml-org/llama.cpp/pull/15641] which is a modern bert based model, this pr continues off of that and has some tweaks. I have ran cosine similarity tests with this script

from sentence_transformers import SentenceTransformer
import numpy as np
import subprocess
import shlex
import os

model_path = os.path.expanduser(
    "~/models/models--answerdotai--ModernBERT-large/snapshots/45bb4654a4d5aaff24dd11d4781fa46d39bf8c13/"
)
lcpp_model = os.path.expanduser("~/models/modern-bert-large.gguf")
lcpp_exe = "/Users/ryanmangeno/Projects/gits/llama-fix/llama.cpp/build/bin/llama-embedding"

model = SentenceTransformer(model_path)

input_queries = [
    "hello world",
    "tell me a story about a developer and their dog",
    "123sfg this is a r@nd0m t35t",
]


def cosine_similarity(vector_a: np.ndarray, vector_b: np.ndarray) -> float:
    vector_a = np.asarray(vector_a)
    vector_b = np.asarray(vector_b)
    numerator = np.dot(vector_a, vector_b)
    denominator_a = np.linalg.norm(vector_a)
    denominator_b = np.linalg.norm(vector_b)
    if denominator_a == 0 or denominator_b == 0: return 0.0
    cosine_sim = numerator / (denominator_a * denominator_b)
    return cosine_sim


for query in input_queries:
    print("### BASELINE ###")
    embedding = model.encode([query])
    print("Embedding shape:", embedding.shape)
    print("Embedding vector:", embedding[:, :8])

    print("### llama.cpp ###")
    cmd = f"{lcpp_exe} -m {lcpp_model} -p \"{query}\" --temp 0 --embd-normalize -1 --pooling mean"
    print(f"llama.cpp command: {cmd}")
    proc = subprocess.Popen(
        shlex.split(cmd),
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
    )
    out, _ = proc.communicate()
    vals = out.decode("utf-8").split(":")[-1]
    vals = [
        float(v) for v in vals.split()
        if v.strip()
    ]
    lcpp_emb = np.array(vals)
    print("llama.cpp Embedding shape:", lcpp_emb.shape)
    print("llama.cpp Embedding vector:", lcpp_emb[:8])
    print()
    cos_sim = cosine_similarity(embedding, lcpp_emb)
    print(f"COSINE SIMILARITY: {cos_sim}")
    print("--------------------------------")
    print()

with the following results

BASELINE

Embedding shape: (1, 1024)
Embedding vector: [[ 0.659244 0.39849958 0.2302168 0.6192862 -0.62407815 0.0042014
0.14638135 0.2541136 ]]

llama.cpp

llama.cpp command: /Users/ryanmangeno/Projects/gits/llama-fix/llama.cpp/build/bin/llama-embedding -m /Users/ryanmangeno/models/modern-bert-large.gguf -p "hello world" --temp 0 --embd-normalize -1 --pooling mean
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
llama.cpp Embedding shape: (1024,)
llama.cpp Embedding vector: [ 0.679083 0.394328 0.235899 0.62807 -0.630304 -0.004468 0.141795
0.248705]

COSINE SIMILARITY: [0.99971951]

BASELINE

Embedding shape: (1, 1024)
Embedding vector: [[ 0.23057191 0.12633912 -0.00238159 -0.08394846 0.19630949 0.03715154
0.0040304 0.63173795]]

llama.cpp

llama.cpp command: /Users/ryanmangeno/Projects/gits/llama-fix/llama.cpp/build/bin/llama-embedding -m /Users/ryanmangeno/models/modern-bert-large.gguf -p "tell me a story about a developer and their dog" --temp 0 --embd-normalize -1 --pooling mean
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
llama.cpp Embedding shape: (1024,)
llama.cpp Embedding vector: [ 0.230692 0.131872 -0.007958 -0.065828 0.282647 0.056364 -0.025206
0.672672]

COSINE SIMILARITY: [0.9994365]

BASELINE

Embedding shape: (1, 1024)
Embedding vector: [[ 0.15972608 0.52267325 -0.05636618 0.40699816 0.6401572 0.49469572
-0.4336093 0.3909793 ]]

llama.cpp

llama.cpp command: /Users/ryanmangeno/Projects/gits/llama-fix/llama.cpp/build/bin/llama-embedding -m /Users/ryanmangeno/models/modern-bert-large.gguf -p "123sfg this is a r@nd0m t35t" --temp 0 --embd-normalize -1 --pooling mean
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
llama.cpp Embedding shape: (1024,)
llama.cpp Embedding vector: [ 0.177373 0.495097 -0.114586 0.46121 0.635596 0.548017 -0.400412
0.430722]

COSINE SIMILARITY: [0.99780866]

running the same tests on the granite-embd-small gives same results as before
@gabe-l-hart

@loci-dev loci-dev force-pushed the main branch 26 times, most recently from 8db062d to 4d805ce Compare February 3, 2026 07:24
@loci-dev loci-dev force-pushed the main branch 6 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 073bd79 to 823244c Compare February 18, 2026 02:17
ryan-mangeno and others added 6 commits February 18, 2026 15:58
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants