UPSTREAM PR #18330: full modern bert support by loci-dev · Pull Request #1128 · auroralabs-loci/llama.cpp

loci-dev · 2026-02-01T23:40:56Z

Note

Source pull request: ggml-org/llama.cpp#18330

Made support for conversion from hf->gguf and execution on llama.cpp after my recent (granite-embd-support)[https://github.com/ggml-org/llama.cpp/pull/15641] which is a modern bert based model, this pr continues off of that and has some tweaks. I have ran cosine similarity tests with this script

from sentence_transformers import SentenceTransformer
import numpy as np
import subprocess
import shlex
import os

model_path = os.path.expanduser(
    "~/models/models--answerdotai--ModernBERT-large/snapshots/45bb4654a4d5aaff24dd11d4781fa46d39bf8c13/"
)
lcpp_model = os.path.expanduser("~/models/modern-bert-large.gguf")
lcpp_exe = "/Users/ryanmangeno/Projects/gits/llama-fix/llama.cpp/build/bin/llama-embedding"

model = SentenceTransformer(model_path)

input_queries = [
    "hello world",
    "tell me a story about a developer and their dog",
    "123sfg this is a r@nd0m t35t",
]


def cosine_similarity(vector_a: np.ndarray, vector_b: np.ndarray) -> float:
    vector_a = np.asarray(vector_a)
    vector_b = np.asarray(vector_b)
    numerator = np.dot(vector_a, vector_b)
    denominator_a = np.linalg.norm(vector_a)
    denominator_b = np.linalg.norm(vector_b)
    if denominator_a == 0 or denominator_b == 0: return 0.0
    cosine_sim = numerator / (denominator_a * denominator_b)
    return cosine_sim


for query in input_queries:
    print("### BASELINE ###")
    embedding = model.encode([query])
    print("Embedding shape:", embedding.shape)
    print("Embedding vector:", embedding[:, :8])

    print("### llama.cpp ###")
    cmd = f"{lcpp_exe} -m {lcpp_model} -p \"{query}\" --temp 0 --embd-normalize -1 --pooling mean"
    print(f"llama.cpp command: {cmd}")
    proc = subprocess.Popen(
        shlex.split(cmd),
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
    )
    out, _ = proc.communicate()
    vals = out.decode("utf-8").split(":")[-1]
    vals = [
        float(v) for v in vals.split()
        if v.strip()
    ]
    lcpp_emb = np.array(vals)
    print("llama.cpp Embedding shape:", lcpp_emb.shape)
    print("llama.cpp Embedding vector:", lcpp_emb[:8])
    print()
    cos_sim = cosine_similarity(embedding, lcpp_emb)
    print(f"COSINE SIMILARITY: {cos_sim}")
    print("--------------------------------")
    print()

with the following results

BASELINE

Embedding shape: (1, 1024)
Embedding vector: [[ 0.659244 0.39849958 0.2302168 0.6192862 -0.62407815 0.0042014
0.14638135 0.2541136 ]]

llama.cpp

llama.cpp command: /Users/ryanmangeno/Projects/gits/llama-fix/llama.cpp/build/bin/llama-embedding -m /Users/ryanmangeno/models/modern-bert-large.gguf -p "hello world" --temp 0 --embd-normalize -1 --pooling mean
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
llama.cpp Embedding shape: (1024,)
llama.cpp Embedding vector: [ 0.679083 0.394328 0.235899 0.62807 -0.630304 -0.004468 0.141795
0.248705]

COSINE SIMILARITY: [0.99971951]

BASELINE

Embedding shape: (1, 1024)
Embedding vector: [[ 0.23057191 0.12633912 -0.00238159 -0.08394846 0.19630949 0.03715154
0.0040304 0.63173795]]

llama.cpp

llama.cpp command: /Users/ryanmangeno/Projects/gits/llama-fix/llama.cpp/build/bin/llama-embedding -m /Users/ryanmangeno/models/modern-bert-large.gguf -p "tell me a story about a developer and their dog" --temp 0 --embd-normalize -1 --pooling mean
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
llama.cpp Embedding shape: (1024,)
llama.cpp Embedding vector: [ 0.230692 0.131872 -0.007958 -0.065828 0.282647 0.056364 -0.025206
0.672672]

COSINE SIMILARITY: [0.9994365]

BASELINE

Embedding shape: (1, 1024)
Embedding vector: [[ 0.15972608 0.52267325 -0.05636618 0.40699816 0.6401572 0.49469572
-0.4336093 0.3909793 ]]

llama.cpp

llama.cpp command: /Users/ryanmangeno/Projects/gits/llama-fix/llama.cpp/build/bin/llama-embedding -m /Users/ryanmangeno/models/modern-bert-large.gguf -p "123sfg this is a r@nd0m t35t" --temp 0 --embd-normalize -1 --pooling mean
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
llama.cpp Embedding shape: (1024,)
llama.cpp Embedding vector: [ 0.177373 0.495097 -0.114586 0.46121 0.635596 0.548017 -0.400412
0.430722]

COSINE SIMILARITY: [0.99780866]

running the same tests on the granite-embd-small gives same results as before
@gabe-l-hart

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

…since modern bert does mean & rank

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

ryan-mangeno and others added 4 commits December 23, 2025 16:39

full modern bert support

98e24a5

added gelu op in rank pooling for modern bert

856c609

still working on stuff, added mean calculation before classifier head

ae029ef

Update convert_hf_to_gguf.py

e105370

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

loci-dev force-pushed the main branch 26 times, most recently from 8db062d to 4d805ce Compare February 3, 2026 07:24

loci-dev force-pushed the main branch 6 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32

loci-dev force-pushed the main branch 9 times, most recently from 073bd79 to 823244c Compare February 18, 2026 02:17

ryan-mangeno and others added 6 commits February 18, 2026 15:58

first layer is dense, as per modern bert research paper

6653dd5

Merge branch 'master' into full-modern-bert-support

8227d86

Update src/llama-graph.cpp

d38f269

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

fixed set input for mean pooling to check if pooling type is ranking …

2b210bb

…since modern bert does mean & rank

Update src/llama-graph.cpp

e7fc189

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Update convert_hf_to_gguf.py

96f7b5a

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

loci-dev force-pushed the main branch from 823244c to bab7d39 Compare February 19, 2026 02:17

loci-dev had a problem deploying to PROD__AL_DEMO February 19, 2026 02:17 — with GitHub Actions Failure

loci-dev force-pushed the main branch 4 times, most recently from 9ea4a65 to c001e9f Compare February 22, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

UPSTREAM PR #18330: full modern bert support#1128

UPSTREAM PR #18330: full modern bert support#1128
loci-dev wants to merge 10 commits intomainfrom
loci/pr-18330-full-modern-bert-support

loci-dev commented Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

loci-dev commented Feb 1, 2026

BASELINE

llama.cpp

COSINE SIMILARITY: [0.99971951]

BASELINE

llama.cpp

COSINE SIMILARITY: [0.9994365]

BASELINE

llama.cpp

COSINE SIMILARITY: [0.99780866]

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants