UPSTREAM PR #19488: model: add JAIS-2 architecture support by loci-dev · Pull Request #1168 · auroralabs-loci/llama.cpp

loci-dev · 2026-02-12T11:20:40Z

Note

Source pull request: ggml-org/llama.cpp#19488

Add support for the JAIS-2 family of Arabic-English bilingual models from Inception AI (https://huggingface.co/inceptionai/Jais-2-8B-Chat).

Architecture characteristics:

LayerNorm (not RMSNorm) with biases
ReLU² (ReLU squared) activation function
Separate Q/K/V projections with biases
Simple MLP without gate projection (up -> act -> down)
RoPE positional embeddings
GPT-2 BPE tokenizer

Supported model sizes:

Jais-2-8B (32 layers, 26 heads, 3328 hidden)
Jais-2-70B (68 layers, 56 heads, 7168 hidden)

Tested with quantizations: BF16, Q8_0, Q6_K, Q5_K_M, Q5_0, Q4_K_M, Q4_0, Q3_K_M, Q2_K

GGUF weights on the Hub (for tests) : https://huggingface.co/inceptionai/Jais-2-8B-Chat-GGUF

Add support for the JAIS-2 family of Arabic-English bilingual models from Inception AI (https://huggingface.co/inceptionai/Jais-2-8B-Chat). Architecture characteristics: - LayerNorm (not RMSNorm) with biases - ReLU² (ReLU squared) activation function - Separate Q/K/V projections with biases - Simple MLP without gate projection (up -> act -> down) - RoPE positional embeddings - GPT-2 BPE tokenizer Supported model sizes: - Jais-2-8B (32 layers, 26 heads, 3328 hidden) - Jais-2-70B (68 layers, 56 heads, 7168 hidden) Tested with quantizations: BF16, Q8_0, Q6_K, Q5_K_M, Q5_0, Q4_K_M, Q4_0, Q3_K_M, Q2_K Note: JAIS-2 requires F32 precision accumulators for numerical stability and uses standard attention (not flash attention) on CUDA backends.

loci-review · 2026-02-12T12:42:47Z

Overview

Analysis of 115,033 functions across JAIS-2 architecture integration reveals minimal performance impact. Modified functions: 36 (0.03%), new functions: 31, removed: 0, unchanged: 114,966 (99.94%).

Power Consumption Changes:

build.bin.libllama.so: +0.143% (+366.29 nJ)
build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so: 0.000% (no change)
build.bin.libmtmd.so, build.bin.llama-cvector-generator, build.bin.llama-tts, build.bin.llama-bench, build.bin.llama-tokenize, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli: 0.000% (no change)

Function Analysis

All performance changes occur in C++ STL functions during initialization, not inference hot paths:

Regressions (non-critical):

std::vector<uint>::begin: +181ns (+289% throughput) - iterator accessor in KV cache initialization
std::_Rb_tree_const_iterator::_M_const_cast: +181ns (+284% throughput) - backend buffer type management
std::vector<string>::_M_realloc_insert: +57ns (+27% throughput) - tensor name vector reallocation during model loading
std::vector<buffer_type*>::_M_realloc_insert: +40ns (+19% throughput) - MoE layer buffer override processing

Improvements:

std::unique_ptr::operator=: -75ns (-49% throughput) - graph context assignment
LLM_TN_IMPL::str: -121ns (-32% throughput) - tensor name generation
llama_grammar_parser::c_rules: -114ns (-41% throughput) - grammar rule accessor
std::unordered_map<int,int>::operator=: -58ns (-19% throughput) - KV cache layer ID mapping

All changes are compiler optimization artifacts in initialization code. No source code modifications justify the performance differences. Cumulative impact: -89μs in model loading (0.001% of total), +112ns per inference batch (0.0002% of total).

Additional Findings

No modifications to performance-critical operations: matrix multiplication (70-90% of inference time), attention mechanisms, quantization kernels, or GPU backends remain unchanged. Flash Attention enabled for JAIS-2 as optimization. GGML libraries show 0.000% power change, confirming no alterations to tensor operations. The 0.143% power increase in libllama.so represents static code addition (748+ tensor definitions) without runtime overhead.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

alielfilali01 added 5 commits February 12, 2026 08:18

fix: run convert_hf_to_gguf_update.py for jais-2 tokenizer hash

e7c2529

fix: use NEOX RoPE type for JAIS2

56571c3

fix: remove Q/K permutation (NEOX RoPE doesn't need it)

d9a442f

fix: enable flash attention for JAIS2 (fixed by #19115)

cbe37e3

loci-dev temporarily deployed to PROD__AL_DEMO February 12, 2026 11:20 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 11 times, most recently from 10f8f26 to a6ecec6 Compare February 20, 2026 02:17

loci-dev force-pushed the main branch 9 times, most recently from 6495042 to 61b4303 Compare February 28, 2026 02:16

loci-dev force-pushed the main branch 3 times, most recently from 8c889a6 to 13648e6 Compare March 2, 2026 02:17

loci-dev force-pushed the main branch 7 times, most recently from 8019888 to 17452e3 Compare March 9, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #19488: model: add JAIS-2 architecture support#1168

UPSTREAM PR #19488: model: add JAIS-2 architecture support#1168
loci-dev wants to merge 5 commits intomainfrom
loci/pr-19488-jais2

loci-dev commented Feb 12, 2026

Uh oh!

loci-review bot commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Feb 12, 2026

Uh oh!

loci-review bot commented Feb 12, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants