Skip to content

UPSTREAM PR #19488: model: add JAIS-2 architecture support#1168

Open
loci-dev wants to merge 5 commits intomainfrom
loci/pr-19488-jais2
Open

UPSTREAM PR #19488: model: add JAIS-2 architecture support#1168
loci-dev wants to merge 5 commits intomainfrom
loci/pr-19488-jais2

Conversation

@loci-dev
Copy link

Note

Source pull request: ggml-org/llama.cpp#19488

Add support for the JAIS-2 family of Arabic-English bilingual models from Inception AI (https://huggingface.co/inceptionai/Jais-2-8B-Chat).

Architecture characteristics:

  • LayerNorm (not RMSNorm) with biases
  • ReLU² (ReLU squared) activation function
  • Separate Q/K/V projections with biases
  • Simple MLP without gate projection (up -> act -> down)
  • RoPE positional embeddings
  • GPT-2 BPE tokenizer

Supported model sizes:

  • Jais-2-8B (32 layers, 26 heads, 3328 hidden)
  • Jais-2-70B (68 layers, 56 heads, 7168 hidden)

Tested with quantizations: BF16, Q8_0, Q6_K, Q5_K_M, Q5_0, Q4_K_M, Q4_0, Q3_K_M, Q2_K

GGUF weights on the Hub (for tests) : https://huggingface.co/inceptionai/Jais-2-8B-Chat-GGUF

Add support for the JAIS-2 family of Arabic-English bilingual models
from Inception AI (https://huggingface.co/inceptionai/Jais-2-8B-Chat).

Architecture characteristics:
- LayerNorm (not RMSNorm) with biases
- ReLU² (ReLU squared) activation function
- Separate Q/K/V projections with biases
- Simple MLP without gate projection (up -> act -> down)
- RoPE positional embeddings
- GPT-2 BPE tokenizer

Supported model sizes:
- Jais-2-8B (32 layers, 26 heads, 3328 hidden)
- Jais-2-70B (68 layers, 56 heads, 7168 hidden)

Tested with quantizations: BF16, Q8_0, Q6_K, Q5_K_M, Q5_0, Q4_K_M, Q4_0, Q3_K_M, Q2_K

Note: JAIS-2 requires F32 precision accumulators for numerical stability
and uses standard attention (not flash attention) on CUDA backends.
@loci-review
Copy link

loci-review bot commented Feb 12, 2026

Overview

Analysis of 115,033 functions across JAIS-2 architecture integration reveals minimal performance impact. Modified functions: 36 (0.03%), new functions: 31, removed: 0, unchanged: 114,966 (99.94%).

Power Consumption Changes:

  • build.bin.libllama.so: +0.143% (+366.29 nJ)
  • build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so: 0.000% (no change)
  • build.bin.libmtmd.so, build.bin.llama-cvector-generator, build.bin.llama-tts, build.bin.llama-bench, build.bin.llama-tokenize, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli: 0.000% (no change)

Function Analysis

All performance changes occur in C++ STL functions during initialization, not inference hot paths:

Regressions (non-critical):

  • std::vector<uint>::begin: +181ns (+289% throughput) - iterator accessor in KV cache initialization
  • std::_Rb_tree_const_iterator::_M_const_cast: +181ns (+284% throughput) - backend buffer type management
  • std::vector<string>::_M_realloc_insert: +57ns (+27% throughput) - tensor name vector reallocation during model loading
  • std::vector<buffer_type*>::_M_realloc_insert: +40ns (+19% throughput) - MoE layer buffer override processing

Improvements:

  • std::unique_ptr::operator=: -75ns (-49% throughput) - graph context assignment
  • LLM_TN_IMPL::str: -121ns (-32% throughput) - tensor name generation
  • llama_grammar_parser::c_rules: -114ns (-41% throughput) - grammar rule accessor
  • std::unordered_map<int,int>::operator=: -58ns (-19% throughput) - KV cache layer ID mapping

All changes are compiler optimization artifacts in initialization code. No source code modifications justify the performance differences. Cumulative impact: -89μs in model loading (0.001% of total), +112ns per inference batch (0.0002% of total).

Additional Findings

No modifications to performance-critical operations: matrix multiplication (70-90% of inference time), attention mechanisms, quantization kernels, or GPU backends remain unchanged. Flash Attention enabled for JAIS-2 as optimization. GGML libraries show 0.000% power change, confirming no alterations to tensor operations. The 0.143% power increase in libllama.so represents static code addition (748+ tensor definitions) without runtime overhead.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 11 times, most recently from 10f8f26 to a6ecec6 Compare February 20, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 6495042 to 61b4303 Compare February 28, 2026 02:16
@loci-dev loci-dev force-pushed the main branch 3 times, most recently from 8c889a6 to 13648e6 Compare March 2, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 7 times, most recently from 8019888 to 17452e3 Compare March 9, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants