UPSTREAM PR #16831: Model: Minimax M2 by DajanaV · Pull Request #9 · auroralabs-loci/llama.cpp

DajanaV · 2025-10-29T01:33:12Z

Implementation for Minimax M2 - not doing the chat template yet because not sure how to handle the interleaving thinking blocks.

loci-review-dev · 2025-10-29T02:46:17Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: Minimax M2 Model Implementation

Key Findings

Performance Degradations

Minimal Impact Across All Metrics:

Response Time: Worst degradation of +0.066% in _Vector_impl_data@plt (7.373 ns vs 7.368 ns)
Throughput: Identical +0.066% degradation in _Tuple_impl@plt for smart pointer management
Bottleneck: Highest degradation of +0.131% in _Construct function for grammar element vector construction (19.666 ns vs 19.640 ns)

Root Cause Analysis:
All degradations occur in PLT (Procedure Linkage Table) stub functions and standard library constructors, indicating symbol table expansion effects rather than algorithmic performance issues.

Core Function Impact Assessment

No Direct Impact on Critical Components:

Core inference functions (llama_encode, llama_decode, llama_tokenize) remain unaffected
Performance-critical areas (matrix multiplication kernels, attention mechanisms, KV cache management) show no degradation
Memory management functions (llama_memory_* family) maintain baseline performance
Model loading pipeline experiences expected overhead from new architecture support

Power Consumption Analysis

Negligible Energy Impact:

Overall Change: Minimal decrease (-0.0%) across all binaries
libllama.so: 303,377.74 nJ (base: 303,379.20 nJ) - marginal improvement
Other libraries: No measurable change in power consumption
Conclusion: New model architecture adds no significant energy overhead to existing functionality

Flame Graph and CFG Analysis

PLT Stub Behavior Confirmed:

Single-level execution: 100% of degradation concentrated in dynamic linking resolution
No functional changes: Identical assembly code between versions
Linear execution pattern: No branching or recursive overhead introduced
Symbol resolution overhead: 7.373 ns represents pure PLT resolution cost increase

GitHub Code Review Insights

Well-Structured Implementation:

467 lines added, 0 deleted: Clean addition without breaking changes
Comprehensive architecture support: Full MoE implementation with proper tensor mappings
Robust error handling: Appropriate validation for expert gating functions
Maintainable code patterns: Follows established llama.cpp architectural conventions

No Critical Issues Identified:

ABI compatibility maintained through proper enum additions
Memory management follows existing patterns with expert weight caching
Build system integration properly configured with test coverage

Overall Assessment

Impact Evaluation

Positive Implementation with Minimal Performance Cost:

The 0.066-0.131% performance degradations are negligible and represent acceptable overhead for significant functionality expansion
All degradations occur in infrastructure code (PLT stubs, standard library) rather than core computational paths
The implementation successfully adds support for a 230B parameter MoE model without affecting existing model performance

Maintainability Considerations

Strong Foundation for Future Development:

Modular architecture: New model support follows established patterns, facilitating future model additions
Clean separation: MoE-specific code isolated in dedicated classes without affecting existing architectures
Comprehensive testing: Tokenizer validation and vocabulary files ensure robust model support
Documentation: Clear tensor mappings and parameter configurations aid future maintenance

Future Performance Considerations

Scalability Factors:

Symbol table growth: Continued architecture additions may compound PLT overhead (current impact minimal)
Memory scaling: 230B parameter models will stress memory bandwidth and allocation patterns
Expert utilization: MoE performance characteristics depend on runtime expert selection efficiency

Optimization Opportunities:

Static linking: Could eliminate PLT overhead entirely for performance-critical deployments
Link-time optimization: May reduce template instantiation overhead
Memory-mapped weights: Could improve large model loading performance

Conclusion

The Minimax M2 implementation represents a high-quality addition to llama.cpp with minimal performance impact. The measured degradations are within acceptable bounds for the significant functionality gained. The implementation maintains code quality standards while adding comprehensive support for a complex MoE architecture, positioning the codebase well for future large model support.

Recommendation: Proceed with the implementation as the benefits significantly outweigh the minimal performance costs.

* webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* Faster tensors (#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (#9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Fix .gitignore * Add memory64 option and remove unneeded macros for setting threads to 1 --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

using common build_attn in sam

* Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though * neg passes backend test * unary operators pass ggml tests * rms_norm double declaration bug atoned * abides by editor-config * removed vestigial files * fixed autoconfig * All operators (inlcluding xielu) working * removed unnecesarry checking if node->src[1] exists for unary operators * responded and dealt with PR comments * implemented REPL_Template support and removed bug in unary operators kernel * formatted embed wgsl and ggml-webgpu.cpp * Faster tensors (#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (#9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Refactored pipelines and workgroup calculations (#10) * refactored pipelines * refactored workgroup calculation * removed commented out block of prior maps * Clean up ceiling division pattern --------- Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on flash attention * Shader structure set up (many bugs still) * debugging * Working first test * Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32 * Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling * Start work on integrating pre-wgsl * Separate structs/initial shader compilation library into separate files * Work on compilation choices for flashattention * Work on subgroup matrix/tile size portability * subgroup size agnostic online softmax * Cleanups, quantization types * more cleanup * fix wasm build * Refactor flashattention to increase parallelism, use direct loads for KV in somce cases * Checkpoint * formatting

* FlashAttention (#13) * Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though * neg passes backend test * unary operators pass ggml tests * rms_norm double declaration bug atoned * abides by editor-config * removed vestigial files * fixed autoconfig * All operators (inlcluding xielu) working * removed unnecesarry checking if node->src[1] exists for unary operators * responded and dealt with PR comments * implemented REPL_Template support and removed bug in unary operators kernel * formatted embed wgsl and ggml-webgpu.cpp * Faster tensors (#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (#9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Refactored pipelines and workgroup calculations (#10) * refactored pipelines * refactored workgroup calculation * removed commented out block of prior maps * Clean up ceiling division pattern --------- Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on flash attention * Shader structure set up (many bugs still) * debugging * Working first test * Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32 * Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling * Start work on integrating pre-wgsl * Separate structs/initial shader compilation library into separate files * Work on compilation choices for flashattention * Work on subgroup matrix/tile size portability * subgroup size agnostic online softmax * Cleanups, quantization types * more cleanup * fix wasm build * Refactor flashattention to increase parallelism, use direct loads for KV in somce cases * Checkpoint * formatting * Update to account for default kv cache padding * formatting shader * Add workflow for ggml-ci webgpu * Try passing absolute path to dawn in ggml-ci * Avoid error on device destruction, add todos for proper cleanup * Fix unused warning * Forgot one parameter unused * Move some flashattn computation to f32 for correctness

pwilkin added 4 commits October 29, 2025 00:13

Model: Minimax M2

06ed421

Cleanup

38a4a05

Cleanup pt. 2

7ef506f

Cleanup pt. 3

415a2fc

DajanaV force-pushed the main branch 2 times, most recently from 1983956 to 326a60a Compare October 29, 2025 12:13

DajanaV added the dev-stale Stale dev environment — dashboard not accessible label Oct 30, 2025

DajanaV deleted the branch main October 30, 2025 15:25

DajanaV closed this Oct 30, 2025

DajanaV deleted the upstream-PR16831-branch_pwilkin-minimax-m2 branch October 30, 2025 15:26

DajanaV mentioned this pull request Nov 18, 2025

UPSTREAM PR #17342: Throughput improvement for small batch sizes #248

Open

loci-dev pushed a commit that referenced this pull request Dec 15, 2025

Merge pull request #9 from sfallah/sf/deepseek-ocr-attn

6687b4e

using common build_attn in sam

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #16831: Model: Minimax M2#9

UPSTREAM PR #16831: Model: Minimax M2#9
DajanaV wants to merge 4 commits intomainfrom
upstream-PR16831-branch_pwilkin-minimax-m2

DajanaV commented Oct 29, 2025

Uh oh!

loci-review-dev bot commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DajanaV commented Oct 29, 2025

Uh oh!

loci-review-dev bot commented Oct 29, 2025

Performance Analysis Summary: Minimax M2 Model Implementation

Key Findings

Performance Degradations

Core Function Impact Assessment

Power Consumption Analysis

Flame Graph and CFG Analysis

GitHub Code Review Insights

Overall Assessment

Impact Evaluation

Maintainability Considerations

Future Performance Considerations

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants