-
Notifications
You must be signed in to change notification settings - Fork 23
Add Q4_0 quantization support for all models in TornadoVM path #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add Q4_0 quantization support for all models in TornadoVM path #67
Conversation
Implement complete Q4_0 quantization support following the same pattern as Q8_0: Core Q4_0 Infrastructure: - Add Q4_0TornadoTensor for GPU tensor representation with 4-bit quantization - Implement Q4_0LayerPlanner base class for all Q4_0 planners - Add LogitsQ4_0Layer shared across all models - Update ModelLoader to handle Q4_0 tensor creation and loading Model-Specific Q4_0 Implementations: - Add LlamaQ4_0LayerPlanner and LlamaQ4_0FFNLayers (also supports Mistral) - Add Qwen2Q4_0LayerPlanner and Qwen2Q4_0FFNLayers (also supports DeepSeek R1 Distill) - Add Qwen3Q4_0LayerPlanner and Qwen3Q4_0FFNLayers - Add Phi3Q4_0LayerPlanner and Phi3Q4_0FFNLayers Factory and Loader Updates: - Update QuantizationPlannerFactory to route Q4_0 requests to appropriate planners - Update all model loaders (Llama, Qwen2, Qwen3, Phi3, Mistral) to accept Q4_0 Q4_0 achieves 4x memory compression vs FP16 and 2x vs Q8_0 while maintaining inference accuracy through per-block quantization with FP16 scale factors. Block size: 32 elements, Type size: 18 bytes (2 bytes FP16 + 16 bytes packed 4-bit values)
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements comprehensive Q4_0 quantization support for the TornadoVM inference path, following the established pattern from Q8_0 quantization. Q4_0 provides 4-bit quantization with per-block scaling (32 elements per block), achieving 4x memory compression compared to FP16 and 2x compression compared to Q8_0.
Key changes:
- Introduced Q4_0 tensor representation and base planner infrastructure
- Added model-specific Q4_0 layer implementations for Llama, Qwen2, Qwen3, and Phi3
- Updated factory routing and model loaders to accept Q4_0 quantization
Reviewed Changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 24 comments.
Show a summary per file
| File | Description |
|---|---|
| Q4_0TornadoTensor.java | Core Q4_0 tensor implementation with packed 4-bit values and FP16 scales |
| Q4_0LayerPlanner.java | Base class for all Q4_0 quantized layer planners |
| LogitsQ4_0Layer.java | Shared logits layer for Q4_0 across all models |
| LlamaQ4_0FFNLayers.java | Llama/Mistral Q4_0 FFN implementation |
| Qwen2Q4_0FFNLayers.java | Qwen2/DeepSeek R1 Distill Q4_0 FFN with bias terms |
| Qwen3Q4_0FFNLayers.java | Qwen3 Q4_0 FFN with GQA and RMS normalization |
| Phi3Q4_0FFNLayers.java | Phi3 Q4_0 FFN with combined QKV and gate/up structure |
| *Q4_0LayerPlanner.java (4 files) | Model-specific planner implementations |
| QuantizationPlannerFactory.java | Factory routing for Q4_0 requests to appropriate planners |
| ModelLoader.java | Added Q4_0TornadoTensor loading support |
| *ModelLoader.java (5 files) | Updated model loaders to accept Q4_0 quantization |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| package org.beehive.gpullama3.tornadovm.layers.type.q4_0; | ||
|
|
||
| import org.beehive.gpullama3.inference.state.State; | ||
| import org.beehive.gpullama3.inference.weights.Weights; |
Copilot
AI
Nov 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused import Weights. This import is not used anywhere in the class since the weights parameter is typed as TornadoWeights in the constructor.
| import org.beehive.gpullama3.inference.weights.Weights; |
| import org.beehive.gpullama3.inference.weights.tornado.Qwen2TornadoWeights; | ||
| import org.beehive.gpullama3.model.qwen2.Qwen2Configuration; | ||
| import org.beehive.gpullama3.tornadovm.kernels.Qwen2Kernels; | ||
| import org.beehive.gpullama3.tornadovm.kernels.Qwen3Kernels; |
Copilot
AI
Nov 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused import Qwen3Kernels. While this class is imported, it's never used in the file. The only kernel reference is to Qwen2Kernels on line 207.
| import org.beehive.gpullama3.tornadovm.kernels.Qwen3Kernels; |
| import uk.ac.manchester.tornado.api.WorkerGrid1D; | ||
| import uk.ac.manchester.tornado.api.WorkerGrid2D; |
Copilot
AI
Nov 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused imports WorkerGrid1D and WorkerGrid2D. These are not used in the file since WorkerGrid instances are created using factory methods or the base WorkerGrid class constructor.
| import uk.ac.manchester.tornado.api.WorkerGrid1D; | |
| import uk.ac.manchester.tornado.api.WorkerGrid2D; |
| unifiedLayer.transferToDevice(DataTransferMode.EVERY_EXECUTION, | ||
| qwen3State.positionHolder, qwen3State.temp, qwen3State.tempFFN); | ||
|
|
||
| Qwen3State qwen3State = (Qwen3State) state; |
Copilot
AI
Nov 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant variable declaration. The variable qwen3State is already declared and initialized as an instance field on line 40. This local variable declaration on line 296 shadows the instance field unnecessarily.
| .task("qbias", TransformerComputeKernelsLayered::addInPlace, state.wrapQ, weights.q_biasLayered[layerIndex].asFloatArray(), config.dim()) | ||
| .task("kbias", TransformerComputeKernelsLayered::addInPlace, state.wrapK, weights.k_biasLayered[layerIndex].asFloatArray(), config.kvDim()) | ||
| .task("vbias", TransformerComputeKernelsLayered::addInPlace, state.wrapV, weights.v_biasLayered[layerIndex].asFloatArray(), config.kvDim()) | ||
| .task("rope", Qwen3Kernels::ropeRotation,context, state.positionHolder, state.wrapQ, state.wrapK, config.numberOfKeyValueHeads(), |
Copilot
AI
Nov 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect kernel usage. The task is using Qwen3Kernels::ropeRotation but this is a Qwen2 implementation. It should use Qwen2Kernels::ropeRotation or verify that the Qwen3 kernel is intentionally being used for Qwen2 models.
| .task("rope", Qwen3Kernels::ropeRotation,context, state.positionHolder, state.wrapQ, state.wrapK, config.numberOfKeyValueHeads(), | |
| .task("rope", Qwen2Kernels::ropeRotation,context, state.positionHolder, state.wrapQ, state.wrapK, config.numberOfKeyValueHeads(), |
| public ImmutableTaskGraph getImmutableTaskGraph() { | ||
| return null; | ||
| } | ||
|
|
Copilot
AI
Nov 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method overrides AbstractFFNLayers.getFfnLayerTaskGraphs; it is advisable to add an Override annotation.
| @Override |
|
|
||
| /** | ||
| * Configure data transfers for first and subsequent layers | ||
| */ |
Copilot
AI
Nov 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method overrides AbstractLayer.configureLayerDataTransfers; it is advisable to add an Override annotation.
| */ | |
| */ | |
| @Override |
|
|
||
| /** | ||
| * Configure data transfers for first and subsequent layers | ||
| */ |
Copilot
AI
Nov 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method overrides AbstractLayer.configureLayerDataTransfers; it is advisable to add an Override annotation.
| */ | |
| */ | |
| @Override |
| public ImmutableTaskGraph getImmutableTaskGraph() { | ||
| return null; | ||
| } | ||
|
|
Copilot
AI
Nov 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method overrides AbstractFFNLayers.getFfnLayerTaskGraphs; it is advisable to add an Override annotation.
| @Override |
| public ImmutableTaskGraph getImmutableTaskGraph() { | ||
| return null; | ||
| } | ||
|
|
Copilot
AI
Nov 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method overrides AbstractFFNLayers.getFfnLayerTaskGraphs; it is advisable to add an Override annotation.
| @Override |
Implement complete Q4_0 quantization support following the same pattern as Q8_0:
Core Q4_0 Infrastructure:
Model-Specific Q4_0 Implementations:
Factory and Loader Updates:
Q4_0 achieves 4x memory compression vs FP16 and 2x vs Q8_0 while maintaining inference accuracy through per-block quantization with FP16 scale factors. Block size: 32 elements, Type size: 18 bytes (2 bytes FP16 + 16 bytes packed 4-bit values)