Add Q4_0 quantization support for all models in TornadoVM path #67

mikepapadim · 2025-11-13T13:00:08Z

Implement complete Q4_0 quantization support following the same pattern as Q8_0:

Core Q4_0 Infrastructure:

Add Q4_0TornadoTensor for GPU tensor representation with 4-bit quantization
Implement Q4_0LayerPlanner base class for all Q4_0 planners
Add LogitsQ4_0Layer shared across all models
Update ModelLoader to handle Q4_0 tensor creation and loading

Model-Specific Q4_0 Implementations:

Add LlamaQ4_0LayerPlanner and LlamaQ4_0FFNLayers (also supports Mistral)
Add Qwen2Q4_0LayerPlanner and Qwen2Q4_0FFNLayers (also supports DeepSeek R1 Distill)
Add Qwen3Q4_0LayerPlanner and Qwen3Q4_0FFNLayers
Add Phi3Q4_0LayerPlanner and Phi3Q4_0FFNLayers

Factory and Loader Updates:

Update QuantizationPlannerFactory to route Q4_0 requests to appropriate planners
Update all model loaders (Llama, Qwen2, Qwen3, Phi3, Mistral) to accept Q4_0

Q4_0 achieves 4x memory compression vs FP16 and 2x vs Q8_0 while maintaining inference accuracy through per-block quantization with FP16 scale factors. Block size: 32 elements, Type size: 18 bytes (2 bytes FP16 + 16 bytes packed 4-bit values)

Implement complete Q4_0 quantization support following the same pattern as Q8_0: Core Q4_0 Infrastructure: - Add Q4_0TornadoTensor for GPU tensor representation with 4-bit quantization - Implement Q4_0LayerPlanner base class for all Q4_0 planners - Add LogitsQ4_0Layer shared across all models - Update ModelLoader to handle Q4_0 tensor creation and loading Model-Specific Q4_0 Implementations: - Add LlamaQ4_0LayerPlanner and LlamaQ4_0FFNLayers (also supports Mistral) - Add Qwen2Q4_0LayerPlanner and Qwen2Q4_0FFNLayers (also supports DeepSeek R1 Distill) - Add Qwen3Q4_0LayerPlanner and Qwen3Q4_0FFNLayers - Add Phi3Q4_0LayerPlanner and Phi3Q4_0FFNLayers Factory and Loader Updates: - Update QuantizationPlannerFactory to route Q4_0 requests to appropriate planners - Update all model loaders (Llama, Qwen2, Qwen3, Phi3, Mistral) to accept Q4_0 Q4_0 achieves 4x memory compression vs FP16 and 2x vs Q8_0 while maintaining inference accuracy through per-block quantization with FP16 scale factors. Block size: 32 elements, Type size: 18 bytes (2 bytes FP16 + 16 bytes packed 4-bit values)

CLAassistant · 2025-11-13T13:00:17Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Copilot

Pull Request Overview

This PR implements comprehensive Q4_0 quantization support for the TornadoVM inference path, following the established pattern from Q8_0 quantization. Q4_0 provides 4-bit quantization with per-block scaling (32 elements per block), achieving 4x memory compression compared to FP16 and 2x compression compared to Q8_0.

Key changes:

Introduced Q4_0 tensor representation and base planner infrastructure
Added model-specific Q4_0 layer implementations for Llama, Qwen2, Qwen3, and Phi3
Updated factory routing and model loaders to accept Q4_0 quantization

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 24 comments.

Show a summary per file

File	Description
Q4_0TornadoTensor.java	Core Q4_0 tensor implementation with packed 4-bit values and FP16 scales
Q4_0LayerPlanner.java	Base class for all Q4_0 quantized layer planners
LogitsQ4_0Layer.java	Shared logits layer for Q4_0 across all models
LlamaQ4_0FFNLayers.java	Llama/Mistral Q4_0 FFN implementation
Qwen2Q4_0FFNLayers.java	Qwen2/DeepSeek R1 Distill Q4_0 FFN with bias terms
Qwen3Q4_0FFNLayers.java	Qwen3 Q4_0 FFN with GQA and RMS normalization
Phi3Q4_0FFNLayers.java	Phi3 Q4_0 FFN with combined QKV and gate/up structure
*Q4_0LayerPlanner.java (4 files)	Model-specific planner implementations
QuantizationPlannerFactory.java	Factory routing for Q4_0 requests to appropriate planners
ModelLoader.java	Added Q4_0TornadoTensor loading support
*ModelLoader.java (5 files)	Updated model loaders to accept Q4_0 quantization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-13T13:05:06Z

src/main/java/org/beehive/gpullama3/tornadovm/layers/type/q4_0/LogitsQ4_0Layer.java

+package org.beehive.gpullama3.tornadovm.layers.type.q4_0;
+
+import org.beehive.gpullama3.inference.state.State;
+import org.beehive.gpullama3.inference.weights.Weights;


Unused import Weights. This import is not used anywhere in the class since the weights parameter is typed as TornadoWeights in the constructor.

Suggested change

import org.beehive.gpullama3.inference.weights.Weights;

Copilot · 2025-11-13T13:05:06Z

src/main/java/org/beehive/gpullama3/tornadovm/layers/type/q4_0/Qwen2Q4_0FFNLayers.java

+import org.beehive.gpullama3.inference.weights.tornado.Qwen2TornadoWeights;
+import org.beehive.gpullama3.model.qwen2.Qwen2Configuration;
+import org.beehive.gpullama3.tornadovm.kernels.Qwen2Kernels;
+import org.beehive.gpullama3.tornadovm.kernels.Qwen3Kernels;


Unused import Qwen3Kernels. While this class is imported, it's never used in the file. The only kernel reference is to Qwen2Kernels on line 207.

Suggested change

import org.beehive.gpullama3.tornadovm.kernels.Qwen3Kernels;

Copilot · 2025-11-13T13:05:06Z

src/main/java/org/beehive/gpullama3/tornadovm/layers/type/q4_0/Qwen2Q4_0FFNLayers.java

+import uk.ac.manchester.tornado.api.WorkerGrid1D;
+import uk.ac.manchester.tornado.api.WorkerGrid2D;


Unused imports WorkerGrid1D and WorkerGrid2D. These are not used in the file since WorkerGrid instances are created using factory methods or the base WorkerGrid class constructor.

Suggested change

import uk.ac.manchester.tornado.api.WorkerGrid1D;

import uk.ac.manchester.tornado.api.WorkerGrid2D;

Copilot · 2025-11-13T13:05:07Z

src/main/java/org/beehive/gpullama3/tornadovm/layers/type/q4_0/Qwen3Q4_0FFNLayers.java

+            unifiedLayer.transferToDevice(DataTransferMode.EVERY_EXECUTION,
+                    qwen3State.positionHolder, qwen3State.temp, qwen3State.tempFFN);
+
+            Qwen3State qwen3State = (Qwen3State) state;


Redundant variable declaration. The variable qwen3State is already declared and initialized as an instance field on line 40. This local variable declaration on line 296 shadows the instance field unnecessarily.

Copilot · 2025-11-13T13:05:07Z

src/main/java/org/beehive/gpullama3/tornadovm/layers/type/q4_0/Qwen2Q4_0FFNLayers.java

+                .task("qbias", TransformerComputeKernelsLayered::addInPlace, state.wrapQ, weights.q_biasLayered[layerIndex].asFloatArray(), config.dim())
+                .task("kbias", TransformerComputeKernelsLayered::addInPlace, state.wrapK, weights.k_biasLayered[layerIndex].asFloatArray(), config.kvDim())
+                .task("vbias", TransformerComputeKernelsLayered::addInPlace, state.wrapV, weights.v_biasLayered[layerIndex].asFloatArray(), config.kvDim())
+                .task("rope", Qwen3Kernels::ropeRotation,context, state.positionHolder, state.wrapQ, state.wrapK, config.numberOfKeyValueHeads(),


Incorrect kernel usage. The task is using Qwen3Kernels::ropeRotation but this is a Qwen2 implementation. It should use Qwen2Kernels::ropeRotation or verify that the Qwen3 kernel is intentionally being used for Qwen2 models.

Suggested change

.task("rope", Qwen3Kernels::ropeRotation,context, state.positionHolder, state.wrapQ, state.wrapK, config.numberOfKeyValueHeads(),

.task("rope", Qwen2Kernels::ropeRotation,context, state.positionHolder, state.wrapQ, state.wrapK, config.numberOfKeyValueHeads(),

Copilot · 2025-11-13T13:05:11Z

src/main/java/org/beehive/gpullama3/tornadovm/layers/type/q4_0/Phi3Q4_0FFNLayers.java

+    public ImmutableTaskGraph getImmutableTaskGraph() {
+        return null;
+    }
+


This method overrides AbstractFFNLayers.getFfnLayerTaskGraphs; it is advisable to add an Override annotation.

Suggested change

@Override

Copilot · 2025-11-13T13:05:12Z

src/main/java/org/beehive/gpullama3/tornadovm/layers/type/q4_0/Qwen3Q4_0FFNLayers.java

+
+    /**
+     * Configure data transfers for first and subsequent layers
+     */


This method overrides AbstractLayer.configureLayerDataTransfers; it is advisable to add an Override annotation.

Suggested change

*/

*/

@Override

Copilot · 2025-11-13T13:05:12Z

src/main/java/org/beehive/gpullama3/tornadovm/layers/type/q4_0/Qwen2Q4_0FFNLayers.java

+
+    /**
+     * Configure data transfers for first and subsequent layers
+     */


This method overrides AbstractLayer.configureLayerDataTransfers; it is advisable to add an Override annotation.

Suggested change

*/

*/

@Override

Copilot · 2025-11-13T13:05:12Z

src/main/java/org/beehive/gpullama3/tornadovm/layers/type/q4_0/Qwen3Q4_0FFNLayers.java

+    public ImmutableTaskGraph getImmutableTaskGraph() {
+        return null;
+    }
+


This method overrides AbstractFFNLayers.getFfnLayerTaskGraphs; it is advisable to add an Override annotation.

Suggested change

@Override

Copilot · 2025-11-13T13:05:12Z

src/main/java/org/beehive/gpullama3/tornadovm/layers/type/q4_0/Qwen2Q4_0FFNLayers.java

+    public ImmutableTaskGraph getImmutableTaskGraph() {
+        return null;
+    }
+


This method overrides AbstractFFNLayers.getFfnLayerTaskGraphs; it is advisable to add an Override annotation.

Suggested change

@Override

mikepapadim requested review from Copilot, mairooni and orionpapadakis November 13, 2025 13:00

Copilot started reviewing on behalf of mikepapadim November 13, 2025 13:00 View session

Copilot finished reviewing on behalf of mikepapadim November 13, 2025 13:01

Copilot AI reviewed Nov 13, 2025

View reviewed changes

		import uk.ac.manchester.tornado.api.WorkerGrid1D;
		import uk.ac.manchester.tornado.api.WorkerGrid2D;

	.task("rope", Qwen3Kernels::ropeRotation,context, state.positionHolder, state.wrapQ, state.wrapK, config.numberOfKeyValueHeads(),
	.task("rope", Qwen2Kernels::ropeRotation,context, state.positionHolder, state.wrapQ, state.wrapK, config.numberOfKeyValueHeads(),

Add Q4_0 quantization support for all models in TornadoVM path #67

Are you sure you want to change the base?

Add Q4_0 quantization support for all models in TornadoVM path #67

Uh oh!

Conversation

mikepapadim commented Nov 13, 2025

Uh oh!

CLAassistant commented Nov 13, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants