Skip to content

Conversation

@mikepapadim
Copy link
Member

Implement complete Q4_0 quantization support following the same pattern as Q8_0:

Core Q4_0 Infrastructure:

  • Add Q4_0TornadoTensor for GPU tensor representation with 4-bit quantization
  • Implement Q4_0LayerPlanner base class for all Q4_0 planners
  • Add LogitsQ4_0Layer shared across all models
  • Update ModelLoader to handle Q4_0 tensor creation and loading

Model-Specific Q4_0 Implementations:

  • Add LlamaQ4_0LayerPlanner and LlamaQ4_0FFNLayers (also supports Mistral)
  • Add Qwen2Q4_0LayerPlanner and Qwen2Q4_0FFNLayers (also supports DeepSeek R1 Distill)
  • Add Qwen3Q4_0LayerPlanner and Qwen3Q4_0FFNLayers
  • Add Phi3Q4_0LayerPlanner and Phi3Q4_0FFNLayers

Factory and Loader Updates:

  • Update QuantizationPlannerFactory to route Q4_0 requests to appropriate planners
  • Update all model loaders (Llama, Qwen2, Qwen3, Phi3, Mistral) to accept Q4_0

Q4_0 achieves 4x memory compression vs FP16 and 2x vs Q8_0 while maintaining inference accuracy through per-block quantization with FP16 scale factors. Block size: 32 elements, Type size: 18 bytes (2 bytes FP16 + 16 bytes packed 4-bit values)

Implement complete Q4_0 quantization support following the same pattern as Q8_0:

Core Q4_0 Infrastructure:
- Add Q4_0TornadoTensor for GPU tensor representation with 4-bit quantization
- Implement Q4_0LayerPlanner base class for all Q4_0 planners
- Add LogitsQ4_0Layer shared across all models
- Update ModelLoader to handle Q4_0 tensor creation and loading

Model-Specific Q4_0 Implementations:
- Add LlamaQ4_0LayerPlanner and LlamaQ4_0FFNLayers (also supports Mistral)
- Add Qwen2Q4_0LayerPlanner and Qwen2Q4_0FFNLayers (also supports DeepSeek R1 Distill)
- Add Qwen3Q4_0LayerPlanner and Qwen3Q4_0FFNLayers
- Add Phi3Q4_0LayerPlanner and Phi3Q4_0FFNLayers

Factory and Loader Updates:
- Update QuantizationPlannerFactory to route Q4_0 requests to appropriate planners
- Update all model loaders (Llama, Qwen2, Qwen3, Phi3, Mistral) to accept Q4_0

Q4_0 achieves 4x memory compression vs FP16 and 2x vs Q8_0 while maintaining
inference accuracy through per-block quantization with FP16 scale factors.
Block size: 32 elements, Type size: 18 bytes (2 bytes FP16 + 16 bytes packed 4-bit values)
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copilot finished reviewing on behalf of mikepapadim November 13, 2025 13:01
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements comprehensive Q4_0 quantization support for the TornadoVM inference path, following the established pattern from Q8_0 quantization. Q4_0 provides 4-bit quantization with per-block scaling (32 elements per block), achieving 4x memory compression compared to FP16 and 2x compression compared to Q8_0.

Key changes:

  • Introduced Q4_0 tensor representation and base planner infrastructure
  • Added model-specific Q4_0 layer implementations for Llama, Qwen2, Qwen3, and Phi3
  • Updated factory routing and model loaders to accept Q4_0 quantization

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 24 comments.

Show a summary per file
File Description
Q4_0TornadoTensor.java Core Q4_0 tensor implementation with packed 4-bit values and FP16 scales
Q4_0LayerPlanner.java Base class for all Q4_0 quantized layer planners
LogitsQ4_0Layer.java Shared logits layer for Q4_0 across all models
LlamaQ4_0FFNLayers.java Llama/Mistral Q4_0 FFN implementation
Qwen2Q4_0FFNLayers.java Qwen2/DeepSeek R1 Distill Q4_0 FFN with bias terms
Qwen3Q4_0FFNLayers.java Qwen3 Q4_0 FFN with GQA and RMS normalization
Phi3Q4_0FFNLayers.java Phi3 Q4_0 FFN with combined QKV and gate/up structure
*Q4_0LayerPlanner.java (4 files) Model-specific planner implementations
QuantizationPlannerFactory.java Factory routing for Q4_0 requests to appropriate planners
ModelLoader.java Added Q4_0TornadoTensor loading support
*ModelLoader.java (5 files) Updated model loaders to accept Q4_0 quantization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

package org.beehive.gpullama3.tornadovm.layers.type.q4_0;

import org.beehive.gpullama3.inference.state.State;
import org.beehive.gpullama3.inference.weights.Weights;
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import Weights. This import is not used anywhere in the class since the weights parameter is typed as TornadoWeights in the constructor.

Suggested change
import org.beehive.gpullama3.inference.weights.Weights;

Copilot uses AI. Check for mistakes.
import org.beehive.gpullama3.inference.weights.tornado.Qwen2TornadoWeights;
import org.beehive.gpullama3.model.qwen2.Qwen2Configuration;
import org.beehive.gpullama3.tornadovm.kernels.Qwen2Kernels;
import org.beehive.gpullama3.tornadovm.kernels.Qwen3Kernels;
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import Qwen3Kernels. While this class is imported, it's never used in the file. The only kernel reference is to Qwen2Kernels on line 207.

Suggested change
import org.beehive.gpullama3.tornadovm.kernels.Qwen3Kernels;

Copilot uses AI. Check for mistakes.
Comment on lines +16 to +17
import uk.ac.manchester.tornado.api.WorkerGrid1D;
import uk.ac.manchester.tornado.api.WorkerGrid2D;
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused imports WorkerGrid1D and WorkerGrid2D. These are not used in the file since WorkerGrid instances are created using factory methods or the base WorkerGrid class constructor.

Suggested change
import uk.ac.manchester.tornado.api.WorkerGrid1D;
import uk.ac.manchester.tornado.api.WorkerGrid2D;

Copilot uses AI. Check for mistakes.
unifiedLayer.transferToDevice(DataTransferMode.EVERY_EXECUTION,
qwen3State.positionHolder, qwen3State.temp, qwen3State.tempFFN);

Qwen3State qwen3State = (Qwen3State) state;
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant variable declaration. The variable qwen3State is already declared and initialized as an instance field on line 40. This local variable declaration on line 296 shadows the instance field unnecessarily.

Copilot uses AI. Check for mistakes.
.task("qbias", TransformerComputeKernelsLayered::addInPlace, state.wrapQ, weights.q_biasLayered[layerIndex].asFloatArray(), config.dim())
.task("kbias", TransformerComputeKernelsLayered::addInPlace, state.wrapK, weights.k_biasLayered[layerIndex].asFloatArray(), config.kvDim())
.task("vbias", TransformerComputeKernelsLayered::addInPlace, state.wrapV, weights.v_biasLayered[layerIndex].asFloatArray(), config.kvDim())
.task("rope", Qwen3Kernels::ropeRotation,context, state.positionHolder, state.wrapQ, state.wrapK, config.numberOfKeyValueHeads(),
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect kernel usage. The task is using Qwen3Kernels::ropeRotation but this is a Qwen2 implementation. It should use Qwen2Kernels::ropeRotation or verify that the Qwen3 kernel is intentionally being used for Qwen2 models.

Suggested change
.task("rope", Qwen3Kernels::ropeRotation,context, state.positionHolder, state.wrapQ, state.wrapK, config.numberOfKeyValueHeads(),
.task("rope", Qwen2Kernels::ropeRotation,context, state.positionHolder, state.wrapQ, state.wrapK, config.numberOfKeyValueHeads(),

Copilot uses AI. Check for mistakes.
public ImmutableTaskGraph getImmutableTaskGraph() {
return null;
}

Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method overrides AbstractFFNLayers.getFfnLayerTaskGraphs; it is advisable to add an Override annotation.

Suggested change
@Override

Copilot uses AI. Check for mistakes.

/**
* Configure data transfers for first and subsequent layers
*/
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method overrides AbstractLayer.configureLayerDataTransfers; it is advisable to add an Override annotation.

Suggested change
*/
*/
@Override

Copilot uses AI. Check for mistakes.

/**
* Configure data transfers for first and subsequent layers
*/
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method overrides AbstractLayer.configureLayerDataTransfers; it is advisable to add an Override annotation.

Suggested change
*/
*/
@Override

Copilot uses AI. Check for mistakes.
public ImmutableTaskGraph getImmutableTaskGraph() {
return null;
}

Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method overrides AbstractFFNLayers.getFfnLayerTaskGraphs; it is advisable to add an Override annotation.

Suggested change
@Override

Copilot uses AI. Check for mistakes.
public ImmutableTaskGraph getImmutableTaskGraph() {
return null;
}

Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method overrides AbstractFFNLayers.getFfnLayerTaskGraphs; it is advisable to add an Override annotation.

Suggested change
@Override

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants