Skip to content

Conversation

@quic-dhirajku
Copy link
Contributor

Added test script to perform end to end Finetuning tests for SFT dataset. Need to add changes to the repo for Seq completion task as well. Current run uses CPU to perform finetuning.

quic-meetkuma and others added 23 commits November 28, 2025 17:09
- Added a logger which will log onto console and file. This code is
similar to existing QEff. Finetuning logger code.
- Also added dist_utils which serves as utility code when dealing with
distributed training.
- Added logger test cases for sanity checks.

---------

Signed-off-by: meetkuma <[email protected]>
…quic#645)

- Added functionality to register dataset, model, optimizer, trainer
objects in a registry and fetch the class of given object based on
configuration provided.
- Also, added simple test cases to verify the functionality.

---------

Signed-off-by: meetkuma <[email protected]>
)

Adding a Script for Registering and Retrieving Optimizer Classes
The script includes:


get_optimizer()
Returns the optimizer class and kwargs.
Additionally, there is a test_optimizer.py script that validates the
functionality of the optimizer registration and retrieval process.

---------

Signed-off-by: Tanisha Chawada <[email protected]>
…ong with its test cases. (quic#647)

Edited the SFTDataset class to enable custom dataset loading.
Updated the dataset.py file to only enable support for SFTDataset type.
Created test file to check the functionalities.

---------

Signed-off-by: Dhiraj Kumar Sah <[email protected]>
Adding a Script for Registering and Retrieving Callback Classes
It has create_callback() function which creates an instance of callback.
Additionally, there is a test_callbacks.py script that validates the
functionality and retrieval process.

---------

Signed-off-by: Tanisha Chawada <[email protected]>
)

Added Config_manager to parse the training, model and dataset related
arguments.

---------

Signed-off-by: Tanisha Chawada <[email protected]>
Split testcase into functional and loss assertion, and enable on CI
Reference metrics data is updated to latest.

---------

Signed-off-by: Ann Kuruvilla <[email protected]>
Signed-off-by: Tanisha <[email protected]>
Co-authored-by: Tanisha <[email protected]>
…LI for CB (quic#646)

InputHandler has changes to create position_ids based on CB batch size.

Signed-off-by: Dhiraj Kumar Sah <[email protected]>
Added step by step instructions for adding custom op in Qeff

---------

Signed-off-by: Rishin Raj <[email protected]>
Co-authored-by: Hem Agnihotri <[email protected]>
Added torchvision 0.22.0 cpu version to environment

Signed-off-by: Rishin Raj <[email protected]>
Co-authored-by: Hem Agnihotri <[email protected]>
This PR updates QEff to support QPC generation on systems without the
Platform SDK by refactoring the module loading behavior. Users can now
compile models and generate QPCs using QEff with only the Apps SDK
installed.

Background: Previously, both Apps SDK and Platform SDK were required to
compile and generate QPCs using QEff. The goal is to allow QPC
generation with only the Apps SDK installed for systems without Ultra
cards.
Changes:
Refactored init.py and generation/cloud_infer.py to use lazy loading via
importlib for qaicrt and aicapi.
This ensures that Platform SDK-dependent modules are only loaded when
explicitly needed, avoiding import errors during initialization and QPC
generation.

Signed-off-by: Sharvari Medhe <[email protected]>
Co-authored-by: Hem Agnihotri <[email protected]>
### Memory Optimization

Added periodic memory cleanup to FP16ClipTransform and
SplitTensorsTransform to reduce memory usage during large tensor
processing. Also avoids redundant external data loading when already
present.

### Time Optimized ONNX Transform via Class Merging and Thread Pooling

It merges the FP16 and Split ONNX transform classes into a single
implementation to eliminate redundant tensor loading and iteration.
Additionally, the transform logic has been refactored to use a **thread
pool**, replacing the previous sequential loop to parallelize tensor
operations.

####  Performance Benchmarks:-

| Model           | Original Duration (s) | Optimized Duration (s) |
|----------------|------------------------|-------------------------|
| LLaMA 3.1 8B    | 88.35                  | 58.55                   |
| LLaMA 3.1 70B   | 1029.82                | 727.37                  |

> **Note:** Thread count is set to `os.cpu_count() * 4` to better handle
I/O-bound workloads. Performance may vary depending on system hardware
and threading capabilities.

---------

Signed-off-by: abhishek-singh591 <[email protected]>
### Objective:

This PR introduces the KV blocking technique for CausalLM models where
the K/V cache is read and processed block by block in the attention
computation. Number of desired KV blocks are defined at model
initialization in the "from_pretrained" call to export the ONNX with
required number of KV blocks. As a result, the following changes are
introduced:

### Changes:
1. SoftMax needs to be changed from regular SoftMax to online SoftMax
where the running maximum and cumulative denominators are tracked and
updated once each block is processed to retain mathematical accuracy
compared to regular SoftMax.
2. Changes to CTXGather and CTXGatherCB custom ops to read only 1 block
worth of data in each cache gather/read.
3. Changes to read_only function in QEffDynamicCache to allow reading of
a cache block by block rather than full K/V cache.
4. Generation of attention mask per block.
5. Changes to eager_attention_forward implementation in the llama model
to allow BlockedKV attention and online SoftMax implementation.
6. Wrapping the num_kv_blocks variable inside qaic_config to keep
consistent calling style.
7. A new PyTorch transform to pass the num_kv_blocks variable to
QEffLlamaAttention block.
8. A new constant added for num_kv_blocks.
9. Added tests to switch the BlockedKV feature on and off.

Please review and feel free to suggest changes and tests.

---------

Signed-off-by: Vaibhav Verma <[email protected]>
Co-authored-by: Hem Agnihotri <[email protected]>
Adding CB support for VLMs:
1. Llava
2. Llava_Next
3. Gemma3
4. Mistral3
5. InternVL2_5
6. InternVL3_5
7. Molmo

---------

Signed-off-by: Asmita Goswami <[email protected]>
Co-authored-by: Mamta Singh <[email protected]>
Co-authored-by: Hem Agnihotri <[email protected]>
…ring compilation process (quic#623)

In these changes, instead of passing CCL lists during model loading, I
passed a flag called ccl_enabled to specify whether CCL feature is
enabled or not and moved passing CCL lists to compilation process.

---------

Signed-off-by: Vahid Janfaza <[email protected]>
Co-authored-by: Hem Agnihotri <[email protected]>
# Support for Diffusers Architecture in Efficient Transformers

## Overview
This pull request introduces **Diffusers architecture support** to the
**Efficient Transformers** framework, enabling seamless integration of
diffusion models.

## Key Highlights
1. **Support of model
[black-forest-labs/FLUX1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell)**
2. **Flexible Configuration**  
- Supports JSON-based configuration files for easy compilation and
execution.
3. **Performance Benchmarking**  
- Implements a performance matrix for Diffusers models to enable
benchmarking for each modules.
4. **Testing Framework**  
   - Includes initial test scripts for Diffusers (In progress).
5. **Support of onnx subfunction graph using flag use_onnx_function**
6. **Support parallel compilation of modules using flag
`parallel_compile`**

---------

Signed-off-by: Amit Raj <[email protected]>
Signed-off-by: Amit Raj <[email protected]>
Signed-off-by: tv-karthikeya <[email protected]>
Signed-off-by: vtirumal <[email protected]>
Co-authored-by: tv-karthikeya <[email protected]>
Co-authored-by: Amit Raj <[email protected]>
Co-authored-by: Karthikeya <[email protected]>
# We should be using disaggragate serving for GPTOSS model for best
performance
- GPT-OSS model has 128/4 for 120b and 32/4 ratio of
total_experts/experts_per_tok
- We use read all experts only once always strategy in prefill-only
model
- And we treat weights activtions meaning read only chosen experts for
decode-only model

# Prefill-only model
## Blocking default behviour when `prefill_only=True` in compile API
 - NUM_Q_BLOCKS=<int> set number of Q blocks in attention 
 - NUM_FFN_BLOCKS=<int> set number of blocks in FFN
- ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we
will be using only valid KVs for given block in Attention reducing MACs
 - prefix_caching is not supported with this mode

## Chunking pass `enable_chunking=True` and `prefill_only=True` in
compile API
- Optimized SWA i.e. reading only valid KV as per diagonal attention
mask is enabled for this version by default
- This model can be used for prefix_caching by passing
`kv_cache_batch_size=<int>` in compile API

# Decode-only model
## Retain Sliding window length of KV for sliding window layers, default
behavour when `prefill_seq_len=1` in compile API
 - This reduces the amount of DDR used by the model
- CB is enabled for this version pass `continous_batching=True` in
`from_pretrained` call and strictly pass `full_batch_size=<int>` and
optinally `kv_cache_batch_size=<int>` if needed
## Full KV for sliding window layers pass `retain_full_kv=True` along
with `prefill_seq_len=1` in compile API
- This uses higher DDR as we are retaining ctx_len KV even for sliding
window layers but will be reading only sliding window len kv in
attention
- CB is enabled for this version pass `continous_batching=True` in
`from_pretrained` call and strictly pass `full_batch_size=<int>` and
optinally `kv_cache_batch_size=<int>` if needed
- This is enabled for the usecase of multi-turn chat, where we will be
running prefill-> decode and then use cache of prefill as well as decode
combined to again run prefill, so we want to retain full KV for sliding
window layers


NOTE:
* decode-only model currently fails compilation with
`use_onnx_subfunctions=True` so avoid using it
* 120B model needs NPI, there are two versions of NPI one with and
without subfunction both are uploaded here, pass it as
`node_precision_info=<path to file>`
* It is advised to use `use_onnx_subfunctions=True` with prefill-only
model, otherwise the compilation times are too high, with this the model
is supposed to export and fail during compile as it needs assert sdk, so
user is supposed to run this compilation manually by pasting the command
printed in the error

---------

Signed-off-by: vbaddi <[email protected]>
Signed-off-by: Onkar Chougule <[email protected]>
Signed-off-by: Mamta Singh <[email protected]>
Signed-off-by: Onkar Chougule <[email protected]>
Co-authored-by: Vinayak Baddi <[email protected]>
Co-authored-by: Vinayak Baddi <[email protected]>
Co-authored-by: Mamta Singh <[email protected]>
Co-authored-by: Mamta Singh <[email protected]>
Added test script to perform end to end Finetuning tests for SFT dataset.
Need to add changes to the repo for Seq completion task as well.
Current run uses CPU to perform finetuning.

Signed-off-by: Dhiraj Kumar Sah <[email protected]>
…e directly for loading the LORA adapters instead of manually doing it.

SFTTrainer class init supports peftadapter loading so removed that part from tests.

Signed-off-by: Dhiraj Kumar Sah <[email protected]>
"""Parametrized tests for different model and dataset configurations."""

@pytest.fixture(autouse=True)
def setup_and_cleanup(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use def setup and def teardown predefined methods? These are executed before and after each tests.

pytest.fail(f"Unknown task type: {task_type}")

# Create configuration
master_config = MasterConfig(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a suggestion. The config_manager should have the capability to either dump the default config to disk at given location or return default config object, which you can use here or in some other test cases as well. Let me know your thoughts.

If we do that way, then we need to extend the config's tests as well. See if this can be done in this PR or next PR.

@pytest.mark.parametrize(
"model_name,task_type,max_eval_step,max_train_step,dataset_name,data_path_fixture,use_peft,config_name",
[
pytest.param(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we convert these configs and function's input argument into some dataclass structure and define those dataclasses like constants at the start of file and use here?

"""
from trl import SFTConfig

# # Get data path if fixture is specified
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are no longer needed then please remove it.

# data_path = request.getfixturevalue(data_path_fixture)

# Determine auto_class_name based on task type
if task_type == "CAUSAL_LM":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there has to be some type of enums in our code base for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bdw there is already "type" key in training section of config which is "sft" for most of the cases. See if we can use the same here.

logger.warning("Trainer instantiated")
# Run Training
logger.warning(f"Starting training for {config_name}...")
train_result = trainer.train()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a try catch.

logger.warning(f"Training loss: {train_result.training_loss:.4f}")

# Test Inference
if task_type == "CAUSAL_LM":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having if else condition for different task_type, should we split the tests for each tasks? The code duplication should be handled properly by having small reusable function. What do you think?

args=sft_config,
train_dataset=dummy_dataset,
processing_class=tokenizer,
peft_config=peft_config,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are expecting the SFTTrainer class to convert the model into peft model. So after this, check whether model has been modified to become a peft model? e.g. trainable params, presence of lora weights etc.

hf_model = HFModel(**model_config)
model = hf_model.load_model()
# Load PEFT Config
peft_config = LoraConfig(peft_model_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to move this to some utility file

from QEfficient.utils.logging_utils import logger


class TestParametrizedConfigurations:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are integration tests. In future we may need to write tests which also checks the individual steps loss values. How to do that? How to reuse this code when we write those comparative tests?

Created a constants.py file for the values as well as the ENUMs as mentioned in comments.
Created certain util functions for modularity, will add the final changes in utils later on.
TODO : Using Registry and ComponentFactory to load every module in the integrated tests is left.
Accidentaly added test_trainer in the previous commit so removing it now.

Signed-off-by: Dhiraj Kumar Sah <[email protected]>
@quic-meetkuma quic-meetkuma changed the title Ft integrated tests [QEff. Finetune]: Ft integrated tests Jan 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.