-
Notifications
You must be signed in to change notification settings - Fork 64
[QEff. Finetune]: Ft integrated tests #694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: ft_experimental
Are you sure you want to change the base?
Conversation
- Added a logger which will log onto console and file. This code is similar to existing QEff. Finetuning logger code. - Also added dist_utils which serves as utility code when dealing with distributed training. - Added logger test cases for sanity checks. --------- Signed-off-by: meetkuma <[email protected]>
…quic#645) - Added functionality to register dataset, model, optimizer, trainer objects in a registry and fetch the class of given object based on configuration provided. - Also, added simple test cases to verify the functionality. --------- Signed-off-by: meetkuma <[email protected]>
) Adding a Script for Registering and Retrieving Optimizer Classes The script includes: get_optimizer() Returns the optimizer class and kwargs. Additionally, there is a test_optimizer.py script that validates the functionality of the optimizer registration and retrieval process. --------- Signed-off-by: Tanisha Chawada <[email protected]>
…ong with its test cases. (quic#647) Edited the SFTDataset class to enable custom dataset loading. Updated the dataset.py file to only enable support for SFTDataset type. Created test file to check the functionalities. --------- Signed-off-by: Dhiraj Kumar Sah <[email protected]>
Adding a Script for Registering and Retrieving Callback Classes It has create_callback() function which creates an instance of callback. Additionally, there is a test_callbacks.py script that validates the functionality and retrieval process. --------- Signed-off-by: Tanisha Chawada <[email protected]>
) Added Config_manager to parse the training, model and dataset related arguments. --------- Signed-off-by: Tanisha Chawada <[email protected]>
Signed-off-by: Tanisha Chawada <[email protected]>
Split testcase into functional and loss assertion, and enable on CI Reference metrics data is updated to latest. --------- Signed-off-by: Ann Kuruvilla <[email protected]> Signed-off-by: Tanisha <[email protected]> Co-authored-by: Tanisha <[email protected]>
…LI for CB (quic#646) InputHandler has changes to create position_ids based on CB batch size. Signed-off-by: Dhiraj Kumar Sah <[email protected]>
Added step by step instructions for adding custom op in Qeff --------- Signed-off-by: Rishin Raj <[email protected]> Co-authored-by: Hem Agnihotri <[email protected]>
Added torchvision 0.22.0 cpu version to environment Signed-off-by: Rishin Raj <[email protected]> Co-authored-by: Hem Agnihotri <[email protected]>
This PR updates QEff to support QPC generation on systems without the Platform SDK by refactoring the module loading behavior. Users can now compile models and generate QPCs using QEff with only the Apps SDK installed. Background: Previously, both Apps SDK and Platform SDK were required to compile and generate QPCs using QEff. The goal is to allow QPC generation with only the Apps SDK installed for systems without Ultra cards. Changes: Refactored init.py and generation/cloud_infer.py to use lazy loading via importlib for qaicrt and aicapi. This ensures that Platform SDK-dependent modules are only loaded when explicitly needed, avoiding import errors during initialization and QPC generation. Signed-off-by: Sharvari Medhe <[email protected]> Co-authored-by: Hem Agnihotri <[email protected]>
### Memory Optimization Added periodic memory cleanup to FP16ClipTransform and SplitTensorsTransform to reduce memory usage during large tensor processing. Also avoids redundant external data loading when already present. ### Time Optimized ONNX Transform via Class Merging and Thread Pooling It merges the FP16 and Split ONNX transform classes into a single implementation to eliminate redundant tensor loading and iteration. Additionally, the transform logic has been refactored to use a **thread pool**, replacing the previous sequential loop to parallelize tensor operations. #### Performance Benchmarks:- | Model | Original Duration (s) | Optimized Duration (s) | |----------------|------------------------|-------------------------| | LLaMA 3.1 8B | 88.35 | 58.55 | | LLaMA 3.1 70B | 1029.82 | 727.37 | > **Note:** Thread count is set to `os.cpu_count() * 4` to better handle I/O-bound workloads. Performance may vary depending on system hardware and threading capabilities. --------- Signed-off-by: abhishek-singh591 <[email protected]>
### Objective: This PR introduces the KV blocking technique for CausalLM models where the K/V cache is read and processed block by block in the attention computation. Number of desired KV blocks are defined at model initialization in the "from_pretrained" call to export the ONNX with required number of KV blocks. As a result, the following changes are introduced: ### Changes: 1. SoftMax needs to be changed from regular SoftMax to online SoftMax where the running maximum and cumulative denominators are tracked and updated once each block is processed to retain mathematical accuracy compared to regular SoftMax. 2. Changes to CTXGather and CTXGatherCB custom ops to read only 1 block worth of data in each cache gather/read. 3. Changes to read_only function in QEffDynamicCache to allow reading of a cache block by block rather than full K/V cache. 4. Generation of attention mask per block. 5. Changes to eager_attention_forward implementation in the llama model to allow BlockedKV attention and online SoftMax implementation. 6. Wrapping the num_kv_blocks variable inside qaic_config to keep consistent calling style. 7. A new PyTorch transform to pass the num_kv_blocks variable to QEffLlamaAttention block. 8. A new constant added for num_kv_blocks. 9. Added tests to switch the BlockedKV feature on and off. Please review and feel free to suggest changes and tests. --------- Signed-off-by: Vaibhav Verma <[email protected]> Co-authored-by: Hem Agnihotri <[email protected]>
Adding CB support for VLMs: 1. Llava 2. Llava_Next 3. Gemma3 4. Mistral3 5. InternVL2_5 6. InternVL3_5 7. Molmo --------- Signed-off-by: Asmita Goswami <[email protected]> Co-authored-by: Mamta Singh <[email protected]> Co-authored-by: Hem Agnihotri <[email protected]>
Signed-off-by: Abukhoyer Shaik <[email protected]>
…ring compilation process (quic#623) In these changes, instead of passing CCL lists during model loading, I passed a flag called ccl_enabled to specify whether CCL feature is enabled or not and moved passing CCL lists to compilation process. --------- Signed-off-by: Vahid Janfaza <[email protected]> Co-authored-by: Hem Agnihotri <[email protected]>
# Support for Diffusers Architecture in Efficient Transformers ## Overview This pull request introduces **Diffusers architecture support** to the **Efficient Transformers** framework, enabling seamless integration of diffusion models. ## Key Highlights 1. **Support of model [black-forest-labs/FLUX1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell)** 2. **Flexible Configuration** - Supports JSON-based configuration files for easy compilation and execution. 3. **Performance Benchmarking** - Implements a performance matrix for Diffusers models to enable benchmarking for each modules. 4. **Testing Framework** - Includes initial test scripts for Diffusers (In progress). 5. **Support of onnx subfunction graph using flag use_onnx_function** 6. **Support parallel compilation of modules using flag `parallel_compile`** --------- Signed-off-by: Amit Raj <[email protected]> Signed-off-by: Amit Raj <[email protected]> Signed-off-by: tv-karthikeya <[email protected]> Signed-off-by: vtirumal <[email protected]> Co-authored-by: tv-karthikeya <[email protected]> Co-authored-by: Amit Raj <[email protected]> Co-authored-by: Karthikeya <[email protected]>
Signed-off-by: abhishek-singh591 <[email protected]>
Signed-off-by: Abukhoyer Shaik <[email protected]>
# We should be using disaggragate serving for GPTOSS model for best performance - GPT-OSS model has 128/4 for 120b and 32/4 ratio of total_experts/experts_per_tok - We use read all experts only once always strategy in prefill-only model - And we treat weights activtions meaning read only chosen experts for decode-only model # Prefill-only model ## Blocking default behviour when `prefill_only=True` in compile API - NUM_Q_BLOCKS=<int> set number of Q blocks in attention - NUM_FFN_BLOCKS=<int> set number of blocks in FFN - ENABLE_OPT_SWA=0 or 1 to enable/disable optimized SWA. when enabled we will be using only valid KVs for given block in Attention reducing MACs - prefix_caching is not supported with this mode ## Chunking pass `enable_chunking=True` and `prefill_only=True` in compile API - Optimized SWA i.e. reading only valid KV as per diagonal attention mask is enabled for this version by default - This model can be used for prefix_caching by passing `kv_cache_batch_size=<int>` in compile API # Decode-only model ## Retain Sliding window length of KV for sliding window layers, default behavour when `prefill_seq_len=1` in compile API - This reduces the amount of DDR used by the model - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed ## Full KV for sliding window layers pass `retain_full_kv=True` along with `prefill_seq_len=1` in compile API - This uses higher DDR as we are retaining ctx_len KV even for sliding window layers but will be reading only sliding window len kv in attention - CB is enabled for this version pass `continous_batching=True` in `from_pretrained` call and strictly pass `full_batch_size=<int>` and optinally `kv_cache_batch_size=<int>` if needed - This is enabled for the usecase of multi-turn chat, where we will be running prefill-> decode and then use cache of prefill as well as decode combined to again run prefill, so we want to retain full KV for sliding window layers NOTE: * decode-only model currently fails compilation with `use_onnx_subfunctions=True` so avoid using it * 120B model needs NPI, there are two versions of NPI one with and without subfunction both are uploaded here, pass it as `node_precision_info=<path to file>` * It is advised to use `use_onnx_subfunctions=True` with prefill-only model, otherwise the compilation times are too high, with this the model is supposed to export and fail during compile as it needs assert sdk, so user is supposed to run this compilation manually by pasting the command printed in the error --------- Signed-off-by: vbaddi <[email protected]> Signed-off-by: Onkar Chougule <[email protected]> Signed-off-by: Mamta Singh <[email protected]> Signed-off-by: Onkar Chougule <[email protected]> Co-authored-by: Vinayak Baddi <[email protected]> Co-authored-by: Vinayak Baddi <[email protected]> Co-authored-by: Mamta Singh <[email protected]> Co-authored-by: Mamta Singh <[email protected]>
Added test script to perform end to end Finetuning tests for SFT dataset. Need to add changes to the repo for Seq completion task as well. Current run uses CPU to perform finetuning. Signed-off-by: Dhiraj Kumar Sah <[email protected]>
…e directly for loading the LORA adapters instead of manually doing it. SFTTrainer class init supports peftadapter loading so removed that part from tests. Signed-off-by: Dhiraj Kumar Sah <[email protected]>
| """Parametrized tests for different model and dataset configurations.""" | ||
|
|
||
| @pytest.fixture(autouse=True) | ||
| def setup_and_cleanup(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use def setup and def teardown predefined methods? These are executed before and after each tests.
| pytest.fail(f"Unknown task type: {task_type}") | ||
|
|
||
| # Create configuration | ||
| master_config = MasterConfig( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a suggestion. The config_manager should have the capability to either dump the default config to disk at given location or return default config object, which you can use here or in some other test cases as well. Let me know your thoughts.
If we do that way, then we need to extend the config's tests as well. See if this can be done in this PR or next PR.
| @pytest.mark.parametrize( | ||
| "model_name,task_type,max_eval_step,max_train_step,dataset_name,data_path_fixture,use_peft,config_name", | ||
| [ | ||
| pytest.param( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we convert these configs and function's input argument into some dataclass structure and define those dataclasses like constants at the start of file and use here?
| """ | ||
| from trl import SFTConfig | ||
|
|
||
| # # Get data path if fixture is specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these are no longer needed then please remove it.
| # data_path = request.getfixturevalue(data_path_fixture) | ||
|
|
||
| # Determine auto_class_name based on task type | ||
| if task_type == "CAUSAL_LM": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there has to be some type of enums in our code base for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bdw there is already "type" key in training section of config which is "sft" for most of the cases. See if we can use the same here.
| logger.warning("Trainer instantiated") | ||
| # Run Training | ||
| logger.warning(f"Starting training for {config_name}...") | ||
| train_result = trainer.train() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a try catch.
| logger.warning(f"Training loss: {train_result.training_loss:.4f}") | ||
|
|
||
| # Test Inference | ||
| if task_type == "CAUSAL_LM": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having if else condition for different task_type, should we split the tests for each tasks? The code duplication should be handled properly by having small reusable function. What do you think?
| args=sft_config, | ||
| train_dataset=dummy_dataset, | ||
| processing_class=tokenizer, | ||
| peft_config=peft_config, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are expecting the SFTTrainer class to convert the model into peft model. So after this, check whether model has been modified to become a peft model? e.g. trainable params, presence of lora weights etc.
| hf_model = HFModel(**model_config) | ||
| model = hf_model.load_model() | ||
| # Load PEFT Config | ||
| peft_config = LoraConfig(peft_model_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to move this to some utility file
| from QEfficient.utils.logging_utils import logger | ||
|
|
||
|
|
||
| class TestParametrizedConfigurations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are integration tests. In future we may need to write tests which also checks the individual steps loss values. How to do that? How to reuse this code when we write those comparative tests?
Created a constants.py file for the values as well as the ENUMs as mentioned in comments. Created certain util functions for modularity, will add the final changes in utils later on. TODO : Using Registry and ComponentFactory to load every module in the integrated tests is left. Accidentaly added test_trainer in the previous commit so removing it now. Signed-off-by: Dhiraj Kumar Sah <[email protected]>
866a140 to
389f15a
Compare
Added test script to perform end to end Finetuning tests for SFT dataset. Need to add changes to the repo for Seq completion task as well. Current run uses CPU to perform finetuning.