Created ReplicateKVHeadTransform to integrate KV-heads replication module within Qefficient library.#625
Conversation
1dfdea6 to
3c90390
Compare
|
@ochougul @quic-amitraj please review |
8c4a1fc to
e502542
Compare
…dule within Qefficient library. The Transform enables KV-head replication for CausalLMs and VLMs as well. The feature is enabled by passing n_kv_head_repeat parameter during initialization of the QEff wrapper class for the corresponding model. n_kv_head_repeat param acts as the multiplier for the number of repeats to be done to original count of KV heads. This operation also causes the config and the hash params of the respective model to update the num_key_value_heads parameter and add a paramter orig_kv_heads to it; It allows us to export the same model with different number of kv_heads without causing a hash conflict. Also added tests for both CausalLMs and VLMs with this functionality to compare outputs of Pytorch HF model and the AIC model. Two new optional paramters n_kv_head_repeat and test_kv_replicate are added for testing purpose. Setting test_kv_replicate to True performs a KV-head replication of every model such that the number of KV-heads and attention heads becomes equal. This was done to ensure tests don't fail due to misalignment issues when we simply repeat num_key_value_heads twice and thus cause a divisibility error on hum_heads. Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
… Doing so would prevent any issues during Transforms when we don't wish to apply it. Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
…orm. Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
870cc8d to
08032e1
Compare
…changes to repeat Bias factor appropriately on quantized layers. Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
ochougul
left a comment
There was a problem hiding this comment.
Write a test that makes sure onnx hash is different when different number of kv heads are passed.
| # InternVL causes an error if we pass the num_kv_heads_repeat parameter | ||
| num_kv_heads_repeat = kwargs.pop("num_kv_heads_repeat", 1) |
| self.model, replicate_kv_transformed = ReplicateKVHeadTransform.apply(self.model, **kwargs) | ||
| if replicate_kv_transformed: | ||
| self.hash_params["config"] = model.config.to_diff_dict() |
There was a problem hiding this comment.
better add it to _pytorch_transforms if we are always going to call it.
| if replicate_kv_transformed: | ||
| self.lang_model.hash_params["config"] = model.config.to_diff_dict() | ||
| self.vision_model.hash_params["config"] = model.config.to_diff_dict() |
There was a problem hiding this comment.
don't we already dump config somewhere? in _generate_export_hash?
You can just always add repeat_kv_heads value to self.hash_params which will be 1 if nothing is passed.
| } | ||
|
|
||
|
|
||
| class ReplicateKVHeadTransform: |
There was a problem hiding this comment.
Make this inherit ModuleMutatorTransform
You may need to implement mutate method which is similar to apply here
|
@quic-dhirajku Please take this PR post 595 |
| layer.bias.data = torch.repeat_interleave( | ||
| layer.bias.data.view(orig_kv_heads, head_dim), repeat, 0 | ||
| ).view(new_kv_heads * head_dim) | ||
| if layer.bias is not None: |
There was a problem hiding this comment.
lines 782-785 are repeated here, please remove
The Transform enables KV-head replication for CausalLMs and VLMs as well.
The feature is enabled by passing n_kv_head_repeat parameter during initialization of the QEff wrapper class for the corresponding model.
n_kv_head_repeatparam acts as the multiplier for the number of repeats to be done to original count of KV heads. This operation also causes the config and the hash params of the respective model to update the num_key_value_heads parameter and add a paramter orig_kv_heads to it; It allows us to export the same model with different number of kv_heads without causing a hash conflict.Added tests for both CausalLMs and VLMs with this functionality to compare outputs of Pytorch HF model and the AIC model. Two new optional paramters
n_kv_head_repeatandtest_kv_replicateare added for testing purpose. Settingtest_kv_replicateto True performs a KV-head replication of every model such that the number of KV-heads and attention heads becomes equal. This was done to ensure tests don't fail due to misalignment issues when we simply repeat num_key_value_heads twice and thus cause a divisibility error on hum_heads.