Skip to content

Multihead arch cutlass int8 qkv average#2035

Draft
almaudoh wants to merge 89 commits intoLeelaChessZero:masterfrom
almaudoh:multihead-arch-cutlass-int8-qkv-average
Draft

Multihead arch cutlass int8 qkv average#2035
almaudoh wants to merge 89 commits intoLeelaChessZero:masterfrom
almaudoh:multihead-arch-cutlass-int8-qkv-average

Conversation

@almaudoh
Copy link
Copy Markdown
Contributor

@almaudoh almaudoh commented Jun 2, 2024

This is a temporary PR to allow testing of a branch of cuda int8 that may not eventually be merged.

ankan-ban and others added 30 commits March 22, 2022 22:28
- skip connection add before layer norm now has a scaling factor (alpha)
 - replace conv layer of value and mlh heads with an embedding layer when attention body is used.
- will be removed once it's fixed.
- also fix scratch space calculation.
 - factor of sizeof(DataType) was missing.
- to handle bigger/wider networks
1.3% improvement in BT2 on RTX 4090
15.6% improvement in test BT3 network with 64 heads.
- only tries doing the KQV dense layers in int8.
- Accuracy seems reasonable.
- Right now quantization isn't fused, and de-quantization is done with bias add.
- Both the above can be possibly be fused with more work.
- Also need to attempt INT8 for other dense layers (MHA dense, FFN1 and FFN2)
almaudoh-1 and others added 30 commits March 2, 2024 17:50
…rnels for clipping of inputs for non-int8 inference.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants