Skip to content

Conversation

@quic-rishinr
Copy link
Contributor

Updated version of Adding Compute-Context-Length (CCL) #576
Compute-Context-Length (CCL) technique optimizes the throughput of large language models (LLMs) on Qualcomm devices when handling very large context lengths. The current Ahead Of Time (AOT) compilation on Qualcomm devices doesn't predict the number of tokens needed, leading to significant throughput drops during the prefilling and the decoding phases. This happens because the system performs attention calculations based on large context length. To address this issue, we introduce Compute Context Length (CCL), an additional ONNX variable that allows for dynamic context-length specialization. By generating tokens using smaller, more manageable context lengths (CCL), we optimize memory reads and attention calculations, thereby improving throughput.

Signed-off-by: Vahid Janfaza <[email protected]>
Signed-off-by: Vahid Janfaza <[email protected]>
Signed-off-by: Vahid Janfaza <[email protected]>
Signed-off-by: Vahid Janfaza <[email protected]>
Signed-off-by: Vahid Janfaza <[email protected]>
Signed-off-by: Vahid Janfaza <[email protected]>
@quic-rishinr
Copy link
Contributor Author

Duplicate of #576 closing this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants