Wire in KV cache quantization #77

neilmehta24 · 2025-01-16T22:34:29Z

See this PR for the mlx-lm implementation ml-explore/mlx-examples#1075

Summary of changes:

Wire kv_bits, kv_group_size, and quantized_kv_start into model_kit, load_model, and create_generator. These values are optional, so that way we rely on mlx-lm to set defaults for us. kv_bits is required to set the other two parameters
Disable kv cache quantization for VLMs
Note that max_kv_size is not respected when quantizing the KV cache
Apply a formatting pass

Closes #31

neilmehta24 added 3 commits January 16, 2025 13:38

first pass

0d78386

formatting

d628450

Update model_kit.py

4bd6893

neilmehta24 marked this pull request as ready for review January 16, 2025 23:06

neilmehta24 requested review from yagil and mattjcly January 16, 2025 23:06

mattjcly approved these changes Jan 17, 2025

View reviewed changes

neilmehta24 merged commit 6b679ef into lmstudio-ai:main Jan 17, 2025

neilmehta24 deleted the kv-cache-qtn branch January 17, 2025 15:30

Provide feedback