Prerequisites
Feature Description
Hi, can we know before loading the model weight that how much VRAM is required to run the model with configurations? Such as llama-server -dry showing how much VRAM is needed and tell us before hand that our configuration can fit into VRAM or not.
Motivation
I have slow SSD and low VRAM, I need to change the -ot and -n-cpu-moe params frequently to get the most possible fit into VRAM of the model.
However, the current behavior is only throw error soon if the model weight cannot fit into VRAM, but if the model weight fits, then into the KV cache allocation and compute allocation, I got error due to not enough VRAM to allocate those spaces.
Knowing this before hand will be better, I don't have to spend 1 min waiting for model load and found out my params are not enough.
Possible Implementation
Can we borrow -fit from mainline to implement this feature? And maybe we can add -dry param to see how much VRAM the configuration required.
I'm not so familiar with C, but be able to code. Can anyone guide me how should I implement this feature and I'm happy to make PR.
Prerequisites
Feature Description
Hi, can we know before loading the model weight that how much VRAM is required to run the model with configurations? Such as
llama-server -dryshowing how much VRAM is needed and tell us before hand that our configuration can fit into VRAM or not.Motivation
I have slow SSD and low VRAM, I need to change the -ot and -n-cpu-moe params frequently to get the most possible fit into VRAM of the model.
However, the current behavior is only throw error soon if the model weight cannot fit into VRAM, but if the model weight fits, then into the KV cache allocation and compute allocation, I got error due to not enough VRAM to allocate those spaces.
Knowing this before hand will be better, I don't have to spend 1 min waiting for model load and found out my params are not enough.
Possible Implementation
Can we borrow
-fitfrom mainline to implement this feature? And maybe we can add-dryparam to see how much VRAM the configuration required.I'm not so familiar with C, but be able to code. Can anyone guide me how should I implement this feature and I'm happy to make PR.