Feature Request: Pre-calculate required VRAM for everything instead of just model weight

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

Hi, can we know before loading the model weight that how much VRAM is required to run the model with configurations? Such as `llama-server -dry` showing how much VRAM is needed and tell us before hand that our configuration can fit into VRAM or not.

### Motivation

I have slow SSD and low VRAM, I need to change the -ot and -n-cpu-moe params frequently to get the most possible fit into VRAM of the model.

However, the current behavior is only throw error soon if the model weight cannot fit into VRAM, but if the model weight fits, then into the KV cache allocation and compute allocation, I got error due to not enough VRAM to allocate those spaces.

Knowing this before hand will be better, I don't have to spend 1 min waiting for model load and found out my params are not enough.

### Possible Implementation

Can we borrow `-fit` from mainline to implement this feature? And maybe we can add `-dry` param to see how much VRAM the configuration required.

I'm not so familiar with C, but be able to code. Can anyone guide me how should I implement this feature and I'm happy to make PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Pre-calculate required VRAM for everything instead of just model weight #1302

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature Request: Pre-calculate required VRAM for everything instead of just model weight #1302

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions