Skip to content

Feature Request: Pre-calculate required VRAM for everything instead of just model weight #1302

@chulucninh09

Description

@chulucninh09

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Hi, can we know before loading the model weight that how much VRAM is required to run the model with configurations? Such as llama-server -dry showing how much VRAM is needed and tell us before hand that our configuration can fit into VRAM or not.

Motivation

I have slow SSD and low VRAM, I need to change the -ot and -n-cpu-moe params frequently to get the most possible fit into VRAM of the model.

However, the current behavior is only throw error soon if the model weight cannot fit into VRAM, but if the model weight fits, then into the KV cache allocation and compute allocation, I got error due to not enough VRAM to allocate those spaces.

Knowing this before hand will be better, I don't have to spend 1 min waiting for model load and found out my params are not enough.

Possible Implementation

Can we borrow -fit from mainline to implement this feature? And maybe we can add -dry param to see how much VRAM the configuration required.

I'm not so familiar with C, but be able to code. Can anyone guide me how should I implement this feature and I'm happy to make PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions