Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. bug: Q4_0 repacking results in double RAM usage #12149

Open
bartowski1182 opened this issue Mar 2, 2025 · 4 comments
Open

Misc. bug: Q4_0 repacking results in double RAM usage #12149

bartowski1182 opened this issue Mar 2, 2025 · 4 comments

Comments

@bartowski1182
Copy link
Contributor

Name and Version

b4792

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-cli

Command line

./llama-cli -m microsoft_Phi-4-mini-instruct-Q4_0.gguf

Problem description & steps to reproduce

When loading the model, it uses 4.3GB of RAM

When using Q4_K_S (similar size) it only uses 2.7GB of RAM

First Bad Commit

No response

Relevant log output

@bartowski1182
Copy link
Contributor Author

Best guess is there's a missing "free" after we've repacked the weights, and the original weights are kept in memory accidentally

@slaren
Copy link
Member

slaren commented Mar 2, 2025

Some parts of the model file will remain mapped. The same thing happens when partially offloading a model, in practice it is not likely to cause issues because the OS can reclaim that memory if necessary. Disabling mmap with --no-mmap should avoid this.

@bartowski1182
Copy link
Contributor Author

Ah okay yes I see the RAM usage drop when using that option.. no performance concerns I assume when using it? Thanks for the speedy response!

@slaren
Copy link
Member

slaren commented Mar 2, 2025

It may affect model loading time, but it will not affect inference performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants