-
Hi, I've wrapped llama.cpp's llava example in a web server so that I can send multiple requests without having to incur the overhead of starting up the app each time. However, I'm not sure how to reset the model state to pass in new requests. I currently free and re-create llama_context on each inference request but this is still a fairly heavyweight operation. Surely there is a way to clear out the context without having to reallocate all of the memory, load the Metal shader again (on macOS), etc.? I'm having trouble following the interactive llama code but will keep digging. In the meantime, any pointers or explanation of what might need to be done would be greatly appreciated! My code is here: https://github.com/trzy/llava-cpp-server/blob/main/llava_server.cpp Note that run_llava_thread() calls perform_inference(), which has to create a new llama_context each time. This is what I'm hoping to streamline. Thank you, Bart |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 4 replies
-
The first release of LLaVA doesn't seem to support interactive mode... it processes one prompt and finishes. My guess that llava interactive mode will become functional in later releases... and I'm keenly awaiting that. Also llava server is on the ToDo list. |
Beta Was this translation helpful? Give feedback.
-
Can't guarantee it will work, but I think you just have to call |
Beta Was this translation helpful? Give feedback.
-
#3589 also includes an attempt to support LLaVA inference in
This is versatile and flexible enough to make all sorts of experiments with LLaVA, i.e., image-only input, text-only input, image + text input, text + image + text input, placing image in anywhere you want etc., making it possible to converse with LLaVA on several turns. |
Beta Was this translation helpful? Give feedback.
-
For anyone else that stumbles upon this and finds that |
Beta Was this translation helpful? Give feedback.
Can't guarantee it will work, but I think you just have to call
llama_kv_cache_tokens_rm(ctx, -1, -1);
before every new input