-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade MLX framework, and refactor create_generator to use stream_generate #44
Conversation
@@ -69,12 +72,15 @@ def process_prompt( | |||
# disable `prefill_step_size` prompt pre-processing in mlx_lm::generate_step | |||
generate_args["prefill_step_size"] = float("inf") | |||
|
|||
generate_step_input = self.model.input_ids[0] | |||
generate_step_input = self.model.input_ids[None] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[None]
is a more correct implementation. This will flatten the array, instead of just taking the first element
@@ -210,7 +289,7 @@ def _convert_to_pil(self, images_b64: List[str]): | |||
PIL.Image.open(BytesIO(base64.b64decode(img))) for img in images_b64 or [] | |||
] | |||
|
|||
def _custom_resize(self, pil_images, max_size=(1000, 1000)): | |||
def _custom_resize(self, pil_images, max_size=(512, 512)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I initially was testing to set this to 1000x1000, if I dropped it down to 512 models like qwen2vl would start mis-recognizing text from screenshots of my screen.
Why the change? If for certain models can we custom resize just for those models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't mean to commit this change. I reverted this.
MLX package upgrades
Each of these brings new features and improvements to the engine.
MLX LM upgrade
stream generate
is a more stable API, and also includes a wired limit setter. This addresses the issues described in Set wired limit before starting generation #40temp
min_p
and other sampling params intogenerate_step
is deprecated. We will use mlx_lm default sampler method, and pass in user provided sampling params.MLX VLM upgrade
This upgrade adds support for two new models, Florence 2 and Molmo.
Florence requires as an input into the language model, the token that was generated during the last evaluation. Use the mlx_lm custom sampling capability to store the most recently sampled token as part of vision_model_wrapper. Note that this model requires trusting remote code.

Molmo required no special code refactors. Note that using the Molmo model may use a lot of memory, so I have been testing by resizing smaller than usual. Note that this model requires trusting remote code.

generate.py and demo.py updates
create_generator
. Assign defaults for all of the named arguments, so that the caller doesn't have to provide defaults for arguments that may not be informed about. Thegenerate_args
option is removed, and now all of the configurable parameters are part of the method signature. Thegenerate_args
dict for mlx_lm is built up duringcreate_generator