CLIP (ViT) #315

gboduljak · 2024-01-13T17:36:35Z

This is an implementation of CLIP (ViT). Closes #143.

Implemented:

This PR is ready for review. My only concern is the correctness of CLIP initialization. I am not sure how to test that.
I would like to generate HuggingFace weights and configuration after the review. It is already possible to generate and upload HuggingFace weights and configuration. However, I decided to wait because those may change if the code changes.

angeloskath

@gboduljak this looks amazing! Thanks for the contribution.

I started the review, haven't finished it yet but I am submitting the 1st pass and will get back to it again later today to review the conversion, preprocessing and hub download parts.

The current comments are mostly stylistic and image attribution. So far the PR looks great so feel free to address the review comments when the 2nd pass of review is submitted.

clip/model.py

clip/cats.jpeg

angeloskath · 2024-01-16T19:34:46Z

As you mentioned one could use mlx-data for the preprocessing and tokenization steps but we don't currently have a BPETokenizer in MLX Data. Let me know if you feel like tackling that cause I think it would be very useful.

gboduljak · 2024-01-16T22:27:23Z

@gboduljak this looks amazing! Thanks for the contribution.

I started the review, haven't finished it yet but I am submitting the 1st pass and will get back to it again later today to review the conversion, preprocessing and hub download parts.

Thank you for the code review and the constructive feedback. I will address your requested changes after your second pass.

gboduljak · 2024-01-16T22:38:52Z

As you mentioned one could use mlx-data for the preprocessing and tokenization steps but we don't currently have a BPETokenizer in MLX Data. Let me know if you feel like tackling that cause I think it would be very useful.

This looks interesting to me. Please take a look at clip/preprocessing/tokenizer.py. This is based on the tokenizer from Stable Diffusion example. I would to port this to mlx-data.

gboduljak · 2024-01-17T02:02:14Z

As you mentioned one could use mlx-data for the preprocessing and tokenization steps but we don't currently have a BPETokenizer in MLX Data. Let me know if you feel like tackling that cause I think it would be very useful.

This looks interesting to me. Please take a look at clip/preprocessing/tokenizer.py. This is based on the tokenizer from Stable Diffusion example. I would to port this to mlx-data.

@angeloskath please see ml-explore/mlx-data#36 :)

mzbac · 2024-01-25T06:34:44Z

@gboduljak Great work! I can't wait for this to be merged, it will enable us to port Llava-like models to mlx 🚀

gboduljak · 2024-01-25T10:23:27Z

@angeloskath Could you take a second look at this PR? I would like to complete this soon :)

awni · 2024-01-25T14:16:30Z

@gboduljak can you rebase on main and resolve the conflict? I can spend some time on this also as I'd like to get it in in the nearish future.

gboduljak · 2024-01-25T15:31:49Z

@gboduljak can you rebase on main and resolve the conflict? I can spend some time on this also as I'd like to get it in in the nearish future.

@awni done :)

clip/cats.jpeg

awni · 2024-01-25T15:47:51Z

@gboduljak my high level comment is there is a lot of code in this example. Examples are intended to be pretty minimal so they can be 1. instructive 2. easily hackable/customizable, and 3. easy for us to maintain. If we want to include this in mlx-examples we should aim to reduce the footprint and simplify as much as possible

If that means removing some config options to make it less flexible that is totally fine!
If that means hardcoding some model preprocessing steps, that is totally fine!
Reducing the number of small files would also be good.

Another path is to make this a separate package that is external to mlx-examples which is also great! And maybe makes more sense for your goals with this, I am not sure.

If we push to include it in mlx-examples though let's aim to simplify as much as possible. Let me know which route your prefer and if you want to aim to get it in mlx-examples (which I support) are you up for taking a stab at simplifying it a bit?

PS I know some of our examples are not as simple as they should / could be, but those are more the exceptions 😄 and they may be moved out of the examples repo in the future.

gboduljak · 2024-01-25T15:56:22Z

@gboduljak my high level comment is there is a lot of code in this example. Examples are intended to be pretty minimal so they can be 1. instructive 2. easily hackable/customizable, and 3. easy for us to maintain. If we want to include this in mlx-examples we should aim to reduce the footprint and simplify as much as possible

If that means removing some config options to make it less flexible that is totally fine!

If that means hardcoding some model preprocessing steps, that is totally fine!

Reducing the number of small files would also be good.

Another path is to make this a separate package that is external to mlx-examples which is also great! And maybe makes more sense for your goals with this, I am not sure.

If we push to include it in mlx-examples though let's aim to simplify as much as possible. Let me know which route your prefer and if you want to aim to get it in mlx-examples (which I support) are you up for taking a stab at simplifying it a bit?

The largest amount of the code is preprocessing, which is not really related to CLIP implementation. My goal was to reduce dependency on transformers, but some things are missing in existing MLX ecosystem (e.g. BPETokenizer) which makes this PR massive.

I suggest we remove preprocessing completely and rely on transformers, until we do not implement necessary features in mlx-data. This makes sense because this is a repository of examples. By relying on transformers preprocessing (i.e. tokenizer + image preprocessing), we are left with just a few files (e.g. CLIP implementation, CLIP tests and convert.py).

To save your time, please take a look at the model implementation (model.py, convert.py) and the tests (test.py).

awni · 2024-01-25T16:13:38Z

I suggest we remove preprocessing completely

I'm in favor of removing it's all in NumPy anyway there is no need to reimplement it here. I don't love having a dependency on Torch though but maybe we take it out once we have the image preprocessing in mlx-data.

Regarding image preprocessing - just curious is the image preprocessing in Transformers in Torch or numpy? Also do you know what is missing to use mlx-data for the image parts, I thought we had most of those functions?
Definitely tokenizations using Transformers is fine, let's do that. Once we can do BPE whatever is needed in mlx-data we can update it.

gboduljak · 2024-01-25T16:20:07Z

Regarding image preprocessing - just curious is the image preprocessing in Transformers in Torch or numpy? Also do you know what is missing to use mlx-data for the image parts, I thought we had most of those functions?

I think it is mostly numpy and Pillow. I think we are missing image normalization in mlx-data. But that is easy to implement anyway.

Let me take a look at this later today and come up with some simplified implementation (e.g. removing preprocessing and trying to implement something basic in mlx). I will let you know when that is done.

awni · 2024-01-25T16:22:14Z

Awesome thank you!!!

The normalization is pretty easy to do in mlx-data. Here is an example: https://github.com/ml-explore/mlx-examples/blob/main/cifar/dataset.py

awni · 2024-01-26T16:53:57Z

@gboduljak let me know when this is ready for re-review.

gboduljak · 2024-01-26T17:48:33Z

@gboduljak let me know when this is ready for re-review.

@awni here is the status update.

I removed preprocessing for images and I just wrapped image preprocessing from transformers. I would like to try mlx-data for image preprocessing instead of this 'patchy' solution.

The implementation of CLIP model, the weight conversion script and the tests are final and I do not intend to change them. These will not change during the image preprocessing re-implementation. Thus, please review model.py, convert.py and test.py.

gboduljak · 2024-01-28T04:25:19Z

Hi @awni, @angeloskath :)

@angeloskath I addressed your first pass comments.

This PR is now ready for another review. I removed existing preprocessing code and I implemented a very simple port of CLIPImageProcessor. The 'simple port' is implemented using Pillow and numpy. This means we can drop dependency on transformers. I decided to keep the dependency because it is useful for tests. However, transformers are not necessary for inference.

I tried using mlx-data for the image processor. In particular, I tried

def normalize(x: mx.array, mean: mx.array, std: mx.array) -> mx.array:
    x = x.astype("float32") / 255.0
    return (x - mean) / std

mean = mx.array(conf["image_mean"]).reshape((1, 1, 3))
std = mx.array(conf["image_std"]).reshape((1, 1, 3))
norm = partial(normalize, mean=mean, std=std)

dset = (
    # Make a buffer (finite length container of samples) from the python list
    dx.buffer_from_vector([{'image': b'assets/cat.jpeg'},{'image': b'assets/dog.jpeg'}])
    # Shuffle and transform to a stream
    .to_stream()
    # CLIP image pipeline
    .load_image("image")
    .image_resize_smallest_side("image", 224)
    .image_center_crop("image", 224, 224)
    # Accumulate into batches
    .batch(2)
    # Normalize
    .key_transform("image", norm)
    # Finally, fetch batches in background threads
    .prefetch(prefetch_size=2, num_threads=2)
)
cat, dog = [sample["image"] for sample in dset]

However, I was getting outputs very different from the reference CLIPImageProcessor. I suspect this is because CLIPImageProcessor uses Bicubic sampler for resizing. I am not sure whether we have Bicubic sampler in mlx-data. If we have the Bicubic sampler in mlx-data, please suggest how to use it.

chigkim · 2024-01-29T13:15:20Z

Sorry for the digression, but thank you so much for working on this!
I've developed VOCR, a tool designed to assist blind users in capturing screenshots and extracting information from images.
VOCR has been utilizing only Apple Vision Kit, but I'm currently working on V2 to incorporate multimodal models with various engines such as GPT-4V, Llama.cpp, and Ollama. I can't wait to try MLX with multimodal vision-language models!
Although still in its beta phase, users are finding VOCR helpful for a wide range of purposes. These include accessing information from inaccessible software, analyzing pictures and videos on social media, and obtaining subtitles, among others.
it would be amazing if it becomes possible in the future to finetune multimodal models on Macs with Apple silicon!
As a blind person myself, I just wanted to express my appreciation to folks involved in the open-source projects related to multimodal vision-language models!
Thanks so much!

clip/README.md

@nkasmanoff

@nkasmanoff: * clip image processor * added example usage

angeloskath

Looks great.

Hopefully we 'll be able to remove a bunch of those dependencies (image preprocessing and BPE tokenizer) using mlx-data in the near future :-)

awni

🚀

@nkasmanoff

* probably approximatelly correct CLIPTextEncoder * implemented CLIPEncoderLayer as built-in nn.TransformerEncoderLayer * replaced embedding layer with simple matrix * implemented ViT * added ViT tests * fixed tests * added pooler_output for text * implemented complete CLIPModel * implemented init * implemented convert.py and from_pretrained * fixed some minor bugs and added the README.md * removed tokenizer unused comments * removed unused deps * updated ACKNOWLEDGEMENTS.md * Feat: Image Processor for CLIP (ml-explore#1) @nkasmanoff: * clip image processor * added example usage * refactored image preprocessing * deleted unused image_config.py * removed preprocessing port * added dependency to mlx-data * fixed attribution and moved photos to assets * implemented a simple port of CLIPImageProcessor * review changes * PR review changes * renamed too verbose arg * updated README.md * nits in readme / conversion * simplify some stuff, remove unneeded inits * remove more init stuff * more simplify * make test a unit test * update main readme * readme nits --------- Co-authored-by: Noah Kasmanoff <[email protected]> Co-authored-by: Awni Hannun <[email protected]>

gboduljak force-pushed the clip branch 2 times, most recently from a440c73 to d7264fb Compare January 14, 2024 00:10

gboduljak changed the title ~~[Work in progress] CLIP~~ CLIP (ViT) Jan 14, 2024

gboduljak mentioned this pull request Jan 14, 2024

[Feature Request] Example of MLLM using MLX #207

Closed

gboduljak marked this pull request as ready for review January 14, 2024 22:19

angeloskath requested changes Jan 16, 2024

View reviewed changes

gboduljak mentioned this pull request Jan 17, 2024

An efficient implementation of BytePairTokenizer ml-explore/mlx-data#36

Open

gboduljak mentioned this pull request Jan 20, 2024

[Feature Request] support for multimode model like LLaVA #344

Closed

awni reviewed Jan 25, 2024

View reviewed changes

clip/cats.jpeg Outdated Show resolved Hide resolved

gboduljak force-pushed the clip branch 3 times, most recently from 65a445c to 9651660 Compare January 28, 2024 03:55

awni reviewed Jan 30, 2024

View reviewed changes

clip/README.md Outdated Show resolved Hide resolved

gboduljak and others added 21 commits January 31, 2024 07:02

implemented complete CLIPModel

a2c7894

implemented init

508f9f6

implemented convert.py and from_pretrained

fc0a06b

fixed some minor bugs and added the README.md

e07ad95

removed tokenizer unused comments

320aceb

removed unused deps

b5845ac

updated ACKNOWLEDGEMENTS.md

3895903

Feat: Image Processor for CLIP (#1)

07d1628

@nkasmanoff: * clip image processor * added example usage

refactored image preprocessing

bce08b6

deleted unused image_config.py

203aedf

removed preprocessing port

7c90128

added dependency to mlx-data

42bba02

fixed attribution and moved photos to assets

b73c8bf

implemented a simple port of CLIPImageProcessor

f25aca9

review changes

ed0ad15

PR review changes

a5041f3

renamed too verbose arg

a146af3

updated README.md

4ab21e5

nits in readme / conversion

5b7734f

simplify some stuff, remove unneeded inits

72f02e6

remove more init stuff

91347ad

awni force-pushed the clip branch from 67bce4a to 91347ad Compare January 31, 2024 15:02

awni added 4 commits January 31, 2024 07:51

more simplify

be8c2d5

make test a unit test

de5bdc3

update main readme

097f586

readme nits

b400af0

angeloskath approved these changes Jan 31, 2024

View reviewed changes

awni approved these changes Jan 31, 2024

View reviewed changes

awni merged commit 9435821 into ml-explore:main Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLIP (ViT) #315

CLIP (ViT) #315

gboduljak commented Jan 13, 2024 •

edited

Loading

angeloskath left a comment

angeloskath commented Jan 16, 2024

gboduljak commented Jan 16, 2024

gboduljak commented Jan 16, 2024

gboduljak commented Jan 17, 2024

mzbac commented Jan 25, 2024

gboduljak commented Jan 25, 2024

awni commented Jan 25, 2024

gboduljak commented Jan 25, 2024

awni commented Jan 25, 2024 •

edited

Loading

gboduljak commented Jan 25, 2024

awni commented Jan 25, 2024

gboduljak commented Jan 25, 2024

awni commented Jan 25, 2024 •

edited

Loading

awni commented Jan 26, 2024

gboduljak commented Jan 26, 2024

gboduljak commented Jan 28, 2024

chigkim commented Jan 29, 2024

angeloskath left a comment

awni left a comment

CLIP (ViT) #315

CLIP (ViT) #315

Conversation

gboduljak commented Jan 13, 2024 • edited Loading

angeloskath left a comment

Choose a reason for hiding this comment

angeloskath commented Jan 16, 2024

gboduljak commented Jan 16, 2024

gboduljak commented Jan 16, 2024

gboduljak commented Jan 17, 2024

mzbac commented Jan 25, 2024

gboduljak commented Jan 25, 2024

awni commented Jan 25, 2024

gboduljak commented Jan 25, 2024

awni commented Jan 25, 2024 • edited Loading

gboduljak commented Jan 25, 2024

awni commented Jan 25, 2024

gboduljak commented Jan 25, 2024

awni commented Jan 25, 2024 • edited Loading

awni commented Jan 26, 2024

gboduljak commented Jan 26, 2024

gboduljak commented Jan 28, 2024

chigkim commented Jan 29, 2024

angeloskath left a comment

Choose a reason for hiding this comment

awni left a comment

Choose a reason for hiding this comment

gboduljak commented Jan 13, 2024 •

edited

Loading

awni commented Jan 25, 2024 •

edited

Loading

awni commented Jan 25, 2024 •

edited

Loading