Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLIP (ViT) #315

Merged
merged 32 commits into from
Jan 31, 2024
Merged

CLIP (ViT) #315

merged 32 commits into from
Jan 31, 2024

Conversation

gboduljak
Copy link
Contributor

@gboduljak gboduljak commented Jan 13, 2024

This is an implementation of CLIP (ViT). Closes #143.

Implemented:

  • CLIP Model + CLIP Loss
  • CLIP Model tests
  • CLIP Vision Encoder tests
  • CLIP Text Encoder tests
  • Partial implementation of CLIPTokenizer
  • Partial implementation of CLIPImageProcessor
  • Initialization
  • convert.py
  • README.md

This PR is ready for review. My only concern is the correctness of CLIP initialization. I am not sure how to test that.
I would like to generate HuggingFace weights and configuration after the review. It is already possible to generate and upload HuggingFace weights and configuration. However, I decided to wait because those may change if the code changes.

@gboduljak gboduljak changed the title [Work in progress] CLIP CLIP (ViT) Jan 14, 2024
@gboduljak gboduljak marked this pull request as ready for review January 14, 2024 22:19
Copy link
Member

@angeloskath angeloskath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gboduljak this looks amazing! Thanks for the contribution.

I started the review, haven't finished it yet but I am submitting the 1st pass and will get back to it again later today to review the conversion, preprocessing and hub download parts.

The current comments are mostly stylistic and image attribution. So far the PR looks great so feel free to address the review comments when the 2nd pass of review is submitted.

clip/model.py Outdated Show resolved Hide resolved
clip/model.py Outdated Show resolved Hide resolved
clip/model.py Outdated Show resolved Hide resolved
clip/model.py Outdated Show resolved Hide resolved
clip/model.py Outdated Show resolved Hide resolved
clip/model.py Outdated Show resolved Hide resolved
clip/cats.jpeg Outdated Show resolved Hide resolved
@angeloskath
Copy link
Member

As you mentioned one could use mlx-data for the preprocessing and tokenization steps but we don't currently have a BPETokenizer in MLX Data. Let me know if you feel like tackling that cause I think it would be very useful.

@gboduljak
Copy link
Contributor Author

@gboduljak this looks amazing! Thanks for the contribution.

I started the review, haven't finished it yet but I am submitting the 1st pass and will get back to it again later today to review the conversion, preprocessing and hub download parts.

Thank you for the code review and the constructive feedback. I will address your requested changes after your second pass.

@gboduljak
Copy link
Contributor Author

As you mentioned one could use mlx-data for the preprocessing and tokenization steps but we don't currently have a BPETokenizer in MLX Data. Let me know if you feel like tackling that cause I think it would be very useful.

This looks interesting to me. Please take a look at clip/preprocessing/tokenizer.py. This is based on the tokenizer from Stable Diffusion example. I would to port this to mlx-data.

@gboduljak
Copy link
Contributor Author

As you mentioned one could use mlx-data for the preprocessing and tokenization steps but we don't currently have a BPETokenizer in MLX Data. Let me know if you feel like tackling that cause I think it would be very useful.

This looks interesting to me. Please take a look at clip/preprocessing/tokenizer.py. This is based on the tokenizer from Stable Diffusion example. I would to port this to mlx-data.

@angeloskath please see ml-explore/mlx-data#36 :)

@mzbac
Copy link
Contributor

mzbac commented Jan 25, 2024

@gboduljak Great work! I can't wait for this to be merged, it will enable us to port Llava-like models to mlx 🚀

@gboduljak
Copy link
Contributor Author

@angeloskath Could you take a second look at this PR? I would like to complete this soon :)

@awni
Copy link
Member

awni commented Jan 25, 2024

@gboduljak can you rebase on main and resolve the conflict? I can spend some time on this also as I'd like to get it in in the nearish future.

@gboduljak
Copy link
Contributor Author

@gboduljak can you rebase on main and resolve the conflict? I can spend some time on this also as I'd like to get it in in the nearish future.

@awni done :)

clip/cats.jpeg Outdated Show resolved Hide resolved
@awni
Copy link
Member

awni commented Jan 25, 2024

@gboduljak my high level comment is there is a lot of code in this example. Examples are intended to be pretty minimal so they can be 1. instructive 2. easily hackable/customizable, and 3. easy for us to maintain. If we want to include this in mlx-examples we should aim to reduce the footprint and simplify as much as possible

  • If that means removing some config options to make it less flexible that is totally fine!
  • If that means hardcoding some model preprocessing steps, that is totally fine!
  • Reducing the number of small files would also be good.

Another path is to make this a separate package that is external to mlx-examples which is also great! And maybe makes more sense for your goals with this, I am not sure.

If we push to include it in mlx-examples though let's aim to simplify as much as possible. Let me know which route your prefer and if you want to aim to get it in mlx-examples (which I support) are you up for taking a stab at simplifying it a bit?

PS I know some of our examples are not as simple as they should / could be, but those are more the exceptions 😄 and they may be moved out of the examples repo in the future.

@gboduljak
Copy link
Contributor Author

@gboduljak my high level comment is there is a lot of code in this example. Examples are intended to be pretty minimal so they can be 1. instructive 2. easily hackable/customizable, and 3. easy for us to maintain. If we want to include this in mlx-examples we should aim to reduce the footprint and simplify as much as possible

  • If that means removing some config options to make it less flexible that is totally fine!
  • If that means hardcoding some model preprocessing steps, that is totally fine!
  • Reducing the number of small files would also be good.

Another path is to make this a separate package that is external to mlx-examples which is also great! And maybe makes more sense for your goals with this, I am not sure.

If we push to include it in mlx-examples though let's aim to simplify as much as possible. Let me know which route your prefer and if you want to aim to get it in mlx-examples (which I support) are you up for taking a stab at simplifying it a bit?

The largest amount of the code is preprocessing, which is not really related to CLIP implementation. My goal was to reduce dependency on transformers, but some things are missing in existing MLX ecosystem (e.g. BPETokenizer) which makes this PR massive.

I suggest we remove preprocessing completely and rely on transformers, until we do not implement necessary features in mlx-data. This makes sense because this is a repository of examples. By relying on transformers preprocessing (i.e. tokenizer + image preprocessing), we are left with just a few files (e.g. CLIP implementation, CLIP tests and convert.py).

To save your time, please take a look at the model implementation (model.py, convert.py) and the tests (test.py).

@awni
Copy link
Member

awni commented Jan 25, 2024

I suggest we remove preprocessing completely

I'm in favor of removing it's all in NumPy anyway there is no need to reimplement it here. I don't love having a dependency on Torch though but maybe we take it out once we have the image preprocessing in mlx-data.

  • Regarding image preprocessing - just curious is the image preprocessing in Transformers in Torch or numpy? Also do you know what is missing to use mlx-data for the image parts, I thought we had most of those functions?

  • Definitely tokenizations using Transformers is fine, let's do that. Once we can do BPE whatever is needed in mlx-data we can update it.

@gboduljak
Copy link
Contributor Author

  • Regarding image preprocessing - just curious is the image preprocessing in Transformers in Torch or numpy? Also do you know what is missing to use mlx-data for the image parts, I thought we had most of those functions?

I think it is mostly numpy and Pillow. I think we are missing image normalization in mlx-data. But that is easy to implement anyway.

Let me take a look at this later today and come up with some simplified implementation (e.g. removing preprocessing and trying to implement something basic in mlx). I will let you know when that is done.

@awni
Copy link
Member

awni commented Jan 25, 2024

Awesome thank you!!!

The normalization is pretty easy to do in mlx-data. Here is an example: https://github.com/ml-explore/mlx-examples/blob/main/cifar/dataset.py

@awni
Copy link
Member

awni commented Jan 26, 2024

@gboduljak let me know when this is ready for re-review.

@gboduljak
Copy link
Contributor Author

@gboduljak let me know when this is ready for re-review.

@awni here is the status update.

I removed preprocessing for images and I just wrapped image preprocessing from transformers. I would like to try mlx-data for image preprocessing instead of this 'patchy' solution.

The implementation of CLIP model, the weight conversion script and the tests are final and I do not intend to change them. These will not change during the image preprocessing re-implementation. Thus, please review model.py, convert.py and test.py.

@gboduljak
Copy link
Contributor Author

Hi @awni, @angeloskath :)

@angeloskath I addressed your first pass comments.

This PR is now ready for another review. I removed existing preprocessing code and I implemented a very simple port of CLIPImageProcessor. The 'simple port' is implemented using Pillow and numpy. This means we can drop dependency on transformers. I decided to keep the dependency because it is useful for tests. However, transformers are not necessary for inference.

I tried using mlx-data for the image processor. In particular, I tried

def normalize(x: mx.array, mean: mx.array, std: mx.array) -> mx.array:
    x = x.astype("float32") / 255.0
    return (x - mean) / std

mean = mx.array(conf["image_mean"]).reshape((1, 1, 3))
std = mx.array(conf["image_std"]).reshape((1, 1, 3))
norm = partial(normalize, mean=mean, std=std)

dset = (
    # Make a buffer (finite length container of samples) from the python list
    dx.buffer_from_vector([{'image': b'assets/cat.jpeg'},{'image': b'assets/dog.jpeg'}])
    # Shuffle and transform to a stream
    .to_stream()
    # CLIP image pipeline
    .load_image("image")
    .image_resize_smallest_side("image", 224)
    .image_center_crop("image", 224, 224)
    # Accumulate into batches
    .batch(2)
    # Normalize
    .key_transform("image", norm)
    # Finally, fetch batches in background threads
    .prefetch(prefetch_size=2, num_threads=2)
)
cat, dog = [sample["image"] for sample in dset]

However, I was getting outputs very different from the reference CLIPImageProcessor. I suspect this is because CLIPImageProcessor uses Bicubic sampler for resizing. I am not sure whether we have Bicubic sampler in mlx-data. If we have the Bicubic sampler in mlx-data, please suggest how to use it.

@chigkim
Copy link

chigkim commented Jan 29, 2024

Sorry for the digression, but thank you so much for working on this!
I've developed VOCR, a tool designed to assist blind users in capturing screenshots and extracting information from images.
VOCR has been utilizing only Apple Vision Kit, but I'm currently working on V2 to incorporate multimodal models with various engines such as GPT-4V, Llama.cpp, and Ollama. I can't wait to try MLX with multimodal vision-language models!
Although still in its beta phase, users are finding VOCR helpful for a wide range of purposes. These include accessing information from inaccessible software, analyzing pictures and videos on social media, and obtaining subtitles, among others.
it would be amazing if it becomes possible in the future to finetune multimodal models on Macs with Apple silicon!
As a blind person myself, I just wanted to express my appreciation to folks involved in the open-source projects related to multimodal vision-language models!
Thanks so much!

clip/README.md Outdated Show resolved Hide resolved
Copy link
Member

@angeloskath angeloskath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.

Hopefully we 'll be able to remove a bunch of those dependencies (image preprocessing and BPE tokenizer) using mlx-data in the near future :-)

Copy link
Member

@awni awni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@awni awni merged commit 9435821 into ml-explore:main Jan 31, 2024
Blaizzy pushed a commit to Blaizzy/mlx-examples that referenced this pull request Mar 13, 2024
* probably approximatelly correct CLIPTextEncoder

* implemented CLIPEncoderLayer as built-in nn.TransformerEncoderLayer

* replaced embedding layer with simple matrix

* implemented ViT

* added ViT tests

* fixed tests

* added pooler_output for text

* implemented complete CLIPModel

* implemented init

* implemented convert.py and from_pretrained

* fixed some minor bugs and added the README.md

* removed tokenizer unused comments

* removed unused deps

* updated ACKNOWLEDGEMENTS.md

* Feat: Image Processor for CLIP (ml-explore#1)

@nkasmanoff:
* clip image processor
* added example usage

* refactored image preprocessing

* deleted unused image_config.py

* removed preprocessing port

* added dependency to mlx-data

* fixed attribution and moved photos to assets

* implemented a simple port of CLIPImageProcessor

* review changes

* PR review changes

* renamed too verbose arg

* updated README.md

* nits in readme / conversion

* simplify some stuff, remove unneeded inits

* remove more init stuff

* more simplify

* make test a unit test

* update main readme

* readme nits

---------

Co-authored-by: Noah Kasmanoff <[email protected]>
Co-authored-by: Awni Hannun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ViT + CLIP
6 participants