Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for searching with query image #27

Open
nuschandra opened this issue Sep 27, 2024 · 9 comments
Open

Support for searching with query image #27

nuschandra opened this issue Sep 27, 2024 · 9 comments

Comments

@nuschandra
Copy link

nuschandra commented Sep 27, 2024

Hi @bclavie & team,

I currently don't see support for searching through an index with a query image instead of a text query. I understand that there is an encode_image option but that only provides the embeddings of the query image and not a full search through the indexed documents along with maxsim calculations. It would be really nice to have support for querying with an image too.

@nuschandra
Copy link
Author

@bclavie If you think that this would be a useful feature, I'd be happy to contribute and raise a PR for the same.

@bclavie
Copy link
Contributor

bclavie commented Oct 3, 2024

Hey! It'd actually be a completely experimental feature since it's not even done in the paper, but I'd be happy to include it under a beta flag if you would like to contribute it!

@nuschandra
Copy link
Author

@bclavie Thanks for your response! Sure, yes I understand. In terms of the logic it remains the same i.e. process_images would return pixel_values & input_ids for the prompt (just like when we do indexing). If we are searching by image, we just make the forward call with both pixel_values and input_ids and get the embeddings which can later be used for maxsim calculations. I will make the code changes later this week and raise a PR .

@sergenerbay
Copy link

sergenerbay commented Oct 27, 2024

Hi,
I need the same feature. Have you completed the code?

@VictorUceda
Copy link

I’m also interested in image queries. @nuschandra, were you able to make progress on this modification? Thank you in advance for all your hard work!

@sergenerbay
Copy link

I worked on image-based RAG. First, I modified the search function in ragmodel.py to compare image similarities and attempted to search images. However, I was not successful because the ColPali model structure is designed to find similarities between a PDF page and an image. The images in the PDF occupied very little space, leading to failed searches. To address this, I extracted the images from the PDF and created a dedicated RAG database for them. With this approach, I achieved significantly higher accuracy in image searching.

@VictorUceda
Copy link

Thank you for your response! Would you mind sharing the code modifications you made?
I plan to use this model to compare layout similarities between two document pages. I still have a lot to learn about this model, but I believe it could work effectively for my use case.

@nuschandra
Copy link
Author

Hi All,

Sorry for the delayed response. I was held up with a few personal commitments over the last month and couldn't get back to this.

I did make the code changes for this but the performance was not particularly great which concurs with what @bclavie mentioned earlier in that ColPali was not designed for image-image similarity. It might work for certain specific use-cases but in general it's not designed for that. If any of you think it'll be useful, I can try to revisit this again.

@VictorUceda
Copy link

I am particularly interested in finding similarities between document pages (layouts, logos, and text within the document) rather than focusing on the images embedded in the PDF. I believe ColPali could work well for this specific use case, don't you think?

Locally, I’ve modified the code to enable searching with an image (extracted from a PDF page) by calling encode_image instead of encode_query. After replacing encode_query with encode_image, I continued using processor.score as usual. However, it seems the similarity calculation isn’t fully accurate—it appears to be weighted rather than patch-based. A patch-based approach would be more useful as it could detect similarities across different parts of the page.

I need to continue exploring alternative scoring methods and gain a deeper understanding of the project. Any comments on your tests or help with code you’ve already experimented with would be greatly appreciated. In any case, thank you for all the work you’ve done so far!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants