Zero-shot image classification predicts a label for an image without any task-specific fine-tuning. A dual-encoder model (CLIP-family, SigLIP) produces an image embedding and a set of per-class text embeddings, then picks the class whose text embedding has the highest cosine similarity with the image. The class vocabulary is defined at inference time, not at training time — hence zero-shot.
This issue is to track the evaluation work of this kind of model.
Zero-shot image classification predicts a label for an image without any task-specific fine-tuning. A dual-encoder model (CLIP-family, SigLIP) produces an image embedding and a set of per-class text embeddings, then picks the class whose text embedding has the highest cosine similarity with the image. The class vocabulary is defined at inference time, not at training time — hence zero-shot.
This issue is to track the evaluation work of this kind of model.