What is the primary function of CLIPS?

12 Apr.,2024

 

Five Minute Summary

CLIP is a pre-trained model for telling you how well a given image and a given text caption fit together, introduced by the paper “Learning Transferrable Visual Models from Natural Language Supervision” (2021) from OpenAI. It was trained contrastively on a huge amount (400 million) of web scraped data of image-caption pairs (one of the first models to do this). It is useful because this pre-trained model can be used for a lot of downstream tasks; instead of just associating an image with a class label out of a set of class labels, it can associate an image with a text caption containing any words from the English language.

🏋🏻 Training

Input: A bunch of image-caption(text) pairs (all encoded to be vectors).

Output: The “cosine similarity” scores between all the image vector and caption vector combinations.

Objective function: A contrastive function that will modify the weights of the model such that correct image-caption pairs get a high similarity score, and incorrect pairs get low similarity scores.

Note — during training, the model requires that a huge batch of image-text pairs is fed at once (e.g., 20,000 pairs). That way, each batch contains 20,000*20,000 = 400,000,000 possible pairs, with only 20,000 being correct pairs. For efficient processing, the similarity scores of all possible pairs are computed at once to yield a 20,000 by 20,000 matrix, with the values in the diagonal being the similarity scores for the correct image-text pairs. That way, the objective function can have the goal to maximize the scores in the diagonal and minimize all the scores not in the diagonal. See the figure below.

Overview of how CLIP works during training.

🤔 Inference

Inputs: the vector for a single image, and the vectors for a bunch of different possible text captions.

Output: the similarity scores of the single image to all the different text captions.

Goal: select the text which has the highest similarity with the image.

Note — the authors used this strategy to use CLIP for classification inference on the ImageNet dataset. They turned the label of each ImageNet class into a sentence, and used these sentences as the possible captions for a given image. For example, instead of using the ImageNet label “cat”, they created a sentence like “A photo of a cat” because this is the type of text that CLIP is used to. Then they compared a given ImageNet image with the set of sentences that correspond to the different classes, and picked the sentence that fit the image the best.

The paper says they have high zero-shot performance — this is because even though the model might not have been trained on any examples of the classes in the ImageNet dataset, it still performs well because it could kind of figure out what the words of the classes mean and associate that with the images.

Other Notes

Although CLIP itself is not a caption generator model, the pre-trained CLIP model can be used to calculate similarity scores between images and captions, which could therefore be useful as part of caption generator models.

I am new to CLIPS, but I have dealt before with Prolog, and as I remember that Prolog will execute the more specific rules, what is the case for CLIPS? For example, I have the following facts: (male ...

What is the primary function of CLIPS?

Newest 'clips' Questions