Zero Shot Image Segmentation using CLIP

Image segmentation is the process of dividing an image into multiple regions, where each region represents a specific object class or possesses certain characteristics. This project explores zero-shot image segmentation by leveraging OpenAI’s CLIP to associate text prompts with image segmentation.

Problem Statement

Traditional image segmentation models are limited to a fixed set of predefined classes. In this project, we aim to perform zero-shot image segmentation using OpenAI’s CLIP (Contrastive Language-Image Pre-training). CLIP is trained to associate similar text and images while contrasting dissimilar ones through contrastive learning.

By leveraging CLIP’s capability to produce similar vector representations for related text and images, we can create a model that accepts a text prompt and an image as input. The model generates embeddings for both the text prompt and the input image, which are then used to train a decoder to produce a binary segmentation map.

Approach

The project utilizes a subsegment of the Phrasecut dataset for training and evaluation. The dataset contains segmented outputs in the form of polygons, which were converted into binary image masks for training purposes.

Model Architectures

Two different model architectures were explored:

Transformer Decoder: This architecture utilizes CLIP’s encoder from OpenAI’s API to generate embeddings for both the text prompt and the image. A transformer-based decoder from Hugging Face’s Transformer library is then trained on top of these embeddings to generate a binary segmentation map.
Convolutional Decoder: A custom architecture inspired by U-Net was implemented, using stacks of upsampling and deconvolutional layers. The output from the CLIP encoder is fed at various levels to the architecture, and the information is passed to the decoder, which produces the output of the desired shape.

Evaluation Metrics

The performance of the models was evaluated using various metrics:

Accuracy
Dice score
Intersection over Union (IoU)

By comparing the results obtained from different models with varying complexities and loss functions (BCE and Dice loss), we analyzed their effectiveness in producing accurate segmentation maps.

Findings

The project revealed several important insights:

Binary Cross-Entropy (BCE) is not the most appropriate metric to evaluate loss in an image segmentation task.
The U-Net based architecture showed limitations in performance, likely because the CLIP encoder is designed to work with transformer-based architectures that can incorporate self-attention mechanisms.
The transformer-based decoder demonstrated better capability in handling the zero-shot segmentation task by effectively utilizing the embeddings generated by the CLIP encoder.

Conclusion

This project demonstrates the potential of using CLIP for zero-shot image segmentation. By combining CLIP’s text-image association capabilities with appropriate decoder architectures, we can segment images based on arbitrary objects using text prompts.

The findings suggest that transformer-based decoders are more suitable for working with CLIP embeddings compared to convolutional architectures like U-Net. Additionally, the choice of loss function plays a crucial role in training effective segmentation models.

Acknowledgement

Special thanks to SAiDL for providing the problem statement.