Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning

IEEE/RSJ International Conference on Intelligent Robots and Systems · IROS 2024

TL;DR Proto-CLIP freezes CLIP's image and text encoders and learns tiny image & text memory banks plus an adapter that aligns the two modalities per class. From just a handful of examples per class, both prototypes vote on the label — the cross-modal alignment consistently beats unimodal prototypical networks and zero-shot CLIP on standard few-shot benchmarks (ImageNet, Cifar, Caltech, Food, Flowers, …) and transfers directly to user-command robot grasping on FewSOL (198 classes) in the real world.

Abstract

Proto-CLIP at a glance

We propose a novel framework for few-shot learning by leveraging large-scale vision–language models such as CLIP. Motivated by unimodal prototypical networks, we introduce Proto-CLIP — which uses both image prototypes and text prototypes for few-shot classification.

Proto-CLIP adapts CLIP's image and text encoders jointly from a handful of few-shot examples and explicitly aligns image and text prototypes of corresponding classes. This cross-modal alignment is where the gain comes from — both modalities contribute to the final decision.

We validate the method on standard few-shot benchmarks and in the real world for robot perception.

Model Overview

Proto-CLIP model overview

The CLIP image and text encoders are frozen during training; the image memory, text memory, and adapter network are learned. Given a class name, τi returns the ith of K predefined text prompts.

Real-World Demo

User-command-oriented robot grasping using Proto-CLIP predictions in the real world — 4 scenes.

Proto-CLIP Scene 1 demo thumbnail

Joint Object Segmentation & Few-Shot Classification

Eight real-world scenes — swipe or use the arrows to browse Proto-CLIP predictions in the wild.

CLIP vs Proto-CLIP

Image and text prototypes before and after fine-tuning.

Proto-CLIP FewSOL-198 t-SNE plot after training the learnable image and text memory banks

(a) Image and text prototypes from zero-shot CLIP are not aligned.
(b) Aligned image and text prototypes from fine-tuned Proto-CLIP.

t-SNE

Barnes–Hut t-SNE visualization using fine-tuned Proto-CLIP trained on FewSOL [198 classes]. Image and text prototypes are aligned; shape-similar and semantically-similar objects cluster together (e.g. vegetables and fruits). Click the figure to zoom in.

Proto-CLIP FewSOL-198 t-SNE plot after training the learnable image and text memory banks

FAQs

Click a question to expand.

Why has FewSOL been used for real-world experiments?
For a robot to work in human environments like the kitchen or living room, it needs to interact with a variety of daily objects. FewSOL is well suited for learning good representations of daily objects for manipulation tasks, which is why we chose it for our experiments.
Any specific observations during inference?
Object segmentation and orientation matter. Segmentation is the more critical of the two — in cluttered scenes, bad segmentation propagates into bad classification. Lighting conditions also play a key role, especially for shiny objects.
In the real-world demo, why is ASR invoked after Proto-CLIP inference?
The ASR package is designed as a streamlined function that can be integrated right after Proto-CLIP predictions, or alternatively as a separate ROS node. In the demo we invoke ASR immediately after the predictions, which avoids the Proto-CLIP node waiting on a trigger event.
Can the real-world system handle scene changes during inference?
The current architecture assumes a static scene throughout Proto-CLIP inference. Detecting scene changes and re-running inference would require significant modifications to the pipeline.
What is Proto-CLIP-Toolkit?
A Python package bundling the utilities used for the real-world demo: POS tagging, ASR, t-SNE plots, and OOD testing for ImageNet. Grab it from PyPI and see the sample code.
How many text prompts are used in the experiments?
All datasets except ImageNet use a single text-prompt template; ImageNet uses 7 (following Tip-Adapter for direct comparison). More prompts can be used without any code change.

BibTeX

Please cite Proto-CLIP if it helps your research.

@INPROCEEDINGS{padalunkal2024protoclip,
  title     = {{Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning}},
  author    = {P, Jishnu Jaykumar and Palanisamy, Kamalesh and Chao, Yu-Wei and Du, Xinya and Xiang, Yu},
  booktitle = {2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
  doi       = {10.1109/IROS58592.2024.10801660},
  pages     = {2594--2601},
  year      = {2024}
}

Contact

Open an issue, join the discussion forum, or email Jishnu — jishnu.p@utdallas.edu

Acknowledgements

Supported by

DARPA
Perceptually-enabled Task Guidance
HR00112220005