Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning

1 The University of Texas at Dallas     2NVIDIA

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024

Abstract

abstract-image-for-proto-clip
We propose a novel framework for few-shot learning by leveraging large-scale vision-language models such as CLIP. Motivated by the unimodal prototypical networks for few-shot learning, we introduce Proto-CLIP that utilizes image prototypes and text prototypes for few-shot learning. Specifically, Proto-CLIP adapts the image encoder and text encoder in CLIP in a joint fashion using few-shot examples. The two encoders are used to compute prototypes of image classes for classification. During adaptation, we propose aligning the image and text prototypes of corresponding classes. Such a proposed alignment is beneficial for few-shot classification due to the contributions from both types of prototypes. We demonstrate the effectiveness of our method by conducting experiments on benchmark datasets for few-shot learning as well as in the real world for robot perception.

Model Overview

The CLIP image encoder and text encoder are frozen during training. The image memory, the text memory and the adapter network are learned. Given a class name, τi returns the ith out of K predefined text prompts.

Demo

User command oriented robot grasping using Proto-CLIP predictions in the real world. [ 4 Scenes ]

Joint Object Segmentation and Few-Shot Classification
in the Real World

CLIP vs Proto-CLIP

Proto-CLIP FewSOL-198 t-SNE plot after training the learnable image and text memory banks

(a) Image and text prototypes from zero-shot CLIP, which are not aligned
(b) Aligned image and text prototypes from fine-tuned Proto-CLIP

t-SNE

Barnes-Hut t-SNE visualization using fine-tuned Proto-CLIP trained on FewSOL [198 classes] dataset.
Here, image and text prototypes are aligned closer to each other. Objects with similar shapes are closer.
Semantics are captured as well, e.g. vegetables/fruits are closer to each other.
Zoom-In to take a closer look.

Proto-CLIP FewSOL-198 t-SNE plot after training the learnable image and text memory banks

FAQs

Why has FewSOL been used for real world experiments?
For a robot to work in human environments like kitchen, living room etc., it has to interact with various daily objects. FewSOL comes in handy when thinking of learning good representations of daily objects for manipulation tasks. Hence, we chose to experiment with FewSOL.
Any specific observations during inference?
Object segmentation and orientation matters. Segmentation is more important as in clutter scenes a bad segmentation can cause problems in classification. Moreover, lighting conditions also play a key role as it impacts classification of shiny objects.
In the real world demo, why do you invoke the ASR after running Proto-CLIP inference?
The ASR package has been designed as a streamlined function that can be seamlessly integrated following Proto-CLIP predictions or alternatively incorporated into a distinct file, thereby enabling the creation of a ROS node to publish the ASR output. During our demonstration, we opted to invoke the ASR promptly after acquiring the Proto-CLIP predictions, mitigating any concerns related to the Proto-CLIP node awaiting a trigger event.
Can the real world system handle changes to the scene during inference?
The current system architecture is predicated on the premise that the scene remains constant throughout Proto-CLIP inference. Should there arise a need to identify scene alterations and perform subsequent inference, it would necessitate significant modifications to the existing system.
What is Proto-CLIP-Toolkit?
Proto-CLIP-Toolkit is a python package which contains the functionalities for the real world demo such as POS tagging and ASR, t-SNE plots, and OOD test for ImageNet dataset. To use it, please check the pypi and sample codes.
How many text prompts have been used in the experiments?
All datasets except ImageNet have a single text prompt template. ImageNet uses 7. For comparison purpose, this setting has been borrowed from Tip-Adapter. More text prompts can be used.

BibTeX

Please cite Proto-CLIP if it helps your research:
@article{padalunkal2023protoclip,
  title={Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning}, 
  author={Jishnu Jaykumar P and Kamalesh Palanisamy and Yu-Wei Chao and Xinya Du and Yu Xiang},
  archivePrefix={arXiv},
  eprint={2307.03073},
  year={2023}
}

Contact

Acknowledgements

This work was supported in part by the DARPA Perceptually-enabled Task Guidance (PTG) Program under contract number HR00112220005.