The CLIP image encoder and text encoder are frozen during training. The image memory, the text memory and the adapter network are learned. Given a class name, τi returns the ith out of K predefined text prompts.
User command oriented robot grasping using Proto-CLIP predictions in the real world. [ 4 Scenes ]
(a) Image and text
prototypes from zero-shot CLIP, which are not aligned
(b) Aligned image and
text prototypes from
fine-tuned Proto-CLIP
Barnes-Hut t-SNE visualization using fine-tuned Proto-CLIP
trained on FewSOL [198 classes] dataset.
Here, image and text prototypes
are aligned closer to each other. Objects with similar shapes are closer.
Semantics are captured as well, e.g.
vegetables/fruits are closer to each other.
Zoom-In to take a closer look.
@INPROCEEDINGS{padalunkal2024protoclip,
author={P, Jishnu Jaykumar and Palanisamy, Kamalesh and Chao, Yu-Wei and Du, Xinya and Xiang, Yu},
booktitle={2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
title={{Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning}},
doi={10.1109/IROS58592.2024.10801660},
pages={2594-2601},
year={2024}
}
This work was supported in part by the DARPA Perceptually-enabled Task Guidance (PTG) Program under contract number HR00112220005.