Segmenting unseen objects from images is a critical perception skill that a robot needs to acquire. In robot manipulation, it can facilitate a robot to grasp and manipulate unseen objects. Mean shift clustering is a widely used method for object segmentation tasks. However, the traditional mean shift clustering algorithm is not easily integrated into an end-to-end neural network training pipeline, which causes representation learning and the clustering algorithm separated. In this work, we propose the Mean Shift Mask Transformer (MSMFormer), a new transformer architecture that simulates the von Mises-Fisher (vMF) mean shift clustering algorithm, allowing for the joint training and inference of both the feature extractor and the clustering. Its central component is a hypersphere attention mechanism, which updates object queries on a hypersphere. To illustrate the effectiveness of our method, we apply MSMFormer to unseen object instance segmentation. Our experiments show that MSMFormer improves over the mean shift clustering baseline that uses deep feature representations, and achieves competitive performance compared to the state-of-the-art methods on unseen object instance segmentation.
The Appendix of Mean Shift Mask Transformer.
The code for Mean Shift Mask Transformer.
@article{lu2022mean,
title={Mean Shift Mask Transformer for Unseen Object Instance Segmentation},
author={Lu, Yangxiao and Chen, Yuqiao and Ruozzi, Nicholas and Xiang, Yu},
journal={arXiv preprint arXiv:2211.11679},
year={2022}
}
}
Send any comments or questions to Yangxiao Lu: yangxiao.lu@utdallas.edu
This work was supported in part by the DARPA Perceptually-enabled Task Guidance (PTG) Program under contract number HR00112220005.