Abstract

Detecting and segmenting novel object instances in open-world environments is a fundamental problem in robotic perception. Given only a small set of template images, a robot must locate and segment a specific object instance in a cluttered, previously unseen scene. Existing proposal-based approaches are highly sensitive to proposal quality and often fail under occlusion and background clutter. We propose L2G-Det, a local-to-global instance detection framework that bypasses explicit object proposals by leveraging dense patch-level matching between templates and the query image. Locally matched patches generate candidate points, which are refined through a candidate selection module to suppress false positives. The filtered points are then used to prompt an augmented Segment Anything Model (SAM) with instance-specific object tokens, enabling reliable reconstruction of complete instance masks. Experiments demonstrate improved performance over proposal-based methods in challenging open-world settings.

L2G-Net

Conceptual comparison between object proposal-based instance detection methods and our local-to-global instance detection.

gto


Overview of our L2G-Det framework for novel instance detection.It consists of a candidate selection module and an augmented SAM module. Only the adapters and object-tokens are learnable, while all other components are frozen.

gto

Detection Examples

gto
gto

Experiment Results

L2G-Det consistently outperforms existing detectors on both the High-Resolution dataset [14] and RoboTools dataset [4], achieving significant improvements in Precision.

Real-World Robotic Experiments

A Fetch robot equipped with L2G-Det autonomously navigates cluttered indoor environments and stops upon detecting novel target objects in real time, demonstrating robust performance across 8 objects. Demo videos are shown below:

References

  1. Anton Osokin, Denis Sumin, and Vasily Lomakin. OS2D: One-Stage One-Shot Object Detection by Matching Anchor Features. ECCV, 2020.
  2. Jean-Philippe Mercier, Mathieu Garon, Philippe Giguere, and Jean-Francois Lalonde. Deep Template-Based Object Instance Detection. WACV, 2021.
  3. Dahun Kim, Tsung-Yi Lin, Anelia Angelova, In So Kweon, and Weicheng Kuo. Learning Open-World Object Proposals without Learning to Classify. IEEE RA-L, 2022.
  4. Bowen Li, Jiashun Wang, Yaoyu Hu, Chen Wang, and Sebastian Scherer. VoxDet: Voxel Learning for Novel Instance Detection. NeurIPS, 2024.
  5. Qianqian Shen et al. Solving Instance Detection from an Open-World Perspective. CVPR, 2025.
  6. Yangxiao Lu, Jishnu Jaykumar P, Yunhui Guo, Nicholas Ruozzi, and Yu Xiang. Adapting Pre-trained Vision Models for Novel Instance Detection and Segmentation. 2024.
  7. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NeurIPS, 2015.
  8. Tsung-Yi Lin et al. Focal Loss for Dense Object Detection. ICCV, 2017.
  9. Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as Points. arXiv:1904.07850, 2019.
  10. Zhi Tian et al. FCOS: Fully Convolutional One-Stage Object Detection. ICCV, 2019.
  11. Feng Li et al. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. ICLR, 2023.
  12. Alexander Kirillov et al. Segment Anything. ICCV, 2023.
  13. Maxime Oquab et al. DINOv2: Learning Robust Visual Features without Supervision. 2023.
  14. Qianqian Shen et al. A High-Resolution Dataset for Instance Detection with Multi-View Instance Capture. NeurIPS Datasets and Benchmarks, 2023.

Citation

Please cite L2G-Net if it helps your research:
@misc{zhang2026l2gdet,
  title        = {From Local Matches to Global Masks: Novel Instance Detection in Open-World Scenes},
  author       = {Qifan Zhang and Sai Haneesh Allu and Jikai Wang and Yangxiao Lu and Yu Xiang},
  year         = {2026},
  eprint       = {2603.03577},
  archivePrefix= {arXiv},
  primaryClass = {cs.CV},
  url          = {https://arxiv.org/abs/2603.03577}
}

Contact

Send any comments or questions to Qifan Zhang: Qifan.Zhang@utdallas.edu

Acknowledgements

This work was supported in part by the National Science Foundation (NSF) under Grant Nos. 2346528 and 2520553, and the NVIDIA Academic Grant Program Award.