TL;DR When a pretrained perception model fails in the wild, a co-located human performs a short HumanPlay to clean the scene, then labels just the final frame hands-free with eye-gaze + voice. SAM2 converts the gaze points into bounding boxes and propagates masks backwards across the short RGB-D clip — turning one annotation into dense supervision. These failure-driven samples iteratively fine-tune MSMFormer, lifting UOIS accuracy by +54.2 combined score and improving real-robot grasping and pick-and-place success on the SceneReplica benchmark by +3 and +7, respectively.
iTeach overview video thumbnail

Abstract

Robotic perception models often fail in the real world due to clutter, occlusion, and novel objects. Existing approaches rely on offline data collection and retraining — slow, and blind to deployment-time failures. We propose iTeach, a failure-driven interactive teaching framework that adapts robot perception in the wild. A co-located human observes live predictions, triggers a short HumanPlay interaction on a failed object, and records an RGB-D sequence. Our Few-Shot Semi-Supervised (FS3) labeling annotates only the final frame using hands-free eye-gaze and voice; SAM2 propagates the mask across the sequence for dense supervision. Iterative fine-tuning on these samples progressively improves an MSMFormer UOIS model, translating into higher grasping and pick-and-place success on the SceneReplica benchmark and real-robot experiments.

Overview

iTeach overview

A pretrained perception model fails in the wild under clutter, occlusion, and novel objects. A co-located human performs a short HumanPlay interaction, annotates a single frame with eye-gaze + voice, and we propagate the label across the short RGB-D sequence. Failure-driven samples feed an iterative fine-tuning loop; the best checkpoint is redeployed.

System Architecture

iTeach system setup

Hardware. Fetch mobile manipulator (RGB-D) · Microsoft HoloLens 2 (gaze + voice annotation, live prediction overlay) · Lenovo Legion Pro 7 laptop with RTX 4090 (inference, SAM2 propagation, fine-tuning). All compute runs onboard; the robot is driven by the human with a PS4 controller for scene exploration. RGB-D streams to the laptop over wired Ethernet; the HoloLens connects over a laptop-hosted Wi-Fi hotspot via the ROS-TCP connector.

FS3 Labeling

01 HumanPlay
HumanPlay interaction
The human rearranges objects to reduce occlusion and produce a clean final frame while a short 5–10 s RGB-D sequence is recorded.
02 Hands-free annotation
FS3 annotation via eye-gaze and voice
Eye-gaze places point prompts on the final frame; a voice command triggers SAM2 to convert them into bounding-box object labels.
03 Label propagation
SAM2 label propagation
SAM2 (video mode) propagates masks backwards from the final annotated frame to all earlier frames, producing dense per-frame supervision.

Results

Perception improves · Manipulation follows.

Perception adaptation
UOIS · Combined Score
3.1× +54.6
Perception quality lift
before26.1 after80.7
UOIS qualitative comparison across iTeach fine-tuning stages
Qualitative UOIS. Left → right: ground truth, pretrained MSMFormer, and iTeach fine-tuning rounds FT1, FT3, FT5. iTeach recovers missed instances and cleans up over-segmentation across tabletop, shelves, sofas, stairs, and floor-level scenes.
iTeach-UOIS on SceneReplica & real robots
  1. 01RGB-D input
  2. 02iTeach-UOIS
    segmentation
  3. 03Contact-GraspNet
    proposals
  4. 04GTO motion
    planning
  5. 05Grasp
    execution

Only stage 02 differs across comparisons — everything else is held fixed.

Grasping success / 100
Prior best (MSMFormer)71
iTeach-UOIS74 +3
Pick & Place success / 100
Prior best (MSMFormer)65
iTeach-UOIS72 +7
Real-world pick-and-place with GTO

Swap in iTeach-UOIS and the same real-robot pipeline starts handling clutter and unseen objects the pretrained baseline fails on — turning perception gains into reliable picks and places in the wild.

iTeach-HumanPlay Dataset
48
scenes
45 train · 3 test
~13 K
training samples
dense masks via FS3
902
test samples
held-out scenes
5–10 s
per sequence
RGB-D · 640 × 480

Code & Data

BibTeX

@misc{p2026iteachwildinteractiveteaching,
  title         = {iTeach: In the Wild Interactive Teaching for Failure-Driven Adaptation of Robot Perception},
  author        = {Jishnu Jaykumar P and Cole Salvato and Vinaya Bomnale and Jikai Wang and Yu Xiang},
  year          = {2026},
  eprint        = {2410.09072},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url           = {https://arxiv.org/abs/2410.09072}
}

Contact

Send any comments or questions to Jishnu: jishnu.p@utdallas.edu

Acknowledgements

Supported by

DARPA
Perceptually-enabled Task Guidance
HR00112220005
Sony
Research Award Program
NSF
National Science Foundation
Grant No. 2346528

Thanks to Sai Haneesh Allu for assistance with the real-robot experiments.