iTeach: In the Wild Interactive Teaching for Failure-Driven Adaptation of Robot Perception

Jaykumar P, Jishnu; Salvato, Cole; Bomnale, Vinaya; Wang, Jikai; Xiang, Yu

iTeach: In the Wild Interactive Teaching
for Failure-Driven Adaptation of Robot Perception

Jishnu Jaykumar P

Cole Salvato

Vinaya Bomnale

Jikai Wang

Yu Xiang

Intelligent Robotics and Vision Lab · The University of Texas at Dallas

Under Submission

arXiv PDF Video Code Data Citation

TL;DR When a pretrained perception model fails in the wild, a co-located human performs a short HumanPlay to clean the scene, then labels just the final frame hands-free with eye-gaze + voice. SAM2 converts the gaze points into bounding boxes and propagates masks backwards across the short RGB-D clip — turning one annotation into dense supervision. These failure-driven samples iteratively fine-tune MSMFormer, lifting UOIS accuracy by +54.2 combined score and improving real-robot grasping and pick-and-place success on the SceneReplica benchmark by +3 and +7, respectively.

Abstract

Robotic perception models often fail in the real world due to clutter, occlusion, and novel objects. Existing approaches rely on offline data collection and retraining — slow, and blind to deployment-time failures. We propose iTeach, a failure-driven interactive teaching framework that adapts robot perception in the wild. A co-located human observes live predictions, triggers a short HumanPlay interaction on a failed object, and records an RGB-D sequence. Our Few-Shot Semi-Supervised (FS3) labeling annotates only the final frame using hands-free eye-gaze and voice; SAM2 propagates the mask across the sequence for dense supervision. Iterative fine-tuning on these samples progressively improves an MSMFormer UOIS model, translating into higher grasping and pick-and-place success on the SceneReplica benchmark and real-robot experiments.

Overview

A pretrained perception model fails in the wild under clutter, occlusion, and novel objects. A co-located human performs a short HumanPlay interaction, annotates a single frame with eye-gaze + voice, and we propagate the label across the short RGB-D sequence. Failure-driven samples feed an iterative fine-tuning loop; the best checkpoint is redeployed.

System Architecture

Hardware. Fetch mobile manipulator (RGB-D) · Microsoft HoloLens 2 (gaze + voice annotation, live prediction overlay) · Lenovo Legion Pro 7 laptop with RTX 4090 (inference, SAM2 propagation, fine-tuning). All compute runs onboard; the robot is driven by the human with a PS4 controller for scene exploration. RGB-D streams to the laptop over wired Ethernet; the HoloLens connects over a laptop-hosted Wi-Fi hotspot via the ROS-TCP connector.

FS3 Labeling

HumanPlay interaction — The human rearranges objects to reduce occlusion and produce a clean final frame while a short 5–10 s RGB-D sequence is recorded.

FS3 annotation via eye-gaze and voice — Eye-gaze places point prompts on the final frame; a voice command triggers SAM2 to convert them into bounding-box object labels.

SAM2 label propagation — SAM2 (video mode) propagates masks *backwards* from the final annotated frame to all earlier frames, producing dense per-frame supervision.

Results

Perception improves · Manipulation follows.

Perception adaptation

UOIS · Combined Score

3.1× +54.6

Perception quality lift

before26.1 → after80.7

UOIS qualitative comparison across iTeach fine-tuning stages — **Qualitative UOIS.** Left → right: ground truth, pretrained MSMFormer, and iTeach fine-tuning rounds FT1, FT3, FT5. iTeach recovers missed instances and cleans up over-segmentation across tabletop, shelves, sofas, stairs, and floor-level scenes.

iTeach-UOIS on SceneReplica & real robots

01RGB-D input
02iTeach-UOIS
segmentation
03Contact-GraspNet
proposals
04GTO motion
planning
05Grasp
execution

Only stage 02 differs across comparisons — everything else is held fixed.

Grasping success / 100

Prior best (MSMFormer)71

iTeach-UOIS74 +3

Pick & Place success / 100

Prior best (MSMFormer)65

iTeach-UOIS72 +7

Swap in iTeach-UOIS and the same real-robot pipeline starts handling clutter and unseen objects the pretrained baseline fails on — turning perception gains into reliable picks and places in the wild.

iTeach-HumanPlay Dataset

scenes

45 train · 3 test

~13 K

training samples

dense masks via FS3

902

test samples

held-out scenes

5–10 s

per sequence

RGB-D · 640 × 480

Code & Data

Code

iTeach System glue: ROS bridge, HoloLens I/O, data rebroadcast & collection iTeach-UOIS Failure-driven MSMFormer fine-tuning iTeachSkillsApp HoloLens 2 gaze + voice app

Data · iTeach-HumanPlay

D5 · controlled 5 scenes · RGB-D + FS3 D40 · scaling 40 scenes · RGB-D + FS3 Test 902 held-out samples

Checkpoints · iTeach-UOIS

D5 ckpts MSMFormer, one per round D40 ckpts MSMFormer, one per round

BibTeX

@misc{p2026iteachwildinteractiveteaching,
  title         = {iTeach: In the Wild Interactive Teaching for Failure-Driven Adaptation of Robot Perception},
  author        = {Jishnu Jaykumar P and Cole Salvato and Vinaya Bomnale and Jikai Wang and Yu Xiang},
  year          = {2026},
  eprint        = {2410.09072},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url           = {https://arxiv.org/abs/2410.09072}
}

Contact

Send any comments or questions to Jishnu: jishnu.p@utdallas.edu

Acknowledgements

Supported by

DARPA

Perceptually-enabled Task Guidance

HR00112220005

Sony

Research Award Program

NSF

National Science Foundation

Grant No. 2346528

Thanks to Sai Haneesh Allu for assistance with the real-robot experiments.