HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction

Jikai Wang¹, Qifan Zhang¹, Yu-Wei Chao², Bowen Wen², Xiaohu Guo¹, Yu Xiang¹

¹University of Texas at Dallas ²NVIDIA
NeurIPS 2025, Datasets and Benchmarks Track

Paper Code Dataset arXiv Slides Poster Citation

Abstract

We introduce a data capture system and a new dataset, HO-Cap, for 3D reconstruction and pose tracking of hands and objects in videos. The system leverages multiple RGB-D cameras and a HoloLens headset for data collection, avoiding the use of expensive 3D scanners or mocap systems. We propose a semi-automatic method for annotating the shape and pose of hands and objects in the collected videos, significantly reducing the annotation time compared to manual labeling. With this system, we captured a video dataset of humans interacting with objects to perform various tasks, including simple pick-and-place actions, handovers between hands, and using objects according to their affordance, which can serve as human demonstrations for research in embodied AI and robot manipulation. Our data capture setup and annotation framework will be available for the community to use in reconstructing 3D shapes of objects and human hands and tracking their poses in videos.

Video

Task 1: Pick-and-Place

Task 2: Handover

Task 3: Affordance Usage

Task 4: IsaacSim Replay

Visualization of all the sequences could be found in Sequence_Renderings.

Overview

Data Capture System

8 RealSense Cameras + HoloLens, No Mo-cap.

Object Shape Reconstruction

Recorded with single Azure Kinect Camera.

All Object Shapes in HO-Cap Dataset.

Annotation Pipeline (Semi-Automatic)

The only human annotation required is to (1) manually prompt two points for each object in the initial frame to generate an initial segmentation mask of the object using SAM2, and (2) label the name of the object to associate it to an object in our database.

Code

HOCap-Toolkit

A Python package that provides evaluation and visualization tools for the HO-Cap dataset.

HOCap-Annotation

A Python package that provides hand-object poses annotations for HO-Cap dataset.

Dataset

License

HO-Cap dataset is licensed under Creative Commons Attribution 4.0 International License (CC BY 4.0).

Object Information

Object descriptions and purchase links: Objects_Info .
Recordings for object reconstruction: Objects_Collection.

Dataset Download

Option One: download the data with script provided by HOCap-Toolkit.
Option Two: download the individual zipped data from Box manually:

Once you successfully download the zip files, extract them to the "./datasets/HO-Cap" folder, the directory structure should look like the following:

datasets/HO-Cap
  ├── calibration
  ├── models
  ├── subject_1
  │   ├── 20231025_165502
  │   │   ├── 037522251142
  │   │   │   ├── color_000000.jpg
  │   │   │   ├── depth_000000.png
  │   │   │   ├── label_000000.npz
  │   │   │   └── ...
  │   │   ├── 043422252387
  │   │   ├── ...
  │   │   ├── hololens_kv5h72
  │   │   ├── meta.yaml
  │   │   ├── poses_m.npy
  │   │   ├── poses_o.npy
  │   │   └── poses_pv.npy
  │   ├── 20231025_165502
  │   └── ...
  ├── ...
  └── subject_9

For instructions about using the dataset please see HOCap-Toolkit.

BibTeX

@inproceedings{wang2025hocap,
title={{HO}-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction},
author={Jikai Wang and Qifan Zhang and Yu-Wei Chao and Bowen Wen and Xiaohu Guo and Yu Xiang},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2025},
}

Acknowledgement

This work was supported in part by the DARPA Perceptually-enabled Task Guidance (PTG) Program under contract number HR00112220005, the Sony Research Award Program, and the National Science Foundation (NSF) under Grant No. 2346528.