RPX Benchmark Toolkit¶

Bring your model. We bring the dataset, the splits, the metrics, and the tables.

RPX enables you to choose and rank perception models for robot learning — on real-world RGB-D data, under embodied deployment conditions, with ESD-stratified difficulty splits and deployment-readiness scoring.

rpx-benchmark is the reference toolkit for RPX — Robot Perception X, a unified real-world RGB-D benchmark for evaluating the perception models actually deployed inside robot learning stacks (not generic perception leaderboards). It is built so a researcher can run an off-the-shelf HuggingFace model on an RPX difficulty split in one command, compare results across the slate of robot-learning backbones, and a team can add a whole new task or metric in one file.

pip install 'rpx-benchmark[depth]'

rpx bench monocular_depth \
    --hf-checkpoint depth-anything/Depth-Anything-V2-Metric-Indoor-Large-hf \
    --split hard

That line downloads only the RGB + depth files the Hard split references (via HuggingFace), loads the model, runs inference with a live progress bar, prints an ESD-weighted phase-score table, measures FLOPs and median latency, and writes result.json + summary.md.

No Python code required.

Five-minute tour¶

Quickstart

Install, run your first benchmark, and read the output tables.
:material-plug: Bring Your Own Model

Three paths: zero-code HF checkpoint, numpy callable, or custom adapter stack.
Architecture

How the adapter framework, metric registry, and task registry fit together.
Extension Guides

Add a new task, metric, or model adapter in one file.

What RPX is¶

75,000 frames across 100 indoor scenes, captured with an Intel RealSense D435 (RGB-D) + T265 (6-DoF VIO) rig.
Three-phase capture protocol: each scene is recorded under Clutter → Interaction (human grasps/moves objects) → Clean. This isolates scene reconfiguration from scene identity so performance deltas mean something.
Effort-Stratified Difficulty (ESD) splits per (scene, phase) — Easy / Medium / Hard derived from real annotation effort.
Ten benchmark tasks on identical scenes: monocular absolute depth, object segmentation, object tracking, object detection, open-vocab detection, visual grounding, sparse depth, relative camera pose, novel view synthesis, keypoint matching.
Scoped first-class around models used as backbones in robot learning, not generic perception SOTA.

Design principles¶

The only variable should be the model. Datasets, splits, metrics, reports, and deployment-readiness scoring are fixed.
Adding a new task or metric should touch one file. Plugin registries for models, metrics, and tasks make this a hard invariant.
Errors should tell the user what to fix. Every raised exception is a subclass of rpx.exceptions.RPXError and carries a hint line.
Documentation is the docstrings. This site is built from them automatically via mkdocstrings. No separate rewrite exists or will exist.
CPU-first, CUDA-aware. Every pipeline auto-falls-back to CPU with a clear warning when CUDA isn't available.

Repository¶

https://github.com/IRVLUTD/RPX

License¶

Benchmark toolkit: MIT
RPX dataset: CC BY 4.0