Abstract

Vision-Language-Action (VLA) models have shown strong promise for general-purpose robotic manipulation, but their real-world evaluation remains limited by a lack of accessible, reproducible, and consistent benchmarks. Simulation benchmarks fail to capture real-world complexity, while existing real-world benchmarks often require expensive hardware, centralized evaluation, or are limited in task diversity. We introduce VLA-REPLICA, a low-cost, easily reproducible real-world benchmark for evaluating VLA models. Built from off-the-shelf components, our system can be quickly assembled and replicated across laboratories, providing a consistent environment for policy evaluation anywhere in the world. VLA-REPLICA includes a diverse suite of manipulation tasks and a small-scale demonstration dataset for target-domain adaptation, with real-world evaluation protocols for both in-distribution and out-of-distribution settings. Experiments with imitation learning and state-of-the-art VLA models reveal model strengths and limitations, while consistent results across independently constructed setups demonstrate the reproducibility of our benchmark.

Overview

Overview our VLA-Replica benchmark

Overview of the VLA-Replica benchmark. (a)1. Hardware components. (a)2. Our assembled platform with the SO-101 follower arm, the light box, the cameras, and the manipulation workspace. (b) 10 manipulation tasks in the benchmark.

Building the Platform

A user with no prior knowledge of the benchmark was able to build the setup within one hour.
Follow the Setup Guide to build yours!

If the embedded player does not start, open the video in Google Drive or download the MP4.

Manipulation Tasks

10 tasks in our VLA-Replica benchmark

Task definitions and examples for training/ID (in-distribution) and OOD (out-of-distribution) evaluation

Expert Demonstrations

Examples of expert demonstrations collected in our dataset. We provide 50 demonstrations for each task, which can be used for training or fine-tuning. The dataset can be downloaded from huggingface.

Expert demonstrations in our VLA-Replica benchmark

Test Scene Reference Images

We provide 90 test scene reference images as shown below. For all tasks, the first row includes ID (in-distribution) tasks, and the second row includes OOD (out-of-distribution) tasks (except 5 & 6). Details of these tasks can be found here. These reference images are already included in our github repo for evaluation.

Task 1 reference images

Task 1: Put bread on plate

Task 2 reference images

Task 2: Put bowl on coaster

Task 3 reference images

Task 3: Stack blocks

Task 4 reference images

Task 4: Fold towel

Task 5 reference images

Task 5: Open oven

Task 6 reference images

Task 6: Clean whiteboard

Task 7 reference images

Task 7: Pour pepper

Task 8 reference images

Task 8: Lift bowl

Task 9 reference images

Task 9: Press button

Task 10 reference images

Task 10: Collect blocks

trophy Leaderboard

Policy evaluation success rates on the VLA-Replica benchmark are shown below for in-distribution and out-of-distribution evaluation.

There are two ways to submit results to the leaderboard: (1) Run the VLA-Replica-ID and VLA-Replica-OOD benchmark scenes and share your evaluation videos with the authors. (2) Submit a model checkpoint through Hugging Face, and the authors will evaluate your checkpoint. Contact the authors if you want to add your method.

10 tasks in our VLA-Replica benchmark

Training and fine-tuning details for the evaluated policies

References

    Code

    To run evaluation with the benchmark, please check the code below.

    BibTeX

    Please cite VLA-Replica if it helps your research:
    
          @misc{huang2026vlareplicalowcostreproduciblebenchmark,
            title={VLA-REPLICA: A Low-Cost, Reproducible Benchmark for Real-World Evaluation of Vision-Language-Action Models}, 
            author={Alex S. Huang and Jiahui Zhang and Shiqing Tang and Yu Xiang},
            year={2026},
            eprint={2605.20774},
            archivePrefix={arXiv},
            primaryClass={cs.RO},
            url={https://arxiv.org/abs/2605.20774} 
          }
    

    Contact

    Send any comments or questions to Alex Huang | Jiahui Zhang:
    alex.huang@utdallas.edu | jiahui.zhang@utdallas.edu

    Acknowledgements

    This work was supported in part by the National Science Foundation (NSF) under Grant Nos. 2346528 and 2520553, the NVIDIA Academic Grant Program Award, and a gift funding from XPeng.