Hub (rpx_benchmark.hub)¶
Task-aware HuggingFace dataset downloader. Only fetches the modalities a given task actually needs; the HF content-addressed cache makes switching tasks on the same scenes free.
hub
¶
HuggingFace Hub integration for RPX benchmark.
Task-aware downloads: fetches only the modalities a given task needs, reusing HF's content-addressed cache so switching tasks on the same scenes only pulls the new label files.
Repo layout (on HF)::
rpx-benchmark/
├── metadata/
│ ├── scenes.parquet
│ └── esd_scores.parquet
├── manifests/
│ └── <task>/<difficulty>.json # logical views, not duplicates
└── scenes/scene_000/{0,1,2}/
├── rgb/*.png
├── depth/*.png # 16-bit mm
├── mask/*.png # integer instance IDs
├── pose/*.npz
├── tracklets.json
├── questionnaires.json
├── spatial_qa.json
└── general_qa.json
fetch_manifest(task: TaskType | str, split: Difficulty | str, repo_id: str = DEFAULT_REPO_ID, cache_dir: str | Path | None = None, revision: str | None = None) -> Dict[str, Any]
¶
Download and parse the task-level manifest for (task, split).
Manifests are small (hundreds of KB) and are fetched eagerly so the caller can discover which (scene, phase) dirs the split references before kicking off a bulk download.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task
|
TaskType or str
|
|
required |
split
|
Difficulty or str
|
|
required |
repo_id
|
str
|
HuggingFace dataset repo id. Defaults to
:data: |
DEFAULT_REPO_ID
|
cache_dir
|
str or Path
|
|
None
|
revision
|
str
|
|
None
|
Returns:
| Type | Description |
|---|---|
dict
|
Parsed manifest JSON. |
Raises:
| Type | Description |
|---|---|
DownloadError
|
If the download fails (network, auth, bad repo id) or the manifest file does not exist on the hub. |
ManifestError
|
If the downloaded file is not valid JSON. |
Source code in rpx_benchmark/hub.py
download_split(task: TaskType | str, split: Difficulty | str, repo_id: str = DEFAULT_REPO_ID, cache_dir: str | Path | None = None, revision: str | None = None, extra_modalities: Sequence[str] | None = None, max_workers: int = 8) -> Path
¶
Download only the files (task, split) needs, return resolved manifest path.
The resolved manifest is a JSON file whose root field points to
the local HF snapshot directory, so it can be fed directly to
:meth:RPXDataset.from_manifest.
Source code in rpx_benchmark/hub.py
load(task: TaskType | str, split: Difficulty | str, repo_id: str = DEFAULT_REPO_ID, cache_dir: str | Path | None = None, revision: str | None = None, batch_size: int = 1) -> RPXDataset
¶
Download (task, split) and return an iterable :class:RPXDataset.
Incremental re-use::
# First run: fetches rgb + depth for 'hard' scenes.
depth_ds = rpx.load("monocular_depth", "hard")
# Second run: rgb/depth already cached, only spatial_qa.json fetched.
qa_ds = rpx.load("visual_grounding", "hard")
Source code in rpx_benchmark/hub.py
mount(repo_id: str = DEFAULT_REPO_ID)
¶
Return an HfFileSystem rooted at the RPX repo for lazy browsing.
Each read goes over the network; prefer :func:load for real workloads.