Top-level package¶
The rpx_benchmark top-level package re-exports the most commonly
used symbols so user code can write import rpx_benchmark as rpx
and reach everything through one namespace.
rpx_benchmark
¶
RPX — choose and rank perception models for robot learning.
rpx_benchmark is the reference toolkit for the RPX benchmark: a
unified real-world RGB-D evaluation suite for the models actually
deployed inside robot learning stacks. Bring your model (any
HuggingFace checkpoint, numpy callable, or custom torch stack) and
the toolkit handles dataset download, splits, metrics, reports, and
ESD-weighted deployment-readiness scoring.
See the top-level README and the online documentation for getting started, the adapter framework, and the extension guides.
Difficulty
¶
Bases: str, Enum
Effort-Stratified Difficulty (ESD) split label.
ESD splits are derived per (scene, phase) from the
annotation-effort signal described in paper §4. See
:mod:rpx_benchmark.deployment for the scoring details.
Members
EASY Few annotation iterations, low occlusion, stable visibility. MEDIUM HARD Many annotation iterations, dense occlusion, high depth- invalid fraction, high jerk.
Phase
¶
Bases: str, Enum
Capture phases of the three-phase RPX reconfiguration protocol.
Every scene is recorded in three phases so the benchmark can attribute performance changes to scene state rather than to lighting / viewpoint / camera identity.
Members
CLUTTER Initial dense object arrangement; significant inter-object occlusion. INTERACTION Human operator grasps and moves objects. Introduces hand-object contact and transient occlusion. CLEAN Same objects re-organised sparsely. Serves as a within-scene control for the other two phases.
TaskType
¶
Bases: str, Enum
Enumeration of every task the benchmark toolkit recognises.
Members are plain strings so they serialise cleanly to JSON and can be used as dict keys for logging / table rows.
Members
MONOCULAR_DEPTH Dense metric depth from a single RGB frame. OBJECT_DETECTION Closed-vocabulary detection with category labels. OBJECT_SEGMENTATION Instance segmentation masks with per-pixel instance IDs. OBJECT_TRACKING Multi-object tracking with persistent track IDs. RELATIVE_CAMERA_POSE 6-DoF pose of frame B relative to frame A. OPEN_VOCAB_DETECTION Detection conditioned on a free-text vocabulary. VISUAL_GROUNDING Referring expression → bounding box on the image. SPARSE_DEPTH Depth values at a sparse set of image locations only. NOVEL_VIEW_SYNTHESIS RGB synthesis from a held-out target pose. KEYPOINT_MATCHING Dense/sparse correspondences between two images.
Examples:
>>> from rpx_benchmark.api import TaskType
>>> TaskType.MONOCULAR_DEPTH.value
'monocular_depth'
>>> TaskType("monocular_depth") is TaskType.MONOCULAR_DEPTH
True
Sample(id: str, rgb: np.ndarray, ground_truth: Any, metadata: Dict[str, Any] | None = None, phase: Phase | None = None, difficulty: Difficulty | None = None, camera_pose: np.ndarray | None = None)
dataclass
¶
One input unit handed by :class:RPXDataset to a model.
Samples are produced by the loader and consumed by
BenchmarkModel.predict. Every field is deliberately simple
(numpy arrays, enums, plain dicts) so models and adapters don't
need to know anything about the on-disk dataset format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
id
|
str
|
Unique identifier of the form |
required |
rgb
|
ndarray
|
H × W × 3 uint8 RGB image in row-major order. |
required |
ground_truth
|
Any
|
Task-specific GroundTruth dataclass (e.g.
:class: |
required |
metadata
|
dict
|
Free-form metadata the loader can attach; conventionally holds fisheye images, secondary RGB frames for pair tasks, and any label paths that do not fit into the ground-truth dataclass. Consumers should treat unknown keys as opaque. |
None
|
phase
|
Phase
|
Capture phase the frame belongs to. Required for ESD-weighted phase scoring. |
None
|
difficulty
|
Difficulty
|
ESD difficulty label of the |
None
|
camera_pose
|
ndarray
|
4 × 4 float64 SE(3) matrix (camera → world) sourced from the T265 tracker. Used for the temporal-stability metric. |
None
|
BenchmarkModel
¶
Bases: ABC
Abstract base class every RPX-compatible model must implement.
In practice, most users should not subclass this directly —
instead compose a :class:rpx_benchmark.adapters.BenchmarkableModel
from an input adapter, a model callable, and an output adapter.
BenchmarkableModel already implements :meth:predict and
:meth:setup correctly for you.
Subclass only when you need complete control over how samples are routed to your model (e.g. true minibatching across GPU devices).
Attributes:
| Name | Type | Description |
|---|---|---|
task |
TaskType
|
The task this model solves. Must be set by subclasses (either
at class level or in |
Examples:
Minimal subclass::
class MyDepth(BenchmarkModel):
task = TaskType.MONOCULAR_DEPTH
def setup(self):
self.net = load_my_checkpoint()
def predict(self, batch):
return [
DepthPrediction(depth_map=self.net(s.rgb))
for s in batch
]
Composed via :class:BenchmarkableModel::
bm = rpx.BenchmarkableModel(
task=TaskType.MONOCULAR_DEPTH,
input_adapter=MyInputAdapter(),
model=my_nn_module,
output_adapter=MyOutputAdapter(),
name="my_model",
)
setup() -> None
abstractmethod
¶
Load checkpoints, warm CUDA, and do any other one-time init.
The runner calls this exactly once before iterating the
dataset, unless BenchmarkRunner(call_setup=False) was
passed — in which case the caller is responsible.
Source code in rpx_benchmark/api.py
predict(batch: Sequence[Sample]) -> Sequence[Any]
abstractmethod
¶
Run inference on a batch of samples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch
|
sequence of Sample
|
One or more samples. Length equals |
required |
Returns:
| Type | Description |
|---|---|
sequence
|
One task-specific Prediction dataclass per input sample,
in the same order. The prediction dataclass must match
what :class: |
Raises:
| Type | Description |
|---|---|
ModelError
|
(By convention) when a sample cannot be processed. The runner surfaces it as a clean error rather than a stack trace. |
Source code in rpx_benchmark/api.py
RPXDataset(samples: List[Dict[str, Any]], task: TaskType, root: Path, batch_size: int = 1)
dataclass
¶
Iterates over RPX samples for a specific task.
Manifest format (JSON)::
{
"task": "object_segmentation",
"root": "/path/to/data",
"samples": [
{
"id": "scene_001_clutter_00000",
"scene": "scene_001",
"phase": "clutter",
"difficulty": "hard",
"rgb": "scene_001/0/rgb/00000.png",
"depth": "scene_001/0/depth/00000.png",
"mask": "scene_001/0/mask/00000.png",
"pose": "scene_001/0/pose/00000.npz",
...
}
]
}
All paths are relative to root.
depth files are 16-bit PNG in millimetres (as saved by save_device_data.py).
pose files are NPZ with keys position ([x,y,z] metres) and
orientation ([x,y,z,w] quaternion) from the T265 tracker.
from_manifest(manifest_path: str | Path, batch_size: int = 1) -> 'RPXDataset'
classmethod
¶
Load a manifest JSON file from disk and return a dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
manifest_path
|
str or Path
|
Path to the manifest JSON. Produced either by
:func: |
required |
batch_size
|
int
|
Number of samples per iteration. Default 1. |
1
|
Returns:
| Type | Description |
|---|---|
RPXDataset
|
|
Raises:
| Type | Description |
|---|---|
ManifestError
|
If the manifest file is missing, not valid JSON, or is missing required top-level fields. |
Source code in rpx_benchmark/loader.py
from_dict(manifest: Dict[str, Any], batch_size: int = 1, default_root: str | Path | None = None) -> 'RPXDataset'
classmethod
¶
Build a dataset from an already-parsed manifest dict.
Raises:
| Type | Description |
|---|---|
ManifestError
|
If |
Source code in rpx_benchmark/loader.py
MetricSuite(task: TaskType)
¶
Thin wrapper around the metric registry used by the runner.
Kept as a class rather than a function because historical API
expects MetricSuite.for_task(...).evaluate(pred, gt). New code
can call :func:compute_metrics directly.
Source code in rpx_benchmark/metrics/registry.py
for_task(task: TaskType) -> 'MetricSuite'
classmethod
¶
Create a suite for the given task.
Raises:
| Type | Description |
|---|---|
MetricError
|
If no calculators are registered for |
Source code in rpx_benchmark/metrics/registry.py
evaluate(prediction: Any, ground_truth: Any) -> Dict[str, float]
¶
Run every registered calculator and return merged results.
Raises:
| Type | Description |
|---|---|
MetricError
|
Propagated from individual calculators when inputs are shape-mismatched or wrong-typed. |
Source code in rpx_benchmark/metrics/registry.py
aggregate(per_sample: List[Dict[str, Any]]) -> Dict[str, float]
¶
Mean over numeric metric keys; non-numeric metadata is skipped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
per_sample
|
list of dict
|
Per-sample rows. May contain metric floats and metadata strings/enums in the same dict. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
One float per numeric key. Empty dict if |
Source code in rpx_benchmark/metrics/registry.py
build_result(per_sample: List[Dict[str, Any]]) -> 'BenchmarkResult'
¶
Convenience: wrap per_sample and its aggregate in a
:class:BenchmarkResult.
Source code in rpx_benchmark/metrics/registry.py
BenchmarkResult(task: TaskType, per_sample: List[Dict[str, Any]], aggregated: Dict[str, float], num_samples: int)
dataclass
¶
Outcome of running a :class:BenchmarkRunner against a dataset.
Attributes:
| Name | Type | Description |
|---|---|---|
task |
TaskType
|
Which task was evaluated. |
per_sample |
list of dict
|
One dict per sample. Each dict mixes metric keys (numeric) and
metadata keys ( |
aggregated |
dict[str, float]
|
Mean over the numeric metric keys in :attr: |
num_samples |
int
|
|
BenchmarkRunner(model: BenchmarkModel, dataset: RPXDataset, metric_suite: MetricSuite | None = None, call_setup: bool = True)
¶
Runs a benchmark end-to-end for a given model.
Basic usage::
runner = BenchmarkRunner(model=model, dataset=dataset)
result = runner.run()
print(result.aggregated)
Phase-stratified usage (requires manifest with phase / difficulty fields)::
runner = BenchmarkRunner(model=model, dataset=dataset)
result, dr_report = runner.run_with_deployment_readiness(
primary_metric="absrel",
model_name="MyDepthModel",
)
Source code in rpx_benchmark/runner.py
run(progress: Optional[ProgressCallback] = None) -> BenchmarkResult
¶
Run benchmark and return flat per-sample + aggregated metrics.
Source code in rpx_benchmark/runner.py
run_with_deployment_readiness(primary_metric: str, model_name: str = 'model', efficiency: EfficiencyMetadata | None = None, compute_ts: bool = True, compute_sgc_flag: bool = True, progress: Optional[ProgressCallback] = None) -> tuple[BenchmarkResult, DeploymentReadinessReport]
¶
Run benchmark and compute all deployment-readiness metrics.
Args: primary_metric: metric key used for ESD/STR scoring (e.g. "absrel", "miou"). model_name: display name for the report. efficiency: pre-computed EfficiencyMetadata (params, FLOPs). compute_ts: whether to compute Temporal Stability (needs sequential frames). compute_sgc_flag: whether to compute SGC (needs both seg + depth predictions).
Returns: (BenchmarkResult, DeploymentReadinessReport)
Source code in rpx_benchmark/runner.py
170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 | |
DeploymentReadinessReport(task: str, model_name: str, weighted_phase_score: WeightedPhaseScore | None = None, temporal_stability: TemporalStabilityResult | None = None, state_transition: StateTransitionRobustnessResult | None = None, geometric_coherence: StackGeometricCoherenceResult | None = None, params_m: float | None = None, flops_g: float | None = None, actmem_gb_fp16: float | None = None, latency_ms_per_sample: float | None = None)
dataclass
¶
Aggregated deployment-readiness report for a model on a task.
ESDResult(easy: float | None, medium: float | None, hard: float | None, metric_key: str)
dataclass
¶
Per-difficulty metric breakdown (Effort-Stratified Difficulty).
weighted_score() -> float
¶
S_p = 0.25·Easy + 0.35·Medium + 0.40·Hard.
Source code in rpx_benchmark/deployment.py
StackGeometricCoherenceResult(sgc_score: float, precision: float, recall: float, num_samples: int)
dataclass
¶
SGC measures mask–depth boundary alignment.
SGC = F-score(boundary(mask), boundary(depth_gradient > τ)) Boundary pixels are extracted via Sobel gradient magnitude thresholding.
StateTransitionRobustnessResult(str_c_to_i: float, str_i_to_l: float, metric_clutter: float, metric_interaction: float, metric_clean: float)
dataclass
¶
STR captures performance change across phase boundaries.
STR_{C→I} = M(interaction) − M(clutter) ← interaction drop (negative = worse) STR_{I→L} = M(clean) − M(interaction) ← recovery (positive = better)
TemporalStabilityResult(ts_score: float, num_pairs: int, per_pair: List[float] = list())
dataclass
¶
TS score per task type.
TS_seg = E[IoU(P_t, warp(P_{t+1}, ΔT))] (segmentation) TS_depth = E[||D_t − warp(D_{t+1}, ΔT)||_1 / N_valid] (depth)
Warping uses the relative SE(3) pose from the T265 ground-truth track. When exact warping is not feasible, a consistency proxy (unchanged-pixel fraction) is used as a lower bound.
WeightedPhaseScore(clutter: ESDResult, interaction: ESDResult, clean: ESDResult)
dataclass
¶
Full deployment-readiness scoring table.
Per phase: S_p = 0.25·M(p,Easy) + 0.35·M(p,Medium) + 0.40·M(p,Hard) Overall: S_overall = (S_C + S_I + S_L) / 3 Delta int: Δ_int = S_I − S_C Delta rec: Δ_rec = S_L − S_I
EfficiencyMetadata(params_m: float | None = None, flops_g: float | None = None, actmem_gb_fp16: float | None = None, latency_ms_per_sample: float | None = None, model_type: str = 'local', notes: str = '')
dataclass
¶
Hardware-agnostic efficiency metadata for a model.
to_table_row() -> dict
¶
Produce result-table-ready dict (None → 'N/A (API)' for API models).
Source code in rpx_benchmark/profiler.py
AdapterError(message: str, *, hint: Optional[str] = None, details: Optional[dict[str, Any]] = None)
¶
Bases: ModelError
Input or output adapter produced an invalid payload.
Examples:
InputAdapter.prepareraised during preprocessing.OutputAdapter.finalizereturned a non-DepthPredictionfor the monocular depth task.- HF processor's
post_process_depth_estimationsignature does not accept the kwargs the adapter wants to pass.
Source code in rpx_benchmark/exceptions.py
ConfigError(message: str, *, hint: Optional[str] = None, details: Optional[dict[str, Any]] = None)
¶
Bases: RPXError
Raised when a user-supplied config is invalid.
Examples:
MonocularDepthRunConfigbuilt with bothmodelandhf_checkpointset.- CLI given
--device cudaon a CPU-only host with--strict-deviceenabled. - Unknown difficulty split.
Source code in rpx_benchmark/exceptions.py
DatasetError(message: str, *, hint: Optional[str] = None, details: Optional[dict[str, Any]] = None)
¶
Bases: RPXError
Base class for dataset load / manifest / download failures.
Source code in rpx_benchmark/exceptions.py
DownloadError(message: str, *, hint: Optional[str] = None, details: Optional[dict[str, Any]] = None)
¶
Bases: DatasetError
HuggingFace download or cache lookup failed.
Raised by :mod:rpx_benchmark.hub when snapshot_download or
hf_hub_download fails (network issue, bad repo id, missing
revision, permission denied).
Source code in rpx_benchmark/exceptions.py
ManifestError(message: str, *, hint: Optional[str] = None, details: Optional[dict[str, Any]] = None)
¶
Bases: DatasetError
Manifest JSON is missing, malformed, or references missing files.
Raised by :class:rpx_benchmark.loader.RPXDataset when the loader
cannot resolve a sample from the manifest it was handed.
Examples:
- Manifest missing the
taskfield. - Sample lists
rgbthat does not exist on disk. - Task value is not in :class:
rpx_benchmark.api.TaskType.
Source code in rpx_benchmark/exceptions.py
MetricError(message: str, *, hint: Optional[str] = None, details: Optional[dict[str, Any]] = None)
¶
Bases: RPXError
Raised when a metric calculator cannot compute a score.
Examples:
- Prediction dataclass is the wrong type for the task.
- Ground-truth shape does not match prediction shape.
- Unknown metric name requested from a registry.
Source code in rpx_benchmark/exceptions.py
ModelError(message: str, *, hint: Optional[str] = None, details: Optional[dict[str, Any]] = None)
¶
Bases: RPXError
Raised by model factories or the runner when a model misbehaves.
Examples:
- Model returned a different number of predictions than samples.
- Model's
taskattribute does not match the dataset task. - Prediction dataclass has a wrong shape.
Source code in rpx_benchmark/exceptions.py
RPXError(message: str, *, hint: Optional[str] = None, details: Optional[dict[str, Any]] = None)
¶
Bases: Exception
Base exception for every error raised by the RPX benchmark toolkit.
All library code raises a subclass of this so user code can write a
single except RPXError to catch all benchmark failures without
accidentally swallowing unrelated exceptions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
message
|
str
|
Human-readable description of what failed. Should include a hint about what the user can do next. |
required |
hint
|
str
|
Additional remediation advice rendered after the main message. |
None
|
details
|
dict
|
Structured context (e.g. offending field, expected value) that higher-level code may inspect. |
None
|
Examples:
>>> from rpx_benchmark.exceptions import RPXError
>>> raise RPXError("something went wrong", hint="check the cache dir")
...
Source code in rpx_benchmark/exceptions.py
BenchmarkableModel(task: TaskType, input_adapter: InputAdapter, model: Any, output_adapter: OutputAdapter, invoker: ModelInvoker = default_invoker, name: str = 'benchmarkable_model', setup_hook: Optional[Callable[[], None]] = None)
dataclass
¶
Bases: BenchmarkModel
Compose an input adapter, a model, and an output adapter.
This is the canonical way to plug a model into the RPX benchmark
harness. The BenchmarkRunner only ever sees the
:class:BenchmarkModel contract (setup, predict); all the
model-family-specific logic lives in the adapters so the harness
stays task-agnostic.
Example — wrap a HuggingFace depth model::
from rpx_benchmark.adapters.depth_hf import make_hf_depth_model
bm = make_hf_depth_model(
"depth-anything/Depth-Anything-V2-Metric-Indoor-Large-hf",
device="cuda",
)
# Hand `bm` to the benchmark runner or the task entrypoint.
Example — wrap a plain numpy callable::
def my_depth(rgb: np.ndarray) -> np.ndarray:
return some_depth_in_metres
bm = make_numpy_depth_model(my_depth)
InputAdapter
¶
Bases: Protocol
Sample → model-ready payload.
OutputAdapter
¶
PreparedInput(payload: Any, context: Dict[str, Any] = dict())
dataclass
¶
Everything a model needs for one sample, plus context for post-processing.
payload is whatever the model's forward call accepts. If it is a
dict, the default invoker calls model(**payload); otherwise
model(payload).
context is a free-form dict the output adapter receives back. Use
it to stash things like target image size, original intrinsics, or
any preprocessing metadata the postprocessing step needs.
MonocularDepthRunConfig(model: Optional[BenchmarkableModel] = None, model_name: Optional[str] = None, hf_checkpoint: Optional[str] = None, split: Difficulty | str = Difficulty.HARD, repo_id: Optional[str] = None, cache_dir: Optional[str] = None, revision: Optional[str] = None, batch_size: int = 1, device: str = 'cuda', output_dir: Optional[str] = None, model_kwargs: Dict[str, Any] = dict(), progress: Optional[ProgressCallback] = None)
dataclass
¶
Bases: TaskRunConfig
Runtime configuration for :func:run_monocular_depth.
Inherits every standard field from :class:TaskRunConfig. Adds
no task-specific fields — monocular depth is the "base case" a
new user encounters.
Examples:
>>> from rpx_benchmark import MonocularDepthRunConfig
>>> cfg = MonocularDepthRunConfig(
... hf_checkpoint="depth-anything/Depth-Anything-V2-Metric-Indoor-Small-hf",
... split="hard",
... device="cpu",
... )
SegmentationRunConfig(model: Optional[BenchmarkableModel] = None, model_name: Optional[str] = None, hf_checkpoint: Optional[str] = None, split: Difficulty | str = Difficulty.HARD, repo_id: Optional[str] = None, cache_dir: Optional[str] = None, revision: Optional[str] = None, batch_size: int = 1, device: str = 'cuda', output_dir: Optional[str] = None, model_kwargs: Dict[str, Any] = dict(), progress: Optional[ProgressCallback] = None)
dataclass
¶
Bases: TaskRunConfig
Runtime configuration for :func:run_segmentation.
See :class:TaskRunConfig for the full field reference.
Raises:
| Type | Description |
|---|---|
ConfigError
|
If zero or more than one model selector is set, or if the
split is not a valid ESD difficulty, or |
ObjectDetectionRunConfig(model: Optional[BenchmarkableModel] = None, model_name: Optional[str] = None, hf_checkpoint: Optional[str] = None, split: Difficulty | str = Difficulty.HARD, repo_id: Optional[str] = None, cache_dir: Optional[str] = None, revision: Optional[str] = None, batch_size: int = 1, device: str = 'cuda', output_dir: Optional[str] = None, model_kwargs: Dict[str, Any] = dict(), progress: Optional[ProgressCallback] = None)
dataclass
¶
Bases: TaskRunConfig
Runtime configuration for :func:run_object_detection.
See :class:TaskRunConfig for the shared field reference.
Detection currently has no registered model factories, so the
only supported selector is model= (a pre-built
:class:BenchmarkableModel, typically from
:func:rpx_benchmark.make_numpy_detection_model).
VisualGroundingRunConfig(model: Optional[BenchmarkableModel] = None, model_name: Optional[str] = None, hf_checkpoint: Optional[str] = None, split: Difficulty | str = Difficulty.HARD, repo_id: Optional[str] = None, cache_dir: Optional[str] = None, revision: Optional[str] = None, batch_size: int = 1, device: str = 'cuda', output_dir: Optional[str] = None, model_kwargs: Dict[str, Any] = dict(), progress: Optional[ProgressCallback] = None)
dataclass
¶
Bases: TaskRunConfig
Runtime configuration for :func:run_visual_grounding.
Grounding models take (rgb, text) and return the referred
bounding box. The referring expression is plucked from
sample.ground_truth.text by the adapter so the model never
sees the GT boxes.
RelativePoseRunConfig(model: Optional[BenchmarkableModel] = None, model_name: Optional[str] = None, hf_checkpoint: Optional[str] = None, split: Difficulty | str = Difficulty.HARD, repo_id: Optional[str] = None, cache_dir: Optional[str] = None, revision: Optional[str] = None, batch_size: int = 1, device: str = 'cuda', output_dir: Optional[str] = None, model_kwargs: Dict[str, Any] = dict(), progress: Optional[ProgressCallback] = None)
dataclass
¶
Bases: TaskRunConfig
Runtime configuration for :func:run_relative_pose.
Pose models receive two RGB frames (rgb_a and rgb_b
plucked from sample.metadata by the adapter) and return the
predicted rotation + translation from frame A to frame B.
KeypointMatchingRunConfig(model: Optional[BenchmarkableModel] = None, model_name: Optional[str] = None, hf_checkpoint: Optional[str] = None, split: Difficulty | str = Difficulty.HARD, repo_id: Optional[str] = None, cache_dir: Optional[str] = None, revision: Optional[str] = None, batch_size: int = 1, device: str = 'cuda', output_dir: Optional[str] = None, model_kwargs: Dict[str, Any] = dict(), progress: Optional[ProgressCallback] = None)
dataclass
¶
Bases: TaskRunConfig
Runtime configuration for :func:run_keypoint_matching.
Matching models receive two RGB frames (rgb_a + rgb_b)
and return corresponding points in each image's pixel grid.
SparseDepthRunConfig(model: Optional[BenchmarkableModel] = None, model_name: Optional[str] = None, hf_checkpoint: Optional[str] = None, split: Difficulty | str = Difficulty.HARD, repo_id: Optional[str] = None, cache_dir: Optional[str] = None, revision: Optional[str] = None, batch_size: int = 1, device: str = 'cuda', output_dir: Optional[str] = None, model_kwargs: Dict[str, Any] = dict(), progress: Optional[ProgressCallback] = None)
dataclass
¶
Bases: TaskRunConfig
Runtime configuration for :func:run_sparse_depth.
Sparse-depth models receive (rgb, coordinates) where
coordinates is the (N, 2) float array of pixel locations
where depth is queried, and return an (N,) array of depth
values in metres at those exact coordinates.
NovelViewSynthesisRunConfig(model: Optional[BenchmarkableModel] = None, model_name: Optional[str] = None, hf_checkpoint: Optional[str] = None, split: Difficulty | str = Difficulty.HARD, repo_id: Optional[str] = None, cache_dir: Optional[str] = None, revision: Optional[str] = None, batch_size: int = 1, device: str = 'cuda', output_dir: Optional[str] = None, model_kwargs: Dict[str, Any] = dict(), progress: Optional[ProgressCallback] = None)
dataclass
¶
Bases: TaskRunConfig
Runtime configuration for :func:run_novel_view_synthesis.
NVS models receive (rgb_source, target_pose) where the
target pose is a 4×4 SE(3) camera-to-world matrix plucked from
sample.ground_truth.camera_pose by the adapter. They return
a synthesised RGB image at the target viewpoint.
compute_esd(per_sample_metrics: List[Dict[str, float]], per_sample_difficulties: List[Difficulty | None], metric_key: str) -> ESDResult
¶
Compute per-difficulty metric averages from per-sample results.
Args: per_sample_metrics: list of metric dicts, one per sample. per_sample_difficulties: difficulty label per sample (may be None). metric_key: which metric key to stratify (e.g. "absrel", "miou").
Returns: ESDResult with easy/medium/hard averages.
Source code in rpx_benchmark/deployment.py
compute_sgc(pred_masks: Sequence[np.ndarray], pred_depths: Sequence[np.ndarray], depth_gradient_threshold: float = 0.1, boundary_dilation: int = 2) -> StackGeometricCoherenceResult
¶
Stack-Level Geometric Coherence: boundary F-score between mask and depth edges.
SGC = F-score(boundary(mask), boundary(depth_gradient > τ))
A high SGC means segmentation boundaries are geometrically consistent with the depth discontinuities — indicating the model perceives coherent surfaces.
Args: pred_masks: sequence of predicted segmentation masks (H×W int). pred_depths: sequence of predicted depth maps (H×W float32, metres). depth_gradient_threshold: τ for depth gradient thresholding. boundary_dilation: pixel tolerance for boundary matching.
Source code in rpx_benchmark/deployment.py
compute_str(phase_scores: Dict[Phase, float]) -> StateTransitionRobustnessResult
¶
Compute STR from per-phase aggregated scores.
Args: phase_scores: dict mapping Phase → scalar metric value.
Source code in rpx_benchmark/deployment.py
compute_temporal_stability_depth(pred_depths: Sequence[np.ndarray], camera_poses: Sequence[np.ndarray | None]) -> TemporalStabilityResult
¶
TS_depth = E[||D_t − warp(D_{t+1}, ΔT)||_1 / N_valid].
Normalised to [0,1] by dividing by the max depth range to give a higher-is-better stability score (TS = 1 − normalised_L1).
Source code in rpx_benchmark/deployment.py
compute_temporal_stability_seg(pred_masks: Sequence[np.ndarray], camera_poses: Sequence[np.ndarray | None]) -> TemporalStabilityResult
¶
TS_seg = E[IoU(P_t, warp(P_{t+1}, ΔT))].
When T265 pose data is available, we use the relative rotation to compensate for camera motion before computing IoU between adjacent frames. Without pixel-accurate warping (which requires depth for backprojection), we apply a simplified affine proxy using the in-plane rotation component only.
This gives a conservative lower-bound TS_seg that is still a meaningful stability signal when scenes have modest depth variation.
Args: pred_masks: sequence of predicted segmentation masks (H×W int). camera_poses: per-frame 4×4 SE(3) matrices (camera-to-world), or None.
Returns: TemporalStabilityResult.
Source code in rpx_benchmark/deployment.py
compute_weighted_phase_score(per_sample_metrics: List[Dict[str, float]], per_sample_phases: List[Phase | None], per_sample_difficulties: List[Difficulty | None], metric_key: str) -> WeightedPhaseScore
¶
Compute the full weighted phase scoring table.
Groups samples by (phase, difficulty) and computes: S_p = 0.25·M(p,Easy) + 0.35·M(p,Medium) + 0.40·M(p,Hard) for each phase, then overall score and transition deltas.
Source code in rpx_benchmark/deployment.py
profile_model(model: Any, input_shape: Tuple[int, ...] = (3, 480, 640), device: str = 'cpu', model_type: str = 'local', notes: str = '') -> EfficiencyMetadata
¶
Auto-profile a model and return EfficiencyMetadata.
Args: model: a model object (PyTorch nn.Module recommended). input_shape: (C, H, W) for FLOPs counting. Default 640×480 RGB. device: device for dummy input tensor. model_type: "local" or "api". notes: free-text notes (e.g. "ViT-L/14, FP16 inference").
Returns: EfficiencyMetadata with params_m and flops_g filled where possible.
Source code in rpx_benchmark/profiler.py
count_parameters(model: Any) -> float
¶
Count trainable parameters in millions.
Supports PyTorch nn.Module and any object with a parameters() method.
Returns None if the model type is not supported.
Source code in rpx_benchmark/profiler.py
download_split(task: TaskType | str, split: Difficulty | str, repo_id: str = DEFAULT_REPO_ID, cache_dir: str | Path | None = None, revision: str | None = None, extra_modalities: Sequence[str] | None = None, max_workers: int = 8) -> Path
¶
Download only the files (task, split) needs, return resolved manifest path.
The resolved manifest is a JSON file whose root field points to
the local HF snapshot directory, so it can be fed directly to
:meth:RPXDataset.from_manifest.
Source code in rpx_benchmark/hub.py
fetch_manifest(task: TaskType | str, split: Difficulty | str, repo_id: str = DEFAULT_REPO_ID, cache_dir: str | Path | None = None, revision: str | None = None) -> Dict[str, Any]
¶
Download and parse the task-level manifest for (task, split).
Manifests are small (hundreds of KB) and are fetched eagerly so the caller can discover which (scene, phase) dirs the split references before kicking off a bulk download.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task
|
TaskType or str
|
|
required |
split
|
Difficulty or str
|
|
required |
repo_id
|
str
|
HuggingFace dataset repo id. Defaults to
:data: |
DEFAULT_REPO_ID
|
cache_dir
|
str or Path
|
|
None
|
revision
|
str
|
|
None
|
Returns:
| Type | Description |
|---|---|
dict
|
Parsed manifest JSON. |
Raises:
| Type | Description |
|---|---|
DownloadError
|
If the download fails (network, auth, bad repo id) or the manifest file does not exist on the hub. |
ManifestError
|
If the downloaded file is not valid JSON. |
Source code in rpx_benchmark/hub.py
load(task: TaskType | str, split: Difficulty | str, repo_id: str = DEFAULT_REPO_ID, cache_dir: str | Path | None = None, revision: str | None = None, batch_size: int = 1) -> RPXDataset
¶
Download (task, split) and return an iterable :class:RPXDataset.
Incremental re-use::
# First run: fetches rgb + depth for 'hard' scenes.
depth_ds = rpx.load("monocular_depth", "hard")
# Second run: rgb/depth already cached, only spatial_qa.json fetched.
qa_ds = rpx.load("visual_grounding", "hard")
Source code in rpx_benchmark/hub.py
mount(repo_id: str = DEFAULT_REPO_ID)
¶
Return an HfFileSystem rooted at the RPX repo for lazy browsing.
Each read goes over the network; prefer :func:load for real workloads.
Source code in rpx_benchmark/hub.py
show_banner(*, context: Optional[str] = None, subtitle: Optional[str] = None, enabled: Optional[bool] = None, file: Optional[TextIO] = None) -> None
¶
Print the RPX startup banner to a stream (default sys.stderr).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
str
|
Short status line appended under the links (e.g.
|
None
|
subtitle
|
str
|
Secondary line directly under the RPX tagline. Use this for the current script name or a per-operation label. |
None
|
enabled
|
bool
|
Force banner on or off. When |
None
|
file
|
file - like
|
Output stream. Defaults to :data: |
None
|
Examples:
Source code in rpx_benchmark/banner.py
configure_logging(level: str | int = 'INFO', *, force: bool = False, use_rich: Optional[bool] = None) -> logging.Logger
¶
Install a handler on the root rpx_benchmark logger.
Safe to call multiple times: subsequent calls are no-ops unless
force=True. The CLI calls this once at the start of main;
library users can call it from their own entrypoint.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
level
|
str or int
|
Logging level name ( |
'INFO'
|
force
|
bool
|
If True, re-install the handler even if one is already present. Use with care — multiple handlers cause duplicate output. |
False
|
use_rich
|
bool
|
Force rich or plain handler selection. When omitted, auto-
detects: rich if the |
None
|
Returns:
| Type | Description |
|---|---|
Logger
|
The configured root logger, for chaining. |
Examples:
>>> from rpx_benchmark.logging_utils import configure_logging
>>> log = configure_logging("DEBUG")
>>> log.info("benchmark starting")
Source code in rpx_benchmark/logging_utils.py
get_logger(name: str) -> logging.Logger
¶
Return a module-scoped logger nested under the rpx_benchmark root.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Usually |
required |
Returns:
| Type | Description |
|---|---|
Logger
|
A logger ready to use. Call |
Examples:
>>> from rpx_benchmark.logging_utils import get_logger
>>> log = get_logger("rpx_benchmark.hub")
>>> log.name
'rpx_benchmark.hub'
Source code in rpx_benchmark/logging_utils.py
make_numpy_depth_model(fn: Callable[[np.ndarray], np.ndarray], *, name: str = 'numpy_depth_model') -> BenchmarkableModel
¶
Wrap a plain numpy depth callable as a :class:BenchmarkableModel.
The callable must accept a (H, W, 3) uint8 RGB image and return a
(H', W') float metric depth map (in metres). If (H', W') !=
(H, W), the output is bilinearly resized to match the ground truth.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fn
|
callable
|
The depth function. Signature: |
required |
name
|
str
|
Display name used in logs and reports. |
'numpy_depth_model'
|
Examples:
>>> import numpy as np
>>> import rpx_benchmark as rpx
>>> def my_depth(rgb):
... return np.full(rgb.shape[:2], 2.0, dtype=np.float32)
>>> bm = rpx.make_numpy_depth_model(my_depth, name="mine")
>>> bm.task is rpx.TaskType.MONOCULAR_DEPTH
True
Source code in rpx_benchmark/adapters/base.py
make_numpy_detection_model(fn: Callable[[np.ndarray], Any], *, name: str = 'numpy_detection_model', task: TaskType = TaskType.OBJECT_DETECTION) -> BenchmarkableModel
¶
Wrap a plain numpy detection callable as a :class:BenchmarkableModel.
The callable must accept a (H, W, 3) uint8 RGB image and
return either a dict with keys "boxes" / "scores" /
"labels" or a (boxes, scores, labels) tuple in that
order. Boxes are pixel coordinates in (x1, y1, x2, y2)
format, scores are floats in [0, 1], labels are strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fn
|
callable
|
|
required |
name
|
str
|
Display name for reports. |
'numpy_detection_model'
|
task
|
TaskType
|
Use :attr: |
OBJECT_DETECTION
|
Examples:
>>> import numpy as np
>>> import rpx_benchmark as rpx
>>> def my_det(rgb):
... return {
... "boxes": np.array([[10, 10, 30, 30]], dtype=np.float32),
... "scores": np.array([0.9], dtype=np.float32),
... "labels": ["cup"],
... }
>>> bm = rpx.make_numpy_detection_model(my_det)
>>> bm.task is rpx.TaskType.OBJECT_DETECTION
True
Source code in rpx_benchmark/adapters/base.py
make_numpy_grounding_model(fn: Callable[[np.ndarray, str], Any], *, name: str = 'numpy_grounding_model') -> BenchmarkableModel
¶
Wrap a visual-grounding callable as a :class:BenchmarkableModel.
The callable takes (rgb_uint8, text) and returns either a
dict with keys "boxes" / "scores" or a tuple
(boxes, scores). Boxes are (x1, y1, x2, y2) pixel
coordinates; scores are floats. The referring expression
text is plucked from sample.ground_truth.text by the
adapter so the callable never sees the GT boxes.
Examples:
>>> import numpy as np
>>> import rpx_benchmark as rpx
>>> def my_ground(rgb, text):
... return (
... np.array([[10, 10, 30, 30]], dtype=np.float32),
... np.array([0.8], dtype=np.float32),
... )
>>> bm = rpx.make_numpy_grounding_model(my_ground)
Source code in rpx_benchmark/adapters/base.py
make_numpy_keypoint_model(fn: Callable[[np.ndarray, np.ndarray], Any], *, name: str = 'numpy_keypoint_model') -> BenchmarkableModel
¶
Wrap a keypoint-matching callable as a :class:BenchmarkableModel.
The callable takes (rgb_a, rgb_b) and returns either a dict
with keys "points0", "points1" and optional "scores"
or a 2/3-tuple in the same order. Points are (N, 2) pixel
coordinates.
Examples:
>>> import numpy as np
>>> import rpx_benchmark as rpx
>>> def my_matcher(rgb_a, rgb_b):
... pts = np.array([[10, 10], [20, 20]], dtype=np.float32)
... return pts, pts
>>> bm = rpx.make_numpy_keypoint_model(my_matcher)
Source code in rpx_benchmark/adapters/base.py
make_numpy_mask_model(fn: Callable[[np.ndarray], np.ndarray], *, name: str = 'numpy_mask_model') -> BenchmarkableModel
¶
Wrap a plain numpy instance-mask callable as a :class:BenchmarkableModel.
The callable must accept a (H, W, 3) uint8 RGB image and return
a (H', W') int instance mask where pixel values are instance
IDs (0 is conventionally background). If (H', W') != (H, W)
the output is nearest-neighbour resized to match the GT mask so
integer IDs are preserved.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fn
|
callable
|
|
required |
name
|
str
|
|
'numpy_mask_model'
|
Examples:
>>> import numpy as np
>>> import rpx_benchmark as rpx
>>> def my_seg(rgb):
... mask = np.zeros(rgb.shape[:2], dtype=np.int32)
... mask[rgb.sum(-1) > 384] = 1 # trivial brightness threshold
... return mask
>>> bm = rpx.make_numpy_mask_model(my_seg, name="mine")
>>> bm.task is rpx.TaskType.OBJECT_SEGMENTATION
True
Source code in rpx_benchmark/adapters/base.py
make_numpy_nvs_model(fn: Callable[[np.ndarray, np.ndarray], np.ndarray], *, name: str = 'numpy_nvs_model') -> BenchmarkableModel
¶
Wrap a novel-view-synthesis callable as a :class:BenchmarkableModel.
The callable takes (rgb_uint8, target_pose) where the target
pose is a 4×4 SE(3) camera-to-world matrix (float64). It
returns an RGB image for the target viewpoint. Non-uint8 output
is clipped and cast; shape mismatches are bilinearly resized.
Examples:
>>> import numpy as np
>>> import rpx_benchmark as rpx
>>> def my_nvs(rgb, target_pose):
... return rgb # identity baseline
>>> bm = rpx.make_numpy_nvs_model(my_nvs)
Source code in rpx_benchmark/adapters/base.py
make_numpy_pose_model(fn: Callable[[np.ndarray, np.ndarray], Any], *, name: str = 'numpy_pose_model') -> BenchmarkableModel
¶
Wrap a relative-camera-pose callable as a :class:BenchmarkableModel.
The callable takes (rgb_a, rgb_b) and returns either a dict
with keys "rotation" (3×3 rotation matrix or 4-element
quaternion) and "translation" (3-vector, metres) or a
(rotation, translation) tuple.
Examples:
>>> import numpy as np
>>> import rpx_benchmark as rpx
>>> def my_pose(rgb_a, rgb_b):
... return {"rotation": np.eye(3), "translation": np.zeros(3)}
>>> bm = rpx.make_numpy_pose_model(my_pose)
Source code in rpx_benchmark/adapters/base.py
make_numpy_sparse_depth_model(fn: Callable[[np.ndarray, np.ndarray], np.ndarray], *, name: str = 'numpy_sparse_depth_model') -> BenchmarkableModel
¶
Wrap a sparse-depth callable as a :class:BenchmarkableModel.
The callable takes (rgb_uint8, coords) where coords is a
(N, 2) float32 array of pixel coordinates and returns an
(N,) float32 array of depths in metres at those exact
coordinates.
Examples:
>>> import numpy as np
>>> import rpx_benchmark as rpx
>>> def my_sd(rgb, coords):
... return np.full(len(coords), 2.0, dtype=np.float32)
>>> bm = rpx.make_numpy_sparse_depth_model(my_sd)
Source code in rpx_benchmark/adapters/base.py
make_hf_depth_model(checkpoint: str, *, device: str = 'cuda', dtype: str | None = None, name: str | None = None) -> BenchmarkableModel
¶
One-line factory for any HuggingFace depth-estimation checkpoint.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
checkpoint
|
str
|
HuggingFace Hub path, e.g.
|
required |
device
|
str
|
Device string passed to |
'cuda'
|
dtype
|
str
|
One of |
None
|
name
|
str
|
Display name; defaults to the checkpoint id. |
None
|
Source code in rpx_benchmark/adapters/depth_hf.py
make_hf_instance_seg_model(checkpoint: str, *, device: str = 'cuda', threshold: float = 0.5, name: Optional[str] = None, model_class_hint: Optional[str] = None) -> BenchmarkableModel
¶
One-line factory for a HuggingFace segmentation checkpoint.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
checkpoint
|
str
|
HuggingFace Hub id
(e.g. |
required |
device
|
str
|
|
'cuda'
|
threshold
|
float
|
Score threshold for instance acceptance (passed to the processor's post-process if it accepts the kwarg). |
0.5
|
name
|
str
|
Display name. Defaults to |
None
|
model_class_hint
|
str
|
One of |
None
|
Raises:
| Type | Description |
|---|---|
AdapterError
|
If the processor exposes no post-process method we can use. |
ImportError
|
If |
Source code in rpx_benchmark/adapters/seg_hf.py
available_models(include_deferred: bool = False) -> List[str]
¶
Return sorted registered model names.
By default excludes deferred stubs so the CLI's --model choice
list is runnable-only. Pass include_deferred=True to list the
full intended slate.
Source code in rpx_benchmark/models/registry.py
get_factory(name: str) -> Callable[..., BenchmarkableModel]
¶
Return the factory function registered under name (lazy import).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Registered model name. Use :func: |
required |
Returns:
| Type | Description |
|---|---|
Callable
|
The factory function. Instantiate the model by calling it with the appropriate device / kwargs. |
Raises:
| Type | Description |
|---|---|
ConfigError
|
If |
Source code in rpx_benchmark/models/registry.py
resolve(name: str, *, device: str = 'cuda', **kwargs) -> BenchmarkableModel
¶
Look up name and call the factory with device + extra kwargs.
run_monocular_depth(cfg: MonocularDepthRunConfig) -> PipelineResult
¶
Run the monocular absolute depth benchmark end-to-end.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
MonocularDepthRunConfig
|
|
required |
Returns:
| Type | Description |
|---|---|
(BenchmarkResult, DeploymentReadinessReport, dict)
|
|
Raises:
| Type | Description |
|---|---|
(ConfigError, DownloadError, ManifestError, ModelError, MetricError)
|
Propagated from the respective subsystem with actionable hints. |
Examples:
>>> import rpx_benchmark as rpx
>>> import numpy as np
>>> def my_depth(rgb): return np.ones(rgb.shape[:2], dtype=np.float32) * 2.0
>>> bm = rpx.make_numpy_depth_model(my_depth, name="unit")
>>> cfg = rpx.MonocularDepthRunConfig(model=bm, split="hard", device="cpu")
>>> result, report, paths = rpx.run_monocular_depth(cfg)
Source code in rpx_benchmark/tasks/monocular_depth.py
run_segmentation(cfg: SegmentationRunConfig) -> PipelineResult
¶
Run the object-segmentation benchmark end-to-end.
Returns the same (BenchmarkResult, DeploymentReadinessReport,
{json, markdown}) tuple shape as :func:run_monocular_depth.
primary_metric="miou" and higher_is_better=True so the
deployment-readiness report interprets deltas accordingly.
Source code in rpx_benchmark/tasks/segmentation.py
run_object_detection(cfg: ObjectDetectionRunConfig) -> PipelineResult
¶
Run the object-detection benchmark end-to-end.
The returned BenchmarkResult.aggregated has precision,
recall, and f1 keys produced by
:class:rpx_benchmark.metrics.detection.DetectionMetrics. The
deployment report uses f1 as the primary metric and treats
higher as better.
Source code in rpx_benchmark/tasks/detection.py
run_open_vocab_detection(cfg: ObjectDetectionRunConfig) -> PipelineResult
¶
Run the open-vocabulary detection benchmark end-to-end.
Uses the same metric suite as :func:run_object_detection but
evaluates on :attr:TaskType.OPEN_VOCAB_DETECTION manifests.
Source code in rpx_benchmark/tasks/detection.py
run_visual_grounding(cfg: VisualGroundingRunConfig) -> PipelineResult
¶
Run the visual-grounding benchmark end-to-end.
Source code in rpx_benchmark/tasks/visual_grounding.py
run_relative_pose(cfg: RelativePoseRunConfig) -> PipelineResult
¶
Run the relative-camera-pose benchmark end-to-end.
Source code in rpx_benchmark/tasks/relative_pose.py
run_keypoint_matching(cfg: KeypointMatchingRunConfig) -> PipelineResult
¶
Run the keypoint-matching benchmark end-to-end.
Source code in rpx_benchmark/tasks/keypoint_matching.py
run_sparse_depth(cfg: SparseDepthRunConfig) -> PipelineResult
¶
Run the sparse-depth benchmark end-to-end.
Source code in rpx_benchmark/tasks/sparse_depth.py
run_novel_view_synthesis(cfg: NovelViewSynthesisRunConfig) -> PipelineResult
¶
Run the novel-view-synthesis benchmark end-to-end.
Source code in rpx_benchmark/tasks/novel_view_synthesis.py
format_markdown_summary(*, task: str, model_name: str, split: str, repo_id: str, result: BenchmarkResult, dr_report: DeploymentReadinessReport | None = None) -> str
¶
Render a benchmark result as a human-readable markdown report.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task
|
str
|
|
required |
model_name
|
str
|
|
required |
split
|
str
|
|
required |
repo_id
|
str
|
|
required |
result
|
BenchmarkResult
|
|
required |
dr_report
|
DeploymentReadinessReport
|
When provided, the output includes the Weighted Phase Score table, State-Transition Robustness deltas, Temporal Stability score, and an Efficiency table (params, FLOPs, latency). |
None
|
Returns:
| Type | Description |
|---|---|
str
|
A Markdown-formatted report. Matches the terminal UI tables the CLI prints so on-disk reports and terminal output stay in sync. |
Examples:
>>> from rpx_benchmark.reports import format_markdown_summary
>>> md = format_markdown_summary(
... task="monocular_depth", model_name="depth_pro",
... split="hard", repo_id="IRVLUTD/rpx-benchmark",
... result=result, dr_report=report,
... )
Source code in rpx_benchmark/reports.py
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 | |
write_json(path: str | Path, *, task: str, model_name: str, split: str, repo_id: str, result: BenchmarkResult, dr_report: DeploymentReadinessReport | None = None, extra: Dict[str, Any] | None = None) -> Path
¶
Serialise a benchmark result + deployment report to JSON.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str or Path
|
Output file path. Parent directories are created if missing. |
required |
task
|
str
|
Task name string (e.g. |
required |
model_name
|
str
|
Display name of the model under test. |
required |
split
|
str
|
ESD difficulty split ( |
required |
repo_id
|
str
|
HuggingFace dataset repo id the samples came from. |
required |
result
|
BenchmarkResult
|
Per-sample + aggregated metric container returned by
:class: |
required |
dr_report
|
DeploymentReadinessReport
|
Weighted Phase Score, STR, TS, efficiency metadata. Omitted
when |
None
|
extra
|
dict
|
Arbitrary free-form extra fields to embed under the |
None
|
Returns:
| Type | Description |
|---|---|
Path
|
The resolved output path, for chaining. |
Notes
Dataclasses are converted via :func:dataclasses.asdict, enums
are lowered to their .value, and unknown objects pass through
unchanged. The output is pretty-printed with indent=2 for
diff-friendliness.