Metrics (rpx_benchmark.metrics)¶
Pluggable per-task metric calculators and the registry that binds them to task types.
Registry core¶
registry
¶
Core of the metric plugin system.
Contains the :class:MetricCalculator base class, the task→calculators
registry, the @register_metric decorator, and a thin
:class:MetricSuite facade that the runner uses.
Most library users should not touch this module directly — instead
import from :mod:rpx_benchmark.metrics.
MetricCalculator
¶
Bases: ABC
Abstract base for a family of metrics bound to a single task.
Subclasses are lightweight, stateless objects. The compute
method accepts one prediction and its ground truth and returns a
dict mapping metric name → scalar float.
Subclasses must set name (human-readable identifier used for
registration) and implement :meth:compute. Subclasses should
take their own hyperparameters via __init__ when needed.
Examples:
>>> from rpx_benchmark.metrics import MetricCalculator, register_metric
>>> from rpx_benchmark.api import TaskType
>>> @register_metric(TaskType.MONOCULAR_DEPTH)
... class MeanPred(MetricCalculator):
... name = "mean_pred"
... def compute(self, prediction, ground_truth):
... return {"mean_pred": float(prediction.depth_map.mean())}
compute(prediction: Any, ground_truth: Any) -> Dict[str, float]
abstractmethod
¶
Return a dict of metric name → scalar for one sample.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prediction
|
Any
|
Task-specific Prediction dataclass
(e.g. :class: |
required |
ground_truth
|
Any
|
Task-specific GroundTruth dataclass
(e.g. :class: |
required |
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
Metric values. Keys should be short snake_case names; values must be numeric so the runner can average them. |
Raises:
| Type | Description |
|---|---|
MetricError
|
When inputs have the wrong shape/type/content. |
Source code in rpx_benchmark/metrics/registry.py
BenchmarkResult(task: TaskType, per_sample: List[Dict[str, Any]], aggregated: Dict[str, float], num_samples: int)
dataclass
¶
Outcome of running a :class:BenchmarkRunner against a dataset.
Attributes:
| Name | Type | Description |
|---|---|---|
task |
TaskType
|
Which task was evaluated. |
per_sample |
list of dict
|
One dict per sample. Each dict mixes metric keys (numeric) and
metadata keys ( |
aggregated |
dict[str, float]
|
Mean over the numeric metric keys in :attr: |
num_samples |
int
|
|
MetricSuite(task: TaskType)
¶
Thin wrapper around the metric registry used by the runner.
Kept as a class rather than a function because historical API
expects MetricSuite.for_task(...).evaluate(pred, gt). New code
can call :func:compute_metrics directly.
Source code in rpx_benchmark/metrics/registry.py
for_task(task: TaskType) -> 'MetricSuite'
classmethod
¶
Create a suite for the given task.
Raises:
| Type | Description |
|---|---|
MetricError
|
If no calculators are registered for |
Source code in rpx_benchmark/metrics/registry.py
evaluate(prediction: Any, ground_truth: Any) -> Dict[str, float]
¶
Run every registered calculator and return merged results.
Raises:
| Type | Description |
|---|---|
MetricError
|
Propagated from individual calculators when inputs are shape-mismatched or wrong-typed. |
Source code in rpx_benchmark/metrics/registry.py
aggregate(per_sample: List[Dict[str, Any]]) -> Dict[str, float]
¶
Mean over numeric metric keys; non-numeric metadata is skipped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
per_sample
|
list of dict
|
Per-sample rows. May contain metric floats and metadata strings/enums in the same dict. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
One float per numeric key. Empty dict if |
Source code in rpx_benchmark/metrics/registry.py
build_result(per_sample: List[Dict[str, Any]]) -> 'BenchmarkResult'
¶
Convenience: wrap per_sample and its aggregate in a
:class:BenchmarkResult.
Source code in rpx_benchmark/metrics/registry.py
register_metric(task: TaskType) -> Callable[[Type[MetricCalculator]], Type[MetricCalculator]]
¶
Class decorator registering a calculator under a task.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task
|
TaskType
|
Task the calculator belongs to. Multiple calculators can be registered against the same task and will be composed at evaluation time. |
required |
Returns:
| Type | Description |
|---|---|
Callable
|
The decorator; when applied to a class it instantiates it once with no arguments and appends to the registry. |
Raises:
| Type | Description |
|---|---|
MetricError
|
If |
Examples:
>>> from rpx_benchmark.api import TaskType
>>> from rpx_benchmark.metrics import MetricCalculator, register_metric
>>> @register_metric(TaskType.MONOCULAR_DEPTH)
... class MyMetric(MetricCalculator):
... name = "my_metric"
... def compute(self, prediction, ground_truth):
... return {"my_metric": 0.0}
Source code in rpx_benchmark/metrics/registry.py
unregister_metric(task: TaskType, name: str) -> bool
¶
Remove a previously registered calculator by its name field.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task
|
TaskType
|
|
required |
name
|
str
|
Name attribute of the calculator to remove. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if a calculator was removed; False if no match was found. |
Source code in rpx_benchmark/metrics/registry.py
clear_registry() -> None
¶
get_calculators(task: TaskType) -> List[MetricCalculator]
¶
Return the calculators registered for task (empty list if none).
available_metrics() -> Dict[TaskType, List[str]]
¶
List the registered metric names grouped by task.
Returns:
| Type | Description |
|---|---|
dict
|
|
Source code in rpx_benchmark/metrics/registry.py
compute_metrics(task: TaskType, prediction: Any, ground_truth: Any) -> Dict[str, float]
¶
Run every registered calculator for task and merge outputs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task
|
TaskType
|
|
required |
prediction
|
Any
|
|
required |
ground_truth
|
Any
|
|
required |
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
Union of all calculators' output dicts. Later calculators may overwrite earlier ones if they emit the same key. |
Raises:
| Type | Description |
|---|---|
MetricError
|
If no calculators are registered for |
Source code in rpx_benchmark/metrics/registry.py
Built-in calculators¶
Monocular depth¶
depth
¶
Monocular absolute depth metric calculators.
Registers :class:DepthErrorMetrics which computes AbsRel, RMSE, and
three thresholded accuracy ratios (δ<1.25, δ<1.25², δ<1.25³) against
the subset of pixels where the ground-truth depth map is valid
(gt > 0). This matches the paper's eval protocol: we mask out
D435-invalid pixels so models are not penalised where the sensor
itself produced no measurement.
DepthErrorMetrics
¶
Bases: MetricCalculator
AbsRel / RMSE / δ-accuracy for monocular metric depth.
Notes
All metrics are computed in metres. Neither raw logits nor any kind of scale alignment is applied — we deliberately measure the model's absolute metric accuracy, which is the property that matters for robot policies that consume depth directly.
For empty validity masks (fully invalid GT frames) we return the
safe no-op sentinel {"absrel": 0, "rmse": 0, "delta1": 1, ...}
so a single bad frame does not break aggregate means. In practice
the dataset pipeline should filter these out upstream.
Metric definitions
.. math::
\text{AbsRel} &= \frac{1}{N} \sum_i \frac{|\hat d_i - d_i|}{d_i}
\text{RMSE} &= \sqrt{\frac{1}{N}\sum_i(\hat d_i - d_i)^2}
\delta_k &= \frac{1}{N}\,\#\Big\{i : \max\!\big(\tfrac{\hat d_i}{d_i},
\tfrac{d_i}{\hat d_i}\big) < 1.25^{\,k}\Big\}
Object detection¶
detection
¶
Object detection + open-vocab detection metric calculators.
DetectionMetrics(iou_threshold: float = 0.5)
¶
Bases: MetricCalculator
Precision / recall / F1 at a single IoU threshold.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
iou_threshold
|
float
|
Minimum IoU (0.0–1.0) for a prediction to count as a true positive. Defaults to 0.5, the COCO baseline. |
0.5
|
Notes
Greedy matching by descending prediction score. A GT box is consumed after the first predicted box matches it, so subsequent predictions for the same object register as false positives (standard VOC/COCO evaluation rule).
Source code in rpx_benchmark/metrics/detection.py
Object segmentation¶
segmentation
¶
Instance/semantic segmentation metric calculators.
SegmentationMIoU
¶
Bases: MetricCalculator
Mean Intersection-over-Union (mIoU) across GT classes.
Notes
Averaged over the set of class ids present in the ground-truth
mask (background class id -1 is ignored). Classes in the
prediction that do not appear in GT do not contribute — they only
reduce the IoU of other classes they overlap with.
Relative camera pose¶
pose
¶
Relative camera pose metric calculators.
RelativePoseError
¶
Bases: MetricCalculator
Rotation geodesic (degrees) + translation L2 (metres).
Notes
Rotations are compared in SO(3) via the geodesic distance
arccos((trace(R_pred^T R_gt) - 1) / 2). Quaternion inputs (4-
vectors) are accepted and converted to rotation matrices.
Translations are compared in metres with straight L2.
Visual grounding¶
grounding
¶
Visual grounding metric calculators.
GroundingIoU(iou_threshold: float = 0.5)
¶
Bases: MetricCalculator
Top-1 IoU and accuracy at a threshold for a single referred object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
iou_threshold
|
float
|
IoU threshold for the |
0.5
|
Source code in rpx_benchmark/metrics/grounding.py
Sparse depth¶
sparse_depth
¶
Sparse depth metric calculators (RGB + sparse GT depth points).
SparseDepthError(radius: float = 2.0)
¶
Bases: MetricCalculator
AbsRel + RMSE computed only at the provided sparse GT locations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
radius
|
float
|
Maximum pixel distance between a prediction point and the matched GT point. Predictions outside this radius contribute a full-magnitude error, which penalises models that fail to localise the sparse points. |
2.0
|
Source code in rpx_benchmark/metrics/sparse_depth.py
Novel view synthesis¶
nvs
¶
Novel view synthesis metric calculators.
NVSQuality
¶
Bases: MetricCalculator
PSNR and simplified global SSIM for an RGB novel-view prediction.
Notes
The SSIM implementation is a deliberately simple global-statistics
variant (no sliding window) so it runs without OpenCV / skimage.
For publication we recommend re-computing SSIM offline with a full
sliding-window implementation against the per-sample arrays
captured in result.per_sample.
Keypoint matching¶
keypoints
¶
Keypoint correspondence metric calculators.
KeypointAccuracy(px_threshold: float = 3.0)
¶
Bases: MetricCalculator
% of predicted matches within a pixel threshold + mean error.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
px_threshold
|
float
|
Pixel radius in frame B at which a predicted correspondence is counted as correct. Default 3 px matches the standard ScanNet/InteriorNet evaluation protocol. |
3.0
|
Source code in rpx_benchmark/metrics/keypoints.py
Object tracking¶
tracking
¶
Multi-object tracking metric calculators.
TrackletMetrics(iou_threshold: float = 0.5)
¶
Bases: MetricCalculator
Simplified MOTA + IDF1 over a single tracklet sample.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
iou_threshold
|
float
|
IoU required to count a predicted detection as a hit for a given GT box. Default 0.5. |
0.5
|
Notes
This is the per-sample version called by the runner. A full MOTA
computation (with cross-scene identity switches) lives in
:mod:rpx_benchmark.deployment and runs on the whole dataset.