Metrics (`rpx_benchmark.metrics`)¶

Pluggable per-task metric calculators and the registry that binds them to task types.

Registry core¶

`registry` ¶

Core of the metric plugin system.

Contains the :class:MetricCalculator base class, the task→calculators registry, the @register_metric decorator, and a thin :class:MetricSuite facade that the runner uses.

Most library users should not touch this module directly — instead import from :mod:rpx_benchmark.metrics.

`MetricCalculator` ¶

Bases: ABC

Abstract base for a family of metrics bound to a single task.

Subclasses are lightweight, stateless objects. The compute method accepts one prediction and its ground truth and returns a dict mapping metric name → scalar float.

Subclasses must set name (human-readable identifier used for registration) and implement :meth:compute. Subclasses should take their own hyperparameters via __init__ when needed.

Examples:

>>> from rpx_benchmark.metrics import MetricCalculator, register_metric
>>> from rpx_benchmark.api import TaskType
>>> @register_metric(TaskType.MONOCULAR_DEPTH)
... class MeanPred(MetricCalculator):
...     name = "mean_pred"
...     def compute(self, prediction, ground_truth):
...         return {"mean_pred": float(prediction.depth_map.mean())}

`compute(prediction: Any, ground_truth: Any) -> Dict[str, float]` `abstractmethod` ¶

Return a dict of metric name → scalar for one sample.

Parameters:

Name	Type	Description	Default
`prediction`	`Any`	Task-specific Prediction dataclass (e.g. :class:`DepthPrediction`). The concrete type is whatever the calculator is designed for.	required
`ground_truth`	`Any`	Task-specific GroundTruth dataclass (e.g. :class:`DepthGroundTruth`).	required

Returns:

Type	Description
`dict[str, float]`	Metric values. Keys should be short snake_case names; values must be numeric so the runner can average them.

Raises:

Type	Description
`MetricError`	When inputs have the wrong shape/type/content.

Source code in rpx_benchmark/metrics/registry.py

@abstractmethod
def compute(self, prediction: Any, ground_truth: Any) -> Dict[str, float]:
    """Return a dict of metric name → scalar for one sample.

    Parameters
    ----------
    prediction : Any
        Task-specific Prediction dataclass
        (e.g. :class:`DepthPrediction`). The concrete type is
        whatever the calculator is designed for.
    ground_truth : Any
        Task-specific GroundTruth dataclass
        (e.g. :class:`DepthGroundTruth`).

    Returns
    -------
    dict[str, float]
        Metric values. Keys should be short snake_case names;
        values must be numeric so the runner can average them.

    Raises
    ------
    MetricError
        When inputs have the wrong shape/type/content.
    """

`BenchmarkResult(task: TaskType, per_sample: List[Dict[str, Any]], aggregated: Dict[str, float], num_samples: int)` `dataclass` ¶

Outcome of running a :class:BenchmarkRunner against a dataset.

Attributes:

Name	Type	Description
`task`	`TaskType`	Which task was evaluated.
`per_sample`	`list of dict`	One dict per sample. Each dict mixes metric keys (numeric) and metadata keys (`id`, `phase`, `difficulty`, `scene`) that :class:`MetricSuite.aggregate` silently skips when computing means.
`aggregated`	`dict[str, float]`	Mean over the numeric metric keys in :attr:`per_sample`.
`num_samples`	`int`

`MetricSuite(task: TaskType)` ¶

Thin wrapper around the metric registry used by the runner.

Kept as a class rather than a function because historical API expects MetricSuite.for_task(...).evaluate(pred, gt). New code can call :func:compute_metrics directly.

Source code in rpx_benchmark/metrics/registry.py

def __init__(self, task: TaskType) -> None:
    self.task = task

`for_task(task: TaskType) -> 'MetricSuite'` `classmethod` ¶

Create a suite for the given task.

Raises:

Type	Description
`MetricError`	If no calculators are registered for `task`.

Source code in rpx_benchmark/metrics/registry.py

@classmethod
def for_task(cls, task: TaskType) -> "MetricSuite":
    """Create a suite for the given task.

    Raises
    ------
    MetricError
        If no calculators are registered for ``task``.
    """
    if not get_calculators(task):
        raise MetricError(
            f"No metric calculator registered for task {task.value!r}.",
            hint="Import rpx_benchmark.metrics to trigger built-ins.",
        )
    return cls(task=task)

`evaluate(prediction: Any, ground_truth: Any) -> Dict[str, float]` ¶

Run every registered calculator and return merged results.

Raises:

Type	Description
`MetricError`	Propagated from individual calculators when inputs are shape-mismatched or wrong-typed.

Source code in rpx_benchmark/metrics/registry.py

def evaluate(self, prediction: Any, ground_truth: Any) -> Dict[str, float]:
    """Run every registered calculator and return merged results.

    Raises
    ------
    MetricError
        Propagated from individual calculators when inputs are
        shape-mismatched or wrong-typed.
    """
    return compute_metrics(self.task, prediction, ground_truth)

`aggregate(per_sample: List[Dict[str, Any]]) -> Dict[str, float]` ¶

Mean over numeric metric keys; non-numeric metadata is skipped.

Parameters:

Name	Type	Description	Default
`per_sample`	`list of dict`	Per-sample rows. May contain metric floats and metadata strings/enums in the same dict.	required

Returns:

Type	Description
`dict[str, float]`	One float per numeric key. Empty dict if `per_sample` is empty or contains no numeric values.

Source code in rpx_benchmark/metrics/registry.py

def aggregate(self, per_sample: List[Dict[str, Any]]) -> Dict[str, float]:
    """Mean over numeric metric keys; non-numeric metadata is skipped.

    Parameters
    ----------
    per_sample : list of dict
        Per-sample rows. May contain metric floats and metadata
        strings/enums in the same dict.

    Returns
    -------
    dict[str, float]
        One float per numeric key. Empty dict if ``per_sample`` is
        empty or contains no numeric values.
    """
    if not per_sample:
        return {}
    numeric_keys = [
        k for k, v in per_sample[0].items()
        if isinstance(v, (int, float)) and not isinstance(v, bool)
    ]
    out: Dict[str, float] = {}
    for k in numeric_keys:
        vals = [
            m[k] for m in per_sample
            if isinstance(m.get(k), (int, float))
            and not isinstance(m.get(k), bool)
        ]
        if vals:
            out[k] = float(np.mean(vals))
    return out

`build_result(per_sample: List[Dict[str, Any]]) -> 'BenchmarkResult'` ¶

Convenience: wrap per_sample and its aggregate in a :class:BenchmarkResult.

Source code in rpx_benchmark/metrics/registry.py

def build_result(self, per_sample: List[Dict[str, Any]]) -> "BenchmarkResult":
    """Convenience: wrap ``per_sample`` and its aggregate in a
    :class:`BenchmarkResult`.
    """
    return BenchmarkResult(
        task=self.task,
        per_sample=per_sample,
        aggregated=self.aggregate(per_sample),
        num_samples=len(per_sample),
    )

`register_metric(task: TaskType) -> Callable[[Type[MetricCalculator]], Type[MetricCalculator]]` ¶

Class decorator registering a calculator under a task.

Parameters:

Name	Type	Description	Default
`task`	`TaskType`	Task the calculator belongs to. Multiple calculators can be registered against the same task and will be composed at evaluation time.	required

Returns:

Type	Description
`Callable`	The decorator; when applied to a class it instantiates it once with no arguments and appends to the registry.

Raises:

Type	Description
`MetricError`	If `task` is not a :class:`TaskType`, or if the decorated object is not a :class:`MetricCalculator` subclass.

Examples:

>>> from rpx_benchmark.api import TaskType
>>> from rpx_benchmark.metrics import MetricCalculator, register_metric
>>> @register_metric(TaskType.MONOCULAR_DEPTH)
... class MyMetric(MetricCalculator):
...     name = "my_metric"
...     def compute(self, prediction, ground_truth):
...         return {"my_metric": 0.0}

Source code in rpx_benchmark/metrics/registry.py

def register_metric(
    task: TaskType,
) -> Callable[[Type[MetricCalculator]], Type[MetricCalculator]]:
    """Class decorator registering a calculator under a task.

    Parameters
    ----------
    task : TaskType
        Task the calculator belongs to. Multiple calculators can be
        registered against the same task and will be composed at
        evaluation time.

    Returns
    -------
    Callable
        The decorator; when applied to a class it instantiates it
        once with no arguments and appends to the registry.

    Raises
    ------
    MetricError
        If ``task`` is not a :class:`TaskType`, or if the decorated
        object is not a :class:`MetricCalculator` subclass.

    Examples
    --------
    >>> from rpx_benchmark.api import TaskType
    >>> from rpx_benchmark.metrics import MetricCalculator, register_metric
    >>> @register_metric(TaskType.MONOCULAR_DEPTH)
    ... class MyMetric(MetricCalculator):
    ...     name = "my_metric"
    ...     def compute(self, prediction, ground_truth):
    ...         return {"my_metric": 0.0}
    """
    if not isinstance(task, TaskType):
        raise MetricError(
            f"register_metric expected a TaskType, got {type(task).__name__}",
            hint="Use one of the members of rpx_benchmark.api.TaskType.",
        )

    def decorator(cls: Type[MetricCalculator]) -> Type[MetricCalculator]:
        if not (isinstance(cls, type) and issubclass(cls, MetricCalculator)):
            raise MetricError(
                f"@register_metric can only decorate MetricCalculator "
                f"subclasses; got {cls!r}",
            )
        instance = cls()
        _CALCULATORS[task].append(instance)
        log.debug("registered metric %s for task %s",
                  instance.name or cls.__name__, task.value)
        return cls

    return decorator

`unregister_metric(task: TaskType, name: str) -> bool` ¶

Remove a previously registered calculator by its name field.

Parameters:

Name	Type	Description	Default
`task`	`TaskType`		required
`name`	`str`	Name attribute of the calculator to remove.	required

Returns:

Type	Description
`bool`	True if a calculator was removed; False if no match was found.

Source code in rpx_benchmark/metrics/registry.py

def unregister_metric(task: TaskType, name: str) -> bool:
    """Remove a previously registered calculator by its ``name`` field.

    Parameters
    ----------
    task : TaskType
    name : str
        Name attribute of the calculator to remove.

    Returns
    -------
    bool
        True if a calculator was removed; False if no match was found.
    """
    before = len(_CALCULATORS.get(task, []))
    _CALCULATORS[task] = [c for c in _CALCULATORS.get(task, []) if c.name != name]
    after = len(_CALCULATORS[task])
    return after < before

`clear_registry() -> None` ¶

Remove every registered calculator. Primarily for tests.

Source code in rpx_benchmark/metrics/registry.py

def clear_registry() -> None:
    """Remove every registered calculator. Primarily for tests."""
    _CALCULATORS.clear()

`get_calculators(task: TaskType) -> List[MetricCalculator]` ¶

Return the calculators registered for task (empty list if none).

Source code in rpx_benchmark/metrics/registry.py

def get_calculators(task: TaskType) -> List[MetricCalculator]:
    """Return the calculators registered for ``task`` (empty list if none)."""
    return list(_CALCULATORS.get(task, []))

`available_metrics() -> Dict[TaskType, List[str]]` ¶

List the registered metric names grouped by task.

Returns:

Type	Description
`dict`	`{TaskType: [calculator_name, ...]}`. Only tasks with at least one registered calculator appear.

Source code in rpx_benchmark/metrics/registry.py

def available_metrics() -> Dict[TaskType, List[str]]:
    """List the registered metric names grouped by task.

    Returns
    -------
    dict
        ``{TaskType: [calculator_name, ...]}``. Only tasks with at
        least one registered calculator appear.
    """
    return {task: [c.name for c in calcs]
            for task, calcs in _CALCULATORS.items()
            if calcs}

`compute_metrics(task: TaskType, prediction: Any, ground_truth: Any) -> Dict[str, float]` ¶

Run every registered calculator for task and merge outputs.

Parameters:

Name	Type	Default
`task`	`TaskType`	required
`prediction`	`Any`	required
`ground_truth`	`Any`	required

Returns:

Type	Description
`dict[str, float]`	Union of all calculators' output dicts. Later calculators may overwrite earlier ones if they emit the same key.

Raises:

Type	Description
`MetricError`	If no calculators are registered for `task`.

Source code in rpx_benchmark/metrics/registry.py

def compute_metrics(
    task: TaskType,
    prediction: Any,
    ground_truth: Any,
) -> Dict[str, float]:
    """Run every registered calculator for ``task`` and merge outputs.

    Parameters
    ----------
    task : TaskType
    prediction : Any
    ground_truth : Any

    Returns
    -------
    dict[str, float]
        Union of all calculators' output dicts. Later calculators may
        overwrite earlier ones if they emit the same key.

    Raises
    ------
    MetricError
        If no calculators are registered for ``task``.
    """
    calcs = _CALCULATORS.get(task)
    if not calcs:
        raise MetricError(
            f"No metric calculator registered for task {task.value!r}.",
            hint=(
                "Either import the built-in calculators (e.g. "
                "`import rpx_benchmark.metrics`) or register a custom "
                "one via @register_metric."
            ),
        )
    merged: Dict[str, float] = {}
    for calc in calcs:
        merged.update(calc.compute(prediction, ground_truth))
    return merged

Built-in calculators¶

Monocular depth¶

`depth` ¶

Monocular absolute depth metric calculators.

Registers :class:DepthErrorMetrics which computes AbsRel, RMSE, and three thresholded accuracy ratios (δ<1.25, δ<1.25², δ<1.25³) against the subset of pixels where the ground-truth depth map is valid (gt > 0). This matches the paper's eval protocol: we mask out D435-invalid pixels so models are not penalised where the sensor itself produced no measurement.

`DepthErrorMetrics` ¶

Bases: MetricCalculator

AbsRel / RMSE / δ-accuracy for monocular metric depth.

Notes

All metrics are computed in metres. Neither raw logits nor any kind of scale alignment is applied — we deliberately measure the model's absolute metric accuracy, which is the property that matters for robot policies that consume depth directly.

For empty validity masks (fully invalid GT frames) we return the safe no-op sentinel {"absrel": 0, "rmse": 0, "delta1": 1, ...} so a single bad frame does not break aggregate means. In practice the dataset pipeline should filter these out upstream.

Metric definitions

.. math::

\text{AbsRel} &= \frac{1}{N} \sum_i \frac{|\hat d_i - d_i|}{d_i}

\text{RMSE}   &= \sqrt{\frac{1}{N}\sum_i(\hat d_i - d_i)^2}

\delta_k      &= \frac{1}{N}\,\#\Big\{i : \max\!\big(\tfrac{\hat d_i}{d_i},
                  \tfrac{d_i}{\hat d_i}\big) < 1.25^{\,k}\Big\}

Object detection¶

`detection` ¶

Object detection + open-vocab detection metric calculators.

`DetectionMetrics(iou_threshold: float = 0.5)` ¶

Bases: MetricCalculator

Precision / recall / F1 at a single IoU threshold.

Parameters:

Name	Type	Description	Default
`iou_threshold`	`float`	Minimum IoU (0.0–1.0) for a prediction to count as a true positive. Defaults to 0.5, the COCO baseline.	`0.5`

Notes

Greedy matching by descending prediction score. A GT box is consumed after the first predicted box matches it, so subsequent predictions for the same object register as false positives (standard VOC/COCO evaluation rule).

Source code in rpx_benchmark/metrics/detection.py

def __init__(self, iou_threshold: float = 0.5) -> None:
    self.iou_threshold = iou_threshold

Object segmentation¶

`segmentation` ¶

Instance/semantic segmentation metric calculators.

`SegmentationMIoU` ¶

Bases: MetricCalculator

Mean Intersection-over-Union (mIoU) across GT classes.

Notes

Averaged over the set of class ids present in the ground-truth mask (background class id -1 is ignored). Classes in the prediction that do not appear in GT do not contribute — they only reduce the IoU of other classes they overlap with.

Relative camera pose¶

`pose` ¶

Relative camera pose metric calculators.

`RelativePoseError` ¶

Bases: MetricCalculator

Rotation geodesic (degrees) + translation L2 (metres).

Notes

Rotations are compared in SO(3) via the geodesic distance arccos((trace(R_pred^T R_gt) - 1) / 2). Quaternion inputs (4- vectors) are accepted and converted to rotation matrices. Translations are compared in metres with straight L2.

Visual grounding¶

`grounding` ¶

Visual grounding metric calculators.

`GroundingIoU(iou_threshold: float = 0.5)` ¶

Bases: MetricCalculator

Top-1 IoU and accuracy at a threshold for a single referred object.

Parameters:

Name	Type	Description	Default
`iou_threshold`	`float`	IoU threshold for the `grounding_acc` indicator. Default 0.5, matching standard referring expression comprehension protocols.	`0.5`

Source code in rpx_benchmark/metrics/grounding.py

def __init__(self, iou_threshold: float = 0.5) -> None:
    self.iou_threshold = iou_threshold

Sparse depth¶

`sparse_depth` ¶

Sparse depth metric calculators (RGB + sparse GT depth points).

`SparseDepthError(radius: float = 2.0)` ¶

Bases: MetricCalculator

AbsRel + RMSE computed only at the provided sparse GT locations.

Parameters:

Name	Type	Description	Default
`radius`	`float`	Maximum pixel distance between a prediction point and the matched GT point. Predictions outside this radius contribute a full-magnitude error, which penalises models that fail to localise the sparse points.	`2.0`

Source code in rpx_benchmark/metrics/sparse_depth.py

def __init__(self, radius: float = 2.0) -> None:
    self.radius = radius

Novel view synthesis¶

`nvs` ¶

Novel view synthesis metric calculators.

`NVSQuality` ¶

Bases: MetricCalculator

PSNR and simplified global SSIM for an RGB novel-view prediction.

Notes

The SSIM implementation is a deliberately simple global-statistics variant (no sliding window) so it runs without OpenCV / skimage. For publication we recommend re-computing SSIM offline with a full sliding-window implementation against the per-sample arrays captured in result.per_sample.

Keypoint matching¶

`keypoints` ¶

Keypoint correspondence metric calculators.

`KeypointAccuracy(px_threshold: float = 3.0)` ¶

Bases: MetricCalculator

% of predicted matches within a pixel threshold + mean error.

Parameters:

Name	Type	Description	Default
`px_threshold`	`float`	Pixel radius in frame B at which a predicted correspondence is counted as correct. Default 3 px matches the standard ScanNet/InteriorNet evaluation protocol.	`3.0`

Source code in rpx_benchmark/metrics/keypoints.py

def __init__(self, px_threshold: float = 3.0) -> None:
    self.px_threshold = px_threshold

Object tracking¶

`tracking` ¶

Multi-object tracking metric calculators.

`TrackletMetrics(iou_threshold: float = 0.5)` ¶

Bases: MetricCalculator

Simplified MOTA + IDF1 over a single tracklet sample.

Parameters:

Name	Type	Description	Default
`iou_threshold`	`float`	IoU required to count a predicted detection as a hit for a given GT box. Default 0.5.	`0.5`

Notes

This is the per-sample version called by the runner. A full MOTA computation (with cross-scene identity switches) lives in :mod:rpx_benchmark.deployment and runs on the whole dataset.

Source code in rpx_benchmark/metrics/tracking.py

def __init__(self, iou_threshold: float = 0.5) -> None:
    self.iou_threshold = iou_threshold

Metrics (rpx_benchmark.metrics)¶

Registry core¶

registry ¶

MetricCalculator ¶

compute(prediction: Any, ground_truth: Any) -> Dict[str, float] abstractmethod ¶

BenchmarkResult(task: TaskType, per_sample: List[Dict[str, Any]], aggregated: Dict[str, float], num_samples: int) dataclass ¶

MetricSuite(task: TaskType) ¶

for_task(task: TaskType) -> 'MetricSuite' classmethod ¶

evaluate(prediction: Any, ground_truth: Any) -> Dict[str, float] ¶

aggregate(per_sample: List[Dict[str, Any]]) -> Dict[str, float] ¶

build_result(per_sample: List[Dict[str, Any]]) -> 'BenchmarkResult' ¶

register_metric(task: TaskType) -> Callable[[Type[MetricCalculator]], Type[MetricCalculator]] ¶

unregister_metric(task: TaskType, name: str) -> bool ¶

clear_registry() -> None ¶

get_calculators(task: TaskType) -> List[MetricCalculator] ¶

available_metrics() -> Dict[TaskType, List[str]] ¶

compute_metrics(task: TaskType, prediction: Any, ground_truth: Any) -> Dict[str, float] ¶

Built-in calculators¶

Monocular depth¶

depth ¶

DepthErrorMetrics ¶

Object detection¶

detection ¶

DetectionMetrics(iou_threshold: float = 0.5) ¶

Object segmentation¶

segmentation ¶

SegmentationMIoU ¶

Relative camera pose¶

pose ¶

RelativePoseError ¶

Visual grounding¶

grounding ¶

GroundingIoU(iou_threshold: float = 0.5) ¶

Sparse depth¶

sparse_depth ¶

SparseDepthError(radius: float = 2.0) ¶

Novel view synthesis¶

nvs ¶

NVSQuality ¶

Keypoint matching¶

keypoints ¶

KeypointAccuracy(px_threshold: float = 3.0) ¶

Object tracking¶

tracking ¶

TrackletMetrics(iou_threshold: float = 0.5) ¶

Metrics (`rpx_benchmark.metrics`)¶

`registry` ¶

`MetricCalculator` ¶

`compute(prediction: Any, ground_truth: Any) -> Dict[str, float]` `abstractmethod` ¶

`BenchmarkResult(task: TaskType, per_sample: List[Dict[str, Any]], aggregated: Dict[str, float], num_samples: int)` `dataclass` ¶

`MetricSuite(task: TaskType)` ¶

`for_task(task: TaskType) -> 'MetricSuite'` `classmethod` ¶

`evaluate(prediction: Any, ground_truth: Any) -> Dict[str, float]` ¶

`aggregate(per_sample: List[Dict[str, Any]]) -> Dict[str, float]` ¶

`build_result(per_sample: List[Dict[str, Any]]) -> 'BenchmarkResult'` ¶

`register_metric(task: TaskType) -> Callable[[Type[MetricCalculator]], Type[MetricCalculator]]` ¶

`unregister_metric(task: TaskType, name: str) -> bool` ¶

`clear_registry() -> None` ¶

`get_calculators(task: TaskType) -> List[MetricCalculator]` ¶

`available_metrics() -> Dict[TaskType, List[str]]` ¶

`compute_metrics(task: TaskType, prediction: Any, ground_truth: Any) -> Dict[str, float]` ¶

`depth` ¶

`DepthErrorMetrics` ¶

`detection` ¶

`DetectionMetrics(iou_threshold: float = 0.5)` ¶

`segmentation` ¶

`SegmentationMIoU` ¶

`pose` ¶

`RelativePoseError` ¶

`grounding` ¶

`GroundingIoU(iou_threshold: float = 0.5)` ¶

`sparse_depth` ¶

`SparseDepthError(radius: float = 2.0)` ¶

`nvs` ¶

`NVSQuality` ¶

`keypoints` ¶

`KeypointAccuracy(px_threshold: float = 3.0)` ¶

`tracking` ¶

`TrackletMetrics(iou_threshold: float = 0.5)` ¶