Skip to content

Metrics (rpx_benchmark.metrics)

Pluggable per-task metric calculators and the registry that binds them to task types.

Registry core

registry

Core of the metric plugin system.

Contains the :class:MetricCalculator base class, the task→calculators registry, the @register_metric decorator, and a thin :class:MetricSuite facade that the runner uses.

Most library users should not touch this module directly — instead import from :mod:rpx_benchmark.metrics.

MetricCalculator

Bases: ABC

Abstract base for a family of metrics bound to a single task.

Subclasses are lightweight, stateless objects. The compute method accepts one prediction and its ground truth and returns a dict mapping metric name → scalar float.

Subclasses must set name (human-readable identifier used for registration) and implement :meth:compute. Subclasses should take their own hyperparameters via __init__ when needed.

Examples:

>>> from rpx_benchmark.metrics import MetricCalculator, register_metric
>>> from rpx_benchmark.api import TaskType
>>> @register_metric(TaskType.MONOCULAR_DEPTH)
... class MeanPred(MetricCalculator):
...     name = "mean_pred"
...     def compute(self, prediction, ground_truth):
...         return {"mean_pred": float(prediction.depth_map.mean())}

compute(prediction: Any, ground_truth: Any) -> Dict[str, float] abstractmethod

Return a dict of metric name → scalar for one sample.

Parameters:

Name Type Description Default
prediction Any

Task-specific Prediction dataclass (e.g. :class:DepthPrediction). The concrete type is whatever the calculator is designed for.

required
ground_truth Any

Task-specific GroundTruth dataclass (e.g. :class:DepthGroundTruth).

required

Returns:

Type Description
dict[str, float]

Metric values. Keys should be short snake_case names; values must be numeric so the runner can average them.

Raises:

Type Description
MetricError

When inputs have the wrong shape/type/content.

Source code in rpx_benchmark/metrics/registry.py
@abstractmethod
def compute(self, prediction: Any, ground_truth: Any) -> Dict[str, float]:
    """Return a dict of metric name → scalar for one sample.

    Parameters
    ----------
    prediction : Any
        Task-specific Prediction dataclass
        (e.g. :class:`DepthPrediction`). The concrete type is
        whatever the calculator is designed for.
    ground_truth : Any
        Task-specific GroundTruth dataclass
        (e.g. :class:`DepthGroundTruth`).

    Returns
    -------
    dict[str, float]
        Metric values. Keys should be short snake_case names;
        values must be numeric so the runner can average them.

    Raises
    ------
    MetricError
        When inputs have the wrong shape/type/content.
    """

BenchmarkResult(task: TaskType, per_sample: List[Dict[str, Any]], aggregated: Dict[str, float], num_samples: int) dataclass

Outcome of running a :class:BenchmarkRunner against a dataset.

Attributes:

Name Type Description
task TaskType

Which task was evaluated.

per_sample list of dict

One dict per sample. Each dict mixes metric keys (numeric) and metadata keys (id, phase, difficulty, scene) that :class:MetricSuite.aggregate silently skips when computing means.

aggregated dict[str, float]

Mean over the numeric metric keys in :attr:per_sample.

num_samples int

MetricSuite(task: TaskType)

Thin wrapper around the metric registry used by the runner.

Kept as a class rather than a function because historical API expects MetricSuite.for_task(...).evaluate(pred, gt). New code can call :func:compute_metrics directly.

Source code in rpx_benchmark/metrics/registry.py
def __init__(self, task: TaskType) -> None:
    self.task = task

for_task(task: TaskType) -> 'MetricSuite' classmethod

Create a suite for the given task.

Raises:

Type Description
MetricError

If no calculators are registered for task.

Source code in rpx_benchmark/metrics/registry.py
@classmethod
def for_task(cls, task: TaskType) -> "MetricSuite":
    """Create a suite for the given task.

    Raises
    ------
    MetricError
        If no calculators are registered for ``task``.
    """
    if not get_calculators(task):
        raise MetricError(
            f"No metric calculator registered for task {task.value!r}.",
            hint="Import rpx_benchmark.metrics to trigger built-ins.",
        )
    return cls(task=task)

evaluate(prediction: Any, ground_truth: Any) -> Dict[str, float]

Run every registered calculator and return merged results.

Raises:

Type Description
MetricError

Propagated from individual calculators when inputs are shape-mismatched or wrong-typed.

Source code in rpx_benchmark/metrics/registry.py
def evaluate(self, prediction: Any, ground_truth: Any) -> Dict[str, float]:
    """Run every registered calculator and return merged results.

    Raises
    ------
    MetricError
        Propagated from individual calculators when inputs are
        shape-mismatched or wrong-typed.
    """
    return compute_metrics(self.task, prediction, ground_truth)

aggregate(per_sample: List[Dict[str, Any]]) -> Dict[str, float]

Mean over numeric metric keys; non-numeric metadata is skipped.

Parameters:

Name Type Description Default
per_sample list of dict

Per-sample rows. May contain metric floats and metadata strings/enums in the same dict.

required

Returns:

Type Description
dict[str, float]

One float per numeric key. Empty dict if per_sample is empty or contains no numeric values.

Source code in rpx_benchmark/metrics/registry.py
def aggregate(self, per_sample: List[Dict[str, Any]]) -> Dict[str, float]:
    """Mean over numeric metric keys; non-numeric metadata is skipped.

    Parameters
    ----------
    per_sample : list of dict
        Per-sample rows. May contain metric floats and metadata
        strings/enums in the same dict.

    Returns
    -------
    dict[str, float]
        One float per numeric key. Empty dict if ``per_sample`` is
        empty or contains no numeric values.
    """
    if not per_sample:
        return {}
    numeric_keys = [
        k for k, v in per_sample[0].items()
        if isinstance(v, (int, float)) and not isinstance(v, bool)
    ]
    out: Dict[str, float] = {}
    for k in numeric_keys:
        vals = [
            m[k] for m in per_sample
            if isinstance(m.get(k), (int, float))
            and not isinstance(m.get(k), bool)
        ]
        if vals:
            out[k] = float(np.mean(vals))
    return out

build_result(per_sample: List[Dict[str, Any]]) -> 'BenchmarkResult'

Convenience: wrap per_sample and its aggregate in a :class:BenchmarkResult.

Source code in rpx_benchmark/metrics/registry.py
def build_result(self, per_sample: List[Dict[str, Any]]) -> "BenchmarkResult":
    """Convenience: wrap ``per_sample`` and its aggregate in a
    :class:`BenchmarkResult`.
    """
    return BenchmarkResult(
        task=self.task,
        per_sample=per_sample,
        aggregated=self.aggregate(per_sample),
        num_samples=len(per_sample),
    )

register_metric(task: TaskType) -> Callable[[Type[MetricCalculator]], Type[MetricCalculator]]

Class decorator registering a calculator under a task.

Parameters:

Name Type Description Default
task TaskType

Task the calculator belongs to. Multiple calculators can be registered against the same task and will be composed at evaluation time.

required

Returns:

Type Description
Callable

The decorator; when applied to a class it instantiates it once with no arguments and appends to the registry.

Raises:

Type Description
MetricError

If task is not a :class:TaskType, or if the decorated object is not a :class:MetricCalculator subclass.

Examples:

>>> from rpx_benchmark.api import TaskType
>>> from rpx_benchmark.metrics import MetricCalculator, register_metric
>>> @register_metric(TaskType.MONOCULAR_DEPTH)
... class MyMetric(MetricCalculator):
...     name = "my_metric"
...     def compute(self, prediction, ground_truth):
...         return {"my_metric": 0.0}
Source code in rpx_benchmark/metrics/registry.py
def register_metric(
    task: TaskType,
) -> Callable[[Type[MetricCalculator]], Type[MetricCalculator]]:
    """Class decorator registering a calculator under a task.

    Parameters
    ----------
    task : TaskType
        Task the calculator belongs to. Multiple calculators can be
        registered against the same task and will be composed at
        evaluation time.

    Returns
    -------
    Callable
        The decorator; when applied to a class it instantiates it
        once with no arguments and appends to the registry.

    Raises
    ------
    MetricError
        If ``task`` is not a :class:`TaskType`, or if the decorated
        object is not a :class:`MetricCalculator` subclass.

    Examples
    --------
    >>> from rpx_benchmark.api import TaskType
    >>> from rpx_benchmark.metrics import MetricCalculator, register_metric
    >>> @register_metric(TaskType.MONOCULAR_DEPTH)
    ... class MyMetric(MetricCalculator):
    ...     name = "my_metric"
    ...     def compute(self, prediction, ground_truth):
    ...         return {"my_metric": 0.0}
    """
    if not isinstance(task, TaskType):
        raise MetricError(
            f"register_metric expected a TaskType, got {type(task).__name__}",
            hint="Use one of the members of rpx_benchmark.api.TaskType.",
        )

    def decorator(cls: Type[MetricCalculator]) -> Type[MetricCalculator]:
        if not (isinstance(cls, type) and issubclass(cls, MetricCalculator)):
            raise MetricError(
                f"@register_metric can only decorate MetricCalculator "
                f"subclasses; got {cls!r}",
            )
        instance = cls()
        _CALCULATORS[task].append(instance)
        log.debug("registered metric %s for task %s",
                  instance.name or cls.__name__, task.value)
        return cls

    return decorator

unregister_metric(task: TaskType, name: str) -> bool

Remove a previously registered calculator by its name field.

Parameters:

Name Type Description Default
task TaskType
required
name str

Name attribute of the calculator to remove.

required

Returns:

Type Description
bool

True if a calculator was removed; False if no match was found.

Source code in rpx_benchmark/metrics/registry.py
def unregister_metric(task: TaskType, name: str) -> bool:
    """Remove a previously registered calculator by its ``name`` field.

    Parameters
    ----------
    task : TaskType
    name : str
        Name attribute of the calculator to remove.

    Returns
    -------
    bool
        True if a calculator was removed; False if no match was found.
    """
    before = len(_CALCULATORS.get(task, []))
    _CALCULATORS[task] = [c for c in _CALCULATORS.get(task, []) if c.name != name]
    after = len(_CALCULATORS[task])
    return after < before

clear_registry() -> None

Remove every registered calculator. Primarily for tests.

Source code in rpx_benchmark/metrics/registry.py
def clear_registry() -> None:
    """Remove every registered calculator. Primarily for tests."""
    _CALCULATORS.clear()

get_calculators(task: TaskType) -> List[MetricCalculator]

Return the calculators registered for task (empty list if none).

Source code in rpx_benchmark/metrics/registry.py
def get_calculators(task: TaskType) -> List[MetricCalculator]:
    """Return the calculators registered for ``task`` (empty list if none)."""
    return list(_CALCULATORS.get(task, []))

available_metrics() -> Dict[TaskType, List[str]]

List the registered metric names grouped by task.

Returns:

Type Description
dict

{TaskType: [calculator_name, ...]}. Only tasks with at least one registered calculator appear.

Source code in rpx_benchmark/metrics/registry.py
def available_metrics() -> Dict[TaskType, List[str]]:
    """List the registered metric names grouped by task.

    Returns
    -------
    dict
        ``{TaskType: [calculator_name, ...]}``. Only tasks with at
        least one registered calculator appear.
    """
    return {task: [c.name for c in calcs]
            for task, calcs in _CALCULATORS.items()
            if calcs}

compute_metrics(task: TaskType, prediction: Any, ground_truth: Any) -> Dict[str, float]

Run every registered calculator for task and merge outputs.

Parameters:

Name Type Description Default
task TaskType
required
prediction Any
required
ground_truth Any
required

Returns:

Type Description
dict[str, float]

Union of all calculators' output dicts. Later calculators may overwrite earlier ones if they emit the same key.

Raises:

Type Description
MetricError

If no calculators are registered for task.

Source code in rpx_benchmark/metrics/registry.py
def compute_metrics(
    task: TaskType,
    prediction: Any,
    ground_truth: Any,
) -> Dict[str, float]:
    """Run every registered calculator for ``task`` and merge outputs.

    Parameters
    ----------
    task : TaskType
    prediction : Any
    ground_truth : Any

    Returns
    -------
    dict[str, float]
        Union of all calculators' output dicts. Later calculators may
        overwrite earlier ones if they emit the same key.

    Raises
    ------
    MetricError
        If no calculators are registered for ``task``.
    """
    calcs = _CALCULATORS.get(task)
    if not calcs:
        raise MetricError(
            f"No metric calculator registered for task {task.value!r}.",
            hint=(
                "Either import the built-in calculators (e.g. "
                "`import rpx_benchmark.metrics`) or register a custom "
                "one via @register_metric."
            ),
        )
    merged: Dict[str, float] = {}
    for calc in calcs:
        merged.update(calc.compute(prediction, ground_truth))
    return merged

Built-in calculators

Monocular depth

depth

Monocular absolute depth metric calculators.

Registers :class:DepthErrorMetrics which computes AbsRel, RMSE, and three thresholded accuracy ratios (δ<1.25, δ<1.25², δ<1.25³) against the subset of pixels where the ground-truth depth map is valid (gt > 0). This matches the paper's eval protocol: we mask out D435-invalid pixels so models are not penalised where the sensor itself produced no measurement.

DepthErrorMetrics

Bases: MetricCalculator

AbsRel / RMSE / δ-accuracy for monocular metric depth.

Notes

All metrics are computed in metres. Neither raw logits nor any kind of scale alignment is applied — we deliberately measure the model's absolute metric accuracy, which is the property that matters for robot policies that consume depth directly.

For empty validity masks (fully invalid GT frames) we return the safe no-op sentinel {"absrel": 0, "rmse": 0, "delta1": 1, ...} so a single bad frame does not break aggregate means. In practice the dataset pipeline should filter these out upstream.

Metric definitions

.. math::

\text{AbsRel} &= \frac{1}{N} \sum_i \frac{|\hat d_i - d_i|}{d_i}

\text{RMSE}   &= \sqrt{\frac{1}{N}\sum_i(\hat d_i - d_i)^2}

\delta_k      &= \frac{1}{N}\,\#\Big\{i : \max\!\big(\tfrac{\hat d_i}{d_i},
                  \tfrac{d_i}{\hat d_i}\big) < 1.25^{\,k}\Big\}

Object detection

detection

Object detection + open-vocab detection metric calculators.

DetectionMetrics(iou_threshold: float = 0.5)

Bases: MetricCalculator

Precision / recall / F1 at a single IoU threshold.

Parameters:

Name Type Description Default
iou_threshold float

Minimum IoU (0.0–1.0) for a prediction to count as a true positive. Defaults to 0.5, the COCO baseline.

0.5
Notes

Greedy matching by descending prediction score. A GT box is consumed after the first predicted box matches it, so subsequent predictions for the same object register as false positives (standard VOC/COCO evaluation rule).

Source code in rpx_benchmark/metrics/detection.py
def __init__(self, iou_threshold: float = 0.5) -> None:
    self.iou_threshold = iou_threshold

Object segmentation

segmentation

Instance/semantic segmentation metric calculators.

SegmentationMIoU

Bases: MetricCalculator

Mean Intersection-over-Union (mIoU) across GT classes.

Notes

Averaged over the set of class ids present in the ground-truth mask (background class id -1 is ignored). Classes in the prediction that do not appear in GT do not contribute — they only reduce the IoU of other classes they overlap with.

Relative camera pose

pose

Relative camera pose metric calculators.

RelativePoseError

Bases: MetricCalculator

Rotation geodesic (degrees) + translation L2 (metres).

Notes

Rotations are compared in SO(3) via the geodesic distance arccos((trace(R_pred^T R_gt) - 1) / 2). Quaternion inputs (4- vectors) are accepted and converted to rotation matrices. Translations are compared in metres with straight L2.

Visual grounding

grounding

Visual grounding metric calculators.

GroundingIoU(iou_threshold: float = 0.5)

Bases: MetricCalculator

Top-1 IoU and accuracy at a threshold for a single referred object.

Parameters:

Name Type Description Default
iou_threshold float

IoU threshold for the grounding_acc indicator. Default 0.5, matching standard referring expression comprehension protocols.

0.5
Source code in rpx_benchmark/metrics/grounding.py
def __init__(self, iou_threshold: float = 0.5) -> None:
    self.iou_threshold = iou_threshold

Sparse depth

sparse_depth

Sparse depth metric calculators (RGB + sparse GT depth points).

SparseDepthError(radius: float = 2.0)

Bases: MetricCalculator

AbsRel + RMSE computed only at the provided sparse GT locations.

Parameters:

Name Type Description Default
radius float

Maximum pixel distance between a prediction point and the matched GT point. Predictions outside this radius contribute a full-magnitude error, which penalises models that fail to localise the sparse points.

2.0
Source code in rpx_benchmark/metrics/sparse_depth.py
def __init__(self, radius: float = 2.0) -> None:
    self.radius = radius

Novel view synthesis

nvs

Novel view synthesis metric calculators.

NVSQuality

Bases: MetricCalculator

PSNR and simplified global SSIM for an RGB novel-view prediction.

Notes

The SSIM implementation is a deliberately simple global-statistics variant (no sliding window) so it runs without OpenCV / skimage. For publication we recommend re-computing SSIM offline with a full sliding-window implementation against the per-sample arrays captured in result.per_sample.

Keypoint matching

keypoints

Keypoint correspondence metric calculators.

KeypointAccuracy(px_threshold: float = 3.0)

Bases: MetricCalculator

% of predicted matches within a pixel threshold + mean error.

Parameters:

Name Type Description Default
px_threshold float

Pixel radius in frame B at which a predicted correspondence is counted as correct. Default 3 px matches the standard ScanNet/InteriorNet evaluation protocol.

3.0
Source code in rpx_benchmark/metrics/keypoints.py
def __init__(self, px_threshold: float = 3.0) -> None:
    self.px_threshold = px_threshold

Object tracking

tracking

Multi-object tracking metric calculators.

TrackletMetrics(iou_threshold: float = 0.5)

Bases: MetricCalculator

Simplified MOTA + IDF1 over a single tracklet sample.

Parameters:

Name Type Description Default
iou_threshold float

IoU required to count a predicted detection as a hit for a given GT box. Default 0.5.

0.5
Notes

This is the per-sample version called by the runner. A full MOTA computation (with cross-scene identity switches) lives in :mod:rpx_benchmark.deployment and runs on the whole dataset.

Source code in rpx_benchmark/metrics/tracking.py
def __init__(self, iou_threshold: float = 0.5) -> None:
    self.iou_threshold = iou_threshold