Skip to content

Adapters (rpx_benchmark.adapters)

The formal adapter framework: InputAdapter / OutputAdapter protocols, PreparedInput, BenchmarkableModel, and the shipped numpy / HuggingFace / UniDepth / Metric3D / segmentation adapters.

Base framework

base

Core types for the RPX adapter framework.

::

Sample ─► InputAdapter.prepare ─► PreparedInput(payload, context)
                                       │
                                       ▼
                                   model(payload)
                                       │
                                       ▼
Sample, context, model_output ─► OutputAdapter.finalize ─► Prediction

Users extending RPX for their own model only need to supply the model and pick a matching pair of adapters. The adapters ship with the library for common families (HuggingFace transformers, UniDepth, Metric3D, raw numpy callables).

ModelInvoker = Callable[[Any, Any], Any] module-attribute

Given (model, payload) return the raw model output.

The default implementation handles the two common cases: dict payloads become model(**payload), anything else becomes model(payload). Override when the model needs non-standard invocation (e.g., a method other than __call__).

PreparedInput(payload: Any, context: Dict[str, Any] = dict()) dataclass

Everything a model needs for one sample, plus context for post-processing.

payload is whatever the model's forward call accepts. If it is a dict, the default invoker calls model(**payload); otherwise model(payload).

context is a free-form dict the output adapter receives back. Use it to stash things like target image size, original intrinsics, or any preprocessing metadata the postprocessing step needs.

InputAdapter

Bases: Protocol

Sample → model-ready payload.

setup() -> None

Optional one-time setup (e.g., build a processor on first use).

Source code in rpx_benchmark/adapters/base.py
def setup(self) -> None:  # pragma: no cover - optional hook
    """Optional one-time setup (e.g., build a processor on first use)."""

OutputAdapter

Bases: Protocol

Model output → RPX prediction object.

setup() -> None

Optional one-time setup.

Source code in rpx_benchmark/adapters/base.py
def setup(self) -> None:  # pragma: no cover - optional hook
    """Optional one-time setup."""

BenchmarkableModel(task: TaskType, input_adapter: InputAdapter, model: Any, output_adapter: OutputAdapter, invoker: ModelInvoker = default_invoker, name: str = 'benchmarkable_model', setup_hook: Optional[Callable[[], None]] = None) dataclass

Bases: BenchmarkModel

Compose an input adapter, a model, and an output adapter.

This is the canonical way to plug a model into the RPX benchmark harness. The BenchmarkRunner only ever sees the :class:BenchmarkModel contract (setup, predict); all the model-family-specific logic lives in the adapters so the harness stays task-agnostic.

Example — wrap a HuggingFace depth model::

from rpx_benchmark.adapters.depth_hf import make_hf_depth_model
bm = make_hf_depth_model(
    "depth-anything/Depth-Anything-V2-Metric-Indoor-Large-hf",
    device="cuda",
)
# Hand `bm` to the benchmark runner or the task entrypoint.

Example — wrap a plain numpy callable::

def my_depth(rgb: np.ndarray) -> np.ndarray:
    return some_depth_in_metres

bm = make_numpy_depth_model(my_depth)

default_invoker(model: Any, payload: Any) -> Any

Call model with payload. Uses torch no_grad if available.

Source code in rpx_benchmark/adapters/base.py
def default_invoker(model: Any, payload: Any) -> Any:
    """Call ``model`` with ``payload``. Uses torch no_grad if available."""
    try:
        import torch  # noqa: F401
    except ImportError:
        return _dispatch(model, payload)

    import torch
    with torch.no_grad():
        return _dispatch(model, payload)

make_numpy_depth_model(fn: Callable[[np.ndarray], np.ndarray], *, name: str = 'numpy_depth_model') -> BenchmarkableModel

Wrap a plain numpy depth callable as a :class:BenchmarkableModel.

The callable must accept a (H, W, 3) uint8 RGB image and return a (H', W') float metric depth map (in metres). If (H', W') != (H, W), the output is bilinearly resized to match the ground truth.

Parameters:

Name Type Description Default
fn callable

The depth function. Signature: fn(rgb_uint8) -> depth_float.

required
name str

Display name used in logs and reports.

'numpy_depth_model'

Examples:

>>> import numpy as np
>>> import rpx_benchmark as rpx
>>> def my_depth(rgb):
...     return np.full(rgb.shape[:2], 2.0, dtype=np.float32)
>>> bm = rpx.make_numpy_depth_model(my_depth, name="mine")
>>> bm.task is rpx.TaskType.MONOCULAR_DEPTH
True
Source code in rpx_benchmark/adapters/base.py
def make_numpy_depth_model(
    fn: Callable[[np.ndarray], np.ndarray],
    *,
    name: str = "numpy_depth_model",
) -> BenchmarkableModel:
    """Wrap a plain numpy depth callable as a :class:`BenchmarkableModel`.

    The callable must accept a ``(H, W, 3) uint8`` RGB image and return a
    ``(H', W') float`` metric depth map (in metres). If ``(H', W') !=
    (H, W)``, the output is bilinearly resized to match the ground truth.

    Parameters
    ----------
    fn : callable
        The depth function. Signature: ``fn(rgb_uint8) -> depth_float``.
    name : str
        Display name used in logs and reports.

    Examples
    --------
    >>> import numpy as np
    >>> import rpx_benchmark as rpx
    >>> def my_depth(rgb):
    ...     return np.full(rgb.shape[:2], 2.0, dtype=np.float32)
    >>> bm = rpx.make_numpy_depth_model(my_depth, name="mine")
    >>> bm.task is rpx.TaskType.MONOCULAR_DEPTH
    True
    """
    return BenchmarkableModel(
        task=TaskType.MONOCULAR_DEPTH,
        input_adapter=_NumpyDepthInput(),
        model=fn,
        output_adapter=_NumpyDepthOutput(),
        invoker=lambda model, payload: model(payload),
        name=name,
    )

make_numpy_mask_model(fn: Callable[[np.ndarray], np.ndarray], *, name: str = 'numpy_mask_model') -> BenchmarkableModel

Wrap a plain numpy instance-mask callable as a :class:BenchmarkableModel.

The callable must accept a (H, W, 3) uint8 RGB image and return a (H', W') int instance mask where pixel values are instance IDs (0 is conventionally background). If (H', W') != (H, W) the output is nearest-neighbour resized to match the GT mask so integer IDs are preserved.

Parameters:

Name Type Description Default
fn callable

fn(rgb_uint8) -> mask_int.

required
name str
'numpy_mask_model'

Examples:

>>> import numpy as np
>>> import rpx_benchmark as rpx
>>> def my_seg(rgb):
...     mask = np.zeros(rgb.shape[:2], dtype=np.int32)
...     mask[rgb.sum(-1) > 384] = 1  # trivial brightness threshold
...     return mask
>>> bm = rpx.make_numpy_mask_model(my_seg, name="mine")
>>> bm.task is rpx.TaskType.OBJECT_SEGMENTATION
True
Source code in rpx_benchmark/adapters/base.py
def make_numpy_mask_model(
    fn: Callable[[np.ndarray], np.ndarray],
    *,
    name: str = "numpy_mask_model",
) -> BenchmarkableModel:
    """Wrap a plain numpy instance-mask callable as a :class:`BenchmarkableModel`.

    The callable must accept a ``(H, W, 3) uint8`` RGB image and return
    a ``(H', W') int`` instance mask where pixel values are instance
    IDs (``0`` is conventionally background). If ``(H', W') != (H, W)``
    the output is nearest-neighbour resized to match the GT mask so
    integer IDs are preserved.

    Parameters
    ----------
    fn : callable
        ``fn(rgb_uint8) -> mask_int``.
    name : str

    Examples
    --------
    >>> import numpy as np
    >>> import rpx_benchmark as rpx
    >>> def my_seg(rgb):
    ...     mask = np.zeros(rgb.shape[:2], dtype=np.int32)
    ...     mask[rgb.sum(-1) > 384] = 1  # trivial brightness threshold
    ...     return mask
    >>> bm = rpx.make_numpy_mask_model(my_seg, name="mine")
    >>> bm.task is rpx.TaskType.OBJECT_SEGMENTATION
    True
    """
    return BenchmarkableModel(
        task=TaskType.OBJECT_SEGMENTATION,
        input_adapter=_NumpyMaskInput(),
        model=fn,
        output_adapter=_NumpyMaskOutput(),
        invoker=lambda model, payload: model(payload),
        name=name,
    )

make_numpy_detection_model(fn: Callable[[np.ndarray], Any], *, name: str = 'numpy_detection_model', task: TaskType = TaskType.OBJECT_DETECTION) -> BenchmarkableModel

Wrap a plain numpy detection callable as a :class:BenchmarkableModel.

The callable must accept a (H, W, 3) uint8 RGB image and return either a dict with keys "boxes" / "scores" / "labels" or a (boxes, scores, labels) tuple in that order. Boxes are pixel coordinates in (x1, y1, x2, y2) format, scores are floats in [0, 1], labels are strings.

Parameters:

Name Type Description Default
fn callable

fn(rgb_uint8) -> dict | tuple.

required
name str

Display name for reports.

'numpy_detection_model'
task TaskType

Use :attr:TaskType.OBJECT_DETECTION for closed-vocabulary detection or :attr:TaskType.OPEN_VOCAB_DETECTION for open-vocab (the Prediction contract is the same).

OBJECT_DETECTION

Examples:

>>> import numpy as np
>>> import rpx_benchmark as rpx
>>> def my_det(rgb):
...     return {
...         "boxes": np.array([[10, 10, 30, 30]], dtype=np.float32),
...         "scores": np.array([0.9], dtype=np.float32),
...         "labels": ["cup"],
...     }
>>> bm = rpx.make_numpy_detection_model(my_det)
>>> bm.task is rpx.TaskType.OBJECT_DETECTION
True
Source code in rpx_benchmark/adapters/base.py
def make_numpy_detection_model(
    fn: Callable[[np.ndarray], Any],
    *,
    name: str = "numpy_detection_model",
    task: TaskType = TaskType.OBJECT_DETECTION,
) -> BenchmarkableModel:
    """Wrap a plain numpy detection callable as a :class:`BenchmarkableModel`.

    The callable must accept a ``(H, W, 3) uint8`` RGB image and
    return either a dict with keys ``"boxes"`` / ``"scores"`` /
    ``"labels"`` or a ``(boxes, scores, labels)`` tuple in that
    order. Boxes are pixel coordinates in ``(x1, y1, x2, y2)``
    format, scores are floats in ``[0, 1]``, labels are strings.

    Parameters
    ----------
    fn : callable
        ``fn(rgb_uint8) -> dict | tuple``.
    name : str
        Display name for reports.
    task : TaskType
        Use :attr:`TaskType.OBJECT_DETECTION` for closed-vocabulary
        detection or :attr:`TaskType.OPEN_VOCAB_DETECTION` for
        open-vocab (the Prediction contract is the same).

    Examples
    --------
    >>> import numpy as np
    >>> import rpx_benchmark as rpx
    >>> def my_det(rgb):
    ...     return {
    ...         "boxes": np.array([[10, 10, 30, 30]], dtype=np.float32),
    ...         "scores": np.array([0.9], dtype=np.float32),
    ...         "labels": ["cup"],
    ...     }
    >>> bm = rpx.make_numpy_detection_model(my_det)
    >>> bm.task is rpx.TaskType.OBJECT_DETECTION
    True
    """
    return BenchmarkableModel(
        task=task,
        input_adapter=_NumpyRgbInput(),
        model=fn,
        output_adapter=_NumpyDetectionOutput(),
        invoker=_passthrough_invoker,
        name=name,
    )

make_numpy_grounding_model(fn: Callable[[np.ndarray, str], Any], *, name: str = 'numpy_grounding_model') -> BenchmarkableModel

Wrap a visual-grounding callable as a :class:BenchmarkableModel.

The callable takes (rgb_uint8, text) and returns either a dict with keys "boxes" / "scores" or a tuple (boxes, scores). Boxes are (x1, y1, x2, y2) pixel coordinates; scores are floats. The referring expression text is plucked from sample.ground_truth.text by the adapter so the callable never sees the GT boxes.

Examples:

>>> import numpy as np
>>> import rpx_benchmark as rpx
>>> def my_ground(rgb, text):
...     return (
...         np.array([[10, 10, 30, 30]], dtype=np.float32),
...         np.array([0.8], dtype=np.float32),
...     )
>>> bm = rpx.make_numpy_grounding_model(my_ground)
Source code in rpx_benchmark/adapters/base.py
def make_numpy_grounding_model(
    fn: Callable[[np.ndarray, str], Any],
    *,
    name: str = "numpy_grounding_model",
) -> BenchmarkableModel:
    """Wrap a visual-grounding callable as a :class:`BenchmarkableModel`.

    The callable takes ``(rgb_uint8, text)`` and returns either a
    dict with keys ``"boxes"`` / ``"scores"`` or a tuple
    ``(boxes, scores)``. Boxes are ``(x1, y1, x2, y2)`` pixel
    coordinates; scores are floats. The referring expression
    ``text`` is plucked from ``sample.ground_truth.text`` by the
    adapter so the callable never sees the GT boxes.

    Examples
    --------
    >>> import numpy as np
    >>> import rpx_benchmark as rpx
    >>> def my_ground(rgb, text):
    ...     return (
    ...         np.array([[10, 10, 30, 30]], dtype=np.float32),
    ...         np.array([0.8], dtype=np.float32),
    ...     )
    >>> bm = rpx.make_numpy_grounding_model(my_ground)
    """
    return BenchmarkableModel(
        task=TaskType.VISUAL_GROUNDING,
        input_adapter=_NumpyGroundingInput(),
        model=fn,
        output_adapter=_NumpyGroundingOutput(),
        invoker=_grounding_invoker,
        name=name,
    )

make_numpy_pose_model(fn: Callable[[np.ndarray, np.ndarray], Any], *, name: str = 'numpy_pose_model') -> BenchmarkableModel

Wrap a relative-camera-pose callable as a :class:BenchmarkableModel.

The callable takes (rgb_a, rgb_b) and returns either a dict with keys "rotation" (3×3 rotation matrix or 4-element quaternion) and "translation" (3-vector, metres) or a (rotation, translation) tuple.

Examples:

>>> import numpy as np
>>> import rpx_benchmark as rpx
>>> def my_pose(rgb_a, rgb_b):
...     return {"rotation": np.eye(3), "translation": np.zeros(3)}
>>> bm = rpx.make_numpy_pose_model(my_pose)
Source code in rpx_benchmark/adapters/base.py
def make_numpy_pose_model(
    fn: Callable[[np.ndarray, np.ndarray], Any],
    *,
    name: str = "numpy_pose_model",
) -> BenchmarkableModel:
    """Wrap a relative-camera-pose callable as a :class:`BenchmarkableModel`.

    The callable takes ``(rgb_a, rgb_b)`` and returns either a dict
    with keys ``"rotation"`` (3×3 rotation matrix or 4-element
    quaternion) and ``"translation"`` (3-vector, metres) or a
    ``(rotation, translation)`` tuple.

    Examples
    --------
    >>> import numpy as np
    >>> import rpx_benchmark as rpx
    >>> def my_pose(rgb_a, rgb_b):
    ...     return {"rotation": np.eye(3), "translation": np.zeros(3)}
    >>> bm = rpx.make_numpy_pose_model(my_pose)
    """
    return BenchmarkableModel(
        task=TaskType.RELATIVE_CAMERA_POSE,
        input_adapter=_NumpyPoseInput(),
        model=fn,
        output_adapter=_NumpyPoseOutput(),
        invoker=_pose_invoker,
        name=name,
    )

make_numpy_sparse_depth_model(fn: Callable[[np.ndarray, np.ndarray], np.ndarray], *, name: str = 'numpy_sparse_depth_model') -> BenchmarkableModel

Wrap a sparse-depth callable as a :class:BenchmarkableModel.

The callable takes (rgb_uint8, coords) where coords is a (N, 2) float32 array of pixel coordinates and returns an (N,) float32 array of depths in metres at those exact coordinates.

Examples:

>>> import numpy as np
>>> import rpx_benchmark as rpx
>>> def my_sd(rgb, coords):
...     return np.full(len(coords), 2.0, dtype=np.float32)
>>> bm = rpx.make_numpy_sparse_depth_model(my_sd)
Source code in rpx_benchmark/adapters/base.py
def make_numpy_sparse_depth_model(
    fn: Callable[[np.ndarray, np.ndarray], np.ndarray],
    *,
    name: str = "numpy_sparse_depth_model",
) -> BenchmarkableModel:
    """Wrap a sparse-depth callable as a :class:`BenchmarkableModel`.

    The callable takes ``(rgb_uint8, coords)`` where ``coords`` is a
    ``(N, 2)`` float32 array of pixel coordinates and returns an
    ``(N,)`` float32 array of depths in metres at those exact
    coordinates.

    Examples
    --------
    >>> import numpy as np
    >>> import rpx_benchmark as rpx
    >>> def my_sd(rgb, coords):
    ...     return np.full(len(coords), 2.0, dtype=np.float32)
    >>> bm = rpx.make_numpy_sparse_depth_model(my_sd)
    """
    return BenchmarkableModel(
        task=TaskType.SPARSE_DEPTH,
        input_adapter=_NumpySparseDepthInput(),
        model=fn,
        output_adapter=_NumpySparseDepthOutput(),
        invoker=_sparse_depth_invoker,
        name=name,
    )

make_numpy_nvs_model(fn: Callable[[np.ndarray, np.ndarray], np.ndarray], *, name: str = 'numpy_nvs_model') -> BenchmarkableModel

Wrap a novel-view-synthesis callable as a :class:BenchmarkableModel.

The callable takes (rgb_uint8, target_pose) where the target pose is a 4×4 SE(3) camera-to-world matrix (float64). It returns an RGB image for the target viewpoint. Non-uint8 output is clipped and cast; shape mismatches are bilinearly resized.

Examples:

>>> import numpy as np
>>> import rpx_benchmark as rpx
>>> def my_nvs(rgb, target_pose):
...     return rgb  # identity baseline
>>> bm = rpx.make_numpy_nvs_model(my_nvs)
Source code in rpx_benchmark/adapters/base.py
def make_numpy_nvs_model(
    fn: Callable[[np.ndarray, np.ndarray], np.ndarray],
    *,
    name: str = "numpy_nvs_model",
) -> BenchmarkableModel:
    """Wrap a novel-view-synthesis callable as a :class:`BenchmarkableModel`.

    The callable takes ``(rgb_uint8, target_pose)`` where the target
    pose is a 4×4 SE(3) camera-to-world matrix (float64). It
    returns an RGB image for the target viewpoint. Non-uint8 output
    is clipped and cast; shape mismatches are bilinearly resized.

    Examples
    --------
    >>> import numpy as np
    >>> import rpx_benchmark as rpx
    >>> def my_nvs(rgb, target_pose):
    ...     return rgb  # identity baseline
    >>> bm = rpx.make_numpy_nvs_model(my_nvs)
    """
    return BenchmarkableModel(
        task=TaskType.NOVEL_VIEW_SYNTHESIS,
        input_adapter=_NumpyNVSInput(),
        model=fn,
        output_adapter=_NumpyNVSOutput(),
        invoker=_nvs_invoker,
        name=name,
    )

make_numpy_keypoint_model(fn: Callable[[np.ndarray, np.ndarray], Any], *, name: str = 'numpy_keypoint_model') -> BenchmarkableModel

Wrap a keypoint-matching callable as a :class:BenchmarkableModel.

The callable takes (rgb_a, rgb_b) and returns either a dict with keys "points0", "points1" and optional "scores" or a 2/3-tuple in the same order. Points are (N, 2) pixel coordinates.

Examples:

>>> import numpy as np
>>> import rpx_benchmark as rpx
>>> def my_matcher(rgb_a, rgb_b):
...     pts = np.array([[10, 10], [20, 20]], dtype=np.float32)
...     return pts, pts
>>> bm = rpx.make_numpy_keypoint_model(my_matcher)
Source code in rpx_benchmark/adapters/base.py
def make_numpy_keypoint_model(
    fn: Callable[[np.ndarray, np.ndarray], Any],
    *,
    name: str = "numpy_keypoint_model",
) -> BenchmarkableModel:
    """Wrap a keypoint-matching callable as a :class:`BenchmarkableModel`.

    The callable takes ``(rgb_a, rgb_b)`` and returns either a dict
    with keys ``"points0"``, ``"points1"`` and optional ``"scores"``
    or a 2/3-tuple in the same order. Points are ``(N, 2)`` pixel
    coordinates.

    Examples
    --------
    >>> import numpy as np
    >>> import rpx_benchmark as rpx
    >>> def my_matcher(rgb_a, rgb_b):
    ...     pts = np.array([[10, 10], [20, 20]], dtype=np.float32)
    ...     return pts, pts
    >>> bm = rpx.make_numpy_keypoint_model(my_matcher)
    """
    return BenchmarkableModel(
        task=TaskType.KEYPOINT_MATCHING,
        input_adapter=_NumpyKeypointInput(),
        model=fn,
        output_adapter=_NumpyKeypointOutput(),
        invoker=_keypoint_invoker,
        name=name,
    )

HuggingFace depth adapter

depth_hf

HuggingFace Transformers adapters for monocular metric depth.

Works with any AutoModelForDepthEstimation checkpoint that exposes post_process_depth_estimation on its image processor — including Depth Anything V2 (metric), Depth Pro, ZoeDepth, Video Depth Anything (per-frame), and PromptDA.

Three entry points:

* :class:`HFDepthInputAdapter` — PIL -> processor -> tensors on device
* :class:`HFDepthOutputAdapter` — model output -> resized metric depth
* :func:`make_hf_depth_model` — single-call factory returning a
  ready-to-run :class:`BenchmarkableModel`

make_hf_depth_model("my-org/my-depth-ckpt") is the "bring your own HuggingFace depth model" fast path — no subclassing required.

HFDepthInputAdapter(processor: Any, device: str = 'cuda') dataclass

Bases: InputAdapter

Converts an RPX Sample into a HuggingFace model input batch of 1.

HFDepthOutputAdapter(processor: Any) dataclass

Bases: OutputAdapter

Runs post_process_depth_estimation and wraps in DepthPrediction.

Different transformers depth processors take different kwargs:

  • DA-V2 / Depth Pro: target_sizes only.
  • ZoeDepth: needs source_sizes (and optionally do_remove_padding) to unpad its internal left-right mirror augmentation.
  • PromptDA: takes both plus an optional outputs key.

We introspect the post-process signature once and forward whatever kwargs it accepts. Both target_sizes and source_sizes are set to the caller's original (H, W) since the adapter contract promises we always return depth at the input RGB's resolution.

make_hf_depth_model(checkpoint: str, *, device: str = 'cuda', dtype: str | None = None, name: str | None = None) -> BenchmarkableModel

One-line factory for any HuggingFace depth-estimation checkpoint.

Parameters:

Name Type Description Default
checkpoint str

HuggingFace Hub path, e.g. "depth-anything/Depth-Anything-V2-Metric-Indoor-Large-hf".

required
device str

Device string passed to .to(device).

'cuda'
dtype str

One of "float16" / "bfloat16" / "float32". If provided, the model is cast to the matching torch dtype.

None
name str

Display name; defaults to the checkpoint id.

None
Source code in rpx_benchmark/adapters/depth_hf.py
def make_hf_depth_model(
    checkpoint: str,
    *,
    device: str = "cuda",
    dtype: str | None = None,
    name: str | None = None,
) -> BenchmarkableModel:
    """One-line factory for any HuggingFace depth-estimation checkpoint.

    Parameters
    ----------
    checkpoint : str
        HuggingFace Hub path, e.g.
        ``"depth-anything/Depth-Anything-V2-Metric-Indoor-Large-hf"``.
    device : str
        Device string passed to ``.to(device)``.
    dtype : str, optional
        One of ``"float16"`` / ``"bfloat16"`` / ``"float32"``. If provided,
        the model is cast to the matching ``torch`` dtype.
    name : str, optional
        Display name; defaults to the checkpoint id.
    """
    try:
        import torch
        from transformers import AutoImageProcessor, AutoModelForDepthEstimation
    except ImportError as e:  # pragma: no cover - guarded at install time
        raise ImportError(
            "make_hf_depth_model needs torch + transformers. "
            "Install with: pip install 'rpx-benchmark[depth-hf]'"
        ) from e

    processor = AutoImageProcessor.from_pretrained(checkpoint)
    model = AutoModelForDepthEstimation.from_pretrained(checkpoint)

    if dtype is not None:
        torch_dtype = {
            "float16": torch.float16,
            "bfloat16": torch.bfloat16,
            "float32": torch.float32,
        }[dtype]
        model = model.to(dtype=torch_dtype)

    model = model.to(device).eval()

    return BenchmarkableModel(
        task=TaskType.MONOCULAR_DEPTH,
        input_adapter=HFDepthInputAdapter(processor=processor, device=device),
        model=model,
        output_adapter=HFDepthOutputAdapter(processor=processor),
        name=name or checkpoint,
    )

UniDepth V2 adapter

depth_unidepth

UniDepth V2 input/output adapters.

UniDepth bypasses the HuggingFace AutoModelForDepthEstimation API and ships its own UniDepthV2 class. The forward API is::

model.infer(rgb: Tensor, camera: Tensor | Camera | None = None,
            normalize: bool = True) -> dict

The result dict has keys depth, confidence, intrinsics, radius, points, rays, depth_features. We take depth of shape (B, 1, H, W) in metres and return the (H, W) slice. UniDepth's output is always at the input RGB resolution, so no resize is normally needed.

make_unidepth_v2_model(checkpoint: str = 'lpiccinelli/unidepth-v2-vitl14', *, device: str = 'cuda', camera_k: Optional[np.ndarray] = None, name: Optional[str] = None) -> BenchmarkableModel

Factory for any UniDepth V2 checkpoint.

Parameters:

Name Type Description Default
checkpoint str

HuggingFace Hub id for the UniDepth weights (e.g. lpiccinelli/unidepth-v2-vitb14, lpiccinelli/unidepth-v2-vitl14).

'lpiccinelli/unidepth-v2-vitl14'
device str
'cuda'
camera_k ndarray

A 3x3 intrinsics matrix. If omitted, UniDepth self-prompts intrinsics from the image.

None
Source code in rpx_benchmark/adapters/depth_unidepth.py
def make_unidepth_v2_model(
    checkpoint: str = "lpiccinelli/unidepth-v2-vitl14",
    *,
    device: str = "cuda",
    camera_k: Optional[np.ndarray] = None,
    name: Optional[str] = None,
) -> BenchmarkableModel:
    """Factory for any UniDepth V2 checkpoint.

    Parameters
    ----------
    checkpoint : str
        HuggingFace Hub id for the UniDepth weights
        (e.g. ``lpiccinelli/unidepth-v2-vitb14``,
        ``lpiccinelli/unidepth-v2-vitl14``).
    device : str
    camera_k : np.ndarray, optional
        A 3x3 intrinsics matrix. If omitted, UniDepth self-prompts
        intrinsics from the image.
    """
    try:
        from unidepth.models import UniDepthV2
    except ImportError as e:  # pragma: no cover
        raise ImportError(
            "UniDepth adapter needs the `unidepth` package. Install with:\n"
            "  pip install 'unidepth @ git+https://github.com/lpiccinelli-eth/UniDepth.git'"
        ) from e

    model = UniDepthV2.from_pretrained(checkpoint).to(device).eval()
    return BenchmarkableModel(
        task=TaskType.MONOCULAR_DEPTH,
        input_adapter=UniDepthInputAdapter(device=device, camera_k=camera_k),
        model=model,
        output_adapter=UniDepthOutputAdapter(),
        invoker=_unidepth_invoker,
        name=name or f"unidepth::{checkpoint}",
    )

Metric3D V2 adapter

depth_metric3d

Metric3D V2 input/output adapters with canonical-focal letterboxing.

Metric3D is trained at a canonical focal length; recovering real metric depth needs the (fx_real / fx_canonical) rescale. The letterbox preprocessing also has to be undone on the output side, so the :class:Metric3DInputAdapter stashes the resize scale and letterbox crop in PreparedInput.context for the output adapter to reverse.

make_metric3d_v2_model(*, device: str = 'cuda', fx_real: float = 605.0, hub_repo: str = 'yvanyin/metric3d', entry: str = 'metric3d_vit_large', name: Optional[str] = None) -> BenchmarkableModel

Metric3D requires CUDA.

The upstream mono/model/decode_heads/RAFTDepthNormalDPTDecoder5.py hardcodes device="cuda" in one of its torch.linspace calls (see upstream issue tracker), so even purely-CPU inference hits an assertion inside the decoder. We hard-fail early with a helpful error rather than waiting for a stack trace buried six frames deep.

Source code in rpx_benchmark/adapters/depth_metric3d.py
def make_metric3d_v2_model(
    *,
    device: str = "cuda",
    fx_real: float = 605.0,
    hub_repo: str = "yvanyin/metric3d",
    entry: str = "metric3d_vit_large",
    name: Optional[str] = None,
) -> BenchmarkableModel:
    """Metric3D requires CUDA.

    The upstream ``mono/model/decode_heads/RAFTDepthNormalDPTDecoder5.py``
    hardcodes ``device="cuda"`` in one of its ``torch.linspace`` calls
    (see upstream issue tracker), so even purely-CPU inference hits an
    assertion inside the decoder. We hard-fail early with a helpful
    error rather than waiting for a stack trace buried six frames deep.
    """
    import torch

    from ..exceptions import AdapterError

    if device != "cuda":
        raise AdapterError(
            "Metric3D V2 requires device='cuda'.",
            hint=(
                "Upstream hardcodes torch.linspace(..., device='cuda') in "
                "mono/model/decode_heads/RAFTDepthNormalDPTDecoder5.py, so "
                "CPU inference is not supported until the upstream repo is "
                "patched. For pure-CPU evaluation use a different adapter "
                "from the slate (e.g. depth_anything_v2_metric_indoor_small)."
            ),
        )
    if not torch.cuda.is_available():
        raise AdapterError(
            "Metric3D V2 requires CUDA but torch.cuda.is_available() is False.",
            hint="Run on a CUDA-capable host, or use another metric-depth adapter.",
        )

    # Metric3D's transitive deps: timm, mmcv (or mmcv-lite), mmengine.
    try:
        import timm  # noqa: F401
    except ImportError as e:
        raise ImportError(
            "Metric3D V2 requires `timm`. Install with: pip install timm"
        ) from e
    try:
        import mmcv  # noqa: F401
    except ImportError:
        try:
            import mmengine  # noqa: F401
        except ImportError as e:
            raise ImportError(
                "Metric3D V2 requires `mmcv` or `mmengine`. Install with:\n"
                "  pip install mmcv-lite mmengine"
            ) from e

    model = torch.hub.load(
        hub_repo, entry, pretrain=True, trust_repo=True
    ).to(device).eval()
    return BenchmarkableModel(
        task=TaskType.MONOCULAR_DEPTH,
        input_adapter=Metric3DInputAdapter(device=device),
        model=model,
        output_adapter=Metric3DOutputAdapter(fx_real=fx_real),
        invoker=_metric3d_invoker,
        name=name or f"metric3d::{entry}",
    )

HuggingFace segmentation adapter

seg_hf

HuggingFace Transformers adapters for instance / panoptic segmentation.

Works with any checkpoint whose image processor exposes one of post_process_instance_segmentation or post_process_panoptic_segmentation — Mask2Former, OneFormer, MaskFormer, DETR-for-panoptic.

The output contract is a single (H, W) int32 mask whose pixel values are consistent instance IDs. For Mask2Former-family models that means we flatten the per-segment outputs returned by the processor and paint each segment's pixels with a unique integer.

The processor-signature detection follows the same pattern as :mod:rpx_benchmark.adapters.depth_hf: we introspect at setup time and pick the right postprocess method.

HFInstanceSegInputAdapter(processor: Any, device: str = 'cuda') dataclass

Bases: InputAdapter

Turn an RPX :class:Sample into a HuggingFace batch of 1 for segmentation.

Parameters:

Name Type Description Default
processor Any

The AutoImageProcessor loaded via AutoImageProcessor.from_pretrained(checkpoint).

required
device str

PyTorch device string. Pixel values are moved to this device on prepare.

'cuda'

HFInstanceSegOutputAdapter(processor: Any, threshold: float = 0.5) dataclass

Bases: OutputAdapter

Map HF segmentation outputs to an integer instance mask.

The processor's post_process_instance_segmentation or post_process_panoptic_segmentation method is used; which one is picked depends on which one the processor exposes (detected at :meth:setup).

The result is a single (H, W) int32 mask where each instance gets a unique integer (0 is background). Instance IDs are assigned in the order the processor returns them.

Parameters:

Name Type Description Default
processor Any

The image processor that produced the model inputs. Its post_process_*_segmentation method must be called on the model outputs.

required
threshold float

Score threshold below which segments are dropped. Defaults to 0.5, matching Mask2Former's default eval config.

0.5

make_hf_instance_seg_model(checkpoint: str, *, device: str = 'cuda', threshold: float = 0.5, name: Optional[str] = None, model_class_hint: Optional[str] = None) -> BenchmarkableModel

One-line factory for a HuggingFace segmentation checkpoint.

Parameters:

Name Type Description Default
checkpoint str

HuggingFace Hub id (e.g. "facebook/mask2former-swin-tiny-coco-instance").

required
device str
'cuda'
threshold float

Score threshold for instance acceptance (passed to the processor's post-process if it accepts the kwarg).

0.5
name str

Display name. Defaults to checkpoint.

None
model_class_hint str

One of "instance", "universal", "semantic". Most users should leave this as None and rely on AutoModelForUniversalSegmentation (the super-class used by Mask2Former / OneFormer). Only set this if the auto class does not dispatch correctly for your checkpoint.

None

Raises:

Type Description
AdapterError

If the processor exposes no post-process method we can use.

ImportError

If torch or transformers are not installed.

Source code in rpx_benchmark/adapters/seg_hf.py
def make_hf_instance_seg_model(
    checkpoint: str,
    *,
    device: str = "cuda",
    threshold: float = 0.5,
    name: Optional[str] = None,
    model_class_hint: Optional[str] = None,
) -> BenchmarkableModel:
    """One-line factory for a HuggingFace segmentation checkpoint.

    Parameters
    ----------
    checkpoint : str
        HuggingFace Hub id
        (e.g. ``"facebook/mask2former-swin-tiny-coco-instance"``).
    device : str
    threshold : float
        Score threshold for instance acceptance (passed to the
        processor's post-process if it accepts the kwarg).
    name : str, optional
        Display name. Defaults to ``checkpoint``.
    model_class_hint : str, optional
        One of ``"instance"``, ``"universal"``, ``"semantic"``. Most
        users should leave this as ``None`` and rely on
        ``AutoModelForUniversalSegmentation`` (the super-class used
        by Mask2Former / OneFormer). Only set this if the auto class
        does not dispatch correctly for your checkpoint.

    Raises
    ------
    AdapterError
        If the processor exposes no post-process method we can use.
    ImportError
        If ``torch`` or ``transformers`` are not installed.
    """
    try:
        import torch  # noqa: F401
        from transformers import AutoImageProcessor
    except ImportError as e:  # pragma: no cover
        raise ImportError(
            "make_hf_instance_seg_model needs torch + transformers. "
            "Install with: pip install 'rpx-benchmark[depth-hf]'"
        ) from e

    processor = AutoImageProcessor.from_pretrained(checkpoint)

    model_cls_name = {
        None: "AutoModelForUniversalSegmentation",
        "universal": "AutoModelForUniversalSegmentation",
        "instance": "AutoModelForInstanceSegmentation",
        "semantic": "AutoModelForSemanticSegmentation",
    }[model_class_hint]
    import transformers
    try:
        model_cls = getattr(transformers, model_cls_name)
    except AttributeError as e:
        raise AdapterError(
            f"transformers has no class named {model_cls_name!r}.",
        ) from e
    model = model_cls.from_pretrained(checkpoint).to(device).eval()

    return BenchmarkableModel(
        task=TaskType.OBJECT_SEGMENTATION,
        input_adapter=HFInstanceSegInputAdapter(processor=processor, device=device),
        model=model,
        output_adapter=HFInstanceSegOutputAdapter(
            processor=processor, threshold=threshold,
        ),
        name=name or checkpoint,
    )