Adapters (rpx_benchmark.adapters)¶
The formal adapter framework: InputAdapter / OutputAdapter
protocols, PreparedInput, BenchmarkableModel, and the shipped
numpy / HuggingFace / UniDepth / Metric3D / segmentation adapters.
Base framework¶
base
¶
Core types for the RPX adapter framework.
::
Sample ─► InputAdapter.prepare ─► PreparedInput(payload, context)
│
▼
model(payload)
│
▼
Sample, context, model_output ─► OutputAdapter.finalize ─► Prediction
Users extending RPX for their own model only need to supply the model and pick a matching pair of adapters. The adapters ship with the library for common families (HuggingFace transformers, UniDepth, Metric3D, raw numpy callables).
ModelInvoker = Callable[[Any, Any], Any]
module-attribute
¶
Given (model, payload) return the raw model output.
The default implementation handles the two common cases: dict payloads
become model(**payload), anything else becomes model(payload).
Override when the model needs non-standard invocation (e.g., a method
other than __call__).
PreparedInput(payload: Any, context: Dict[str, Any] = dict())
dataclass
¶
Everything a model needs for one sample, plus context for post-processing.
payload is whatever the model's forward call accepts. If it is a
dict, the default invoker calls model(**payload); otherwise
model(payload).
context is a free-form dict the output adapter receives back. Use
it to stash things like target image size, original intrinsics, or
any preprocessing metadata the postprocessing step needs.
InputAdapter
¶
Bases: Protocol
Sample → model-ready payload.
OutputAdapter
¶
BenchmarkableModel(task: TaskType, input_adapter: InputAdapter, model: Any, output_adapter: OutputAdapter, invoker: ModelInvoker = default_invoker, name: str = 'benchmarkable_model', setup_hook: Optional[Callable[[], None]] = None)
dataclass
¶
Bases: BenchmarkModel
Compose an input adapter, a model, and an output adapter.
This is the canonical way to plug a model into the RPX benchmark
harness. The BenchmarkRunner only ever sees the
:class:BenchmarkModel contract (setup, predict); all the
model-family-specific logic lives in the adapters so the harness
stays task-agnostic.
Example — wrap a HuggingFace depth model::
from rpx_benchmark.adapters.depth_hf import make_hf_depth_model
bm = make_hf_depth_model(
"depth-anything/Depth-Anything-V2-Metric-Indoor-Large-hf",
device="cuda",
)
# Hand `bm` to the benchmark runner or the task entrypoint.
Example — wrap a plain numpy callable::
def my_depth(rgb: np.ndarray) -> np.ndarray:
return some_depth_in_metres
bm = make_numpy_depth_model(my_depth)
default_invoker(model: Any, payload: Any) -> Any
¶
Call model with payload. Uses torch no_grad if available.
Source code in rpx_benchmark/adapters/base.py
make_numpy_depth_model(fn: Callable[[np.ndarray], np.ndarray], *, name: str = 'numpy_depth_model') -> BenchmarkableModel
¶
Wrap a plain numpy depth callable as a :class:BenchmarkableModel.
The callable must accept a (H, W, 3) uint8 RGB image and return a
(H', W') float metric depth map (in metres). If (H', W') !=
(H, W), the output is bilinearly resized to match the ground truth.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fn
|
callable
|
The depth function. Signature: |
required |
name
|
str
|
Display name used in logs and reports. |
'numpy_depth_model'
|
Examples:
>>> import numpy as np
>>> import rpx_benchmark as rpx
>>> def my_depth(rgb):
... return np.full(rgb.shape[:2], 2.0, dtype=np.float32)
>>> bm = rpx.make_numpy_depth_model(my_depth, name="mine")
>>> bm.task is rpx.TaskType.MONOCULAR_DEPTH
True
Source code in rpx_benchmark/adapters/base.py
make_numpy_mask_model(fn: Callable[[np.ndarray], np.ndarray], *, name: str = 'numpy_mask_model') -> BenchmarkableModel
¶
Wrap a plain numpy instance-mask callable as a :class:BenchmarkableModel.
The callable must accept a (H, W, 3) uint8 RGB image and return
a (H', W') int instance mask where pixel values are instance
IDs (0 is conventionally background). If (H', W') != (H, W)
the output is nearest-neighbour resized to match the GT mask so
integer IDs are preserved.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fn
|
callable
|
|
required |
name
|
str
|
|
'numpy_mask_model'
|
Examples:
>>> import numpy as np
>>> import rpx_benchmark as rpx
>>> def my_seg(rgb):
... mask = np.zeros(rgb.shape[:2], dtype=np.int32)
... mask[rgb.sum(-1) > 384] = 1 # trivial brightness threshold
... return mask
>>> bm = rpx.make_numpy_mask_model(my_seg, name="mine")
>>> bm.task is rpx.TaskType.OBJECT_SEGMENTATION
True
Source code in rpx_benchmark/adapters/base.py
make_numpy_detection_model(fn: Callable[[np.ndarray], Any], *, name: str = 'numpy_detection_model', task: TaskType = TaskType.OBJECT_DETECTION) -> BenchmarkableModel
¶
Wrap a plain numpy detection callable as a :class:BenchmarkableModel.
The callable must accept a (H, W, 3) uint8 RGB image and
return either a dict with keys "boxes" / "scores" /
"labels" or a (boxes, scores, labels) tuple in that
order. Boxes are pixel coordinates in (x1, y1, x2, y2)
format, scores are floats in [0, 1], labels are strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fn
|
callable
|
|
required |
name
|
str
|
Display name for reports. |
'numpy_detection_model'
|
task
|
TaskType
|
Use :attr: |
OBJECT_DETECTION
|
Examples:
>>> import numpy as np
>>> import rpx_benchmark as rpx
>>> def my_det(rgb):
... return {
... "boxes": np.array([[10, 10, 30, 30]], dtype=np.float32),
... "scores": np.array([0.9], dtype=np.float32),
... "labels": ["cup"],
... }
>>> bm = rpx.make_numpy_detection_model(my_det)
>>> bm.task is rpx.TaskType.OBJECT_DETECTION
True
Source code in rpx_benchmark/adapters/base.py
make_numpy_grounding_model(fn: Callable[[np.ndarray, str], Any], *, name: str = 'numpy_grounding_model') -> BenchmarkableModel
¶
Wrap a visual-grounding callable as a :class:BenchmarkableModel.
The callable takes (rgb_uint8, text) and returns either a
dict with keys "boxes" / "scores" or a tuple
(boxes, scores). Boxes are (x1, y1, x2, y2) pixel
coordinates; scores are floats. The referring expression
text is plucked from sample.ground_truth.text by the
adapter so the callable never sees the GT boxes.
Examples:
>>> import numpy as np
>>> import rpx_benchmark as rpx
>>> def my_ground(rgb, text):
... return (
... np.array([[10, 10, 30, 30]], dtype=np.float32),
... np.array([0.8], dtype=np.float32),
... )
>>> bm = rpx.make_numpy_grounding_model(my_ground)
Source code in rpx_benchmark/adapters/base.py
make_numpy_pose_model(fn: Callable[[np.ndarray, np.ndarray], Any], *, name: str = 'numpy_pose_model') -> BenchmarkableModel
¶
Wrap a relative-camera-pose callable as a :class:BenchmarkableModel.
The callable takes (rgb_a, rgb_b) and returns either a dict
with keys "rotation" (3×3 rotation matrix or 4-element
quaternion) and "translation" (3-vector, metres) or a
(rotation, translation) tuple.
Examples:
>>> import numpy as np
>>> import rpx_benchmark as rpx
>>> def my_pose(rgb_a, rgb_b):
... return {"rotation": np.eye(3), "translation": np.zeros(3)}
>>> bm = rpx.make_numpy_pose_model(my_pose)
Source code in rpx_benchmark/adapters/base.py
make_numpy_sparse_depth_model(fn: Callable[[np.ndarray, np.ndarray], np.ndarray], *, name: str = 'numpy_sparse_depth_model') -> BenchmarkableModel
¶
Wrap a sparse-depth callable as a :class:BenchmarkableModel.
The callable takes (rgb_uint8, coords) where coords is a
(N, 2) float32 array of pixel coordinates and returns an
(N,) float32 array of depths in metres at those exact
coordinates.
Examples:
>>> import numpy as np
>>> import rpx_benchmark as rpx
>>> def my_sd(rgb, coords):
... return np.full(len(coords), 2.0, dtype=np.float32)
>>> bm = rpx.make_numpy_sparse_depth_model(my_sd)
Source code in rpx_benchmark/adapters/base.py
make_numpy_nvs_model(fn: Callable[[np.ndarray, np.ndarray], np.ndarray], *, name: str = 'numpy_nvs_model') -> BenchmarkableModel
¶
Wrap a novel-view-synthesis callable as a :class:BenchmarkableModel.
The callable takes (rgb_uint8, target_pose) where the target
pose is a 4×4 SE(3) camera-to-world matrix (float64). It
returns an RGB image for the target viewpoint. Non-uint8 output
is clipped and cast; shape mismatches are bilinearly resized.
Examples:
>>> import numpy as np
>>> import rpx_benchmark as rpx
>>> def my_nvs(rgb, target_pose):
... return rgb # identity baseline
>>> bm = rpx.make_numpy_nvs_model(my_nvs)
Source code in rpx_benchmark/adapters/base.py
make_numpy_keypoint_model(fn: Callable[[np.ndarray, np.ndarray], Any], *, name: str = 'numpy_keypoint_model') -> BenchmarkableModel
¶
Wrap a keypoint-matching callable as a :class:BenchmarkableModel.
The callable takes (rgb_a, rgb_b) and returns either a dict
with keys "points0", "points1" and optional "scores"
or a 2/3-tuple in the same order. Points are (N, 2) pixel
coordinates.
Examples:
>>> import numpy as np
>>> import rpx_benchmark as rpx
>>> def my_matcher(rgb_a, rgb_b):
... pts = np.array([[10, 10], [20, 20]], dtype=np.float32)
... return pts, pts
>>> bm = rpx.make_numpy_keypoint_model(my_matcher)
Source code in rpx_benchmark/adapters/base.py
HuggingFace depth adapter¶
depth_hf
¶
HuggingFace Transformers adapters for monocular metric depth.
Works with any AutoModelForDepthEstimation checkpoint that exposes
post_process_depth_estimation on its image processor — including
Depth Anything V2 (metric), Depth Pro, ZoeDepth, Video Depth Anything
(per-frame), and PromptDA.
Three entry points:
* :class:`HFDepthInputAdapter` — PIL -> processor -> tensors on device
* :class:`HFDepthOutputAdapter` — model output -> resized metric depth
* :func:`make_hf_depth_model` — single-call factory returning a
ready-to-run :class:`BenchmarkableModel`
make_hf_depth_model("my-org/my-depth-ckpt") is the "bring your own
HuggingFace depth model" fast path — no subclassing required.
HFDepthInputAdapter(processor: Any, device: str = 'cuda')
dataclass
¶
HFDepthOutputAdapter(processor: Any)
dataclass
¶
Bases: OutputAdapter
Runs post_process_depth_estimation and wraps in DepthPrediction.
Different transformers depth processors take different kwargs:
- DA-V2 / Depth Pro:
target_sizesonly. - ZoeDepth: needs
source_sizes(and optionallydo_remove_padding) to unpad its internal left-right mirror augmentation. - PromptDA: takes both plus an optional
outputskey.
We introspect the post-process signature once and forward whatever
kwargs it accepts. Both target_sizes and source_sizes are set
to the caller's original (H, W) since the adapter contract
promises we always return depth at the input RGB's resolution.
make_hf_depth_model(checkpoint: str, *, device: str = 'cuda', dtype: str | None = None, name: str | None = None) -> BenchmarkableModel
¶
One-line factory for any HuggingFace depth-estimation checkpoint.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
checkpoint
|
str
|
HuggingFace Hub path, e.g.
|
required |
device
|
str
|
Device string passed to |
'cuda'
|
dtype
|
str
|
One of |
None
|
name
|
str
|
Display name; defaults to the checkpoint id. |
None
|
Source code in rpx_benchmark/adapters/depth_hf.py
UniDepth V2 adapter¶
depth_unidepth
¶
UniDepth V2 input/output adapters.
UniDepth bypasses the HuggingFace AutoModelForDepthEstimation API and
ships its own UniDepthV2 class. The forward API is::
model.infer(rgb: Tensor, camera: Tensor | Camera | None = None,
normalize: bool = True) -> dict
The result dict has keys depth, confidence, intrinsics,
radius, points, rays, depth_features. We take
depth of shape (B, 1, H, W) in metres and return the (H, W)
slice. UniDepth's output is always at the input RGB resolution, so no
resize is normally needed.
make_unidepth_v2_model(checkpoint: str = 'lpiccinelli/unidepth-v2-vitl14', *, device: str = 'cuda', camera_k: Optional[np.ndarray] = None, name: Optional[str] = None) -> BenchmarkableModel
¶
Factory for any UniDepth V2 checkpoint.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
checkpoint
|
str
|
HuggingFace Hub id for the UniDepth weights
(e.g. |
'lpiccinelli/unidepth-v2-vitl14'
|
device
|
str
|
|
'cuda'
|
camera_k
|
ndarray
|
A 3x3 intrinsics matrix. If omitted, UniDepth self-prompts intrinsics from the image. |
None
|
Source code in rpx_benchmark/adapters/depth_unidepth.py
Metric3D V2 adapter¶
depth_metric3d
¶
Metric3D V2 input/output adapters with canonical-focal letterboxing.
Metric3D is trained at a canonical focal length; recovering real
metric depth needs the (fx_real / fx_canonical) rescale. The letterbox
preprocessing also has to be undone on the output side, so the
:class:Metric3DInputAdapter stashes the resize scale and letterbox
crop in PreparedInput.context for the output adapter to reverse.
make_metric3d_v2_model(*, device: str = 'cuda', fx_real: float = 605.0, hub_repo: str = 'yvanyin/metric3d', entry: str = 'metric3d_vit_large', name: Optional[str] = None) -> BenchmarkableModel
¶
Metric3D requires CUDA.
The upstream mono/model/decode_heads/RAFTDepthNormalDPTDecoder5.py
hardcodes device="cuda" in one of its torch.linspace calls
(see upstream issue tracker), so even purely-CPU inference hits an
assertion inside the decoder. We hard-fail early with a helpful
error rather than waiting for a stack trace buried six frames deep.
Source code in rpx_benchmark/adapters/depth_metric3d.py
HuggingFace segmentation adapter¶
seg_hf
¶
HuggingFace Transformers adapters for instance / panoptic segmentation.
Works with any checkpoint whose image processor exposes one of
post_process_instance_segmentation or
post_process_panoptic_segmentation — Mask2Former, OneFormer,
MaskFormer, DETR-for-panoptic.
The output contract is a single (H, W) int32 mask whose pixel
values are consistent instance IDs. For Mask2Former-family models
that means we flatten the per-segment outputs returned by the
processor and paint each segment's pixels with a unique integer.
The processor-signature detection follows the same pattern as
:mod:rpx_benchmark.adapters.depth_hf: we introspect at setup time
and pick the right postprocess method.
HFInstanceSegInputAdapter(processor: Any, device: str = 'cuda')
dataclass
¶
Bases: InputAdapter
Turn an RPX :class:Sample into a HuggingFace batch of 1 for segmentation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
processor
|
Any
|
The |
required |
device
|
str
|
PyTorch device string. Pixel values are moved to this device
on |
'cuda'
|
HFInstanceSegOutputAdapter(processor: Any, threshold: float = 0.5)
dataclass
¶
Bases: OutputAdapter
Map HF segmentation outputs to an integer instance mask.
The processor's post_process_instance_segmentation or
post_process_panoptic_segmentation method is used; which one
is picked depends on which one the processor exposes (detected at
:meth:setup).
The result is a single (H, W) int32 mask where each instance
gets a unique integer (0 is background). Instance IDs are
assigned in the order the processor returns them.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
processor
|
Any
|
The image processor that produced the model inputs. Its
|
required |
threshold
|
float
|
Score threshold below which segments are dropped. Defaults to 0.5, matching Mask2Former's default eval config. |
0.5
|
make_hf_instance_seg_model(checkpoint: str, *, device: str = 'cuda', threshold: float = 0.5, name: Optional[str] = None, model_class_hint: Optional[str] = None) -> BenchmarkableModel
¶
One-line factory for a HuggingFace segmentation checkpoint.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
checkpoint
|
str
|
HuggingFace Hub id
(e.g. |
required |
device
|
str
|
|
'cuda'
|
threshold
|
float
|
Score threshold for instance acceptance (passed to the processor's post-process if it accepts the kwarg). |
0.5
|
name
|
str
|
Display name. Defaults to |
None
|
model_class_hint
|
str
|
One of |
None
|
Raises:
| Type | Description |
|---|---|
AdapterError
|
If the processor exposes no post-process method we can use. |
ImportError
|
If |