Deployment metrics (rpx_benchmark.deployment)¶
ESD-weighted phase scoring, State-Transition Robustness, Temporal
Stability, and Stack-Level Geometric Coherence. These are the
second-tier metrics that go into the DeploymentReadinessReport
beyond the raw per-sample task metrics.
deployment
¶
Deployment-readiness metrics for RPX: TS, STR, SGC, ESD, weighted scoring.
These are the core novel contributions of the RPX benchmark paper.
Four metrics characterise how well a model generalises under deployment conditions:
TS — Temporal Stability: pose-compensated frame-to-frame consistency STR — State-Transition Robustness: performance drop / recovery across scene phases SGC — Stack-Level Geometric Coherence: mask–depth boundary alignment ESD — Effort-Stratified Difficulty: per-difficulty breakdown (Easy/Medium/Hard)
Plus the weighted phase scoring scheme: S_p = 0.25·M(p,Easy) + 0.35·M(p,Med) + 0.40·M(p,Hard) S_overall = (S_clutter + S_interaction + S_clean) / 3 Δ_int = S_interaction − S_clutter (interaction drop) Δ_rec = S_clean − S_interaction (recovery)
TemporalStabilityResult(ts_score: float, num_pairs: int, per_pair: List[float] = list())
dataclass
¶
TS score per task type.
TS_seg = E[IoU(P_t, warp(P_{t+1}, ΔT))] (segmentation) TS_depth = E[||D_t − warp(D_{t+1}, ΔT)||_1 / N_valid] (depth)
Warping uses the relative SE(3) pose from the T265 ground-truth track. When exact warping is not feasible, a consistency proxy (unchanged-pixel fraction) is used as a lower bound.
StateTransitionRobustnessResult(str_c_to_i: float, str_i_to_l: float, metric_clutter: float, metric_interaction: float, metric_clean: float)
dataclass
¶
STR captures performance change across phase boundaries.
STR_{C→I} = M(interaction) − M(clutter) ← interaction drop (negative = worse) STR_{I→L} = M(clean) − M(interaction) ← recovery (positive = better)
StackGeometricCoherenceResult(sgc_score: float, precision: float, recall: float, num_samples: int)
dataclass
¶
SGC measures mask–depth boundary alignment.
SGC = F-score(boundary(mask), boundary(depth_gradient > τ)) Boundary pixels are extracted via Sobel gradient magnitude thresholding.
ESDResult(easy: float | None, medium: float | None, hard: float | None, metric_key: str)
dataclass
¶
Per-difficulty metric breakdown (Effort-Stratified Difficulty).
weighted_score() -> float
¶
S_p = 0.25·Easy + 0.35·Medium + 0.40·Hard.
Source code in rpx_benchmark/deployment.py
WeightedPhaseScore(clutter: ESDResult, interaction: ESDResult, clean: ESDResult)
dataclass
¶
Full deployment-readiness scoring table.
Per phase: S_p = 0.25·M(p,Easy) + 0.35·M(p,Medium) + 0.40·M(p,Hard) Overall: S_overall = (S_C + S_I + S_L) / 3 Delta int: Δ_int = S_I − S_C Delta rec: Δ_rec = S_L − S_I
DeploymentReadinessReport(task: str, model_name: str, weighted_phase_score: WeightedPhaseScore | None = None, temporal_stability: TemporalStabilityResult | None = None, state_transition: StateTransitionRobustnessResult | None = None, geometric_coherence: StackGeometricCoherenceResult | None = None, params_m: float | None = None, flops_g: float | None = None, actmem_gb_fp16: float | None = None, latency_ms_per_sample: float | None = None)
dataclass
¶
Aggregated deployment-readiness report for a model on a task.
compute_esd(per_sample_metrics: List[Dict[str, float]], per_sample_difficulties: List[Difficulty | None], metric_key: str) -> ESDResult
¶
Compute per-difficulty metric averages from per-sample results.
Args: per_sample_metrics: list of metric dicts, one per sample. per_sample_difficulties: difficulty label per sample (may be None). metric_key: which metric key to stratify (e.g. "absrel", "miou").
Returns: ESDResult with easy/medium/hard averages.
Source code in rpx_benchmark/deployment.py
compute_weighted_phase_score(per_sample_metrics: List[Dict[str, float]], per_sample_phases: List[Phase | None], per_sample_difficulties: List[Difficulty | None], metric_key: str) -> WeightedPhaseScore
¶
Compute the full weighted phase scoring table.
Groups samples by (phase, difficulty) and computes: S_p = 0.25·M(p,Easy) + 0.35·M(p,Medium) + 0.40·M(p,Hard) for each phase, then overall score and transition deltas.
Source code in rpx_benchmark/deployment.py
compute_str(phase_scores: Dict[Phase, float]) -> StateTransitionRobustnessResult
¶
Compute STR from per-phase aggregated scores.
Args: phase_scores: dict mapping Phase → scalar metric value.
Source code in rpx_benchmark/deployment.py
compute_temporal_stability_seg(pred_masks: Sequence[np.ndarray], camera_poses: Sequence[np.ndarray | None]) -> TemporalStabilityResult
¶
TS_seg = E[IoU(P_t, warp(P_{t+1}, ΔT))].
When T265 pose data is available, we use the relative rotation to compensate for camera motion before computing IoU between adjacent frames. Without pixel-accurate warping (which requires depth for backprojection), we apply a simplified affine proxy using the in-plane rotation component only.
This gives a conservative lower-bound TS_seg that is still a meaningful stability signal when scenes have modest depth variation.
Args: pred_masks: sequence of predicted segmentation masks (H×W int). camera_poses: per-frame 4×4 SE(3) matrices (camera-to-world), or None.
Returns: TemporalStabilityResult.
Source code in rpx_benchmark/deployment.py
compute_temporal_stability_depth(pred_depths: Sequence[np.ndarray], camera_poses: Sequence[np.ndarray | None]) -> TemporalStabilityResult
¶
TS_depth = E[||D_t − warp(D_{t+1}, ΔT)||_1 / N_valid].
Normalised to [0,1] by dividing by the max depth range to give a higher-is-better stability score (TS = 1 − normalised_L1).
Source code in rpx_benchmark/deployment.py
compute_sgc(pred_masks: Sequence[np.ndarray], pred_depths: Sequence[np.ndarray], depth_gradient_threshold: float = 0.1, boundary_dilation: int = 2) -> StackGeometricCoherenceResult
¶
Stack-Level Geometric Coherence: boundary F-score between mask and depth edges.
SGC = F-score(boundary(mask), boundary(depth_gradient > τ))
A high SGC means segmentation boundaries are geometrically consistent with the depth discontinuities — indicating the model perceives coherent surfaces.
Args: pred_masks: sequence of predicted segmentation masks (H×W int). pred_depths: sequence of predicted depth maps (H×W float32, metres). depth_gradient_threshold: τ for depth gradient thresholding. boundary_dilation: pixel tolerance for boundary matching.