Architecture Overview¶
The toolkit separates what stays fixed (the benchmark itself) from what varies (the model under test). This page is the map of the moving parts.
Layers¶
┌─────────────────────────────────────────────────────────────┐
│ CLI (rpx_benchmark.cli) │
│ ├── auto-discovers tasks from the task registry │
│ └── maps RPXError → exit codes │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────┐
│ Task pipelines (rpx_benchmark.tasks.*) │
│ ├── monocular_depth.py │
│ ├── segmentation.py │
│ └── <your new task here> │
│ each registers a TaskSpec with the task registry │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────┐
│ Runner (rpx_benchmark.runner.BenchmarkRunner) │
│ ├── iterates the dataset │
│ ├── wraps first batch in FlopCounterMode for FLOPs │
│ ├── records per-sample latency (median, skip warmup) │
│ ├── attaches per-sample metadata (id/phase/difficulty) │
│ └── builds DeploymentReadinessReport │
└─────────────────────────────────────────────────────────────┘
│ │ │
┌──────────┴──────┐ ┌────────────┴────────┐ ┌────┴─────────────┐
│ Adapters │ │ Metric registry │ │ Loader / Hub │
│ (Input/Output │ │ (per-task plugin │ │ (manifest parse │
│ framework) │ │ calculators) │ │ + HF download) │
└─────────────────┘ └─────────────────────┘ └──────────────────┘
Plugin registries¶
All extensibility flows through three registries:
| Registry | Module | Adds | Touchpoints on existing code |
|---|---|---|---|
| Models | rpx_benchmark.models.registry |
Named factory → BenchmarkableModel |
0 |
| Metrics | rpx_benchmark.metrics.registry |
MetricCalculator subclass per task |
0 |
| Tasks | rpx_benchmark.tasks.registry |
TaskSpec(task, primary_metric, run, ...) |
0 |
Adding a new task, metric, or model to the slate is always a one-file change. The CLI auto-discovers new tasks from the task registry at parser-build time.
Data flow for one rpx bench <task> call¶
1. CLI parses flags → task's _build_config → TypedConfig
2. Pipeline: resolve device (CUDA fallback)
3. ↓
hub.download_split(task, split)
↓ writes resolved manifest to ~/.cache/rpx_benchmark/
4. ↓
RPXDataset.from_manifest(path) ← raises ManifestError
5. ↓
Resolve model: cfg.model │
cfg.model_name │ priority order
cfg.hf_checkpoint │
↓
BenchmarkableModel instance
6. ↓
BenchmarkRunner.run_with_deployment_readiness(...)
├── First batch → FlopCounterMode → flops_g
├── Per-batch → time.perf_counter → latency_ms (median)
├── Per-sample → metric calc → result.per_sample with metadata
└── After loop → compute WPS / STR / TS
7. ↓
Reports: write_json + format_markdown_summary
8. ↓
Return (BenchmarkResult, DeploymentReadinessReport, paths)
Exception hierarchy¶
RPXError # base — `except rpx.RPXError` catches everything
├── ConfigError # invalid user config
├── DatasetError
│ ├── ManifestError # malformed / missing manifest
│ └── DownloadError # HuggingFace / network failure
├── ModelError
│ └── AdapterError # input / output adapter failure
└── MetricError # evaluator failure
Every exception carries a hint string that tells the user what to
fix, plus an optional details dict for structured context.
Logging¶
Every module creates its logger with
log = get_logger(__name__). The CLI calls
configure_logging
once at startup with a level driven by --verbose / --quiet /
RPX_LOG_LEVEL. When rich is installed the logs render through
RichHandler; otherwise a plain stream handler is used.
Hierarchy mirrors the package structure, so turning a single module up or down is one call: