Architecture Overview¶

The toolkit separates what stays fixed (the benchmark itself) from what varies (the model under test). This page is the map of the moving parts.

Layers¶

┌─────────────────────────────────────────────────────────────┐
│  CLI  (rpx_benchmark.cli)                                   │
│  ├── auto-discovers tasks from the task registry            │
│  └── maps RPXError → exit codes                             │
└─────────────────────────────────────────────────────────────┘
                             │
┌─────────────────────────────────────────────────────────────┐
│  Task pipelines  (rpx_benchmark.tasks.*)                    │
│  ├── monocular_depth.py                                     │
│  ├── segmentation.py                                        │
│  └── <your new task here>                                   │
│     each registers a TaskSpec with the task registry        │
└─────────────────────────────────────────────────────────────┘
                             │
┌─────────────────────────────────────────────────────────────┐
│  Runner  (rpx_benchmark.runner.BenchmarkRunner)             │
│  ├── iterates the dataset                                   │
│  ├── wraps first batch in FlopCounterMode for FLOPs         │
│  ├── records per-sample latency (median, skip warmup)       │
│  ├── attaches per-sample metadata (id/phase/difficulty)     │
│  └── builds DeploymentReadinessReport                       │
└─────────────────────────────────────────────────────────────┘
           │                      │                │
┌──────────┴──────┐  ┌────────────┴────────┐  ┌────┴─────────────┐
│  Adapters       │  │  Metric registry    │  │  Loader / Hub    │
│  (Input/Output  │  │  (per-task plugin   │  │  (manifest parse │
│   framework)    │  │   calculators)      │  │   + HF download) │
└─────────────────┘  └─────────────────────┘  └──────────────────┘

Plugin registries¶

All extensibility flows through three registries:

Registry	Module	Adds
Models	`rpx_benchmark.models.registry`	Named factory → `BenchmarkableModel`
Metrics	`rpx_benchmark.metrics.registry`	`MetricCalculator` subclass per task
Tasks	`rpx_benchmark.tasks.registry`	`TaskSpec(task, primary_metric, run, ...)`

Adding a new task, metric, or model to the slate is always a one-file change. The CLI auto-discovers new tasks from the task registry at parser-build time.

Data flow for one `rpx bench <task>` call¶

1. CLI parses flags → task's _build_config → TypedConfig
2. Pipeline: resolve device (CUDA fallback)
3.            ↓
    hub.download_split(task, split)
    ↓ writes resolved manifest to ~/.cache/rpx_benchmark/
4.            ↓
    RPXDataset.from_manifest(path)      ← raises ManifestError
5.            ↓
    Resolve model:  cfg.model         │
                    cfg.model_name    │   priority order
                    cfg.hf_checkpoint │
    ↓
    BenchmarkableModel instance
6.            ↓
    BenchmarkRunner.run_with_deployment_readiness(...)
    ├── First batch → FlopCounterMode → flops_g
    ├── Per-batch   → time.perf_counter → latency_ms (median)
    ├── Per-sample  → metric calc → result.per_sample with metadata
    └── After loop  → compute WPS / STR / TS
7.            ↓
    Reports: write_json + format_markdown_summary
8.            ↓
    Return (BenchmarkResult, DeploymentReadinessReport, paths)

Exception hierarchy¶

RPXError                       # base — `except rpx.RPXError` catches everything
├── ConfigError               # invalid user config
├── DatasetError
│   ├── ManifestError         # malformed / missing manifest
│   └── DownloadError         # HuggingFace / network failure
├── ModelError
│   └── AdapterError          # input / output adapter failure
└── MetricError               # evaluator failure

Every exception carries a hint string that tells the user what to fix, plus an optional details dict for structured context.

Logging¶

Every module creates its logger with log = get_logger(__name__). The CLI calls configure_logging once at startup with a level driven by --verbose / --quiet / RPX_LOG_LEVEL. When rich is installed the logs render through RichHandler; otherwise a plain stream handler is used.

Hierarchy mirrors the package structure, so turning a single module up or down is one call:

import logging
logging.getLogger("rpx_benchmark.hub").setLevel(logging.DEBUG)