Measure inference the way enterprise operators actually feel it.

The benchmark surface starts methodology-first: cost per token, watts per token, p50/p95/p99 latency, utilization, cache behavior, model quality boundaries, placement, and governance overhead.

What to measure

  • $/token by model, runtime, tenant, and workload class.
  • Watts/token and power-envelope behavior under load.
  • Latency SLOs across prefill, decode, cache hit, and cold-start paths.
  • Utilization, batching, cache hit rate, and queueing behavior.

What not to overclaim

Benchmark results should state hardware, runtime, model, prompt mix, traffic shape, quantization, cache assumptions, and measurement window. Early pages should be clear when they describe methodology rather than published production results.