9. Benchmarking · Docs · AncientHoldings

Every node runs a benchmark before it can earn. The benchmark produces a single ServerScore and a per-category breakdown (CPU, Disk, Network, RAM, Commitment). ServerScore drives the per-second Stoicism accrual rate — see §6 The Stoic System. This chapter is the authoritative reference for how those numbers are produced and what upgrades do to them.

15.1 · When a benchmark runs

Benchmarks are operator-initiated. The hub never runs one autonomously. Triggers:

First benchmark after registration. The node has no ServerScore until the operator clicks Run benchmark on the Scoring card.
Re-benchmark. Any time. Stamps the latest successful run; older runs kept as a diagnostic tail (last 3 rows in benchmark_runs).
Restamp.No SSH, no benchmark — just re-derives the stamp from the stored breakdown under today’s scoring formula. Cheap; used after a formula change.

The auto-stop + auto-restart flow lands deps install while chainweb is still up, then stops chainweb, benchmarks, and restarts it. Expected operator-facing downtime: ~3–5 min. A finally block guarantees restart even on crash mid-run; if auto-restart itself fails, the job log surfaces it and the operator can restart manually from the Control tab.

15.2 · What is measured

Each benchmark run executes eight phases on the target box, in order. Phase percentages mirror the progress checklist in the Scoring card exactly.

Phase	Tool	What it produces	Time budget
Deps install	apt	sysbench, fio, perf, stress-ng, curl	0–90 s
System info	systemd-detect-virt, dmidecode, lscpu, /proc	CPU model, cores × threads, RAM, hw_type classification	~2 s
CPU single-thread ×5	sysbench cpu --threads=1	events/sec (reference only, not in formula)	~60 s
CPU multi-thread ×5	sysbench cpu --threads=N	events/sec — drives the 22 % CPU slice	~60 s
perf + stress-ng	perf stat, stress-ng	oversubscription signal (CPU steal mean, penalty factor)	~30 s
fio disk probe	fio (direct on committed provision path)	4K rand read IOPS / 4K rand write IOPS / 1M seq read MB/s	~60 s
Net probe ×8 regions	curl against 8 Linode regions	per-region download Mbps + ping ms	~120 s
Parse + persist	hub-side	breakdown JSON + score stamp	~5 s

A phase can fail without invalidating the whole run. If fio fails but CPU + network succeed, the run’s disk contribution is 0but the stamp still lands — the card renders a red “✗” on the Disk tile so the operator sees which dimension didn’t measure. Only fully-failed runs (every dimension zero) are discarded.

15.3 · Why 8 regions and two file sizes

Network score is the average of 8 per-region composites. Regions: Frankfurt, London, Newark, Dallas, Fremont, Singapore, Tokyo, Sydney. Each worth 2.5 % of total score (together, 20 %).

Per-region composite:

loc = 0.5 × (down / 500 Mbps) + 0.5 × (30 ms / ping)
netComposite = Σ(loc) ÷ 8 expected regions
netContribution = netComposite × 0.20

Probe file sizes differ by region: 100 MB full download for EU/US regions, 25 MiB byte-range download for Pacific regions. The TCP window × RTT ceiling means 100 MB would time out against Tokyo or Sydney from most European operators within the 30 s per-probe budget; the smaller file under-reports throughput (slow-start is a bigger fraction) but bias is conservative, not generous.

Unreachable regions score 0 for that 2.5 % slice; the other regions still contribute.Next benchmark retries — a temporary route outage doesn’t permanently penalise the node.

15.3a · Live substep list (v.G.1.4a Iris)

While a benchmark is running, the Scoring card renders a live per-substep list grouped by category. The Iris release expanded the list from CPU-only (9 substeps) to a full Install · CPU · Disk · Network · RAM · Persist layout, with thin section headers between non-empty groups.

Each row transitions pending → in-progress → complete (or skippedwhen the host can’t run that workload). Once a row reaches complete, the measured value renders inline alongside the row title:

CPU rows — sysbench events/sec, IPC, and the six chainweb workloads (Blake2s, secp256k1, Ed25519, random memory, Pact-proxy, hash-and-verify) per the v.G.1.3 Antikythera spec.
Disk row (fio) — 4K rand read X IOPS · 1M seq Y MB/s, summarising the two highest-signal disk dimensions on a single line.
Network row (librespeed) — Frankfurt 287 Mbps · Dallas 142 Mbps, picking the top-2 regions by download throughput. The full per-region table is still on the Network subscore tile.
RAM row (slice bench only) — memmove X.X GB/s from the in-cgroup stress-ng vm-method memmove run.
Install + Persist rows — no inline measurement; success surfaces as a green check, failures surface as a red banner above the list.

When the librespeed parser fails to read a region’s output (DNS hiccup, partial pipe, server-side hiccup), the run now emits a ===PHASE:librespeed:fail===sentinel and persists the first 2 KB of raw output to the breakdown blob’s diagnostic block. Pre-Iris this path was silent — the Network subcard tinted red without surfacing why. Operators who hit this can now read the raw output back from the bench history.

15.4 · Hardware classification

Every benchmark stamps an hw_typethat determines the formula’s outer multiplier. Classification runs during the System info phase.

hw_type	Matches	Multiplier	Eligible?
barebone	systemd-detect-virt = none	×1.15	yes
vps	kvm, qemu, vmware, xen, microsoft, bhyve, oracle	×1.00	yes
container / lxc / docker	systemd-detect-virt = docker / lxc / container	—	no (rejected at stamp)
unknown	detection failed or returned an unknown type	—	no (admin review required)

Container rejection (v0.7.8z20+).Containers were historically allowed at ×0.9 multiplier. That stopped with z20 because a container isn’t carrying its own OS, kernel, or independent failure domain — it’s a guest of a larger box that’s already running somebody’s node, and the scoring formula has no way to tell a dedicated container apart from sixteen containers sharing the same hardware. Benchmarks on container-classified hosts now refuse to stamp.

Spoofability is honest. systemd-detect-virt is trivially defeated by an operator with root (which they have, because the benchmark runs over SSH as root). The multi-signal check that cross-references dmidecode, /proc/cpuinfohypervisor flags, and cgroup markers refuses the stamp when signals disagree. If you suspect a spoofed hw_type on a fleet node, an Ancient Admin can force a re-benchmark from the node detail page; there is no “force to barebone” lever by design.

15.5 · The ServerScore formula

ServerScore = hwTypeMultiplier × (
    cpuContribution           (22 %)
  + diskContribution          (21 %, split 3 × 7 %)
  + netContribution           (20 %, split 8 × 2.5 %)
  + ramContribution           (19 %, linear, uncapped)
  + provisionContribution     (18 %, log curve, live)
)

CPU/Disk/Net/RAM contributions are stamped at bench time — they don’t move until a new benchmark runs. The Commitment slot is live: every scoring tick recomputes it using the current fleet-wide minimum and the operator’s current committed GB. As the chain grows, everyone’s commitment contribution drops slowly; raising committed_gb lifts it back.

15.5.1 · Why the commitment slot is a log curve

At ratio ≤ 1 (below baseline), the commitment contribution scales linearly: 0.18 × ratio. At ratio > 1 (over-committed, buying headroom for chain growth), it follows 0.18 × (1 + log₂(ratio)) — so doubling your commitment adds one full slot, but the marginal value tapers. Stops over-committed whales from dominating the scoreboard while still rewarding prudent buffer.

15.5.2 · Live vs stamped display

The /admin/nodes/[id]scoring card renders the headline ServerScore using the live commitment slot — the number matches what the scoring worker is actually using right now. Historical-run inspection falls back to the stamped value (that run’s provisionContribution at bench time), which is the honest representation when looking back in time.

15.6 · Six chainweb-correlated workloads

Sysbench’s prime-sieve loop is a fine general-purpose CPU exerciser, but chainweb-node doesn’t spend its time sieving primes. It hashes blocks, verifies signatures, chases pointers through RocksDB, and runs Pact contracts on a single thread per block. Six new measurements, introduced in v.G.1.3, target each of those hot paths directly so the CPU score reflects how the box will perform under real consensus load — not how it performs at an unrelated arithmetic loop.

Each workload is single-threaded by design (chainweb’s per-block path is single-threaded), uses tools already installed by the bench’s deps phase (no new apt packages), and is tolerant of missing tooling — if an operator’s OpenSSL build is missing a cipher, the workload records null and the run continues.

Workload	What it measures	Chainweb hot path it mirrors	Unit
Blake2s throughput	b2sum / openssl speed blake2s256	consensus block hashing + Merkle tree construction	MB/s
secp256k1 ECDSA verify	openssl speed ecdsap256	transaction signature verification (the dominant signature scheme)	verify/s
Ed25519 verify	openssl speed ed25519	alternate-signature path used by some Pact key types	verify/s
Random memory access	sysbench memory --memory-access-mode=rnd	RocksDB lookup / cache-miss patterns the UTXO layer hits	MB/s
Pact-proxy single-thread	sysbench cpu --threads=1	smart-contract evaluation pipeline (interpreter integer work)	events/s
Hash-and-verify pipeline	shell loop: Blake2s hash + Ed25519 verify	end-to-end transaction-validation throughput	ops/s

On a reference 6c/12t modern desktop (Ryzen 5 5600 / Core i5-12400 class) the order-of-magnitude figures we calibrate against are roughly ~600 MB/s Blake2s, ~30k verify/s for both ECDSA and Ed25519, ~6 GB/s random-memory throughput, ~1800 events/s on the Pact proxy, and ~12k ops/son the hash-and-verify pipeline. Older or constrained CPUs land below; newer enthusiast chips land above. These figures are the “1.0” line that the bucket label is calibrated against — see §15.7.

The six workloads add roughly thirty seconds to the total bench runtime (about fifteen seconds for the three crypto workloads, five seconds for memory, ten seconds for the Pact proxy, five seconds for the hash-verify pipeline) — a small price for a CPU score that’s actually about chainweb instead of about prime numbers.

15.7 · Chainweb bucket label

Alongside the numeric chainweb score, every benchmark stamps a one-word bucket labeldescribing how chainweb-friendly the CPU is. The label is a quick glance for operators who don’t want to read raw per-workload numbers; the score itself remains the source of truth for the formula.

Label	Score range	What it means	Visual cue
strong	score ≥ 1.0	at or above the reference 6c/12t modern desktop	★
good	0.75 ≤ score < 1.0	solid chainweb performance, slightly below the reference	✓
adequate	0.5 ≤ score < 0.75	usable but not optimal under sustained load	~
weak	score < 0.5	significantly below reference; chainweb-node may struggle under load	!
—	null (no bucket)	record predates v.G.1.3 — re-bench to populate	·

The label means “how chainweb-friendly is this CPU” — nothing more. It is NOTa translation of the score into Geekbench 6, PassMark, Cinebench, or any other external benchmark’s units. The thresholds are calibrated by running this very suite on a reference 6c/12t modern desktop, not by regression-fitting against another vendor’s numbers. Don’t expect a “good” score here to map onto a particular Geekbench tier; the two suites measure different things.

Records benchmarked before v.G.1.3 don’t carry a bucket — the chainweb workloads weren’t run, so there’s nothing to label. The pill simply omits in that case and the CPU details panel shows a quiet hint prompting a re-bench when the operator wants the new signal. Existing nodes are not forced to re-bench — the legacy score formula continues to drive the displayed ServerScore for those records (see §15.10 for the broader upgrade procedure).

15.8 · Instructions-per-cycle in the score

Modern CPUs differ wildly in how many useful instructions they retire per clock cycle. Two chips at the same GHz can be a factor of two apart in real throughput because one executes more ops per cycle than the other. Chainweb-node, with its tight inner loops in the consensus path, benefits directly from high IPC. v.G.1.3 promotes the IPC value the bench has always captured to a scored signal: it contributes roughly 10% of the chainweb-aggregate score, calibrated against a reference value of BASELINE_CPU_IPC_REF = 1.5 instructions per cycle (the figure produced by the same 6c/12t desktop the other workloads are calibrated against).

IPC is captured by the kernel’s perf stattool. Ubuntu 22.04 and later default the kernel’s perf_event_paranoidsetting to a level that blocks unprivileged perf invocation, which means the IPC capture had been silently failing on a meaningful slice of operator boxes — nobody noticed because IPC wasn’t scored. v.G.1.3 fixes this by adding the canonical perf binary path (and the distribution-variant kernel-tools path Ubuntu uses) to the hub’s sudoers list, so sudo -n perf stat now succeeds regardless of paranoia level. The existing sudoers refresh job applies the new entry on the next benchmark — no operator-side action required.

One-time step-change after upgrade. Operators upgrading from a previous version may see their IPC value change on the first re-bench because the previous configuration silently captured no IPC at all. This is expected and the score formula handles it gracefully: records without IPC fall back to the legacy formula shape without IPC, and records with IPC blend it in at the 10% weight. No action required either way.

When perf-stat is unavailable on a particular box (older kernel, locked-down distribution, missing tooling) the IPC value is recorded as nullfor that run. The score formula’s null-fallback path takes over and the chainweb-aggregate is produced from the remaining signals — no run is aborted just because IPC didn’t capture.

15.9 · Reduced contention penalty for single-thread workloads

The hub’s contention-penalty machinery — the multiplier between 0.3 and 1.0 derived from steal-time sampling, run-to-run coefficient of variation, and multi-thread scaling efficiency — exists to catch oversubscribed VPSes whose nominal CPU count doesn’t match what the box can actually deliver under load. It works on the multi-thread sysbench number because all three signals genuinely affect multi-thread performance: steal-time eats wall-clock; high run-to-run variance shows the scheduler is fighting; poor scaling efficiency shows the cores aren’t independent.

For the six chainweb workloads (and IPC), only two of those three signals make sense. Each chainweb workload is single-threaded by design — there are no multiple threads to scale across — so scaling efficiency simply does not apply. The penalty for these workloads is therefore reduced to a steal-time + run-variance only shape; the scaling-efficiency component is excluded. Both kept components still bite when they should: a noisy neighbour stealing CPU still penalises the score; a wildly variable single-thread run still penalises the score; but an honest single-thread workload doesn’t get docked for failing a multi-thread test it never ran.

Score component	Penalty signals applied	Penalty shape
sysbench multi-thread	steal + CV + scaling efficiency	full (preserved verbatim)
six chainweb workloads	steal + CV only	reduced (no scaling component)
IPC (perf-stat)	steal + CV only	reduced (no scaling component)

The same reduced-penalty rule applies inside the segregated-slice bench: a slice running its chainweb workloads inside a systemd-run --scope cgroup only has steal and CV to penalise against, since scaling is meaningless under a per-slice CPU quota. See §15.13 for the broader segregated-bench shape.

15.10 · Upgrade procedure (when the formula changes)

Formula changes happen — new dimension added, weight rebalanced, a penalty retired. To propagate a formula change safely:

Ship the code. Update lib/stoic-power.ts; constants carry clear comments naming the version that introduced them.
Fleet Recompute (no SSH). Ancient Admin hits Recompute ServerScores (fleet) on /admin/fleet-maintenance. The hub walks every node, re-parses each stored breakdown, re-scores under the new formula, and stamps the latest successful run. Sub-second for thousands of nodes. No re-benchmark required for dimensions that only depend on the existing raw measurements.
Re-benchmark only when raw measurements change. New dimension (e.g. a specific new disk test) → fleet Recompute can’t fabricate the missing raw number. Operators re-benchmark at their own pace; until they do, the new dimension contributes 0.

Under v0.7.8t single-score model, every node’s server_score is the latest successful run. benchmark_runs is trimmed to the last 3 rows as a diagnostic log. There is no historical run selection — pick the recompute path for formula changes, pick the re-benchmark path for fresh measurements.

15.11 · Restamp vs re-benchmark

Restamp— re-derive the score from the stored breakdown under today’s formula. No SSH, no benchmark, no downtime. Use when the formula changed but the raw measurements are still valid. Available on every node and fleet-wide at /admin/fleet-maintenance.
Re-benchmark — full benchmark run. Use when hardware changed (disk upgrade, RAM bump), when oversubscription pattern shifted, or when a new dimension landed that needs fresh raw measurements. ~3–5 min auto-stop+restart downtime.
Clear + null stamp (danger zone) — destroys every benchmark row and nulls the stamped ServerScore. Accrual pauses until a fresh benchmark runs. Only use when the old data is actively misleading (e.g. years-old benchmarks from before a major refactor).

15.12 · Dimension sources at a glance

Want to know exactly what number went into each slot? Each tile on the Scoring card is clickable — expand it for a per-component detail panel showing CPU model + clock + sysbench single/multi, disk per-sub-measure pass/fail, network per-region speed + ping + unreachable flag, RAM calculation, and commitment ratio.

The scoring formula is not a black box. Every number you see on the card corresponds directly to a measurement the operator can reproduce locally — same sysbench flags, same fio job file, same curl endpoints.

15.13 · Segregated containers (Container Score)

A host running in segregated mode hosts one or more chainweb children inside their own Docker containers with cgroup limits (CPU shares + memory cap). The host itself no longer runs chainweb — the children do. Each child publishes a Container Score using the same five-tile layout as a full-host ServerScore, with the following sources:

CPU + RAM — measured by running sysbench and stress-ng directly on the host under a transient cgroup scope (systemd-run --scope withCPUQuota + MemoryMaxset to the slice’s allotment). The score reflects how the slice will actually perform under its own constraints, not the parent host’s aggregate. sysbench cpu (single + multi-thread events/sec) + stress-ng vm-method memmove (RAM bandwidth) are written into segregated_slice_benchmarks. Pre-l5w the bench ran inside an alpine ghost container that needed apk add sysbenchon every run; on hosts where docker’s outbound networking couldn’t reach alpine.org packages, the install failed silently and the slave persisted CPU 0.0. l5w switches to host-direct execution and fails loudly if sysbench can’t be parsed.
Disk — taken from the host-level drive benchmark (host_drive_benchmarks) divided by the number of children sharing that drive. Adding a sibling on the same drive instantly halves the disk subscore for both at the next read; removing one doubles it back. No re-benchmark needed when sharer count changes.
Network— pulled from the parent host’s preserved breakdown (8-region speed test). Network is shared by every container on the host so the full host measurement applies to each child.
Commitment — live, per-child commit ÷ fleet-wide minimum. Same logarithmic curve as full-host.

Re-running a benchmark from a chainweb child’s row (Run benchmarkon the Scoring card) enqueues a fresh slice CPU/RAM bench (host-direct sysbench + stress-ng under a systemd-run cgroup scope) plus a fresh drive bench at host level — network and commitment require no measurement. The parent (segregated host) row itself has no benchmark to run; its display score is the sum of its children’s Container Scores.