Documentation / 8. Scaling plan

Chapter 8

Scaling plan

7-tier trigger-gated plan to 500k operators. Current tier state.

The hub is designed to scale from today’s ≤ 50 active nodes to six-figure operator counts — a half-million-participant ceiling is the explicit north star. Seven tiers get us there, each triggered by real utilisation, not speculation. Premature scaling is how projects accumulate complexity without customers; this plan refuses to pay complexity before the customer count demands it.

Why a half-million is realistic. StoaChain is a proof-of-work network descended from the Kadena chainweb architecture (developed by the former Kadena LLC team). Chainweb parallelises a PoW chain across 10 independent chains that cross-reference each other through Merkle proofs — so every chain carries an independent tip-poisoning floor, every chain accepts its own transactions, and throughput scales multiplicatively with the chain count. Where a single-chain PoW network caps at a few hundred TPS, chainweb absorbs the load of hundreds of thousands of miners and operators because the work is sharded at the protocol layer, not rebuilt at the hub layer. The scaling ladder below leans hard on that architecture: by T7 the daily Stoicism mint fans out to 10 chains of ~2 M gas each and finishes in under five minutes, because the blockchain itself is the parallelism.

16.1 · Where we are today (honest audit)

We ship T2-capable from day one.The audit below enumerates every T2 ceiling-raising requirement against the current codebase. All gaps closed as of v0.7.8z14; headroom to T2’s ~350-node ceiling is intact.

T2 requirementStatusEvidence
SQLite WAL mode✓ donedb/index.ts:22 — db.pragma('journal_mode = WAL')
Concurrent job execution✓ doneworker/index.ts MAX_PARALLEL_JOBS=32 with per-node lock + per-kind caps (v0.7.8q)
Probe result cache✓ donenode_chainweb_tip table populated every 30s by the tip-poller (v0.7.8z9)
Bulk-probe scheduler with rate-limit✓ donelib/chainweb-tip-poller.ts — 8-way concurrency cap, 30s cadence, re-entrancy guard
Earnings snapshot pagination✓ donepages/admin/earnings.tsx NODE_PAGE_SIZE=20; single-active scope simplifies cross-account view
SSH connection pool (multiplexed, persistent)✓ donelib/ssh.ts in-module pool (v0.7.8z14). Idle TTL 5 min, max age 1 h, reaper on 60s interval.
Rich List materialised hourly✓ donelib/rich-list-mv.ts + migration 035 (v0.7.8z14). Worker refreshes rich_list_mv every hour.

Conclusion. The hub is full T2 as of v0.7.8z14. The per-tier deep-dive below is the roadmap for every step from here to half a million.

16.2 · The ladder at a glance

TierTarget scaleHeadline change
T1≤ 50 nodesSingle hub process, SQLite, direct SSH per op. Baseline.
T2· current≤ 350 nodesWAL + parallel workers + probe cache + bulk-probe scheduler + SSH pool + rich-list MV. WE ARE HERE (fully).
T3≤ 1 500 nodesJob queue + job logs on disk + composite indexes on hot paths. Earnings streams move to SSE.
T4≤ 7 000 nodesPostgres replaces SQLite. Redis cache + BullMQ queue. OpenTelemetry. Multiple worker hosts per hub.
T5≤ 30 000 nodesFederated hubs per region; cross-region reconciliation; regulatory data-residency support.
T6≤ 200 000 nodesAgent-pull protocol: nodes pull their own work from the hub instead of hub-initiated SSH.
T7500 000+ nodesGlobal coordinator + horizontal hub fleet. Daily Stoicism mint sharded across 10 StoaChain chains (~5 min sweep).

16.3 · T1 — the baseline (≤ 50 nodes)

Trigger: day one.

What breaks past ~50: single SQLite writer serialises every job + probe on the same connection. SSH handshakes dominate probe time. The landing page starts lagging on first paint.

What T1 ships: one Node.js process, SQLite single-file DB, direct SSH per job, no cache layer, no queue. Simplest thing that works.

Status: retired. We jumped directly to T2 during v0.7.8 work because the gap cost was small and the T2 ceiling is 7× larger.

16.4 · T2 — parallelism + caching (≤ 350 nodes) · CURRENT

Trigger: T1 saturation. Hit at ~15 nodes during v0.7.6–v0.7.8 work because concurrent benchmark + probe pressure serialised the writer.

What breaks past ~350: the in-process job queue grows unbounded under sustained multi-node activity; SSH pool warm-hit ratio drops once per-node activity falls below pool idle TTL; earnings-page pagination hits the DB twice per page.

What T2 shipped:

  • SQLite WAL. Readers don’t block writers, writers don’t block readers. Single-line pragma.
  • Concurrent workers. MAX_PARALLEL_JOBS = 32, per-node lock prevents two jobs touching the same target, per-kind caps stop one kind saturating every slot.
  • Chainweb tip cache. node_chainweb_tip table refreshed every 30 s by the tip-poller. Page reads never SSH.
  • Bulk-probe scheduler with rate-limit. 8-way concurrency cap, re-entrancy guard so an overrun tick can’t stomp the previous one.
  • SSH connection pool. Persistent ssh2 connections keyed by user@host:port; idle TTL 5 min, max age 1 h, 60 s reaper. Handshake cost amortises to zero.
  • Rich-list materialised view. rich_list_mv refreshed hourly; aggregate page read becomes a single-row lookup.
  • Single-active-scope session model. One email is "active" at a time; cross-account views simplify + queries stay bounded.

Status: done as of v0.7.8z14. Headroom to 350 active nodes.

16.5 · T3 — queue + observability seams (≤ 1 500 nodes)

Trigger: 350-node ceiling crossed for 7 sustained days. No panic-scaling: crossing the ceiling for a day doesn’t move us.

What breaks past ~1 500: job logs fight structured state for WAL write budget — the two share one SQLite file; slow-query log on stoicism_events + jobs shows unindexed paths under load; live earnings polling costs a full page refresh per operator.

What T3 lands:

  • Job logs off the DB. Each job gets a flat file under data/jobs/<id>.log; the DB keeps only the structured state row. WAL write budget frees up for hot paths.
  • Composite indexes. Driven by slow-query logging — concrete candidates: (owner_email, ran_at) on stoicism_events; (kind, status, scheduled_at) on jobs; (node_id, kind, created_at) on backups.
  • JobQueue seam. Today’s in-process queue extracted into a named module with a pluggable back-end. T4’s BullMQ swap becomes a one-module change.
  • DbAdapter seam. The getDb() singleton grows behind an interface — methods are the same, implementation is SQLite today and Postgres-pg next tier.
  • Live earnings via SSE. Hub pushes Stoicism delta events as they happen instead of operators re-polling.

Status: planned. Seams partially in place (job queue already abstracted behind lib/jobs).

16.6 · T4 — Postgres + multi-worker (≤ 7 000 nodes)

Trigger: 1 500-node ceiling crossed for 7 sustained days; at this scale the hub is a real business, not a tool.

What breaks past ~7 000: SQLite’s single-writer model hits a hard wall — no amount of WAL tuning changes that only one process at a time mutates state; the single-leader worker lease becomes the whole-hub throughput ceiling; full-text and multi-dimensional query patterns (aggregate earnings by region, historical timeseries) are already painful on SQLite by this point.

What T4 lands:

  • Postgres replaces SQLite. The DbAdapter seam introduced at T3 makes this a connection-string change, not a port.
  • Redis. Fronts SSH pool cache + SSE fanout; retires the in-process single-leader pattern.
  • BullMQ job queue. Multiple worker processes, multiple hosts; the lease pattern generalises to a Redis primitive.
  • Structured observability. OpenTelemetry traces on every SSH + DB call; structured logs to Loki (or equivalent); production incident triage target: under 15 minutes.
  • Read replicas. Admin dashboard queries go to a replica; write path stays on the primary.

Status: designed-for, not built. Every abstraction introduced before T4 is sized so T4 is incremental, not a rewrite.

16.7 · T5 — federated hub fleet (≤ 30 000 nodes)

Trigger: 7 000-node ceiling + regulatory pressure (GDPR data residency, regional tax attestation, regional latency SLAs).

What breaks past ~30 000: a single hub hosted in one country is a geopolitical single point of failure and a regulatory headache; cross-continent SSH latency becomes meaningful for probe budgets; compliance work can’t be batched — different jurisdictions want different behaviour.

What T5 lands:

  • One hub per region. EU, NA, APAC, LATAM, AF. Nodes register against the regionally-closest hub.
  • Cross-region reconciliation. A thin federation coordinator component replicates authoritative tables (accounts, Stoicism ledger) between hubs; writes stay local, reads are globally consistent within a lag budget.
  • Data-residency policy per region. A GDPR-pinned operator’s data never leaves the EU hub’s database.
  • Region-aware benchmark baselines. EU nodes benchmarked against EU reference loads; bandwidth-expectations adjusted for regional infra norms.

Status: architectural sketch. Not built; named so the T4 abstractions don’t paint us into a corner.

16.8 · T6 — agent-pull protocol (≤ 200 000 nodes)

Trigger: hub-initiated SSH fanout becomes a scaling ceiling — even at 8-way concurrency + persistent pool, cycling through every node for a routine probe is minutes, not seconds.

What breaks past ~200 000: outbound SSH fanout doesn’t scale linearly — residential routers rate-limit outbound connections, cloud egress costs grow, and nodes behind NAT/CGNAT become unreachable from the hub without ugly workarounds.

What T6 lands:

  • Agent on every node. Lightweight Node.js or Rust binary; maintains a long-poll or WebSocket to its regional hub.
  • Inverted probe pattern. The agent pushes its heartbeat + metrics + argv; the hub never SSHes for routine ops. SSH remains for privileged administrative actions (bootstrap, key-seat, recovery).
  • NAT traversal for free. Outbound from the node works past any firewall; hub inbound port stays one (the agent control channel).
  • Agent self-update. Hub signs new agent binaries; agent verifies, downloads, exec-replaces. OTA fleet updates.

Status: research. Not designed in detail; T5 abstractions are shaped so an agent is a drop-in for the SSH path where it makes sense.

16.9 · T7 — global coordinator (500 000+ nodes)

Trigger: half-a-million participants. Individual regional hubs saturate; one global view becomes necessary for anti-sybil and aggregate economics, but building it at regional level means cross-region round-trips on every read.

What breaks past ~500 000: conflict resolution between regional hubs becomes the bottleneck — two hubs updating the same global account concurrently needs a tie-breaker beyond wall-clock timestamps; daily Stoicism mint coordination across regions needs a single authority to sequence the batched txs.

What T7 lands:

  • Global coordinator process. Arbitrates global state; thin — purely the tie-breaker + mint-sequencer, not on the critical path for operator-facing reads.
  • Horizontal hub fleet within each region. Several hub instances per region share a Postgres + BullMQ cluster; behind a load balancer for operator login.
  • Daily mint sharded across 10 StoaChain chains. The flagship tier. Hub computes per-account Stoicism deltas, shards accounts by the chain that owns their register, emits one batched update-registers tx per chain in parallel. 500 000 accounts ÷ 10 chains = ~50 000 accounts per chain; at ~2 M gas / tx, every chain completes its sweep in under 5 minutes wall-clock; global sweep ~5 minutes. ~80 transactions/day globally — chainweb absorbs this without effort.
  • Validator-network backed reads. By T7 the Wave-2 validator fleet is operational; explorer reads, IPFS pins, UI hosting all come off the validator network, not the hub. The hub shrinks back to its core: operator CRM, scoring engine, mint orchestrator.

Status: architecturally named; designed for but intentionally unbuilt. Every earlier-tier abstraction is sized so T7 is a set of additions, not a rewrite.

16.10 · Why the ladder bottoms out at half a million

StoaChain’s 10 chains at ~2 M gas per tx absorb the daily Stoicism mint directly — via register aggregation, not ZK. One mint-and-register-update transaction per chain batches thousands of account deltas into a single on-chain write: the hub computes the daily delta for every account, sorts by the chain that owns each account’s register, and emits one batched update-registers call per chain. At 500 k accounts sharded across the chains, a daily sweep finishes in under 5 minutes wall-clock with ~80 transactions/day globally.

No SNARKs, no Merkle-root client-redeem dance, no off-chain attestation. Register aggregation is the dumb-scalablepath: plain Pact math inside a single tx, bounded by gas not cryptography. StoaChain’s gas ceiling is what makes the plan honest beyond T4 — without it, a T6-era Merkle-attestation pivot would be necessary at much lower scale. We don’t need it.

Proof of work + chainweb + register aggregation = the half-million ceiling. This is what the former Kadena LLC team built; the hub inherits its scalability for free by leaning on chainweb at the protocol layer instead of reinventing sharding at the application layer.

16.11 · Design discipline from day one

Even at T1 the codebase is careful about module boundaries so tier jumps are cheap. The abstractions introduced before we need them:

  • SshPool— ships as a single-connection passthrough at T1; grows into a real pool at T2 (now). Call sites don’t change.
  • ProbeCache — already a real thing for chainweb tips (node_chainweb_tip); extends naturally to other fields (flags, capacity) when needed.
  • JobQueue — the lib/jobs helpers form the seam; swapping to BullMQ at T4 is a one-module change, not a rewrite.
  • DbAdapter — formalises at T3; the getDb() singleton is the current surface and abstracts cleanly.
  • AgentChannel — named for T6; drops in where the SSH path is routine rather than privileged.

Full internal planning doc lives in plans/v0.8-hub-scalability.md. This public chapter is authoritative for external consumption; the internal doc carries speculative reasoning + benchmarks that don’t need to be in the public surface.