20 — Reliability, throughput & SLOs.

The substrate is queue-driven, backpressured, and circuit-broken. SLOs are declared per surface; capacity planning is per (tenant, lane, adapter) tuple. The platform never silently drops work.

20.1 Queue topology

Fig. 20.1 — Queue topology. Work, retry and DLQ queues exist independently for the reasoning plane and the action plane; they are never merged.

20.2 Backpressure model

Each queue has a soft and hard threshold; soft slows ingestion (429 / Retry-After), hard pauses ingestion entirely.
Per-tenant quotas prevent one tenant from starving another in shared deployments.
Per-agent quotas prevent one agent from starving others within a tenant.
Capacity is published as metrics; tenants can configure alerting thresholds.

20.3 Retry classifier & backoff

The retry classifier (§11.6) applies the following defaults; tenants can tighten them per surface.

transient:
  base:    200ms
  factor:  2.0
  jitter:  ±50%
  cap:     30s
  attempts: 5
  preserve: idempotencyKey
permanent:
  attempts: 0
  emit:    Exception
policy:
  attempts: 0
  emit:    Exception
  abortOn: mandatory_step

20.4 Circuit breakers

Each external dependency (model provider, SoR adapter, retrieval source) is fronted by a circuit breaker:

State	Behaviour	Transition
closed	normal traffic	error rate > threshold → open
open	fail fast; route to fallback if available	after cooldown → half-open
half-open	limited probe traffic	success → closed; fail → open

20.5 Capacity planning

runs/sec per tenant per agent class — bounded by sandbox pool size.
token/sec per tenant per lane — bounded by provider quota.
cost/sec per tenant — bounded by tenant budget.
actions committed/sec per adapter — bounded by SoR rate limits.
retrieval queries/sec per corpus — bounded by index sharding.

20.6 SLO surfaces

Surface	Default SLO	Window
Ingest acceptance (push)	p99 ≤ 1s · success ratio ≥ 99.9%	30d
Run start latency (queued → running)	p99 ≤ 5s	30d
Tool dispatch (PDP roundtrip)	p99 ≤ 50ms	30d
Reasoning lane (reasoning.fast)	p99 ≤ 2s	30d
Action commit (transactional)	p99 ≤ 5s · success ratio ≥ 99.5%	30d
Action commit (flat-file)	p99 ≤ 1 batch interval · success ratio ≥ 99.9%	30d
Audit append	p99 ≤ 200ms · loss = 0	any
Replay reconstruction	p99 ≤ 30s for runs ≤ 30d old	any

20.7 Error budgets

Each SLO has an associated error budget. Burn-rate monitors gate releases (§21): a fast-burn breach pauses promotion of new agent / tool / model versions until the budget recovers.

20.8 Loss prevention guarantees

Ingest is durable before acknowledgement; no acknowledged signal is lost.
Action staging is durable before the staged response is returned.
Audit append is durable before the operation that triggered it is acknowledged.
Crashes during commit are recovered via idempotency-key re-presentation; commit is exactly-once-effective.

Foundations

Ontology

Data Plane

Logic & Reasoning

Models

Action Plane

Security & Governance

Operations

Enterprise

Reliability, Throughput & SLOs

20 — Reliability, throughput & SLOs.

20.1 Queue topology

20.2 Backpressure model

20.3 Retry classifier & backoff

20.4 Circuit breakers

20.5 Capacity planning

20.6 SLO surfaces

20.7 Error budgets

20.8 Loss prevention guarantees

​20 — Reliability, throughput & SLOs.

​20.1 Queue topology

​20.2 Backpressure model

​20.3 Retry classifier & backoff

​20.4 Circuit breakers

​20.5 Capacity planning

​20.6 SLO surfaces

​20.7 Error budgets

​20.8 Loss prevention guarantees

20 — Reliability, throughput & SLOs.

20.1 Queue topology

20.2 Backpressure model

20.3 Retry classifier & backoff

20.4 Circuit breakers

20.5 Capacity planning

20.6 SLO surfaces

20.7 Error budgets

20.8 Loss prevention guarantees