Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.uselayerup.com/llms.txt

Use this file to discover all available pages before exploring further.

20 — Reliability, throughput & SLOs.

The substrate is queue-driven, backpressured, and circuit-broken. SLOs are declared per surface; capacity planning is per (tenant, lane, adapter) tuple. The platform never silently drops work.

20.1 Queue topology

Fig. 20.1 — Queue topology. Work, retry and DLQ queues exist independently for the reasoning plane and the action plane; they are never merged.

20.2 Backpressure model

  • Each queue has a soft and hard threshold; soft slows ingestion (429 / Retry-After), hard pauses ingestion entirely.
  • Per-tenant quotas prevent one tenant from starving another in shared deployments.
  • Per-agent quotas prevent one agent from starving others within a tenant.
  • Capacity is published as metrics; tenants can configure alerting thresholds.

20.3 Retry classifier & backoff

The retry classifier (§11.6) applies the following defaults; tenants can tighten them per surface.
transient:
  base:    200ms
  factor:  2.0
  jitter:  ±50%
  cap:     30s
  attempts: 5
  preserve: idempotencyKey
permanent:
  attempts: 0
  emit:    Exception
policy:
  attempts: 0
  emit:    Exception
  abortOn: mandatory_step

20.4 Circuit breakers

Each external dependency (model provider, SoR adapter, retrieval source) is fronted by a circuit breaker:
StateBehaviourTransition
closednormal trafficerror rate > threshold → open
openfail fast; route to fallback if availableafter cooldown → half-open
half-openlimited probe trafficsuccess → closed; fail → open

20.5 Capacity planning

  • runs/sec per tenant per agent class — bounded by sandbox pool size.
  • token/sec per tenant per lane — bounded by provider quota.
  • cost/sec per tenant — bounded by tenant budget.
  • actions committed/sec per adapter — bounded by SoR rate limits.
  • retrieval queries/sec per corpus — bounded by index sharding.

20.6 SLO surfaces

SurfaceDefault SLOWindow
Ingest acceptance (push)p99 ≤ 1s · success ratio ≥ 99.9%30d
Run start latency (queued → running)p99 ≤ 5s30d
Tool dispatch (PDP roundtrip)p99 ≤ 50ms30d
Reasoning lane (reasoning.fast)p99 ≤ 2s30d
Action commit (transactional)p99 ≤ 5s · success ratio ≥ 99.5%30d
Action commit (flat-file)p99 ≤ 1 batch interval · success ratio ≥ 99.9%30d
Audit appendp99 ≤ 200ms · loss = 0any
Replay reconstructionp99 ≤ 30s for runs ≤ 30d oldany

20.7 Error budgets

Each SLO has an associated error budget. Burn-rate monitors gate releases (§21): a fast-burn breach pauses promotion of new agent / tool / model versions until the budget recovers.

20.8 Loss prevention guarantees

  • Ingest is durable before acknowledgement; no acknowledged signal is lost.
  • Action staging is durable before the staged response is returned.
  • Audit append is durable before the operation that triggered it is acknowledged.
  • Crashes during commit are recovered via idempotency-key re-presentation; commit is exactly-once-effective.