Documentation Index
Fetch the complete documentation index at: https://docs.uselayerup.com/llms.txt
Use this file to discover all available pages before exploring further.
20 — Reliability, throughput & SLOs.
The substrate is queue-driven, backpressured, and circuit-broken. SLOs are declared per surface; capacity planning is per (tenant, lane, adapter) tuple. The platform never silently drops work.20.1 Queue topology
Fig. 20.1 — Queue topology. Work, retry and DLQ queues exist independently for the reasoning plane and the action plane; they are never merged.20.2 Backpressure model
- Each queue has a soft and hard threshold; soft slows ingestion (429 / Retry-After), hard pauses ingestion entirely.
- Per-tenant quotas prevent one tenant from starving another in shared deployments.
- Per-agent quotas prevent one agent from starving others within a tenant.
- Capacity is published as metrics; tenants can configure alerting thresholds.
20.3 Retry classifier & backoff
The retry classifier (§11.6) applies the following defaults; tenants can tighten them per surface.20.4 Circuit breakers
Each external dependency (model provider, SoR adapter, retrieval source) is fronted by a circuit breaker:| State | Behaviour | Transition |
|---|---|---|
| closed | normal traffic | error rate > threshold → open |
| open | fail fast; route to fallback if available | after cooldown → half-open |
| half-open | limited probe traffic | success → closed; fail → open |
20.5 Capacity planning
- runs/sec per tenant per agent class — bounded by sandbox pool size.
- token/sec per tenant per lane — bounded by provider quota.
- cost/sec per tenant — bounded by tenant budget.
- actions committed/sec per adapter — bounded by SoR rate limits.
- retrieval queries/sec per corpus — bounded by index sharding.
20.6 SLO surfaces
| Surface | Default SLO | Window |
|---|---|---|
| Ingest acceptance (push) | p99 ≤ 1s · success ratio ≥ 99.9% | 30d |
| Run start latency (queued → running) | p99 ≤ 5s | 30d |
| Tool dispatch (PDP roundtrip) | p99 ≤ 50ms | 30d |
| Reasoning lane (reasoning.fast) | p99 ≤ 2s | 30d |
| Action commit (transactional) | p99 ≤ 5s · success ratio ≥ 99.5% | 30d |
| Action commit (flat-file) | p99 ≤ 1 batch interval · success ratio ≥ 99.9% | 30d |
| Audit append | p99 ≤ 200ms · loss = 0 | any |
| Replay reconstruction | p99 ≤ 30s for runs ≤ 30d old | any |
20.7 Error budgets
Each SLO has an associated error budget. Burn-rate monitors gate releases (§21): a fast-burn breach pauses promotion of new agent / tool / model versions until the budget recovers.20.8 Loss prevention guarantees
- Ingest is durable before acknowledgement; no acknowledged signal is lost.
- Action staging is durable before the staged response is returned.
- Audit append is durable before the operation that triggered it is acknowledged.
- Crashes during commit are recovered via idempotency-key re-presentation; commit is exactly-once-effective.

