Documentation Index
Fetch the complete documentation index at: https://docs.uselayerup.com/llms.txt
Use this file to discover all available pages before exploring further.
13 — Evaluation, benchmarks & drift.
Promotion of a model, a prompt, a tool, an agent, or an ontology change is gated by the evaluation harness. Evaluation is continuous, not a release activity. Drift is a first-class signal that demotes assets without operator action.13.1 Eval suite types
Golden
Frozen labelled set — Hand-curated inputs with known correct outputs. Owned by the tenant. Versioned. Tested every promotion.
Replay
Recent runs — A sample of recent production runs replayed against the candidate. Shows real-world behavioural delta.
Adversarial
Probes — Prompt-injection, evidence forgery, jailbreaks, ontology poisoning. Block on regression.
Calibration
Probability quality — Reliability diagrams, ECE / MCE, isotonic-fit deviation. Bounds confidence drift.
13.2 Bench score math
The bench score is a weighted aggregate over an eval pack. Each suite has a weight; each suite has its own metric.13.3 Eval gate algorithm
Fig. 13.1 — Eval gate algorithm. A candidate must pass minimum probes, regression, score, and adversarial probes before entering shadow; shadow + canary precede full enable.13.4 Drift sigma
Drift is measured continuously on the live signal. The default detector is a sigma probe over a trailing baseline window.13.5 Watched metrics
| Metric | Asset class | Window | Default warn / breach |
|---|---|---|---|
| Accuracy on golden set | model · prompt · tool · agent | 1d / 14d | 2σ / 3σ |
| Verifier-block ratio | agent | 1d / 14d | 2σ / 3σ |
| Confidence histogram KL | extractor / classifier | 1d / 14d | 2σ / 3σ |
| Rejection ratio | verifier rule pack | 1d / 14d | 2σ / 3σ |
| Calibration ECE | extractor / classifier | 1d / 14d | 2σ / 3σ |
| Tool error class distribution | tool | 1d / 14d | χ² > threshold |
13.6 Reviewer-pairwise eval
For workflows where ground truth is operator judgement (e.g. nuanced narrative interpretation), the harness uses pairwise human comparison. Reviewers see (A, B) draws — candidate vs baseline — without knowing which is which. Win rate is reported with a confidence interval; a candidate must beat baseline at the configured significance level to pass.13.7 Eval result lineage
Every eval run is itself an immutable artefact: it has a content hash, a frozen pack version, a candidate id, a baseline id, the dataset snapshot id, the runtime versions used, and the resulting metrics. Promotion decisions cite the eval result by id; rollbacks cite the original eval result.13.8 Continuous eval cadence
- Full pack — on every change to model / prompt / tool / agent / ontology.
- Regression suite — nightly per asset class.
- Drift probes — every current_window (typically 1 day) per asset.
- Calibration refit — weekly for extractors / classifiers.
- Adversarial sweep — weekly + on threat-intel update.
13.9 Failure attribution
When a Decision is later judged wrong by a reviewer, the harness back-attributes:- which step in the plan caused the deviation,
- which model lineage was active at that step,
- which prompt revision and retrieval snapshot were used,
- whether the verifier should have caught it (and if so, which rule pack and rule).

