Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.uselayerup.com/llms.txt

Use this file to discover all available pages before exploring further.

13 — Evaluation, benchmarks & drift.

Promotion of a model, a prompt, a tool, an agent, or an ontology change is gated by the evaluation harness. Evaluation is continuous, not a release activity. Drift is a first-class signal that demotes assets without operator action.

13.1 Eval suite types

Golden

Frozen labelled set — Hand-curated inputs with known correct outputs. Owned by the tenant. Versioned. Tested every promotion.

Replay

Recent runs — A sample of recent production runs replayed against the candidate. Shows real-world behavioural delta.

Adversarial

Probes — Prompt-injection, evidence forgery, jailbreaks, ontology poisoning. Block on regression.

Calibration

Probability quality — Reliability diagrams, ECE / MCE, isotonic-fit deviation. Bounds confidence drift.

13.2 Bench score math

The bench score is a weighted aggregate over an eval pack. Each suite has a weight; each suite has its own metric.
bench(pack) = Σ_s w_s · metric_s
metric_s ∈ [0, 1]
Σ_s w_s = 1

regression(pack, candidate, baseline) = max_s ( metric_s(baseline) − metric_s(candidate) )

passing(pack, candidate) iff
  bench(pack)(candidate) ≥ pack.score_min
  ∧  regression(pack, candidate, baseline) ≤ pack.regression_max
  ∧  ∀ s ∈ pack.required_suites:  metric_s(candidate) ≥ pack.suite_min_s

13.3 Eval gate algorithm

Fig. 13.1 — Eval gate algorithm. A candidate must pass minimum probes, regression, score, and adversarial probes before entering shadow; shadow + canary precede full enable.

13.4 Drift sigma

Drift is measured continuously on the live signal. The default detector is a sigma probe over a trailing baseline window.
baseline_window:  trailing T_b   (e.g. 14 days, asset-dependent)
current_window:   trailing T_c   (e.g. 1 day)
metric_t:  per-window value of a watched metric
            (accuracy on golden set; calibration ECE; rejection ratio;
             confidence histogram KL; verifier-block ratio)

mu  = mean(baseline_window of metric_t)
sd  = stdev(baseline_window of metric_t)
sigma_t = (current_window mean − mu) / sd

WARN  if  |sigma_t| ≥ 2
BREACH if  |sigma_t| ≥ 3 sustained for ≥ N current_windows

WARN     → demote model to shadow; route at sample rate; alert owners.
BREACH   → demote model to demoted; remove from primary routing; require re-eval.

13.5 Watched metrics

MetricAsset classWindowDefault warn / breach
Accuracy on golden setmodel · prompt · tool · agent1d / 14d2σ / 3σ
Verifier-block ratioagent1d / 14d2σ / 3σ
Confidence histogram KLextractor / classifier1d / 14d2σ / 3σ
Rejection ratioverifier rule pack1d / 14d2σ / 3σ
Calibration ECEextractor / classifier1d / 14d2σ / 3σ
Tool error class distributiontool1d / 14dχ² > threshold

13.6 Reviewer-pairwise eval

For workflows where ground truth is operator judgement (e.g. nuanced narrative interpretation), the harness uses pairwise human comparison. Reviewers see (A, B) draws — candidate vs baseline — without knowing which is which. Win rate is reported with a confidence interval; a candidate must beat baseline at the configured significance level to pass.

13.7 Eval result lineage

Every eval run is itself an immutable artefact: it has a content hash, a frozen pack version, a candidate id, a baseline id, the dataset snapshot id, the runtime versions used, and the resulting metrics. Promotion decisions cite the eval result by id; rollbacks cite the original eval result.

13.8 Continuous eval cadence

  • Full pack — on every change to model / prompt / tool / agent / ontology.
  • Regression suite — nightly per asset class.
  • Drift probes — every current_window (typically 1 day) per asset.
  • Calibration refit — weekly for extractors / classifiers.
  • Adversarial sweep — weekly + on threat-intel update.

13.9 Failure attribution

When a Decision is later judged wrong by a reviewer, the harness back-attributes:
  1. which step in the plan caused the deviation,
  2. which model lineage was active at that step,
  3. which prompt revision and retrieval snapshot were used,
  4. whether the verifier should have caught it (and if so, which rule pack and rule).
Attributions feed the eval suites, the verifier rule packs, and the routing policy.