13 — Evaluation, benchmarks & drift.

Promotion of a model, a prompt, a tool, an agent, or an ontology change is gated by the evaluation harness. Evaluation is continuous, not a release activity. Drift is a first-class signal that demotes assets without operator action.

13.1 Eval suite types

Golden

Frozen labelled set — Hand-curated inputs with known correct outputs. Owned by the tenant. Versioned. Tested every promotion.

Replay

Recent runs — A sample of recent production runs replayed against the candidate. Shows real-world behavioural delta.

Adversarial

Probes — Prompt-injection, evidence forgery, jailbreaks, ontology poisoning. Block on regression.

Calibration

Probability quality — Reliability diagrams, ECE / MCE, isotonic-fit deviation. Bounds confidence drift.

13.2 Bench score math

The bench score is a weighted aggregate over an eval pack. Each suite has a weight; each suite has its own metric.

bench(pack) = Σ_s w_s · metric_s
metric_s ∈ [0, 1]
Σ_s w_s = 1

regression(pack, candidate, baseline) = max_s ( metric_s(baseline) − metric_s(candidate) )

passing(pack, candidate) iff
  bench(pack)(candidate) ≥ pack.score_min
  ∧  regression(pack, candidate, baseline) ≤ pack.regression_max
  ∧  ∀ s ∈ pack.required_suites:  metric_s(candidate) ≥ pack.suite_min_s

13.3 Eval gate algorithm

Fig. 13.1 — Eval gate algorithm. A candidate must pass minimum probes, regression, score, and adversarial probes before entering shadow; shadow + canary precede full enable.

13.4 Drift sigma

Drift is measured continuously on the live signal. The default detector is a sigma probe over a trailing baseline window.

baseline_window:  trailing T_b   (e.g. 14 days, asset-dependent)
current_window:   trailing T_c   (e.g. 1 day)
metric_t:  per-window value of a watched metric
            (accuracy on golden set; calibration ECE; rejection ratio;
             confidence histogram KL; verifier-block ratio)

mu  = mean(baseline_window of metric_t)
sd  = stdev(baseline_window of metric_t)
sigma_t = (current_window mean − mu) / sd

WARN  if  |sigma_t| ≥ 2
BREACH if  |sigma_t| ≥ 3 sustained for ≥ N current_windows

WARN     → demote model to shadow; route at sample rate; alert owners.
BREACH   → demote model to demoted; remove from primary routing; require re-eval.

13.5 Watched metrics

Metric	Asset class	Window	Default warn / breach
Accuracy on golden set	model · prompt · tool · agent	1d / 14d	2σ / 3σ
Verifier-block ratio	agent	1d / 14d	2σ / 3σ
Confidence histogram KL	extractor / classifier	1d / 14d	2σ / 3σ
Rejection ratio	verifier rule pack	1d / 14d	2σ / 3σ
Calibration ECE	extractor / classifier	1d / 14d	2σ / 3σ
Tool error class distribution	tool	1d / 14d	χ² > threshold

13.6 Reviewer-pairwise eval

For workflows where ground truth is operator judgement (e.g. nuanced narrative interpretation), the harness uses pairwise human comparison. Reviewers see (A, B) draws — candidate vs baseline — without knowing which is which. Win rate is reported with a confidence interval; a candidate must beat baseline at the configured significance level to pass.

13.7 Eval result lineage

Every eval run is itself an immutable artefact: it has a content hash, a frozen pack version, a candidate id, a baseline id, the dataset snapshot id, the runtime versions used, and the resulting metrics. Promotion decisions cite the eval result by id; rollbacks cite the original eval result.

13.8 Continuous eval cadence

Full pack — on every change to model / prompt / tool / agent / ontology.
Regression suite — nightly per asset class.
Drift probes — every current_window (typically 1 day) per asset.
Calibration refit — weekly for extractors / classifiers.
Adversarial sweep — weekly + on threat-intel update.

13.9 Failure attribution

When a Decision is later judged wrong by a reviewer, the harness back-attributes:

which step in the plan caused the deviation,
which model lineage was active at that step,
which prompt revision and retrieval snapshot were used,
whether the verifier should have caught it (and if so, which rule pack and rule).

Attributions feed the eval suites, the verifier rule packs, and the routing policy.

Foundations

Ontology

Data Plane

Logic & Reasoning

Models

Action Plane

Security & Governance

Operations

Enterprise

Evaluation, Benchmarks & Drift

13 — Evaluation, benchmarks & drift.

13.1 Eval suite types

Golden

Replay

Adversarial

Calibration

13.2 Bench score math

13.3 Eval gate algorithm

13.4 Drift sigma

13.5 Watched metrics

13.6 Reviewer-pairwise eval

13.7 Eval result lineage

13.8 Continuous eval cadence

13.9 Failure attribution

​13 — Evaluation, benchmarks & drift.

​13.1 Eval suite types

Golden

Replay

Adversarial

Calibration

​13.2 Bench score math

​13.3 Eval gate algorithm

​13.4 Drift sigma

​13.5 Watched metrics

​13.6 Reviewer-pairwise eval

​13.7 Eval result lineage

​13.8 Continuous eval cadence

​13.9 Failure attribution

13 — Evaluation, benchmarks & drift.

13.1 Eval suite types

13.2 Bench score math

13.3 Eval gate algorithm

13.4 Drift sigma

13.5 Watched metrics

13.6 Reviewer-pairwise eval

13.7 Eval result lineage

13.8 Continuous eval cadence

13.9 Failure attribution