Skip to main content

11 — Confidence engine — multi-signal scoring, threshold-governed escalation & audit-grounded reproducibility.

Every output the Layerup AI Agent produces carries a confidence score. That score is not cosmetic. It is the operative signal that determines whether a case is auto-resolved, surfaced for underwriter review, or mandatorily escalated — and it is the primary control that prevents the agent from making a consequential recommendation it is not entitled to make. The confidence engine is designed around a single governing principle: an artificially high confidence score is more dangerous than an artificially low one. A suppressed score triggers a human review. An inflated score allows a case through that should not have passed. If confidence scores cannot be trusted to reflect genuine evidential quality, the agent’s recommendations cannot be trusted — and the product’s value proposition collapses with them. This section is addressed to your Underwriting Governance, Information Security, and Internal Audit teams. It describes how the confidence score is computed, what suppresses it, how thresholds govern agent behaviour, and how the architecture addresses the legitimate enterprise concern about score reproducibility.

11.1 How the confidence score is computed

The confidence score is not a single number returned by a foundation model. It is a weighted composite assembled from independently evaluated signal streams — one for each evidence domain the agent is configured to assess. Each stream produces its own sub-score. Those sub-scores are aggregated, normalised, and evaluated against your configured threshold tiers before any output is assembled. The evidence domains evaluated in every case are:
domainwhat the engine assesses
AOP CoverageWhether every material fact in this case has an applicable, unambiguous rule in the Agent Operating Procedure.
Document Extraction QualityThe OCR and text-extraction confidence across all submitted documents, weighted by how critical each document is to the recommendation.
Cross-Evidence ConsistencyWhether the extracted facts across all documents, application form, and third-party data sources are internally coherent — no conflicting values for the same material fact.
Inference QualityThe foundation model’s own uncertainty signal across reasoning steps, plus any guardrail interventions that introduced gaps in the reasoning context.
External Data AlignmentWhether third-party data sources (MIB, Rx history, motor vehicle records) are reconcilable with the submitted document set.
Each domain score is independently computed and independently logged. The composite confidence score is the weighted aggregate of these domain scores, with detractor penalties applied for each identified signal of ambiguity, conflict, or unresolvable uncertainty.
The confidence score is never a direct LLM opinion about how confident it feels. It is the output of a structured detractor model: a defined set of signals, each with a logged value, an evidence citation, and a weighted contribution to the composite. Every suppression is attributable. Every attribution is auditable. The engine is designed so that the reasoning behind a score of 68 is as legible as the recommendation itself.

11.2 Confidence detractors — exhaustive taxonomy

The confidence engine operates on a detractor model. The theoretical maximum score for any case is 100. From that maximum, the engine applies suppression penalties for every identified signal of ambiguity, conflict, or unresolvable uncertainty. The composite score is what remains after all applicable detractors are applied. Any factor that prevents the agent from reaching a high-evidence, internally consistent, AOP-grounded conclusion about a material fact is a detractor. The detractors are grouped into four families below.

AOP & Configuration Detractors

These detractors fire when the agent’s operating configuration itself is the source of uncertainty — either because the AOP does not cover the scenario presented, or because the AOP contains internal contradictions that prevent a deterministic evaluation.

AOP Gap — Undefined Scenario

The case presents a material fact or combination of facts for which no applicable rule exists in the AOP. Rather than fabricating an evaluation, the agent explicitly surfaces the gap and flags it as unresolved. The absence of a rule is treated as a confidence suppressor, not a licence to extrapolate. This is by design: an AOP gap is a signal that human underwriting judgment is required.

Intra-AOP Rule Conflict

Two or more clauses in the AOP produce contradictory evaluations of the same material fact. For example, the occupation section classifies a role as Class 3, while the financial documentation standards section applies thresholds appropriate for Class 2. The engine cannot produce a coherent recommendation when the configuration itself is internally inconsistent — both the conflict and the affected rules are logged in full.

Out-of-Scope Edge Case

The case involves a pattern that the AOP does not define any coverage for — not a gap within a covered dimension, but an entire scenario type outside the AOP’s scope. This is distinct from an AOP gap: the AOP has not been configured to evaluate this scenario at all rather than having an incomplete rule for a scenario it covers. The agent does not extrapolate beyond its configured scope under any circumstances.

Missing Escalation Routing

The AOP specifies that certain conditions should trigger escalation to a named role, team, or queue that is not found in the configured routing table. The agent cannot complete output assembly without a valid escalation target, and the configuration omission itself suppresses confidence on the affected case dimensions.

Document Extraction Detractors

These detractors fire when the quality of the raw document set — or the agent’s ability to extract structured data from it — is insufficient to support a high-confidence evaluation of the material facts contained within.

OCR Confidence Below Extraction Floor

The OCR pipeline cannot extract readable text from a document page at or above the configured confidence floor. Common causes include poor handwriting, degraded scan quality, fax artefacts, dot-matrix printing, and over-compressed image files. The agent does not attempt to reason over content it cannot extract with sufficient confidence — the affected field is marked as unresolved and contributes a suppression proportional to that field’s evidentiary weight.

Unsupported or Corrupted Document Format

The submitted document is in a format the extraction pipeline cannot parse, or the file itself is structurally corrupted. This includes proprietary medical record formats, certain legacy EDI document types, and files where the binary structure does not match the declared MIME type. Corrupted documents are flagged in the unprocessed_attachments list, and any material facts expected from those documents are treated as unresolved.

Password-Protected or Rights-Managed Document

The document is encrypted, password-protected, or subject to DRM controls that prevent the extraction pipeline from reading its content. The agent logs the document identifier, the protection type, and the expected material facts it cannot access. These cases require the submitting party to resubmit in an unprotected format.

Stale or Superseded Document Version

The document’s issue date predates a known SOP revision, regulatory update, or product line change that the AOP was built to reflect. Content from a superseded document may apply outdated standards, schedules, or classification tables — using it without flagging this discrepancy would introduce silent error. The agent logs the document date, the relevant revision date, and the specific fields potentially affected.

Translation or Language Extraction Uncertainty

The submitted document is in a non-primary language and requires machine translation before extraction. Translation confidence and OCR confidence compound: a document with 90% OCR accuracy and 90% translation accuracy produces data points with an effective extraction confidence of approximately 81% before any further uncertainty is applied. Material facts extracted from translated documents carry a compounded confidence penalty proportional to both error rates.

Incomplete Page Set

The submitted document references pages, exhibits, or attachments that are not present in the intake packet. The agent can identify this in documents that contain explicit page counts, continuation references, or exhibits lists. The missing content is treated as an unresolved evidence gap and contributes a suppression proportional to the expected materiality of the missing pages.

Cross-Evidence Consistency Detractors

These detractors fire when the agent successfully extracts data from multiple sources but finds that those sources cannot be reconciled into a coherent, internally consistent picture of the material facts. Consistency is evaluated across all document pairs, not just against the application form.

Cross-Document Data Discrepancy

Two or more source documents state different values for the same material fact. Examples include a date of birth that differs between the application form and the APS, an annual income figure that differs between the employer letter and the tax return, or a diagnosis date that differs between two treating physician reports. The agent logs the specific field, the conflicting values, and the source documents for each — it does not resolve the conflict by choosing the value it deems more credible.

Applicant-Stated vs. Document-Evidenced Mismatch

The facts stated on the application form cannot be corroborated by — or are directly contradicted by — the supporting document set. This is one of the highest-weight detractors in the engine, because an irreconcilable mismatch between stated and evidenced facts on a material underwriting dimension is a standalone underwriting concern independent of any other finding.

Conflicting Occupation Representations

The applicant’s occupation is described differently across sources: the job title on the application, the narrative in the Attending Physician Statement, the employer letter, and the duty questionnaire cannot be aligned to a single resolved occupation class. Occupation class is a primary underwriting driver — irreconcilable occupation representations prevent the agent from applying the correct risk thresholds and duty mix requirements.

Inconsistent Income Timeline

The YTD income figures, annualised projections, prior-year tax documents, and employer letters cannot be arithmetically reconciled within the tolerance bands defined in the AOP. This includes cases where the implied YTD run rate is implausible relative to the stated annual income, and cases where income figures across different document types diverge beyond the AOP’s defined reconciliation tolerance.

Medical History Timeline Inconsistency

Treatment dates, prescription fill dates, procedure dates, and the narrative timeline in the Attending Physician Statement cannot be aligned into a coherent chronological sequence. Timeline inconsistencies in medical documentation are material underwriting findings in their own right — the agent logs each inconsistency with the specific dates, the documents each date was sourced from, and the nature of the conflict.

Third-Party Data Conflict

A third-party data source — MIB record, prescription drug history report, motor vehicle report, or equivalent — contains information that is inconsistent with the applicant’s stated history or the submitted supporting documentation. The agent logs the specific conflicting data points, their source (third-party vs. submitted), and the severity of the discrepancy under the AOP’s defined conflict-severity rubric.

Implausible or Anomalous Value

An extracted value — income, age at diagnosis, treatment duration, claim amount — falls outside the range of statistically plausible values for the applicant’s stated occupation class, age, or condition profile as defined in the AOP’s reference tables. The agent does not reject the value — it flags it as requiring human scrutiny and applies a suppression proportional to the degree of departure from the expected range.

Reinsurance Treaty Boundary Case

The case characteristics place it on or near a boundary defined in the applicable reinsurance treaty — for example, a benefit amount or occupation class that sits at the edge of a treaty’s automatic acceptance limits. The treaty boundary is explicitly configured in the AOP; cases within the defined margin of that boundary receive a confidence suppression because the correct treaty disposition cannot be determined without direct reinsurer consultation.

Inference & Reasoning Detractors

These detractors fire when the reasoning process itself introduces uncertainty — either from the foundation model’s own internal signal quality, from guardrail interventions that created gaps in reasoning context, or from structural ambiguities in the case that prevent the agent from reaching a deterministic conclusion under its configured rules.

Model Uncertainty Signal

The foundation model’s own inference uncertainty indicator — expressed as response hedging, low-confidence phrasing, or explicit qualification in the reasoning chain — falls below the configured floor for a given reasoning step. This is not the composite confidence score: it is the model’s internal signal about its own output quality on a specific sub-question. When the model is uncertain about a conclusion, the engine treats that uncertainty as a first-class signal rather than discarding it.

Guardrail Partial Intervention

An Amazon Bedrock Guardrail or Azure AI Content Safety policy redacted or modified part of an inference input or output during a reasoning step. Even where the guardrail did not block the step entirely, a partial intervention creates a gap in the reasoning context: the model completed its reasoning on a modified version of the input, and the agent cannot determine with certainty what was altered. The intervention is logged in full; the affected reasoning step carries a confidence suppression.

Incomplete Duty Mix Resolution

The available evidence is insufficient to break down the applicant’s occupational duty mix — the percentage split between manual, supervisory, and clerical functions — to the granularity the AOP requires for the applicable occupation class. Duty mix is a primary driver of disability risk classification; an unresolved duty mix prevents the agent from applying the correct thresholds and results in a confidence suppression on the occupation analysis dimension.

Unresolvable Requirement

The agent determines that a specific piece of documentation is required under the AOP to reach a recommendation on a material dimension, but that document cannot be obtained through any available channel — for example, an Attending Physician Statement from a treating physician whose practice has closed. The requirement is surfaced explicitly; the agent does not substitute alternative evidence for a document the AOP designates as mandatory.

Multi-Jurisdiction Complexity

The case involves material facts — licensing, occupation classification, regulatory limits, benefit amounts — that span multiple states or countries where applicable rules differ and the correct jurisdiction cannot be determined from available documentation. The agent logs each jurisdictional ambiguity and the specific rules that conflict; it does not apply a single jurisdiction’s rules to a case where that selection is itself unresolved.

Compounded Uncertainty Cascade

Multiple individually moderate detractors, when combined, produce a cascading suppression that is materially larger than the sum of their individual penalties. The engine detects when detractors are correlated — for example, an OCR confidence issue on the primary income document combined with a cross-document income discrepancy — and applies a cascade multiplier that reflects the compounded evidentiary weakness rather than treating each detractor as independent.

11.3 Confidence engine architecture

Fig. A12.1 — Confidence engine architecture. Five independently evaluated signal streams feed a weighted aggregator. Detractor penalties are applied by the normaliser before the composite score is evaluated against your configured threshold tiers. Each stream’s contribution and each detractor’s penalty are individually logged against the case.

11.4 Threshold configuration & escalation tiers

The confidence engine evaluates the composite score against two configurable thresholds defined in the AOP: a high threshold and a low threshold. These two values divide the score range into three tiers, each with a distinct agent behaviour.
tierscore rangeagent behaviour
Auto-Resolve≥ high threshold (e.g., ≥ 90)Full or partial recommendation issued with evidence citations. No mandatory human review. Flags are included for any dimension with a non-zero detractor, but the recommendation is considered actionable by the underwriter at their discretion.
Soft ReviewBetween low and high thresholds (e.g., 75–89)Partial recommendation with explicit flags issued. The case is surfaced in the underwriter’s review queue for optional but encouraged human review. All detractor signals and their evidence are visible in the case record.
Hard Escalation< low threshold (e.g., < 75)No recommendation is issued. The case is mandatorily routed to a senior underwriter with the full confidence signal breakdown, escalation reasons, open questions, and all available evidence citations. The agent does not produce a recommendation under these conditions.
Threshold values are set in the AOP, not in the container image. This means different product lines, distribution channels, or regulatory jurisdictions can carry different threshold configurations under separate AOP versions — all managed through your standard AOP governance workflow (see 9). The escalation logic that enforces these thresholds is hard-coded in the agent’s output assembly layer and cannot be bypassed by AOP configuration.
A critical data point failure — where a material underwriting dimension cannot be evaluated at all due to document absence, total OCR failure, or irreconcilable inconsistency — triggers hard escalation regardless of the composite confidence score. The score threshold is the floor; a critical extraction failure is an unconditional ceiling. See 8 for the full escalation gate protocol.

11.5 Score reproducibility — two structural features

Enterprise customers consistently raise a legitimate question: if the confidence score is produced by a reasoning workload that incorporates a foundation model, how can it be trusted to be reproducible and auditable? This concern deserves a precise answer. Nondeterminism in the context of confidence scoring conflates two distinct phenomena that require different responses:
  • Score variance from changing inputs — the AOP is updated, a new document is added, the model version changes, or the case facts are genuinely different. Score variance in this case is correct and expected. A system that produces the same score regardless of input changes is the more dangerous product.
  • Score variance from identical inputs — the same case, the same AOP version, the same model version, the same document set produces materially different scores across runs. This is the legitimate concern.
The confidence engine addresses the second phenomenon through two structural features.

11.5.1 Confidence signal lineage report

Every case output includes a confidence_signal_breakdown object alongside the composite score. This object decomposes the composite into its constituent signal contributions — showing not just the final number, but every suppression that produced it.
{
  "confidence_signal_breakdown": {
    "composite_score": 72,
    "signals": [
      {
        "domain": "cross_evidence_consistency",
        "domain_score": 61,
        "weight": 0.30,
        "detractors": [
          {
            "type": "cross_document_data_discrepancy",
            "severity": "high",
            "affected_field": "annual_income",
            "suppression_applied": 22,
            "sources": [
              { "document": "tax_return_2024.pdf", "page": 1, "extracted_value": "142000" },
              { "document": "employer_letter.pdf", "page": 1, "extracted_value": "118500" }
            ]
          }
        ]
      },
      {
        "domain": "document_extraction_quality",
        "domain_score": 84,
        "weight": 0.20,
        "detractors": [
          {
            "type": "ocr_below_extraction_floor",
            "severity": "moderate",
            "affected_field": "attending_physician_signature",
            "suppression_applied": 11,
            "sources": [
              { "document": "aps_dr_chen.pdf", "page": 4, "ocr_confidence": 0.61 }
            ]
          }
        ]
      }
    ]
  }
}
Every detractor in the breakdown is evidence-grounded: it names the source document, the page, the extracted value (or the OCR confidence), and the suppression it contributed. A score of 72 is not a black box — it is the precise, traceable sum of every ambiguity the engine identified. This means:
  • If two runs of the same case produce different scores, the lineage report shows exactly which signal changed, which detractor fired differently, and which source document drove the difference.
  • Your internal audit team can reconstruct the full scoring rationale for any historical case — at the signal level, not just the recommendation level — without re-running the agent.
  • Regulatory examiners reviewing a historical decision have access to the complete chain of evidence from raw OCR confidence to composite score to recommendation.
The signal lineage report is written to both the output payload and your CloudWatch / Azure Monitor audit log simultaneously. It is not a separate query or a post-hoc report — it is a first-class output artefact logged at the same time as the recommendation (see 7).

11.5.2 AOP-anchored reproducibility validation

The CI/CD test harness (see 10) runs the agent against a versioned set of historical validation cases before any AOP promotion. For confidence scoring specifically, the harness does not only check whether the binary recommendation matches the historical outcome — it captures the full score distribution across the validation case set and compares it against the certified baseline distribution established at AOP certification. Before any AOP version is promoted to production, the harness must confirm that:
  1. The composite score distribution across the validation set falls within the defined tolerance band relative to the certified baseline (configurable; default: ± 3 points on mean, ± 5 points on p10/p90).
  2. No individual validation case produces a score that deviates by more than the defined per-case tolerance from its baseline (configurable; default: ± 8 points).
  3. The proportion of cases in each tier — auto-resolve, soft review, hard escalation — does not shift by more than the defined tier-distribution tolerance (configurable; default: ± 5 percentage points per tier).
A failure on any of these checks blocks the AOP promotion and generates a structured review report itemising each out-of-tolerance case, the detractors that changed, and the AOP delta responsible for the change. The certified baseline distribution is stored as a version-controlled artefact in your source control system alongside the AOP itself. It is updated only when a new AOP version is explicitly certified by your underwriting governance lead — it is never auto-updated by the harness.
A score distribution that is consistently concentrated in the high tier is not evidence of a well-calibrated AOP — it may be evidence of a permissive configuration that is failing to suppress scores for genuinely ambiguous cases. Layerup’s implementation team reviews the certified baseline distribution as part of every white-glove AOP update and will flag any distribution that shows implausible concentration above the high threshold.

11.6 Confidence score in the output payload

The confidence score and its signal lineage are surfaced in two locations within the output payload (see 6 for the full output schema).
fieldlocation in payloaddescription
confidence_scoreai_recommendation.confidence_score0–100 integer. The composite score after all detractor penalties are applied. Cases below the configured low threshold receive Escalate to Senior Underwriter as the decision value.
confidence_signal_breakdownai_recommendation.confidence_signal_breakdownDecomposed signal breakdown object. One entry per evaluated domain, each with domain score, weight, and detractor array with source citations.
escalation_reasonsescalation_flag.escalation_reasonsStructured list of the specific detractors that contributed to escalation, cross-referenced to the signals in confidence_signal_breakdown. Where the escalation was triggered by a critical data point failure rather than a threshold breach, the specific failure is enumerated here.
The relationship between these three fields forms the complete audit chain for any individual case decision: the composite score is explained by the signal breakdown, and the escalation reasons reference the specific signals that crossed the escalation threshold — giving your underwriting team, your compliance officers, and your internal auditors a single coherent artefact that traces from raw document evidence to final recommendation disposition.