Skip to main content

8 — Error handling & resilience — deterministic fallback, escalation protocols & failure-mode taxonomy.

The Layerup agent is designed to never silently fail. This is not an operational aspiration — it is a structural property enforced by the agent’s output assembly architecture. Every error condition results in a deterministic fallback action that preserves case integrity, writes a structured error record to your audit log, and ensures a human underwriter is notified.

8.1 Design principle — fail deterministically, never silently

The following invariant governs all error handling logic in the Layerup agent:
Any error condition that prevents the agent from producing a high-confidence, fully evidence-grounded output for a case results in an explicit, structured escalation to a human underwriter. The agent will never produce a confident-looking output for a case it cannot process correctly. Silent failures, suppressed exceptions, and optimistic outputs under uncertainty are architecturally excluded.
This means the agent’s output always falls into one of three categories:
  1. Full recommendation — All required documents processed, confidence above threshold, evidence grounded, requirements complete.
  2. Partial recommendation with flags — Some data points uncertain, some documents degraded, but overall confidence sufficient. Output includes explicit flags directing the underwriter to the uncertain items.
  3. Escalation — Confidence below threshold, critical extraction failure, or edge case outside AOP coverage. Case is routed to a senior underwriter with full context.
There is no fourth category.

8.2 Document processing failures

OCR failure

If the OCR pipeline cannot extract readable text from a document with confidence above the configured threshold:
  1. The agent logs a structured warning citing the specific document filename and page number.
  2. The agent continues processing all other documents in the case — partial failure does not abort the case.
  3. The failed document is included in the requirements list as requiring manual review by the underwriter, with explicit notation that OCR confidence was insufficient.
  4. The agent does not attempt to reason over unreadable content. Uncertain fields from failed OCR extractions are explicitly marked as unresolved in the output payload.

Unsupported document format

If the agent encounters a document format outside its supported types (e.g., a proprietary medical record format, a password-protected PDF, or a corrupted file):
  1. The error is logged with the specific document identifier, format, and error type.
  2. The document is included in an unprocessed_attachments list in the output payload.
  3. The underwriter is directed to review the document manually, with the specific reason for non-processing stated.

8.3 LLM inference failures

Model timeout

If an Amazon Bedrock or Azure OpenAI inference call exceeds the configured timeout (default: 120 seconds):
  1. The agent implements an exponential backoff retry policy with a maximum of 3 retries.
  2. Retry intervals: 5s → 15s → 45s (with jitter), ensuring SQS / Service Bus visibility timeout is not breached during the retry window.
  3. If all 3 retries fail, the case is marked as Failed and routed to the human escalation queue with full error context.
  4. The failure event is written to CloudWatch Logs with millisecond-precision timestamps for each retry attempt.
retry attemptwait before retryaction if exceeded
Attempt 1ImmediateLog failure · retry
Attempt 25 seconds (+ jitter)Log failure · retry
Attempt 315 seconds (+ jitter)Log failure · retry
Final failure45 seconds (+ jitter)Mark Failed · route to escalation queue

Guardrail intervention

If Amazon Bedrock Guardrails or Azure AI Content Safety blocks an inference request:
  1. The specific guardrail policy that triggered the block is logged in full detail to the audit log, including the blocked input and the policy type.
  2. The agent aborts that specific reasoning step and does not attempt to rephrase or retry past the guardrail block. There is no code path that circumvents the guardrail.
  3. The case is flagged for human review, with the guardrail intervention noted in the escalation flag and requirements list.
The agent is explicitly prohibited from retrying past a guardrail block with a modified prompt. A guardrail block is treated as a terminal error for that reasoning step. This ensures guardrail policies cannot be circumvented through iterative prompt modification.

Model unavailability

If the underlying foundation model is unavailable (e.g., during an AWS Bedrock or Azure OpenAI service incident):
  1. The agent’s SQS queue / Service Bus retains unprocessed messages for the duration of the outage. Messages are not dropped.
  2. Once the service recovers, the agent processes the backlog in arrival order.
  3. Your CloudWatch Alarm / Azure Monitor Alert notifies your operations team of the queue depth during an outage, with configurable thresholds for breach notification.
  4. SQS Dead-Letter Queue (DLQ) / Service Bus Dead-Letter Queue captures any messages that exceed the maximum receive count — these are reviewed manually by your operations team.

8.4 Flag-and-escalate protocol — the confidence floor

The Layerup agent is configured to escalate to a human underwriter in any situation where its confidence is insufficient to support a definitive recommendation. The escalation logic operates as a mandatory, pre-output gate — it cannot be bypassed by prompt engineering or configuration. Fig. A8.1 — Escalation gate. Every case passes three mandatory checks before a recommendation is assembled. Escalation is not an error state — it is the designed outcome for cases where human judgment is required. The three mandatory escalation triggers are:

Confidence Below Threshold

Any case where the overall confidence score falls below your configured threshold (default: 75%) receives an automatic Escalate to Senior Underwriter recommendation, regardless of the agent’s other findings. The threshold is configurable in the AOP; the escalation logic is not.

Critical Data Point Failure

Any case where a critical data point (e.g., primary income source, primary occupation) cannot be extracted from available documents — due to document absence, OCR failure, or irreconcilable inconsistency — results in automatic escalation, regardless of confidence on other dimensions.

Edge Case Outside AOP

Any case where the agent detects a pattern it has not been configured to evaluate — an edge case outside the AOP’s defined coverage — results in a flag with an explicit note that the specific scenario requires human judgment. The agent does not extrapolate beyond its AOP scope.

Guardrail-Triggered Escalation

Any case where a Bedrock Guardrail or Azure AI Content Safety block cannot be resolved through the retry protocol is automatically escalated, with the full guardrail intervention record included in the escalation payload for the reviewing underwriter.

8.5 Failure-mode summary

failure modeagent responsewritten to audit loghuman escalation
OCR below confidence thresholdFlag document · continue other docsYesVia requirements list
Unsupported document formatLog · include in unprocessed listYesVia requirements list
Model timeout (all retries exhausted)Mark Failed · route to escalation queueYesYes — immediate
Guardrail interventionAbort reasoning step · flag caseYes (full guardrail detail)Yes
Model service unavailabilityRetain in queue · process on recoveryYes — queue depth alarmsVia operations team alert
Confidence below thresholdEscalate to Senior UnderwriterYesYes
Critical extraction failureEscalate to Senior UnderwriterYesYes
Edge case outside AOPFlag with explicit noteYesYes