Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.uselayerup.com/llms.txt

Use this file to discover all available pages before exploring further.

07 — Data plane — ingestion & source surfaces.

The Data Plane is the platform’s port to the outside world. It accepts heterogeneous signal from many channels, classifies it, hashes it, normalises it, and presents it to the Ontology and the Reasoning plane as a typed, content-addressed stream of objects.

7.0 Reference — multimodal ingestion layer

The platform ingests anything an insurance carrier already touches: human-language channels (email, chat, phone, SMS), structured digital channels (forms, partner APIs, webhooks), document channels (PDF, scans, images, handwriting), voice channels (calls, voicemail, IVR transcripts), and machine channels (telematics, sensor, CDC). Everything funnels through a Channel Router, into channel-appropriate extractors (OCR / VLM, voice transcription, structured parsers), and out into a Unified Intake Queue typed against the Ontology. Fig. 7.0a — Multimodal ingestion layer. Eight channel families → Channel Router → channel-specific extractors → Unified Intake Queue → Ontology.

7.1 Channel matrix

The substrate accepts signal across three transport modes — push, pull, and stream — and three content modes — structured, semi-structured, and unstructured. The same ingest gateway (§7.3) terminates all of them; differences are confined to adapters.
ModeExamplesLatencyIdempotencyDefault backpressure
Push · structuredWebhooks · partner APIs · broker portalsreal-timecaller-supplied key + body hashHTTP 429 + Retry-After
Push · unstructuredInbound email · SFTP drop · upload portalseconds–minutes(message-id, attachment-hash) tuplequeue depth gating
Pull · structuredPolicy admin · claims · billing · GLscheduled or change-feed(source, primary key, version)chunked cursor
Pull · unstructuredDocument mgmt · regulator portalspolled(source, externalId, contentHash)chunked cursor
StreamTelematics · sensor · clickstream · partner CDCmilliseconds–secondsper-stream watermarkpartition lag · DLQ

7.2 Content modes

Typed payloads

Structured — JSON / XML / Avro / Protobuf with a known schema. Validated at the gateway, mapped directly to Ontology objects.

Form-shaped

Semi-structured — Standard forms (ACORD, regulator filings, broker spreadsheets). Parsed into a tabular intermediate, then mapped.

Free-form content

Unstructured — PDFs, images, scans, handwritten forms, emails, voice transcripts. Routed through extraction tools (§9 · Extraction) under audit.

7.3 Ingest gateway responsibilities

The ingest gateway is a single hardened component that every channel terminates into. It performs:
  1. Authentication — mTLS / OAuth2 client credentials / signed webhook / SFTP key. Anonymous ingest is never permitted.
  2. Authorisation — the source principal’s scope must include data.ingest.<channel>. Cross-tenant routing is impossible by construction.
  3. Anti-malware & content scan — every byte stream is scanned at the boundary. Quarantine on detection; tenant security plane notified.
  4. Classification — initial markings (tenant, region, default sensitivity) are applied immediately.
  5. Hashing — SHA-256 of the canonical payload bytes; stored as the document content address.
  6. Persistence — write to immutable object storage with versioned key and retention policy.
  7. Dedupe — see §7.5.
  8. Acknowledgement — typed receipt with (documentId, contentHash, ingestId); acknowledged after durable persistence and audit emission.

7.4 Topology

Fig. 7.1 — Ingest topology. Every channel terminates into the same gateway; every step emits an AuditEvent.

7.5 Dedupe key construction

Dedupe keys are deterministic per channel and form the basis of intake idempotency. The same logical signal received twice never produces two upstream objects.
ChannelDedupe key
Webhooksha256(deliveryId · bodyHash)
Emailsha256(messageId · normalisedFrom · attachmentHashes)
Pull (CDC / change-feed)(sourceId, recordKey, sourceVersion)
SFTP / batchsha256(filePath · contentHash · ingestEpoch)
Stream(partition, offset)

7.6 Intake idempotency contract

  • Replays of an identical payload return the original ingestId with the receipt unchanged.
  • Different payloads under the same dedupe key are typed as a data.ingest.collision Exception and quarantined.
  • Replays of an identical stream offset are silently dropped.
  • The gateway never silently overwrites a previously persisted document.

7.7 Rate-shaping & backpressure

Each channel has a per-tenant rate budget configured at provisioning. Sustained breach results in 429 / Retry-After to push channels and pause-of-cursor to pull adapters. Streams apply per-partition lag thresholds; sustained breach moves traffic to a slow lane and raises a data.ingest.lag Exception. The gateway never silently drops signal.

7.8 Boundary failure modes

FailureDetectionContainmentAudit signature
Auth replaynonce / timestamp windowreject; lock principal after thresholddata.ingest.auth_replay
Schema drift (push)schema validation failroute to schema-quarantine queuedata.ingest.schema_drift
Malwarescanner verdictquarantine; security notifydata.ingest.malware
Source unavailable (pull)error rate windowbackoff with circuit breakerdata.ingest.source_down
Stream lagpartition lag > SLOslow lane + Exceptiondata.ingest.lag

7.9 Channel Router

The Channel Router is the first hop after the gateway accepts a payload. Its job is to decide which extractor lane the payload belongs to, which tenant and region it lives in, and which rate budget and markings apply — before any extractor touches the bytes. It is intentionally thin and stateless. Every routing decision is a typed AuditEvent. Inputs
  • Wire-level metadata (transport, source IP, signed sender).
  • Authenticated principal and its tenant scope.
  • Content type, MIME, magic-bytes, file extension.
  • Channel hint declared by the gateway (e.g. email.inbound, partner.webhook).
Outputs
  • Typed ChannelRoute: {tenant, region, channel, lane, markings, rateClass}.
  • One or more lane handles (a structured payload can fan out: parse the email body in the email-parser lane and route attachments through the OCR / VLM lane).
  • Audit emission: data.route.assigned with route handle and decision reasons.
Routing taxonomy
Channel classDefault laneFan-out lanes
Email · inboundemail-parserOCR / VLM (attachments) · structured-parser (forms)
Chat / SMSchat-parser
Phone / voicemail / IVRvoice-transcriptionchat-parser (post-transcript)
Web / mobile formstructured-parserOCR / VLM (uploads)
Partner API / webhookstructured-parser
SFTP / batchstructured-parser or OCR / VLM by content type
Telematics / sensorstream-consumer
The router is also where region pinning is established. A payload’s region is decided here, persists through every downstream plane, and is never re-decided. Cross-region transit is therefore impossible by construction unless an explicit cross-region replication policy applies. (See §19 and §23.2.)

7.10 Agentic OCR

The OCR / VLM lane is not a single OCR engine; it is an agentic extraction pipeline that selects and combines extraction tools per document. The lane supports printed text, handwriting, scanned forms, photographs, diagrams, tables, and mixed-content multi-page documents. Pipeline stages
  1. Pre-processing — orientation, deskew, despeckle, page split, segmentation.
  2. Layout analysis — block / line / table / figure regions; reading order.
  3. Primary extraction — printed-text OCR, handwriting OCR, table parser, signature detector, stamp detector. Each is a versioned tool (§9 · Extraction).
  4. VLM fallback — for low-confidence regions or non-textual content (images, diagrams), a vision-language model emits structured descriptions.
  5. Cross-extractor reconciliation — outputs are reconciled by an agent that selects the most-supported value per field; the reasoning trail (§10.3) records why.
  6. Provenance emission — every emitted property carries an EvidenceSpan (page, bbox, optionally token range) plus extractor identity and confidence.
Why “agentic” For mixed and degraded documents, no single extractor wins. The lane is run by an agent that picks tools, requests fallbacks, and verifies cross-extractor consistency. This agentic loop is what lets the same lane handle a structured ACORD form, a smartphone photograph of a handwritten loss notice, and a scanned binder all at quality — without per-document hand-tuning. Calibration & audit
  • Confidence per field is calibrated per extractor and per document class (§8.3).
  • Every reconciliation decision is an AuditEvent (data.ocr.reconcile).
  • Originals are retained at content-address; an extraction can be re-run against a newer extractor version without losing the original lineage.

7.11 Voice Transcription

Voice channels (recorded calls, voicemail, IVR, agent-customer dialog, claim FNOL phone intake, broker phone submissions) are routed into the voice-transcription lane. The lane produces a typed Transcript with timing, speaker labels, language tag, redactions, and line-by-line confidence. Pipeline stages
  1. Pre-processing — channel split, silence trim, format normalisation.
  2. ASR — speech-to-text in the language detected; configurable per region and per LOB.
  3. Diarisation — speaker turns (agent / customer / third-party).
  4. Translation (optional) — into the operator-display language; original transcript retained.
  5. Redaction — PII / PHI tokens are redacted at the lane boundary per markings policy (§15.4); raw audio retention follows the configured per-tenant retention floor.
  6. Hand-off — the transcript is published to the Unified Intake Queue (§7.12) typed as VoiceTranscript, with a backref to the original recording.
Downstream agents treat a transcript like any other unstructured document — they read from the Ontology, cite EvidenceSpans (line ranges in the transcript), and emit Decisions and Actions.

7.12 Unified Intake Queue

The Unified Intake Queue is the single, ordered, typed surface that all extractor lanes publish into and that downstream planes (Ontology, Reasoning) consume from. It is what makes the substrate channel-agnostic above the queue: an underwriting agent or a claims agent does not know whether the originating signal arrived as an email attachment, an IVR call, a partner webhook, or a sensor event — it sees only typed ontology objects with provenance. Properties
  • Typed: every queued event is an Ontology-typed payload with provenance.
  • Per-tenant: queues are tenant-isolated; cross-tenant fan-out is impossible.
  • Per-region: queues are region-bound; cross-region transit requires explicit policy.
  • Ordered with idempotency: the dedupe key (§7.5) is honoured at queue write; re-publishes are no-ops.
  • Backpressure-aware: lag SLOs apply (§7.7); slow lanes are first-class.
  • Auditable: every enqueue and dequeue is an AuditEvent (data.intake.publish, data.intake.consume).
Why a unified queue Two alternatives are inferior. (a) “Channel-by-channel queues then merge in the workflow” forces every workflow to handle every channel; the same workflow logic is rewritten in every product. (b) “No queue, direct call to runtime” couples ingest and reasoning, so a backed-up extractor blocks live runs. The unified queue keeps every plane simple: extractors publish typed events, the runtime consumes typed events.
The Channel Router establishes region at the gateway, every extractor lane runs region-pinned, and the Unified Intake Queue is region-bound. A payload that arrives in an EU mailbox is parsed by EU extractors, queued in the EU intake queue, and consumed by EU agents using EU-pinned models — with no cross-region transit. The same pattern holds for any region: APAC, LATAM, sovereign clouds. Region is therefore not a deployment afterthought but a property carried by every event from byte zero. (See §19, §23.2.)