08 — Data abstraction, mapping, lineage & replay.

Once signal is in the gateway, the Data Plane projects it onto the Ontology. This is the Data Abstraction Layer: the place where heterogeneous, messy, multi-format input becomes typed, audited, replayable Ontology objects. Every typed property carries a provenance record. Every property is part of a lineage graph that links back to its source bytes. Every run is replayable against its original ontology version. And every object is searchable by an agent through a governed RAG Knowledge Base.

8.0 Reference — data abstraction layer

The Data Abstraction Layer turns the unified intake stream (§7.12) into typed Ontology objects with full lineage, then exposes those objects to the Reasoning Plane through a governed retrieval surface and a versioned RAG Knowledge Base. Fig. 8.0a — Data Abstraction Layer pipeline. Five stages (Entity Extraction → Ontology Instantiation → Schema Mapping → Cross-Validation → Semantic Search / RAG) sit between the Unified Intake Queue and the Agent Runtime.

8.1 Source-to-canonical mapping model

A mapping is a versioned, declarative artefact that projects a source schema onto an Ontology object. Mappings are stored as part of the configuration domain (§21) and are subject to the same release governance as agents and tools.

id: mapping.policy.source-a
ontology: layerup://ontology/v1
ontologyPin: 2026.05
target: Policy
source:
  kind: pull
  schema: source-a.policy.v7
fields:
  policyId:        $.policyId
  policyNumber:    $.polNum
  insuredRef:      ref(Insured, $.insuredKey)
  productCode:     $.product
  lineOfBusiness:  $.lob
  currency:        $.ccy
  effective:       date($.effDate)
  expiry:          date($.expDate)
  status:          enum($.status, table=mapping.policy.status)
  coverageRefs:    map($.coverages, mapping.coverage.source-a)
required: [policyId, policyNumber, insuredRef, productCode, lineOfBusiness, currency, effective, expiry, status, coverageRefs]
provenance:
  source: source-a
  channel: pull
  recordKey: $.policyId
audit:
  on:
    - mapping.applied
    - mapping.rejected

Mappings can be deterministic (pure projections) or call extraction tools (§9 · Extraction). Either way, every emitted property carries a provenance record (§8.2). Mappings are functions; they do not maintain hidden state.

8.2 Provenance record

Every property the platform writes is accompanied by a provenance record. The record is the substrate’s evidentiary contract.

{
  "source":           "source-a | email | webhook | stream:sensor.v3 | extractor:tool.extract.contact.v2",
  "sourceId":         "doc_…  |  msg_…  |  evt_…",
  "byteRange":        { "kind": "byteRange", "start": 1024, "end": 1208 },
  "extractor":        { "tool": "tool.extract.contact", "version": "2.4.1" },
  "extractorRun":     "run_01HF…",
  "modelLineage":     { "model": "layerup://gw/lane/extract.text/v3", "promptRev": "p_2c8f", "retrievalSnap": "rs_19a4" },
  "confidence":       0.94,
  "observedAt":       "2026-05-24T13:21:09Z",
  "ontologyVersion":  "2026.05",
  "verifierVerdict":  "pass"
}

8.3 Confidence model

Confidence is a normalised scalar in [0,1] with a defined source. Deterministic mappings emit 1.0. Extraction tools emit a model-derived score subject to calibration. Aggregations of multiple evidence spans use a fixed combination rule:

combined(c1, c2, …, cn) = 1 − Π_i (1 − c_i)        // independent supports
narrowed(c, ruleVerdict) = c · w(ruleVerdict)        // verifier dampens or boosts
calibrated(c) = isotonic_regression(c, calibrator_v)  // per-tool calibration

Calibrators are versioned and re-fit on a fixed schedule against labelled samples; a calibrator change is itself a data.calibrator.update AuditEvent.

8.4 Lineage graph

Fig. 8.1 — Lineage graph. Every Property is reachable from its source bytes via at least one EvidenceSpan; every Decision cites the spans it relied on; every Action is reachable from the Decision that authored it.

8.5 Time-travel queries

Every Object supports asOf(timestamp) reads. The substrate retains version history per property; given a timestamp, the resolver returns the version-set in force at that instant. Lineage queries (e.g. “show me the EvidenceSpans cited by Decision X”) are stable under any subsequent ontology evolution because Decisions pin to ontology versions (§6.4).

8.6 Replay semantics

Replay reconstructs an AgentRun from its persisted inputs and lineage. The substrate distinguishes two replay modes:

Deterministic steps

Bit-exact — All tool calls (§9) are idempotent and side-effect-bounded; replaying them on the same inputs reproduces the same outputs bit-exactly. Validation, lookup, conversion, classification with discrete outputs, and rule packs are bit-exact.

Non-deterministic steps

Seed-pinned — Model calls capture the model id, prompt revision, retrieval snapshot, parameter set, and seed. Replays use the exact pinned set; outputs are reproducible to the bounds the underlying model supports. Where a model has been retired, replay routes through the registered successor and an Exception of kind replay.successor is emitted.

8.7 Replay bundle format

A replay bundle is a self-contained, signed export of everything required to re-execute a run: the input objects pinned to their ontology version, the prompt revisions, the retrieval snapshots used, the tool versions, and the model lineage. Bundles are exportable in the .lrb archive format and are themselves content-addressed.

Path inside bundle	Contents
`/manifest.json`	Run identity, integrity hashes, signing identity
`/ontology/`	Frozen ontology snapshot at run pin
`/objects/`	Ontology objects referenced by the run
`/documents/`	Source documents (bytes-by-content-address)
`/spans/`	EvidenceSpans cited
`/prompts/`	Prompt revisions
`/retrieval/`	Retrieval-corpus snapshot manifests
`/models/`	Model lineage and capability lane mapping
`/audit/`	The slice of the audit chain covering the run

8.8 Retention

Documents, EvidenceSpans, Decisions, Actions, and AuditEvents are retained per the tenant’s policy with a per-class minimum. The substrate enforces minimum retention regardless of any tenant deletion request; deletion below the minimum requires a typed data.retention.exception Decision, signed off by the tenant’s data protection officer.

Specific retention durations are tenant policy and are not part of the platform’s architectural contract. The substrate guarantees the controls; tenants set the values.

8.9 Entity Extraction

Entity Extraction is the stage at which the substrate identifies typed entities inside unstructured payloads — named parties, identifiers, monetary amounts, dates, addresses, vehicles, vessels, properties, providers, codes — and proposes them as candidates for Ontology objects (§8.10) and Property values (§8.1). Inputs / outputs

Inputs: a content-addressed Document plus optional region/LOB hints from the Channel Router (§7.9).
Outputs: a typed EntityCandidate set, each with type, value, EvidenceSpan (page / bbox / token range / transcript line), extractor identity, model lineage, and calibrated confidence (§8.3).

Pattern Extractors are tools (§9 · Extraction). The lane runs an agent that selects extractors per content class, layers VLM fallback on low-confidence regions, and cross-checks outputs across extractors. Multiple extractors can propose candidates for the same span; the agent picks the most-supported value and records the others as alternates.

8.10 Ontology Instantiation

Once entity candidates exist, the substrate decides which existing Ontology objects they belong to and which new objects to instantiate. This is the stage that resolves identity and links. Algorithm

Candidate normalisation — canonicalise identifiers (case-fold, strip punctuation, normalise tax-ids, addresses, account numbers).
Entity resolution — match candidates to existing Ontology objects via deterministic keys first, then probabilistic match using approved embeddings against an entity index. Per-LOB resolvers can be pinned (e.g. provider registry for Health, vessel registry for Marine).
Decision — one of match (link to existing), create (instantiate new), or defer (raise an Exception of kind data.entity.ambiguous for human review).
Link writes — relationships are emitted with provenance, so every link has an audit trail back to the EvidenceSpan that supports it.

Ontology Instantiation never silently merges identities. A merge requires a typed data.entity.merge Decision, which is itself replayable.

8.11 Cross-Validation

Cross-Validation is the stage at which proposed property values are validated across sources before becoming authoritative on the Ontology. It is what makes the Data Abstraction Layer trustworthy under conflicting evidence. Validation classes

Within-document — consistency between fields in the same document (e.g. policy number and policyholder name match).
Cross-document — agreement among multiple documents covering the same Ontology object (e.g. loss notice + adjuster report + photographs).
Against systems of record — reconciliation with the authoritative system (e.g. policy admin lookup, provider registry, tax authority).
Against rule packs — policy-table validation (e.g. coverage applies on date of loss; deductible ≤ limit; sum of allocations equals 100%).
Against historical lineage — check that a proposed property update is consistent with the history (e.g. policy effective date does not change after binding).

Conflicts are resolved by deterministic rule packs first, agent reasoning second, and human review third. Every cross-validation decision is a typed AuditEvent (data.crossvalidate.<verdict>) and is part of the property’s provenance record.

8.12 Semantic Search & Code Lookup

Once Ontology objects exist with provenance, the substrate exposes them to the Reasoning Plane through two retrieval interfaces:

Semantic search — a typed retrieval interface that combines dense embeddings (over EvidenceSpans, Documents, transcripts, and Property text) with structured filters (tenant, region, LOB, marking, time window). All retrievals are permission-checked (§16) at query time, never at index-build time.
Code lookup — deterministic lookups against governed code-lists (ICD-10, CPT, NAICS, ISO, vehicle / vessel / property registries, peril codes, occupational codes, currency, jurisdictional rules). Each list is versioned and pinned by the agent at run time.

Why both Pure embedding search hallucinates and is hard to govern; pure code lookup misses anything not in a registry. The substrate uses both surfaces side-by-side: agents retrieve semantically when intent is fuzzy, and look up deterministically when the answer must be exact. Code-lookup tools are deterministic by construction (§9) and therefore replay bit-exactly.

8.13 RAG Knowledge Base

The RAG Knowledge Base is the substrate’s governed retrieval-augmented surface for agents that need to read beyond a single object’s lineage. It is a first-class component: indexed, versioned, multi-tenant, region-pinned, marking-aware, and replayable. Composition

Vector Store — embedding index over EvidenceSpans, Documents, transcripts, and selected Property text. Embeddings are produced by approved embedding models in the Model Gateway (§12) and re-embedded on model upgrade with a deterministic re-embedding job that produces a new retrieval snapshot.
Indexed Knowledge — structured indexes over Ontology objects (typed properties, relationships, code-lists, calibration tables, policy tables, rule packs, prior decision summaries). These are not embedded; they are deterministic.
Retrieval Snapshot — an immutable handle that pins which embedding model, which index version, and which inclusion / marking filters were in effect at retrieval time. Every Decision and tool call records its retrieval snapshot id (§8.7), so every retrieval is replayable.

Governance

Marking-aware retrieval — retrievals enforce the caller’s clearance (§15.4); a chunk a caller cannot see does not appear in the result set, and its absence is itself audited.
Tenant isolation — vector indexes are tenant-isolated by physical partition; a query cannot cross tenants by construction.
Region pinning — indexes live in their tenant’s region; a cross-region query is impossible without an explicit replication policy.
Provenance preservation — every retrieved chunk retains its source EvidenceSpan, so any Decision that uses a retrieval can cite the underlying bytes (§17 · Decision lineage).
No customer-data training — retrievals are not training data. The Model Gateway’s no-train policy (§12) binds at the retrieval boundary too.

From the Reasoning Plane’s perspective, the RAG Knowledge Base is just another tool pattern (§9 · Search / Retrieval): typed query, typed result set, idempotent on the same retrieval snapshot, audited per call. The agent does not know whether the answer came from semantic similarity or a code lookup — only that it cited the EvidenceSpan it used.

Foundations

Ontology

Data Plane

Logic & Reasoning

Models

Action Plane

Security & Governance

Operations

Enterprise

Data Abstraction, Mapping, Lineage & Replay

08 — Data abstraction, mapping, lineage & replay.

8.0 Reference — data abstraction layer

8.1 Source-to-canonical mapping model

8.2 Provenance record

8.3 Confidence model

8.4 Lineage graph

8.5 Time-travel queries

8.6 Replay semantics

Deterministic steps

Non-deterministic steps

8.7 Replay bundle format

8.8 Retention

8.9 Entity Extraction

8.10 Ontology Instantiation

8.11 Cross-Validation

8.12 Semantic Search & Code Lookup

8.13 RAG Knowledge Base

​08 — Data abstraction, mapping, lineage & replay.

​8.0 Reference — data abstraction layer

​8.1 Source-to-canonical mapping model

​8.2 Provenance record

​8.3 Confidence model

​8.4 Lineage graph

​8.5 Time-travel queries

​8.6 Replay semantics

Deterministic steps

Non-deterministic steps

​8.7 Replay bundle format

​8.8 Retention

​8.9 Entity Extraction

​8.10 Ontology Instantiation

​8.11 Cross-Validation

​8.12 Semantic Search & Code Lookup

​8.13 RAG Knowledge Base

08 — Data abstraction, mapping, lineage & replay.

8.0 Reference — data abstraction layer

8.1 Source-to-canonical mapping model

8.2 Provenance record

8.3 Confidence model

8.4 Lineage graph

8.5 Time-travel queries

8.6 Replay semantics

8.7 Replay bundle format

8.8 Retention

8.9 Entity Extraction

8.10 Ontology Instantiation

8.11 Cross-Validation

8.12 Semantic Search & Code Lookup

8.13 RAG Knowledge Base