Semantic Ingestion as the Missing Layer

Enterprise data does not arrive as clean rows waiting for a schema. It arrives as email threads, meeting transcripts, ticket updates, attachments, and informal notes—often overlapping, partially redundant, and written for humans rather than databases.

Traditional software responds with forms and mandatory fields. That works for transactions. It fails for the bulk of organizational knowledge that never gets entered.

Ingestion is not “connect the API”

Connectors matter eventually, but the harder problem is interpretation: deciding what in a source is evidence, what is context, what is a fact, what is an event, and what is model-derived signal rather than something the business asserted.

I call this layer semantic ingestion because it is more than copying bytes into storage. It is a pipeline that:

Accepts a source with metadata (type, project, timestamps, participants when known).
Preserves the raw text immutably for traceability.
Runs distillation to separate noise from useful passages.
Extracts structured objects—decisions, commitments, risks, issues—where the text supports them.
Emits derived signals separately, so interpretation is never mistaken for fact.
Writes governed records to durable stores and distilled memory to retrieval indexes.

Distillation is not summarization

A common mistake is to summarize each document into a short paragraph and discard the rest. Summaries are lossy by design. For business memory, the goal is different: remove noise while preserving signal.

That usually means multiple outputs from one source:

Evidence passages — short excerpts tied to exact offsets in the raw source, suitable for citation.
Context passages — surrounding material that helps retrieval but may not be quoted directly.
Structured extractions — typed objects with confidence and links back to passages.

The raw layer remains available for audit, reprocessing when models improve, and human review when something looks wrong.

In practice, models are eager to label things. A tense email thread might produce “project health is poor” alongside “deadline moved to June 15.” The first is interpretation; the second might be a factual event if the text supports it.

Mixing these in one bucket teaches downstream agents to treat sentiment as ground truth. The ingestion pipeline should keep source-backed facts and events separate from derived signals, and mark the latter explicitly in retrieval and answers.

Graph memory receives distilled records, not raw dumps

Graph databases and tools like Graphiti are useful for relationships and temporal memory. They are poor places to dump entire email archives. The pattern I am exploring sends distilled facts, events, and labeled signals into graph memory—scoped by project—while PostgreSQL remains the system of record.

Each graph episode links back to stored memory records and source references so retrieval can be verified before an agent sees anything.

Why this layer is easy to underestimate

Semantic ingestion is unglamorous compared to chat UX. It is also where cost, quality, and trust are decided. Bad ingestion produces confident wrong answers with citations that do not support the claim. Good ingestion makes agents boring in the right way: they repeat what the organization actually said, show where it came from, and stay quiet when evidence is thin.

Current scope

The proof-of-concept uses manual file upload only—exported emails, transcripts, markdown notes. That is a deliberate constraint. It forces the pipeline to work on realistic mess without pretending live connectors are solved. Connectors come later; interpretation has to work first.

Ingestion is not “connect the API”

Distillation is not summarization

Facts, events, and signals should not share a namespace

Graph memory receives distilled records, not raw dumps

Why this layer is easy to underestimate

Current scope