Published: 2026-05-20
The noise problem in AI agent memory — and how we solved it
When we audited 10,134 agent memory entries, 9,791 were noise. That’s 97% of stored data that served no purpose—duplicates, hallucinations, and malformed extractions that made retrieval worse, not better. Here’s why this happens, why vector databases make it worse, and how we built a deterministic system that fixes it.
Why agent memory rots
Every AI agent has a memory problem. An agent answers a question, remembers something, and stores it in a vector database. Next conversation, it does it again. Without deduplication or consistency checks, the same fact gets embedded in thirty slightly different ways—each one pulling retrieval in a different direction.
The vector database doesn’t know these are duplicates. It sees “client uses React 18.2” and “the frontend is React 18.2” as different embeddings. Both get returned. Both eat context window. Both dilute the signal your agent actually needs.
The result isn’t just wasted tokens. It’s wrong answers. When half the context window is stale near-duplicates, the LLM starts confabulating from confusion rather than reasoning from facts.
The 97% audit
We ran a systematic audit across production memory stores. Each entry was labeled against three categories:
- Duplicate: same fact, different phrasing, no version tracking (4,112 entries)
- Hallucinated: LLM invented details not present in source material (3,089 entries)
- Malformed: extraction failed JSON schema or lost critical fields (2,590 entries)
Full methodology, the aggregate dataset, and a reproducibility script are published in our noise audit post. The short version: standard agent memory pipelines operating without deterministic validation converge to noise-dominated retrieval in approximately 200–400 conversation turns.
Why vector databases alone can’t fix this
Vector databases are great at similarity search. They’re terrible at identity. A vector DB sees “API key is expired” and “the key expired” as two separate facts. It has no mechanism for deduplication, no concept of canonical truth, and no way to say “this supersedes that.”
Adding more sophisticated embeddings or hybrid search doesn’t solve the root problem. It just makes similarity scoring more precise without addressing the fact that you’re scoring against a poisoned corpus.
The deterministic solution
Our approach splits the system into two layers: advisory and deterministic. The LLM proposes what might be true. Deterministic code decides what becomes durable memory.
- Content-addressable deduplication. Every ingested fact gets a deterministic hash based on normalized content. Before writing, the system checks if a semantically-identical record already exists. If so, it links the new reference rather than duplicating storage.
- Confidence gating. Every LLM extraction carries a confidence score. High-confidence facts auto-promote to durable memory. Medium-confidence routes to a review queue. Low-confidence is retained as non-authoritative context but never treated as fact.
- Provenance tracking. Every fact links back to its source document, chunk, and extraction run. You can audit why something is in memory and trace it back to the raw input that produced it.
- Idempotent writes. Replaying the same input produces the same facts. No duplicates from retries. No corruption from concurrent writes. The idempotency contract is enforced at the database level.
What this looks like in practice
An agent ingests a Slack thread. The ingestion worker chunks it, hashes each chunk, and checks for existing records. The advisory worker proposes entities, relations, and facts with confidence scores. The deterministic core normalizes, deduplicates, and persists only what clears the gates.
When the agent later calls brain.recall_memory, it gets back the canonical fact—not thirty variations of it. Context windows shrink. Retrieval quality improves. And crucially, the agent doesn’t hallucinate from its own memory.
In our production runs, the deduplication layer alone eliminates ~60% of candidate writes before they ever hit the graph. The confidence gating catches another ~25% as low-confidence proposals. The result: durable memory stays clean at scale.
What this means for your agents
You don’t need a noise problem. A deterministic ingestion pipeline, content-addressable deduplication, and confidence-gated persistence are straightforward engineering choices—not research problems. The architectural split between advisory LLMs and deterministic code is the mechanism that makes it work.