Research
AI Systems8 min read

Memory Architecture for Production AI Systems

The three-tier memory model every production AI system needs, plus the session boundary failures that most teams build without realising it.

Memory EngineeringAI AgentsVector DatabasesProduction AIArchitecture
Memory Architecture for Production AI Systems

The context window is not memory. It is a workspace. What happens inside it is processing, not retention. This distinction is fundamental to building AI systems that perform reliably over time, across sessions, and at scale.

Memory in production AI systems is an architectural decision, not a feature you add after the fact. Systems that conflate the context window with memory run well in demos and break down in production.

The Three-Tier Memory Model

Production AI systems need three types of memory, each with different properties:

Working memory is what the model is actively processing in the current context window. It is volatile, token-bounded, and ephemeral. It does not persist beyond the current call. Everything in working memory is lost when the context is cleared.

Episodic memory covers the current session or task. It holds the history of the current conversation, the steps taken in the current workflow, and the intermediate results produced so far. It needs to persist within a session but does not need to survive session boundaries.

Persistent memory holds facts, preferences, and knowledge that should survive across sessions. A user's stated preferences, a company's internal knowledge base, the outcomes of past decisions. This is the layer that turns a stateless LLM into a system that accumulates knowledge over time.

Working Memory: Managing What You Have

Working memory management is context engineering. The key decisions: what to include in the current context, in what order, and how to handle the transition when the context fills.

For long tasks, the working memory budget must be allocated deliberately. A coding agent that puts the entire codebase in context will run out of space before it can produce a meaningful output. The correct approach is to load only the files relevant to the current subtask, using persistent memory to know which files those are.

Episodic Memory: Session Persistence

The simplest episodic memory implementation is a session store. After each exchange or action, write a structured summary to a session record. Before each exchange, read the most relevant portions of that record back into the context.

The critical engineering decision is what to summarise and how. Verbatim transcripts are expensive and often redundant. Structured event logs are cheap and queryable. A good episodic memory system writes events like: "User confirmed the billing address at step 3. Agent wrote to orders table at step 5. Constraint: must use USD pricing." This is far more useful than a transcript.

Persistent Memory: The Knowledge Layer

Persistent memory is where the interesting architecture decisions live. Three patterns:

Retrieval-augmented memory. Store facts, documents, and historical context in a vector database. At the start of each session or task, retrieve the most relevant items based on the current query. This scales to large knowledge bases and handles the recall problem that direct context injection cannot.

Structured fact stores. For facts with known schemas (user preferences, entity attributes, configuration), a relational or document store is more appropriate than a vector database. Query by entity ID rather than semantic similarity.

Hybrid retrieval. Most production systems benefit from both. Use semantic search for unstructured knowledge and document lookup, structured queries for known entities. The orchestration layer decides which retrieval mechanism to use based on the query type.

Memory as a System Boundary

The boundary between working memory and episodic memory is where most production failures occur. An agent that carries everything forward in context will hit the limit. An agent that discards everything at each step will re-discover the same facts repeatedly.

The solution is explicit write and read operations at memory tier boundaries:

On task completion or interruption: write a summary from working memory to episodic memory. At session start: read the most relevant episodic memory back into working context. When episodic memory reaches a threshold: compress and promote key facts to persistent memory.

These are not automatic. They are engineering decisions that need to be designed, tested, and maintained.

Testing Memory Systems

Memory bugs are the hardest to catch because they manifest across sessions, not within them. A production memory system needs a test harness that:

Runs multi-session scenarios, not just single-turn tests. Validates that facts written in session N are correctly retrieved in session N+K. Tests the boundary behaviour when memory tiers are full. Checks for memory contamination between different users or entities.

The systems that fail most visibly in production are the ones whose memory was tested only at the component level, not across session boundaries.