Your AI Agent canât remember what you said yesterday.
This isnât a bugâitâs a structural deficiency. LLMs are inherently stateless. The context window is their entire memory, and closing the window wipes the slate clean.
In 2025, the answer was RAG: stuff conversation history into a Vector DB, retrieve when needed. But RAG has a fundamental problemâhow do you know what to retrieve?
In 2026, three radically different answers emerged at once.
Why âMemoryâ Suddenly Became the Battleground
If youâve used an AI coding agent, youâve lived this scenario:
Monday, you tell the AI: âOur project uses Kotlin, follows Clean Architecture, and Dispatchers must be injected via DI.â AI performs perfectly.
Tuesday, you start a new session. It writes Dispatchers.IO again. You explain again.
Wednesday, same thing.
Your AI is that coworker who has to re-introduce themselves every morning.
In 2025, this was treated as a âtolerable inconvenience.â By 2026, as Agents took on more complex tasksâcross-session debugging, long-term project maintenance, multi-user collaborationââcanât rememberâ went from annoying to fatal.
Three teams came up with three very different answers:
| School | Representatives | Core Philosophy | One-Liner |
|---|---|---|---|
| Graph-based | Mem0, Zep | Structure memory as a knowledge graph | âRemember the relationships between factsâ |
| OS-inspired | Letta (formerly MemGPT) | LLM as operating system, memory as virtual memory | âLet the Agent manage its own memoryâ |
| Observational | Mastra | Background Agent compresses conversations into observation notes | âCompress everything, stuff it into contextâ |
This article breaks down each schoolâs design philosophy, technical architecture, and trade-offsânot just âwhat it is,â but more importantly, âwhy itâs designed this wayâ and âwhen to use which.â
School 1: Graph-based â Memory as Knowledge Graph
Design Philosophy
The core belief of Graph-based memory: memory isnât a pile of isolated factsâitâs a structured knowledge network.
âUser likes Kotlinâ is a fact. But âuser likes Kotlin, uses Clean Architecture in a project, and requires Dispatchers to be DI-injectedââthe relationships between these three facts are the truly valuable memory.
Vector DBs can store facts, but not relationships. Searching for âKotlinâ might find the first one but wonât automatically surface the other two. Graph DBs canâbecause they store entities as nodes, relations as edges, and a single traversal pulls out the entire context.
Representative 1: Mem0 â Two-Stage Extraction Pipeline
Mem0 is the pioneer of this school, raising $24M Series A in 2025 and becoming AWS Agent SDKâs exclusive memory provider.
Architecture:
Conversation â [LLM #1: Extraction] â Extract entities + relations
â
[LLM #2: Update] â Decide add/update/delete
â
Vector DB + Graph DB (Neo4j)
After each conversation, Mem0 processes memory with two LLM calls: the first extracts information, the second decides how to update. The benefit of this pipeline is high structural fidelityâyou end up with a clean knowledge graph, not a blob of compressed text.
Mem0áľ (Graph Memory) is its advanced version, supporting Neo4j, Memgraph, and other graph backends. It stores entities as nodes, relations as edges, paired with vector embeddings for hybrid search.
Data (from arXiv:2504.19413):
- 26% accuracy improvement over OpenAI Memory (LOCOMO benchmark)
- P95 latency reduced by 91%
- Token cost savings of 90%
Representative 2: Zep â Temporal Knowledge Graph
Zep uses a more sophisticated design: bi-temporal knowledge graph.
Ordinary knowledge graphs are staticââuser likes Pythonâ is an eternal fact. But real-world facts change. Three months ago the user liked Python, last month they switched to Rust, this week theyâre considering Zig.
Zepâs Graphiti engine records two timelines for each edge:
| Timeline | What It Records | Use Case |
|---|---|---|
| Event Time (T) | When the fact was true in the real world | âUser liked Python from 2025-06 to 2025-12â |
| Ingestion Time (Tâ) | When the fact was ingested by the system | Audit trail, conflict resolution |
When new facts contradict old ones (âuser now likes Rustâ), Zep doesnât delete the old edgeâit invalidates it by setting a t_invalid marker. This means you can do point-in-time queries: âWhat were the userâs preferences three months ago?â
The retrieval design is also worth examining:
Query â [Cosine semantic search] ââ
â [BM25 keyword search] ââ Hybrid ranking â LLM Reranking â Context
â [BFS graph traversal] ââ
According to Zepâs official description, Graphiti supports hybrid queries combining âtime, semantic, full-text, and graph algorithm.â Three search strategies run in parallel, with LLM reranking assembling the final context.
Data (from arXiv:2501.13956):
- LongMemEval accuracy improved by 18.5% (GPT-4o)
- Median latency 2.58s vs. full-text baseline 28.9s (90% reduction)
- Context uses only 1.4% of baseline tokens (1.6k vs. 115k)
Graph-based Trade-offs
Strengths:
- Highest structural fidelityâknowledge has explicit entity/relation structure
- Supports complex queries (multi-hop reasoning, temporal reasoning)
- Suitable for multi-user shared memory
Costs:
- Ingestion requires LLM callsâevery conversation runs an extraction pipeline, adding cost and latency
- Graph maintenance is complexâentity deduplication, edge invalidation all need careful design
- Risk of over-structuringânot all memory fits neatly into entity-relation format
School 2: OS-inspired â Let the Agent Manage Its Own Memory
Design Philosophy
Lettaâs (formerly MemGPT) core belief: donât manage memory for the Agentâlet it manage its own.
This idea comes directly from operating system design. Your OS doesnât need you to manually decide what goes in RAM vs. diskâit has virtual memory to handle that automatically. Letta applies the same thinking to LLMs:
| OS Concept | Letta Equivalent |
|---|---|
| RAM (main memory) | Core Memory (always in context) |
| Hard disk | Archival Memory (semantic search) |
| Virtual Memory Paging | Agent autonomously decides when to access |
| Page Fault | Agent discovers needed info isnât in context |
Three-Tier Memory Hierarchy
Tier 1: Core Memory (always in context)
Core memory is a set of structured text blocks embedded directly in the system prompt, visible to the Agent on every inference. It has two default blocks:
persona: Agentâs identity, behavior patternshuman: Userâs preferences, history, context
The key design: the Agent can modify these blocks itself. Letta provides core_memory_append and core_memory_replace toolsâthe Agent decides when to update and what to update.
# Agent autonomously decides to remember this preference
core_memory_replace(
block="human",
old="Uses Python for most projects",
new="Recently switched from Python to Rust for systems work"
)
Tier 2: Archival Memory (semantic search)
Long-term knowledge that wonât fit in core memoryâpast conversation summaries, learned patterns, project history. The Agent proactively searches via archival_memory_search.
Tier 3: Recall Memory (conversation history)
Complete event log including all messages, tool calls, and reasoning traces. When capacity is exceeded, automatic recursive summarization kicks in.
Letta Code: A Memory-Driven Coding Agent
In February 2026, Letta launched Letta Codeâa coding agent built on its memory architecture that achieved #1 among model-agnostic open-source frameworks on Terminal-Bench.
Its secret weapon is the Skill Library: the Agent stores learned patterns as .md files (API migration steps, dashboard creation workflows, common bug fixes) and automatically invokes them when facing similar problems.
Even more notable are Context Repositoriesâusing Git for memory version control. The Agentâs memory can be committed, branched, and even rolled back.
OS-inspired Trade-offs
Strengths:
- Highest autonomyâAgent controls its own memory lifecycle
- Good explainabilityâyou can directly read the Agentâs core memory to understand âwhat it remembersâ
- Continuous learningâAgent accumulates experience over time, improving with use
Costs:
- High fixed context window costâsystem prompt + tool definitions + core memory eat ~2,000 tokens before the conversation even starts
- Self-editing is riskyâAgent might accidentally delete important memories or write incorrect information
- Page faults are expensiveâjust like a real OS, when the Agent discovers needed memory isnât in Core Memory, it must search Archival Memory (one RAG call + one LLM inference). The latency of this âpage faultâ can be fatal in real-time response scenarios
- An interesting self-contradictionâLettaâs own research found that on the LoCoMo benchmark, an Agent using basic filesystem operations (74.0%) beat an Agent using Mem0âs graph memory (68.5%)
That last point is worth sitting with. Lettaâs team concluded: âCurrent agents are very proficient at using tools, particularly those commonly encountered in training data (such as filesystem operations).â This hints at a possibilityâmaybe the best memory system doesnât need a special architecture, just the tools the Agent is most familiar with.
School 3: Observational â Compress Everything, Stuff It into Context
Design Philosophy
Mastraâs Observational Memory is the most radical of the three: no retrieval, no external databases, compress all memory into plain text and stuff it into the context window.
This sounds absurd. Context windows are finiteâhow can you fit everything in?
The answer: compression is more effective than youâd think.
Dual-Agent Compression Architecture
Mastra divides the context window into two zones:
Context Window
ââ Observations (compressed memory) â Stable, cacheable
â Structured text notes with dates + priority
â Exceeds 40k tokens â Reflector cleans up
â
ââ Raw Messages (original conversation) â Append-only
â Uncompressed recent conversation
â Exceeds 30k tokens â Observer compresses
Two background Agents handle memory management:
Observer: When raw conversation exceeds 30k tokens, the Observer compresses it into structured observation notes with date and priority markers:
đ´ 2026-03-02 User's Android project uses Clean Architecture
- Dispatchers must be DI-injected (lint rule enforced)
- Don't use Dispatchers.IO, use @Dispatcher annotation instead
đĄ 2026-03-01 User is writing an AI Agent Memory article series
- Target audience: L5+ engineers
- Requires technical depth + real-world trade-offs
Reflector: When observation notes exceed 40k tokens, the Reflector performs âgarbage collectionââmerging duplicates, removing outdated information, retaining high-priority memories.
Why This Beats RAG
Traditional RAG:
Query â Embedding â Vector Search â Reranking â Context Assembly â LLM
Every step has information loss. Imprecise embeddings, incomplete search results, reranking might miss critical context.
Observational Memory:
Messages â Observer compression â Directly in Context â LLM
No retrieval gap. All relevant memory is always in contextâthe âwhat to search forâ problem doesnât exist.
And itâs cheaper. Because the Observation block is stable (only changes during compression), it benefits perfectly from prompt caching. According to Mastra, most model providers offer 4-10x token cost discounts for cache hits.
Compression Efficiency
| Conversation Type | Compression Ratio | Why |
|---|---|---|
| Pure text | 3-6x | Human language has massive redundancy |
| With tool calls | 5-40x | JSON output is extremely redundant (10 API calls compressed into one observation line) |
To give you a feel for the numbers:
Original conversation (~5,000 tokens):
User: "What's the weather in SF?"
Tool: {"location": "San Francisco", "temp": 65, "conditions": "sunny" ...1k JSON}
Agent: "SF is currently 65°F, sunny"
User: "How about tomorrow?"
Tool: {"location": "San Francisco", "temp": 68, ...1k JSON}
Agent: "Tomorrow: 68°F, partly cloudy"
Compressed observation (~150 tokens):
đĄ 2026-03-02 User checked SF weather
- Today: 65°F sunny
- Tomorrow: 68°F partly cloudy
Compression ratio: 33x. The more tool calls, the greater the compression benefit.
Data
LongMemEval 94.87% (GPT-5-mini)âcurrent SOTA. For comparison, on the same benchmark, GPT-4o with Mastra scored 84.23%, while traditional RAG with GPT-4o only scored 80.05%.
Even more notable is the scaling characteristic: from GPT-4o (84.23%) to GPT-5-mini (94.87%), Mastraâs score jumped over 10 percentage points. The stronger the model, the greater the benefit of compressed memory.
Observational Trade-offs
Strengths:
- Zero infrastructureâno Vector DB, no Graph DB, no external systems needed
- Deterministicâsame observation notes produce the same results every time (unlike retrieval, which has randomness)
- Debug-friendlyâjust read the Observation block to see what the Agent âremembersâ
- Lowest costâprompt caching (most providers offer 4-10x discounts) + zero retrieval overhead
Costs:
- Synchronous blockingâconversation pauses when Observer triggers, until compression completes (async mode is in development)
- Compression is lossyâthe Observer might compress ten failed API calls into âattempted API calls, failed.â Fine for daily conversation, but if your Coding Agent is debugging, those ten error traces are the clues you needâcompressing them away means genuine âamnesiaâ
- Not suitable for large knowledge basesâcan only compress conversation history, canât replace document corpus RAG
- Single-conversation onlyâno cross-user, cross-Agent shared memory mechanism
- Multi-session ceilingâLongMemEvalâs multi-session sub-score is only 87.2%, a clear weakness
Head-to-Head Comparison
Benchmark Comparison
| Benchmark | Mem0 (Graph) | Zep (Temporal) | Letta (OS) | Mastra (Observational) |
|---|---|---|---|---|
| LOCOMO | 68.5% | 75.1%â | 74.0%⥠| â |
| LongMemEval | â | 71.2% | â | 94.87% |
â Zep re-ran LOCOMO and found configuration issues in Mem0âs original evaluation (wrong user model, improper timestamp handling) ⥠Lettaâs result using basic filesystem operations, not Mem0 integration
Important caveat:
These benchmarks canât be directly compared across the boardâeach team tested different task types, models, and configurations. And the benchmarks themselves are controversial:
- LOCOMOâs problem: Conversations average only 16k-26k tokens, within modern LLMsâ context windows. Zep found that stuffing the full conversation directly into context (without any memory system) scored ~73%âhigher than Mem0âs 68.5%. This raises the question of whether LOCOMO actually tests âmemoryâ at all.
- LongMemEval is currently the more rigorous benchmark (averaging 115k tokens), but only Zep and Mastra have run it.
Thereâs an insight worth pausing on here: In 2026, with context windows reaching 1M or even 2M tokens, brute-force context coverage is often more effective than elegant retrieval. This also explains why Mastraâs Observational approachâcompressing everything and stuffing it into contextâachieves SOTA on LongMemEval.
The trend is still clear: Observational leads in long-conversation memory, while Graph-based has the advantage in structured knowledge and cross-session scenarios.
Architectural Philosophy Comparison
| Dimension | Graph-based | OS-inspired | Observational |
|---|---|---|---|
| Memory Storage | External DB (Graph + Vector) | Tiered (Core + Archival + Recall) | Inside context window |
| Memory Manager | External pipeline | The Agent itself | Background Agent |
| Retrieval | Hybrid search | Agent-initiated search | Not needed (everything in context) |
| Infrastructure | High (Graph DB + Vector DB) | Medium (Agent Runtime) | Low (just LLM) |
| Cross-session | Native support | Native support | Weak (compression is per-conversation) |
| Cross-user | Native support | Shared blocks | Not supported |
| Temporal reasoning | Zepâs strength | Limited | Limited (relies on date markers) |
| Scalability | High (DB scales) | Medium | Bounded by context window |
| Debugging | Inspect graph | Read core memory | Read observation notes |
Cost Comparison (estimated for 50-turn conversation)
| Item | Graph-based (Mem0) | OS-inspired (Letta) | Observational (Mastra) |
|---|---|---|---|
| Ingestion LLM calls | 50 Ă 2 calls | 1 call per memory op | ~2 calls (Observer + Reflector) |
| Retrieval | 50 Ă hybrid search | Agent searches on demand | 0 (everything in context) |
| External DB | Graph DB + Vector DB | Vector DB (archival) | None |
| Prompt caching benefit | Low (context differs each time) | Medium (core memory is stable) | High (observation block is stable) |
Decision Framework
After all this architecture talk, letâs get to the practical question: which one should you use?
Scenario 1: Long-term Personal Assistant (remember user preferences, history)
Recommendation: Mem0 or Letta
Needs cross-session memory, user profile management. Mem0âs structured knowledge graph excels at âremembering facts,â Lettaâs self-editing excels at âremembering behavior patterns.â
Scenario 2: Coding Agent (remember project context, coding style)
Recommendation: Letta Code or Observational
Letta Codeâs Skill Library was designed exactly for this. If you donât need cross-session memory, Mastraâs zero-infrastructure approach is simpler.
Scenario 3: Customer Service Bot (many users, audit requirements)
Recommendation: Zep
The temporal knowledge graphâs bi-temporal model natively supports audit trails. It can answer compliance questions like âwhat did the user say three months ago?â
Scenario 4: High Throughput, Cost Sensitive
Recommendation: Observational
Zero infrastructure + prompt caching = lowest cost. But accept the multi-session limitations.
Scenario 5: Enterprise Knowledge Management
Recommendation: Mem0 (managed) or Zep (managed)
Need SOC 2, HIPAA compliance, multi-tenant support. Both offer managed services.
Quick Decision Tree
Does your Agent need cross-session memory?
ââ No â Observational (simplest, cheapest)
ââ Yes
ââ Need temporal reasoning (facts change over time)?
â ââ Yes â Zep (temporal KG)
ââ No
ââ Does the Agent need to learn autonomously?
â ââ Yes â Letta (self-editing memory)
ââ No
ââ Mem0 (structured extraction, fastest to ship)
Conclusion: No Silver Bullet, but No Longer Unsolvable
After researching all three schools, Iâll be honestânone of them made me think âthis is the oneâ:
- Graph-based has beautiful structure, but running an extraction pipeline every conversation adds up in cost
- OS-inspired has a cool vision, but Agent self-management isnât quite there yetâeven Letta found that filesystem operations beat graph memory
- Observational has the best numbers, but falls short the moment you need cross-session memory
This will probably converge toward some kind of hybridâObservational compression within sessions, Graph persistence across sessions. But thatâs a future problem.
The practical move right now is simple: look at your scenario, pick one from the decision tree above, ship it, then iterate. The perfect solution doesnât exist, but waiting just means your Agent keeps forgetting.
If you remember only one thing:
RAG is no longer the only answer. In 2026, AI Agent memory has three serious solutions, each crushing pure RAG in its sweet spot. The only question is whether youâre willing to try.
The next article will dive deep into Mem0âs architectureâhow the extraction pipeline works, Graph Memory implementation details, and the real methodology behind that 26% accuracy boost from the paper.
References
- Mem0: Memory Layer for AI Agents (arXiv:2504.19413) â Mem0 core paper, source of the 26% accuracy boost data
- Zep: A Temporal Knowledge Graph Architecture for Agent Memory (arXiv:2501.13956) â Zep/Graphiti paper, bi-temporal KG architecture details
- MemGPT: Towards LLMs as Operating Systems (arXiv:2310.08560) â Lettaâs predecessor MemGPT, the original paper
- Mastra: Observational Memory â Observational Memory official blog
- Mastra Research: Observational Memory â LongMemEval 94.87% benchmark details
- Letta Code Announcement â Context Repositories + Terminal-Bench ranking
- Letta: Benchmarking AI Agent Memory â The âfilesystem beats graph memoryâ research
- Mem0 Series A ($24M) â Mem0 commercialization + AWS partnership
- Zep: State of the Art Agent Memory â Zep official benchmark analysis
- Zep: Is Mem0 Really SOTA in Agent Memory? â Zepâs critique of Mem0 benchmark methodology
This is part of the âAI Agent Architecture in Practiceâ series. Previous: Cursorâs $29B Secret: The Deleted Shadow Workspace, Reverse-Engineered. Next: Mem0 Deep DiveâFrom arXiv Paper to Production