Chapter 9: Memory and Persistence
You’ve built something good. Users like your AI assistant. It answers questions clearly, follows instructions well, maybe even has a bit of personality. Then you watch someone use it for the tenth time, and you cringe.
“Hi! I’m your coding assistant. I can help you understand your codebase, answer questions about your code, and assist with development tasks.”
The same introduction. The same cheerful ignorance. No memory of the nine previous conversations where they explained their architecture, debated naming conventions, and made decisions together. Every session starts from absolute zero.
Think about working with a human colleague who forgot every conversation overnight. You’d waste half your time re-establishing context. You’d never build on previous decisions. You’d never develop the shorthand and shared understanding that makes collaboration efficient. That’s what stateless AI feels like to your users—helpful in the moment, exhausting over time.
Memory transforms AI from a tool you use into a partner you work with. Users who feel remembered become engaged users. Applications with memory can learn from interactions and improve. But memory done poorly—bloated with irrelevant details, retrieving the wrong context, violating user privacy—creates worse experiences than no memory at all. This chapter teaches you to build memory systems that actually work.
From Session to System: The Memory Problem
In Chapter 5, we tackled conversation history within a single session: sliding windows to manage growth, summarization to compress old messages into key facts, and tiered memory to preserve the most important information within the conversation. But when the session ends, everything vanishes. The user closes their browser, and all that carefully managed context evaporates.
This chapter extends Chapter 5’s principles across sessions. Where Chapter 5 asks “how do we keep this conversation coherent?”, Chapter 9 asks “how do we carry what matters into the next session?” The key facts Chapter 5 extracted—decisions, corrections, user preferences—become the seed memories for Chapter 9’s persistent store. The tiered compression approach (active messages → summarized → archived facts) becomes the filtering layer for what deserves cross-session storage. And the token budget discipline becomes memory count discipline.
The MemGPT paper (Packer et al., 2023) introduced a useful analogy for this: treat the LLM’s context window like RAM and external storage like disk. Just as an operating system manages which data lives in fast RAM versus slower disk, a memory system manages which memories occupy precious context window space versus sitting in a database waiting to be retrieved. The conversation history techniques from Chapter 5 manage RAM. This chapter builds the disk layer—and the retrieval logic that decides what to page in.
The answer isn’t “store everything in a database.” That’s easy. The hard part is deciding what to store and—most critically—how to retrieve the right memories at the right time. Research on this problem is accelerating: the Mem0 framework (Chhikara et al., 2025) showed that selective memory retention reduces token consumption by 90% compared to full-context approaches while improving accuracy by 26% over OpenAI’s built-in memory feature. The engineering challenge isn’t storage—it’s the retrieval layer.
Researchers and practitioners have converged on three distinct memory types, each serving different purposes.
The Three Types of Memory
Episodic memory captures timestamped events and interactions. “On January 15th, the user asked about authentication patterns.” “Last week, we refactored the login module together.” “Yesterday, the user mentioned they’re preparing for a code review.” Episodic memories provide continuity—they let your AI reference shared history and demonstrate that it remembers working with this specific user. The risk is that episodic memories grow unbounded and develop recency bias, where recent trivia crowds out older but more important interactions.
Semantic memory stores facts, preferences, and knowledge extracted from interactions. “This user prefers TypeScript over JavaScript.” “Their codebase uses PostgreSQL with Prisma as the ORM.” “They work on a team of five developers.” Semantic memories enable personalization—your AI can adapt its responses based on what it knows about the user and their context without asking the same questions repeatedly. The risk is staleness: facts change, and outdated semantic memories lead to wrong assumptions.
Procedural memory captures learned patterns and workflows. “When this user asks for code review, they want security issues checked first.” “They prefer detailed explanations with concrete examples.” “They like to see the test cases before the implementation.” Procedural memories allow behavioral adaptation—your AI can adjust its style and approach based on what has worked well before. The risk is overfitting, where the AI becomes too rigid in following patterns that may not apply to every situation.
Deciding which type to use for a given piece of information matters because it affects storage, retrieval, and decay. An episodic memory (“We discussed authentication options on March 5th”) provides reference but becomes less relevant over time as the project evolves. A semantic memory (“The team chose JWT for authentication”) persists until explicitly superseded. A procedural memory (“When reviewing auth code, check token expiration first”) stays relevant as long as the project uses that pattern. Misclassifying a memory—storing a temporary decision as a permanent fact, or treating a lasting preference as a one-time event—leads to retrieval problems downstream. The classification isn’t just organizational; it drives how the system scores and surfaces information.
Here’s how these memory types fit into a complete architecture — refer back to the diagram above. The architecture looks like a database diagram, but the interesting engineering happens in that retrieval layer. Storing memories is easy—you can throw everything into a vector database and call it done. Retrieving the right memories at the right time for the current context is where systems succeed or fail.
This three-type framework isn’t just academic taxonomy. Production systems use it directly. ChatGPT’s memory feature stores semantic memories (user facts and preferences) that persist across conversations. Gemini’s “Personal Context” extracts semantic and procedural memories from work patterns, integrating with email and calendar data. The Zep framework (Preston-Werner et al., 2025) implements all three types using a temporal knowledge graph, achieving 94.8% accuracy on the Deep Memory Retrieval benchmark—a 1.4-point improvement over MemGPT’s earlier approach. The memory types aren’t theoretical; they’re the building blocks of every production memory system shipping today.
What to Store (And What to Skip)
Not every message deserves to be remembered. Most conversation turns are transient—useful in the moment, irrelevant an hour later. The engineering habit for this chapter: storage is cheap; attention is expensive. Be selective.
You can afford to store every message, every preference, every interaction. Storage costs are negligible. But at retrieval time, you face the attention budget we discussed in Chapter 2. You might have 50,000 memories in storage, but you can inject maybe 500 tokens of memory context before crowding out the actual task. That’s a 100:1 compression ratio. If you store indiscriminately, you make retrieval harder—more candidates to score, more irrelevant results to filter, more chances to surface the wrong context.
For episodic memories, store interactions that:
- Represent decisions or agreements (“We decided to use JWT for authentication”)
- Contain explicit user corrections (“Actually, that endpoint should return 404, not 400”)
- Mark significant milestones (“Successfully deployed v2.0 to production”)
- Include strong positive or negative feedback (“This explanation was really helpful” or “That’s not what I asked for at all”)
Skip routine exchanges, clarifying questions that got resolved immediately, and transient context that only mattered for that specific request.
For semantic memories, extract and store:
- Explicit statements about preferences (“I prefer functional style over classes”)
- Technical environment details (“We’re using React 18 with Next.js 14”)
- Project structure and architecture (“The API lives in /api, frontend in /web”)
- Team and organizational context (“I’m the tech lead on a team of four”)
Skip inferred preferences that might be wrong, one-time context (“I’m debugging this specific error”), and information that’s likely to change frequently.
For procedural memories, capture:
- Repeated patterns in requests (“User often asks for test cases alongside implementations”)
- Explicit style guidance (“Please keep explanations concise”)
- Correction patterns that reveal preferences (“User frequently shortens my verbose responses”)
Skip one-time workflow requests and patterns that might reflect temporary needs rather than lasting preferences.
The key question before storing any memory: “Will this improve a future interaction, or am I just hoarding data?” If you can’t articulate how a memory might be useful later, don’t store it.
The Mem0 framework formalizes this into four operations that run every time a new memory candidate is identified. The system compares each candidate against the most similar existing memories and selects one action: create (new information, nothing similar exists), update (refines an existing memory with additional detail), merge (combines two related memories into one richer entry), or delete (supersedes an outdated memory). This four-operation model prevents the two most common storage mistakes: storing redundant memories that bloat the retrieval set, and storing contradictions that confuse the model at query time. Whether or not you use Mem0 specifically, thinking about memory writes as one of these four operations is a useful discipline.
Memory Operation Decision Matrix
Here’s how to decide which operation to use for a new piece of information:
| Operation | Condition | Example | Store? | Importance |
|---|---|---|---|---|
| CREATE | No similar existing memory | “User learned Rust last year” (first mention) | Yes | 0.6-0.8 |
| Entirely new topic/entity | User describes new project “WebAssembly compiler” (no prior record) | Yes | 0.7 | |
| Zero semantic overlap | User shifts from discussing Python to discussing Kubernetes (unrelated) | Yes | 0.5-0.7 | |
| UPDATE | New information supersedes existing memory | User: “I prefer TypeScript now” (previously: JavaScript preference) | Replace old, store new | 0.85+ |
| Preference reversal or major change | “We switched to PostgreSQL” (previously: MySQL) | Increment version, boost importance | 0.9 | |
| Explicit contradiction/correction | User: “Actually, that’s wrong. The codebase uses…” | Delete old, store correction | 0.95 | |
| New constraint or requirement | “Budget cut means we can’t use expensive tools anymore” | Supersede open-ended resources | 0.8 | |
| MERGE | Multiple memories describe same concept | “User uses React” + “User prefers React for frontends” → “User’s frontend framework is React (preferred)” | Consolidate into one | Combined importance |
| Complementary details on same topic | “Team has 5 engineers” + “Team includes 2 senior devs” → “Team: 5 engineers, 2 senior” | Combine richer record | 0.7+ | |
| Reducing redundancy with added detail | Same information stored twice slightly differently | Keep more detailed version only | Higher of the two | |
| DELETE | Information explicitly retracted | “Forget my previous preference, I don’t like React” | Remove old memory | N/A |
| Time-bound information expired | “I’m interviewing at Company X” (deadline passed, interview happened) | Remove or archive | N/A | |
| Privacy request | “Don’t remember my personal health conditions” | Delete immediately | N/A | |
| Contradicted by newer, higher-confidence information | New explicit statement overrides stale inference | Replace, don’t accumulate | N/A |
Decision Algorithm in Code
def determine_memory_operation(
new_content: str,
existing_memories: list[Memory]
) -> tuple[str, Optional[Memory]]:
"""
Decide: CREATE, UPDATE, MERGE, or DELETE.
Returns: (operation, target_memory_id or None)
"""
# Find semantically similar memories
similar = find_similar_memories(new_content, existing_memories, threshold=0.7)
if not similar:
return ("CREATE", None)
# Check for direct contradictions
for mem in similar:
if contradicts(new_content, mem.content):
# Same topic, opposite assertion
if is_explicit_correction(new_content):
return ("UPDATE", mem.id) # New info replaces old
else:
return ("UPDATE", mem.id) # Preference change detected
# Check for redundancy
if is_duplicate_content(new_content, similar):
if len(similar) > 1:
return ("MERGE", similar[0].id) # Combine all similar
else:
return ("UPDATE", similar[0].id) # Refine existing
# Check for complementary information
if adds_meaningful_detail_to(new_content, similar[0]):
return ("UPDATE", similar[0].id) # Add detail to existing
# Information is time-bound and now irrelevant
if is_time_bound_and_expired(new_content):
return ("DELETE", similar[0].id)
# Fallback: if we're not sure, merge similar memories
if len(similar) > 1:
return ("MERGE", similar[0].id)
# Explicit contradiction in recent context
if conflicts_with_recent_decision(new_content):
return ("UPDATE", find_conflicting_memory(new_content).id)
# Default: create if nothing matches well
return ("CREATE", None)
Practical Examples
Example 1: Preference Evolution
- Day 1: User says “I prefer JavaScript”
- Operation: CREATE (“User prefers JavaScript”)
- Importance: 0.7
- Day 30: User says “I switched to TypeScript, it’s much better”
- Existing: “User prefers JavaScript”
- Operation: UPDATE (contradiction detected, new preference replaces old)
- Result: Delete JavaScript preference, store “User prefers TypeScript” with importance 0.9
Example 2: Accumulating Detail
- Day 1: “We’re building a web app”
- Operation: CREATE (“Project: web application”)
- Importance: 0.5
- Day 5: “The web app uses React and TypeScript”
- Existing: “Project: web application”
- Operation: UPDATE (complementary detail)
- Result: “Project: web application, tech stack: React + TypeScript”
- Importance: 0.7 (upgraded from new information)
Example 3: Redundancy After Extraction
- Extraction produces: “User uses PostgreSQL” + “User’s database is PostgreSQL”
- Similar memories found: both semantically ~0.95 similar
- Operation: MERGE
- Result: Single memory “User’s database is PostgreSQL” with combined importance
Example 4: Expired Information
- User: “I’m preparing for my code review next Friday”
- Operation: CREATE with expiration_date = Friday
- Importance: 0.6
- Next Monday: Code review is past
- Operation: DELETE (explicitly expired)
- Result: Memory removed from retrieval
This discipline prevents two critical problems: (1) contradictory memories that confuse the model, and (2) bloat from storing subtle variations of the same fact. Your memory system should have fewer, higher-quality memories than a naive system that stores everything.
The Retrieval Problem
Now a user asks a question, and you need to decide which memories—out of thousands—deserve precious context window space. This is the retrieval problem, and it’s harder than it sounds.
A benchmark study of over 3,000 LLM agent memory operations found that agents fail to retrieve stored information 6 out of 10 times—a valid recall rate of just 39.6%. Nearly half of those failures (46.2%) occurred because the agent ran out of memory space and evicted important information to make room for new entries. The fundamental issue isn’t that memories aren’t stored; it’s that retrieval surfaces the wrong ones.
Naive approaches fail predictably. Pure recency surfaces recent but irrelevant memories. Pure semantic similarity finds topically related but unhelpful ones. And here’s the uncomfortable truth about full-context approaches: production benchmarks show memory systems that naively include everything cost 14-77x more while being 31-33% less accurate than well-designed selective retrieval. More context doesn’t mean better answers—it means more noise.
Production systems use hybrid scoring—combining multiple signals:
Recency scoring favors recent memories with exponential decay:
def recency_score(memory: Memory, decay_rate: float = 0.05) -> float:
"""Recent memories score higher (decay_rate 0.05: 1 week = 0.70, 1 month = 0.22)."""
days_old = (datetime.now() - memory.timestamp).days
return math.exp(-decay_rate * days_old)
Works well for ongoing projects, fails when old decisions are relevant to current questions.
Relevance scoring uses embedding similarity:
def relevance_score(memory: Memory, query_embedding: list[float]) -> float:
"""Memories semantically similar to query score higher."""
return cosine_similarity(memory.embedding, query_embedding)
Finds topically related memories, but misses important but semantically distant information.
Importance scoring weights by significance:
def importance_score(memory: Memory) -> float:
"""Importance assigned at storage time; boost decisions and corrections."""
base = memory.importance
if "decision" in memory.metadata.get("tags", []):
base = min(base * 1.3, 1.0)
if "correction" in memory.metadata.get("tags", []):
base = min(base * 1.5, 1.0)
return base
Ensures critical memories surface even when old. Requires good importance assignment—garbage in, garbage out.
Hybrid scoring combines all three signals with tunable weights:
def hybrid_score(
memory: Memory,
query_embedding: list[float],
weights: ScoringWeights
) -> float:
"""
Combine recency, relevance, and importance with configurable weights.
Example weight configurations:
- Ongoing project: recency=0.4, relevance=0.4, importance=0.2
- Research task: recency=0.1, relevance=0.6, importance=0.3
- Returning user: recency=0.2, relevance=0.3, importance=0.5
"""
return (
weights.recency * recency_score(memory) +
weights.relevance * relevance_score(memory, query_embedding) +
weights.importance * importance_score(memory)
)
The weights are your engineering knobs. Different tasks, different users, and different memory sizes call for different weight configurations. Start with balanced weights (0.33 each), then tune based on observed retrieval quality.
One critical insight from retrieval research: LLMs predominantly use the top 1-5 retrieved passages, which means precision in ranking matters far more than recall. Retrieving 1,000 potentially relevant memories is useless if the right one isn’t in the top 5. This is why hybrid scoring matters—it lets you combine signals that no single metric captures. Pure vector similarity might rank a tangentially related memory above a critical decision that happens to use different vocabulary. Adding importance scoring fixes that.
The SimpleMem framework (AIMING Lab, 2026) demonstrated this principle at scale: by combining semantic compression with intent-aware retrieval planning, it achieved an F1 score of 43.24 on the LoCoMo benchmark while using just 531 tokens per query—compared to 16,910 tokens for full-context approaches. That’s a 30x reduction in token usage with significantly better accuracy. The lesson: selective retrieval isn’t just cheaper, it’s better.
Scaling Retrieval: From Hundreds to Hundreds of Thousands
What happens when a power user accumulates not 1,000 memories, but 100,000? Or when your multi-tenant system spans a million user memories across databases? The retrieval layer, which works fine at small scale, suddenly becomes your bottleneck.
Practical Thresholds and Strategies
Different memory scales call for different retrieval approaches:
Below 10,000 memories: Brute force works fine. Load all memories, score each one, return top-K. With 10K memories and modern hardware, even pure sequential scanning finishes in <100ms. The overhead of sophisticated indexing isn’t worth the implementation complexity.
10,000 - 100,000 memories: Add approximate nearest neighbor (ANN) indexing. Pure vector search now dominates latency. Options:
-
HNSW (Hierarchical Navigable Small World): Graph-based index with logarithmic search complexity. Search time: ~10-50ms for 100K memories. Memory overhead: ~2x the raw data. Works great for single-user or sharded systems. Used by Pinecone, Weaviate, Qdrant.
-
IVF (Inverted File Index): Partition space into cells, search relevant cells only. Search time: ~20-100ms depending on partition count. Better memory efficiency than HNSW (~1.2x data size). Trickier to tune—number of partitions affects both speed and accuracy.
-
Trade-off: HNSW trades memory for speed and ease of tuning. IVF trades query complexity for memory efficiency. For most applications, HNSW is worth the space.
Tradeoff between recall and speed: With ANN indexing, you don’t get exact nearest neighbors—you get approximate ones. This matters.
# HNSW parameters
hnsw_index = HNSWIndex(
ef_construction=200, # Higher = more accurate but slower indexing
M=16, # Connections per layer (higher = more memory, faster search)
)
# Search-time parameter
results = hnsw_index.search(query_embedding, ef=50) # ef: higher = more accurate but slower
# Typical recall-speed tradeoff
# ef=10: ~85% recall, 5ms latency
# ef=50: ~95% recall, 15ms latency
# ef=100: ~99% recall, 30ms latency
The key insight: retrieval don’t need to be exact. If you’re ranking memories by hybrid score (recency + relevance + importance), getting the top 5 approximate neighbors versus exact neighbors rarely changes the final ranking. You’re looking for good matches, not perfect ones.
100,000+ memories: Consider sharding. A single-user system with 100K memories still fits in memory on modern hardware, but a multi-tenant system with 1M+ memories spanning dozens of users requires distribution.
Sharding approaches:
-
By user: Each user gets a separate index. Simple, provides isolation, allows per-user tuning. Downside: doesn’t help if a single power user has 500K memories.
-
By time: Partition older memories separately. Recent memories (< 3 months) live in fast index; older ones live in slower archive. Most queries care about recency anyway, so this works well. The
recency_scorefunction we defined already implements temporal preference—just make it architectural by physically separating old data. -
By topic: Cluster similar memories into buckets. Query classifier determines relevant buckets. Much harder to implement; only do this if you have clear topic boundaries (e.g., separate memory stores for each project in CodebaseAI).
Concrete Latency Numbers
Here’s what production systems typically achieve:
| Scale | Approach | Latency | Cost | Notes |
|---|---|---|---|---|
| 1K memories | Brute force | 5ms | Negligible | Sequential scan on CPU |
| 10K memories | Brute force | 15ms | Negligible | Still fast; no index needed |
| 50K memories | HNSW (ef=50) | 20ms | 100MB memory | Good balance of speed/quality |
| 100K memories | HNSW (ef=50) | 25ms | 200MB memory | Scales linearly to ~1M |
| 1M memories | HNSW + sharding | 30ms | ~2GB distributed | Split across shards, search in parallel |
| 10M memories | IVF + time sharding | 50-100ms | Depends on partitioning | Archive old data separately |
The rule of thumb: retrieval should take 20-50ms. Anything faster and you’re over-optimizing. Anything slower and you should shard or adjust your indexing strategy.
Implementation Sketch: Scaled Retrieval
class ScaledMemoryStore:
"""Memory store that scales from 1K to 1M+ memories."""
def __init__(self, scale_category: str):
self.scale = scale_category # "small", "medium", "large"
if scale_category == "small": # <10K
self.retriever = BruteForceRetriever()
elif scale_category == "medium": # 10K-100K
self.retriever = HNSWRetriever(
ef_construction=200,
M=16,
ef_search=50
)
elif scale_category == "large": # >100K
# Shard by time: recent in fast index, old in archive
self.recent_retriever = HNSWRetriever(
max_age_days=90,
ef_search=50
)
self.archive_retriever = BruteForceRetriever() # Slower, but rarely queried
def retrieve(self, query: str, limit: int = 5):
if self.scale in ["small", "medium"]:
return self.retriever.search(query, limit)
else: # large
# Search recent memories first (higher recall + speed)
recent = self.recent_retriever.search(query, limit)
if len(recent) >= limit:
return recent
# Fall back to archive if needed
archive = self.archive_retriever.search(query, limit - len(recent))
return recent + archive
When Memory Hurts
Memory systems create new failure modes that don’t exist in stateless systems. You’ve built perfect storage and retrieval, but memories themselves decay, contradict, bloat, leak, and sometimes hallucinate. Research on long-running agents (arXiv:2505.16067) found that naive memory strategies cause a 10% performance loss compared to optimized approaches, with approximately 50% of long-running agents experiencing behavioral degradation—leading to a projected 42% reduction in task success rates and 3.2x increase in human intervention requirements.
This section covers the five pathologies that emerge at scale. For each one, we’ll cover the symptoms you’ll see, the root cause, and the specific fix. These aren’t theoretical—they’re the bugs you’ll file tickets for in production.
Stale Memories
Facts change. Technologies evolve. User preferences shift. A memory that was correct in 2020 becomes harmful in 2024. The user mentioned they prefer Python 2—true then, disastrous now. They said they were learning TypeScript—they’ve been proficient for two years. The team structure changed twice. Old architectural decisions have been superseded.
Stale memory is the most common memory pathology because it’s baked into the fundamental design. At creation time, a memory is accurate and useful. Importance is assigned based on that initial accuracy. But the world changes, and the memory doesn’t. Research on RAG systems (which face an identical problem with document retrieval) shows that stale information is particularly dangerous because the model presents it with the same confidence as fresh information—it’s real data from a real interaction, just an outdated one. Users can’t distinguish “the system confidently knows this about me” from “the system is confidently using information that’s no longer true.”
The problem is that importance doesn’t decay naturally. A memory tagged as important at creation time stays important forever, even when it becomes obsolete. Meanwhile, newer contradictory information might be tagged with lower importance because it seems incremental. The retrieval system surfaces the stale memory preferentially.
Memory decay scoring addresses this by degrading importance over time for certain memory types:
def decayed_importance_score(memory: Memory) -> float:
"""
Importance decays faster for semantic memories (facts that change),
slower for episodic memories (historical events).
"""
base_importance = memory.importance
days_old = (datetime.now() - memory.timestamp).days
if memory.memory_type == "semantic":
# Semantic memories decay quickly: half-life of 180 days
decay_factor = math.exp(-0.693 * days_old / 180)
elif memory.memory_type == "procedural":
# Procedural memories decay slowly: half-life of 365 days
decay_factor = math.exp(-0.693 * days_old / 365)
else: # episodic
# Episodic memories don't decay: historical facts
decay_factor = 1.0
return base_importance * decay_factor
Apply decay scoring when retrieving, not at storage time. That way, memories don’t disappear—they just lose priority as they age. A two-year-old preference might still be relevant if nothing newer contradicts it, but a recent explicit statement will win.
Explicit expiration works for time-bound information:
@dataclass
class Memory:
# ... existing fields ...
expiration_date: Optional[datetime] = None
def is_expired(self) -> bool:
"""Check if memory has explicit expiration date."""
if self.expiration_date is None:
return False
return datetime.now() > self.expiration_date
When a user says “I’m preparing for a code review next week,” that’s time-bound context. Mark it with an expiration. After the week passes, stop retrieving it. For permanent information (“I prefer TypeScript”), leave expiration_date as None.
The real fix is encouraging users to update memories explicitly:
def update_semantic_memory(old_id: str, new_content: str, memory_store: MemoryStore):
"""
Replace an old semantic memory with a new one.
Marks the old memory as superseded.
"""
memory_store.store(
content=new_content,
memory_type="semantic",
importance=0.9, # Higher importance for explicit updates
metadata={"supersedes": old_id}
)
memory_store.forget(old_id, reason="explicit_user_update")
Users rarely do this unprompted. But when they say “I’ve switched to TypeScript” or “We migrated to PostgreSQL,” recognize the pattern and offer: “Should I update my memory that you prefer JavaScript to say TypeScript instead?” Making updates explicit keeps memories fresh.
Conflicting Memories
Chapter 5 introduced DecisionTracker for catching contradictions within a single session. But contradictions across sessions are harder to detect and more dangerous, because the conflicting memories may have been stored days or weeks apart with no direct connection.
Older, more specific memories create a conflict with newer, general ones. “User prefers verbose explanations with lots of examples” vs. “User said ‘just give me the code.’” Both are in the store. The retrieval system doesn’t know one supersedes the other—it might fetch both and confuse the model.
Contradictions are more common than you’d expect, and resolving them is harder than it looks. Benchmarks on memory conflict resolution found that even GPT-4o achieves only about 60% accuracy on single-hop conflict resolution (where one memory directly contradicts another). For multi-hop conflicts—where the contradiction requires chaining together multiple memories to detect—accuracy drops to 6% or below across all tested memory paradigms. This means you can’t rely on the LLM to sort out conflicting memories at query time; you need to catch them before they enter the store.
Contradictions fall into two categories:
Direct contradictions: The exact same claim with opposite truth values. “Prefers JavaScript” vs. “Prefers TypeScript.” These should be detected at storage time:
def detect_contradictions(new_memory: Memory, existing_memories: list[Memory]) -> list[str]:
"""
Find existing memories that directly contradict the new one.
Returns IDs of contradictory memories.
"""
contradictions = []
if new_memory.memory_type != "semantic":
return [] # Only check semantic memories
for existing in existing_memories:
if existing.memory_type != "semantic":
continue
# Simple heuristic: same core entities but opposite polarity
# Real systems would use stronger semantic analysis
if _extract_topic(existing.content) == _extract_topic(new_memory.content):
# Same topic, check sentiment/polarity
if _opposite_sentiment(existing.content, new_memory.content):
contradictions.append(existing.id)
return contradictions
def store_with_contradiction_resolution(
new_memory: Memory,
memory_store: MemoryStore
) -> Memory:
"""Store new memory, superseding contradictions."""
existing = memory_store.get_all()
contradictions = detect_contradictions(new_memory, existing)
# Remove contradictory memories, mark as superseded
for old_id in contradictions:
memory_store.forget(old_id, reason="superseded_by_new_memory")
# Store the new memory with high importance
return memory_store.store(
content=new_memory.content,
memory_type=new_memory.memory_type,
importance=max(new_memory.importance, 0.85), # Boost explicit updates
metadata={"supersedes": contradictions}
)
Qualified contradictions: Not direct opposites, but different contexts. “Prefers concise explanations” (general) vs. “Prefers detailed walkthrough of authentication logic” (specific). Both are true—context matters. Rather than delete, add resolution logic:
def resolve_contradictions_by_context(
memories: list[Memory],
query: str
) -> list[Memory]:
"""
Given potentially conflicting memories, resolve based on query context.
More specific memories win over general ones for their domain.
"""
# Group by specificity
general = [m for m in memories if _is_general(m)]
specific = [m for m in memories if _is_specific(m)]
# If query matches specific memory's topic, prefer it
resolved = []
for spec_mem in specific:
if _matches_query_domain(spec_mem, query):
resolved.append(spec_mem)
# Remove general memories in this domain
general = [g for g in general if not _same_domain(g, spec_mem)]
return resolved + general
For critical preferences, ask for user confirmation rather than guessing:
def retrieve_with_conflict_checking(
query: str,
memory_store: MemoryStore
) -> tuple[list[Memory], list[Memory]]:
"""
Retrieve memories and identify potential conflicts.
Returns (resolved_memories, conflicting_memories).
"""
retrieved = memory_store.retrieve(query, limit=10)
conflicts = detect_contradictions_in_list(retrieved)
if conflicts:
# User should resolve these, not the system
return (retrieved, conflicts)
return (retrieved, [])
False Memories
This is the most insidious failure mode: the system confidently uses information that was never true, or that combines fragments from different contexts into a plausible but wrong composite.
False memories emerge from three sources. First, extraction errors: the LLM-based memory extractor misinterprets a conversation and stores an incorrect fact. The user said “I’m considering switching to PostgreSQL” and the extractor stores “User uses PostgreSQL.” Second, cross-contamination: in multi-tenant systems, information from one user’s session bleeds into another’s memory store (we’ll cover this in the Privacy section). Third, inference confabulation: the retrieval system returns several memories, and the LLM synthesizes them into a conclusion that none of them actually support.
The danger is that false memories look exactly like real ones. They have the same metadata, the same importance scores, the same embedding vectors. The system presents them with the same confidence it presents accurate memories.
Detection requires validation at both storage and retrieval time:
class ValidatedMemoryExtractor:
"""Extract memories with confidence scoring and validation."""
def extract_with_validation(self, conversation: str) -> list[dict]:
"""
Extract memories and validate them against the source conversation.
Returns only memories with high confidence of accuracy.
"""
# First pass: extract candidate memories
candidates = self._extract_candidates(conversation)
# Second pass: validate each candidate against source
validated = []
for candidate in candidates:
confidence = self._validate_against_source(
candidate["content"],
conversation
)
if confidence >= 0.8: # Only store high-confidence extractions
candidate["metadata"] = candidate.get("metadata", {})
candidate["metadata"]["extraction_confidence"] = confidence
validated.append(candidate)
else:
# Log for human review rather than storing wrong info
self._log_low_confidence(candidate, confidence)
return validated
def _validate_against_source(self, memory: str, source: str) -> float:
"""
Ask: does the source conversation actually support this memory?
Uses a separate LLM call to cross-check extraction accuracy.
Cost: ~100 tokens per validation. Worth it for data quality.
"""
validation_prompt = f"""Does this conversation support the following claim?
CONVERSATION: {source[:2000]}
CLAIM: {memory}
Rate confidence 0.0-1.0 that the claim is accurately supported.
Output only the number."""
response = self.llm.complete(validation_prompt, temperature=0.0)
try:
return float(response.strip())
except ValueError:
return 0.0 # Can't validate = don't store
The validation step adds cost—roughly doubling extraction time. But the alternative is storing wrong information that corrupts future interactions. In production, false memories are harder to debug than missing memories because users don’t know the system is using wrong information until the recommendations go visibly off the rails.
Concrete token costs of validation: Each validation call costs approximately 100-150 tokens (source conversation check + confidence scoring). A typical system might extract 3-5 memory candidates per session, requiring 300-750 tokens of validation overhead per session. At $3 per million tokens, that’s $0.0009-0.00225 per session in validation costs.
For a service writing 1,000 memories daily (across all users) with an average 4 validation checks per memory:
- Daily validation tokens: 1,000 memories × 4 checks × 125 tokens = 500,000 tokens
- Monthly cost: 500,000 × 30 × ($3/1M) = $45/month
- Typical failure rate caught: 15-30% of candidate memories get filtered as duplicates, contradictions, or confidence-score rejections
This cost is justified because false memories that slip through validation often trigger cascading failures: downstream queries retrieve and rely on wrong information, the LLM builds questionable reasoning on corrupted data, and users receive confusing or incorrect recommendations. The cost of one false memory entering the system (reduced accuracy on downstream queries, user confusion, potential eroded trust) typically exceeds months of validation overhead. Moreover, production data shows that systems with systematic validation maintain 26-31% higher accuracy compared to systems that skip extraction validation.
When is validation worth the investment? If your memory system is serving users daily, or if answer accuracy is critical (medical, financial, legal contexts), validation is essential. If you have a toy project or non-critical domain, you might skip it initially and add it once you see false memory problems. But the moment your system starts getting real usage, build validation in—it’s cheaper to validate upfront than to recover from false memories in production.
The habit: when the cost of a wrong memory is higher than the cost of a missing memory (which it almost always is), validate before storing.
Memory Bloat
A power user has been using CodebaseAI for six months. Their memory store now has 50,000 entries: every message, every preference, every fragment of extracted context. Retrieval slows. Storage grows. Each query retrieves from a larger candidate set. Precision drops—the signal is buried in noise.
Memory bloat happens because insertion is cheap and deletion is hard. You store everything because you might need it. But “might need” is almost never true. Most memories never get retrieved after their first week.
The benchmark data is sobering: 46.2% of all memory operation failures in the 3,000+ operation study occurred because agents ran out of space and evicted important information to make room. Doubling capacity only reduced eviction failures by about 15%—the problem isn’t how much you can store, it’s how much you’re storing that has no value. An agent that remembers everything eventually remembers nothing useful, because the signal is buried in noise and retrieval precision drops as the candidate set grows.
Aggressive pruning with importance scoring:
def prune_low_value_memories(memory_store: MemoryStore, keep_percentage: float = 0.8):
"""
Remove the least valuable memories, keeping only top performers.
"""
all_memories = memory_store.get_all()
if len(all_memories) < 1000:
return # No need to prune yet
# Score each memory's value: importance × recency
scored = []
for m in all_memories:
value = m.importance * recency_score(m)
scored.append((value, m.id))
scored.sort(reverse=True)
keep_count = int(len(scored) * keep_percentage)
# Delete bottom performers
for _, memory_id in scored[keep_count:]:
memory_store.forget(memory_id, reason="pruning_low_value")
Consolidation merges similar memories:
def consolidate_similar_memories(memory_store: MemoryStore, similarity_threshold: float = 0.90):
"""
Find similar memories and merge them into one.
Keeps the version with highest importance and recency.
"""
all_memories = memory_store.get_all()
semantic = [m for m in all_memories if m.memory_type == "semantic"]
# Cluster by similarity
clusters = cluster_by_embedding_similarity(semantic, similarity_threshold)
for cluster in clusters:
if len(cluster) <= 1:
continue
# Keep the best, delete others
best = max(cluster, key=lambda m: m.importance * recency_score(m))
for memory in cluster:
if memory.id != best.id:
memory_store.forget(memory.id, reason="consolidated_duplicate")
Hard limits on total memory size:
def enforce_memory_budget(memory_store: MemoryStore, max_memories: int = 10000):
"""
Hard cap: if over budget, delete lowest-value memories.
"""
all_memories = memory_store.get_all()
if len(all_memories) <= max_memories:
return
# Score and delete from bottom
scored = [(m.importance * recency_score(m), m.id) for m in all_memories]
scored.sort()
for _, memory_id in scored[:len(all_memories) - max_memories]:
memory_store.forget(memory_id, reason="budget_exceeded")
Run these three operations on a schedule: daily for high-volume users, weekly for most. The result is a memory store that stays performant—containing only information worth retrieving.
A useful mental model: think of memory maintenance like garbage collection in programming languages. You don’t manually free every allocation—you set up rules (reference counting, generational collection) and let the system clean up automatically. Memory pruning works the same way. Define your importance thresholds, your staleness criteria, and your budget limits, then let the maintenance pipeline run on its own. Just as a program with a memory leak eventually crashes, a memory system without pruning eventually becomes useless—not from a hard failure, but from gradual degradation as noise overwhelms signal.
Privacy Incidents
Here’s the scenario that should terrify every memory system designer: Alice has been using CodebaseAI to discuss her codebase. Her preferred database is PostgreSQL. Her team uses TypeScript. Over months, memories accumulate about Alice’s specific architecture.
Then Bob starts using the same CodebaseAI instance (shared multi-tenant system, different user_id in the memory database—but a bug bypasses user isolation). CodebaseAI starts suggesting Alice’s preferences to Bob. “It looks like you prefer PostgreSQL” (no, that’s Alice’s preference). “I see your team uses TypeScript” (Bob’s team uses Python).
This isn’t hypothetical. Security researchers at Tenable identified seven vulnerabilities in ChatGPT’s memory system that could enable exfiltration of private information from user memories and chat history. The research community has documented that persistent memory intensifies privacy threats because past inputs combined with stored memories create compound leakage risk—information that was safe in a single ephemeral session becomes dangerous when it persists and accumulates.
The failure is subtle and dangerous. The access control exists at the retrieval layer—memories should be filtered by user_id before returning. But a bug in the filtering code, or worse, a shared embedding index that doesn’t include user_id, causes cross-user leakage. As one security analysis put it: “When context, cache, or memory state bleeds between user sessions, your authentication and authorization controls become irrelevant.”
Defense in depth means isolation at multiple layers:
class MemoryStore:
def __init__(self, db_path: str, user_id: str):
self.db_path = db_path
self.user_id = user_id
# CRITICAL: user_id is immutable per store instance
self._init_database()
def store(self, content: str, memory_type: str, **kwargs) -> Memory:
"""
Store a memory with mandatory user isolation.
"""
memory = Memory(
id=self._generate_id(),
content=content,
memory_type=memory_type,
user_id=self.user_id, # ALWAYS included
timestamp=datetime.now(),
**kwargs
)
# Database constraint: user_id cannot be null
# Storage layer will reject if missing
self._save_to_db(memory)
return memory
def retrieve(self, query: str, limit: int = 5) -> list[Memory]:
"""
Retrieve only this user's memories.
Query isolation is enforced at the database layer.
"""
# CRITICAL: Filter by user_id in the query itself
# Not after retrieval—during retrieval
memories = self._load_from_db(
f"SELECT * FROM memories WHERE user_id = ? ...",
parameters=(self.user_id,) # Parameterized to prevent injection
)
return memories[:limit]
Stronger isolation:
class TenantIsolatedMemoryStore:
"""
Memory store that guarantees cross-tenant isolation
through schema-level separation.
"""
def __init__(self, db_path: str, user_id: str):
self.user_id = user_id
# Each user gets a separate table
# This is belt-and-suspenders isolation
self.table_name = f"memories_{user_id}"
self._init_schema()
def retrieve(self, query: str, limit: int = 5) -> list[Memory]:
"""
Retrieve from this user's isolated table.
Even a SQL injection can't leak other users' data.
"""
sql = f"SELECT * FROM {self.table_name} LIMIT {limit}"
# Still parameterized and safe
return self._execute_query(sql)
Audit logging makes incidents detectable:
def audit_retrieve(
user_id: str,
query: str,
returned_memories: list[Memory]
):
"""
Log every retrieval attempt and what was returned.
"""
log_entry = {
"timestamp": datetime.now().isoformat(),
"user_id": user_id,
"query": query,
"returned_count": len(returned_memories),
"returned_ids": [m.id for m in returned_memories]
}
# Write to audit log (append-only, separate from main storage)
audit_file = f"audit_{user_id}.log"
with open(audit_file, "a") as f:
f.write(json.dumps(log_entry) + "\n")
def detect_cross_user_leakage(user_id: str) -> list[str]:
"""
Scan audit logs to detect if this user's memories
were retrieved by or for other users.
"""
leakages = []
for entry in read_audit_log(user_id):
if entry["user_id"] != user_id:
leakages.append(f"Memory {entry['id']} retrieved by {entry['user_id']}")
return leakages
The habit: Assume isolation can fail. Build multiple layers. Test cross-tenant scenarios explicitly. When memory systems scale to multiple users, isolation bugs become data breaches. Prevent them at the source.
When to Skip Memory Entirely
Not every AI application needs memory. Adding memory introduces complexity, failure modes, privacy obligations, and maintenance burden. Before building a memory system, ask these questions:
Do your users return? If your application handles one-shot queries (a code formatter, a translation tool, a single-question Q&A system), there’s no one to remember. Memory only helps when the same user interacts multiple times.
Does context actually improve responses? Test this empirically. Take a sample of user queries and manually inject relevant context from their history. Does response quality improve? If not, memory is overhead without benefit. Some tasks are inherently context-independent—the best answer to “how do I reverse a linked list?” doesn’t change based on who’s asking.
Can you afford the privacy obligations? Memory means storing user data. That means GDPR compliance, data deletion capabilities, audit trails, and security reviews. For some applications, the legal and operational cost exceeds the user experience benefit.
Is the interaction frequency high enough? A user who visits monthly won’t benefit from memory the way a daily user will. Monthly users’ contexts change so much between visits that stored memories are likely stale. Design memory for your actual usage pattern, not for the power user you wish you had.
If you answered “no” to any of these, consider simpler alternatives: let users maintain their own context file (like a .cursorrules or CLAUDE.md file), or let them explicitly re-state preferences at the start of each session. These approaches give users control without the overhead of automated memory management.
CodebaseAI v0.8.0: Adding Memory
Time to give CodebaseAI a memory. In Chapter 8, we built version 0.7.0—an agentic assistant that could search the codebase, read files, and run tests. Capable, but amnesiac. Every session started fresh, with no knowledge of previous interactions.
Version 0.8.0 adds three new capabilities:
- User preferences: CodebaseAI remembers coding style preferences, communication preferences, and technical environment details
- Codebase context: Architectural decisions, file purposes, and patterns discovered during exploration persist across sessions
- Interaction history: Past questions and answers provide continuity, allowing CodebaseAI to reference previous discussions
Let’s start with the memory data structures:
"""
CodebaseAI v0.8.0 - Memory and Persistence
Changelog from v0.7.0:
- Added Memory and MemoryStore classes for persistent storage
- Added MemoryExtractor for automatic memory extraction from conversations
- Integrated memory retrieval into query pipeline
- Added privacy controls (export, delete, audit)
"""
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional
import json
@dataclass
class Memory:
"""
A single unit of long-term memory.
Attributes:
id: Unique identifier for this memory
content: The actual information being stored
memory_type: One of "episodic", "semantic", "procedural"
timestamp: When this memory was created
importance: 0.0 to 1.0, affects retrieval priority
embedding: Vector representation for semantic search
metadata: Additional context (source, tags, etc.)
"""
id: str
content: str
memory_type: str
timestamp: datetime
importance: float = 0.5
embedding: Optional[list[float]] = None
metadata: dict = field(default_factory=dict)
def to_context_string(self) -> str:
"""Format this memory for injection into the conversation context."""
type_labels = {
"episodic": "Previous interaction",
"semantic": "Known fact",
"procedural": "Learned preference"
}
label = type_labels.get(self.memory_type, "Memory")
return f"[{label}] {self.content}"
The Memory class is straightforward—it’s a data container with enough metadata to support retrieval scoring. The to_context_string method formats memories for injection into prompts, with type labels that help the model understand what kind of information it’s receiving.
Now the memory store:
class MemoryStore:
"""
Manages persistent memory storage and retrieval.
Design decisions:
- SQLite for local storage (swap to PostgreSQL for production scale)
- Embeddings computed at write time for fast retrieval
- Hybrid scoring for retrieval with tunable weights
- Audit logging for all operations (privacy compliance)
"""
def __init__(self, db_path: str, user_id: str, embedding_model):
self.db_path = db_path
self.user_id = user_id
self.embedding_model = embedding_model
self._init_database()
def store(
self,
content: str,
memory_type: str,
importance: float = 0.5,
metadata: Optional[dict] = None
) -> Memory:
"""
Store a new memory.
Args:
content: What to remember
memory_type: "episodic", "semantic", or "procedural"
importance: 0.0-1.0, affects retrieval priority
metadata: Optional tags, source info, etc.
Returns:
The created Memory object
"""
memory = Memory(
id=self._generate_id(content),
content=content,
memory_type=memory_type,
timestamp=datetime.now(),
importance=importance,
embedding=self.embedding_model.embed(content),
metadata=metadata or {}
)
self._save_to_db(memory)
self._log_operation("store", memory.id, content[:100])
return memory
def retrieve(
self,
query: str,
limit: int = 5,
weights: Optional[ScoringWeights] = None
) -> list[Memory]:
"""
Retrieve relevant memories using hybrid scoring.
Args:
query: The current user query to match against
limit: Maximum number of memories to return
weights: Scoring weights (defaults to balanced)
Returns:
List of memories, sorted by relevance score
"""
if weights is None:
weights = ScoringWeights(recency=0.3, relevance=0.5, importance=0.2)
query_embedding = self.embedding_model.embed(query)
all_memories = self._load_all_memories()
scored_memories = []
for memory in all_memories:
score = hybrid_score(memory, query_embedding, weights)
scored_memories.append((score, memory))
scored_memories.sort(reverse=True, key=lambda x: x[0])
return [memory for _, memory in scored_memories[:limit]]
def forget(self, memory_id: str, reason: str = "user_request") -> bool:
"""
Delete a memory with audit trail.
Privacy feature: Users can request memory deletion.
We log the deletion event but don't retain the content.
"""
success = self._delete_from_db(memory_id)
if success:
self._log_operation("forget", memory_id, f"reason: {reason}")
return success
def get_context_injection(
self,
query: str,
max_tokens: int = 500
) -> str:
"""
Get formatted memory context ready for prompt injection.
This is the primary integration point. Call this when building
context for a new query, then include the result in your
system prompt or conversation history.
Args:
query: The user's current query
max_tokens: Token budget for memory context
Returns:
Formatted string of relevant memories
"""
memories = self.retrieve(query, limit=10)
# Format memories and respect token budget
context_lines = []
estimated_tokens = 0
for memory in memories:
line = memory.to_context_string()
line_tokens = len(line) // 4 # Rough estimate: 4 chars per token
if estimated_tokens + line_tokens > max_tokens:
break
context_lines.append(line)
estimated_tokens += line_tokens
if not context_lines:
return ""
header = "## Relevant memories from previous sessions:\n"
return header + "\n".join(context_lines)
The MemoryStore handles storage, retrieval, and deletion. Notice the get_context_injection method—it’s a convenience function that handles retrieval, formatting, and token budgeting in one call. This is the integration point you’ll use when building context for queries.
Next, we need to extract memories from conversations. This is where an LLM helps identify what’s worth remembering:
class MemoryExtractor:
"""
Extracts memories from conversations using an LLM.
Runs after significant interactions to identify information
worth storing for future sessions.
"""
EXTRACTION_PROMPT = '''Analyze this conversation and extract information worth remembering for future sessions.
CONVERSATION:
{conversation}
Extract memories in these categories:
SEMANTIC (facts about user, project, or preferences):
- Technical environment (languages, frameworks, databases)
- Stated preferences ("I prefer...", "I like...", "I don't want...")
- Project structure and architecture
- Team context
EPISODIC (significant events worth referencing later):
- Decisions made together
- Problems solved
- Milestones reached
- Strong feedback (positive or negative)
PROCEDURAL (patterns in how the user wants things done):
- Communication style preferences
- Workflow preferences
- Recurring request patterns
Rate each memory's importance (0.0-1.0):
- 0.1-0.3: Nice to have, minor detail
- 0.4-0.6: Useful context, moderate relevance
- 0.7-0.8: Important preference or decision
- 0.9-1.0: Critical (explicit corrections, strong feedback)
SPECIAL CASES:
- If user CHANGES a preference (e.g., "I switched to TypeScript"), rate 0.9 (supersedes old info)
- If user explicitly says "remember this", rate 1.0
- If user explicitly says "don't remember this" or discusses sensitive info, DO NOT extract
Output valid JSON:
{
"memories": [
{"type": "semantic|episodic|procedural", "content": "...", "importance": 0.X}
]
}
If nothing worth remembering, output: {"memories": []}'''
def __init__(self, llm_client):
self.llm = llm_client
def extract(self, conversation: str) -> list[dict]:
"""
Extract memories from a conversation turn.
Args:
conversation: Recent conversation history to analyze
Returns:
List of memory dicts with type, content, and importance
"""
prompt = self.EXTRACTION_PROMPT.format(conversation=conversation)
response = self.llm.complete(prompt, temperature=0.1)
try:
result = json.loads(response)
return result.get("memories", [])
except json.JSONDecodeError:
# Log the parsing failure for debugging
logger.warning(f"Failed to parse memory extraction: {response[:200]}")
return []
The extraction prompt is detailed because quality extraction is crucial. Notice the special case handling: preference changes get high importance (they supersede old memories), explicit “remember this” requests get maximum importance, and sensitive information is explicitly excluded.
Now let’s integrate memory into CodebaseAI’s main flow:
class CodebaseAI:
"""
CodebaseAI v0.8.0: An AI assistant for understanding codebases.
New in v0.8.0:
- Persistent memory across sessions
- Automatic preference learning
- Context-aware memory retrieval
- Privacy controls
"""
def __init__(
self,
codebase_path: str,
user_id: str,
llm_client,
embedding_model
):
self.codebase_path = codebase_path
self.user_id = user_id
self.llm = llm_client
# Initialize memory system
self.memory_store = MemoryStore(
db_path=f"codebase_ai_memory_{user_id}.db",
user_id=user_id,
embedding_model=embedding_model
)
self.memory_extractor = MemoryExtractor(llm_client)
# ... existing initialization (file index, tools, etc.)
def query(self, question: str) -> str:
"""
Answer a question about the codebase.
Enhanced in v0.8.0 to include memory context.
"""
# Step 1: Retrieve relevant memories
memory_context = self.memory_store.get_context_injection(
query=question,
max_tokens=400 # Reserve most context for codebase
)
# Step 2: Build full context (now includes memories)
context = self._build_context(
question=question,
memory_context=memory_context
)
# Step 3: Generate response
response = self.llm.complete(context)
# Step 4: Extract and store memories from this interaction
self._process_new_memories(question, response)
return response
def _build_context(
self,
question: str,
memory_context: str
) -> str:
"""Build the full context for the LLM, including memories."""
system_prompt = f"""You are CodebaseAI, an expert assistant for understanding and working with codebases.
{memory_context}
## Current codebase: {self.codebase_path}
Use your knowledge of this user and codebase to provide personalized, contextual assistance."""
# ... rest of context building (file search, tool setup, etc.)
return system_prompt
def _process_new_memories(self, question: str, response: str):
"""Extract and store memories from the current interaction."""
conversation = f"User: {question}\n\nAssistant: {response}"
new_memories = self.memory_extractor.extract(conversation)
for mem in new_memories:
self.memory_store.store(
content=mem["content"],
memory_type=mem["type"],
importance=mem["importance"],
metadata={"source": "conversation_extraction"}
)
The integration is clean: retrieve memories before generating, include them in context, extract new memories afterward.
Here’s what the user experience looks like in practice. First session: “Hi, I’m using CodebaseAI for the first time. My project uses React 18 with Next.js 14 and TypeScript.” CodebaseAI responds helpfully and the MemoryExtractor stores three semantic memories: the React version, the Next.js version, and the TypeScript preference—each rated 0.7+ importance.
Second session, a week later: “Can you help me set up API routes?” CodebaseAI retrieves the stored semantic memories, includes them in context, and responds with Next.js 14-specific App Router API route patterns using TypeScript—without the user needing to re-explain their stack. The user feels remembered. The system demonstrates value.
Third session, a month later: “We migrated to SvelteKit last week.” The MemoryExtractor detects a preference change (high importance: 0.9), and the contradiction detection system marks the React and Next.js memories as superseded. Future sessions reference SvelteKit instead. The system stays current.
This progression—from stateless tool to contextual collaborator—is what memory enables when it works well. The engineering challenge is making sure it keeps working well as memories accumulate.
Diagnostic Walkthrough: When Memory Goes Wrong
Memory systems fail in predictable ways. Here’s a diagnostic framework you can follow when users report memory-related problems. Each scenario starts with a user complaint, walks through the investigation, and ends with a specific fix.
Scenario 1: “My AI Remembers the Wrong Things”
A user reports that CodebaseAI keeps suggesting JavaScript patterns even though they switched to TypeScript months ago.
Step 1: Inspect what’s being retrieved. The first tool you need is a retrieval debugger:
def debug_retrieval(memory_store: MemoryStore, query: str):
"""Debug tool: see what memories are retrieved and why."""
query_embedding = memory_store.embedding_model.embed(query) if memory_store.embedding_model else []
memories = memory_store.retrieve(query, limit=10)
print(f"Query: {query}")
print(f"Retrieved {len(memories)} memories:\n")
for i, memory in enumerate(memories, 1):
age = (datetime.now() - memory.timestamp).days
rec = recency_score(memory)
imp = importance_score(memory)
print(f"{i}. [{memory.memory_type}] {memory.content}")
print(f" Age: {age} days | Importance: {imp:.2f} | Recency: {rec:.3f}")
print(f" Combined score breakdown: importance contributes {imp:.2f}")
print()
Step 2: Identify the root cause. Common patterns:
Stale high-importance memories: Old preferences score higher than recent changes because importance was set high at creation time and never decayed. Fix: Apply decayed_importance_score from the scoring section above, and update extraction to rate preference changes at 0.9 importance.
No contradiction detection: Contradictory memories coexist because nothing checks for conflicts at storage time. Fix: Add contradiction detection (covered in the “Conflicting Memories” section) to your storage pipeline.
Embedding blind spots: “TypeScript” and “JavaScript” have similar embeddings because they’re closely related languages. The retrieval system can’t distinguish a preference for JavaScript from a switch away from JavaScript. Fix: Include preference polarity in memory content—store “User SWITCHED FROM JavaScript TO TypeScript” rather than just “User TypeScript preference.”
Scenario 2: “Responses Are Getting Slower and Worse”
After six months, retrieval latency has tripled and answer quality has declined.
Step 1: Check memory store size.
def diagnose_memory_health(memory_store: MemoryStore) -> dict:
"""Health check for memory system performance."""
all_memories = memory_store.get_all()
now = datetime.now()
# Size metrics
total = len(all_memories)
by_type = {}
for m in all_memories:
by_type[m.memory_type] = by_type.get(m.memory_type, 0) + 1
# Staleness metrics
stale_count = sum(1 for m in all_memories
if (now - m.timestamp).days > 180 and m.importance < 0.7)
stale_pct = (stale_count / total * 100) if total > 0 else 0
# Duplicate detection (rough)
contents = [m.content.lower().strip() for m in all_memories]
unique_ratio = len(set(contents)) / len(contents) if contents else 1.0
report = {
"total_memories": total,
"by_type": by_type,
"stale_memories": stale_count,
"stale_percentage": f"{stale_pct:.1f}%",
"uniqueness_ratio": f"{unique_ratio:.2f}",
"recommendation": "PRUNE" if total > 5000 or stale_pct > 30 else "OK"
}
print("=== Memory Health Report ===")
for key, value in report.items():
print(f" {key}: {value}")
return report
Step 2: Apply the fix. If the health check shows bloat (>5,000 memories, >30% stale, uniqueness ratio below 0.8), run the pruning pipeline from the Memory Bloat section. Start with consolidation (merge duplicates), then prune stale entries, then enforce the hard budget.
The key insight: memory health checks should run on a schedule, not just when users complain. By the time a user notices degraded quality, the problem has been building for weeks.
Privacy by Design
Memory creates liability. Every preference stored is data that can leak. Privacy isn’t a feature—it’s a constraint on every design decision. Essential controls:
class PrivacyControls:
"""Required privacy features for memory systems."""
def __init__(self, memory_store: MemoryStore):
self.store = memory_store
def export_user_data(self, user_id: str) -> dict:
"""GDPR Article 20: Export all user data in portable format."""
memories = self.store.get_all_for_user(user_id)
return {
"user_id": user_id,
"export_timestamp": datetime.now().isoformat(),
"memories": [{"content": m.content, "type": m.memory_type,
"created": m.timestamp.isoformat()} for m in memories]
}
def delete_all_user_data(self, user_id: str) -> int:
"""GDPR Article 17: Complete erasure of all user data."""
memories = self.store.get_all_for_user(user_id)
for memory in memories:
self.store.forget(memory.id, reason="user_deletion_request")
return len(memories)
def get_audit_log(self, user_id: str) -> list[dict]:
"""Show user what was stored and when (transparency)."""
return self.store.get_audit_log_for_user(user_id)
Beyond GDPR compliance, establish clear rules for what never gets stored:
# Add to extraction prompt:
"""
NEVER extract or store:
- Passwords, API keys, tokens, or credentials
- Personal health information
- Financial account details (account numbers, balances)
- Information the user explicitly asks not to remember
- Sensitive personal details (SSN, government IDs)
If the user mentions any of the above, acknowledge you heard it
but explicitly state you will not remember it.
"""
When in doubt, don’t store it. The cost of missing a useful memory is low—the user can re-state their preference. The cost of leaking sensitive information is catastrophic—a data breach, a regulatory fine, a destroyed reputation.
This asymmetry should guide every design decision in your memory system. Store less, not more. Retrieve selectively, not exhaustively. Delete proactively, not reluctantly. The best memory systems feel magical to users not because they remember everything, but because they remember the right things and forget the rest.
The Complete Memory Pipeline
In production, Chapter 5 and Chapter 9 work together as a pipeline. Understanding how they connect helps you build a complete memory architecture:
Within-Session (Ch 5) Session Boundary Cross-Session (Ch 9)
┌─────────────────────┐ ┌──────────────────────┐ ┌─────────────────────┐
│ Recent messages │ │ MemoryExtractor runs │ │ Persistent store │
│ (full verbatim) │────→│ on full conversation │────→│ (episodic/semantic/ │
│ │ │ │ │ procedural) │
│ Older messages │ │ Identifies: │ │ │
│ (summarized, Ch 5) │────→│ - Decisions (0.8+) │ │ Hybrid retrieval │
│ │ │ - Preferences (0.7+) │ │ scores & injects │
│ Key facts │ │ - Corrections (0.9+) │ │ into next session │
│ (archived, Ch 5) │────→│ - Patterns (0.6+) │ │ │
└─────────────────────┘ └──────────────────────┘ └─────────────────────┘
Step 1: Within a session, Chapter 5’s techniques manage the conversation. Recent messages stay verbatim. After 10+ messages, older ones get summarized. After 20+, extract key facts.
Step 2: At session end (or after significant interactions), the MemoryExtractor runs on the full conversation—summaries, key facts, and recent messages together. It identifies what deserves cross-session persistence.
Step 3: Extracted memories are stored in the persistent store with type labels and importance scores. Contradiction detection runs at storage time.
Step 4: Next session, the retrieval system scores memories using hybrid weights (recency + relevance + importance), injects the top results into the system prompt, and the new conversation begins with context from past sessions.
The bottleneck is always Step 4—the retrieval layer. Invest in retrieval quality: test that memories actually help, validate that contradictions are detected, and confirm that stale memories decay appropriately.
Worked Example: The Evolving Preference
Alex has used CodebaseAI for three months and recently switched from JavaScript to TypeScript. But CodebaseAI keeps suggesting JavaScript patterns: “Why do you keep suggesting vanilla JavaScript? I’ve been using TypeScript for two months now!”
The Investigation:
debug_retrieval(alex_memory_store, "language preference TypeScript JavaScript")
# Output:
# 1. [semantic] User prefers JavaScript for frontend development
# Importance: 0.72 | Age: 95 days
# 2. [semantic] User completed TypeScript migration for frontend
# Importance: 0.48 | Age: 45 days
# 3. [semantic] User prefers TypeScript for all new frontend code
# Importance: 0.52 | Age: 30 days
Root cause: The old JavaScript preference has importance 0.72, outscoring the newer TypeScript preferences (0.48 and 0.52). The extraction didn’t recognize “completed migration” as a high-importance preference change.
The Fix: Three changes prevent this in the future:
# 1. Manual correction: delete outdated, boost correct memories
alex_memory_store.forget("mem_js_preference_001", reason="superseded")
alex_memory_store.update_importance("mem_ts_preference", 0.90)
# 2. Update extraction prompt to recognize preference changes
"""
CRITICAL: When user indicates CHANGED preference ("I switched to...",
"We migrated to...", "I now use..."), rate importance 0.85-0.95.
These supersede previous preferences on the same topic.
"""
# 3. Add contradiction detection at storage time
def store_with_supersession(content: str, memory_type: str, importance: float):
if memory_type == "semantic":
for existing in memory_store.retrieve(content, limit=5):
if existing.memory_type == "semantic":
if is_contradictory(existing.content, content):
memory_store.forget(existing.id, reason="superseded")
return memory_store.store(content, memory_type, importance)
The Lesson: Memory isn’t just storage—it’s maintenance. A system that only stores and retrieves, without mechanisms to update and supersede, accumulates contradictions until it becomes useless. This is a software engineering principle that applies far beyond AI: any data system without update and deletion logic eventually drowns in stale information. Databases need migration scripts. Caches need invalidation. Logs need rotation. Memory systems need contradiction detection and supersession.
The Engineering Habit
Storage is cheap; attention is expensive. Be selective.
You can store every message, every preference, every interaction. Storage costs pennies. But at retrieval time, you face the attention budget. Memory context competes with the actual task—if you retrieve fifteen memories and twelve are irrelevant, you’ve wasted attention on noise.
This principle applies across engineering: you can log everything, but be selective about what triggers alerts. You can document everything, but curate what goes in the README. You can store every database relationship, but design your schema for how data will be read, not just written.
The engineer’s job is building the selection layer—the logic that decides what deserves attention right now. Master it, and your systems scale gracefully. Ignore it, and you’ll drown in your own data.
This principle has a name in information theory: the value of information is determined not by how much you have, but by how well you can find what you need when you need it. In memory systems, in databases, in documentation, in your own note-taking—the discipline of selective storage and efficient retrieval is what separates useful systems from data graveyards.
Context Engineering Beyond AI Apps
Memory design shows up everywhere in software engineering, not just in AI systems. Every application that maintains user state faces the same questions this chapter addresses: what to remember, what to forget, and how to retrieve the right context at the right time.
Caching systems are memory systems with different vocabulary. Redis, Memcached, and CDN caches all face the staleness problem (cache invalidation is famously one of the two hard problems in computer science). They all face the bloat problem (cache eviction policies like LRU and LFU are just different importance scoring functions). And they all face the retrieval problem (cache key design determines what gets found quickly). If you’ve ever debugged a stale cache serving outdated data to users, you’ve experienced the same pathology as stale AI memories—different domain, identical root cause.
AI development tools are building their own memory systems too—and the trade-offs mirror everything in this chapter. Cursor stores conversation history and learned preferences in project-specific files. Claude Code’s CLAUDE.md persists project context across sessions. IDE plugins remember your coding patterns and preferences. The three memory types from this chapter map directly: episodic memory is your conversation history with the tool, semantic memory is project knowledge in configuration files, and procedural memory is the learned patterns for how you like tests structured or which patterns you prefer.
Database schema design is a form of memory architecture. When you design a database, you’re deciding what to store (columns), how to index it (retrieval optimization), and when to archive or delete (data lifecycle). The principles from this chapter—be selective about what you store, optimize for how data will be read rather than how it’s written, and build maintenance into the system from day one—apply to every persistence layer you’ll ever design.
The privacy considerations transfer universally. What your AI development tools remember about your project could include proprietary code and security-sensitive information. What your caching system remembers could include user sessions and authentication tokens. The “privacy by design” principles from this chapter—minimizing what’s stored, controlling access, allowing deletion—aren’t AI-specific. They’re engineering fundamentals for any system that persists user data.
Summary
Memory transforms AI from a stateless tool into a persistent collaborator. But memory done wrong—bloated, stale, privacy-violating—creates worse experiences than no memory at all. The research is clear: agents fail to retrieve the right information 60% of the time, memory systems that store everything cost 14-77x more while being less accurate, and half of long-running agents experience behavioral degradation from poor memory management. Getting memory right matters.
The three memory types serve distinct purposes: episodic memory captures timestamped events and interactions for continuity, semantic memory stores facts and preferences for personalization, and procedural memory records learned patterns for behavioral adaptation. Production systems from ChatGPT to Gemini to open-source frameworks like Mem0 and Zep use this taxonomy directly.
Storage is the easy part. The engineering challenge is retrieval: surfacing the right memories for the current context. Hybrid scoring—combining recency, relevance, and importance—gives you tunable knobs to balance different signals. SimpleMem demonstrated that selective retrieval achieves better accuracy at 30x lower token cost than full-context approaches.
Memory has five failure modes: stale memories (outdated information presented confidently), conflicting memories (contradictions the LLM can’t resolve—accuracy drops to 6% for multi-hop conflicts), false memories (extraction errors and inference confabulation), memory bloat (retrieval degrades as the candidate set grows), and privacy incidents (cross-user leakage that turns personalization into a data breach).
Memory requires maintenance. Build pruning, consolidation, decay scoring, and contradiction detection from the start—not as afterthoughts. A system that only stores and retrieves, without mechanisms to update and supersede, accumulates contradictions until it becomes useless.
Privacy is a constraint, not a feature. Every memory stored is data that can leak. Implement export, deletion, and audit capabilities. Establish clear rules for what never gets stored. When in doubt, don’t store it.
CodebaseAI v0.8.0 adds persistent memory: extracting information from conversations, storing typed memories, retrieving relevant context with hybrid scoring, and providing privacy controls for GDPR compliance.
New Concepts Introduced
- Episodic, semantic, and procedural memory types
- The MemGPT OS analogy: context window as RAM, persistent storage as disk
- Hybrid retrieval scoring (recency, relevance, importance)
- Memory extraction with confidence validation
- Five failure modes: stale, conflicting, false, bloat, privacy
- Contradiction detection and supersession
- Memory pruning, consolidation, and budget enforcement
- Privacy by design (export, delete, audit, isolation)
- The complete Ch 5 → Ch 9 memory pipeline
CodebaseAI Evolution
Version 0.8.0 capabilities:
- Persistent memory across sessions
- Automatic extraction of user preferences, facts, and patterns
- Hybrid retrieval with configurable weights for context injection
- Memory health diagnostics and maintenance tools
- Privacy controls for data management and GDPR compliance
The Engineering Habit
Storage is cheap; attention is expensive. Be selective.
Try it yourself: Complete, runnable versions of this chapter’s code examples are available in the companion repository.
Chapter 5 taught us to manage context within a session. This chapter extended that to context across sessions. But so far, we’ve been building single AI systems—one assistant, one memory store, one retrieval pipeline. What happens when a task is too complex for a single AI? Chapter 10 explores multi-agent systems: when to split work across specialized agents, how they communicate, and how to orchestrate their collaboration.