Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Chapter 5: Managing Conversation History

Your chatbot has amnesia.

Not the dramatic kind where it forgets everything. The subtle kind where it starts strong, remembers the first few exchanges perfectly, then gradually loses the thread. By message twenty, it’s asking questions you already answered. By message forty, it contradicts advice it gave earlier. By message sixty, it’s forgotten what it’s supposed to be helping you with.

This isn’t a bug. It’s a fundamental constraint: conversation history grows linearly, but context windows are finite. Every message you add pushes older messages closer to irrelevance—or out of the window entirely.

The vibe coder’s solution is to dump the entire conversation into the context and hope for the best. This works until it doesn’t. Costs spike. Latency increases. Quality degrades. And then, suddenly, the context window overflows and the whole thing breaks.

This chapter teaches you to manage conversation history deliberately. The core practice: state is the enemy; manage it deliberately or it will manage you. Every technique that follows—sliding windows, summarization, hybrid approaches—serves this principle.

By the end, you’ll have strategies for keeping your AI coherent across long sessions without exhausting your context budget or your wallet. (This chapter focuses on managing conversation history within a single session. When the session ends and the user comes back tomorrow, you’ll need persistent memory—that’s Chapter 9’s territory.)


The Conversation History Problem

Let’s quantify the problem. A typical customer support conversation runs 15-20 exchanges. Each exchange (user message + assistant response) averages 200-400 tokens. That’s 3,000-8,000 tokens for a modest conversation.

Now consider a complex debugging session. A developer is working through a tricky problem with an AI assistant. They share code snippets, error messages, stack traces. The AI responds with explanations and suggestions. Twenty exchanges in, each message is getting longer as they dive deeper. You’re easily at 30,000 tokens—and climbing.

At some point, you hit a wall.

Three Ways Conversations Break

Token overflow: You exceed the context window. The API returns an error, or silently truncates your input. Either way, the conversation breaks.

Context rot: Even before overflow, quality degrades. As Chapter 2 explained, model attention dilutes as context grows. Information in the middle gets lost. The model starts ignoring things you told it earlier—not because they’re gone, but because they’re drowned out.

Cost explosion: Tokens cost money. A 100K-token context costs roughly 10x a 10K-token context. For high-volume applications, the difference between managed and unmanaged history is the difference between viable and bankrupt.

The Naive Approach Fails

The simplest approach is to concatenate everything:

def naive_history(messages):
    """Don't do this in production."""
    return "\n".join([
        f"{m['role']}: {m['content']}"
        for m in messages
    ])

This works for demos. It fails in production because:

  1. Growth is unbounded. Every message makes the context larger. Eventually, you hit the wall.

  2. Old information crowds out new. By the time you’re at message 50, the system prompt and early context compete with 49 messages for attention.

  3. Costs scale linearly. Each message makes every future message more expensive—you’re paying for the entire history on every turn.

  4. Latency increases. More tokens means slower responses. Users notice.

You need strategies that preserve what matters while discarding what doesn’t.


Strategy 1: Sliding Windows

The simplest real strategy is a sliding window: keep the last N messages, discard the rest.

class SlidingWindowMemory:
    """Keep only recent messages in context."""

    def __init__(self, max_messages: int = 10, max_tokens: int = 4000):
        self.max_messages = max_messages
        self.max_tokens = max_tokens
        self.messages = []

    def add(self, role: str, content: str):
        """Add a message and enforce limits."""
        self.messages.append({"role": role, "content": content})
        self._enforce_limits()

    def _enforce_limits(self):
        """Keep conversation within bounds."""
        # First: message count limit
        if len(self.messages) > self.max_messages:
            self.messages = self.messages[-self.max_messages:]

        # Second: token limit (approximate)
        while self._estimate_tokens() > self.max_tokens and len(self.messages) > 2:
            self.messages.pop(0)

    def _estimate_tokens(self) -> int:
        """Rough token estimate (4 chars ≈ 1 token)."""
        return sum(len(m["content"]) // 4 for m in self.messages)

    def get_messages(self) -> list:
        """Return messages for API call."""
        return self.messages.copy()

    # Example usage output:
    # After 15 messages with max_messages=10:
    # - Messages 1-5: discarded
    # - Messages 6-15: retained
    # Token count: ~3,200 (within 4,000 limit)

Here’s what the sliding window looks like visually:

Sliding Window: System prompt and current query stay fixed while older messages are discarded

When Sliding Windows Work

Sliding windows work well when:

  • Recent context is sufficient. The last few exchanges contain everything needed to continue.
  • Conversations are short. Most interactions complete within the window size.
  • Topics don’t reference old context. Users don’t say “remember what you said earlier about X.”

Typical applications: simple chatbots, Q&A systems, single-topic support conversations.

When Sliding Windows Fail

Sliding windows fail when:

  • Users reference old context. “What was that command you suggested earlier?” If it’s been discarded, the model can’t answer.
  • Decisions build on earlier discussions. In a debugging session, the error message from turn 3 might be critical in turn 30.
  • The conversation has phases. Setup → exploration → resolution. The sliding window might discard the setup just when you need it for resolution.

For these cases, you need something smarter.


Strategy 2: Summarization

Instead of discarding old messages, compress them. A 10-message exchange becomes a 2-sentence summary. You preserve the essence while reclaiming tokens.

class SummarizingMemory:
    """Compress old messages into summaries."""

    def __init__(self, llm_client, active_window: int = 6, summary_threshold: int = 10):
        self.llm = llm_client
        self.active_window = active_window  # Keep this many recent messages
        self.summary_threshold = summary_threshold  # Summarize when exceeding this
        self.messages = []
        self.summaries = []

    def add(self, role: str, content: str):
        """Add message, compress if needed."""
        self.messages.append({"role": role, "content": content})

        if len(self.messages) >= self.summary_threshold:
            self._compress_old_messages()

    def _compress_old_messages(self):
        """Summarize older messages to reclaim tokens."""
        # Keep recent messages active
        to_summarize = self.messages[:-self.active_window]
        self.messages = self.messages[-self.active_window:]

        if not to_summarize:
            return

        # Format for summarization
        conversation = "\n".join([
            f"{m['role']}: {m['content'][:200]}..."
            if len(m['content']) > 200 else f"{m['role']}: {m['content']}"
            for m in to_summarize
        ])

        # Generate summary
        summary = self.llm.complete(
            f"Summarize this conversation segment concisely. "
            f"Preserve key facts, decisions, and any unresolved questions:\n\n"
            f"{conversation}\n\nSummary:"
        )

        self.summaries.append({
            "text": summary,
            "message_count": len(to_summarize),
            "timestamp": datetime.utcnow().isoformat()
        })

    def get_context(self) -> str:
        """Build context from summaries + active messages."""
        parts = []

        # Add summaries of older conversation
        if self.summaries:
            parts.append("=== Previous Discussion ===")
            # Keep last 3 summaries to bound growth
            for summary in self.summaries[-3:]:
                parts.append(summary["text"])
            parts.append("")

        # Add active messages
        if self.messages:
            parts.append("=== Recent Messages ===")
            for m in self.messages:
                parts.append(f"{m['role']}: {m['content']}")

        return "\n".join(parts)

    # Example context output:
    # === Previous Discussion ===
    # User asked about authentication options. Discussed JWT vs sessions.
    # Decided on JWT with refresh tokens. User concerned about security.
    #
    # === Recent Messages ===
    # user: How do I handle token expiration?
    # assistant: For JWT expiration, you have two main strategies...

The Summarization Trade-off

Summarization preserves meaning but loses detail. The model knows you “discussed authentication options” but not the specific code snippet you shared.

This trade-off is often acceptable. In a customer support conversation, knowing “user is frustrated about shipping delay” is more useful than retaining every word of their complaint. The summary captures what matters for continuing the conversation.

But for technical conversations—debugging, code review, architecture discussions—details matter. Losing the exact error message or the specific line of code can derail the conversation.

Summarization Quality

The quality of your summaries determines the quality of your long conversations. Bad summaries lose critical information. Good summaries capture:

  • Key facts: Names, numbers, decisions made
  • Unresolved questions: What’s still being worked on
  • User state: Emotional tone, expertise level, preferences expressed
  • Commitments: What the assistant promised or suggested

Test your summarization. Take real conversations, summarize them, then see if you can continue the conversation correctly from just the summary. If critical context is lost, improve your summarization prompt.

What to Preserve vs. Drop in Summaries

To write better summarization prompts, you need to be explicit about what information is critical and what’s noise.

Must preserve:

  • Decisions made: “User chose PostgreSQL over MySQL because of JSON support”
  • Factual assertions: “The error only occurs in production, not locally”
  • User preferences: “Prefers brief explanations with code examples”
  • Action items: “Need to refactor authentication before deploying to staging”
  • Key constraints: “API calls limited to 1,000/day”, “Must support Python 3.8+”

Safe to drop:

  • Greetings and meta-discussion: “Hi, can you help me with…”, “Thanks for your help”
  • Filler and reformulations: Repeated explanations of the same thing, “uh”, “let me think”
  • Irrelevant personal details: Weather, what they had for lunch (unless relevant to the problem)
  • Duplicate information: If something was said three times, mention it once

Example of bad vs. good summarization:

Bad summary (loses critical information):

The user discussed pagination with the assistant. They talked about different approaches.
The user seems interested in performance.

Good summary (preserves decisions and constraints):

User implemented cursor-based pagination (not offset-based) because API handles 10K+ records.
Decided on keyset pagination using (created_at, id) compound key for sort stability.
Constraint: Must maintain compatibility with existing client code. Unresolved: How to handle deleted records in paginated results.

Summarization Prompt Template

Here’s a prompt template that extracts the right information:

def create_summarization_prompt(conversation_segment: str) -> str:
    """Create a prompt that extracts critical information from a conversation."""
    return f"""Summarize this conversation segment. Extract and preserve:

1. KEY DECISIONS: What did the user decide to do? Be specific.
   Format: "Decided to [action] because [reason]"

2. FACTUAL ASSERTIONS: What facts or constraints were established?
   Format: "System constraint: [constraint]" or "Problem: [fact]"

3. USER PREFERENCES: How does the user prefer responses or approaches?
   Format: "User prefers [preference] (reason: [why if stated])"

4. ACTION ITEMS: What work is pending or what was promised?
   Format: "TODO: [specific action]"

5. UNRESOLVED QUESTIONS: What's still being discussed or undecided?
   Format: "Open question: [question]"

Drop: greetings, meta-discussion, filler, off-topic details, repeated points.

Conversation segment:
---
{conversation_segment}
---

Provide the summary as a concise paragraph using the formats above. Be specific—use actual names, numbers, and technical details."""

# Usage in your summarization code
def improved_summarize(llm_client, conversation_segment: str) -> str:
    """Generate summary with structured prompt."""
    prompt = create_summarization_prompt(conversation_segment)
    summary = llm_client.complete(prompt)
    return summary

Testing Your Summarization

The ultimate test: can the conversation continue correctly from just the summary?

def test_summarization_quality(llm_client, original_conversation: list, question: str):
    """Test if a summary preserves enough context to answer follow-up questions.

    Args:
        original_conversation: Full conversation history
        question: A follow-up question that depends on earlier context
    """
    # Summarize the conversation
    segment = "\n".join([f"{m['role']}: {m['content']}" for m in original_conversation])
    summary = improved_summarize(llm_client, segment)

    # Try to answer a follow-up question using only the summary
    summary_based_answer = llm_client.complete(
        f"Summary of earlier conversation:\n{summary}\n\n"
        f"Follow-up question: {question}\n\n"
        f"Answer based only on the summary above:"
    )

    # Try to answer using the full conversation
    full_context_answer = llm_client.complete(
        f"Full conversation:\n{segment}\n\n"
        f"Follow-up question: {question}\n\n"
        f"Answer based on the full conversation:"
    )

    # Compare: were critical details preserved?
    print(f"Question: {question}")
    print(f"\nFrom summary: {summary_based_answer[:200]}...")
    print(f"From full context: {full_context_answer[:200]}...")
    print(f"\nSummary length: {len(summary)} chars (vs {len(segment)} full)")

    # If answers diverge significantly, your summarization is losing critical info
    if similarity(summary_based_answer, full_context_answer) < 0.7:
        print("⚠ WARNING: Summary loses critical information for follow-up questions")
        return False

    return True

This test catches the most important failure mode: when a summary is so compressed that following conversations can’t build on it correctly.


Strategy 3: Hybrid Approaches

Production systems rarely use a single strategy. They combine approaches based on message age and importance.

Tiered Memory

The most effective pattern is tiered memory: recent messages stay verbatim, older messages get summarized, ancient messages get archived or discarded.

class TieredMemory:
    """Three-tier memory: active, summarized, archived."""

    def __init__(self, llm_client):
        self.llm = llm_client

        # Tier 1: Active (full messages, ~10 most recent)
        self.active_messages = []
        self.active_limit = 10

        # Tier 2: Summarized (compressed batches)
        self.summaries = []
        self.max_summaries = 5

        # Tier 3: Key facts (extracted important information)
        self.key_facts = []

    def add(self, role: str, content: str):
        """Add message, manage tiers."""
        self.active_messages.append({"role": role, "content": content})

        # Promote to Tier 2 when Tier 1 overflows
        if len(self.active_messages) > self.active_limit:
            self._promote_to_summary()

        # Archive Tier 2 when it overflows
        if len(self.summaries) > self.max_summaries:
            self._archive_oldest_summary()

    def _promote_to_summary(self):
        """Move oldest active messages to summary tier."""
        # Take oldest half of active messages
        to_summarize = self.active_messages[:self.active_limit // 2]
        self.active_messages = self.active_messages[self.active_limit // 2:]

        # Summarize
        summary = self._summarize(to_summarize)
        self.summaries.append(summary)

    def _archive_oldest_summary(self):
        """Extract key facts from oldest summary, then discard it."""
        oldest = self.summaries.pop(0)

        # Extract any facts worth preserving permanently
        facts = self._extract_key_facts(oldest)
        self.key_facts.extend(facts)

        # Deduplicate and limit key facts
        self.key_facts = self._deduplicate_facts(self.key_facts)[-20:]

    def get_context(self, max_tokens: int = 4000) -> str:
        """Assemble context within token budget."""
        parts = []
        tokens_used = 0

        # Always include key facts (highest information density)
        if self.key_facts:
            facts_text = "Key facts: " + "; ".join(self.key_facts)
            parts.append(facts_text)
            tokens_used += len(facts_text) // 4

        # Add summaries if room
        summary_budget = (max_tokens - tokens_used) * 0.3
        for summary in reversed(self.summaries):  # Most recent first
            if tokens_used < summary_budget:
                parts.append(f"[Earlier]: {summary}")
                tokens_used += len(summary) // 4

        # Add active messages (always include most recent)
        parts.append("--- Recent ---")
        for m in self.active_messages:
            parts.append(f"{m['role']}: {m['content']}")

        return "\n".join(parts)

    def estimate_tokens(self) -> int:
        """Rough token estimate across all tiers."""
        total = sum(len(m["content"]) // 4 for m in self.active_messages)
        total += sum(len(s) // 4 for s in self.summaries)
        total += sum(len(f) // 4 for f in self.key_facts)
        return total

    def get_stats(self) -> dict:
        """Return memory statistics for monitoring."""
        return {
            "active_messages": len(self.active_messages),
            "summaries": len(self.summaries),
            "key_facts": len(self.key_facts),
            "estimated_tokens": self.estimate_tokens(),
        }

    def get_key_facts(self) -> list:
        """Return current key facts."""
        return self.key_facts.copy()

    def set_key_facts(self, facts: list):
        """Restore key facts after reset."""
        self.key_facts = facts

    def reset(self):
        """Clear all tiers."""
        self.active_messages = []
        self.summaries = []
        self.key_facts = []

    # Example stats output:
    # {"active_messages": 8, "summaries": 3, "key_facts": 12,
    #  "estimated_tokens": 2847}

Budget Allocation

A common pattern is 40/30/30 allocation:

  • 40% for recent messages: The immediate context needs full fidelity
  • 30% for summaries: Compressed but meaningful history
  • 30% for retrieved/key facts: The most important information regardless of age

This ensures recent context gets priority while preserving access to older information.


When to Reset vs. Preserve

Not every conversation should be preserved. Sometimes the right answer is to start fresh. The decision depends on understanding what information is actually useful for the next part of the conversation.

Decision Framework: When to Reset Context

Use this checklist to determine whether to reset conversation history or preserve it:

1. Topic Change: Did the Subject Fundamentally Shift?

IF user explicitly requests topic change ("let's talk about X instead")
   OR query subject is completely unrelated to previous exchanges
   THEN: Strong signal to reset

IF user is exploring multiple related subtopics
   (e.g., "first, let's discuss indexing, then caching, then monitoring")
   THEN: Preserve context (all related to performance optimization)

Example: User spent 30 messages debugging a race condition in authentication. Then asks “how should I structure my logging?” This is a topic change—reset is appropriate.

Counterexample: User spent messages on “optimize database queries.” Now asks “what indexes would help here?” This is the same topic deepening, not changing—preserve context.

2. User Identity Change: Is This Still the Same User?

IF session token changes
   OR user authentication changes
   OR you have explicit evidence user switched
   THEN: Always reset for security

IF same user continues in same context
   THEN: Preserve

This is non-negotiable: never leak one user’s conversation into another user’s context, even if they’re discussing similar topics.

3. Error Recovery: Did the Model Give a Bad Response?

IF user says "that's wrong" or "try again"
   AND the previous response was based on misunderstanding
   THEN: Can reset + restate the problem more clearly
   (this often works better than trying to correct in-place)

IF user says "you contradicted yourself"
   THEN: Strong signal to reset (context has become confused)

IF user wants to continue from a different assumption
   THEN: Reset, then explicitly state the new constraint

Example: Model suggested using Redis for a use case. User says “no, we can’t add another infrastructure dependency.” Rather than trying to correct mid-conversation, reset and ask: “Given the constraint that we can’t add external infrastructure, what are our options?”

4. Session Timeout: Has Time Passed?

IF user returns after > 1 hour of inactivity
   THEN: Mild signal to reset (context may be stale)

IF user returns after > 4 hours
   THEN: Strong signal to reset (context is likely stale)

IF user was offline > 24 hours
   THEN: Always reset (context is definitely stale)

   NOTE: But preserve key facts (decisions made, constraints discovered)

The reasoning: conversation context makes sense while it’s fresh. After a break, the user’s mental model of the conversation has likely reset anyway, and restarting fresh is less disorienting.

5. Context Budget Exceeded: Are You Approaching the Limit?

IF memory.estimate_tokens() > ABSOLUTE_MAX_TOKENS * 0.7  # 70% threshold
   THEN: Proactive compression required

   IF compression would result in too much information loss
      THEN: Reset + preserve key facts
      AND: Inform user "conversation was getting long, I've captured the key decisions"

   ELSE:
      THEN: Compress (summarize old messages)
      AND: Continue conversation with compressed history

The 70% threshold is critical. Compress before you hit the wall, not after.

Decision Tree in Practice

Here’s the logic as a flowchart:

┌─ START: New message arrives
│
├─ Did user explicitly request reset?
│  YES → RESET
│  NO ↓
│
├─ Did user identity change?
│  YES → RESET (security)
│  NO ↓
│
├─ Is conversation context > 70% of token budget?
│  YES → Can we compress safely?
│       YES → COMPRESS & CONTINUE
│       NO → RESET + PRESERVE FACTS
│  NO ↓
│
├─ Did the topic fundamentally change?
│  YES → RESET (topic shift)
│  NO ↓
│
├─ Is this an error recovery ("that's wrong", "try again")?
│  YES → RESET + RESTATE PROBLEM
│  NO ↓
│
├─ Has > 4 hours passed since last message?
│  YES → RESET (but preserve key facts)
│  NO ↓
│
└─ PRESERVE & CONTINUE

Notice the order: explicit requests and security come first, then budget constraints, then topic/coherence issues, then time-based signals.

Implementing Smart Reset Logic

from datetime import datetime, timedelta

class SmartContextManager:
    """Decide whether to reset or preserve conversation context."""

    def __init__(self, memory, logger=None):
        self.memory = memory
        self.logger = logger
        self.last_message_time = None
        self.session_start = datetime.utcnow()

    def should_reset(self, new_message: str) -> tuple[bool, str]:
        """Determine if conversation should reset.

        Returns:
            (should_reset: bool, reason: str)
        """

        # 1. Explicit user request
        reset_phrases = ["start over", "forget that", "new topic", "let's reset",
                        "clear context", "begin fresh"]
        for phrase in reset_phrases:
            if phrase.lower() in new_message.lower():
                return True, f"User requested reset: '{phrase}'"

        # 2. Security: User identity change (you'd implement this based on auth)
        # (skipped here, but would check session tokens)

        # 3. Token budget exceeded
        current_tokens = self.memory.estimate_tokens()
        max_tokens = self.memory.ABSOLUTE_MAX_TOKENS
        if current_tokens > max_tokens * 0.7:
            can_compress = current_tokens < max_tokens * 0.85
            if can_compress:
                return False, "approaching_limit_but_can_compress"
            else:
                return True, "context_budget_exceeded_compression_insufficient"

        # 4. Major topic shift
        if self._detect_topic_shift(new_message):
            return True, "topic_shift_detected"

        # 5. Error recovery
        if self._detect_error_recovery(new_message):
            return True, "user_requesting_retry_after_error"

        # 6. Session timeout
        time_since_last = self._time_since_last_message()
        if time_since_last > timedelta(hours=4):
            return True, "session_timeout_4hours"
        elif time_since_last > timedelta(hours=1):
            return False, "soft_timeout_but_preserve"  # Compress instead

        return False, "no_reset_needed"

    def _detect_topic_shift(self, new_message: str) -> bool:
        """Detect if user is shifting to a fundamentally different topic.

        Simple implementation; production would use semantic similarity.
        """
        if not self.memory.messages:
            return False

        # Get the most recent user messages
        recent_topics = " ".join([
            m["content"] for m in self.memory.messages[-6:]
            if m["role"] == "user"
        ])

        # Check semantic distance (simplified version)
        from sentence_transformers import SentenceTransformer
        model = SentenceTransformer('all-MiniLM-L6-v2')

        recent_vec = model.encode(recent_topics)
        current_vec = model.encode(new_message)

        # Cosine similarity
        from sklearn.metrics.pairwise import cosine_similarity
        similarity = cosine_similarity([recent_vec], [current_vec])[0][0]

        # If similarity is very low, it's a topic shift
        return similarity < 0.4

    def _detect_error_recovery(self, new_message: str) -> bool:
        """Detect if user is asking to retry/correct."""
        error_phrases = ["that's wrong", "try again", "no that's not right",
                        "you contradicted", "that doesn't make sense",
                        "let me rephrase", "actually no"]
        return any(phrase.lower() in new_message.lower() for phrase in error_phrases)

    def _time_since_last_message(self) -> timedelta:
        """Return time elapsed since last message in conversation."""
        if not self.last_message_time:
            return timedelta(0)
        return datetime.utcnow() - self.last_message_time

    def handle_message(self, role: str, content: str):
        """Process message and apply reset logic if needed."""
        should_reset, reason = self.should_reset(content)

        if should_reset:
            key_facts = self.memory.get_key_facts()
            self.memory.reset()
            if key_facts:
                self.memory.set_key_facts(key_facts)

            if self.logger:
                self.logger.info(f"Reset conversation. Reason: {reason}")

        self.memory.add(role, content)
        self.last_message_time = datetime.utcnow()

What to Preserve When You Reset

When you reset, you’re not discarding everything—you’re moving important information to the key facts tier:

def reset_with_preservation(memory, reason: str):
    """Reset conversation but preserve key facts."""
    facts_to_preserve = {
        "decisions": [
            "Chose PostgreSQL for the primary database",
            "Decided against caching layer due to budget constraints"
        ],
        "constraints": [
            "Must support Python 3.8+",
            "API rate limit: 1000 calls/day",
            "Database schema cannot be modified"
        ],
        "user_preferences": [
            "User prefers concise explanations with code examples",
            "User wants to understand the 'why' behind recommendations"
        ]
    }

    # Preserve these before clearing conversation
    for fact in facts_to_preserve["decisions"]:
        memory.add_key_fact(fact)
    for constraint in facts_to_preserve["constraints"]:
        memory.add_key_fact(constraint)
    for pref in facts_to_preserve["user_preferences"]:
        memory.add_key_fact(pref)

    memory.reset()
    print(f"Reset: {reason}. Preserved {len(facts_to_preserve)} key facts.")

Notice that all of these preservation signals are about the current session—keeping context active while the conversation is live. This is different from extraction, where you identify key facts worth storing permanently for future sessions. Within-session preservation asks “should I keep this in the active window?” Cross-session extraction asks “is this worth remembering forever?” The criteria overlap but aren’t identical: you might preserve an entire debugging thread for the current session but only extract the final resolution for long-term memory.

Chapter 9 tackles the extraction problem: how to carry what matters into the next session. The tiered memory architecture and key fact extraction you’re learning here become the foundation for that persistent memory layer.

Implementing Reset Logic

class ConversationManager:
    """Manage conversation lifecycle including resets."""

    def __init__(self, memory):
        self.memory = memory
        self.topic_tracker = TopicTracker()

    def should_reset(self, new_message: str) -> bool:
        """Determine if conversation should reset."""

        # Explicit reset request
        reset_phrases = ["start over", "forget that", "new topic", "let's reset"]
        if any(phrase in new_message.lower() for phrase in reset_phrases):
            return True

        # Major topic shift
        if self.topic_tracker.is_major_shift(new_message):
            return True

        # Conversation too long with low coherence
        if (len(self.memory.messages) > 50 and
            self.topic_tracker.coherence_score < 0.3):
            return True

        return False

    def handle_message(self, role: str, content: str):
        """Process message with potential reset."""
        if self.should_reset(content):
            # Preserve key facts before reset
            preserved = self.memory.get_key_facts()
            self.memory.reset()
            self.memory.set_key_facts(preserved)

        self.memory.add(role, content)
        self.topic_tracker.update(content)

Streaming and Conversation History

Everything in this chapter assumes batch processing—you wait for the full response before updating conversation history. In production, most systems use streaming to deliver tokens as they’re generated. This creates specific challenges for conversation history management.

The Streaming History Problem

When streaming, you don’t have the complete assistant response when the user might interrupt or the connection might drop. This means:

  • Partial responses in history: If the user disconnects mid-stream, do you save the partial response? A half-finished code example might be worse than no response at all.
  • Summary timing: When do you trigger summarization? After each complete response? After a batch of exchanges? You can’t summarize a response that’s still generating.
  • Memory extraction timing: Should you extract memories from partial responses? Generally no—wait for the complete response to avoid extracting incomplete or incorrect information.

Practical Patterns

Buffer-then-commit: Stream tokens to the user in real time, but buffer the full response before adding it to conversation history. If the stream is interrupted, discard the partial response from history (but optionally log it for debugging).

class StreamingHistoryManager:
    def __init__(self, history: ConversationHistory):
        self.history = history
        self.buffer = ""

    async def handle_stream(self, stream):
        self.buffer = ""
        try:
            async for chunk in stream:
                self.buffer += chunk.text
                yield chunk  # Forward to user
            # Stream complete — commit to history
            self.history.add_assistant_message(self.buffer)
        except ConnectionError:
            # Stream interrupted — don't commit partial response
            log.warning(f"Partial response discarded: {len(self.buffer)} chars")
            self.buffer = ""

Checkpoint summarization: For long-running sessions, summarize at natural breakpoints (topic changes, explicit “let’s move on” signals) rather than on a fixed token count. This avoids summarizing mid-thought.

Incremental memory extraction: Extract memories only from committed (complete) responses. Run extraction asynchronously after the response is fully committed to avoid blocking the next user interaction.

Streaming doesn’t change the fundamental principles of conversation history management—it just adds timing considerations around when to commit, summarize, and extract.


CodebaseAI Evolution: Adding Conversation Memory

Previous versions of CodebaseAI were stateless—each question was independent. Now we add the ability to have multi-turn conversations about code.

import anthropic
import json
import uuid
import logging
from datetime import datetime
from dataclasses import dataclass

@dataclass
class Response:
    content: str
    request_id: str
    prompt_version: str
    memory_stats: dict

class ConversationalCodebaseAI:
    """CodebaseAI with conversation memory.

    Extends the v0.3.1 system prompt architecture from Chapter 4
    with tiered conversation history management.
    """

    VERSION = "0.4.0"
    PROMPT_VERSION = "v2.1.0"

    # System prompt from Chapter 4 (abbreviated for clarity)
    SYSTEM_PROMPT = """You are a senior software engineer and code educator.
    [Full four-component prompt from Chapter 4, v2.0.0]"""

    def __init__(self, config=None):
        self.config = config or self._default_config()
        self.client = anthropic.Anthropic(api_key=self.config.get("api_key"))
        self.logger = logging.getLogger("codebase_ai")

        # Conversation memory with tiered approach
        self.memory = TieredMemory(
            active_limit=10,
            max_summaries=5,
            max_tokens=self.config.get("max_context_tokens", 16000) * 0.4  # 40% for history
        )

        # Track code files discussed
        self.code_context = {}  # filename -> content

    @staticmethod
    def _default_config():
        return {"api_key": "your-key", "model": "claude-sonnet-4-5-20250929",
                "max_tokens": 4096, "max_context_tokens": 16000}

    def ask(self, question: str, code: str = None) -> Response:
        """Ask a question in the context of ongoing conversation."""

        request_id = str(uuid.uuid4())[:8]

        # Update code context if new code provided
        if code:
            filename = self._extract_filename(question) or "current_file"
            self.code_context[filename] = code

        # Add user message to memory
        self.memory.add("user", question)

        # Build context: system prompt + history + code
        conversation_context = self.memory.get_context()
        code_context = self._format_code_context()

        # Log for debugging
        self.logger.info(json.dumps({
            "event": "request",
            "request_id": request_id,
            "memory_stats": {
                "active_messages": len(self.memory.active_messages),
                "summaries": len(self.memory.summaries),
                "key_facts": len(self.memory.key_facts),
            },
            "code_files": list(self.code_context.keys()),
        }))

        # Build messages for API
        messages = [{
            "role": "user",
            "content": f"{conversation_context}\n\n{code_context}\n\nCurrent question: {question}"
        }]

        response = self.client.messages.create(
            model=self.config.get("model", "claude-sonnet-4-5-20250929"),
            max_tokens=self.config.get("max_tokens", 4096),
            system=self.SYSTEM_PROMPT,
            messages=messages
        )

        # Add assistant response to memory
        assistant_response = response.content[0].text
        self.memory.add("assistant", assistant_response)

        return Response(
            content=assistant_response,
            request_id=request_id,
            prompt_version=self.PROMPT_VERSION,
            memory_stats=self.memory.get_stats()
        )

    def _format_code_context(self) -> str:
        """Format tracked code files for context."""
        if not self.code_context:
            return ""

        parts = ["=== Code Files ==="]
        for filename, content in self.code_context.items():
            # Truncate very long files
            if len(content) > 2000:
                content = content[:2000] + "\n... [truncated]"
            parts.append(f"--- {filename} ---\n{content}")

        return "\n".join(parts)

    def reset_conversation(self, preserve_code: bool = True):
        """Reset conversation memory."""
        key_facts = self.memory.get_key_facts()
        self.memory.reset()

        # Optionally preserve key facts
        if key_facts:
            self.memory.set_key_facts(key_facts)

        # Optionally clear code context
        if not preserve_code:
            self.code_context = {}

        self.logger.info(json.dumps({
            "event": "conversation_reset",
            "preserved_facts": len(key_facts),
            "preserved_code": preserve_code,
        }))

    def get_conversation_stats(self) -> dict:
        """Return memory statistics for monitoring."""
        return {
            "active_messages": len(self.memory.active_messages),
            "summaries": len(self.memory.summaries),
            "key_facts": len(self.memory.key_facts),
            "code_files_tracked": len(self.code_context),
            "estimated_tokens": self.memory.estimate_tokens(),
        }

What Changed

Before: Each ask() call was independent. No memory of previous questions.

After: Conversations persist across calls. The system remembers what you discussed, what code you shared, and what conclusions you reached.

Memory management: Tiered approach keeps recent messages verbatim, summarizes older ones, and extracts key facts from ancient history.

Code tracking: Files discussed are tracked separately from conversation history. They persist even when conversation history is compressed.

Observability: Memory statistics are logged with each request, enabling debugging of memory-related issues.


Debugging: “My Chatbot Contradicts Itself”

The most common conversation history bug: the model says one thing, then later says the opposite. Here’s how to diagnose and fix it.

Step 1: Check What’s Actually in Context

The first question: does the model have access to what it said before?

def debug_contradiction(memory, contradicting_response):
    """Diagnose why model contradicted itself."""

    # Get the context that was sent
    context = memory.get_context()

    # Search for the original statement
    # (You need to know roughly what it said)
    original_claim = "use PostgreSQL"  # Example

    if original_claim not in context:
        return "DIAGNOSIS: Original statement was truncated or summarized away"

    # Check position in context
    position = context.find(original_claim)
    context_length = len(context)
    relative_position = position / context_length

    if 0.3 < relative_position < 0.7:
        return "DIAGNOSIS: Original statement is in the 'lost middle' zone"

    return "DIAGNOSIS: Statement is in context but model still contradicted it"

Step 2: Identify the Cause

Cause A: Truncation The original statement was in messages that got discarded by the sliding window.

Fix: Extend the window, add summarization, or extract key decisions as facts.

Cause B: Lost in the Middle The statement is technically in context but buried in the middle where attention is weak.

Fix: Move important decisions to key facts (beginning of context) or repeat them periodically.

Cause C: Ambiguous Summarization The statement was summarized in a way that lost its definitiveness. “Discussed database options” doesn’t capture “decided on PostgreSQL.”

Fix: Improve summarization prompt to preserve decisions, not just topics.

Cause D: Conflicting Information Later in the conversation, something contradicted the original statement. The model sided with the newer information.

Fix: Make decisions explicit and final. “DECISION: We will use PostgreSQL. This is final unless explicitly revisited.”

Step 3: Implement Prevention

class DecisionTracker:
    """Track and reinforce key decisions to prevent contradictions."""

    def __init__(self):
        self.decisions = []  # List of firm decisions

    def record_decision(self, topic: str, decision: str):
        """Record a firm decision."""
        self.decisions.append({
            "topic": topic,
            "decision": decision,
            "timestamp": datetime.utcnow().isoformat(),
            "final": True
        })

    def get_decisions_context(self) -> str:
        """Format decisions for injection into context."""
        if not self.decisions:
            return ""

        lines = ["=== Established Decisions (Do Not Contradict) ==="]
        for d in self.decisions:
            lines.append(f"- {d['topic']}: {d['decision']}")
        return "\n".join(lines)

    def check_for_contradiction(self, response: str) -> list:
        """Check if response contradicts recorded decisions."""
        contradictions = []
        for decision in self.decisions:
            # Simple check: does response suggest opposite?
            # Production would use more sophisticated detection
            if self._might_contradict(response, decision):
                contradictions.append(decision)
        return contradictions

The key insight: contradictions happen when important information competes with other content for attention. Elevate decisions to first-class tracked entities, not just conversation messages.


The Memory Leak Problem

Without proper management, conversation memory is a memory leak. It grows without bound, eventually causing failures.

Symptoms of Memory Leak

  • Gradual slowdown: Each response takes longer as context grows
  • Cost creep: Monthly API bills increase even with stable traffic
  • Sudden failures: Context overflow errors after long conversations
  • Quality degradation: Responses get worse over time within a conversation

Prevention

Set hard limits and enforce them:

class BoundedMemory:
    """Memory with hard limits to prevent leaks."""

    ABSOLUTE_MAX_TOKENS = 50000  # Never exceed this
    WARNING_THRESHOLD = 0.7  # Warn at 70%

    def add(self, role: str, content: str):
        """Add with limit enforcement."""
        self.messages.append({"role": role, "content": content})

        current = self.estimate_tokens()

        if current > self.ABSOLUTE_MAX_TOKENS:
            self._emergency_compress()
            self.logger.warning(f"Emergency compression triggered at {current} tokens")

        elif current > self.ABSOLUTE_MAX_TOKENS * self.WARNING_THRESHOLD:
            self._proactive_compress()
            self.logger.info(f"Proactive compression at {current} tokens")

    def _emergency_compress(self):
        """Aggressive compression when limits exceeded."""
        # Keep only essential: key facts + last 5 messages
        self.summaries = [self._summarize_all(self.summaries)]
        self.messages = self.messages[-5:]

    def _proactive_compress(self):
        """Gentle compression before limits hit."""
        # Standard tiered compression
        self._promote_to_summary()

The 70-80% threshold is critical. Compress before you hit the wall, not after. Proactive compression is controlled; emergency compression is lossy.


Context Engineering Beyond AI Apps

The conversation history strategies from this chapter apply directly to how you work with AI coding tools — and one practitioner has formalized this into a methodology.

Geoffrey Huntley’s Ralph Loop is context engineering applied to AI-assisted development. The core insight: instead of letting your conversation with an AI coding tool accumulate context until it degrades, start each significant iteration with a fresh context. Persist state through the filesystem — code, tests, documentation, specs — not through the conversation. At the start of each loop, the AI reads the current state of the project from disk, works within a clean context window, and writes its outputs back. The conversation is disposable. The artifacts are permanent.

This is the same principle as the sliding window and summarization strategies from this chapter, applied to a different domain. Just as you’d summarize old conversation history to free up context for new information in a chatbot, the Ralph Loop resets the conversation and lets the filesystem serve as long-term memory. The “when to reset” decision framework from this chapter applies directly: reset when the conversation has drifted, when the context is saturated, or when you’re starting a new phase of work.

If you’ve ever noticed your AI coding tool giving worse suggestions after a long session — repeating patterns you’ve already rejected, or losing track of decisions you made earlier — you’ve experienced the same context degradation this chapter teaches you to manage.

The practical application: structure your AI-assisted development around explicit checkpoints. When implementing a multi-file feature, write a brief progress note after completing each component — what’s done, what decisions were made, what’s next. If the session gets long and quality drops, start a fresh conversation with that progress note as the seed context. You’re essentially implementing the tiered memory pattern from this chapter: the progress note is your “key facts” tier, the current code is your “active messages” tier, and everything else can be safely discarded. Teams that adopt this pattern report more consistent code generation across long development sessions, particularly for complex refactoring tasks that span many files.


Summary

Key Takeaways

  • Conversation history grows linearly; context windows don’t. Without management, every conversation eventually breaks.
  • Sliding windows are simple but lose old context entirely. Use for short, single-topic conversations.
  • Summarization preserves meaning while reclaiming tokens. Quality depends on your summarization prompt.
  • Hybrid approaches combine strategies: recent messages verbatim, older ones summarized, key facts preserved indefinitely.
  • Contradictions usually stem from truncation, lost-in-the-middle effects, or poor summarization. Track decisions explicitly.
  • Set hard limits with proactive compression. Memory leaks are easier to prevent than to fix.

Concepts Introduced

  • Sliding window memory
  • Summarization-based compression
  • Tiered memory (active → summarized → archived)
  • Token budget allocation (40/30/30 pattern)
  • Decision tracking for contradiction prevention
  • Memory leak prevention with proactive compression

CodebaseAI Status

Added multi-turn conversation capability with tiered memory management. Tracks active messages, generates summaries for older exchanges, and preserves key facts. Code files are tracked separately and persist across compression cycles. Memory statistics are logged for debugging.

Engineering Habit

State is the enemy; manage it deliberately or it will manage you.

Try it yourself: Complete, runnable versions of this chapter’s code examples are available in the companion repository.


In Chapter 6, we’ll tackle retrieval-augmented generation (RAG)—how to bring external knowledge into your AI’s context when conversation history alone isn’t enough. And in Chapter 9, we’ll extend the within-session techniques from this chapter into persistent memory that survives across sessions.