Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Appendix C: Debugging Cheat Sheet

Appendix C, v2.1 — Early 2026

This is the appendix you open when something’s broken and you need answers fast. Find your symptom, check the likely causes in order, try the quick fixes.

For explanations, see the referenced chapters. For reusable solutions, see Appendix B (Pattern Library).


Quick Reference

Token Estimates

ContentTokens
1 character~0.25 tokens
1 word~1.3 tokens
1 page (500 words)~650 tokens
1 code function~100-500 tokens

Effective Context Limits

Model LimitTarget MaxDanger Zone
8K5.6K6.4K+
32K22K26K+
128K90K102K+
200K140K160K+

Quality degrades around 32K tokens regardless of model limit.

Latency Benchmarks

OperationTypicalSlow
Embedding10-50ms100ms+
Vector search20-100ms200ms+
Reranking100-250ms500ms+
LLM first token200-500ms1000ms+
LLM full response1-5s8s+

1. Context & Memory Issues

Symptom: AI ignores information that’s clearly in the context

Likely Causes:

  1. Information is in the “lost middle” (40-60% position)
  2. Context too long—attention diluted
  3. More recent/prominent information contradicts it
  4. Information format doesn’t match query pattern

Quick Fixes:

  • Move critical information to first or last 20% of context
  • Reduce total context size
  • Repeat key information near the end
  • Rephrase information to match likely query patterns

Chapter: 2 (Context Window)


Symptom: AI forgot what was said earlier in conversation

Likely Causes:

  1. Message was truncated by sliding window
  2. Message was summarized and detail was lost
  3. Token limit reached, oldest messages dropped
  4. Summarization prompt lost key details

Quick Fixes:

  • Check current token count vs. limit
  • Review what’s actually in the conversation history
  • Look at summarization output for missing details
  • Increase history budget or reduce other components

Chapter: 5 (Conversation History)


Symptom: AI contradicts its earlier statements

Likely Causes:

  1. Original statement no longer in context (truncated)
  2. Original statement in lost middle
  3. Summarization didn’t preserve the decision
  4. Later message implicitly contradicted it

Quick Fixes:

  • Check if original statement still exists in context
  • Check position of original statement
  • Add explicit decision tracking (Pattern B.3.4)
  • Include “established decisions” section in context

Chapter: 5 (Conversation History)


Symptom: Memory grows unbounded until failure

Likely Causes:

  1. No pruning strategy implemented
  2. Pruning thresholds too high
  3. Memory extraction creating duplicates
  4. Contradiction detection not superseding old memories

Quick Fixes:

  • Implement hard memory limits
  • Add tiered pruning (Pattern B.6.4)
  • Deduplicate on storage
  • Check supersession logic

Chapter: 9 (Memory and Persistence)


Symptom: Old preferences override new ones

Likely Causes:

  1. No contradiction detection
  2. Old memory has higher importance score
  3. Old memory retrieved because more similar to query
  4. New preference not extracted as memory

Quick Fixes:

  • Implement contradiction detection (Pattern B.6.5)
  • Check importance scoring logic
  • Verify new preferences are being extracted
  • Add recency boost to retrieval scoring

Chapter: 9 (Memory and Persistence)


2. RAG & Retrieval Issues

Symptom: Retrieval returns completely unrelated content

Likely Causes:

  1. Embedding model mismatch (different models for index vs. query)
  2. Chunking destroyed semantic units
  3. Query vocabulary doesn’t match document vocabulary
  4. Index corrupted or wrong collection queried

Quick Fixes:

  • Verify same embedding model for indexing and query
  • Check chunks contain coherent content (not mid-sentence)
  • Try hybrid search to catch keyword matches
  • Verify querying correct index/collection

Chapter: 6 (RAG Fundamentals)


Symptom: Correct content exists but isn’t retrieved

Likely Causes:

  1. Content was chunked poorly (split across chunks)
  2. Top-K too small
  3. Embedding doesn’t capture the semantic relationship
  4. Metadata filter excluding it

Quick Fixes:

  • Search chunks directly for expected content
  • Increase top-K (try 20-50)
  • Try different query phrasings
  • Check metadata filters aren’t over-restrictive

Chapter: 6 (RAG Fundamentals)


Symptom: Good retrieval but answer is wrong/hallucinated

Likely Causes:

  1. Too much context (lost in middle problem)
  2. Conflicting information in retrieved docs
  3. Prompt doesn’t instruct grounding
  4. Model confident in training knowledge over context

Quick Fixes:

  • Reduce number of retrieved documents
  • Add explicit “only use provided context” instruction
  • Check for contradictions in retrieved content
  • Add “if not in context, say so” instruction

Chapter: 6 (RAG Fundamentals)


Symptom: Answer ignores retrieved context entirely

Likely Causes:

  1. Context not clearly delimited
  2. System prompt doesn’t emphasize using context
  3. Query answerable from model’s training (bypasses retrieval)
  4. Retrieved content formatted poorly

Quick Fixes:

  • Add clear delimiters around retrieved content
  • Strengthen grounding instructions in system prompt
  • Add “base your answer on the following context” framing
  • Format retrieved content with clear source labels

Chapter: 6 (RAG Fundamentals)


Symptom: Reranking made results worse

Likely Causes:

  1. Reranker trained on different domain
  2. Reranking all results instead of just close scores
  3. Cross-encoder input too long (truncated)
  4. Original ranking was already good

Quick Fixes:

  • Test with and without reranking, measure both
  • Only rerank when top scores are close (within 0.15)
  • Ensure chunks fit reranker’s max length
  • Try different reranker model

Chapter: 7 (Advanced Retrieval)


Symptom: Query expansion added noise, not coverage

Likely Causes:

  1. Too many variants generated
  2. Variants drifted from original meaning
  3. Merge strategy weights variants too highly
  4. Original query was already specific

Quick Fixes:

  • Reduce to 2-3 variants
  • Add “keep the same meaning” to expansion prompt
  • Weight original query higher in merge
  • Skip expansion for specific/technical queries

Chapter: 7 (Advanced Retrieval)


3. System Prompt Issues

Symptom: AI ignores parts of system prompt

Likely Causes:

  1. Conflicting instructions (model picks one)
  2. Instruction buried in middle of long prompt
  3. Too many instructions (attention diluted)
  4. Instruction is ambiguous

Quick Fixes:

  • Audit for conflicting instructions
  • Move critical instructions to beginning or end
  • Reduce total prompt length (<2000 tokens ideal)
  • Make instructions specific and unambiguous

Chapter: 4 (System Prompts)


Symptom: Output format not followed

Likely Causes:

  1. No example provided
  2. Format specification conflicts with content needs
  3. Schema too complex
  4. Format instruction buried in prompt

Quick Fixes:

  • Add concrete example of desired output
  • Simplify schema (flatten nested structures)
  • Put format specification at end of prompt
  • Use structured output mode if available

Chapter: 4 (System Prompts)


Symptom: AI does things explicitly forbidden

Likely Causes:

  1. Constraint not prominent enough
  2. User input overrides constraint
  3. Constraint conflicts with other instructions
  4. Constraint phrasing is ambiguous

Quick Fixes:

  • Move constraints to end of prompt (high attention)
  • Phrase as explicit “NEVER do X” rather than “avoid X”
  • Add constraint reminder after user input section
  • Check for instructions that might override constraint

Chapter: 4 (System Prompts)


Symptom: Behavior inconsistent across similar queries

Likely Causes:

  1. Instructions have edge cases not covered
  2. Temperature too high
  3. Ambiguous phrasing interpreted differently
  4. Context differences between queries

Quick Fixes:

  • Reduce temperature (try 0.3 or lower)
  • Add explicit handling for edge cases
  • Rephrase ambiguous instructions
  • Log full context for inconsistent cases, compare

Chapter: 4 (System Prompts)


4. Tool Use Issues

Symptom: Model calls wrong tool

Likely Causes:

  1. Tool descriptions overlap or are ambiguous
  2. Tool names unfamiliar (not matching training patterns)
  3. Too many tools (decision fatigue)
  4. Description doesn’t include “when NOT to use”

Quick Fixes:

  • Add “Use for:” and “Do NOT use for:” to descriptions
  • Use familiar names (read_file not fetch_document)
  • Reduce tool count or group by task
  • Add examples to descriptions

Chapter: 8 (Tool Use)


Symptom: Model passes invalid parameters

Likely Causes:

  1. Parameter types not specified in schema
  2. Constraints not documented
  3. No examples in description
  4. Parameter names ambiguous

Quick Fixes:

  • Add explicit types to all parameters
  • Document constraints (min, max, allowed values)
  • Add example calls to tool description
  • Use clear, unambiguous parameter names

Chapter: 8 (Tool Use)


Symptom: Tool errors, model keeps retrying same call

Likely Causes:

  1. Error message doesn’t explain what went wrong
  2. No alternative suggested in error
  3. Model doesn’t understand the error
  4. No retry limit implemented

Quick Fixes:

  • Return actionable error messages
  • Include suggestions in errors (“Try X instead”)
  • Implement retry limit (3 max)
  • Add different error types for different failures

Chapter: 8 (Tool Use)


Symptom: Tool succeeds but model ignores result

Likely Causes:

  1. Output format unclear/unparseable
  2. No delimiters marking output boundaries
  3. Output too long (truncated without indicator)
  4. Output doesn’t answer what model was looking for

Quick Fixes:

  • Add clear delimiters (=== START === / === END ===)
  • Truncate with indicator (“…truncated, 5000 more chars”)
  • Structure output with clear sections
  • Include summary at top of long outputs

Chapter: 8 (Tool Use)


Symptom: Destructive action executed without authorization

Likely Causes:

  1. No action gating implemented
  2. Risk levels not properly classified
  3. Confirmation flow bypassed
  4. Action not recognized as destructive

Quick Fixes:

  • Implement action gate (Pattern B.10.4)
  • Classify all write/delete/execute as HIGH or CRITICAL
  • Require explicit confirmation for destructive actions
  • Log all destructive actions for audit

Chapter: 8 (Tool Use), 14 (Security)


5. Multi-Agent Issues

Symptom: Agents contradict each other

Likely Causes:

  1. Agents have different context/information
  2. No handoff validation
  3. No shared ground truth
  4. Orchestrator didn’t synthesize properly

Quick Fixes:

  • Log what context each agent received
  • Validate outputs at handoff boundaries
  • Include source attribution in agent outputs
  • Check orchestrator synthesis logic

Chapter: 10 (Multi-Agent Systems)


Symptom: System hangs (never completes)

Likely Causes:

  1. Circular dependency in task graph
  2. Agent stuck waiting for response
  3. No timeout implemented
  4. Infinite tool loop

Quick Fixes:

  • Check dependency graph for cycles
  • Add timeout per agent (30s default)
  • Implement circuit breaker (Pattern B.7.5)
  • Add max iterations to agent loops

Chapter: 10 (Multi-Agent Systems)


Symptom: Wrong agent selected for task

Likely Causes:

  1. Task classification ambiguous
  2. Classifier prompt unclear
  3. Overlapping agent capabilities
  4. Always defaulting to one agent

Quick Fixes:

  • Review classification examples
  • Add clearer criteria to classifier prompt
  • Sharpen agent role definitions
  • Log classification decisions for analysis

Chapter: 10 (Multi-Agent Systems)


Symptom: Context lost between agent handoffs

Likely Causes:

  1. Handoff not including necessary information
  2. Output schema missing fields
  3. Receiving agent expects different format
  4. Summarization losing details

Quick Fixes:

  • Define typed output schemas (Pattern B.7.3)
  • Validate outputs match schema before handoff
  • Log full handoff data for debugging
  • Include “context for next agent” in output

Chapter: 10 (Multi-Agent Systems)


6. Production Issues

Symptom: Works in development, fails in production

Likely Causes:

  1. Production inputs more varied/messy
  2. Concurrent load not tested
  3. Context accumulates in long sessions
  4. External dependencies behave differently

Quick Fixes:

  • Compare dev inputs vs. prod inputs (log samples)
  • Load test before deploying
  • Monitor context size in prod sessions
  • Mock external dependencies consistently

Chapter: 11 (Production)


Symptom: Costs much higher than projected

Likely Causes:

  1. Memory/history bloating token usage
  2. Retrieving too many documents
  3. Retry storms on failures
  4. Output verbosity not controlled

Quick Fixes:

  • Audit token usage by component
  • Check for retry loops
  • Reduce retrieval count
  • Add max_tokens to all calls

Chapter: 11 (Production)


Symptom: Latency spikes under load

Likely Causes:

  1. No rate limiting (overloading API)
  2. Synchronous calls that should be parallel
  3. Large context causing slow inference
  4. Database queries not optimized

Quick Fixes:

  • Implement rate limiting (Pattern B.8.1)
  • Parallelize independent operations
  • Reduce context size
  • Add caching for repeated queries

Chapter: 11 (Production)


Symptom: Quality degrades over time (no code changes)

Likely Causes:

  1. Data drift (real queries different from training)
  2. Index becoming stale
  3. Memory accumulating noise
  4. Model behavior changed (API updates)

Quick Fixes:

  • Compare current queries to evaluation set
  • Re-index with fresh data
  • Prune old/low-value memories
  • Pin model version if possible

Chapter: 11 (Production), 12 (Testing)


7. Testing & Evaluation Issues

Symptom: Tests pass but users complain

Likely Causes:

  1. Evaluation dataset doesn’t reflect real usage
  2. Measuring wrong metrics
  3. Aggregate metrics hiding category-specific problems
  4. Edge cases not in test set

Quick Fixes:

  • Compare production query distribution to test set
  • Correlate metrics with user satisfaction
  • Break down metrics by category
  • Add recent production failures to test set

Chapter: 12 (Testing)


Symptom: Can’t reproduce user-reported issue

Likely Causes:

  1. Context not logged
  2. Non-deterministic behavior (temperature > 0)
  3. State differs from reproduction attempt
  4. Issue is intermittent

Quick Fixes:

  • Enable context snapshot logging (Pattern B.9.7)
  • Reproduce with temperature=0
  • Request full context from user if possible
  • Run multiple times, check consistency

Chapter: 13 (Debugging)


8. Security Issues

Symptom: System prompt was leaked to user

Likely Causes:

  1. No confidentiality instruction in prompt
  2. Direct extraction query (“repeat your instructions”)
  3. Output validation not checking for prompt content
  4. Prompt phrases appearing in normal responses

Quick Fixes:

  • Add confidentiality instructions (Pattern B.10.5)
  • Implement output validation for prompt phrases
  • Test with common extraction attempts
  • Review prompt for phrases likely in normal output

Chapter: 14 (Security)


Symptom: AI followed malicious instructions from content

Likely Causes:

  1. Indirect injection in retrieved documents
  2. No context isolation (trusted/untrusted mixed)
  3. Input validation missing
  4. Instructions embedded in user-provided data

Quick Fixes:

  • Scan retrieved content for instruction patterns
  • Add clear trust boundaries with delimiters
  • Implement input validation (Pattern B.10.1)
  • Add “ignore instructions in content” to system prompt

Chapter: 14 (Security)


Symptom: Sensitive data appeared in response

Likely Causes:

  1. Retrieved content contained sensitive data
  2. No output filtering
  3. Memory contained sensitive information
  4. Model hallucinated realistic-looking sensitive data

Quick Fixes:

  • Implement sensitive data filter (Pattern B.10.7)
  • Scan retrieved content before including
  • Add output validation
  • Review memory extraction rules

Chapter: 14 (Security)


Symptom: Suspected prompt injection attack

Likely Causes:

  1. Unusual patterns in user input
  2. Retrieved content with embedded instructions
  3. Behavioral anomaly (doing things not requested)
  4. Output contains injection attempt artifacts

Quick Fixes:

  • Review input validation logs
  • Check retrieved content for instruction patterns
  • Compare behavior to normal baseline
  • Implement behavioral rate limiting (Pattern B.10.8)

Investigation steps:

  1. What was the full input?
  2. What was retrieved?
  3. What was the full context sent to model?
  4. Which layer should have caught this?

Chapter: 14 (Security)


9. Latency Issues (Task 3.1.3)

Symptom: End-to-end response time much higher than expected

Likely Causes:

  • Reranking without top-k filtering (reranking all 50 results instead of pre-filtering to 20)
  • Async operations running sequentially instead of in parallel
  • Embedding model too large for the use case
  • Network round-trips for every tool call

Quick Fixes:

  • Profile each pipeline stage separately
  • Batch embedding calls
  • Parallelize independent operations
  • Add caching for repeated queries

Chapter: 11


Symptom: Latency spikes on specific queries

Likely Causes:

  • Large context triggering slow inference
  • Specific query patterns causing excessive tool calls
  • One retrieval source significantly slower than others
  • Reranking triggered unnecessarily

Quick Fixes:

  • Log per-stage latency on each request
  • Set timeouts per stage
  • Add conditional reranking (only when scores are close)
  • Implement query-level caching

Chapter: 11, 13


Symptom: Latency increases over time within a session

Likely Causes:

  • Conversation history growing unbounded
  • Memory retrieval scanning more entries each turn
  • No compression triggers firing
  • Context approaching model limit

Quick Fixes:

  • Check context token count trend
  • Verify compression thresholds are working
  • Add hard token limits per component
  • Implement sliding window

Chapter: 5, 11


10. Memory System Issues (Task 3.1.4)

Symptom: Memories contradict each other

Likely Causes:

  • No contradiction detection on storage
  • Old memories with higher importance not superseded
  • Multiple extraction passes creating duplicates with different timestamps
  • Semantic similarity threshold too low

Quick Fixes:

  • Implement contradiction check (Pattern B.6.5)
  • Add dedup on content hash
  • Log all memory writes for audit
  • Lower similarity threshold for contradiction matching

Chapter: 9


Symptom: Retrieval returns irrelevant memories

Likely Causes:

  • Embedding model doesn’t capture your domain well
  • Recency scoring dominating relevance
  • Importance scores all similar (no differentiation)
  • Too many memories in store (noise overwhelms signal)

Quick Fixes:

  • Test embedding similarity manually
  • Tune hybrid scoring weights
  • Prune low-value memories
  • Try domain-specific embedding model

Chapter: 9


Symptom: Important memories not retrieved despite existing in store

Likely Causes:

  • Query phrasing doesn’t match memory embedding
  • Importance score too low
  • Memory was pruned
  • Memory type filter excluding it

Quick Fixes:

  • Search memories directly to confirm existence
  • Check retrieval scoring
  • Verify pruning isn’t removing valuable memories
  • Widen type filter

Chapter: 9


11. Debugging Without Observability (Task 3.1.5)

Many teams don’t have structured logging, metrics, or tracing infrastructure when they start building context-aware systems. This is completely normal. This section provides a practical path forward without waiting for a full observability platform.

Starting from Zero

You can begin debugging immediately with minimal tooling:

  • Add print/log statements at every pipeline stage boundary. When a request enters retrieval, log it. When retrieval completes, log the results. Same for embedding, reranking, generation—every major stage.
  • Save the full context (messages array) to a JSON file on every request. Include the exact state passed to the model, with timestamps. This becomes your audit trail.
  • Compare working vs failing requests by diffing saved contexts. When something breaks, find a successful request nearby and compare what’s different in the messages array, token counts, or memory state.
  • Use timestamp differences to profile latency. Measure wall-clock time between stage boundaries. This tells you where time is being spent without instrumenting every function call.

The Minimal Observability Kit

Start by adding five things to every request:

  1. Log the full messages array sent to the model (redact sensitive data like PII)
  2. Log token counts per component (system prompt, user message, RAG context, memory, conversation history)
  3. Log the model response and finish_reason (did it complete or hit length limits?)
  4. Log wall-clock time per stage (embedding, retrieval, reranking, generation in milliseconds)
  5. Save failures to a file for later analysis (include input, full context, error, and timestamp)

Here’s a minimal Python logging class to get started:

import json
import time
from datetime import datetime
from typing import Any, Dict, List

class SimpleRequestLogger:
    def __init__(self, log_file: str = "requests.jsonl"):
        self.log_file = log_file

    def log_request(
        self,
        request_id: str,
        query: str,
        messages: List[Dict[str, str]],
        stage: str = "start"
    ):
        """Log request at a pipeline stage."""
        entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "request_id": request_id,
            "stage": stage,
            "query": query,
            "message_count": len(messages),
            "token_estimate": sum(len(m.get("content", "").split()) for m in messages),
        }
        self._write(entry)

    def log_stage_latency(
        self,
        request_id: str,
        stage: str,
        latency_ms: float,
        metadata: Dict[str, Any] = None
    ):
        """Log latency for a specific stage."""
        entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "request_id": request_id,
            "stage": stage,
            "latency_ms": latency_ms,
            "metadata": metadata or {},
        }
        self._write(entry)

    def log_failure(
        self,
        request_id: str,
        query: str,
        messages: List[Dict[str, str]],
        error: str,
        stage: str
    ):
        """Log a failed request with full context."""
        entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "request_id": request_id,
            "stage": stage,
            "query": query,
            "messages": messages,
            "error": str(error),
        }
        self._write(entry)

    def _write(self, entry: Dict[str, Any]):
        """Write entry as JSON line."""
        with open(self.log_file, "a") as f:
            f.write(json.dumps(entry) + "\n")

Usage in your pipeline:

logger = SimpleRequestLogger("debug_requests.jsonl")
request_id = str(uuid.uuid4())

logger.log_request(request_id, user_query, initial_messages, stage="start")

start = time.time()
retrieval_results = retrieve(user_query)
logger.log_stage_latency(request_id, "retrieval", (time.time() - start) * 1000)

try:
    response = model.generate(messages)
    logger.log_request(request_id, user_query, messages, stage="generation_complete")
except Exception as e:
    logger.log_failure(request_id, user_query, messages, str(e), stage="generation")

Upgrading Incrementally

As your system matures, upgrade your observability in stages:

  1. Phase 1: Structured logging - Replace print statements with a logging module (Python’s logging or similar). Add structured fields: request_id, stage, latency_ms, token_count. Write to files or a log aggregator. You’re already here if you’ve implemented SimpleRequestLogger.

  2. Phase 2: Metrics and dashboards - Count events per stage, measure p50/p95/p99 latency, track error rates. Tools like Prometheus + Grafana, DataDog, or CloudWatch make this easy. Focus on the five metrics above.

  3. Phase 3: Distributed tracing - Use OpenTelemetry to connect your logs, metrics, and traces. Trace latency through asynchronous operations, across service boundaries, and into external APIs (LLM calls, retrieval services). Chapter 13 covers observability in depth.

Don’t wait for Phase 3 to start debugging. Phases 1 and 2 will solve most issues you encounter. Once you understand your system’s behavior, structured traces in Phase 3 become a precision tool rather than a necessity.


General Debugging Process

When nothing above matches, follow this process:

Step 1: Reproduce

  • Can you reproduce the issue?
  • What’s the minimal input that triggers it?
  • Is it deterministic or intermittent?

Step 2: Isolate

  • Which component is failing? (retrieval, generation, tools, etc.)
  • Test each component independently
  • What’s different between working and failing cases?

Step 3: Observe

  • What’s actually in the context? (log it)
  • What’s the model actually outputting? (full response)
  • What do the metrics show?

Step 4: Hypothesize

  • What’s the most likely cause?
  • What evidence would confirm or refute it?

Step 5: Test

  • Change one variable at a time
  • Measure before and after
  • Did the change help?

Step 6: Fix

  • Implement minimal fix
  • Add test case for this failure
  • Monitor for recurrence

Emergency Response

System is down

  1. Check API status (provider outage?)
  2. Check rate limits (quota exceeded?)
  3. Check error logs (what’s failing?)
  4. Implement fallback if available

Costs spiking

  1. Implement emergency rate limit
  2. Check for retry storms
  3. Review recent deployments
  4. Reduce context/retrieval temporarily

Quality collapsed

  1. Check for recent changes (rollback?)
  2. Compare to baseline metrics
  3. Sample recent queries (what’s different?)
  4. Check external dependencies (API changes?)

Security incident

  1. Disable affected endpoint
  2. Preserve logs for investigation
  3. Identify attack vector
  4. Patch and monitor

Real-World Debugging Stories

These mini case studies illustrate how debugging principles apply in practice.

Case Study: The Friday Afternoon RAG Failure

Situation: A product Q&A system started returning wrong answers every Friday afternoon. Quality metrics showed a 40% accuracy drop between 2-5 PM on Fridays.

Investigation: The team checked model changes (none), prompt changes (none), and infrastructure (stable). Then they looked at the data pipeline: marketing published weekly blog posts every Friday at 1 PM, triggering a re-indexing job that temporarily corrupted the vector index during the 2-3 hour rebuild.

Root Cause: The ingestion pipeline didn’t use atomic index swaps—it updated the live index in place, meaning queries during re-indexing hit a partially-built index with incomplete embeddings.

Fix: Implemented blue-green indexing: build the new index alongside the old one, swap atomically when complete. Added a retrieval quality check that compared scores against a baseline before and after indexing.

Lesson: When problems correlate with time, look at scheduled jobs. Always index atomically.

Patterns used: B.4.8 RAG Stage Isolation, B.9.3 Regression Detection


Case Study: The Helpful but Wrong Memory

Situation: A coding assistant kept suggesting deprecated API patterns to a user, even though the user had corrected it multiple times. The user would say “don’t use the old API,” the system would acknowledge it, but the next session it reverted.

Investigation: Memory extraction was working—the correction was stored. Memory retrieval was working—the correction was retrieved. But so were 15 older memories about the same API, all referencing the old pattern. The hybrid scoring (0.5 relevance + 0.3 recency + 0.2 importance) gave the older memories a collective advantage because they were more numerous and highly relevant to API questions.

Root Cause: Contradiction detection only compared pairs of memories, not clusters. The single “don’t use old API” memory was superseding one old memory, but 14 others remained with high relevance scores.

Fix: Implemented cluster-based contradiction detection: when a new memory contradicts one memory in a cluster, check all semantically similar memories and mark the entire cluster as superseded. Also boosted importance scores for explicit user corrections.

Lesson: Memory systems need cluster-aware contradiction handling, not just pairwise comparison.

Patterns used: B.6.5 Contradiction Detection, B.6.4 Memory Pruning


AI System On-Call Runbook

This runbook template is designed for AI systems built with context engineering. Adapt it to your specific architecture. Print it. Keep it where on-call engineers can find it at 3 AM.

Quick Reference: Incident Classification

CategorySymptomsFirst Response
Model-sideProvider outage, model update, rate limiting, unexpected responsesCheck provider status page, try backup model
Context-sideBad retrieval, assembly failure, wrong contextCheck retrieval metrics, review recent context/data changes
Data-sideCorrupted embeddings, stale knowledge base, bad chunkingCheck data freshness, verify embedding integrity, review recent pipeline runs
InfrastructureNetwork, database, cache failuresCheck service health dashboards, verify connectivity
SecurityPrompt injection, data exfiltration, unusual patternsCheck for suspicious input patterns, enable enhanced logging
QualityGradual degradation, user complaints, low scoresCheck quality metrics trends, compare to baseline, review recent changes

Step-by-Step: When an Alert Fires

Step 1: Acknowledge and Assess (5 minutes)

□ Acknowledge the alert in your incident management system
□ Open the dashboard linked in the alert
□ Answer these questions:
  - How many users are affected? (check error rate + quality metrics)
  - Is it getting worse, stable, or recovering?
  - Is there a pattern? (specific query types, user segments, time of day)
  - When did it start? (check metric timeline)

Step 2: Quick Checks (10 minutes)

□ Provider status pages (OpenAI, Anthropic, etc.)
□ Recent deployments (anything in the last 12 hours?)
□ Recent data pipeline runs (knowledge base refreshes, embedding updates)
□ Infrastructure health (database, cache, vector DB, network)
□ Recent configuration changes (model versions, temperature, prompts)

Step 3: Classify the Incident

Based on your quick checks, classify using the table above. This determines your investigation path.

Incident Response Decision Tree

Step 4: Mitigate (Before Root Cause)

For model-side issues:

# Switch to backup model
config.model = config.backup_model
# Or: disable complex features, use simpler mode
config.use_rag = False
config.use_multi_agent = False

For context-side issues:

# Enable cached responses for repeated queries
config.cache_mode = "aggressive"
# Or: reduce context size to avoid assembly issues
config.max_context_tokens = config.safe_minimum

For data-side issues:

# Rollback to last known good version
knowledge_base.rollback(category, version=last_good_version)

For rate limiting / cost spikes:

# Reduce traffic
rate_limiter.set_limit(config.emergency_limit)
# Queue non-urgent requests
config.queue_mode = True

Step 5: Investigate

Pull sample requests:

# Get failing requests
failing = query_logs(
    "quality_score < 0.5 OR error = true",
    time_range="last_1_hour",
    limit=20
)

# Compare to successful requests from same period
passing = query_logs(
    "quality_score > 0.7",
    time_range="last_1_hour",
    limit=20
)

# Look for differences
compare_request_characteristics(failing, passing)

Check retrieval health:

# Compare retrieval scores before and after incident start
before = get_retrieval_scores(time_range="2_hours_before_incident")
during = get_retrieval_scores(time_range="since_incident_start")
print(f"Before: avg={mean(before):.2f}, During: avg={mean(during):.2f}")

Check for data changes:

# List recent data pipeline events
events = query_system_logs(
    "service IN ('knowledge_base', 'embeddings', 'data_pipeline')",
    time_range="last_6_hours"
)

Step 6: Fix and Verify

□ Implement fix (in staging first if possible)
□ Run evaluation suite against fix
□ Check for regressions in unaffected areas
□ Gradual rollout with monitoring
□ Confirm metrics return to baseline
□ Declare incident resolved

Step 7: Post-Incident

□ Gather data within 24-48 hours
□ Write post-mortem (use template below)
□ Schedule team review
□ Track action items to completion
□ Update THIS RUNBOOK with anything you learned

Post-Mortem Template

# Post-Mortem: [Descriptive Title]

## Summary
- **Date**: YYYY-MM-DD
- **Duration**: X hours Y minutes (start - end UTC)
- **Impact**: What users experienced and how many were affected
- **Detection**: How was it detected? (automated alert / user report / team noticed)
- **Severity**: Critical / High / Medium / Low

## Timeline
- HH:MM - Event that preceded the incident
- HH:MM - Incident started (or first detection)
- HH:MM - Alert fired / team notified
- HH:MM - Investigation began
- HH:MM - Root cause identified
- HH:MM - Mitigation applied
- HH:MM - Full resolution confirmed

## Root Cause
[Detailed technical explanation of what went wrong and why]

## What Went Well
- [Detection speed, response process, mitigation effectiveness]

## What Went Poorly
- [Detection gaps, investigation bottlenecks, missing tooling]

## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| [Specific, actionable item] | [Name] | [Date] | Open |

## Lessons Learned
1. [Key insight that applies beyond this specific incident]

Common Failure Patterns Quick Reference

PatternKey DiagnosticQuick Fix
Context RotCheck context length, info positionMove critical info to start/end
Retrieval MissCheck retrieval scores, top-k resultsIncrease top-k, add hybrid search
HallucinationSearch context for model’s claimsStrengthen grounding instructions
Tool Call FailureCheck tool definitions, selection logsClarify tool descriptions
Cascade FailureTrace error to originating agentAdd validation at handoff points
Prompt InjectionCheck inputs for instruction-like contentInput sanitization, clear delimiters

Useful Queries

Find requests with low quality in a time range:

quality_score < 0.5 AND timestamp > "2026-01-15T02:00:00Z"

Find requests that used a specific prompt version:

prompt_version = "v2.3.1" AND status = "error"

Find requests where retrieval was slow:

retrieval_latency_ms > 5000

Find requests where context was near limit:

context_tokens > 0.9 * context_limit

Find requests where model response was truncated:

finish_reason = "length"


Appendix Cross-References

This SectionRelated AppendixConnection
Quick Reference (tokens)Appendix D: D.1 Token EstimationDetailed token math
RAG & Retrieval IssuesAppendix A: A.2-A.3 Databases & EmbeddingsTool-specific debugging
Production Issues (costs)Appendix D: D.6 Worked ExamplesCost calculations
Security IssuesAppendix B: B.10 Security patternsSolutions to apply
On-Call RunbookAppendix B: B.8 Production patternsMitigation patterns

When in doubt: log everything, change one thing at a time, measure before and after. Review this runbook after every post-mortem—stale runbooks are worse than no runbooks.