Appendix C: Debugging Cheat Sheet

Appendix C, v2.1 — Early 2026

This is the appendix you open when something’s broken and you need answers fast. Find your symptom, check the likely causes in order, try the quick fixes.

For explanations, see the referenced chapters. For reusable solutions, see Appendix B (Pattern Library).

Quick Reference

Token Estimates

Content	Tokens
1 character	~0.25 tokens
1 word	~1.3 tokens
1 page (500 words)	~650 tokens
1 code function	~100-500 tokens

Effective Context Limits

Model Limit	Target Max	Danger Zone
8K	5.6K	6.4K+
32K	22K	26K+
128K	90K	102K+
200K	140K	160K+

Quality degrades around 32K tokens regardless of model limit.

Latency Benchmarks

Operation	Typical	Slow
Embedding	10-50ms	100ms+
Vector search	20-100ms	200ms+
Reranking	100-250ms	500ms+
LLM first token	200-500ms	1000ms+
LLM full response	1-5s	8s+

1. Context & Memory Issues

Symptom: AI ignores information that’s clearly in the context

Likely Causes:

Information is in the “lost middle” (40-60% position)
Context too long—attention diluted
More recent/prominent information contradicts it
Information format doesn’t match query pattern

Quick Fixes:

Move critical information to first or last 20% of context
Reduce total context size
Repeat key information near the end
Rephrase information to match likely query patterns

Chapter: 2 (Context Window)

Symptom: AI forgot what was said earlier in conversation

Likely Causes:

Message was truncated by sliding window
Message was summarized and detail was lost
Token limit reached, oldest messages dropped
Summarization prompt lost key details

Quick Fixes:

Check current token count vs. limit
Review what’s actually in the conversation history
Look at summarization output for missing details
Increase history budget or reduce other components

Chapter: 5 (Conversation History)

Symptom: AI contradicts its earlier statements

Likely Causes:

Original statement no longer in context (truncated)
Original statement in lost middle
Summarization didn’t preserve the decision
Later message implicitly contradicted it

Quick Fixes:

Check if original statement still exists in context
Check position of original statement
Add explicit decision tracking (Pattern B.3.4)
Include “established decisions” section in context

Chapter: 5 (Conversation History)

Symptom: Memory grows unbounded until failure

Likely Causes:

No pruning strategy implemented
Pruning thresholds too high
Memory extraction creating duplicates
Contradiction detection not superseding old memories

Quick Fixes:

Implement hard memory limits
Add tiered pruning (Pattern B.6.4)
Deduplicate on storage
Check supersession logic

Chapter: 9 (Memory and Persistence)

Symptom: Old preferences override new ones

Likely Causes:

No contradiction detection
Old memory has higher importance score
Old memory retrieved because more similar to query
New preference not extracted as memory

Quick Fixes:

Implement contradiction detection (Pattern B.6.5)
Check importance scoring logic
Verify new preferences are being extracted
Add recency boost to retrieval scoring

Chapter: 9 (Memory and Persistence)

2. RAG & Retrieval Issues

Symptom: Retrieval returns completely unrelated content

Likely Causes:

Embedding model mismatch (different models for index vs. query)
Chunking destroyed semantic units
Query vocabulary doesn’t match document vocabulary
Index corrupted or wrong collection queried

Quick Fixes:

Verify same embedding model for indexing and query
Check chunks contain coherent content (not mid-sentence)
Try hybrid search to catch keyword matches
Verify querying correct index/collection

Chapter: 6 (RAG Fundamentals)

Symptom: Correct content exists but isn’t retrieved

Likely Causes:

Content was chunked poorly (split across chunks)
Top-K too small
Embedding doesn’t capture the semantic relationship
Metadata filter excluding it

Quick Fixes:

Search chunks directly for expected content
Increase top-K (try 20-50)
Try different query phrasings
Check metadata filters aren’t over-restrictive

Chapter: 6 (RAG Fundamentals)

Symptom: Good retrieval but answer is wrong/hallucinated

Likely Causes:

Too much context (lost in middle problem)
Conflicting information in retrieved docs
Prompt doesn’t instruct grounding
Model confident in training knowledge over context

Quick Fixes:

Reduce number of retrieved documents
Add explicit “only use provided context” instruction
Check for contradictions in retrieved content
Add “if not in context, say so” instruction

Chapter: 6 (RAG Fundamentals)

Symptom: Answer ignores retrieved context entirely

Likely Causes:

Context not clearly delimited
System prompt doesn’t emphasize using context
Query answerable from model’s training (bypasses retrieval)
Retrieved content formatted poorly

Quick Fixes:

Add clear delimiters around retrieved content
Strengthen grounding instructions in system prompt
Add “base your answer on the following context” framing
Format retrieved content with clear source labels

Chapter: 6 (RAG Fundamentals)

Symptom: Reranking made results worse

Likely Causes:

Reranker trained on different domain
Reranking all results instead of just close scores
Cross-encoder input too long (truncated)
Original ranking was already good

Quick Fixes:

Test with and without reranking, measure both
Only rerank when top scores are close (within 0.15)
Ensure chunks fit reranker’s max length
Try different reranker model

Chapter: 7 (Advanced Retrieval)

Symptom: Query expansion added noise, not coverage

Likely Causes:

Too many variants generated
Variants drifted from original meaning
Merge strategy weights variants too highly
Original query was already specific

Quick Fixes:

Reduce to 2-3 variants
Add “keep the same meaning” to expansion prompt
Weight original query higher in merge
Skip expansion for specific/technical queries

Chapter: 7 (Advanced Retrieval)

3. System Prompt Issues

Symptom: AI ignores parts of system prompt

Likely Causes:

Conflicting instructions (model picks one)
Instruction buried in middle of long prompt
Too many instructions (attention diluted)
Instruction is ambiguous

Quick Fixes:

Audit for conflicting instructions
Move critical instructions to beginning or end
Reduce total prompt length (<2000 tokens ideal)
Make instructions specific and unambiguous

Chapter: 4 (System Prompts)

Symptom: Output format not followed

Likely Causes:

No example provided
Format specification conflicts with content needs
Schema too complex
Format instruction buried in prompt

Quick Fixes:

Add concrete example of desired output
Simplify schema (flatten nested structures)
Put format specification at end of prompt
Use structured output mode if available

Chapter: 4 (System Prompts)

Symptom: AI does things explicitly forbidden

Likely Causes:

Constraint not prominent enough
User input overrides constraint
Constraint conflicts with other instructions
Constraint phrasing is ambiguous

Quick Fixes:

Move constraints to end of prompt (high attention)
Phrase as explicit “NEVER do X” rather than “avoid X”
Add constraint reminder after user input section
Check for instructions that might override constraint

Chapter: 4 (System Prompts)

Symptom: Behavior inconsistent across similar queries

Likely Causes:

Instructions have edge cases not covered
Temperature too high
Ambiguous phrasing interpreted differently
Context differences between queries

Quick Fixes:

Reduce temperature (try 0.3 or lower)
Add explicit handling for edge cases
Rephrase ambiguous instructions
Log full context for inconsistent cases, compare

Chapter: 4 (System Prompts)

4. Tool Use Issues

Symptom: Model calls wrong tool

Likely Causes:

Tool descriptions overlap or are ambiguous
Tool names unfamiliar (not matching training patterns)
Too many tools (decision fatigue)
Description doesn’t include “when NOT to use”

Quick Fixes:

Add “Use for:” and “Do NOT use for:” to descriptions
Use familiar names (read_file not fetch_document)
Reduce tool count or group by task
Add examples to descriptions

Chapter: 8 (Tool Use)

Symptom: Model passes invalid parameters

Likely Causes:

Parameter types not specified in schema
Constraints not documented
No examples in description
Parameter names ambiguous

Quick Fixes:

Add explicit types to all parameters
Document constraints (min, max, allowed values)
Add example calls to tool description
Use clear, unambiguous parameter names

Chapter: 8 (Tool Use)

Symptom: Tool errors, model keeps retrying same call

Likely Causes:

Error message doesn’t explain what went wrong
No alternative suggested in error
Model doesn’t understand the error
No retry limit implemented

Quick Fixes:

Return actionable error messages
Include suggestions in errors (“Try X instead”)
Implement retry limit (3 max)
Add different error types for different failures

Chapter: 8 (Tool Use)

Symptom: Tool succeeds but model ignores result

Likely Causes:

Output format unclear/unparseable
No delimiters marking output boundaries
Output too long (truncated without indicator)
Output doesn’t answer what model was looking for

Quick Fixes:

Add clear delimiters (=== START === / === END ===)
Truncate with indicator (“…truncated, 5000 more chars”)
Structure output with clear sections
Include summary at top of long outputs

Chapter: 8 (Tool Use)

Symptom: Destructive action executed without authorization

Likely Causes:

No action gating implemented
Risk levels not properly classified
Confirmation flow bypassed
Action not recognized as destructive

Quick Fixes:

Implement action gate (Pattern B.10.4)
Classify all write/delete/execute as HIGH or CRITICAL
Require explicit confirmation for destructive actions
Log all destructive actions for audit

Chapter: 8 (Tool Use), 14 (Security)

5. Multi-Agent Issues

Symptom: Agents contradict each other

Likely Causes:

Agents have different context/information
No handoff validation
No shared ground truth
Orchestrator didn’t synthesize properly

Quick Fixes:

Log what context each agent received
Validate outputs at handoff boundaries
Include source attribution in agent outputs
Check orchestrator synthesis logic

Chapter: 10 (Multi-Agent Systems)

Symptom: System hangs (never completes)

Likely Causes:

Circular dependency in task graph
Agent stuck waiting for response
No timeout implemented
Infinite tool loop

Quick Fixes:

Check dependency graph for cycles
Add timeout per agent (30s default)
Implement circuit breaker (Pattern B.7.5)
Add max iterations to agent loops

Chapter: 10 (Multi-Agent Systems)

Symptom: Wrong agent selected for task

Likely Causes:

Task classification ambiguous
Classifier prompt unclear
Overlapping agent capabilities
Always defaulting to one agent

Quick Fixes:

Review classification examples
Add clearer criteria to classifier prompt
Sharpen agent role definitions
Log classification decisions for analysis

Chapter: 10 (Multi-Agent Systems)

Symptom: Context lost between agent handoffs

Likely Causes:

Handoff not including necessary information
Output schema missing fields
Receiving agent expects different format
Summarization losing details

Quick Fixes:

Define typed output schemas (Pattern B.7.3)
Validate outputs match schema before handoff
Log full handoff data for debugging
Include “context for next agent” in output

Chapter: 10 (Multi-Agent Systems)

6. Production Issues

Symptom: Works in development, fails in production

Likely Causes:

Production inputs more varied/messy
Concurrent load not tested
Context accumulates in long sessions
External dependencies behave differently

Quick Fixes:

Compare dev inputs vs. prod inputs (log samples)
Load test before deploying
Monitor context size in prod sessions
Mock external dependencies consistently

Chapter: 11 (Production)

Symptom: Costs much higher than projected

Likely Causes:

Memory/history bloating token usage
Retrieving too many documents
Retry storms on failures
Output verbosity not controlled

Quick Fixes:

Audit token usage by component
Check for retry loops
Reduce retrieval count
Add max_tokens to all calls

Chapter: 11 (Production)

Symptom: Latency spikes under load

Likely Causes:

No rate limiting (overloading API)
Synchronous calls that should be parallel
Large context causing slow inference
Database queries not optimized

Quick Fixes:

Implement rate limiting (Pattern B.8.1)
Parallelize independent operations
Reduce context size
Add caching for repeated queries

Chapter: 11 (Production)

Symptom: Quality degrades over time (no code changes)

Likely Causes:

Data drift (real queries different from training)
Index becoming stale
Memory accumulating noise
Model behavior changed (API updates)

Quick Fixes:

Compare current queries to evaluation set
Re-index with fresh data
Prune old/low-value memories
Pin model version if possible

Chapter: 11 (Production), 12 (Testing)

7. Testing & Evaluation Issues

Symptom: Tests pass but users complain

Likely Causes:

Evaluation dataset doesn’t reflect real usage
Measuring wrong metrics
Aggregate metrics hiding category-specific problems
Edge cases not in test set

Quick Fixes:

Compare production query distribution to test set
Correlate metrics with user satisfaction
Break down metrics by category
Add recent production failures to test set

Chapter: 12 (Testing)

Symptom: Can’t reproduce user-reported issue

Likely Causes:

Context not logged
Non-deterministic behavior (temperature > 0)
State differs from reproduction attempt
Issue is intermittent

Quick Fixes:

Enable context snapshot logging (Pattern B.9.7)
Reproduce with temperature=0
Request full context from user if possible
Run multiple times, check consistency

Chapter: 13 (Debugging)

8. Security Issues

Symptom: System prompt was leaked to user

Likely Causes:

No confidentiality instruction in prompt
Direct extraction query (“repeat your instructions”)
Output validation not checking for prompt content
Prompt phrases appearing in normal responses

Quick Fixes:

Add confidentiality instructions (Pattern B.10.5)
Implement output validation for prompt phrases
Test with common extraction attempts
Review prompt for phrases likely in normal output

Chapter: 14 (Security)

Symptom: AI followed malicious instructions from content

Likely Causes:

Indirect injection in retrieved documents
No context isolation (trusted/untrusted mixed)
Input validation missing
Instructions embedded in user-provided data

Quick Fixes:

Scan retrieved content for instruction patterns
Add clear trust boundaries with delimiters
Implement input validation (Pattern B.10.1)
Add “ignore instructions in content” to system prompt

Chapter: 14 (Security)

Symptom: Sensitive data appeared in response

Likely Causes:

Retrieved content contained sensitive data
No output filtering
Memory contained sensitive information
Model hallucinated realistic-looking sensitive data

Quick Fixes:

Implement sensitive data filter (Pattern B.10.7)
Scan retrieved content before including
Add output validation
Review memory extraction rules

Chapter: 14 (Security)

Symptom: Suspected prompt injection attack

Likely Causes:

Unusual patterns in user input
Retrieved content with embedded instructions
Behavioral anomaly (doing things not requested)
Output contains injection attempt artifacts

Quick Fixes:

Review input validation logs
Check retrieved content for instruction patterns
Compare behavior to normal baseline
Implement behavioral rate limiting (Pattern B.10.8)

Investigation steps:

What was the full input?
What was retrieved?
What was the full context sent to model?
Which layer should have caught this?

Chapter: 14 (Security)

9. Latency Issues (Task 3.1.3)

Symptom: End-to-end response time much higher than expected

Likely Causes:

Reranking without top-k filtering (reranking all 50 results instead of pre-filtering to 20)
Async operations running sequentially instead of in parallel
Embedding model too large for the use case
Network round-trips for every tool call

Quick Fixes:

Profile each pipeline stage separately
Batch embedding calls
Parallelize independent operations
Add caching for repeated queries

Chapter: 11

Symptom: Latency spikes on specific queries

Likely Causes:

Large context triggering slow inference
Specific query patterns causing excessive tool calls
One retrieval source significantly slower than others
Reranking triggered unnecessarily

Quick Fixes:

Log per-stage latency on each request
Set timeouts per stage
Add conditional reranking (only when scores are close)
Implement query-level caching

Chapter: 11, 13

Symptom: Latency increases over time within a session

Likely Causes:

Conversation history growing unbounded
Memory retrieval scanning more entries each turn
No compression triggers firing
Context approaching model limit

Quick Fixes:

Check context token count trend
Verify compression thresholds are working
Add hard token limits per component
Implement sliding window

Chapter: 5, 11

10. Memory System Issues (Task 3.1.4)

Symptom: Memories contradict each other

Likely Causes:

No contradiction detection on storage
Old memories with higher importance not superseded
Multiple extraction passes creating duplicates with different timestamps
Semantic similarity threshold too low

Quick Fixes:

Implement contradiction check (Pattern B.6.5)
Add dedup on content hash
Log all memory writes for audit
Lower similarity threshold for contradiction matching

Chapter: 9

Symptom: Retrieval returns irrelevant memories

Likely Causes:

Embedding model doesn’t capture your domain well
Recency scoring dominating relevance
Importance scores all similar (no differentiation)
Too many memories in store (noise overwhelms signal)

Quick Fixes:

Test embedding similarity manually
Tune hybrid scoring weights
Prune low-value memories
Try domain-specific embedding model

Chapter: 9

Symptom: Important memories not retrieved despite existing in store

Likely Causes:

Query phrasing doesn’t match memory embedding
Importance score too low
Memory was pruned
Memory type filter excluding it

Quick Fixes:

Search memories directly to confirm existence
Check retrieval scoring
Verify pruning isn’t removing valuable memories
Widen type filter

Chapter: 9

11. Debugging Without Observability (Task 3.1.5)

Many teams don’t have structured logging, metrics, or tracing infrastructure when they start building context-aware systems. This is completely normal. This section provides a practical path forward without waiting for a full observability platform.

Starting from Zero

You can begin debugging immediately with minimal tooling:

Add print/log statements at every pipeline stage boundary. When a request enters retrieval, log it. When retrieval completes, log the results. Same for embedding, reranking, generation—every major stage.
Save the full context (messages array) to a JSON file on every request. Include the exact state passed to the model, with timestamps. This becomes your audit trail.
Compare working vs failing requests by diffing saved contexts. When something breaks, find a successful request nearby and compare what’s different in the messages array, token counts, or memory state.
Use timestamp differences to profile latency. Measure wall-clock time between stage boundaries. This tells you where time is being spent without instrumenting every function call.

The Minimal Observability Kit

Start by adding five things to every request:

Log the full messages array sent to the model (redact sensitive data like PII)
Log token counts per component (system prompt, user message, RAG context, memory, conversation history)
Log the model response and finish_reason (did it complete or hit length limits?)
Log wall-clock time per stage (embedding, retrieval, reranking, generation in milliseconds)
Save failures to a file for later analysis (include input, full context, error, and timestamp)

Here’s a minimal Python logging class to get started:

import json
import time
from datetime import datetime
from typing import Any, Dict, List

class SimpleRequestLogger:
    def __init__(self, log_file: str = "requests.jsonl"):
        self.log_file = log_file

    def log_request(
        self,
        request_id: str,
        query: str,
        messages: List[Dict[str, str]],
        stage: str = "start"
    ):
        """Log request at a pipeline stage."""
        entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "request_id": request_id,
            "stage": stage,
            "query": query,
            "message_count": len(messages),
            "token_estimate": sum(len(m.get("content", "").split()) for m in messages),
        }
        self._write(entry)

    def log_stage_latency(
        self,
        request_id: str,
        stage: str,
        latency_ms: float,
        metadata: Dict[str, Any] = None
    ):
        """Log latency for a specific stage."""
        entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "request_id": request_id,
            "stage": stage,
            "latency_ms": latency_ms,
            "metadata": metadata or {},
        }
        self._write(entry)

    def log_failure(
        self,
        request_id: str,
        query: str,
        messages: List[Dict[str, str]],
        error: str,
        stage: str
    ):
        """Log a failed request with full context."""
        entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "request_id": request_id,
            "stage": stage,
            "query": query,
            "messages": messages,
            "error": str(error),
        }
        self._write(entry)

    def _write(self, entry: Dict[str, Any]):
        """Write entry as JSON line."""
        with open(self.log_file, "a") as f:
            f.write(json.dumps(entry) + "\n")

Usage in your pipeline:

logger = SimpleRequestLogger("debug_requests.jsonl")
request_id = str(uuid.uuid4())

logger.log_request(request_id, user_query, initial_messages, stage="start")

start = time.time()
retrieval_results = retrieve(user_query)
logger.log_stage_latency(request_id, "retrieval", (time.time() - start) * 1000)

try:
    response = model.generate(messages)
    logger.log_request(request_id, user_query, messages, stage="generation_complete")
except Exception as e:
    logger.log_failure(request_id, user_query, messages, str(e), stage="generation")

Upgrading Incrementally

As your system matures, upgrade your observability in stages:

Phase 1: Structured logging - Replace print statements with a logging module (Python’s logging or similar). Add structured fields: request_id, stage, latency_ms, token_count. Write to files or a log aggregator. You’re already here if you’ve implemented SimpleRequestLogger.
Phase 2: Metrics and dashboards - Count events per stage, measure p50/p95/p99 latency, track error rates. Tools like Prometheus + Grafana, DataDog, or CloudWatch make this easy. Focus on the five metrics above.
Phase 3: Distributed tracing - Use OpenTelemetry to connect your logs, metrics, and traces. Trace latency through asynchronous operations, across service boundaries, and into external APIs (LLM calls, retrieval services). Chapter 13 covers observability in depth.

Don’t wait for Phase 3 to start debugging. Phases 1 and 2 will solve most issues you encounter. Once you understand your system’s behavior, structured traces in Phase 3 become a precision tool rather than a necessity.

General Debugging Process

When nothing above matches, follow this process:

Step 1: Reproduce

Can you reproduce the issue?
What’s the minimal input that triggers it?
Is it deterministic or intermittent?

Step 2: Isolate

Which component is failing? (retrieval, generation, tools, etc.)
Test each component independently
What’s different between working and failing cases?

Step 3: Observe

What’s actually in the context? (log it)
What’s the model actually outputting? (full response)
What do the metrics show?

Step 4: Hypothesize

What’s the most likely cause?
What evidence would confirm or refute it?

Step 5: Test

Change one variable at a time
Measure before and after
Did the change help?

Step 6: Fix

Implement minimal fix
Add test case for this failure
Monitor for recurrence

Emergency Response

System is down

Check API status (provider outage?)
Check rate limits (quota exceeded?)
Check error logs (what’s failing?)
Implement fallback if available

Costs spiking

Implement emergency rate limit
Check for retry storms
Review recent deployments
Reduce context/retrieval temporarily

Quality collapsed

Check for recent changes (rollback?)
Compare to baseline metrics
Sample recent queries (what’s different?)
Check external dependencies (API changes?)

Security incident

Disable affected endpoint
Preserve logs for investigation
Identify attack vector
Patch and monitor

Real-World Debugging Stories

These mini case studies illustrate how debugging principles apply in practice.

Case Study: The Friday Afternoon RAG Failure

Situation: A product Q&A system started returning wrong answers every Friday afternoon. Quality metrics showed a 40% accuracy drop between 2-5 PM on Fridays.

Investigation: The team checked model changes (none), prompt changes (none), and infrastructure (stable). Then they looked at the data pipeline: marketing published weekly blog posts every Friday at 1 PM, triggering a re-indexing job that temporarily corrupted the vector index during the 2-3 hour rebuild.

Root Cause: The ingestion pipeline didn’t use atomic index swaps—it updated the live index in place, meaning queries during re-indexing hit a partially-built index with incomplete embeddings.

Fix: Implemented blue-green indexing: build the new index alongside the old one, swap atomically when complete. Added a retrieval quality check that compared scores against a baseline before and after indexing.

Lesson: When problems correlate with time, look at scheduled jobs. Always index atomically.

Patterns used: B.4.8 RAG Stage Isolation, B.9.3 Regression Detection

Case Study: The Helpful but Wrong Memory

Situation: A coding assistant kept suggesting deprecated API patterns to a user, even though the user had corrected it multiple times. The user would say “don’t use the old API,” the system would acknowledge it, but the next session it reverted.

Investigation: Memory extraction was working—the correction was stored. Memory retrieval was working—the correction was retrieved. But so were 15 older memories about the same API, all referencing the old pattern. The hybrid scoring (0.5 relevance + 0.3 recency + 0.2 importance) gave the older memories a collective advantage because they were more numerous and highly relevant to API questions.

Root Cause: Contradiction detection only compared pairs of memories, not clusters. The single “don’t use old API” memory was superseding one old memory, but 14 others remained with high relevance scores.

Fix: Implemented cluster-based contradiction detection: when a new memory contradicts one memory in a cluster, check all semantically similar memories and mark the entire cluster as superseded. Also boosted importance scores for explicit user corrections.

Lesson: Memory systems need cluster-aware contradiction handling, not just pairwise comparison.

Patterns used: B.6.5 Contradiction Detection, B.6.4 Memory Pruning

AI System On-Call Runbook

This runbook template is designed for AI systems built with context engineering. Adapt it to your specific architecture. Print it. Keep it where on-call engineers can find it at 3 AM.

Quick Reference: Incident Classification

Category	Symptoms	First Response
Model-side	Provider outage, model update, rate limiting, unexpected responses	Check provider status page, try backup model
Context-side	Bad retrieval, assembly failure, wrong context	Check retrieval metrics, review recent context/data changes
Data-side	Corrupted embeddings, stale knowledge base, bad chunking	Check data freshness, verify embedding integrity, review recent pipeline runs
Infrastructure	Network, database, cache failures	Check service health dashboards, verify connectivity
Security	Prompt injection, data exfiltration, unusual patterns	Check for suspicious input patterns, enable enhanced logging
Quality	Gradual degradation, user complaints, low scores	Check quality metrics trends, compare to baseline, review recent changes

Step-by-Step: When an Alert Fires

Step 1: Acknowledge and Assess (5 minutes)

□ Acknowledge the alert in your incident management system
□ Open the dashboard linked in the alert
□ Answer these questions:
  - How many users are affected? (check error rate + quality metrics)
  - Is it getting worse, stable, or recovering?
  - Is there a pattern? (specific query types, user segments, time of day)
  - When did it start? (check metric timeline)

Step 2: Quick Checks (10 minutes)

□ Provider status pages (OpenAI, Anthropic, etc.)
□ Recent deployments (anything in the last 12 hours?)
□ Recent data pipeline runs (knowledge base refreshes, embedding updates)
□ Infrastructure health (database, cache, vector DB, network)
□ Recent configuration changes (model versions, temperature, prompts)

Step 3: Classify the Incident

Based on your quick checks, classify using the table above. This determines your investigation path.

Incident Response Decision Tree

Step 4: Mitigate (Before Root Cause)

For model-side issues:

# Switch to backup model
config.model = config.backup_model
# Or: disable complex features, use simpler mode
config.use_rag = False
config.use_multi_agent = False

For context-side issues:

# Enable cached responses for repeated queries
config.cache_mode = "aggressive"
# Or: reduce context size to avoid assembly issues
config.max_context_tokens = config.safe_minimum

For data-side issues:

# Rollback to last known good version
knowledge_base.rollback(category, version=last_good_version)

For rate limiting / cost spikes:

# Reduce traffic
rate_limiter.set_limit(config.emergency_limit)
# Queue non-urgent requests
config.queue_mode = True

Step 5: Investigate

Pull sample requests:

# Get failing requests
failing = query_logs(
    "quality_score < 0.5 OR error = true",
    time_range="last_1_hour",
    limit=20
)

# Compare to successful requests from same period
passing = query_logs(
    "quality_score > 0.7",
    time_range="last_1_hour",
    limit=20
)

# Look for differences
compare_request_characteristics(failing, passing)

Check retrieval health:

# Compare retrieval scores before and after incident start
before = get_retrieval_scores(time_range="2_hours_before_incident")
during = get_retrieval_scores(time_range="since_incident_start")
print(f"Before: avg={mean(before):.2f}, During: avg={mean(during):.2f}")

Check for data changes:

# List recent data pipeline events
events = query_system_logs(
    "service IN ('knowledge_base', 'embeddings', 'data_pipeline')",
    time_range="last_6_hours"
)

Step 6: Fix and Verify

□ Implement fix (in staging first if possible)
□ Run evaluation suite against fix
□ Check for regressions in unaffected areas
□ Gradual rollout with monitoring
□ Confirm metrics return to baseline
□ Declare incident resolved

Step 7: Post-Incident

□ Gather data within 24-48 hours
□ Write post-mortem (use template below)
□ Schedule team review
□ Track action items to completion
□ Update THIS RUNBOOK with anything you learned

Post-Mortem Template

# Post-Mortem: [Descriptive Title]

## Summary
- **Date**: YYYY-MM-DD
- **Duration**: X hours Y minutes (start - end UTC)
- **Impact**: What users experienced and how many were affected
- **Detection**: How was it detected? (automated alert / user report / team noticed)
- **Severity**: Critical / High / Medium / Low

## Timeline
- HH:MM - Event that preceded the incident
- HH:MM - Incident started (or first detection)
- HH:MM - Alert fired / team notified
- HH:MM - Investigation began
- HH:MM - Root cause identified
- HH:MM - Mitigation applied
- HH:MM - Full resolution confirmed

## Root Cause
[Detailed technical explanation of what went wrong and why]

## What Went Well
- [Detection speed, response process, mitigation effectiveness]

## What Went Poorly
- [Detection gaps, investigation bottlenecks, missing tooling]

## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| [Specific, actionable item] | [Name] | [Date] | Open |

## Lessons Learned
1. [Key insight that applies beyond this specific incident]

Common Failure Patterns Quick Reference

Pattern	Key Diagnostic	Quick Fix
Context Rot	Check context length, info position	Move critical info to start/end
Retrieval Miss	Check retrieval scores, top-k results	Increase top-k, add hybrid search
Hallucination	Search context for model’s claims	Strengthen grounding instructions
Tool Call Failure	Check tool definitions, selection logs	Clarify tool descriptions
Cascade Failure	Trace error to originating agent	Add validation at handoff points
Prompt Injection	Check inputs for instruction-like content	Input sanitization, clear delimiters

Useful Queries

Find requests with low quality in a time range:

quality_score < 0.5 AND timestamp > "2026-01-15T02:00:00Z"

Find requests that used a specific prompt version:

prompt_version = "v2.3.1" AND status = "error"

Find requests where retrieval was slow:

retrieval_latency_ms > 5000

Find requests where context was near limit:

context_tokens > 0.9 * context_limit

Find requests where model response was truncated:

finish_reason = "length"

Appendix Cross-References

This Section	Related Appendix	Connection
Quick Reference (tokens)	Appendix D: D.1 Token Estimation	Detailed token math
RAG & Retrieval Issues	Appendix A: A.2-A.3 Databases & Embeddings	Tool-specific debugging
Production Issues (costs)	Appendix D: D.6 Worked Examples	Cost calculations
Security Issues	Appendix B: B.10 Security patterns	Solutions to apply
On-Call Runbook	Appendix B: B.8 Production patterns	Mitigation patterns

When in doubt: log everything, change one thing at a time, measure before and after. Review this runbook after every post-mortem—stale runbooks are worse than no runbooks.

Keyboard shortcuts

Context Engineering: From Vibe Coder to Software Engineer