Appendix E: Glossary
Appendix E, v2.1 — Early 2026
Quick-reference definitions for key terms used throughout this book. Each entry includes the chapter where the term is introduced or primarily discussed.
A
A/B Testing — Comparing two system configurations (control vs. treatment) using statistical significance testing to determine which performs better. (Ch 12)
Action Gating — Verifying tool calls before execution and requiring user confirmation for sensitive or irreversible operations. (Ch 8, 14)
Agentic Coding — A development pattern where AI agents autonomously plan, execute, and iterate on tasks using tools, rather than requiring step-by-step human direction. (Ch 8, 10)
See also: Agentic Engineering, Agentic Loop, Vibe Coding
Agentic Engineering — The emerging professional practice of designing AI systems that autonomously plan, execute, and iterate on tasks—combining model orchestration, tool integration, context management, and reliability engineering into a production discipline. Context engineering is a core competency: these systems are only as reliable as the information they work with. (Ch 8, 10, 15)
See also: Agentic Coding, Multi-Agent Systems, Agent Orchestration
Agentic Loop — The pattern where a model receives a task, decides which tools to use, executes them, evaluates results, and continues iterating until the task is complete. The foundation of agentic coding. (Ch 8)
See also: Tool Use, Agentic Coding
Agent Orchestration — Managing multiple AI agents working together—defining permissions, boundaries, goals, and coordination patterns. Includes routing requests, sharing state, handling failures, and ensuring agents don’t contradict each other. (Ch 10)
See also: Multi-Agent Systems, Agentic Engineering
AGENTS.md — An open standard providing AI coding agents with project-specific context: repository structure, conventions, and constraints. Adopted by over 40,000 open-source projects as of early 2026. (Ch 4, 15)
See also: .cursorrules / .claude files, System Prompt
AI-Native Development — Building systems designed from the ground up for AI, rather than retrofitting AI onto existing processes. Involves rethinking workflows, data structures, and interfaces around AI capabilities. Named a Gartner top strategic technology trend for 2026. (Ch 15)
Answer Relevancy — A RAG evaluation metric measuring whether the generated response actually addresses the question that was asked. (Ch 7)
Attention Budget — The limited cognitive focus a model can allocate across its context window; more tokens means less attention per token. (Ch 1)
B
Baseline Metrics — Reference quality scores from a known-good system state, used to detect regressions when changes are made. (Ch 12)
Behavioral Rate Limiting — Rate limiting that detects attack patterns (repeated injection attempts, enumeration, resource exhaustion) rather than just counting requests. (Ch 14)
Bi-Encoder — An embedding model that encodes query and document separately, enabling fast retrieval but with less precision than cross-encoders. (Ch 7)
See also: Cross-Encoder
Binary Search Debugging — Systematically halving the problem space to isolate the cause of a bug, rather than checking components randomly. (Ch 3)
Budget Alert Threshold — A cost level that triggers warnings to operators, allowing intervention before hitting hard limits. (Ch 11)
Budget Hard Limit — The cost ceiling where a system stops accepting requests to prevent runaway spending. (Ch 11)
C
Calibrated Evaluation — Running multiple LLM-as-judge evaluations and aggregating results (typically using the median) to reduce individual judgment bias. (Ch 12)
Cascade Failure — When one bad decision early in a pipeline causes everything downstream to fail, making root cause analysis difficult. (Ch 13)
Category-Level Analysis — Measuring quality metrics separately by query type to catch segment-specific regressions that aggregate metrics would hide. (Ch 12)
Chunking — Splitting documents into meaningful pieces for embedding and retrieval, balancing semantic completeness with size constraints. (Ch 6)
Command Injection — An attack where shell commands are embedded in parameters, potentially allowing unauthorized system access. (Ch 14)
Completeness — An evaluation metric measuring whether a response covers everything needed to adequately answer the question. (Ch 12)
Complexity Classifier — A lightweight model that routes simple queries to fast paths and complex queries to more capable (and expensive) handlers. (Ch 10)
Confirmation Flow — Requiring explicit user approval before executing destructive or irreversible tool actions. (Ch 8)
Constraints — The part of a system prompt specifying what the model must always do, must never do, and required output formats. (Ch 4)
Context — Everything the model sees when generating a response: system prompt, conversation history, retrieved documents, tool definitions, and metadata. (Ch 1)
Context Budget — A token allocation strategy that assigns explicit limits to each context component (system prompt, RAG, memory, etc.). (Ch 11)
See also: Token Budget
Context Compression — Extracting only the relevant parts from retrieved chunks before adding them to context, reducing token usage while preserving information. (Ch 7)
Context Engineering — The discipline of designing what information reaches an AI model, in what format, at what time—to reliably achieve a desired outcome. The evolution of prompt engineering, expanded from optimizing individual requests to designing the entire information environment. A core competency within agentic engineering. (Ch 1)
See also: Prompt Engineering, Agentic Engineering
Context Isolation — Clear separation of trusted content (system instructions) from untrusted content (user input, retrieved documents) using XML delimiters or other markers. (Ch 14)
Context Precision — A RAG evaluation metric measuring what percentage of retrieved chunks are actually relevant to the query. (Ch 7)
Context Recall — A RAG evaluation metric measuring what percentage of relevant chunks were successfully retrieved. (Ch 7)
Context Reduction — Intelligently shrinking context when resources are constrained, typically by truncating conversation history first, then RAG results, then memory. (Ch 11)
Context Reproducer — A debugging tool that replays a request using saved context snapshots to reproduce the exact failure conditions. (Ch 13)
Context Rot — Performance degradation as context fills up; adding more information can interfere with the model’s ability to use relevant information effectively. (Ch 1, 2)
See also: Lost in the Middle
Context Snapshot — A preserved copy of the exact inputs sent to a model, enabling reproduction and debugging of specific requests. (Ch 13)
Context Window — The fixed-size input a language model can process in a single request, measured in tokens (e.g., 200K tokens for Claude 3.5). (Ch 1, 2)
Contradiction Detection — Identifying conflicting memories and resolving them by superseding old information with new, especially when user preferences change. (Ch 9)
Coordination Tax — The overhead cost of multi-agent systems: more LLM calls, increased latency, additional tokens, and more potential failure points. (Ch 10)
Correctness — An evaluation metric measuring the factual accuracy of response content. (Ch 12)
Cosine Similarity — A metric measuring the angle between two vectors, used to calculate semantic similarity between text embeddings; values range from 0 to 1, with 1 being most similar. (Ch 6, 7)
Cost Tracking — Monitoring LLM costs across dimensions: per user, per model, per component, and globally. (Ch 11)
Cross-Encoder — A model that processes query and document together, providing more accurate relevance scores than bi-encoders but at higher computational cost. (Ch 7)
See also: Bi-Encoder, Reranking
.cursorrules / .claude files — Project-level configuration files providing persistent context to AI coding tools like Cursor and Claude Code. Functionally equivalent to system prompts for development environments. (Ch 4, 15)
See also: AGENTS.md, System Prompt
D
Dataset Drift — When usage patterns change over time, making a test dataset unrepresentative of current production traffic. (Ch 12)
Decision Tracking — Explicitly preserving key decisions in conversation history to prevent the model from later contradicting itself. (Ch 5)
Defense in Depth — A security architecture with multiple protection layers, each designed to catch what others miss. (Ch 14)
Dense Search — Vector-based semantic similarity search that finds content by meaning rather than exact keyword matching. (Ch 6)
See also: Sparse Search, Hybrid Search
Diff Analysis — Comparing current model behavior to historical baselines to detect drift or unexpected changes. (Ch 13)
Direct Prompt Injection — An attack where user input contains explicit instructions attempting to override system behavior. (Ch 14)
See also: Indirect Prompt Injection
Distributed Tracing — Connecting events across multiple pipeline stages to understand the complete request journey, including timing and dependencies. (Ch 13)
Document-Aware Chunking — Splitting documents while respecting their structure—keeping code functions whole, preserving paragraph boundaries, maintaining list integrity. (Ch 6)
Domain-Specific Metrics — Custom quality measurements that reflect what “good” means for a specific application, beyond generic metrics. (Ch 12)
Dynamic Components — Prompt elements that change per-request: specific task details, selected examples, current user context. (Ch 4)
See also: Static Components
E
Effective Capacity — The practical maximum context size before quality degrades noticeably; typically 60-70% of the theoretical context window limit. (Ch 2)
Embedding — A vector representation of text that captures semantic meaning, enabling similarity comparisons between different pieces of content. (Ch 6)
Emergency Limit — A hard ceiling on requests activated during traffic spikes to prevent cascading failures. (Ch 11)
Episodic Memory — Timestamped records of specific events and interactions, providing continuity and demonstrating user history. (Ch 9)
See also: Semantic Memory, Procedural Memory
Error Handling Hierarchy — A layered approach to errors: validation (catch before execution), execution (handle during), recovery (graceful fallback after). (Ch 8)
Evaluation Dataset — A representative, labeled test set combining production samples, expert-created examples, and synthetic edge cases. (Ch 12)
See also: Golden Dataset
Excessive Agency — System capabilities beyond what’s necessary for its purpose, creating a larger attack surface. (Ch 14)
Extractive Compression — Using an LLM to extract only the relevant sentences from retrieved chunks, discarding irrelevant content. (Ch 7)
F
Faithfulness — A RAG evaluation metric measuring whether the response is grounded in retrieved context rather than hallucinated. (Ch 7)
See also: Groundedness, Hallucination
First Token Latency (TTFT) — The time from sending an API request to receiving the first token of the response. A key production metric distinct from total response time, as users perceive responsiveness from the first token. (Ch 11, 13)
See also: Latency Budget
5 Whys Pattern — A root cause analysis technique that asks “why” iteratively (typically five times) to move from symptoms to underlying causes. (Ch 13)
G
Golden Dataset — A carefully curated evaluation dataset representing actual usage patterns, maintained over time as the definitive quality benchmark. (Ch 12)
Graceful Degradation — Returning reduced-quality but still functional responses when resources are constrained, rather than failing completely. (Ch 8, 11)
GraphRAG — A RAG approach that uses entity relationships and knowledge graphs to enable multi-document reasoning and complex queries. (Ch 7)
See also: LazyGraphRAG
Groundedness — Whether a response is based on provided context rather than invented by the model. (Ch 12)
See also: Faithfulness, Hallucination
H
Hallucination — When a model invents information rather than using what’s provided in context, often presenting fabricated content with false confidence. (Ch 6)
Handoff Problem — The challenge of passing appropriate output between agents—too much detail overwhelms, too little loses critical information. (Ch 10)
Hybrid Scoring — Combining multiple signals (recency, relevance, importance) with tunable weights to rank memories for retrieval. (Ch 9)
Hybrid Search — Combining dense (vector) and sparse (keyword) search to get both semantic understanding and exact matching. (Ch 6)
I
Importance Scoring — Weighting memories by their significance, typically boosting decisions, corrections, and explicitly stated preferences. (Ch 9)
Indirect Prompt Injection — An attack where malicious instructions are hidden in retrieved documents, tool outputs, or other external content. (Ch 14)
See also: Direct Prompt Injection
Inflection Point — The context size where measurable performance degradation begins, typically around 32K tokens for many models. (Ch 2)
Ingestion Pipeline — The offline process that prepares documents for retrieval: loading → chunking → embedding → storage. (Ch 6)
Injection Patterns — Regex patterns designed to detect known prompt injection attempts like “ignore previous instructions” or “new system prompt.” (Ch 14)
Input Guardrails — High-level policies that block inappropriate requests before they reach the model. (Ch 14)
Input Validation — Pattern matching and analysis to detect obvious injection attempts and malformed inputs. (Ch 14)
Instructions — The part of a system prompt specifying what the model should do, in what order, and with what decision logic. (Ch 4)
K
Key Facts Extraction — Identifying the most important information to preserve when compressing conversation history. (Ch 5)
L
Large Language Model (LLM) — A neural network trained on vast text data that generates human-like text responses based on input context. (Ch 1)
Latency Budget — An allocated time limit for each pipeline stage, ensuring the total request stays within acceptable response time. (Ch 7, 11)
See also: First Token Latency
LazyGraphRAG — A lightweight GraphRAG variant that defers graph construction to query time, avoiding upfront indexing costs. (Ch 7)
Long Context Windows — Context windows exceeding 100K tokens (e.g., Claude’s 200K, Gemini’s 1M) that enable processing entire codebases or document collections in a single request. Larger windows don’t eliminate the need for context engineering—the “Lost in the Middle” effect and attention dilution mean that careful curation remains essential even with abundant capacity. (Ch 2)
See also: Context Window, Context Rot, Effective Capacity
LLM-as-Judge — Using one LLM to evaluate another LLM’s response quality, enabling scalable assessment of subjective dimensions. (Ch 12)
Logs — Records of discrete events happening in a system, typically structured as JSON with timestamps and correlation IDs. (Ch 13)
Lost in the Middle — The phenomenon where information positioned in the middle of context (40-60% position) receives less attention than content at the beginning or end. (Ch 2)
See also: Primacy Effect, Recency Effect
M
Match Rate Evaluation — Comparing LLM responses to a human baseline using embedding similarity to assess quality at scale. (Ch 12)
Memory Leak — When conversation memory grows without bound, eventually consuming the entire context window or causing failures. (Ch 5)
Memory Pruning — Intelligent removal of stale or low-value memories to prevent unbounded growth while preserving important information. (Ch 9)
Memory Retrieval Layer — The scoring and selection mechanism that decides which stored memories get injected into current context. (Ch 9)
Metric Mismatch — When automated evaluation metrics don’t correlate with actual user satisfaction or real-world performance. (Ch 12)
Metrics — Aggregate measurements over time: latency percentiles, error rates, quality scores, costs. (Ch 13)
Model Context Protocol (MCP) — The open standard for connecting LLMs to external data sources and tools, standardizing how context gets assembled from external systems. Introduced by Anthropic (November 2024), donated to the Linux Foundation’s Agentic AI Foundation (December 2025). By early 2026: 97 million monthly SDK downloads, 10,000+ active servers. (Ch 8)
See also: Tool Definition, Context Engineering
Model Drift — Behavior changes over time due to model updates from the provider, occurring without any code changes on your side. (Ch 13)
Model Fallback Chain — Attempting requests with a preferred model and automatically falling back to cheaper or faster alternatives on failure. (Ch 11)
Multi-Agent Systems — Architectures where multiple specialized AI agents collaborate on complex tasks, coordinated by an orchestrator or through structured handoffs. Adds capability at the cost of latency, tokens, and debugging complexity (the “coordination tax”). (Ch 10)
See also: Agent Orchestration, Orchestrator, Coordination Tax
Multi-Tenant Isolation — Ensuring users can only access data they’re authorized for, preventing cross-tenant data leakage. (Ch 14)
N
Non-Deterministic Behavior — When the same input produces different outputs due to temperature settings, model updates, or subtle context variations. (Ch 13)
O
Observability — The ability to see what a system is doing: what went in, what came out, how long it took, and what it cost. (Ch 3, 13)
Orchestrator — A central coordinator in multi-agent systems that decomposes tasks, delegates to specialist agents, and synthesizes their results. (Ch 10)
Output Format Specification — An explicit schema in the system prompt defining the required response structure (JSON, markdown, specific fields). (Ch 4)
Output Guardrails — Final content filtering applied before returning responses to users. (Ch 14)
Output Validation — Checking model outputs for system prompt leakage, sensitive data exposure, or dangerous recommendations. (Ch 14)
P
Parallel with Aggregation — A multi-agent pattern where multiple agents work simultaneously on independent subtasks, with results combined afterward. (Ch 10)
Path Traversal — An attack using “../” patterns to access files outside intended directories. (Ch 14)
Path Validation — Ensuring tools can’t access files outside their intended directories by checking and normalizing file paths. (Ch 8)
Personally Identifiable Information (PII) — Data that can identify individuals, requiring special handling and protection. (Ch 14)
Pipeline Pattern — A multi-agent pattern where each agent’s output becomes the next agent’s input in a sequential transformation. (Ch 10)
Post-Mortem — A blameless learning document created after an incident, focusing on systemic improvements rather than individual fault. (Ch 13)
Practical Significance — Whether a statistically significant improvement is actually meaningful enough to justify the change in a real-world context. (Ch 12)
Primacy Effect — The phenomenon where information at the beginning of context receives elevated attention from the model. (Ch 2)
See also: Recency Effect, Lost in the Middle
Principle of Least Privilege — Giving tools and agents only the permissions they need to accomplish their task, nothing more. (Ch 8, 14)
Procedural Memory — Learned patterns and workflows that enable behavioral adaptation, like knowing a user prefers certain code review approaches. (Ch 9)
See also: Episodic Memory, Semantic Memory
Prompt — The complete input sent to a language model, including system prompt, conversation history, retrieved context, and the current user message. (Ch 1, 2)
Prompt Engineering — The practice of crafting effective inputs for language models—using clarity, structure, examples, and role definitions to get better results. The foundation that context engineering evolved from; its core insights remain essential within context engineering. Not obsolete, but absorbed into the broader discipline. (Ch 1)
See also: Context Engineering
Prompt Caching — A cost optimization where providers cache the system prompt and other static context prefix, charging reduced rates for subsequent requests that share the same prefix. Typically reduces input costs by 80-90% for repeated prefixes. (Ch 11)
See also: Context Budget, Cost Tracking
Prompt Injection — An attack where crafted input attempts to override or modify a system’s intended behavior by exploiting the model’s instruction-following nature. (Ch 14)
Q
Query Expansion — Generating alternative phrasings of a query to improve retrieval coverage and catch relevant documents that use different terminology. (Ch 7)
R
RAGAS — An evaluation framework for RAG systems that measures context precision, context recall, faithfulness, and answer relevancy. (Ch 7)
Ralph Loop — A development methodology by Geoffrey Huntley treating context management as central to AI-assisted development. Key principles: reset context each iteration, persist state through the filesystem rather than conversation, allocate ~40% planning, ~20% implementation, ~40% review. (Ch 5, 15)
See also: Conversation History, Context Window
Rate Limiting (Token-Based) — Limiting usage by tokens consumed rather than just request count, providing fairer allocation for varying query sizes. (Ch 11)
Recency Effect — The phenomenon where information at the end of context receives elevated attention from the model. (Ch 2)
See also: Primacy Effect, Lost in the Middle
Recency Scoring — Favoring recent memories in retrieval using exponential decay, so newer information is more likely to be included. (Ch 9)
Reciprocal Rank Fusion (RRF) — An algorithm that combines results from multiple search methods by aggregating their rank positions. (Ch 6)
Regression Detection — Identifying when changes degrade quality metrics beyond acceptable thresholds compared to baseline. (Ch 12)
Regression Testing — Tests that verify new changes don’t break existing functionality that was previously working. (Ch 3, 12)
Relevance — An evaluation metric measuring whether a response addresses the actual question that was asked. (Ch 12)
Relevance Scoring — Using embedding similarity to find memories that are semantically related to the current context. (Ch 9)
Reranking — A second-pass reordering of retrieval results using a more accurate (but slower) model to improve relevance. (Ch 7)
See also: Cross-Encoder
Reproducibility — The ability to get the same outputs given the same inputs, essential for debugging non-deterministic AI systems. (Ch 3)
Request Observer — An observability context manager that tracks a single request through all pipeline stages, collecting timing and metadata. (Ch 13)
Response Mode Degradation — Simplifying what the model produces (full → standard → concise) based on resource constraints or latency requirements. (Ch 11)
Retrieval-Augmented Generation (RAG) — A technique that finds relevant information from a knowledge base and injects it into context before generation. (Ch 6)
Retrieval Miss — When a relevant answer exists in the knowledge base but wasn’t retrieved, often due to vocabulary mismatch or insufficient top-k. (Ch 13)
Retrieval Pipeline — The online process that handles queries: embedding the query → searching the index → returning top-k results. (Ch 6)
Role — The part of a system prompt defining who the model is, what expertise it has, and what perspective it brings. (Ch 4)
Root Cause Analysis — Finding the underlying cause of a problem rather than just addressing the proximate symptom. (Ch 13)
Routing Pattern — A multi-agent pattern that dispatches requests to specialized handlers based on request classification. (Ch 10)
S
Sandboxing — Running commands in isolated environments with restricted permissions to limit potential damage from malicious or buggy operations. (Ch 8)
Semantic Memory — Facts, preferences, and knowledge extracted from interactions, enabling personalization without storing every conversation detail. (Ch 9)
See also: Episodic Memory, Procedural Memory
Semantic Search — Finding content by meaning rather than exact keyword matching, using embeddings to measure similarity. (Ch 6)
Sensitive Data Filter — Pattern-based detection and redaction of credentials, API keys, secrets, and other sensitive information. (Ch 14)
Signal-to-Noise Ratio — The proportion of useful information versus filler tokens in context; higher ratios generally produce better results. (Ch 1)
Sliding Window — A conversation management strategy that keeps only the last N messages, discarding older ones as new messages arrive. (Ch 5)
Span — An individual operation within a distributed trace hierarchy, representing one step in a request’s journey. (Ch 13)
Sparse Search — Keyword-based search methods like BM25 that match exact terms rather than semantic meaning. (Ch 6)
See also: Dense Search, Hybrid Search
Specialist Agent — A focused agent with narrow responsibilities and access to only the tools needed for its specific task. (Ch 10)
Static Components — Prompt elements that rarely change across requests: role definitions, core behaviors, fundamental constraints. (Ch 4)
See also: Dynamic Components
Statistical Debugging — Running the same request multiple times to understand the distribution of failures and identify patterns. (Ch 13)
Statistical Significance — A p-value below 0.05 indicating that observed differences are unlikely to be due to random chance. (Ch 12)
Stratified Sampling — Balanced sampling across query categories to ensure evaluation datasets represent all important use cases. (Ch 12)
Structured Handoffs — Using schemas to define and validate agent-to-agent communication, ensuring consistent data transfer. (Ch 10)
Structured Logging — Recording events as queryable data (typically JSON) with correlation IDs, timestamps, and key-value metadata. (Ch 3, 13)
Structured Output — Predictable, parseable responses following a defined format like JSON or specific markdown structures. (Ch 4)
Summarization — Compressing old conversation messages to preserve their essence while reclaiming tokens for new content. (Ch 5)
Systematic Debugging — Finding root causes through a repeatable process rather than trial-and-error guessing. (Ch 3)
System Prompt — The persistent instructions that define an AI’s role, behavior, and constraints; set by the developer, not the user. (Ch 4)
System Prompt Leakage — Unintended exposure of internal system instructions through model outputs, often via clever user queries. (Ch 14)
T
Temperature — A model parameter (typically 0.0–1.0) controlling output randomness. Lower values produce more deterministic responses; higher values increase creativity but reduce reproducibility. Critical for debugging: always log the temperature used for each request. (Ch 3, 13)
Testability — The ability to verify system correctness through automated tests and detect when behavior changes. (Ch 3)
The 70% Rule — A guideline to trigger compression or context management when approaching 70-80% of the context window limit. (Ch 2)
Tiered Evaluation — A cost-effective strategy using cheap automated checks on every commit, LLM-as-judge weekly, and human review monthly. (Ch 12)
Tiered Limits — Differentiated rate limits by user tier (free, pro, enterprise), allowing paying users more resources. (Ch 11)
Tiered Memory — A three-tier conversation management approach: active (full recent messages), summarized (compressed older content), archived (stored but not loaded). (Ch 5)
Token — A chunk of text (approximately 4 characters in English) that language models process as a single unit. Tokens are the fundamental unit of context measurement. (Ch 2)
Token Budget — A deliberate allocation of tokens to each context component, ensuring no single component crowds out others. (Ch 2, 11)
See also: Context Budget
Top-K — The number of results returned from a retrieval query. Higher values improve recall but may dilute relevance and consume more context tokens. Finding the right top-k is a key tuning parameter for RAG systems. (Ch 6, 7)
Tool — A function that a model can invoke to take actions in the world: reading files, searching databases, calling APIs, executing code. (Ch 8)
Tool Definition — A specification telling the model what a tool does, what parameters it accepts, and when to use it. (Ch 8)
Tool Isolation — Restricting agents to only the tools they need for their specific task, preventing confused tool selection. (Ch 10)
Topological Sort — Ordering agent execution by dependency graph to ensure agents run in the correct sequence. (Ch 10)
Traces — Connected records of events across a request’s journey through multiple system components. (Ch 13)
Truncation — Removing content from context to fit within token limits, typically applied to conversation history (oldest first), RAG results (lowest relevance first), or memory (lowest priority first). Distinguished from compression, which preserves information in fewer tokens. (Ch 5, 11)
See also: Context Reduction, Summarization
Trust Boundaries — Explicit markers (typically XML tags) separating trusted content (system instructions) from untrusted content (user input, external data). (Ch 14)
U
User Message — The human’s input in a conversation; distinguished from system prompt (developer-defined) and assistant response (model-generated). (Ch 1)
V
Validator Pattern — A multi-agent pattern where a dedicated agent checks another agent’s work, improving accuracy by catching errors. (Ch 10)
Vector Database — A database optimized for storing embeddings and performing fast similarity searches across large collections. (Ch 6)
Version Control for Prompts — Treating prompts like code: tracking versions, reviewing changes, testing before deployment. (Ch 3)
Vibe Coding — A development methodology where builders collaborate with AI through natural language and iterative feedback, without necessarily reviewing generated code. Coined by Andrej Karpathy (February 2025); Collins Dictionary Word of the Year 2025. Effective for prototyping; context engineering adds the discipline needed for production. (Ch 1, 15)
See also: Agentic Coding, Context Engineering
Concepts by Problem
Can’t find what you need alphabetically? Search by the problem you’re trying to solve.
| Problem | Key Concepts | Where to Start |
|---|---|---|
| My context is too large | Context Rot, The 70% Rule, Context Compression, Token Budget | Ch 2, Appendix B.1 |
| Model ignores my instructions | Lost in the Middle, Positional Priority, Conflict Detection | Ch 2, Ch 4 |
| RAG returns wrong results | Chunking, Hybrid Search, Reranking, RAG Stage Isolation | Ch 6, Ch 7 |
| AI hallucinates despite having context | Faithfulness, Groundedness, Context Isolation | Ch 6, Ch 14 |
| Costs are too high | Cost Tracking, Prompt Caching, Graceful Degradation, Model Fallback | Ch 11, Appendix D |
| System is unreliable in production | Circuit Breaker, Rate Limiting, Model Fallback Chain | Ch 11, Appendix B.8 |
| Can’t debug failures | Distributed Tracing, Context Snapshot, Reproducibility | Ch 13, Appendix C |
| Security concerns | Defense in Depth, Prompt Injection, Context Isolation, Action Gating | Ch 14, Appendix B.10 |
| Memory grows unbounded | Memory Pruning, Contradiction Detection, Tiered Memory | Ch 9, Appendix B.6 |
| Agents contradict each other | Structured Handoffs, Tool Isolation, Orchestrator | Ch 10, Appendix B.7 |
| Tests pass but users complain | Domain-Specific Metrics, Stratified Sampling, Dataset Drift | Ch 12, Appendix B.9 |
| Need to choose between approaches | Context Engineering (decision frameworks throughout) | Ch 1, Ch 15 |
Appendix Cross-References
| Term Category | Related Appendix | Connection |
|---|---|---|
| RAG terms (Chunking, Embedding, etc.) | Appendix A: A.2-A.3 | Tool options |
| Pattern terms (70% Rule, etc.) | Appendix B | Full pattern details |
| Debugging terms (Traces, etc.) | Appendix C | Diagnostic procedures |
| Cost terms (Token Budget, etc.) | Appendix D | Calculations and examples |
Glossary complete. For detailed explanations and code examples, see the referenced chapters.