Appendix E: Glossary

Appendix E, v2.1 — Early 2026

Quick-reference definitions for key terms used throughout this book. Each entry includes the chapter where the term is introduced or primarily discussed.

A

A/B Testing — Comparing two system configurations (control vs. treatment) using statistical significance testing to determine which performs better. (Ch 12)

Action Gating — Verifying tool calls before execution and requiring user confirmation for sensitive or irreversible operations. (Ch 8, 14)

Agentic Coding — A development pattern where AI agents autonomously plan, execute, and iterate on tasks using tools, rather than requiring step-by-step human direction. (Ch 8, 10)

See also: Agentic Engineering, Agentic Loop, Vibe Coding

Agentic Engineering — The emerging professional practice of designing AI systems that autonomously plan, execute, and iterate on tasks—combining model orchestration, tool integration, context management, and reliability engineering into a production discipline. Context engineering is a core competency: these systems are only as reliable as the information they work with. (Ch 8, 10, 15)

See also: Agentic Coding, Multi-Agent Systems, Agent Orchestration

Agentic Loop — The pattern where a model receives a task, decides which tools to use, executes them, evaluates results, and continues iterating until the task is complete. The foundation of agentic coding. (Ch 8)

B

Baseline Metrics — Reference quality scores from a known-good system state, used to detect regressions when changes are made. (Ch 12)

Behavioral Rate Limiting — Rate limiting that detects attack patterns (repeated injection attempts, enumeration, resource exhaustion) rather than just counting requests. (Ch 14)

Bi-Encoder — An embedding model that encodes query and document separately, enabling fast retrieval but with less precision than cross-encoders. (Ch 7)

C

Calibrated Evaluation — Running multiple LLM-as-judge evaluations and aggregating results (typically using the median) to reduce individual judgment bias. (Ch 12)

Cascade Failure — When one bad decision early in a pipeline causes everything downstream to fail, making root cause analysis difficult. (Ch 13)

Category-Level Analysis — Measuring quality metrics separately by query type to catch segment-specific regressions that aggregate metrics would hide. (Ch 12)

Chunking — Splitting documents into meaningful pieces for embedding and retrieval, balancing semantic completeness with size constraints. (Ch 6)

Command Injection — An attack where shell commands are embedded in parameters, potentially allowing unauthorized system access. (Ch 14)

Completeness — An evaluation metric measuring whether a response covers everything needed to adequately answer the question. (Ch 12)

Complexity Classifier — A lightweight model that routes simple queries to fast paths and complex queries to more capable (and expensive) handlers. (Ch 10)

Confirmation Flow — Requiring explicit user approval before executing destructive or irreversible tool actions. (Ch 8)

Constraints — The part of a system prompt specifying what the model must always do, must never do, and required output formats. (Ch 4)

Context — Everything the model sees when generating a response: system prompt, conversation history, retrieved documents, tool definitions, and metadata. (Ch 1)

Context Budget — A token allocation strategy that assigns explicit limits to each context component (system prompt, RAG, memory, etc.). (Ch 11)

D

Dataset Drift — When usage patterns change over time, making a test dataset unrepresentative of current production traffic. (Ch 12)

Decision Tracking — Explicitly preserving key decisions in conversation history to prevent the model from later contradicting itself. (Ch 5)

Defense in Depth — A security architecture with multiple protection layers, each designed to catch what others miss. (Ch 14)

Dense Search — Vector-based semantic similarity search that finds content by meaning rather than exact keyword matching. (Ch 6)

E

Effective Capacity — The practical maximum context size before quality degrades noticeably; typically 60-70% of the theoretical context window limit. (Ch 2)

Embedding — A vector representation of text that captures semantic meaning, enabling similarity comparisons between different pieces of content. (Ch 6)

Emergency Limit — A hard ceiling on requests activated during traffic spikes to prevent cascading failures. (Ch 11)

Episodic Memory — Timestamped records of specific events and interactions, providing continuity and demonstrating user history. (Ch 9)

See also: Semantic Memory, Procedural Memory

Error Handling Hierarchy — A layered approach to errors: validation (catch before execution), execution (handle during), recovery (graceful fallback after). (Ch 8)

Evaluation Dataset — A representative, labeled test set combining production samples, expert-created examples, and synthetic edge cases. (Ch 12)

F

Faithfulness — A RAG evaluation metric measuring whether the response is grounded in retrieved context rather than hallucinated. (Ch 7)

G

Golden Dataset — A carefully curated evaluation dataset representing actual usage patterns, maintained over time as the definitive quality benchmark. (Ch 12)

Graceful Degradation — Returning reduced-quality but still functional responses when resources are constrained, rather than failing completely. (Ch 8, 11)

GraphRAG — A RAG approach that uses entity relationships and knowledge graphs to enable multi-document reasoning and complex queries. (Ch 7)

H

Hallucination — When a model invents information rather than using what’s provided in context, often presenting fabricated content with false confidence. (Ch 6)

Handoff Problem — The challenge of passing appropriate output between agents—too much detail overwhelms, too little loses critical information. (Ch 10)

Hybrid Scoring — Combining multiple signals (recency, relevance, importance) with tunable weights to rank memories for retrieval. (Ch 9)

Hybrid Search — Combining dense (vector) and sparse (keyword) search to get both semantic understanding and exact matching. (Ch 6)

I

Importance Scoring — Weighting memories by their significance, typically boosting decisions, corrections, and explicitly stated preferences. (Ch 9)

Indirect Prompt Injection — An attack where malicious instructions are hidden in retrieved documents, tool outputs, or other external content. (Ch 14)

K

Key Facts Extraction — Identifying the most important information to preserve when compressing conversation history. (Ch 5)

L

Large Language Model (LLM) — A neural network trained on vast text data that generates human-like text responses based on input context. (Ch 1)

Latency Budget — An allocated time limit for each pipeline stage, ensuring the total request stays within acceptable response time. (Ch 7, 11)

M

Match Rate Evaluation — Comparing LLM responses to a human baseline using embedding similarity to assess quality at scale. (Ch 12)

Memory Leak — When conversation memory grows without bound, eventually consuming the entire context window or causing failures. (Ch 5)

Memory Pruning — Intelligent removal of stale or low-value memories to prevent unbounded growth while preserving important information. (Ch 9)

Memory Retrieval Layer — The scoring and selection mechanism that decides which stored memories get injected into current context. (Ch 9)

Metric Mismatch — When automated evaluation metrics don’t correlate with actual user satisfaction or real-world performance. (Ch 12)

Metrics — Aggregate measurements over time: latency percentiles, error rates, quality scores, costs. (Ch 13)

Model Context Protocol (MCP) — The open standard for connecting LLMs to external data sources and tools, standardizing how context gets assembled from external systems. Introduced by Anthropic (November 2024), donated to the Linux Foundation’s Agentic AI Foundation (December 2025). By early 2026: 97 million monthly SDK downloads, 10,000+ active servers. (Ch 8)

See also: Tool Definition, Context Engineering

Model Drift — Behavior changes over time due to model updates from the provider, occurring without any code changes on your side. (Ch 13)

Model Fallback Chain — Attempting requests with a preferred model and automatically falling back to cheaper or faster alternatives on failure. (Ch 11)

Multi-Agent Systems — Architectures where multiple specialized AI agents collaborate on complex tasks, coordinated by an orchestrator or through structured handoffs. Adds capability at the cost of latency, tokens, and debugging complexity (the “coordination tax”). (Ch 10)

See also: Agent Orchestration, Orchestrator, Coordination Tax

Multi-Tenant Isolation — Ensuring users can only access data they’re authorized for, preventing cross-tenant data leakage. (Ch 14)

N

Non-Deterministic Behavior — When the same input produces different outputs due to temperature settings, model updates, or subtle context variations. (Ch 13)

O

Observability — The ability to see what a system is doing: what went in, what came out, how long it took, and what it cost. (Ch 3, 13)

Orchestrator — A central coordinator in multi-agent systems that decomposes tasks, delegates to specialist agents, and synthesizes their results. (Ch 10)

Output Format Specification — An explicit schema in the system prompt defining the required response structure (JSON, markdown, specific fields). (Ch 4)

Output Guardrails — Final content filtering applied before returning responses to users. (Ch 14)

Output Validation — Checking model outputs for system prompt leakage, sensitive data exposure, or dangerous recommendations. (Ch 14)

P

Parallel with Aggregation — A multi-agent pattern where multiple agents work simultaneously on independent subtasks, with results combined afterward. (Ch 10)

Path Traversal — An attack using “../” patterns to access files outside intended directories. (Ch 14)

Path Validation — Ensuring tools can’t access files outside their intended directories by checking and normalizing file paths. (Ch 8)

Personally Identifiable Information (PII) — Data that can identify individuals, requiring special handling and protection. (Ch 14)

Pipeline Pattern — A multi-agent pattern where each agent’s output becomes the next agent’s input in a sequential transformation. (Ch 10)

Post-Mortem — A blameless learning document created after an incident, focusing on systemic improvements rather than individual fault. (Ch 13)

Practical Significance — Whether a statistically significant improvement is actually meaningful enough to justify the change in a real-world context. (Ch 12)

Primacy Effect — The phenomenon where information at the beginning of context receives elevated attention from the model. (Ch 2)

See also: Recency Effect, Lost in the Middle

Principle of Least Privilege — Giving tools and agents only the permissions they need to accomplish their task, nothing more. (Ch 8, 14)

Procedural Memory — Learned patterns and workflows that enable behavioral adaptation, like knowing a user prefers certain code review approaches. (Ch 9)

See also: Episodic Memory, Semantic Memory

Prompt — The complete input sent to a language model, including system prompt, conversation history, retrieved context, and the current user message. (Ch 1, 2)

Prompt Engineering — The practice of crafting effective inputs for language models—using clarity, structure, examples, and role definitions to get better results. The foundation that context engineering evolved from; its core insights remain essential within context engineering. Not obsolete, but absorbed into the broader discipline. (Ch 1)

Q

Query Expansion — Generating alternative phrasings of a query to improve retrieval coverage and catch relevant documents that use different terminology. (Ch 7)

R

RAGAS — An evaluation framework for RAG systems that measures context precision, context recall, faithfulness, and answer relevancy. (Ch 7)

Ralph Loop — A development methodology by Geoffrey Huntley treating context management as central to AI-assisted development. Key principles: reset context each iteration, persist state through the filesystem rather than conversation, allocate ~40% planning, ~20% implementation, ~40% review. (Ch 5, 15)

See also: Conversation History, Context Window

Rate Limiting (Token-Based) — Limiting usage by tokens consumed rather than just request count, providing fairer allocation for varying query sizes. (Ch 11)

Recency Effect — The phenomenon where information at the end of context receives elevated attention from the model. (Ch 2)

See also: Primacy Effect, Lost in the Middle

Recency Scoring — Favoring recent memories in retrieval using exponential decay, so newer information is more likely to be included. (Ch 9)

Reciprocal Rank Fusion (RRF) — An algorithm that combines results from multiple search methods by aggregating their rank positions. (Ch 6)

Regression Detection — Identifying when changes degrade quality metrics beyond acceptable thresholds compared to baseline. (Ch 12)

Regression Testing — Tests that verify new changes don’t break existing functionality that was previously working. (Ch 3, 12)

Relevance — An evaluation metric measuring whether a response addresses the actual question that was asked. (Ch 12)

Relevance Scoring — Using embedding similarity to find memories that are semantically related to the current context. (Ch 9)

Reranking — A second-pass reordering of retrieval results using a more accurate (but slower) model to improve relevance. (Ch 7)

S

Sandboxing — Running commands in isolated environments with restricted permissions to limit potential damage from malicious or buggy operations. (Ch 8)

Semantic Memory — Facts, preferences, and knowledge extracted from interactions, enabling personalization without storing every conversation detail. (Ch 9)

See also: Episodic Memory, Procedural Memory

Semantic Search — Finding content by meaning rather than exact keyword matching, using embeddings to measure similarity. (Ch 6)

Sensitive Data Filter — Pattern-based detection and redaction of credentials, API keys, secrets, and other sensitive information. (Ch 14)

Signal-to-Noise Ratio — The proportion of useful information versus filler tokens in context; higher ratios generally produce better results. (Ch 1)

Sliding Window — A conversation management strategy that keeps only the last N messages, discarding older ones as new messages arrive. (Ch 5)

Span — An individual operation within a distributed trace hierarchy, representing one step in a request’s journey. (Ch 13)

Sparse Search — Keyword-based search methods like BM25 that match exact terms rather than semantic meaning. (Ch 6)

T

Temperature — A model parameter (typically 0.0–1.0) controlling output randomness. Lower values produce more deterministic responses; higher values increase creativity but reduce reproducibility. Critical for debugging: always log the temperature used for each request. (Ch 3, 13)

Testability — The ability to verify system correctness through automated tests and detect when behavior changes. (Ch 3)

The 70% Rule — A guideline to trigger compression or context management when approaching 70-80% of the context window limit. (Ch 2)

Tiered Evaluation — A cost-effective strategy using cheap automated checks on every commit, LLM-as-judge weekly, and human review monthly. (Ch 12)

Tiered Limits — Differentiated rate limits by user tier (free, pro, enterprise), allowing paying users more resources. (Ch 11)

Tiered Memory — A three-tier conversation management approach: active (full recent messages), summarized (compressed older content), archived (stored but not loaded). (Ch 5)

Token — A chunk of text (approximately 4 characters in English) that language models process as a single unit. Tokens are the fundamental unit of context measurement. (Ch 2)

Token Budget — A deliberate allocation of tokens to each context component, ensuring no single component crowds out others. (Ch 2, 11)

U

User Message — The human’s input in a conversation; distinguished from system prompt (developer-defined) and assistant response (model-generated). (Ch 1)

V

Validator Pattern — A multi-agent pattern where a dedicated agent checks another agent’s work, improving accuracy by catching errors. (Ch 10)

Vector Database — A database optimized for storing embeddings and performing fast similarity searches across large collections. (Ch 6)

Version Control for Prompts — Treating prompts like code: tracking versions, reviewing changes, testing before deployment. (Ch 3)

Vibe Coding — A development methodology where builders collaborate with AI through natural language and iterative feedback, without necessarily reviewing generated code. Coined by Andrej Karpathy (February 2025); Collins Dictionary Word of the Year 2025. Effective for prototyping; context engineering adds the discipline needed for production. (Ch 1, 15)

See also: Agentic Coding, Context Engineering

Concepts by Problem

Can’t find what you need alphabetically? Search by the problem you’re trying to solve.

Problem	Key Concepts	Where to Start
My context is too large	Context Rot, The 70% Rule, Context Compression, Token Budget	Ch 2, Appendix B.1
Model ignores my instructions	Lost in the Middle, Positional Priority, Conflict Detection	Ch 2, Ch 4
RAG returns wrong results	Chunking, Hybrid Search, Reranking, RAG Stage Isolation	Ch 6, Ch 7
AI hallucinates despite having context	Faithfulness, Groundedness, Context Isolation	Ch 6, Ch 14
Costs are too high	Cost Tracking, Prompt Caching, Graceful Degradation, Model Fallback	Ch 11, Appendix D
System is unreliable in production	Circuit Breaker, Rate Limiting, Model Fallback Chain	Ch 11, Appendix B.8
Can’t debug failures	Distributed Tracing, Context Snapshot, Reproducibility	Ch 13, Appendix C
Security concerns	Defense in Depth, Prompt Injection, Context Isolation, Action Gating	Ch 14, Appendix B.10
Memory grows unbounded	Memory Pruning, Contradiction Detection, Tiered Memory	Ch 9, Appendix B.6
Agents contradict each other	Structured Handoffs, Tool Isolation, Orchestrator	Ch 10, Appendix B.7
Tests pass but users complain	Domain-Specific Metrics, Stratified Sampling, Dataset Drift	Ch 12, Appendix B.9
Need to choose between approaches	Context Engineering (decision frameworks throughout)	Ch 1, Ch 15

Appendix Cross-References

Term Category	Related Appendix	Connection
RAG terms (Chunking, Embedding, etc.)	Appendix A: A.2-A.3	Tool options
Pattern terms (70% Rule, etc.)	Appendix B	Full pattern details
Debugging terms (Traces, etc.)	Appendix C	Diagnostic procedures
Cost terms (Token Budget, etc.)	Appendix D	Calculations and examples

Glossary complete. For detailed explanations and code examples, see the referenced chapters.

Keyboard shortcuts

Context Engineering: From Vibe Coder to Software Engineer