Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Appendix A: Tool and Framework Reference

Appendix A, v2.1 — Early 2026

This appendix reflects the tool and pricing landscape as of early 2026. Specific versions, pricing, and feature sets will change. The decision frameworks and evaluation criteria remain relevant.

This appendix is your field guide for choosing tools. Throughout the book, we’ve focused on principles and patterns that transfer regardless of which tools you use. But eventually, you need to pick something—and the landscape is overwhelming.

The LLM tooling ecosystem changes faster than any book can track. New frameworks appear monthly. Vector databases add features quarterly. Pricing models shift. What I can give you is something more durable: decision frameworks for evaluating tools, honest assessments of trade-offs, and practical guidance based on production experience.

This appendix covers the major categories you’ll need to decide on:

  • Orchestration frameworks: LangChain, LlamaIndex, Semantic Kernel—or nothing at all
  • Vector databases: Where your embeddings live and how to choose
  • Embedding models: Converting text to vectors
  • Model Context Protocol (MCP): The industry standard for tool integration
  • Evaluation frameworks: Measuring RAG quality

What this appendix does not cover: LLM providers (the field moves too fast, and the choice is often made for you by your organization), cloud infrastructure (too variable), or fine-tuning frameworks (outside our scope).

One principle before we begin: start simple. Chapter 10’s engineering habit applies here—“Simplicity wins. Only add complexity when simple fails.” Many production systems use far less tooling than tutorials suggest. A direct API call to an LLM, a basic vector database, and well-designed prompts can take you surprisingly far.


A.1 Orchestration Frameworks

Orchestration frameworks promise to simplify building LLM applications. They provide abstractions for common patterns: chains of prompts, retrieval pipelines, agent loops, memory management. They can genuinely help—but they can also add complexity you don’t need.

The question isn’t “which framework is best?” It’s “do I need a framework at all?”

When to Use a Framework

Use a framework when:

  • You need multiple retrieval sources with different strategies
  • You’re building complex multi-step workflows with branching logic
  • You want built-in tracing and debugging tools
  • Your team lacks LLM-specific experience and needs guardrails
  • You need rapid prototyping before committing to production architecture

Skip the framework when:

  • Your use case is straightforward (single retrieval source, single model call)
  • You need maximum control over every step of the pipeline
  • You’re optimizing for latency (frameworks add overhead, typically 10-50ms)
  • You have strong opinions about implementation details
  • You’re building something the framework wasn’t designed for

The 80/20 observation: Many production systems use frameworks for prototyping, then partially or fully migrate to custom implementations for performance-critical paths. The framework helps you learn what you need; then you build exactly that.

LangChain

LangChain is the most popular LLM orchestration framework, with the largest ecosystem and community. It provides modular components for building chains, agents, retrieval systems, and memory—plus integrations with nearly every LLM provider, vector database, and tool you might want.

Strengths:

  • Largest ecosystem: 100+ vector store integrations, 50+ LLM providers
  • LangSmith provides excellent tracing and debugging for development
  • Comprehensive documentation and tutorials
  • Active development and responsive maintainers
  • LCEL (LangChain Expression Language) enables composable pipelines

Weaknesses:

  • Abstraction overhead can obscure what’s actually happening
  • Breaking changes between versions require migration effort
  • Can encourage over-engineering simple problems
  • Debugging complex chains requires understanding LangChain internals
  • The “LangChain way” may not match your preferred architecture

Best for: Rapid prototyping, teams new to LLM development, projects needing many integrations, and situations where LangSmith tracing provides value.

Basic RAG example:

from langchain.chains import RetrievalQA
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI

# Initialize components
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(
    collection_name="documents",
    embedding_function=embeddings,
    persist_directory="./chroma_data"
)
llm = ChatOpenAI(model="gpt-4", temperature=0)

# Create retrieval chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # Stuffs all retrieved docs into context
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)

# Query
result = qa_chain.invoke({"query": "What is context engineering?"})
print(result["result"])
print(f"Sources: {[doc.metadata for doc in result['source_documents']]}")

When to migrate away: When you need sub-100ms latency and the framework overhead matters. When debugging becomes harder than building from scratch. When LangChain’s abstractions fight your architecture rather than supporting it.

LlamaIndex

LlamaIndex focuses specifically on connecting LLMs to data. Where LangChain is a general-purpose framework, LlamaIndex excels at document processing, indexing strategies, and retrieval—the core of RAG systems.

Strengths:

  • Best-in-class document processing (PDF, HTML, code, structured data)
  • Sophisticated indexing strategies (vector, keyword, tree, knowledge graph)
  • Query engines that intelligently combine multiple retrieval strategies
  • Strong TypeScript support alongside Python
  • Cleaner abstractions for data-focused work than LangChain

Weaknesses:

  • Smaller ecosystem than LangChain
  • Less flexible for non-retrieval use cases (agents, complex workflows)
  • Can be heavyweight for simple applications
  • Documentation assumes some familiarity with RAG concepts

Best for: Document-heavy applications, complex retrieval strategies, knowledge bases, and teams that have some LLM experience and want specialized tools.

Multi-index query example:

from llama_index.core import VectorStoreIndex, KeywordTableIndex
from llama_index.core.retrievers import RouterRetriever
from llama_index.core.selectors import LLMSingleSelector
from llama_index.core.tools import RetrieverTool

# Create specialized indexes
vector_index = VectorStoreIndex.from_documents(documents)
keyword_index = KeywordTableIndex.from_documents(documents)

# Router selects best index per query
retriever = RouterRetriever(
    selector=LLMSingleSelector.from_defaults(),
    retriever_tools=[
        RetrieverTool.from_defaults(
            retriever=vector_index.as_retriever(),
            description="Best for semantic similarity and conceptual queries"
        ),
        RetrieverTool.from_defaults(
            retriever=keyword_index.as_retriever(),
            description="Best for specific keyword and terminology lookups"
        ),
    ]
)

# Query - router picks appropriate index
nodes = retriever.retrieve("authentication flow diagram")

When to choose over LangChain: When your primary challenge is getting the right documents into context, especially with complex document structures or multiple data sources.

Semantic Kernel

Semantic Kernel is Microsoft’s SDK for integrating LLMs into applications. It’s enterprise-focused, with first-class support for C#, Python, and Java—and deep Azure integration.

Strengths:

  • First-class Azure OpenAI integration
  • Strong typing and enterprise design patterns
  • Excellent for C#/.NET development teams
  • Plugins architecture for extending capabilities
  • Good fit for existing Microsoft ecosystem

Weaknesses:

  • Smaller community than LangChain or LlamaIndex
  • Python version less mature than C#
  • Examples tend toward Azure (though it works with any LLM)
  • Enterprise patterns may be overkill for small projects

Best for: .NET/C# teams, Azure-first organizations, enterprise environments, and teams that prefer strongly-typed approaches.

Basic example:

import semantic_kernel as sk
from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion

# Initialize kernel
kernel = sk.Kernel()

# Add LLM service
kernel.add_service(
    OpenAIChatCompletion(
        service_id="chat",
        ai_model_id="gpt-4",
        api_key=api_key
    )
)

# Create semantic function
summarize = kernel.create_function_from_prompt(
    prompt="Summarize this text concisely: {{$input}}",
    function_name="summarize",
    plugin_name="text"
)

# Use
result = await kernel.invoke(summarize, input="Long text here...")
print(result)

The “No Framework” Option

For many production systems, the right answer is: build it yourself.

This sounds like more work, but consider what a “framework-free” RAG system actually requires:

from openai import OpenAI
from your_vector_db import VectorDB  # Whatever you chose

class SimpleRAG:
    def __init__(self, collection: str):
        self.db = VectorDB(collection)
        self.client = OpenAI()

    def query(self, question: str, top_k: int = 3) -> dict:
        # Retrieve
        docs = self.db.search(question, limit=top_k)
        context = "\n\n".join([
            f"Source: {d.metadata['source']}\n{d.text}"
            for d in docs
        ])

        # Generate
        response = self.client.chat.completions.create(
            model="gpt-4",
            messages=[
                {
                    "role": "system",
                    "content": f"Answer based on this context:\n\n{context}"
                },
                {"role": "user", "content": question}
            ]
        )

        return {
            "answer": response.choices[0].message.content,
            "sources": [d.metadata for d in docs],
            "tokens_used": response.usage.total_tokens
        }

That’s a working RAG system in under 30 lines. You have complete visibility into what’s happening. Debugging is straightforward. Latency is minimal. You can add exactly the features you need.

Advantages of no framework: Full control, minimal dependencies, easier debugging, lower latency, no breaking changes from upstream, exact behavior you want.

Disadvantages: More code to maintain, reinventing common patterns, no built-in tracing, slower initial development.

When this makes sense: You have specific performance requirements. Your use case is well-defined. You want maximum observability. Your team understands LLM fundamentals (which, having read this book, you do).

Framework Comparison

AspectLangChainLlamaIndexSemantic KernelNo Framework
Primary strengthEcosystemDocument processingEnterprise/.NETControl
Learning curveMediumMediumMedium-LowLow (if you know LLMs)
Abstraction levelHighHighMediumNone
Community sizeLargestLargeMediumN/A
Best languagePythonPython/TSC#Any
Debugging toolsLangSmithLlamaTraceAzure MonitorYour own
Latency overheadMedium-HighMediumLow-MediumNone
When to usePrototyping, many integrationsComplex retrieval.NET shopsProduction optimization

A.2 Vector Databases

Vector databases store embeddings and enable similarity search. They’re the backbone of RAG systems—when you retrieve documents based on semantic similarity, a vector database is doing the work.

The choice matters for latency, cost, and operational complexity. But here’s the honest truth: for most applications, most choices will work. The differences matter at scale or for specific requirements.

Decision Framework

Before evaluating databases, answer these questions:

1. Scale: How many vectors?

  • Under 1 million: Most options work fine
  • 1-100 million: Need to consider performance and sharding
  • Over 100 million: Requires distributed architecture

2. Deployment: Where will it run?

  • Managed cloud: Zero ops, higher cost
  • Self-hosted: More control, operational burden
  • Embedded: Simplest, limited scale

3. Hybrid search: Do you need keyword + semantic?

  • If yes: Weaviate, Qdrant, or Elasticsearch
  • If no: Any option works

4. Latency: What’s your p99 target?

  • Under 50ms: Pinecone or Qdrant
  • Under 150ms: Any option
  • Flexible: Optimize for cost instead

5. Budget: What can you afford?

  • Significant budget: Pinecone (managed, fast)
  • Moderate budget: Qdrant Cloud, Weaviate Cloud
  • Minimal budget: Self-hosted options

Pinecone

Pinecone is a fully managed vector database—you don’t run any infrastructure. It focuses on performance and simplicity.

Strengths:

  • Lowest latency at scale (p99 around 47ms)
  • Zero operational overhead
  • Automatic scaling and replication
  • Excellent documentation
  • Generous free tier for development

Weaknesses:

  • No native hybrid search (dense vectors only)
  • Higher cost at scale versus self-hosted
  • Vendor lock-in (no self-hosted option)
  • Limited filtering compared to some alternatives

Pricing (2026):

  • Serverless: ~$0.10 per million vectors/month + query costs
  • Pods: Starting around $70/month for dedicated capacity

Best for: Teams that want zero ops, need best-in-class latency, and don’t require hybrid search.

from pinecone import Pinecone

pc = Pinecone(api_key="your-key")
index = pc.Index("documents")

# Upsert vectors
index.upsert(vectors=[
    {
        "id": "doc-1",
        "values": embedding,  # Your 768-dim vector
        "metadata": {"source": "docs/intro.md", "category": "overview"}
    }
])

# Query with metadata filter
results = index.query(
    vector=query_embedding,
    top_k=10,
    include_metadata=True,
    filter={"category": {"$eq": "overview"}}
)

for match in results.matches:
    print(f"{match.id}: {match.score:.3f}")

Weaviate

Weaviate is an open-source vector database with native hybrid search—combining BM25 keyword search with vector similarity in a single query.

Strengths:

  • Native hybrid search (BM25 + vector)
  • GraphQL API for flexible querying
  • Multi-tenancy support
  • Self-hosted or Weaviate Cloud
  • Active community and good documentation

Weaknesses:

  • Higher latency than Pinecone (p99 around 123ms)
  • More operational complexity if self-hosted
  • GraphQL has a learning curve
  • Resource-intensive for large deployments

Pricing:

  • Self-hosted: Free (pay for infrastructure)
  • Weaviate Cloud: Starting around $25/month

Best for: Teams needing hybrid search, comfortable with self-hosting, or wanting flexible query capabilities.

import weaviate

client = weaviate.Client("http://localhost:8080")

# Hybrid search combines keyword and vector
result = (
    client.query
    .get("Document", ["text", "source", "category"])
    .with_hybrid(
        query="authentication best practices",
        alpha=0.5  # 0 = pure keyword, 1 = pure vector
    )
    .with_limit(10)
    .do()
)

for doc in result["data"]["Get"]["Document"]:
    print(f"{doc['source']}: {doc['text'][:100]}...")

Chroma

Chroma is an open-source, embedded-first vector database. It’s designed to be the simplest way to get started—pip install chromadb and you’re running.

Strengths:

  • Zero setup for development
  • Embedded mode runs in-process
  • Simple, intuitive Python API
  • Automatic embedding (pass text, get vectors)
  • Good for prototyping and small datasets

Weaknesses:

  • Not designed for production scale (struggles above 1M vectors)
  • Limited cloud offering
  • Fewer advanced features
  • Performance degrades at scale

Pricing: Free for self-hosted; Chroma Cloud pricing varies.

Best for: Prototyping, local development, tutorials, and small production deployments (under 100K vectors).

import chromadb

# In-memory for development
client = chromadb.Client()

# Or persistent
client = chromadb.PersistentClient(path="./chroma_data")

collection = client.create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}
)

# Add documents (auto-embeds if you configure an embedding function)
collection.add(
    documents=["First document text", "Second document text"],
    ids=["doc-1", "doc-2"],
    metadatas=[{"source": "a.md"}, {"source": "b.md"}]
)

# Query
results = collection.query(
    query_texts=["search query"],
    n_results=5,
    where={"source": "a.md"}  # Optional filter
)

Qdrant

Qdrant is an open-source vector database written in Rust, focusing on performance and advanced filtering. It offers a good balance between features and speed.

Strengths:

  • Fast performance (p99 around 60ms)
  • Advanced filtering capabilities
  • Hybrid search support
  • Self-hosted or Qdrant Cloud
  • Efficient Rust implementation
  • Good balance of features and performance

Weaknesses:

  • Smaller community than Weaviate
  • Documentation less comprehensive
  • Fewer third-party integrations

Pricing:

  • Self-hosted: Free
  • Qdrant Cloud: Starting around $25/month

Best for: Teams wanting a balanced option—good performance, hybrid search capability, reasonable cost.

from qdrant_client import QdrantClient
from qdrant_client.models import (
    VectorParams, Distance, Filter,
    FieldCondition, MatchValue
)

client = QdrantClient("localhost", port=6333)

# Create collection
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=768, distance=Distance.COSINE)
)

# Search with filter
results = client.search(
    collection_name="documents",
    query_vector=embedding,
    query_filter=Filter(
        must=[
            FieldCondition(
                key="category",
                match=MatchValue(value="code")
            )
        ]
    ),
    limit=10
)

pgvector (PostgreSQL Extension)

pgvector adds vector similarity search to PostgreSQL. If you’re already running PostgreSQL, you can add vector search without new infrastructure.

Strengths:

  • Uses existing PostgreSQL infrastructure
  • Full SQL capabilities alongside vectors
  • Transactional consistency with your other data
  • Familiar tooling (backups, monitoring, replication)
  • No new database to learn or operate

Weaknesses:

  • Slower than purpose-built vector databases
  • Limited to PostgreSQL
  • Scaling requires PostgreSQL scaling expertise
  • Less sophisticated vector operations

Best for: Teams already using PostgreSQL who want vector search without adding infrastructure complexity.

-- Enable extension
CREATE EXTENSION vector;

-- Create table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    source VARCHAR(255),
    embedding vector(768)
);

-- Create index for fast search
CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Search (using cosine distance)
SELECT content, source, 1 - (embedding <=> $1) as similarity
FROM documents
ORDER BY embedding <=> $1
LIMIT 10;

Vector Database Comparison

DatabaseLatency (p99)Hybrid SearchDeploymentCost (1M vectors)Best For
Pinecone~47msNoManaged~$100/moZero ops, max speed
Weaviate~123msYes (native)Both~$50-100/moHybrid search
Chroma~200msNoEmbedded/CloudFree-$50/moPrototyping
Qdrant~60msYesBoth~$25-75/moBalanced choice
pgvector~150msVia SQLSelf-hostedInfra onlyPostgreSQL shops
Elasticsearch~150msYes (robust)BothVariesFull-text + vector
Milvus~80msNoSelf-hostedInfra onlyLarge scale

Vector Database Comparison

Practical Recommendations

Just starting out? Use Chroma locally, then migrate to Qdrant or Pinecone for production.

Need hybrid search? Weaviate or Qdrant. Both handle BM25 + vector well.

Have PostgreSQL already? pgvector avoids adding infrastructure. Performance is good enough for many use cases.

Need maximum performance? Pinecone, but be prepared for the cost at scale.

Cost-sensitive at scale? Self-hosted Qdrant or Milvus with your own infrastructure.


A.3 Embedding Models

Embedding models convert text into vectors—the dense numerical representations that enable semantic search. Your choice affects retrieval quality, latency, and cost.

The good news: most modern embedding models are good enough. The difference between a good model and a great model is often smaller than the difference between good and bad chunking strategies (Chapter 6).

The bad news: switching embedding models later requires re-embedding all your documents. Choose thoughtfully, but don’t overthink it.

Decision Framework

Key trade-offs:

  • Quality vs. latency: Larger models produce better embeddings but run slower
  • Cost structure: API models charge per token; self-hosted has infrastructure costs
  • Language support: Some models are English-only; others handle 100+ languages
  • Dimensions: Higher dimensions capture more nuance but use more storage

Quick guidance:

SituationRecommendation
Starting outOpenAI text-embedding-3-small
Cost-sensitiveall-MiniLM-L6-v2 (free, fast)
Highest qualityOpenAI text-embedding-3-large or E5-Mistral-7B
MultilingualBGE-M3 or OpenAI models
Latency-criticalall-MiniLM-L6-v2 (10ms)

OpenAI Embedding Models

OpenAI’s embedding models are the most commonly used in production. They’re managed (no infrastructure), high-quality, and reasonably priced.

text-embedding-3-small (Recommended starting point)

  • Dimensions: 512-1536 (configurable)
  • Cost: $0.02 per million tokens
  • Latency: ~20ms
  • Quality: Excellent for most use cases
  • Languages: Multilingual

text-embedding-3-large

  • Dimensions: 256-3072 (configurable)
  • Cost: $0.13 per million tokens
  • Latency: ~50ms
  • Quality: Best available from OpenAI
  • Use when: Quality is critical and cost isn’t a constraint
from openai import OpenAI

client = OpenAI()

def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
    response = client.embeddings.create(
        model=model,
        input=text,
        dimensions=768  # Optional: reduce for efficiency
    )
    return response.data[0].embedding

# Single text
embedding = get_embedding("What is context engineering?")

# Batch for efficiency
def get_embeddings_batch(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    return [item.embedding for item in response.data]

Open-Source Models

Open-source models run on your infrastructure—free per-token, but you pay for compute.

all-MiniLM-L6-v2 (sentence-transformers)

The workhorse of open-source embeddings. Fast, free, good quality.

  • Dimensions: 384
  • Cost: Free (self-hosted)
  • Latency: ~10ms on CPU
  • Quality: Good for most applications
  • Languages: Primarily English
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

# Single embedding
embedding = model.encode("What is context engineering?")

# Batch (more efficient)
embeddings = model.encode([
    "First document",
    "Second document",
    "Third document"
])

E5-Mistral-7B

A larger model that approaches OpenAI quality while remaining self-hostable.

  • Dimensions: 768
  • Cost: Free (requires GPU)
  • Latency: ~40ms on GPU
  • Quality: Excellent, competitive with OpenAI
  • Languages: Multilingual

BGE-M3

Excellent multilingual model that also supports hybrid retrieval (dense + sparse vectors).

  • Dimensions: 1024
  • Cost: Free
  • Latency: ~35ms
  • Quality: Excellent for multilingual
  • Unique feature: Outputs both dense and sparse representations

Embedding Model Comparison

ModelDimensionsCost/1M tokensLatencyQualityLanguages
text-embedding-3-small512-1536$0.0220msExcellentMulti
text-embedding-3-large256-3072$0.1350msBestMulti
all-MiniLM-L6-v2384Free10msGoodEN
E5-Mistral-7B768Free40msExcellentMulti
BGE-M31024Free35msExcellent100+

Practical Guidance

Don’t overthink embedding choice. For most applications, text-embedding-3-small or all-MiniLM-L6-v2 is sufficient. The chunking strategy (Chapter 6) and reranking (Chapter 7) matter more.

When to invest in better embeddings:

  • Your domain has highly specialized terminology
  • Retrieval quality is demonstrably the bottleneck (measure first!)
  • You’ve already optimized chunking and added reranking

Dimension trade-offs:

  • 384 dimensions: Fast, low storage, slightly lower quality
  • 768 dimensions: Good balance (most common)
  • 1024+ dimensions: Higher quality, more storage and compute
  • 3072 dimensions: Maximum quality, 3x the cost

Migration warning: Changing embedding models means re-embedding all documents. For a million documents at $0.02/1M tokens with 500 tokens each, that’s about $10—not terrible. But the pipeline work and testing take time. Plan for this if you start simple.


A.4 Model Context Protocol (MCP)

MCP standardizes how LLMs connect to tools and data sources. Rather than each application implementing custom function-calling logic, MCP provides a protocol that tools can implement once and any MCP-compatible client can use.

Chapter 8 covers tool use in depth. This section provides practical resources for implementing MCP.

What MCP Provides

Core capabilities:

  • Standardized tool definitions with JSON Schema
  • Server-client architecture for hosting tools
  • Transport options (stdio for local, HTTP for remote)
  • Type-safe request/response handling
  • Built-in error handling patterns

Why it matters:

  • Write a tool once, use it with any MCP client
  • Growing ecosystem of pre-built servers
  • Standardized patterns reduce bugs
  • Easier to share tools across projects and teams

Official Resources

Specification and documentation:

  • Spec: https://modelcontextprotocol.io
  • Concepts: https://modelcontextprotocol.io/docs/concepts

SDKs:

  • Python: pip install mcp
  • TypeScript: npm install @modelcontextprotocol/sdk

Pre-built servers (official and community):

  • File system operations
  • Database queries (PostgreSQL, SQLite)
  • Git operations
  • GitHub, Slack, Notion integrations
  • Web browsing and search

Building a Custom MCP Server

Here’s a minimal MCP server that provides codebase search:

from mcp.server import Server
from mcp.types import Tool, TextContent
import asyncio

server = Server("codebase-search")

@server.tool()
async def search_code(
    query: str,
    file_pattern: str = "*.py",
    max_results: int = 10
) -> list[TextContent]:
    """
    Search the codebase for code matching a query.

    Args:
        query: Search term or pattern
        file_pattern: Glob pattern for files to search
        max_results: Maximum results to return
    """
    # Your search implementation
    results = await do_search(query, file_pattern, max_results)

    return [
        TextContent(
            type="text",
            text=f"File: {r.file}\nLine {r.line}:\n{r.content}"
        )
        for r in results
    ]

@server.tool()
async def read_file(path: str) -> list[TextContent]:
    """
    Read a file from the codebase.

    Args:
        path: Relative path to the file
    """
    # Validate path is within allowed directory
    if not is_safe_path(path):
        raise ValueError(f"Access denied: {path}")

    content = await read_file_content(path)
    return [TextContent(type="text", text=content)]

if __name__ == "__main__":
    # Run with stdio transport (for local use)
    asyncio.run(server.run_stdio())

Best Practices for MCP Tools

Keep tools focused. One clear responsibility per tool. “search_and_summarize” should be two tools.

Write clear descriptions. The LLM reads these to decide when to use the tool:

@server.tool()
async def search_code(query: str) -> list[TextContent]:
    """
    Search for code in the repository using semantic search.

    Use this when the user asks about:
    - Finding where something is implemented
    - Locating specific functions or classes
    - Understanding how features work

    Do NOT use for:
    - Reading specific files (use read_file instead)
    - Running code (use execute_code instead)
    """

Handle errors gracefully. Return helpful error messages, not stack traces:

@server.tool()
async def read_file(path: str) -> list[TextContent]:
    try:
        content = await read_file_content(path)
        return [TextContent(type="text", text=content)]
    except FileNotFoundError:
        return [TextContent(
            type="text",
            text=f"File not found: {path}. Use search_code to find the correct path."
        )]
    except PermissionError:
        return [TextContent(
            type="text",
            text=f"Access denied: {path} is outside the allowed directory."
        )]

Include examples in descriptions when the usage isn’t obvious.

MCP vs. Direct Function Calling

Use MCP when:

  • You want reusable tools across projects
  • You need to share tools with your team
  • You want standardized error handling
  • You’re building for multiple LLM providers

Use direct function calling when:

  • Maximum simplicity is the goal
  • Tools are project-specific and won’t be reused
  • You’re optimizing for minimal latency
  • The MCP overhead isn’t worth it for your use case

For CodebaseAI, we used direct function calling because the tools are tightly integrated with the application. For a general-purpose coding assistant, MCP would make more sense.

Streaming Response Handling

Long-running MCP tools should stream intermediate results back to the client rather than blocking until completion. This is especially important for LLM interactions where users expect progressive feedback.

MCP supports streaming through the TextContent type with incremental updates. Here’s how to implement a tool that streams results:

from mcp.server import Server
from mcp.types import TextContent
import asyncio

server = Server("streaming-tools")

@server.tool()
async def analyze_large_dataset(file_path: str):
    """
    Analyze a large dataset, streaming results as they're computed.

    Args:
        file_path: Path to the dataset file
    """
    results = []

    # Simulate processing chunks of a dataset
    async def stream_analysis():
        for i in range(10):
            # Process chunk i
            chunk_result = await process_chunk(file_path, i)

            # Stream intermediate result
            yield TextContent(
                type="text",
                text=f"Chunk {i}: {chunk_result}\n"
            )

            # Small delay to simulate work
            await asyncio.sleep(0.1)

    # Collect and return streamed results
    all_results = []
    async for result in stream_analysis():
        all_results.append(result)

    return all_results

@server.tool()
async def search_and_analyze(query: str, file_pattern: str = "*.txt"):
    """Search files and stream analysis results as they arrive."""

    async def search_stream():
        files = await find_files(file_pattern)

        for file_path in files:
            # Search file
            matches = await search_file(file_path, query)

            if matches:
                # Stream result for this file
                yield TextContent(
                    type="text",
                    text=f"File: {file_path}\n"
                           f"Found {len(matches)} matches\n"
                           f"Preview: {matches[0][:100]}...\n\n"
                )

    results = []
    async for chunk in search_stream():
        results.append(chunk)

    return results

Key patterns:

  • Use async def and yield to stream results incrementally
  • Each yielded TextContent represents an update sent to the client
  • The LLM sees results in real-time, enabling progressive reasoning
  • Large operations become more responsive and user-friendly

When to use streaming:

  • Operations taking more than 1 second
  • Data analysis or processing pipelines
  • Search operations across multiple sources
  • Any task where intermediate results are useful to the LLM

Error Recovery Patterns

Real MCP servers encounter failures: network issues, tool bugs, downstream service outages. Production systems need recovery strategies.

Retry with exponential backoff:

from mcp.server import Server
from mcp.types import TextContent
import asyncio
import random

server = Server("resilient-tools")

async def call_with_retry(
    func,
    *args,
    max_retries: int = 3,
    base_delay: float = 1.0,
    **kwargs
):
    """
    Call a function with exponential backoff retry.

    Args:
        func: Async function to call
        max_retries: Maximum retry attempts
        base_delay: Initial delay in seconds (doubles on each retry)
    """
    last_error = None

    for attempt in range(max_retries):
        try:
            return await func(*args, **kwargs)
        except Exception as e:
            last_error = e

            if attempt < max_retries - 1:
                # Exponential backoff with jitter to prevent thundering herd
                delay = base_delay * (2 ** attempt)
                jitter = random.uniform(0, delay * 0.1)
                await asyncio.sleep(delay + jitter)

    raise last_error

@server.tool()
async def query_database(query: str, table: str):
    """Query a database with automatic retry on transient failures."""

    async def do_query():
        # Your database query logic
        return await db.execute(f"SELECT * FROM {table} WHERE {query}")

    try:
        result = await call_with_retry(do_query, max_retries=3)
        return [TextContent(type="text", text=f"Result: {result}")]
    except Exception as e:
        return [TextContent(
            type="text",
            text=f"Query failed after retries: {str(e)}"
        )]

Circuit breaker for unreliable services:

from enum import Enum
from datetime import datetime, timedelta

class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing, reject requests
    HALF_OPEN = "half_open"  # Testing if recovered

class CircuitBreaker:
    """Prevents cascading failures by stopping calls to failing services."""

    def __init__(self, failure_threshold: int = 5, timeout: int = 60):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.failure_count = 0
        self.state = CircuitState.CLOSED
        self.last_failure_time = None

    async def call(self, func, *args, **kwargs):
        """Execute func, managing circuit state."""

        if self.state == CircuitState.OPEN:
            # Check if timeout has passed
            if datetime.now() - self.last_failure_time > timedelta(seconds=self.timeout):
                self.state = CircuitState.HALF_OPEN
            else:
                raise Exception("Circuit breaker is OPEN - service unavailable")

        try:
            result = await func(*args, **kwargs)
            # Success - reset on recovery
            if self.state == CircuitState.HALF_OPEN:
                self.state = CircuitState.CLOSED
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = datetime.now()

            if self.failure_count >= self.failure_threshold:
                self.state = CircuitState.OPEN

            raise e

# Usage
breaker = CircuitBreaker(failure_threshold=3, timeout=30)

@server.tool()
async def call_external_api(endpoint: str, params: dict):
    """Call external API with circuit breaker protection."""

    async def do_call():
        return await httpx.get(endpoint, params=params)

    try:
        result = await breaker.call(do_call)
        return [TextContent(type="text", text=f"Response: {result.json()}")]
    except Exception as e:
        return [TextContent(
            type="text",
            text=f"External service unavailable: {str(e)}"
        )]

Graceful degradation when a tool fails:

@server.tool()
async def get_user_info(user_id: str):
    """
    Get user information from the primary service.
    Falls back to cache if primary is unavailable.
    """

    try:
        # Try primary service
        return await primary_user_service.get(user_id)
    except Exception as e:
        # Fall back to cache
        cached = await cache.get(f"user:{user_id}")
        if cached:
            return [TextContent(
                type="text",
                text=f"User info (from cache, may be stale): {cached}\n"
                     f"Note: Primary service unavailable, using cached data."
            )]
        else:
            return [TextContent(
                type="text",
                text=f"User info unavailable: {str(e)}\n"
                     f"The primary service is down and no cached data exists."
            )]

@server.tool()
async def search_documents(query: str, use_semantic: bool = True):
    """
    Search documents, degrading gracefully if semantic search fails.
    """

    try:
        if use_semantic:
            # Try semantic search first
            return await semantic_search(query)
    except Exception as e:
        # Fall back to keyword search
        results = await keyword_search(query)
        return [TextContent(
            type="text",
            text=f"Results (keyword search, not semantic): {results}\n"
                 f"Note: Semantic search unavailable."
        )]

Error recovery best practices:

  • Retry transient errors (timeouts, connection resets), not permanent errors (bad auth, invalid input)
  • Use exponential backoff with jitter to avoid overwhelming recovering services
  • Implement circuit breakers to prevent cascading failures
  • Provide graceful degradation when critical services fail
  • Always include error context so the LLM understands what went wrong

Testing MCP Servers Without a Full Client

You can test MCP tool functions directly without standing up a full LLM client. This is crucial for rapid iteration and validating error handling.

Minimal test harness:

import pytest
import asyncio
from mcp.server import Server
from mcp.types import TextContent

# Your MCP server
server = Server("testable-tools")

@server.tool()
async def process_text(text: str, transform: str = "upper") -> list[TextContent]:
    """
    Process text with specified transformation.

    Args:
        text: Input text
        transform: 'upper', 'lower', or 'reverse'
    """
    if not text:
        raise ValueError("text cannot be empty")

    if transform == "upper":
        result = text.upper()
    elif transform == "lower":
        result = text.lower()
    elif transform == "reverse":
        result = text[::-1]
    else:
        raise ValueError(f"Unknown transform: {transform}")

    return [TextContent(type="text", text=result)]

# Tests
class TestProcessText:

    @pytest.mark.asyncio
    async def test_uppercase_transformation(self):
        """Test uppercase transformation."""
        result = await process_text("hello world", transform="upper")
        assert result[0].text == "HELLO WORLD"

    @pytest.mark.asyncio
    async def test_lowercase_transformation(self):
        """Test lowercase transformation."""
        result = await process_text("HELLO WORLD", transform="lower")
        assert result[0].text == "hello world"

    @pytest.mark.asyncio
    async def test_reverse_transformation(self):
        """Test reverse transformation."""
        result = await process_text("hello", transform="reverse")
        assert result[0].text == "olleh"

    @pytest.mark.asyncio
    async def test_empty_input_raises_error(self):
        """Test that empty input raises ValueError."""
        with pytest.raises(ValueError, match="text cannot be empty"):
            await process_text("", transform="upper")

    @pytest.mark.asyncio
    async def test_invalid_transform_raises_error(self):
        """Test that invalid transform raises ValueError."""
        with pytest.raises(ValueError, match="Unknown transform"):
            await process_text("hello", transform="unknown")

    @pytest.mark.asyncio
    async def test_returns_text_content_type(self):
        """Test that result is proper TextContent."""
        result = await process_text("test")
        assert len(result) == 1
        assert isinstance(result[0], TextContent)
        assert result[0].type == "text"

Schema validation:

import json
from jsonschema import validate, ValidationError

def validate_tool_schema(tool_func, test_cases: list[dict]):
    """
    Validate that a tool's inputs match its defined schema.
    """
    # Get tool schema (implementation-specific)
    schema = get_tool_schema(tool_func)

    for case in test_cases:
        try:
            validate(instance=case, schema=schema)
            print(f"✓ Valid: {case}")
        except ValidationError as e:
            print(f"✗ Invalid: {case} - {e.message}")

# Usage
test_inputs = [
    {"text": "hello", "transform": "upper"},
    {"text": "world"},  # transform is optional
    {"text": ""},  # Empty but valid schema-wise
    {"transform": "upper"},  # Missing required field
]

validate_tool_schema(process_text, test_inputs)

Integration testing with mock client:

class MockMCPClient:
    """
    Minimal MCP client for testing servers.
    Calls tools directly without network.
    """

    def __init__(self, server: Server):
        self.server = server
        self.tools = {}

    async def get_tools(self):
        """Get available tools from server."""
        # This depends on your server implementation
        return self.server.tools

    async def call_tool(self, tool_name: str, **kwargs):
        """Call a tool by name with arguments."""
        tool_func = getattr(self.server, tool_name, None)
        if not tool_func:
            raise ValueError(f"Tool not found: {tool_name}")

        return await tool_func(**kwargs)

@pytest.mark.asyncio
async def test_with_mock_client():
    """Test server tools using mock client."""
    client = MockMCPClient(server)

    # Get tools
    tools = await client.get_tools()
    assert "process_text" in tools

    # Call tool
    result = await client.call_tool("process_text", text="hello", transform="upper")
    assert result[0].text == "HELLO"

    # Test error handling
    with pytest.raises(ValueError):
        await client.call_tool("process_text", text="", transform="upper")

Performance testing:

import time
import statistics

@pytest.mark.asyncio
async def test_tool_performance():
    """Verify tool meets latency requirements."""
    latencies = []
    iterations = 100

    for _ in range(iterations):
        start = time.time()
        result = await process_text("test" * 100)
        latencies.append((time.time() - start) * 1000)  # Convert to ms

    avg_latency = statistics.mean(latencies)
    p99_latency = sorted(latencies)[int(len(latencies) * 0.99)]

    print(f"Average latency: {avg_latency:.2f}ms")
    print(f"P99 latency: {p99_latency:.2f}ms")

    # Assert performance targets
    assert avg_latency < 50, f"Average latency too high: {avg_latency:.2f}ms"
    assert p99_latency < 100, f"P99 latency too high: {p99_latency:.2f}ms"

Testing strategy:

  • Unit test each tool function in isolation
  • Validate inputs against schema
  • Test both happy paths and error cases
  • Use mock clients for integration testing
  • Measure performance against targets
  • Keep tests fast (no external services)

A.5 Evaluation Frameworks

You can’t improve what you don’t measure. Chapter 12 covers testing AI systems in depth. This section provides practical guidance on evaluation tools.

RAGAS

RAGAS (Retrieval-Augmented Generation Assessment) is the industry standard for evaluating RAG systems. It provides metrics that measure both retrieval quality and generation quality.

Core metrics:

MetricWhat it measuresTarget
Context PrecisionAre retrieved docs ranked correctly?> 0.7
Context RecallDid we retrieve all needed info?> 0.7
FaithfulnessIs the answer grounded in context?> 0.8
Answer RelevancyDoes the answer address the question?> 0.8

Installation: pip install ragas

Basic usage:

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    faithfulness,
    answer_relevancy
)

# Prepare evaluation dataset
eval_data = {
    "question": ["What is RAG?", "How does chunking work?"],
    "answer": ["RAG is...", "Chunking splits..."],
    "contexts": [["Retrieved doc 1", "Retrieved doc 2"], ["Doc A", "Doc B"]],
    "ground_truth": ["RAG retrieves...", "Chunking divides..."]
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
result = evaluate(
    dataset,
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy
    ]
)

print(result)
# {'context_precision': 0.82, 'context_recall': 0.75, ...}

Interpreting results:

  • 0.9+: Production ready
  • 0.7-0.9: Good, worth optimizing
  • 0.5-0.7: Significant issues to investigate
  • Below 0.5: Fundamental problems

Custom Evaluation

For domain-specific needs, build evaluators tailored to your requirements:

from openai import OpenAI

class CodeAnswerEvaluator:
    """Evaluates answers about code for technical accuracy."""

    def __init__(self):
        self.client = OpenAI()

    def evaluate(
        self,
        question: str,
        answer: str,
        context: str,
        reference_code: str = None
    ) -> dict:
        prompt = f"""Evaluate this answer about code.

Question: {question}
Answer: {answer}
Context provided: {context}
{f"Reference code: {reference_code}" if reference_code else ""}

Rate each dimension 0-1:
1. Technical accuracy: Are code references correct?
2. Completeness: Does it fully answer the question?
3. Groundedness: Is everything supported by context?
4. Clarity: Is the explanation clear?

Return JSON: {{"accuracy": X, "completeness": X, "groundedness": X, "clarity": X}}
"""

        response = self.client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )

        return json.loads(response.choices[0].message.content)

Evaluation Strategy

Tiered evaluation (from Chapter 12):

TierFrequencyMetricsDataset size
1Every commitLatency, error rate, basic quality50 examples
2WeeklyFull RAGAS, category breakdown200 examples
3MonthlyHuman evaluation, edge cases500+ examples

Building evaluation datasets:

  1. Start with 50-100 examples covering core use cases
  2. Add production queries that revealed problems
  3. Include edge cases and adversarial examples
  4. Expand to 500+ for statistical significance
  5. Maintain category balance (don’t over-index on easy cases)

A.6 Quick Reference

“I Need to Choose…” Decision Guide

If you need…Choose…Why
Fastest prototypingLangChain + ChromaLargest ecosystem, simplest setup
Best RAG qualityLlamaIndex + PineconeDocument-focused + fastest retrieval
Hybrid searchWeaviate or QdrantNative BM25 + vector
Zero operationsPineconeFully managed
Lowest costSelf-hosted Qdrant + MiniLMNo API costs
.NET environmentSemantic KernelFirst-class C# support
Existing PostgreSQLpgvectorNo new infrastructure
Maximum controlNo frameworkBuild exactly what you need

Cost Estimation

Monthly costs for 1M documents, 1000 queries/day:

SetupEmbeddingStorageGenerationTotal
OpenAI + Pinecone~$20~$100~$150~$270/mo
OpenAI + Qdrant Cloud~$20~$50~$150~$220/mo
OpenAI + self-hosted Qdrant~$20~$30 (infra)~$150~$200/mo
Self-hosted everything$0~$50~$50~$100/mo

Generation costs assume GPT-4 with ~2K tokens per query. Adjust for your model and usage.

Chapter Cross-References

This Appendix SectionRelated ChapterKey Concepts
A.1 Orchestration FrameworksCh 8: Tool UseWhen tools need coordination
A.2 Vector DatabasesCh 6: RAG FundamentalsRetrieval infrastructure
A.3 Embedding ModelsCh 6: RAG FundamentalsSemantic representation
A.3 Embedding ModelsCh 7: Advanced RetrievalQuality vs. cost trade-offs
A.4 MCPCh 8: Tool UseTool architecture patterns
A.5 EvaluationCh 12: Testing AI SystemsMetrics and methodology

Version Note

Tool versions and pricing in this appendix reflect the state as of early 2026. The principles and decision frameworks are designed to remain useful even as specific tools evolve. When in doubt, check official documentation for current details.



Appendix Cross-References

This SectionRelated AppendixConnection
A.2 Vector DatabasesAppendix E: pgvector, Vector DatabaseGlossary definitions
A.3 Embedding ModelsAppendix D: D.3 Embedding Cost CalculatorCost implications
A.4 MCPAppendix B: B.5 Tool Use patternsImplementation patterns
A.5 Evaluation FrameworksAppendix B: B.9 Testing & DebuggingEvaluation patterns
A.6 Quick ReferenceAppendix D: D.7 Quick ReferenceCost estimates

Remember: the best tool is the one you understand deeply enough to debug at 3 AM. Fancy abstractions that obscure behavior aren’t helping you—they’re hiding problems until the worst possible moment. Start simple, add complexity only when you’ve hit real limits, and always maintain the ability to see what’s actually happening.