Appendix A: Tool and Framework Reference

Appendix A, v2.1 — Early 2026

This appendix reflects the tool and pricing landscape as of early 2026. Specific versions, pricing, and feature sets will change. The decision frameworks and evaluation criteria remain relevant.

This appendix is your field guide for choosing tools. Throughout the book, we’ve focused on principles and patterns that transfer regardless of which tools you use. But eventually, you need to pick something—and the landscape is overwhelming.

The LLM tooling ecosystem changes faster than any book can track. New frameworks appear monthly. Vector databases add features quarterly. Pricing models shift. What I can give you is something more durable: decision frameworks for evaluating tools, honest assessments of trade-offs, and practical guidance based on production experience.

This appendix covers the major categories you’ll need to decide on:

Orchestration frameworks: LangChain, LlamaIndex, Semantic Kernel—or nothing at all
Vector databases: Where your embeddings live and how to choose
Embedding models: Converting text to vectors
Model Context Protocol (MCP): The industry standard for tool integration
Evaluation frameworks: Measuring RAG quality

What this appendix does not cover: LLM providers (the field moves too fast, and the choice is often made for you by your organization), cloud infrastructure (too variable), or fine-tuning frameworks (outside our scope).

One principle before we begin: start simple. Chapter 10’s engineering habit applies here—“Simplicity wins. Only add complexity when simple fails.” Many production systems use far less tooling than tutorials suggest. A direct API call to an LLM, a basic vector database, and well-designed prompts can take you surprisingly far.

A.1 Orchestration Frameworks

Orchestration frameworks promise to simplify building LLM applications. They provide abstractions for common patterns: chains of prompts, retrieval pipelines, agent loops, memory management. They can genuinely help—but they can also add complexity you don’t need.

The question isn’t “which framework is best?” It’s “do I need a framework at all?”

When to Use a Framework

Use a framework when:

You need multiple retrieval sources with different strategies
You’re building complex multi-step workflows with branching logic
You want built-in tracing and debugging tools
Your team lacks LLM-specific experience and needs guardrails
You need rapid prototyping before committing to production architecture

Skip the framework when:

Your use case is straightforward (single retrieval source, single model call)
You need maximum control over every step of the pipeline
You’re optimizing for latency (frameworks add overhead, typically 10-50ms)
You have strong opinions about implementation details
You’re building something the framework wasn’t designed for

The 80/20 observation: Many production systems use frameworks for prototyping, then partially or fully migrate to custom implementations for performance-critical paths. The framework helps you learn what you need; then you build exactly that.

LangChain

LangChain is the most popular LLM orchestration framework, with the largest ecosystem and community. It provides modular components for building chains, agents, retrieval systems, and memory—plus integrations with nearly every LLM provider, vector database, and tool you might want.

Strengths:

Largest ecosystem: 100+ vector store integrations, 50+ LLM providers
LangSmith provides excellent tracing and debugging for development
Comprehensive documentation and tutorials
Active development and responsive maintainers
LCEL (LangChain Expression Language) enables composable pipelines

Weaknesses:

Abstraction overhead can obscure what’s actually happening
Breaking changes between versions require migration effort
Can encourage over-engineering simple problems
Debugging complex chains requires understanding LangChain internals
The “LangChain way” may not match your preferred architecture

Best for: Rapid prototyping, teams new to LLM development, projects needing many integrations, and situations where LangSmith tracing provides value.

Basic RAG example:

from langchain.chains import RetrievalQA
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI

# Initialize components
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(
    collection_name="documents",
    embedding_function=embeddings,
    persist_directory="./chroma_data"
)
llm = ChatOpenAI(model="gpt-4", temperature=0)

# Create retrieval chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # Stuffs all retrieved docs into context
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)

# Query
result = qa_chain.invoke({"query": "What is context engineering?"})
print(result["result"])
print(f"Sources: {[doc.metadata for doc in result['source_documents']]}")

When to migrate away: When you need sub-100ms latency and the framework overhead matters. When debugging becomes harder than building from scratch. When LangChain’s abstractions fight your architecture rather than supporting it.

LlamaIndex

LlamaIndex focuses specifically on connecting LLMs to data. Where LangChain is a general-purpose framework, LlamaIndex excels at document processing, indexing strategies, and retrieval—the core of RAG systems.

Strengths:

Best-in-class document processing (PDF, HTML, code, structured data)
Sophisticated indexing strategies (vector, keyword, tree, knowledge graph)
Query engines that intelligently combine multiple retrieval strategies
Strong TypeScript support alongside Python
Cleaner abstractions for data-focused work than LangChain

Weaknesses:

Smaller ecosystem than LangChain
Less flexible for non-retrieval use cases (agents, complex workflows)
Can be heavyweight for simple applications
Documentation assumes some familiarity with RAG concepts

Best for: Document-heavy applications, complex retrieval strategies, knowledge bases, and teams that have some LLM experience and want specialized tools.

Multi-index query example:

from llama_index.core import VectorStoreIndex, KeywordTableIndex
from llama_index.core.retrievers import RouterRetriever
from llama_index.core.selectors import LLMSingleSelector
from llama_index.core.tools import RetrieverTool

# Create specialized indexes
vector_index = VectorStoreIndex.from_documents(documents)
keyword_index = KeywordTableIndex.from_documents(documents)

# Router selects best index per query
retriever = RouterRetriever(
    selector=LLMSingleSelector.from_defaults(),
    retriever_tools=[
        RetrieverTool.from_defaults(
            retriever=vector_index.as_retriever(),
            description="Best for semantic similarity and conceptual queries"
        ),
        RetrieverTool.from_defaults(
            retriever=keyword_index.as_retriever(),
            description="Best for specific keyword and terminology lookups"
        ),
    ]
)

# Query - router picks appropriate index
nodes = retriever.retrieve("authentication flow diagram")

When to choose over LangChain: When your primary challenge is getting the right documents into context, especially with complex document structures or multiple data sources.

Semantic Kernel

Semantic Kernel is Microsoft’s SDK for integrating LLMs into applications. It’s enterprise-focused, with first-class support for C#, Python, and Java—and deep Azure integration.

Strengths:

First-class Azure OpenAI integration
Strong typing and enterprise design patterns
Excellent for C#/.NET development teams
Plugins architecture for extending capabilities
Good fit for existing Microsoft ecosystem

Weaknesses:

Smaller community than LangChain or LlamaIndex
Python version less mature than C#
Examples tend toward Azure (though it works with any LLM)
Enterprise patterns may be overkill for small projects

Best for: .NET/C# teams, Azure-first organizations, enterprise environments, and teams that prefer strongly-typed approaches.

Basic example:

import semantic_kernel as sk
from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion

# Initialize kernel
kernel = sk.Kernel()

# Add LLM service
kernel.add_service(
    OpenAIChatCompletion(
        service_id="chat",
        ai_model_id="gpt-4",
        api_key=api_key
    )
)

# Create semantic function
summarize = kernel.create_function_from_prompt(
    prompt="Summarize this text concisely: {{$input}}",
    function_name="summarize",
    plugin_name="text"
)

# Use
result = await kernel.invoke(summarize, input="Long text here...")
print(result)

The “No Framework” Option

For many production systems, the right answer is: build it yourself.

This sounds like more work, but consider what a “framework-free” RAG system actually requires:

from openai import OpenAI
from your_vector_db import VectorDB  # Whatever you chose

class SimpleRAG:
    def __init__(self, collection: str):
        self.db = VectorDB(collection)
        self.client = OpenAI()

    def query(self, question: str, top_k: int = 3) -> dict:
        # Retrieve
        docs = self.db.search(question, limit=top_k)
        context = "\n\n".join([
            f"Source: {d.metadata['source']}\n{d.text}"
            for d in docs
        ])

        # Generate
        response = self.client.chat.completions.create(
            model="gpt-4",
            messages=[
                {
                    "role": "system",
                    "content": f"Answer based on this context:\n\n{context}"
                },
                {"role": "user", "content": question}
            ]
        )

        return {
            "answer": response.choices[0].message.content,
            "sources": [d.metadata for d in docs],
            "tokens_used": response.usage.total_tokens
        }

That’s a working RAG system in under 30 lines. You have complete visibility into what’s happening. Debugging is straightforward. Latency is minimal. You can add exactly the features you need.

Advantages of no framework: Full control, minimal dependencies, easier debugging, lower latency, no breaking changes from upstream, exact behavior you want.

Disadvantages: More code to maintain, reinventing common patterns, no built-in tracing, slower initial development.

When this makes sense: You have specific performance requirements. Your use case is well-defined. You want maximum observability. Your team understands LLM fundamentals (which, having read this book, you do).

Framework Comparison

Aspect	LangChain	LlamaIndex	Semantic Kernel	No Framework
Primary strength	Ecosystem	Document processing	Enterprise/.NET	Control
Learning curve	Medium	Medium	Medium-Low	Low (if you know LLMs)
Abstraction level	High	High	Medium	None
Community size	Largest	Large	Medium	N/A
Best language	Python	Python/TS	C#	Any
Debugging tools	LangSmith	LlamaTrace	Azure Monitor	Your own
Latency overhead	Medium-High	Medium	Low-Medium	None
When to use	Prototyping, many integrations	Complex retrieval	.NET shops	Production optimization

A.2 Vector Databases

Vector databases store embeddings and enable similarity search. They’re the backbone of RAG systems—when you retrieve documents based on semantic similarity, a vector database is doing the work.

The choice matters for latency, cost, and operational complexity. But here’s the honest truth: for most applications, most choices will work. The differences matter at scale or for specific requirements.

Decision Framework

Before evaluating databases, answer these questions:

1. Scale: How many vectors?

Under 1 million: Most options work fine
1-100 million: Need to consider performance and sharding
Over 100 million: Requires distributed architecture

2. Deployment: Where will it run?

Managed cloud: Zero ops, higher cost
Self-hosted: More control, operational burden
Embedded: Simplest, limited scale

3. Hybrid search: Do you need keyword + semantic?

If yes: Weaviate, Qdrant, or Elasticsearch
If no: Any option works

4. Latency: What’s your p99 target?

Under 50ms: Pinecone or Qdrant
Under 150ms: Any option
Flexible: Optimize for cost instead

5. Budget: What can you afford?

Significant budget: Pinecone (managed, fast)
Moderate budget: Qdrant Cloud, Weaviate Cloud
Minimal budget: Self-hosted options

Pinecone

Pinecone is a fully managed vector database—you don’t run any infrastructure. It focuses on performance and simplicity.

Strengths:

Lowest latency at scale (p99 around 47ms)
Zero operational overhead
Automatic scaling and replication
Excellent documentation
Generous free tier for development

Weaknesses:

No native hybrid search (dense vectors only)
Higher cost at scale versus self-hosted
Vendor lock-in (no self-hosted option)
Limited filtering compared to some alternatives

Pricing (2026):

Serverless: ~$0.10 per million vectors/month + query costs
Pods: Starting around $70/month for dedicated capacity

Best for: Teams that want zero ops, need best-in-class latency, and don’t require hybrid search.

from pinecone import Pinecone

pc = Pinecone(api_key="your-key")
index = pc.Index("documents")

# Upsert vectors
index.upsert(vectors=[
    {
        "id": "doc-1",
        "values": embedding,  # Your 768-dim vector
        "metadata": {"source": "docs/intro.md", "category": "overview"}
    }
])

# Query with metadata filter
results = index.query(
    vector=query_embedding,
    top_k=10,
    include_metadata=True,
    filter={"category": {"$eq": "overview"}}
)

for match in results.matches:
    print(f"{match.id}: {match.score:.3f}")

Weaviate

Weaviate is an open-source vector database with native hybrid search—combining BM25 keyword search with vector similarity in a single query.

Strengths:

Native hybrid search (BM25 + vector)
GraphQL API for flexible querying
Multi-tenancy support
Self-hosted or Weaviate Cloud
Active community and good documentation

Weaknesses:

Higher latency than Pinecone (p99 around 123ms)
More operational complexity if self-hosted
GraphQL has a learning curve
Resource-intensive for large deployments

Pricing:

Self-hosted: Free (pay for infrastructure)
Weaviate Cloud: Starting around $25/month

Best for: Teams needing hybrid search, comfortable with self-hosting, or wanting flexible query capabilities.

import weaviate

client = weaviate.Client("http://localhost:8080")

# Hybrid search combines keyword and vector
result = (
    client.query
    .get("Document", ["text", "source", "category"])
    .with_hybrid(
        query="authentication best practices",
        alpha=0.5  # 0 = pure keyword, 1 = pure vector
    )
    .with_limit(10)
    .do()
)

for doc in result["data"]["Get"]["Document"]:
    print(f"{doc['source']}: {doc['text'][:100]}...")

Chroma

Chroma is an open-source, embedded-first vector database. It’s designed to be the simplest way to get started—pip install chromadb and you’re running.

Strengths:

Zero setup for development
Embedded mode runs in-process
Simple, intuitive Python API
Automatic embedding (pass text, get vectors)
Good for prototyping and small datasets

Weaknesses:

Not designed for production scale (struggles above 1M vectors)
Limited cloud offering
Fewer advanced features
Performance degrades at scale

Pricing: Free for self-hosted; Chroma Cloud pricing varies.

Best for: Prototyping, local development, tutorials, and small production deployments (under 100K vectors).

import chromadb

# In-memory for development
client = chromadb.Client()

# Or persistent
client = chromadb.PersistentClient(path="./chroma_data")

collection = client.create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}
)

# Add documents (auto-embeds if you configure an embedding function)
collection.add(
    documents=["First document text", "Second document text"],
    ids=["doc-1", "doc-2"],
    metadatas=[{"source": "a.md"}, {"source": "b.md"}]
)

# Query
results = collection.query(
    query_texts=["search query"],
    n_results=5,
    where={"source": "a.md"}  # Optional filter
)

Qdrant

Qdrant is an open-source vector database written in Rust, focusing on performance and advanced filtering. It offers a good balance between features and speed.

Strengths:

Fast performance (p99 around 60ms)
Advanced filtering capabilities
Hybrid search support
Self-hosted or Qdrant Cloud
Efficient Rust implementation
Good balance of features and performance

Weaknesses:

Smaller community than Weaviate
Documentation less comprehensive
Fewer third-party integrations

Pricing:

Self-hosted: Free
Qdrant Cloud: Starting around $25/month

Best for: Teams wanting a balanced option—good performance, hybrid search capability, reasonable cost.

from qdrant_client import QdrantClient
from qdrant_client.models import (
    VectorParams, Distance, Filter,
    FieldCondition, MatchValue
)

client = QdrantClient("localhost", port=6333)

# Create collection
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=768, distance=Distance.COSINE)
)

# Search with filter
results = client.search(
    collection_name="documents",
    query_vector=embedding,
    query_filter=Filter(
        must=[
            FieldCondition(
                key="category",
                match=MatchValue(value="code")
            )
        ]
    ),
    limit=10
)

pgvector (PostgreSQL Extension)

pgvector adds vector similarity search to PostgreSQL. If you’re already running PostgreSQL, you can add vector search without new infrastructure.

Strengths:

Uses existing PostgreSQL infrastructure
Full SQL capabilities alongside vectors
Transactional consistency with your other data
Familiar tooling (backups, monitoring, replication)
No new database to learn or operate

Weaknesses:

Slower than purpose-built vector databases
Limited to PostgreSQL
Scaling requires PostgreSQL scaling expertise
Less sophisticated vector operations

Best for: Teams already using PostgreSQL who want vector search without adding infrastructure complexity.

-- Enable extension
CREATE EXTENSION vector;

-- Create table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    source VARCHAR(255),
    embedding vector(768)
);

-- Create index for fast search
CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Search (using cosine distance)
SELECT content, source, 1 - (embedding <=> $1) as similarity
FROM documents
ORDER BY embedding <=> $1
LIMIT 10;

Vector Database Comparison

Database	Latency (p99)	Hybrid Search	Deployment	Cost (1M vectors)	Best For
Pinecone	~47ms	No	Managed	~$100/mo	Zero ops, max speed
Weaviate	~123ms	Yes (native)	Both	~$50-100/mo	Hybrid search
Chroma	~200ms	No	Embedded/Cloud	Free-$50/mo	Prototyping
Qdrant	~60ms	Yes	Both	~$25-75/mo	Balanced choice
pgvector	~150ms	Via SQL	Self-hosted	Infra only	PostgreSQL shops
Elasticsearch	~150ms	Yes (robust)	Both	Varies	Full-text + vector
Milvus	~80ms	No	Self-hosted	Infra only	Large scale

Vector Database Comparison

Practical Recommendations

Just starting out? Use Chroma locally, then migrate to Qdrant or Pinecone for production.

Need hybrid search? Weaviate or Qdrant. Both handle BM25 + vector well.

Have PostgreSQL already? pgvector avoids adding infrastructure. Performance is good enough for many use cases.

Need maximum performance? Pinecone, but be prepared for the cost at scale.

Cost-sensitive at scale? Self-hosted Qdrant or Milvus with your own infrastructure.

A.3 Embedding Models

Embedding models convert text into vectors—the dense numerical representations that enable semantic search. Your choice affects retrieval quality, latency, and cost.

The good news: most modern embedding models are good enough. The difference between a good model and a great model is often smaller than the difference between good and bad chunking strategies (Chapter 6).

The bad news: switching embedding models later requires re-embedding all your documents. Choose thoughtfully, but don’t overthink it.

Decision Framework

Key trade-offs:

Quality vs. latency: Larger models produce better embeddings but run slower
Cost structure: API models charge per token; self-hosted has infrastructure costs
Language support: Some models are English-only; others handle 100+ languages
Dimensions: Higher dimensions capture more nuance but use more storage

Quick guidance:

Situation	Recommendation
Starting out	OpenAI `text-embedding-3-small`
Cost-sensitive	`all-MiniLM-L6-v2` (free, fast)
Highest quality	OpenAI `text-embedding-3-large` or `E5-Mistral-7B`
Multilingual	`BGE-M3` or OpenAI models
Latency-critical	`all-MiniLM-L6-v2` (10ms)

OpenAI Embedding Models

OpenAI’s embedding models are the most commonly used in production. They’re managed (no infrastructure), high-quality, and reasonably priced.

text-embedding-3-small (Recommended starting point)

Dimensions: 512-1536 (configurable)
Cost: $0.02 per million tokens
Latency: ~20ms
Quality: Excellent for most use cases
Languages: Multilingual

text-embedding-3-large

Dimensions: 256-3072 (configurable)
Cost: $0.13 per million tokens
Latency: ~50ms
Quality: Best available from OpenAI
Use when: Quality is critical and cost isn’t a constraint

from openai import OpenAI

client = OpenAI()

def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
    response = client.embeddings.create(
        model=model,
        input=text,
        dimensions=768  # Optional: reduce for efficiency
    )
    return response.data[0].embedding

# Single text
embedding = get_embedding("What is context engineering?")

# Batch for efficiency
def get_embeddings_batch(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    return [item.embedding for item in response.data]

Open-Source Models

Open-source models run on your infrastructure—free per-token, but you pay for compute.

all-MiniLM-L6-v2 (sentence-transformers)

The workhorse of open-source embeddings. Fast, free, good quality.

Dimensions: 384
Cost: Free (self-hosted)
Latency: ~10ms on CPU
Quality: Good for most applications
Languages: Primarily English

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

# Single embedding
embedding = model.encode("What is context engineering?")

# Batch (more efficient)
embeddings = model.encode([
    "First document",
    "Second document",
    "Third document"
])

E5-Mistral-7B

A larger model that approaches OpenAI quality while remaining self-hostable.

Dimensions: 768
Cost: Free (requires GPU)
Latency: ~40ms on GPU
Quality: Excellent, competitive with OpenAI
Languages: Multilingual

BGE-M3

Excellent multilingual model that also supports hybrid retrieval (dense + sparse vectors).

Dimensions: 1024
Cost: Free
Latency: ~35ms
Quality: Excellent for multilingual
Unique feature: Outputs both dense and sparse representations

Embedding Model Comparison

Model	Dimensions	Cost/1M tokens	Latency	Quality	Languages
text-embedding-3-small	512-1536	$0.02	20ms	Excellent	Multi
text-embedding-3-large	256-3072	$0.13	50ms	Best	Multi
all-MiniLM-L6-v2	384	Free	10ms	Good	EN
E5-Mistral-7B	768	Free	40ms	Excellent	Multi
BGE-M3	1024	Free	35ms	Excellent	100+

Practical Guidance

Don’t overthink embedding choice. For most applications, text-embedding-3-small or all-MiniLM-L6-v2 is sufficient. The chunking strategy (Chapter 6) and reranking (Chapter 7) matter more.

When to invest in better embeddings:

Your domain has highly specialized terminology
Retrieval quality is demonstrably the bottleneck (measure first!)
You’ve already optimized chunking and added reranking

Dimension trade-offs:

384 dimensions: Fast, low storage, slightly lower quality
768 dimensions: Good balance (most common)
1024+ dimensions: Higher quality, more storage and compute
3072 dimensions: Maximum quality, 3x the cost

Migration warning: Changing embedding models means re-embedding all documents. For a million documents at $0.02/1M tokens with 500 tokens each, that’s about $10—not terrible. But the pipeline work and testing take time. Plan for this if you start simple.

A.4 Model Context Protocol (MCP)

MCP standardizes how LLMs connect to tools and data sources. Rather than each application implementing custom function-calling logic, MCP provides a protocol that tools can implement once and any MCP-compatible client can use.

Chapter 8 covers tool use in depth. This section provides practical resources for implementing MCP.

What MCP Provides

Core capabilities:

Standardized tool definitions with JSON Schema
Server-client architecture for hosting tools
Transport options (stdio for local, HTTP for remote)
Type-safe request/response handling
Built-in error handling patterns

Why it matters:

Write a tool once, use it with any MCP client
Growing ecosystem of pre-built servers
Standardized patterns reduce bugs
Easier to share tools across projects and teams

Official Resources

Specification and documentation:

Spec: https://modelcontextprotocol.io
Concepts: https://modelcontextprotocol.io/docs/concepts

SDKs:

Python: pip install mcp
TypeScript: npm install @modelcontextprotocol/sdk

Pre-built servers (official and community):

File system operations
Database queries (PostgreSQL, SQLite)
Git operations
GitHub, Slack, Notion integrations
Web browsing and search

Building a Custom MCP Server

Here’s a minimal MCP server that provides codebase search:

from mcp.server import Server
from mcp.types import Tool, TextContent
import asyncio

server = Server("codebase-search")

@server.tool()
async def search_code(
    query: str,
    file_pattern: str = "*.py",
    max_results: int = 10
) -> list[TextContent]:
    """
    Search the codebase for code matching a query.

    Args:
        query: Search term or pattern
        file_pattern: Glob pattern for files to search
        max_results: Maximum results to return
    """
    # Your search implementation
    results = await do_search(query, file_pattern, max_results)

    return [
        TextContent(
            type="text",
            text=f"File: {r.file}\nLine {r.line}:\n{r.content}"
        )
        for r in results
    ]

@server.tool()
async def read_file(path: str) -> list[TextContent]:
    """
    Read a file from the codebase.

    Args:
        path: Relative path to the file
    """
    # Validate path is within allowed directory
    if not is_safe_path(path):
        raise ValueError(f"Access denied: {path}")

    content = await read_file_content(path)
    return [TextContent(type="text", text=content)]

if __name__ == "__main__":
    # Run with stdio transport (for local use)
    asyncio.run(server.run_stdio())

Best Practices for MCP Tools

Keep tools focused. One clear responsibility per tool. “search_and_summarize” should be two tools.

Write clear descriptions. The LLM reads these to decide when to use the tool:

@server.tool()
async def search_code(query: str) -> list[TextContent]:
    """
    Search for code in the repository using semantic search.

    Use this when the user asks about:
    - Finding where something is implemented
    - Locating specific functions or classes
    - Understanding how features work

    Do NOT use for:
    - Reading specific files (use read_file instead)
    - Running code (use execute_code instead)
    """

Handle errors gracefully. Return helpful error messages, not stack traces:

@server.tool()
async def read_file(path: str) -> list[TextContent]:
    try:
        content = await read_file_content(path)
        return [TextContent(type="text", text=content)]
    except FileNotFoundError:
        return [TextContent(
            type="text",
            text=f"File not found: {path}. Use search_code to find the correct path."
        )]
    except PermissionError:
        return [TextContent(
            type="text",
            text=f"Access denied: {path} is outside the allowed directory."
        )]

Include examples in descriptions when the usage isn’t obvious.

MCP vs. Direct Function Calling

Use MCP when:

You want reusable tools across projects
You need to share tools with your team
You want standardized error handling
You’re building for multiple LLM providers

Use direct function calling when:

Maximum simplicity is the goal
Tools are project-specific and won’t be reused
You’re optimizing for minimal latency
The MCP overhead isn’t worth it for your use case

For CodebaseAI, we used direct function calling because the tools are tightly integrated with the application. For a general-purpose coding assistant, MCP would make more sense.

Streaming Response Handling

Long-running MCP tools should stream intermediate results back to the client rather than blocking until completion. This is especially important for LLM interactions where users expect progressive feedback.

MCP supports streaming through the TextContent type with incremental updates. Here’s how to implement a tool that streams results:

from mcp.server import Server
from mcp.types import TextContent
import asyncio

server = Server("streaming-tools")

@server.tool()
async def analyze_large_dataset(file_path: str):
    """
    Analyze a large dataset, streaming results as they're computed.

    Args:
        file_path: Path to the dataset file
    """
    results = []

    # Simulate processing chunks of a dataset
    async def stream_analysis():
        for i in range(10):
            # Process chunk i
            chunk_result = await process_chunk(file_path, i)

            # Stream intermediate result
            yield TextContent(
                type="text",
                text=f"Chunk {i}: {chunk_result}\n"
            )

            # Small delay to simulate work
            await asyncio.sleep(0.1)

    # Collect and return streamed results
    all_results = []
    async for result in stream_analysis():
        all_results.append(result)

    return all_results

@server.tool()
async def search_and_analyze(query: str, file_pattern: str = "*.txt"):
    """Search files and stream analysis results as they arrive."""

    async def search_stream():
        files = await find_files(file_pattern)

        for file_path in files:
            # Search file
            matches = await search_file(file_path, query)

            if matches:
                # Stream result for this file
                yield TextContent(
                    type="text",
                    text=f"File: {file_path}\n"
                           f"Found {len(matches)} matches\n"
                           f"Preview: {matches[0][:100]}...\n\n"
                )

    results = []
    async for chunk in search_stream():
        results.append(chunk)

    return results

Key patterns:

Use async def and yield to stream results incrementally
Each yielded TextContent represents an update sent to the client
The LLM sees results in real-time, enabling progressive reasoning
Large operations become more responsive and user-friendly

When to use streaming:

Operations taking more than 1 second
Data analysis or processing pipelines
Search operations across multiple sources
Any task where intermediate results are useful to the LLM

Error Recovery Patterns

Real MCP servers encounter failures: network issues, tool bugs, downstream service outages. Production systems need recovery strategies.

Retry with exponential backoff:

from mcp.server import Server
from mcp.types import TextContent
import asyncio
import random

server = Server("resilient-tools")

async def call_with_retry(
    func,
    *args,
    max_retries: int = 3,
    base_delay: float = 1.0,
    **kwargs
):
    """
    Call a function with exponential backoff retry.

    Args:
        func: Async function to call
        max_retries: Maximum retry attempts
        base_delay: Initial delay in seconds (doubles on each retry)
    """
    last_error = None

    for attempt in range(max_retries):
        try:
            return await func(*args, **kwargs)
        except Exception as e:
            last_error = e

            if attempt < max_retries - 1:
                # Exponential backoff with jitter to prevent thundering herd
                delay = base_delay * (2 ** attempt)
                jitter = random.uniform(0, delay * 0.1)
                await asyncio.sleep(delay + jitter)

    raise last_error

@server.tool()
async def query_database(query: str, table: str):
    """Query a database with automatic retry on transient failures."""

    async def do_query():
        # Your database query logic
        return await db.execute(f"SELECT * FROM {table} WHERE {query}")

    try:
        result = await call_with_retry(do_query, max_retries=3)
        return [TextContent(type="text", text=f"Result: {result}")]
    except Exception as e:
        return [TextContent(
            type="text",
            text=f"Query failed after retries: {str(e)}"
        )]

Circuit breaker for unreliable services:

from enum import Enum
from datetime import datetime, timedelta

class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing, reject requests
    HALF_OPEN = "half_open"  # Testing if recovered

class CircuitBreaker:
    """Prevents cascading failures by stopping calls to failing services."""

    def __init__(self, failure_threshold: int = 5, timeout: int = 60):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.failure_count = 0
        self.state = CircuitState.CLOSED
        self.last_failure_time = None

    async def call(self, func, *args, **kwargs):
        """Execute func, managing circuit state."""

        if self.state == CircuitState.OPEN:
            # Check if timeout has passed
            if datetime.now() - self.last_failure_time > timedelta(seconds=self.timeout):
                self.state = CircuitState.HALF_OPEN
            else:
                raise Exception("Circuit breaker is OPEN - service unavailable")

        try:
            result = await func(*args, **kwargs)
            # Success - reset on recovery
            if self.state == CircuitState.HALF_OPEN:
                self.state = CircuitState.CLOSED
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = datetime.now()

            if self.failure_count >= self.failure_threshold:
                self.state = CircuitState.OPEN

            raise e

# Usage
breaker = CircuitBreaker(failure_threshold=3, timeout=30)

@server.tool()
async def call_external_api(endpoint: str, params: dict):
    """Call external API with circuit breaker protection."""

    async def do_call():
        return await httpx.get(endpoint, params=params)

    try:
        result = await breaker.call(do_call)
        return [TextContent(type="text", text=f"Response: {result.json()}")]
    except Exception as e:
        return [TextContent(
            type="text",
            text=f"External service unavailable: {str(e)}"
        )]

Graceful degradation when a tool fails:

@server.tool()
async def get_user_info(user_id: str):
    """
    Get user information from the primary service.
    Falls back to cache if primary is unavailable.
    """

    try:
        # Try primary service
        return await primary_user_service.get(user_id)
    except Exception as e:
        # Fall back to cache
        cached = await cache.get(f"user:{user_id}")
        if cached:
            return [TextContent(
                type="text",
                text=f"User info (from cache, may be stale): {cached}\n"
                     f"Note: Primary service unavailable, using cached data."
            )]
        else:
            return [TextContent(
                type="text",
                text=f"User info unavailable: {str(e)}\n"
                     f"The primary service is down and no cached data exists."
            )]

@server.tool()
async def search_documents(query: str, use_semantic: bool = True):
    """
    Search documents, degrading gracefully if semantic search fails.
    """

    try:
        if use_semantic:
            # Try semantic search first
            return await semantic_search(query)
    except Exception as e:
        # Fall back to keyword search
        results = await keyword_search(query)
        return [TextContent(
            type="text",
            text=f"Results (keyword search, not semantic): {results}\n"
                 f"Note: Semantic search unavailable."
        )]

Error recovery best practices:

Retry transient errors (timeouts, connection resets), not permanent errors (bad auth, invalid input)
Use exponential backoff with jitter to avoid overwhelming recovering services
Implement circuit breakers to prevent cascading failures
Provide graceful degradation when critical services fail
Always include error context so the LLM understands what went wrong

Testing MCP Servers Without a Full Client

You can test MCP tool functions directly without standing up a full LLM client. This is crucial for rapid iteration and validating error handling.

Minimal test harness:

import pytest
import asyncio
from mcp.server import Server
from mcp.types import TextContent

# Your MCP server
server = Server("testable-tools")

@server.tool()
async def process_text(text: str, transform: str = "upper") -> list[TextContent]:
    """
    Process text with specified transformation.

    Args:
        text: Input text
        transform: 'upper', 'lower', or 'reverse'
    """
    if not text:
        raise ValueError("text cannot be empty")

    if transform == "upper":
        result = text.upper()
    elif transform == "lower":
        result = text.lower()
    elif transform == "reverse":
        result = text[::-1]
    else:
        raise ValueError(f"Unknown transform: {transform}")

    return [TextContent(type="text", text=result)]

# Tests
class TestProcessText:

    @pytest.mark.asyncio
    async def test_uppercase_transformation(self):
        """Test uppercase transformation."""
        result = await process_text("hello world", transform="upper")
        assert result[0].text == "HELLO WORLD"

    @pytest.mark.asyncio
    async def test_lowercase_transformation(self):
        """Test lowercase transformation."""
        result = await process_text("HELLO WORLD", transform="lower")
        assert result[0].text == "hello world"

    @pytest.mark.asyncio
    async def test_reverse_transformation(self):
        """Test reverse transformation."""
        result = await process_text("hello", transform="reverse")
        assert result[0].text == "olleh"

    @pytest.mark.asyncio
    async def test_empty_input_raises_error(self):
        """Test that empty input raises ValueError."""
        with pytest.raises(ValueError, match="text cannot be empty"):
            await process_text("", transform="upper")

    @pytest.mark.asyncio
    async def test_invalid_transform_raises_error(self):
        """Test that invalid transform raises ValueError."""
        with pytest.raises(ValueError, match="Unknown transform"):
            await process_text("hello", transform="unknown")

    @pytest.mark.asyncio
    async def test_returns_text_content_type(self):
        """Test that result is proper TextContent."""
        result = await process_text("test")
        assert len(result) == 1
        assert isinstance(result[0], TextContent)
        assert result[0].type == "text"

Schema validation:

import json
from jsonschema import validate, ValidationError

def validate_tool_schema(tool_func, test_cases: list[dict]):
    """
    Validate that a tool's inputs match its defined schema.
    """
    # Get tool schema (implementation-specific)
    schema = get_tool_schema(tool_func)

    for case in test_cases:
        try:
            validate(instance=case, schema=schema)
            print(f"✓ Valid: {case}")
        except ValidationError as e:
            print(f"✗ Invalid: {case} - {e.message}")

# Usage
test_inputs = [
    {"text": "hello", "transform": "upper"},
    {"text": "world"},  # transform is optional
    {"text": ""},  # Empty but valid schema-wise
    {"transform": "upper"},  # Missing required field
]

validate_tool_schema(process_text, test_inputs)

Integration testing with mock client:

class MockMCPClient:
    """
    Minimal MCP client for testing servers.
    Calls tools directly without network.
    """

    def __init__(self, server: Server):
        self.server = server
        self.tools = {}

    async def get_tools(self):
        """Get available tools from server."""
        # This depends on your server implementation
        return self.server.tools

    async def call_tool(self, tool_name: str, **kwargs):
        """Call a tool by name with arguments."""
        tool_func = getattr(self.server, tool_name, None)
        if not tool_func:
            raise ValueError(f"Tool not found: {tool_name}")

        return await tool_func(**kwargs)

@pytest.mark.asyncio
async def test_with_mock_client():
    """Test server tools using mock client."""
    client = MockMCPClient(server)

    # Get tools
    tools = await client.get_tools()
    assert "process_text" in tools

    # Call tool
    result = await client.call_tool("process_text", text="hello", transform="upper")
    assert result[0].text == "HELLO"

    # Test error handling
    with pytest.raises(ValueError):
        await client.call_tool("process_text", text="", transform="upper")

Performance testing:

import time
import statistics

@pytest.mark.asyncio
async def test_tool_performance():
    """Verify tool meets latency requirements."""
    latencies = []
    iterations = 100

    for _ in range(iterations):
        start = time.time()
        result = await process_text("test" * 100)
        latencies.append((time.time() - start) * 1000)  # Convert to ms

    avg_latency = statistics.mean(latencies)
    p99_latency = sorted(latencies)[int(len(latencies) * 0.99)]

    print(f"Average latency: {avg_latency:.2f}ms")
    print(f"P99 latency: {p99_latency:.2f}ms")

    # Assert performance targets
    assert avg_latency < 50, f"Average latency too high: {avg_latency:.2f}ms"
    assert p99_latency < 100, f"P99 latency too high: {p99_latency:.2f}ms"

Testing strategy:

Unit test each tool function in isolation
Validate inputs against schema
Test both happy paths and error cases
Use mock clients for integration testing
Measure performance against targets
Keep tests fast (no external services)

A.5 Evaluation Frameworks

You can’t improve what you don’t measure. Chapter 12 covers testing AI systems in depth. This section provides practical guidance on evaluation tools.

RAGAS

RAGAS (Retrieval-Augmented Generation Assessment) is the industry standard for evaluating RAG systems. It provides metrics that measure both retrieval quality and generation quality.

Core metrics:

Metric	What it measures	Target
Context Precision	Are retrieved docs ranked correctly?	> 0.7
Context Recall	Did we retrieve all needed info?	> 0.7
Faithfulness	Is the answer grounded in context?	> 0.8
Answer Relevancy	Does the answer address the question?	> 0.8

Installation: pip install ragas

Basic usage:

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    faithfulness,
    answer_relevancy
)

# Prepare evaluation dataset
eval_data = {
    "question": ["What is RAG?", "How does chunking work?"],
    "answer": ["RAG is...", "Chunking splits..."],
    "contexts": [["Retrieved doc 1", "Retrieved doc 2"], ["Doc A", "Doc B"]],
    "ground_truth": ["RAG retrieves...", "Chunking divides..."]
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
result = evaluate(
    dataset,
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy
    ]
)

print(result)
# {'context_precision': 0.82, 'context_recall': 0.75, ...}

Interpreting results:

0.9+: Production ready
0.7-0.9: Good, worth optimizing
0.5-0.7: Significant issues to investigate
Below 0.5: Fundamental problems

Custom Evaluation

For domain-specific needs, build evaluators tailored to your requirements:

from openai import OpenAI

class CodeAnswerEvaluator:
    """Evaluates answers about code for technical accuracy."""

    def __init__(self):
        self.client = OpenAI()

    def evaluate(
        self,
        question: str,
        answer: str,
        context: str,
        reference_code: str = None
    ) -> dict:
        prompt = f"""Evaluate this answer about code.

Question: {question}
Answer: {answer}
Context provided: {context}
{f"Reference code: {reference_code}" if reference_code else ""}

Rate each dimension 0-1:
1. Technical accuracy: Are code references correct?
2. Completeness: Does it fully answer the question?
3. Groundedness: Is everything supported by context?
4. Clarity: Is the explanation clear?

Return JSON: {{"accuracy": X, "completeness": X, "groundedness": X, "clarity": X}}
"""

        response = self.client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )

        return json.loads(response.choices[0].message.content)

Evaluation Strategy

Tiered evaluation (from Chapter 12):

Tier	Frequency	Metrics	Dataset size
1	Every commit	Latency, error rate, basic quality	50 examples
2	Weekly	Full RAGAS, category breakdown	200 examples
3	Monthly	Human evaluation, edge cases	500+ examples

Building evaluation datasets:

Start with 50-100 examples covering core use cases
Add production queries that revealed problems
Include edge cases and adversarial examples
Expand to 500+ for statistical significance
Maintain category balance (don’t over-index on easy cases)

A.6 Quick Reference

“I Need to Choose…” Decision Guide

If you need…	Choose…	Why
Fastest prototyping	LangChain + Chroma	Largest ecosystem, simplest setup
Best RAG quality	LlamaIndex + Pinecone	Document-focused + fastest retrieval
Hybrid search	Weaviate or Qdrant	Native BM25 + vector
Zero operations	Pinecone	Fully managed
Lowest cost	Self-hosted Qdrant + MiniLM	No API costs
.NET environment	Semantic Kernel	First-class C# support
Existing PostgreSQL	pgvector	No new infrastructure
Maximum control	No framework	Build exactly what you need

Cost Estimation

Monthly costs for 1M documents, 1000 queries/day:

Setup	Embedding	Storage	Generation	Total
OpenAI + Pinecone	~$20	~$100	~$150	~$270/mo
OpenAI + Qdrant Cloud	~$20	~$50	~$150	~$220/mo
OpenAI + self-hosted Qdrant	~$20	~$30 (infra)	~$150	~$200/mo
Self-hosted everything	$0	~$50	~$50	~$100/mo

Generation costs assume GPT-4 with ~2K tokens per query. Adjust for your model and usage.

Chapter Cross-References

This Appendix Section	Related Chapter	Key Concepts
A.1 Orchestration Frameworks	Ch 8: Tool Use	When tools need coordination
A.2 Vector Databases	Ch 6: RAG Fundamentals	Retrieval infrastructure
A.3 Embedding Models	Ch 6: RAG Fundamentals	Semantic representation
A.3 Embedding Models	Ch 7: Advanced Retrieval	Quality vs. cost trade-offs
A.4 MCP	Ch 8: Tool Use	Tool architecture patterns
A.5 Evaluation	Ch 12: Testing AI Systems	Metrics and methodology

Version Note

Tool versions and pricing in this appendix reflect the state as of early 2026. The principles and decision frameworks are designed to remain useful even as specific tools evolve. When in doubt, check official documentation for current details.

Appendix Cross-References

This Section	Related Appendix	Connection
A.2 Vector Databases	Appendix E: pgvector, Vector Database	Glossary definitions
A.3 Embedding Models	Appendix D: D.3 Embedding Cost Calculator	Cost implications
A.4 MCP	Appendix B: B.5 Tool Use patterns	Implementation patterns
A.5 Evaluation Frameworks	Appendix B: B.9 Testing & Debugging	Evaluation patterns
A.6 Quick Reference	Appendix D: D.7 Quick Reference	Cost estimates

Remember: the best tool is the one you understand deeply enough to debug at 3 AM. Fancy abstractions that obscure behavior aren’t helping you—they’re hiding problems until the worst possible moment. Start simple, add complexity only when you’ve hit real limits, and always maintain the ability to see what’s actually happening.

Keyboard shortcuts

Context Engineering: From Vibe Coder to Software Engineer