Appendix D: Cost Reference

Appendix D, v2.1 — Early 2026

Pricing in this appendix reflects early 2026 rates. Models, pricing tiers, and cost structures change frequently. Use the methodologies here with current pricing from provider documentation.

This appendix provides the numbers you need to estimate costs before they surprise you. No theory—Chapter 11 covers why costs matter. Here you’ll find formulas, calculators, pricing tables, and worked examples.

Important: LLM pricing changes frequently. The numbers here reflect early 2026 rates. Always check provider pricing pages before making commitments. The formulas and patterns, however, remain useful regardless of specific prices.

D.1 Token Estimation

Tokens are the currency of LLM costs. Every API call charges by tokens consumed.

Quick Estimation Rules

For English text:

Content	Tokens
1 character	~0.25 tokens
1 word	~1.3 tokens
4 characters	~1 token
100 words	~130 tokens
1 page (500 words)	~650 tokens
1,000 words	~1,300 tokens

For code:

Content	Tokens
1 line of code	~15-20 tokens
1 function (typical)	~100-500 tokens
1 file (500 lines)	~8,000-10,000 tokens
JSON (per KB)	~400 tokens

Code is less token-efficient than prose. Punctuation, indentation, and special characters all consume tokens. JSON and structured data are particularly token-hungry.

Token Estimation Code

For quick estimates during development:

def estimate_tokens(text: str) -> int:
    """Quick token estimate: 1 token ≈ 4 characters for English."""
    return len(text) // 4

def estimate_tokens_words(word_count: int) -> int:
    """Estimate from word count: 1 token ≈ 0.75 words."""
    return int(word_count * 1.33)

For accurate counts, use the tokenizer libraries:

# OpenAI models
import tiktoken

def count_tokens_openai(text: str, model: str = "gpt-4") -> int:
    """Exact token count for OpenAI models."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

# Anthropic models
from anthropic import Anthropic

def count_tokens_anthropic(text: str) -> int:
    """Exact token count for Claude models."""
    client = Anthropic()
    return client.count_tokens(text)

Model-Specific Differences

Different models tokenize differently. The same text may have different token counts across providers:

Text Sample	GPT-4	Claude	Llama
“Hello, world!”	4	4	5
`def foo(): return 42`	9	8	11
1KB JSON	~380	~400	~420

For budgeting purposes, use the 4-character rule for estimates, then verify with the actual tokenizer before production deployment.

D.2 Model Pricing

Generation Models

Prices per 1 million tokens (early 2026):

Model	Input	Output	Notes
Premium Tier
Claude 3.5 Sonnet	$3.00	$15.00	Best quality/cost balance
GPT-4o	$2.50	$10.00	Multimodal capable
Claude 3 Opus	$15.00	$75.00	Maximum capability
GPT-4 Turbo	$10.00	$30.00	Large context window
Budget Tier
GPT-4o-mini	$0.15	$0.60	20x cheaper than GPT-4o
Claude 3 Haiku	$0.25	$1.25	Fast, efficient
Claude 3.5 Haiku	$0.80	$4.00	Improved Haiku
Open Source (API)
Llama 3 70B (via API)	$0.50-1.00	$0.50-1.00	Provider dependent
Mixtral 8x7B	$0.25-0.50	$0.25-0.50	Provider dependent

Cost Per Query

What a single 10,000-token context query costs:

Model	Input Cost	Output (500 tok)	Total
Claude 3.5 Sonnet	$0.030	$0.0075	~$0.038
GPT-4o	$0.025	$0.005	~$0.030
GPT-4o-mini	$0.0015	$0.0003	~$0.002
Claude 3 Haiku	$0.0025	$0.0006	~$0.003

Embedding Models

Prices per 1 million tokens:

Model	Price	Dimensions	Notes
text-embedding-3-small	$0.02	1536	Best value
text-embedding-3-large	$0.13	3072	Higher quality
text-embedding-ada-002	$0.10	1536	Legacy
Cohere embed-v3	$0.10	1024	Good multilingual
Voyage-2	$0.10	1024	Code-optimized available

Cached vs. Uncached

Some providers offer prompt caching at reduced rates:

Provider	Cached Input	Savings
Anthropic	10% of base	90% off
OpenAI	Varies	Up to 50% off

Cache hits require exact prefix matches. Design system prompts to maximize cache reuse.

D.3 Cost Calculators

Basic Cost Formula

input_cost = (input_tokens / 1,000,000) × input_price_per_million
output_cost = (output_tokens / 1,000,000) × output_price_per_million
total_cost = input_cost + output_cost

Query Cost Calculator

from dataclasses import dataclass
from typing import Optional

@dataclass
class ModelPricing:
    """Pricing for a specific model."""
    name: str
    input_per_million: float
    output_per_million: float

# Common model pricing (early 2026)
MODELS = {
    "claude-sonnet": ModelPricing("Claude 3.5 Sonnet", 3.00, 15.00),
    "gpt-4o": ModelPricing("GPT-4o", 2.50, 10.00),
    "gpt-4o-mini": ModelPricing("GPT-4o-mini", 0.15, 0.60),
    "claude-haiku": ModelPricing("Claude 3 Haiku", 0.25, 1.25),
}

class CostCalculator:
    """Calculate LLM costs for context engineering systems."""

    def __init__(self, model: str):
        self.pricing = MODELS[model]

    def query_cost(
        self,
        system_prompt_tokens: int,
        user_query_tokens: int,
        rag_tokens: int = 0,
        memory_tokens: int = 0,
        conversation_tokens: int = 0,
        expected_output_tokens: int = 500
    ) -> dict:
        """Calculate cost for a single query."""
        total_input = (
            system_prompt_tokens +
            user_query_tokens +
            rag_tokens +
            memory_tokens +
            conversation_tokens
        )

        input_cost = (total_input / 1_000_000) * self.pricing.input_per_million
        output_cost = (expected_output_tokens / 1_000_000) * self.pricing.output_per_million

        return {
            "model": self.pricing.name,
            "input_tokens": total_input,
            "output_tokens": expected_output_tokens,
            "input_cost": round(input_cost, 6),
            "output_cost": round(output_cost, 6),
            "total_cost": round(input_cost + output_cost, 6),
        }

    def daily_cost(self, queries_per_day: int, avg_cost_per_query: float) -> float:
        """Project daily costs."""
        return queries_per_day * avg_cost_per_query

    def monthly_cost(self, queries_per_day: int, avg_cost_per_query: float) -> float:
        """Project monthly costs (30 days)."""
        return queries_per_day * 30 * avg_cost_per_query


# Example usage
calc = CostCalculator("claude-sonnet")

# Typical RAG query
result = calc.query_cost(
    system_prompt_tokens=500,
    user_query_tokens=100,
    rag_tokens=2000,
    memory_tokens=400,
    conversation_tokens=1000,
    expected_output_tokens=500
)
# Result: ~$0.02 per query

# Monthly projection
monthly = calc.monthly_cost(
    queries_per_day=1000,
    avg_cost_per_query=0.02
)
# Result: ~$600/month

Multi-Model Cost Calculator

For systems using multiple models (routing, specialist agents):

class MultiModelCalculator:
    """Calculate costs for multi-agent systems."""

    def __init__(self):
        self.calculators = {
            name: CostCalculator(name) for name in MODELS
        }

    def multi_agent_query(
        self,
        router_model: str,
        router_tokens: int,
        specialist_model: str,
        specialist_calls: int,
        specialist_input_tokens: int,
        specialist_output_tokens: int
    ) -> dict:
        """Calculate cost for a multi-agent query."""
        # Router cost (typically small, budget model)
        router_calc = self.calculators[router_model]
        router_cost = router_calc.query_cost(
            system_prompt_tokens=200,
            user_query_tokens=router_tokens,
            expected_output_tokens=50
        )["total_cost"]

        # Specialist costs
        specialist_calc = self.calculators[specialist_model]
        specialist_cost = specialist_calc.query_cost(
            system_prompt_tokens=500,
            user_query_tokens=specialist_input_tokens,
            expected_output_tokens=specialist_output_tokens
        )["total_cost"] * specialist_calls

        return {
            "router_cost": router_cost,
            "specialist_cost": specialist_cost,
            "total_cost": router_cost + specialist_cost,
            "calls": 1 + specialist_calls
        }


# Example: Router + 2 specialist calls
multi = MultiModelCalculator()
result = multi.multi_agent_query(
    router_model="claude-haiku",
    router_tokens=200,
    specialist_model="claude-sonnet",
    specialist_calls=2,
    specialist_input_tokens=3000,
    specialist_output_tokens=800
)
# Result: ~$0.05 per user query

Embedding Cost Calculator

def embedding_cost(
    num_documents: int,
    avg_tokens_per_doc: int,
    price_per_million: float = 0.02  # text-embedding-3-small
) -> dict:
    """Calculate one-time embedding costs."""
    total_tokens = num_documents * avg_tokens_per_doc
    cost = (total_tokens / 1_000_000) * price_per_million

    return {
        "documents": num_documents,
        "total_tokens": total_tokens,
        "cost": round(cost, 4)
    }

# Example: Embed 10,000 documents
result = embedding_cost(
    num_documents=10_000,
    avg_tokens_per_doc=500,
    price_per_million=0.02
)
# Result: 5M tokens, $0.10

D.4 Context Budget Allocation

Reference Budget Template

A production-ready token budget for a 16,000-token context:

Total Context Budget: 16,000 tokens
├── System Prompt:       500 tokens  (3%)   [fixed]
├── User Query:        1,000 tokens  (6%)   [truncate if longer]
├── Memory Context:      400 tokens  (3%)   [most relevant only]
├── RAG Results:       2,000 tokens (13%)   [top-k with limit]
├── Conversation:      2,000 tokens (13%)   [sliding window]
└── Response Headroom: 10,100 tokens (62%)  [for model output]

Allocation by Use Case

Component	Chatbot	RAG System	Code Assistant	Agent
System Prompt	3-5%	5-8%	8-10%	10-15%
User Query	5-10%	5-10%	10-15%	5-10%
Memory	5-10%	2-5%	5-10%	10-15%
RAG/Context	0%	15-25%	20-30%	10-20%
Conversation	20-30%	10-15%	10-15%	5-10%
Response	50-60%	50-60%	40-50%	40-50%

Budget Enforcement Code

from dataclasses import dataclass
from typing import Dict, Any

@dataclass
class ContextBudget:
    """Define and enforce token budgets."""
    system_prompt: int = 500
    user_query: int = 1000
    memory: int = 400
    rag: int = 2000
    conversation: int = 2000
    total_limit: int = 16000

    def allocate(self, components: Dict[str, str]) -> Dict[str, str]:
        """Truncate components to fit budgets."""
        allocated = {}

        for name, content in components.items():
            limit = getattr(self, name, 1000)
            tokens = len(content) // 4  # Quick estimate

            if tokens <= limit:
                allocated[name] = content
            else:
                # Truncate to fit budget
                char_limit = limit * 4
                allocated[name] = content[:char_limit]

        return allocated

    def remaining_for_response(self, used_tokens: int) -> int:
        """Calculate remaining tokens for response."""
        return self.total_limit - used_tokens


# Example usage
budget = ContextBudget(
    system_prompt=500,
    user_query=1000,
    memory=400,
    rag=2000,
    conversation=2000,
    total_limit=16000
)

components = {
    "system_prompt": system_prompt_text,
    "user_query": user_input,
    "memory": retrieved_memories,
    "rag": retrieved_chunks,
    "conversation": conversation_history
}

allocated = budget.allocate(components)

Scaling Budgets

For different context windows:

Total Budget	System	Query	Memory	RAG	Conv	Response
4K tokens	200	400	200	500	500	2,200
16K tokens	500	1,000	400	2,000	2,000	10,100
32K tokens	1,000	2,000	800	5,000	4,000	19,200
128K tokens	2,000	4,000	2,000	20,000	10,000	90,000

D.5 Performance Benchmarks

Latency by Operation

Typical latency ranges (p50 values):

Operation	Latency	Notes
Embedding
Single text	10-50ms	API overhead dominates
Batch (100 texts)	100-300ms	More efficient per-text
Vector Search
In-memory (10K vectors)	1-5ms	Fastest option
In-memory (1M vectors)	20-50ms	Still fast
Cloud (Pinecone, etc.)	50-150ms	Network latency added
Reranking
Cross-encoder (10 docs)	100-250ms	Per batch
Cohere Rerank	150-300ms	API call
LLM Generation
First token (short context)	200-500ms	Time to first token
First token (long context)	500-2000ms	Scales with input
Full response (500 tokens)	2-5s	Depends on output length
Full response (2000 tokens)	8-15s	Streaming recommended
Full RAG Pipeline
Simple (embed + search + generate)	1-3s	Typical
Complex (rerank + multi-step)	3-8s	More processing

Latency by Model

Time to first token with 10K token context:

Model	First Token	Notes
GPT-4o-mini	150-300ms	Fastest
Claude 3 Haiku	200-400ms	Fast
GPT-4o	300-600ms	Moderate
Claude 3.5 Sonnet	400-800ms	Moderate
Claude 3 Opus	800-1500ms	Slowest

Context Size Impact

Latency scaling with context size (approximate):

Context Size	Relative Latency	Example
1K tokens	1.0x	Baseline
4K tokens	1.2x	+20%
16K tokens	1.8x	+80%
32K tokens	2.5x	+150%
128K tokens	5-10x	+400-900%

Throughput Guidelines

Sustainable request rates before hitting limits:

Provider	Tier	Requests/min	Tokens/min
OpenAI	Free	3	40,000
OpenAI	Tier 1	500	200,000
OpenAI	Tier 5	10,000	10,000,000
Anthropic	Free	5	40,000
Anthropic	Build	1,000	400,000
Anthropic	Scale	Custom	Custom

D.6 Worked Examples

Example 1: Simple RAG Chatbot

Setup: Customer support chatbot with document retrieval

Parameters:

500 queries/day
Model: GPT-4o-mini
System prompt: 300 tokens
User query: 100 tokens average
RAG chunks: 1,500 tokens (3 chunks × 500)
Output: 300 tokens average

Calculation:

Input tokens per query: 300 + 100 + 1500 = 1,900
Input cost: 1,900 / 1,000,000 × $0.15 = $0.000285
Output cost: 300 / 1,000,000 × $0.60 = $0.00018
Total per query: $0.000465

Daily: 500 × $0.000465 = $0.23
Monthly: $0.23 × 30 = $7

Plus embedding costs (one-time):

5,000 support documents × 400 tokens = 2M tokens
Cost: 2M / 1M × $0.02 = $0.04

Total monthly: ~$7 (embedding is negligible)

Example 2: Production Code Assistant

Setup: Internal developer tool for codebase Q&A

Parameters:

2,000 queries/day
Model: Claude 3.5 Sonnet
System prompt: 800 tokens (detailed instructions)
User query: 200 tokens
RAG code chunks: 4,000 tokens (8 chunks × 500)
Conversation history: 1,000 tokens
Output: 800 tokens average (explanations + code)

Calculation:

Input tokens per query: 800 + 200 + 4000 + 1000 = 6,000
Input cost: 6,000 / 1,000,000 × $3.00 = $0.018
Output cost: 800 / 1,000,000 × $15.00 = $0.012
Total per query: $0.03

Daily: 2,000 × $0.03 = $60
Monthly: $60 × 30 = $1,800

Plus embedding costs (one-time):

50,000 code files × 600 tokens = 30M tokens
Cost: 30M / 1M × $0.13 = $3.90 (using large model for code)

Total monthly: ~$1,800

Example 3: Multi-Agent System

Setup: Complex research assistant with routing and specialists

Parameters:

1,000 user queries/day
Router: Claude 3 Haiku (fast, cheap)
Specialists: Claude 3.5 Sonnet
Average 2.5 specialist calls per user query

Router call:

Input: 500 tokens (prompt + query)
Output: 50 tokens (routing decision)
Cost: (500/1M × $0.25) + (50/1M × $1.25) = $0.000188

Specialist call (average):

Input: 4,000 tokens (prompt + context)
Output: 600 tokens
Cost: (4000/1M × $3.00) + (600/1M × $15.00) = $0.021

Per user query:

Router: $0.000188
Specialists (2.5 calls): 2.5 × $0.021 = $0.0525
Total: $0.053

Daily: 1,000 × $0.053 = $53
Monthly: $53 × 30 = $1,590

Example 4: High-Volume Consumer App

Setup: AI writing assistant with free tier

Parameters:

50,000 queries/day (free users)
10,000 queries/day (premium users)
Free: GPT-4o-mini, 2K context, 200 output
Premium: GPT-4o, 8K context, 500 output

Free tier:

Input cost: 2,000 / 1,000,000 × $0.15 = $0.0003
Output cost: 200 / 1,000,000 × $0.60 = $0.00012
Per query: $0.00042

Daily: 50,000 × $0.00042 = $21
Monthly: $630

Premium tier:

Input cost: 8,000 / 1,000,000 × $2.50 = $0.02
Output cost: 500 / 1,000,000 × $10.00 = $0.005
Per query: $0.025

Daily: 10,000 × $0.025 = $250
Monthly: $7,500

Total monthly: ~$8,130

D.7 Quick Reference

Cost Rules of Thumb

Budget models cost ~20x less than premium
Output tokens cost ~3-5x more than input tokens
RAG adds 1,000-5,000 tokens per query
Multi-agent multiplies base cost by number of calls
Embedding is cheap—don’t optimize prematurely

Token Rules of Thumb

4 characters ≈ 1 token (English)
1 page ≈ 650 tokens
1 code file ≈ 8,000-10,000 tokens
JSON/XML is 20-30% more tokens than equivalent plain text

Latency Rules of Thumb

Embedding: 10-50ms (batch for efficiency)
Vector search: 20-100ms (depends on scale)
LLM first token: 200-800ms (depends on model + context)
Full RAG: 1-3 seconds (acceptable for most UX)

Model Selection Quick Guide

Priority	Choose
Lowest cost	GPT-4o-mini
Best quality	Claude 3.5 Sonnet or GPT-4o
Fastest	Claude 3 Haiku or GPT-4o-mini
Longest context	Claude (200K) or GPT-4 Turbo (128K)
Routing/classification	Any budget model

Monthly Cost Quick Estimates

Scenario	Queries/day	Model	Monthly
Prototype	100	Budget	$3-5
Small app	1,000	Budget	$30-50
Small app	1,000	Premium	$600-1,000
Production	10,000	Budget	$300-500
Production	10,000	Premium	$6,000-10,000
High volume	100,000	Budget	$3,000-5,000

D.8 Cost Monitoring Code

Track actual costs in production:

from dataclasses import dataclass, field
from datetime import datetime, date
from typing import Dict
import json

@dataclass
class CostTracker:
    """Track LLM costs in production."""
    pricing: Dict[str, ModelPricing]
    daily_costs: Dict[str, float] = field(default_factory=dict)

    def record(
        self,
        model: str,
        input_tokens: int,
        output_tokens: int,
        user_id: str = None
    ) -> float:
        """Record a request and return its cost."""
        pricing = self.pricing[model]
        cost = (
            (input_tokens / 1_000_000) * pricing.input_per_million +
            (output_tokens / 1_000_000) * pricing.output_per_million
        )

        # Track by day
        today = date.today().isoformat()
        self.daily_costs[today] = self.daily_costs.get(today, 0) + cost

        return cost

    def get_daily_total(self, day: str = None) -> float:
        """Get total cost for a day."""
        if day is None:
            day = date.today().isoformat()
        return self.daily_costs.get(day, 0)

    def check_budget(self, daily_limit: float) -> bool:
        """Check if under daily budget."""
        return self.get_daily_total() < daily_limit


# Usage
tracker = CostTracker(pricing=MODELS)

# After each LLM call
cost = tracker.record(
    model="claude-sonnet",
    input_tokens=4000,
    output_tokens=500,
    user_id="user_123"
)

# Check budget before expensive operations
if tracker.check_budget(daily_limit=100.0):
    # Proceed with request
    pass
else:
    # Degrade gracefully or alert
    pass

D.9 Cost-Latency Tradeoff Analysis (Task 3.1.6)

Every context engineering technique involves a fundamental tradeoff between cost and latency. Understanding this frontier helps you make informed decisions about which optimizations to apply.

The Cost-Latency Frontier

Common operations and their cost-latency impacts:

Technique	Cost Impact	Latency Impact	When Worth It
Direct LLM call	Baseline	Baseline	Always your starting point
RAG retrieval	+$0.001-0.005	+200-500ms	When you need external context
Reranking	+$0.002-0.010	+100-300ms	When top-k retrieval is unreliable
Query expansion	+0.1x base cost	+200-400ms	When recall matters more than latency
Multi-agent routing	+$0.0002-0.001	+50-200ms	When specialization improves quality
Prompt caching	-90% input cost	+0ms	When you have repeated prefixes
Context compression	+$0.001	-10-20% latency	When context is large and redundant

Key insight: Prompt caching is the only pure win—it reduces cost with no latency penalty. All other optimizations require justification.

Model Selection Decision Tree

Use this structured guide to choose models for your latency constraints:

If latency requirement < 500ms:
  └─ Use budget model (GPT-4o-mini, Claude 3 Haiku)
  └─ Expect: Sub-500ms first token, ~$0.001 per query

If latency requirement < 2 seconds:
  └─ If quality is paramount:
     └─ Use premium model (Claude 3.5 Sonnet, GPT-4o)
     └─ Expect: 400-800ms first token, ~$0.03 per query
  └─ If acceptable quality from budget model:
     └─ Use budget model + more context
     └─ Expect: ~$0.002 per query

If latency requirement < 5 seconds:
  └─ Use best-quality model for the task
  └─ Optimize context, not model
  └─ Expect: Quality-dependent costs

If simple classification/routing:
  └─ ALWAYS use budget model
  └─ Output is small, quality is predictable
  └─ Save ~20x on model cost

If context > 50,000 tokens:
  └─ Check if smaller context + better retrieval is cheaper
  └─ Compare: Large context premium model vs. small context budget model + reranking
  └─ Usually: Better retrieval + budget model wins

Quality Per Dollar Analysis

Calculate quality-per-dollar (QpD) for your system to find the best value:

from dataclasses import dataclass
from typing import Dict, List
import statistics

@dataclass
class QualityScore:
    """Evaluation score for a model response."""
    model: str
    latency_ms: float
    accuracy: float  # 0-1
    cost: float

    @property
    def quality_per_dollar(self) -> float:
        """Score per dollar spent."""
        if self.cost == 0:
            return float('inf')
        return self.accuracy / self.cost

    @property
    def quality_per_second(self) -> float:
        """Score per second of latency."""
        if self.latency_ms == 0:
            return float('inf')
        seconds = self.latency_ms / 1000
        return self.accuracy / seconds


class QualityPerDollarAnalysis:
    """Analyze quality-per-dollar across models."""

    def __init__(self):
        self.scores: List[QualityScore] = []

    def add_evaluation(self, scores: List[QualityScore]):
        """Add evaluation results."""
        self.scores.extend(scores)

    def best_model_for_latency_target(self, max_latency_ms: float) -> QualityScore:
        """Find the model with best QpD within latency budget."""
        candidates = [
            s for s in self.scores
            if s.latency_ms <= max_latency_ms
        ]

        if not candidates:
            return None

        return max(candidates, key=lambda s: s.quality_per_dollar)

    def best_model_for_budget(self, max_cost: float) -> QualityScore:
        """Find the model with best accuracy within cost budget."""
        candidates = [
            s for s in self.scores
            if s.cost <= max_cost
        ]

        if not candidates:
            return None

        return max(candidates, key=lambda s: s.accuracy)

    def report(self):
        """Print quality-per-dollar analysis."""
        print("Quality Per Dollar Analysis")
        print("=" * 70)
        print(f"{'Model':<20} {'Accuracy':<12} {'Cost':<12} {'QpD':<12}")
        print("-" * 70)

        for score in sorted(
            self.scores,
            key=lambda s: s.quality_per_dollar,
            reverse=True
        ):
            print(
                f"{score.model:<20} {score.accuracy:<12.2%} "
                f"${score.cost:<11.4f} {score.quality_per_dollar:<12.2f}"
            )


# Example: Evaluate multiple models on your task
analysis = QualityPerDollarAnalysis()

analysis.add_evaluation([
    QualityScore("gpt-4o-mini", latency_ms=250, accuracy=0.72, cost=0.0015),
    QualityScore("claude-haiku", latency_ms=300, accuracy=0.75, cost=0.0025),
    QualityScore("gpt-4o", latency_ms=450, accuracy=0.88, cost=0.030),
    QualityScore("claude-sonnet", latency_ms=500, accuracy=0.91, cost=0.038),
])

# For 500ms latency budget, what gives best quality/$?
best_500ms = analysis.best_model_for_latency_target(500)
print(f"Best for 500ms: {best_500ms.model} "
      f"({best_500ms.accuracy:.1%} accuracy, ${best_500ms.cost:.4f})")

# For $0.01 budget, what's the best accuracy?
best_budget = analysis.best_model_for_budget(0.01)
print(f"Best for $0.01: {best_budget.model} "
      f"({best_budget.accuracy:.1%} accuracy, {best_budget.latency_ms:.0f}ms)")

analysis.report()

D.10 Prompt Caching ROI Calculator (Task 3.1.7)

Prompt caching is one of the most underutilized cost optimizations. If your system has repeated prefixes (system prompts, conversation histories, reference materials), caching can save 90% on input costs.

When Caching Pays Off

Caching is worth implementing when:

Repeated identical prefixes - Your system prompt, instructions, or static reference material appears in multiple queries
High request volume - More queries mean more cache hits
Large system prompts - The bigger the cached prefix, the bigger the savings

The formula is straightforward:

Monthly savings = (daily_queries × cache_hit_rate ×
                  cached_input_tokens × price_per_token ×
                  cache_discount_factor) × 30

At Anthropic pricing (90% discount on cached input):

Every 1,000 cached tokens with 90% hit rate saves ~$0.02/day

Caching ROI Calculator

from dataclasses import dataclass

@dataclass
class CachingROICalculator:
    """Calculate prompt caching savings."""

    # Anthropic rates (early 2026)
    base_input_price_per_million = 3.00  # Claude 3.5 Sonnet
    cached_input_price_per_million = 0.30  # 90% discount

    def monthly_savings(
        self,
        system_prompt_tokens: int,
        queries_per_day: int,
        cache_hit_rate: float,
        base_price: float = None,
        cached_price: float = None
    ) -> float:
        """
        Calculate monthly savings from prompt caching.

        Args:
            system_prompt_tokens: Size of cached system prompt
            queries_per_day: Daily query volume
            cache_hit_rate: Fraction of queries that hit cache (0-1)
            base_price: Base input price per million tokens
            cached_price: Cached input price per million tokens

        Returns:
            Monthly savings in dollars
        """
        if base_price is None:
            base_price = self.base_input_price_per_million
        if cached_price is None:
            cached_price = self.cached_input_price_per_million

        # Cost per query without caching
        cost_uncached = (system_prompt_tokens / 1_000_000) * base_price

        # Cost per query with caching
        cached_queries = queries_per_day * cache_hit_rate
        uncached_queries = queries_per_day * (1 - cache_hit_rate)

        daily_cost_cached = (
            cached_queries * ((system_prompt_tokens / 1_000_000) * cached_price) +
            uncached_queries * ((system_prompt_tokens / 1_000_000) * base_price)
        )

        # Savings
        daily_cost_uncached = queries_per_day * cost_uncached
        daily_savings = daily_cost_uncached - daily_cost_cached
        monthly_savings = daily_savings * 30

        return monthly_savings

    def breakeven_queries(
        self,
        system_prompt_tokens: int,
        cache_hit_rate: float = 0.9,
        base_price: float = None,
        cached_price: float = None
    ) -> float:
        """
        How many queries per day to break even on caching overhead?

        Note: Caching has no implementation overhead, so breakeven is immediate.
        This returns the daily volume at which caching becomes worthwhile ($1+/day).

        Args:
            system_prompt_tokens: Size of system prompt
            cache_hit_rate: Expected cache hit rate
            base_price: Base input price per million
            cached_price: Cached input price per million

        Returns:
            Daily queries needed for $1/month savings
        """
        if base_price is None:
            base_price = self.base_input_price_per_million
        if cached_price is None:
            cached_price = self.cached_input_price_per_million

        # Savings per cached query
        savings_per_cached = (
            (system_prompt_tokens / 1_000_000) *
            (base_price - cached_price)
        )

        # Queries needed for $1/month
        target_monthly = 1.0
        target_daily = target_monthly / 30

        if savings_per_cached == 0:
            return float('inf')

        daily_queries = target_daily / (savings_per_cached * cache_hit_rate)
        return daily_queries

    def report(
        self,
        system_prompt_tokens: int,
        queries_per_day: int,
        cache_hit_rate: float = 0.9
    ):
        """Print caching ROI analysis."""
        monthly = self.monthly_savings(
            system_prompt_tokens,
            queries_per_day,
            cache_hit_rate
        )

        print(f"Prompt Caching ROI Analysis")
        print(f"=" * 60)
        print(f"System prompt: {system_prompt_tokens:,} tokens")
        print(f"Daily volume: {queries_per_day:,} queries")
        print(f"Cache hit rate: {cache_hit_rate:.0%}")
        print(f"-" * 60)
        print(f"Monthly savings: ${monthly:.2f}")
        print(f"Annual savings: ${monthly * 12:.2f}")

        if monthly > 0:
            print(f"\nCaching is worthwhile for this volume.")
        else:
            breakeven = self.breakeven_queries(system_prompt_tokens, cache_hit_rate)
            print(f"\nNeed {breakeven:.0f} queries/day for $1/month savings.")


# Example: 500-token system prompt, 90% cache hit at Anthropic rates
calc = CachingROICalculator()

print("Scenario: 500-token system prompt, 90% cache hit rate\n")

for volume in [100, 1000, 10000]:
    calc.report(
        system_prompt_tokens=500,
        queries_per_day=volume,
        cache_hit_rate=0.90
    )
    print()

Worked Example

Assume:

System prompt: 500 tokens
Cache hit rate: 90%
Claude 3.5 Sonnet pricing: $3.00 per million input tokens (base), $0.30 per million (cached)
Savings per cached input token: $0.0000027 per token

Results:

Daily Volume	Monthly Cost (Uncached)	Monthly Cost (Cached)	Monthly Savings
100 queries	$4.50	$0.90	$3.60
1,000 queries	$45	$9	$36
10,000 queries	$450	$90	$360
100,000 queries	$4,500	$900	$3,600

Caching becomes worthwhile at very modest volumes. Even 100 daily queries saves ~$3.60/month. At production scale (10,000+ daily), you’re looking at $300-3,600/month in savings from a trivial implementation.

Maximizing Cache Hit Rate

To get the most from caching:

Keep static content as prefix - Put your entire system prompt, instructions, and reference material at the beginning of the message (before user input)
Put dynamic content after cached prefix - User queries, conversation history, and dynamic context come after the static system prompt
Avoid randomizing components - Don’t shuffle or randomize parts of your system prompt—consistency enables cache hits
Batch similar requests - Process similar user queries together to maximize the chance they share cached prefixes
Version your prompts - When you update system prompts, do it carefully. A one-character change invalidates all caches

Design your prompt structure like this:

[CACHED PREFIX - never changes]
System prompt (static instructions)
Reference material (company guidelines, examples)
Current date/time
Tool definitions
[END CACHED PREFIX]

[DYNAMIC - changes per query]
Conversation history
User query
Context for this specific request
[END DYNAMIC]

This structure ensures every query benefits from the cached prefix.

D.11 Enhanced Cost Monitoring (Task 3.1.8)

Upgrade the CostTracker from D.8 with production-ready monitoring capabilities:

from dataclasses import dataclass, field
from datetime import datetime, date
from typing import Dict, Callable, Optional
import json

@dataclass
class CostTracker:
    """Track LLM costs in production with budget alerts and model degradation."""
    pricing: Dict[str, ModelPricing]
    daily_costs: Dict[str, float] = field(default_factory=dict)
    daily_budget: float = 100.0
    alert_thresholds: Dict[float, bool] = field(default_factory=lambda: {0.5: False, 0.8: False})
    on_threshold_reached: Optional[Callable[[float, float], None]] = None

    def record(
        self,
        model: str,
        input_tokens: int,
        output_tokens: int,
        user_id: str = None
    ) -> float:
        """Record a request and return its cost."""
        pricing = self.pricing[model]
        cost = (
            (input_tokens / 1_000_000) * pricing.input_per_million +
            (output_tokens / 1_000_000) * pricing.output_per_million
        )

        # Track by day
        today = date.today().isoformat()
        self.daily_costs[today] = self.daily_costs.get(today, 0) + cost

        # Check thresholds
        self._check_thresholds()

        return cost

    def get_daily_total(self, day: str = None) -> float:
        """Get total cost for a day."""
        if day is None:
            day = date.today().isoformat()
        return self.daily_costs.get(day, 0)

    def check_budget(self, daily_limit: float = None) -> bool:
        """Check if under daily budget."""
        if daily_limit is None:
            daily_limit = self.daily_budget
        return self.get_daily_total() < daily_limit

    def _check_thresholds(self):
        """Check if cost has reached alert thresholds."""
        current_cost = self.get_daily_total()

        for threshold_pct, already_alerted in self.alert_thresholds.items():
            threshold_amount = self.daily_budget * threshold_pct

            if current_cost >= threshold_amount and not already_alerted:
                self.alert_thresholds[threshold_pct] = True

                if self.on_threshold_reached:
                    self.on_threshold_reached(threshold_pct, current_cost)

    def degrade_model(self, current_model: str) -> str:
        """
        Return a cheaper model when approaching budget.

        Degradation strategy:
        - If using premium, switch to budget version of same provider
        - If already on budget, return current model (can't go lower)

        Args:
            current_model: Current model name (e.g., "claude-sonnet")

        Returns:
            Cheaper model name or current if already cheapest
        """
        degradation_map = {
            "claude-sonnet": "claude-haiku",
            "claude-opus": "claude-sonnet",
            "gpt-4o": "gpt-4o-mini",
            "gpt-4o-mini": "gpt-4o-mini",  # Already cheapest
            "claude-haiku": "claude-haiku",  # Already cheapest
        }

        return degradation_map.get(current_model, current_model)

    def should_degrade_model(self, threshold: float = 0.8) -> bool:
        """
        Check if we should degrade to a cheaper model.

        Args:
            threshold: Percentage of daily budget (0-1) to trigger degradation

        Returns:
            True if current spend exceeds threshold
        """
        current_cost = self.get_daily_total()
        threshold_amount = self.daily_budget * threshold
        return current_cost >= threshold_amount

    def get_cost_status(self) -> Dict:
        """Get comprehensive cost status."""
        current_cost = self.get_daily_total()
        remaining = self.daily_budget - current_cost
        pct_used = (current_cost / self.daily_budget) * 100 if self.daily_budget > 0 else 0

        return {
            "daily_budget": self.daily_budget,
            "current_cost": round(current_cost, 4),
            "remaining": round(remaining, 4),
            "percent_used": round(pct_used, 1),
            "under_budget": current_cost < self.daily_budget,
            "at_50_percent": current_cost >= (self.daily_budget * 0.5),
            "at_80_percent": current_cost >= (self.daily_budget * 0.8),
        }


# Usage example with alerting
def alert_callback(threshold_pct: float, current_cost: float):
    """Callback when cost thresholds are reached."""
    print(f"ALERT: Daily cost at {threshold_pct:.0%} of budget: ${current_cost:.2f}")
    # In production: send to monitoring system, PagerDuty, etc.


tracker = CostTracker(
    pricing=MODELS,
    daily_budget=100.0,
    on_threshold_reached=alert_callback
)

# Record a query
cost = tracker.record(
    model="claude-sonnet",
    input_tokens=4000,
    output_tokens=500
)

# Check if we should degrade to save money
if tracker.should_degrade_model(threshold=0.8):
    cheaper_model = tracker.degrade_model("claude-sonnet")
    print(f"Degrading from claude-sonnet to {cheaper_model}")

# Get status anytime
status = tracker.get_cost_status()
print(f"Cost status: {status['percent_used']:.1f}% of daily budget used")

Key additions:

Alert thresholds - Automatically trigger callbacks at 50% and 80% of daily budget
Model degradation - Intelligently downgrade to cheaper models when approaching budget (e.g., Sonnet → Haiku, GPT-4o → GPT-4o-mini)
Callback mechanism - on_threshold_reached allows integration with monitoring, alerting, and operational systems
Comprehensive status - get_cost_status() provides real-time visibility into budget utilization

In production, wire the callback to:

Send alerts to Slack/PagerDuty
Log to monitoring systems (Datadog, New Relic)
Trigger automatic cost reduction (reduce concurrency, use cheaper models)
Update user-facing dashboards

Appendix Cross-References

This Section	Related Appendix	Connection
D.1 Token Estimation	Appendix C: Quick Reference tables	Quick lookup
D.2 Model Pricing	Appendix A: A.1 Framework overhead	Total cost picture
D.4 Context Budget	Appendix B: B.1.3 Token Budget Allocation	Implementation pattern
D.5 Performance Benchmarks	Appendix C: Latency Benchmarks	Debugging slow systems
D.8 Cost Monitoring	Appendix B: B.8.5 Cost Tracking pattern	Pattern implementation

Appendix D complete. For production cost strategies, see Chapter 11. For debugging cost-related issues, see Appendix C.

Keyboard shortcuts

Context Engineering: From Vibe Coder to Software Engineer