Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Appendix D: Cost Reference

Appendix D, v2.1 — Early 2026

Pricing in this appendix reflects early 2026 rates. Models, pricing tiers, and cost structures change frequently. Use the methodologies here with current pricing from provider documentation.

This appendix provides the numbers you need to estimate costs before they surprise you. No theory—Chapter 11 covers why costs matter. Here you’ll find formulas, calculators, pricing tables, and worked examples.

Important: LLM pricing changes frequently. The numbers here reflect early 2026 rates. Always check provider pricing pages before making commitments. The formulas and patterns, however, remain useful regardless of specific prices.


D.1 Token Estimation

Tokens are the currency of LLM costs. Every API call charges by tokens consumed.

Quick Estimation Rules

For English text:

ContentTokens
1 character~0.25 tokens
1 word~1.3 tokens
4 characters~1 token
100 words~130 tokens
1 page (500 words)~650 tokens
1,000 words~1,300 tokens

For code:

ContentTokens
1 line of code~15-20 tokens
1 function (typical)~100-500 tokens
1 file (500 lines)~8,000-10,000 tokens
JSON (per KB)~400 tokens

Code is less token-efficient than prose. Punctuation, indentation, and special characters all consume tokens. JSON and structured data are particularly token-hungry.

Token Estimation Code

For quick estimates during development:

def estimate_tokens(text: str) -> int:
    """Quick token estimate: 1 token ≈ 4 characters for English."""
    return len(text) // 4

def estimate_tokens_words(word_count: int) -> int:
    """Estimate from word count: 1 token ≈ 0.75 words."""
    return int(word_count * 1.33)

For accurate counts, use the tokenizer libraries:

# OpenAI models
import tiktoken

def count_tokens_openai(text: str, model: str = "gpt-4") -> int:
    """Exact token count for OpenAI models."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

# Anthropic models
from anthropic import Anthropic

def count_tokens_anthropic(text: str) -> int:
    """Exact token count for Claude models."""
    client = Anthropic()
    return client.count_tokens(text)

Model-Specific Differences

Different models tokenize differently. The same text may have different token counts across providers:

Text SampleGPT-4ClaudeLlama
“Hello, world!”445
def foo(): return 429811
1KB JSON~380~400~420

For budgeting purposes, use the 4-character rule for estimates, then verify with the actual tokenizer before production deployment.


D.2 Model Pricing

Generation Models

Prices per 1 million tokens (early 2026):

ModelInputOutputNotes
Premium Tier
Claude 3.5 Sonnet$3.00$15.00Best quality/cost balance
GPT-4o$2.50$10.00Multimodal capable
Claude 3 Opus$15.00$75.00Maximum capability
GPT-4 Turbo$10.00$30.00Large context window
Budget Tier
GPT-4o-mini$0.15$0.6020x cheaper than GPT-4o
Claude 3 Haiku$0.25$1.25Fast, efficient
Claude 3.5 Haiku$0.80$4.00Improved Haiku
Open Source (API)
Llama 3 70B (via API)$0.50-1.00$0.50-1.00Provider dependent
Mixtral 8x7B$0.25-0.50$0.25-0.50Provider dependent

Cost Per Query

What a single 10,000-token context query costs:

ModelInput CostOutput (500 tok)Total
Claude 3.5 Sonnet$0.030$0.0075~$0.038
GPT-4o$0.025$0.005~$0.030
GPT-4o-mini$0.0015$0.0003~$0.002
Claude 3 Haiku$0.0025$0.0006~$0.003

Embedding Models

Prices per 1 million tokens:

ModelPriceDimensionsNotes
text-embedding-3-small$0.021536Best value
text-embedding-3-large$0.133072Higher quality
text-embedding-ada-002$0.101536Legacy
Cohere embed-v3$0.101024Good multilingual
Voyage-2$0.101024Code-optimized available

Cached vs. Uncached

Some providers offer prompt caching at reduced rates:

ProviderCached InputSavings
Anthropic10% of base90% off
OpenAIVariesUp to 50% off

Cache hits require exact prefix matches. Design system prompts to maximize cache reuse.


D.3 Cost Calculators

Basic Cost Formula

input_cost = (input_tokens / 1,000,000) × input_price_per_million
output_cost = (output_tokens / 1,000,000) × output_price_per_million
total_cost = input_cost + output_cost

Query Cost Calculator

from dataclasses import dataclass
from typing import Optional

@dataclass
class ModelPricing:
    """Pricing for a specific model."""
    name: str
    input_per_million: float
    output_per_million: float

# Common model pricing (early 2026)
MODELS = {
    "claude-sonnet": ModelPricing("Claude 3.5 Sonnet", 3.00, 15.00),
    "gpt-4o": ModelPricing("GPT-4o", 2.50, 10.00),
    "gpt-4o-mini": ModelPricing("GPT-4o-mini", 0.15, 0.60),
    "claude-haiku": ModelPricing("Claude 3 Haiku", 0.25, 1.25),
}

class CostCalculator:
    """Calculate LLM costs for context engineering systems."""

    def __init__(self, model: str):
        self.pricing = MODELS[model]

    def query_cost(
        self,
        system_prompt_tokens: int,
        user_query_tokens: int,
        rag_tokens: int = 0,
        memory_tokens: int = 0,
        conversation_tokens: int = 0,
        expected_output_tokens: int = 500
    ) -> dict:
        """Calculate cost for a single query."""
        total_input = (
            system_prompt_tokens +
            user_query_tokens +
            rag_tokens +
            memory_tokens +
            conversation_tokens
        )

        input_cost = (total_input / 1_000_000) * self.pricing.input_per_million
        output_cost = (expected_output_tokens / 1_000_000) * self.pricing.output_per_million

        return {
            "model": self.pricing.name,
            "input_tokens": total_input,
            "output_tokens": expected_output_tokens,
            "input_cost": round(input_cost, 6),
            "output_cost": round(output_cost, 6),
            "total_cost": round(input_cost + output_cost, 6),
        }

    def daily_cost(self, queries_per_day: int, avg_cost_per_query: float) -> float:
        """Project daily costs."""
        return queries_per_day * avg_cost_per_query

    def monthly_cost(self, queries_per_day: int, avg_cost_per_query: float) -> float:
        """Project monthly costs (30 days)."""
        return queries_per_day * 30 * avg_cost_per_query


# Example usage
calc = CostCalculator("claude-sonnet")

# Typical RAG query
result = calc.query_cost(
    system_prompt_tokens=500,
    user_query_tokens=100,
    rag_tokens=2000,
    memory_tokens=400,
    conversation_tokens=1000,
    expected_output_tokens=500
)
# Result: ~$0.02 per query

# Monthly projection
monthly = calc.monthly_cost(
    queries_per_day=1000,
    avg_cost_per_query=0.02
)
# Result: ~$600/month

Multi-Model Cost Calculator

For systems using multiple models (routing, specialist agents):

class MultiModelCalculator:
    """Calculate costs for multi-agent systems."""

    def __init__(self):
        self.calculators = {
            name: CostCalculator(name) for name in MODELS
        }

    def multi_agent_query(
        self,
        router_model: str,
        router_tokens: int,
        specialist_model: str,
        specialist_calls: int,
        specialist_input_tokens: int,
        specialist_output_tokens: int
    ) -> dict:
        """Calculate cost for a multi-agent query."""
        # Router cost (typically small, budget model)
        router_calc = self.calculators[router_model]
        router_cost = router_calc.query_cost(
            system_prompt_tokens=200,
            user_query_tokens=router_tokens,
            expected_output_tokens=50
        )["total_cost"]

        # Specialist costs
        specialist_calc = self.calculators[specialist_model]
        specialist_cost = specialist_calc.query_cost(
            system_prompt_tokens=500,
            user_query_tokens=specialist_input_tokens,
            expected_output_tokens=specialist_output_tokens
        )["total_cost"] * specialist_calls

        return {
            "router_cost": router_cost,
            "specialist_cost": specialist_cost,
            "total_cost": router_cost + specialist_cost,
            "calls": 1 + specialist_calls
        }


# Example: Router + 2 specialist calls
multi = MultiModelCalculator()
result = multi.multi_agent_query(
    router_model="claude-haiku",
    router_tokens=200,
    specialist_model="claude-sonnet",
    specialist_calls=2,
    specialist_input_tokens=3000,
    specialist_output_tokens=800
)
# Result: ~$0.05 per user query

Embedding Cost Calculator

def embedding_cost(
    num_documents: int,
    avg_tokens_per_doc: int,
    price_per_million: float = 0.02  # text-embedding-3-small
) -> dict:
    """Calculate one-time embedding costs."""
    total_tokens = num_documents * avg_tokens_per_doc
    cost = (total_tokens / 1_000_000) * price_per_million

    return {
        "documents": num_documents,
        "total_tokens": total_tokens,
        "cost": round(cost, 4)
    }

# Example: Embed 10,000 documents
result = embedding_cost(
    num_documents=10_000,
    avg_tokens_per_doc=500,
    price_per_million=0.02
)
# Result: 5M tokens, $0.10

D.4 Context Budget Allocation

Reference Budget Template

A production-ready token budget for a 16,000-token context:

Total Context Budget: 16,000 tokens
├── System Prompt:       500 tokens  (3%)   [fixed]
├── User Query:        1,000 tokens  (6%)   [truncate if longer]
├── Memory Context:      400 tokens  (3%)   [most relevant only]
├── RAG Results:       2,000 tokens (13%)   [top-k with limit]
├── Conversation:      2,000 tokens (13%)   [sliding window]
└── Response Headroom: 10,100 tokens (62%)  [for model output]

Allocation by Use Case

ComponentChatbotRAG SystemCode AssistantAgent
System Prompt3-5%5-8%8-10%10-15%
User Query5-10%5-10%10-15%5-10%
Memory5-10%2-5%5-10%10-15%
RAG/Context0%15-25%20-30%10-20%
Conversation20-30%10-15%10-15%5-10%
Response50-60%50-60%40-50%40-50%

Budget Enforcement Code

from dataclasses import dataclass
from typing import Dict, Any

@dataclass
class ContextBudget:
    """Define and enforce token budgets."""
    system_prompt: int = 500
    user_query: int = 1000
    memory: int = 400
    rag: int = 2000
    conversation: int = 2000
    total_limit: int = 16000

    def allocate(self, components: Dict[str, str]) -> Dict[str, str]:
        """Truncate components to fit budgets."""
        allocated = {}

        for name, content in components.items():
            limit = getattr(self, name, 1000)
            tokens = len(content) // 4  # Quick estimate

            if tokens <= limit:
                allocated[name] = content
            else:
                # Truncate to fit budget
                char_limit = limit * 4
                allocated[name] = content[:char_limit]

        return allocated

    def remaining_for_response(self, used_tokens: int) -> int:
        """Calculate remaining tokens for response."""
        return self.total_limit - used_tokens


# Example usage
budget = ContextBudget(
    system_prompt=500,
    user_query=1000,
    memory=400,
    rag=2000,
    conversation=2000,
    total_limit=16000
)

components = {
    "system_prompt": system_prompt_text,
    "user_query": user_input,
    "memory": retrieved_memories,
    "rag": retrieved_chunks,
    "conversation": conversation_history
}

allocated = budget.allocate(components)

Scaling Budgets

For different context windows:

Total BudgetSystemQueryMemoryRAGConvResponse
4K tokens2004002005005002,200
16K tokens5001,0004002,0002,00010,100
32K tokens1,0002,0008005,0004,00019,200
128K tokens2,0004,0002,00020,00010,00090,000

D.5 Performance Benchmarks

Latency by Operation

Typical latency ranges (p50 values):

OperationLatencyNotes
Embedding
Single text10-50msAPI overhead dominates
Batch (100 texts)100-300msMore efficient per-text
Vector Search
In-memory (10K vectors)1-5msFastest option
In-memory (1M vectors)20-50msStill fast
Cloud (Pinecone, etc.)50-150msNetwork latency added
Reranking
Cross-encoder (10 docs)100-250msPer batch
Cohere Rerank150-300msAPI call
LLM Generation
First token (short context)200-500msTime to first token
First token (long context)500-2000msScales with input
Full response (500 tokens)2-5sDepends on output length
Full response (2000 tokens)8-15sStreaming recommended
Full RAG Pipeline
Simple (embed + search + generate)1-3sTypical
Complex (rerank + multi-step)3-8sMore processing

Latency by Model

Time to first token with 10K token context:

ModelFirst TokenNotes
GPT-4o-mini150-300msFastest
Claude 3 Haiku200-400msFast
GPT-4o300-600msModerate
Claude 3.5 Sonnet400-800msModerate
Claude 3 Opus800-1500msSlowest

Context Size Impact

Latency scaling with context size (approximate):

Context SizeRelative LatencyExample
1K tokens1.0xBaseline
4K tokens1.2x+20%
16K tokens1.8x+80%
32K tokens2.5x+150%
128K tokens5-10x+400-900%

Throughput Guidelines

Sustainable request rates before hitting limits:

ProviderTierRequests/minTokens/min
OpenAIFree340,000
OpenAITier 1500200,000
OpenAITier 510,00010,000,000
AnthropicFree540,000
AnthropicBuild1,000400,000
AnthropicScaleCustomCustom

D.6 Worked Examples

Example 1: Simple RAG Chatbot

Setup: Customer support chatbot with document retrieval

Parameters:

  • 500 queries/day
  • Model: GPT-4o-mini
  • System prompt: 300 tokens
  • User query: 100 tokens average
  • RAG chunks: 1,500 tokens (3 chunks × 500)
  • Output: 300 tokens average

Calculation:

Input tokens per query: 300 + 100 + 1500 = 1,900
Input cost: 1,900 / 1,000,000 × $0.15 = $0.000285
Output cost: 300 / 1,000,000 × $0.60 = $0.00018
Total per query: $0.000465

Daily: 500 × $0.000465 = $0.23
Monthly: $0.23 × 30 = $7

Plus embedding costs (one-time):

  • 5,000 support documents × 400 tokens = 2M tokens
  • Cost: 2M / 1M × $0.02 = $0.04

Total monthly: ~$7 (embedding is negligible)


Example 2: Production Code Assistant

Setup: Internal developer tool for codebase Q&A

Parameters:

  • 2,000 queries/day
  • Model: Claude 3.5 Sonnet
  • System prompt: 800 tokens (detailed instructions)
  • User query: 200 tokens
  • RAG code chunks: 4,000 tokens (8 chunks × 500)
  • Conversation history: 1,000 tokens
  • Output: 800 tokens average (explanations + code)

Calculation:

Input tokens per query: 800 + 200 + 4000 + 1000 = 6,000
Input cost: 6,000 / 1,000,000 × $3.00 = $0.018
Output cost: 800 / 1,000,000 × $15.00 = $0.012
Total per query: $0.03

Daily: 2,000 × $0.03 = $60
Monthly: $60 × 30 = $1,800

Plus embedding costs (one-time):

  • 50,000 code files × 600 tokens = 30M tokens
  • Cost: 30M / 1M × $0.13 = $3.90 (using large model for code)

Total monthly: ~$1,800


Example 3: Multi-Agent System

Setup: Complex research assistant with routing and specialists

Parameters:

  • 1,000 user queries/day
  • Router: Claude 3 Haiku (fast, cheap)
  • Specialists: Claude 3.5 Sonnet
  • Average 2.5 specialist calls per user query

Router call:

Input: 500 tokens (prompt + query)
Output: 50 tokens (routing decision)
Cost: (500/1M × $0.25) + (50/1M × $1.25) = $0.000188

Specialist call (average):

Input: 4,000 tokens (prompt + context)
Output: 600 tokens
Cost: (4000/1M × $3.00) + (600/1M × $15.00) = $0.021

Per user query:

Router: $0.000188
Specialists (2.5 calls): 2.5 × $0.021 = $0.0525
Total: $0.053

Daily: 1,000 × $0.053 = $53
Monthly: $53 × 30 = $1,590

Example 4: High-Volume Consumer App

Setup: AI writing assistant with free tier

Parameters:

  • 50,000 queries/day (free users)
  • 10,000 queries/day (premium users)
  • Free: GPT-4o-mini, 2K context, 200 output
  • Premium: GPT-4o, 8K context, 500 output

Free tier:

Input cost: 2,000 / 1,000,000 × $0.15 = $0.0003
Output cost: 200 / 1,000,000 × $0.60 = $0.00012
Per query: $0.00042

Daily: 50,000 × $0.00042 = $21
Monthly: $630

Premium tier:

Input cost: 8,000 / 1,000,000 × $2.50 = $0.02
Output cost: 500 / 1,000,000 × $10.00 = $0.005
Per query: $0.025

Daily: 10,000 × $0.025 = $250
Monthly: $7,500

Total monthly: ~$8,130


D.7 Quick Reference

Cost Rules of Thumb

  • Budget models cost ~20x less than premium
  • Output tokens cost ~3-5x more than input tokens
  • RAG adds 1,000-5,000 tokens per query
  • Multi-agent multiplies base cost by number of calls
  • Embedding is cheap—don’t optimize prematurely

Token Rules of Thumb

  • 4 characters ≈ 1 token (English)
  • 1 page ≈ 650 tokens
  • 1 code file ≈ 8,000-10,000 tokens
  • JSON/XML is 20-30% more tokens than equivalent plain text

Latency Rules of Thumb

  • Embedding: 10-50ms (batch for efficiency)
  • Vector search: 20-100ms (depends on scale)
  • LLM first token: 200-800ms (depends on model + context)
  • Full RAG: 1-3 seconds (acceptable for most UX)

Model Selection Quick Guide

PriorityChoose
Lowest costGPT-4o-mini
Best qualityClaude 3.5 Sonnet or GPT-4o
FastestClaude 3 Haiku or GPT-4o-mini
Longest contextClaude (200K) or GPT-4 Turbo (128K)
Routing/classificationAny budget model

Monthly Cost Quick Estimates

ScenarioQueries/dayModelMonthly
Prototype100Budget$3-5
Small app1,000Budget$30-50
Small app1,000Premium$600-1,000
Production10,000Budget$300-500
Production10,000Premium$6,000-10,000
High volume100,000Budget$3,000-5,000

D.8 Cost Monitoring Code

Track actual costs in production:

from dataclasses import dataclass, field
from datetime import datetime, date
from typing import Dict
import json

@dataclass
class CostTracker:
    """Track LLM costs in production."""
    pricing: Dict[str, ModelPricing]
    daily_costs: Dict[str, float] = field(default_factory=dict)

    def record(
        self,
        model: str,
        input_tokens: int,
        output_tokens: int,
        user_id: str = None
    ) -> float:
        """Record a request and return its cost."""
        pricing = self.pricing[model]
        cost = (
            (input_tokens / 1_000_000) * pricing.input_per_million +
            (output_tokens / 1_000_000) * pricing.output_per_million
        )

        # Track by day
        today = date.today().isoformat()
        self.daily_costs[today] = self.daily_costs.get(today, 0) + cost

        return cost

    def get_daily_total(self, day: str = None) -> float:
        """Get total cost for a day."""
        if day is None:
            day = date.today().isoformat()
        return self.daily_costs.get(day, 0)

    def check_budget(self, daily_limit: float) -> bool:
        """Check if under daily budget."""
        return self.get_daily_total() < daily_limit


# Usage
tracker = CostTracker(pricing=MODELS)

# After each LLM call
cost = tracker.record(
    model="claude-sonnet",
    input_tokens=4000,
    output_tokens=500,
    user_id="user_123"
)

# Check budget before expensive operations
if tracker.check_budget(daily_limit=100.0):
    # Proceed with request
    pass
else:
    # Degrade gracefully or alert
    pass

D.9 Cost-Latency Tradeoff Analysis (Task 3.1.6)

Every context engineering technique involves a fundamental tradeoff between cost and latency. Understanding this frontier helps you make informed decisions about which optimizations to apply.

The Cost-Latency Frontier

Common operations and their cost-latency impacts:

TechniqueCost ImpactLatency ImpactWhen Worth It
Direct LLM callBaselineBaselineAlways your starting point
RAG retrieval+$0.001-0.005+200-500msWhen you need external context
Reranking+$0.002-0.010+100-300msWhen top-k retrieval is unreliable
Query expansion+0.1x base cost+200-400msWhen recall matters more than latency
Multi-agent routing+$0.0002-0.001+50-200msWhen specialization improves quality
Prompt caching-90% input cost+0msWhen you have repeated prefixes
Context compression+$0.001-10-20% latencyWhen context is large and redundant

Key insight: Prompt caching is the only pure win—it reduces cost with no latency penalty. All other optimizations require justification.

Model Selection Decision Tree

Use this structured guide to choose models for your latency constraints:

If latency requirement < 500ms:
  └─ Use budget model (GPT-4o-mini, Claude 3 Haiku)
  └─ Expect: Sub-500ms first token, ~$0.001 per query

If latency requirement < 2 seconds:
  └─ If quality is paramount:
     └─ Use premium model (Claude 3.5 Sonnet, GPT-4o)
     └─ Expect: 400-800ms first token, ~$0.03 per query
  └─ If acceptable quality from budget model:
     └─ Use budget model + more context
     └─ Expect: ~$0.002 per query

If latency requirement < 5 seconds:
  └─ Use best-quality model for the task
  └─ Optimize context, not model
  └─ Expect: Quality-dependent costs

If simple classification/routing:
  └─ ALWAYS use budget model
  └─ Output is small, quality is predictable
  └─ Save ~20x on model cost

If context > 50,000 tokens:
  └─ Check if smaller context + better retrieval is cheaper
  └─ Compare: Large context premium model vs. small context budget model + reranking
  └─ Usually: Better retrieval + budget model wins

Quality Per Dollar Analysis

Calculate quality-per-dollar (QpD) for your system to find the best value:

from dataclasses import dataclass
from typing import Dict, List
import statistics

@dataclass
class QualityScore:
    """Evaluation score for a model response."""
    model: str
    latency_ms: float
    accuracy: float  # 0-1
    cost: float

    @property
    def quality_per_dollar(self) -> float:
        """Score per dollar spent."""
        if self.cost == 0:
            return float('inf')
        return self.accuracy / self.cost

    @property
    def quality_per_second(self) -> float:
        """Score per second of latency."""
        if self.latency_ms == 0:
            return float('inf')
        seconds = self.latency_ms / 1000
        return self.accuracy / seconds


class QualityPerDollarAnalysis:
    """Analyze quality-per-dollar across models."""

    def __init__(self):
        self.scores: List[QualityScore] = []

    def add_evaluation(self, scores: List[QualityScore]):
        """Add evaluation results."""
        self.scores.extend(scores)

    def best_model_for_latency_target(self, max_latency_ms: float) -> QualityScore:
        """Find the model with best QpD within latency budget."""
        candidates = [
            s for s in self.scores
            if s.latency_ms <= max_latency_ms
        ]

        if not candidates:
            return None

        return max(candidates, key=lambda s: s.quality_per_dollar)

    def best_model_for_budget(self, max_cost: float) -> QualityScore:
        """Find the model with best accuracy within cost budget."""
        candidates = [
            s for s in self.scores
            if s.cost <= max_cost
        ]

        if not candidates:
            return None

        return max(candidates, key=lambda s: s.accuracy)

    def report(self):
        """Print quality-per-dollar analysis."""
        print("Quality Per Dollar Analysis")
        print("=" * 70)
        print(f"{'Model':<20} {'Accuracy':<12} {'Cost':<12} {'QpD':<12}")
        print("-" * 70)

        for score in sorted(
            self.scores,
            key=lambda s: s.quality_per_dollar,
            reverse=True
        ):
            print(
                f"{score.model:<20} {score.accuracy:<12.2%} "
                f"${score.cost:<11.4f} {score.quality_per_dollar:<12.2f}"
            )


# Example: Evaluate multiple models on your task
analysis = QualityPerDollarAnalysis()

analysis.add_evaluation([
    QualityScore("gpt-4o-mini", latency_ms=250, accuracy=0.72, cost=0.0015),
    QualityScore("claude-haiku", latency_ms=300, accuracy=0.75, cost=0.0025),
    QualityScore("gpt-4o", latency_ms=450, accuracy=0.88, cost=0.030),
    QualityScore("claude-sonnet", latency_ms=500, accuracy=0.91, cost=0.038),
])

# For 500ms latency budget, what gives best quality/$?
best_500ms = analysis.best_model_for_latency_target(500)
print(f"Best for 500ms: {best_500ms.model} "
      f"({best_500ms.accuracy:.1%} accuracy, ${best_500ms.cost:.4f})")

# For $0.01 budget, what's the best accuracy?
best_budget = analysis.best_model_for_budget(0.01)
print(f"Best for $0.01: {best_budget.model} "
      f"({best_budget.accuracy:.1%} accuracy, {best_budget.latency_ms:.0f}ms)")

analysis.report()

D.10 Prompt Caching ROI Calculator (Task 3.1.7)

Prompt caching is one of the most underutilized cost optimizations. If your system has repeated prefixes (system prompts, conversation histories, reference materials), caching can save 90% on input costs.

When Caching Pays Off

Caching is worth implementing when:

  1. Repeated identical prefixes - Your system prompt, instructions, or static reference material appears in multiple queries
  2. High request volume - More queries mean more cache hits
  3. Large system prompts - The bigger the cached prefix, the bigger the savings

The formula is straightforward:

Monthly savings = (daily_queries × cache_hit_rate ×
                  cached_input_tokens × price_per_token ×
                  cache_discount_factor) × 30

At Anthropic pricing (90% discount on cached input):

  • Every 1,000 cached tokens with 90% hit rate saves ~$0.02/day

Caching ROI Calculator

from dataclasses import dataclass

@dataclass
class CachingROICalculator:
    """Calculate prompt caching savings."""

    # Anthropic rates (early 2026)
    base_input_price_per_million = 3.00  # Claude 3.5 Sonnet
    cached_input_price_per_million = 0.30  # 90% discount

    def monthly_savings(
        self,
        system_prompt_tokens: int,
        queries_per_day: int,
        cache_hit_rate: float,
        base_price: float = None,
        cached_price: float = None
    ) -> float:
        """
        Calculate monthly savings from prompt caching.

        Args:
            system_prompt_tokens: Size of cached system prompt
            queries_per_day: Daily query volume
            cache_hit_rate: Fraction of queries that hit cache (0-1)
            base_price: Base input price per million tokens
            cached_price: Cached input price per million tokens

        Returns:
            Monthly savings in dollars
        """
        if base_price is None:
            base_price = self.base_input_price_per_million
        if cached_price is None:
            cached_price = self.cached_input_price_per_million

        # Cost per query without caching
        cost_uncached = (system_prompt_tokens / 1_000_000) * base_price

        # Cost per query with caching
        cached_queries = queries_per_day * cache_hit_rate
        uncached_queries = queries_per_day * (1 - cache_hit_rate)

        daily_cost_cached = (
            cached_queries * ((system_prompt_tokens / 1_000_000) * cached_price) +
            uncached_queries * ((system_prompt_tokens / 1_000_000) * base_price)
        )

        # Savings
        daily_cost_uncached = queries_per_day * cost_uncached
        daily_savings = daily_cost_uncached - daily_cost_cached
        monthly_savings = daily_savings * 30

        return monthly_savings

    def breakeven_queries(
        self,
        system_prompt_tokens: int,
        cache_hit_rate: float = 0.9,
        base_price: float = None,
        cached_price: float = None
    ) -> float:
        """
        How many queries per day to break even on caching overhead?

        Note: Caching has no implementation overhead, so breakeven is immediate.
        This returns the daily volume at which caching becomes worthwhile ($1+/day).

        Args:
            system_prompt_tokens: Size of system prompt
            cache_hit_rate: Expected cache hit rate
            base_price: Base input price per million
            cached_price: Cached input price per million

        Returns:
            Daily queries needed for $1/month savings
        """
        if base_price is None:
            base_price = self.base_input_price_per_million
        if cached_price is None:
            cached_price = self.cached_input_price_per_million

        # Savings per cached query
        savings_per_cached = (
            (system_prompt_tokens / 1_000_000) *
            (base_price - cached_price)
        )

        # Queries needed for $1/month
        target_monthly = 1.0
        target_daily = target_monthly / 30

        if savings_per_cached == 0:
            return float('inf')

        daily_queries = target_daily / (savings_per_cached * cache_hit_rate)
        return daily_queries

    def report(
        self,
        system_prompt_tokens: int,
        queries_per_day: int,
        cache_hit_rate: float = 0.9
    ):
        """Print caching ROI analysis."""
        monthly = self.monthly_savings(
            system_prompt_tokens,
            queries_per_day,
            cache_hit_rate
        )

        print(f"Prompt Caching ROI Analysis")
        print(f"=" * 60)
        print(f"System prompt: {system_prompt_tokens:,} tokens")
        print(f"Daily volume: {queries_per_day:,} queries")
        print(f"Cache hit rate: {cache_hit_rate:.0%}")
        print(f"-" * 60)
        print(f"Monthly savings: ${monthly:.2f}")
        print(f"Annual savings: ${monthly * 12:.2f}")

        if monthly > 0:
            print(f"\nCaching is worthwhile for this volume.")
        else:
            breakeven = self.breakeven_queries(system_prompt_tokens, cache_hit_rate)
            print(f"\nNeed {breakeven:.0f} queries/day for $1/month savings.")


# Example: 500-token system prompt, 90% cache hit at Anthropic rates
calc = CachingROICalculator()

print("Scenario: 500-token system prompt, 90% cache hit rate\n")

for volume in [100, 1000, 10000]:
    calc.report(
        system_prompt_tokens=500,
        queries_per_day=volume,
        cache_hit_rate=0.90
    )
    print()

Worked Example

Assume:

  • System prompt: 500 tokens
  • Cache hit rate: 90%
  • Claude 3.5 Sonnet pricing: $3.00 per million input tokens (base), $0.30 per million (cached)
  • Savings per cached input token: $0.0000027 per token

Results:

Daily VolumeMonthly Cost (Uncached)Monthly Cost (Cached)Monthly Savings
100 queries$4.50$0.90$3.60
1,000 queries$45$9$36
10,000 queries$450$90$360
100,000 queries$4,500$900$3,600

Caching becomes worthwhile at very modest volumes. Even 100 daily queries saves ~$3.60/month. At production scale (10,000+ daily), you’re looking at $300-3,600/month in savings from a trivial implementation.

Maximizing Cache Hit Rate

To get the most from caching:

  1. Keep static content as prefix - Put your entire system prompt, instructions, and reference material at the beginning of the message (before user input)
  2. Put dynamic content after cached prefix - User queries, conversation history, and dynamic context come after the static system prompt
  3. Avoid randomizing components - Don’t shuffle or randomize parts of your system prompt—consistency enables cache hits
  4. Batch similar requests - Process similar user queries together to maximize the chance they share cached prefixes
  5. Version your prompts - When you update system prompts, do it carefully. A one-character change invalidates all caches

Design your prompt structure like this:

[CACHED PREFIX - never changes]
System prompt (static instructions)
Reference material (company guidelines, examples)
Current date/time
Tool definitions
[END CACHED PREFIX]

[DYNAMIC - changes per query]
Conversation history
User query
Context for this specific request
[END DYNAMIC]

This structure ensures every query benefits from the cached prefix.


D.11 Enhanced Cost Monitoring (Task 3.1.8)

Upgrade the CostTracker from D.8 with production-ready monitoring capabilities:

from dataclasses import dataclass, field
from datetime import datetime, date
from typing import Dict, Callable, Optional
import json

@dataclass
class CostTracker:
    """Track LLM costs in production with budget alerts and model degradation."""
    pricing: Dict[str, ModelPricing]
    daily_costs: Dict[str, float] = field(default_factory=dict)
    daily_budget: float = 100.0
    alert_thresholds: Dict[float, bool] = field(default_factory=lambda: {0.5: False, 0.8: False})
    on_threshold_reached: Optional[Callable[[float, float], None]] = None

    def record(
        self,
        model: str,
        input_tokens: int,
        output_tokens: int,
        user_id: str = None
    ) -> float:
        """Record a request and return its cost."""
        pricing = self.pricing[model]
        cost = (
            (input_tokens / 1_000_000) * pricing.input_per_million +
            (output_tokens / 1_000_000) * pricing.output_per_million
        )

        # Track by day
        today = date.today().isoformat()
        self.daily_costs[today] = self.daily_costs.get(today, 0) + cost

        # Check thresholds
        self._check_thresholds()

        return cost

    def get_daily_total(self, day: str = None) -> float:
        """Get total cost for a day."""
        if day is None:
            day = date.today().isoformat()
        return self.daily_costs.get(day, 0)

    def check_budget(self, daily_limit: float = None) -> bool:
        """Check if under daily budget."""
        if daily_limit is None:
            daily_limit = self.daily_budget
        return self.get_daily_total() < daily_limit

    def _check_thresholds(self):
        """Check if cost has reached alert thresholds."""
        current_cost = self.get_daily_total()

        for threshold_pct, already_alerted in self.alert_thresholds.items():
            threshold_amount = self.daily_budget * threshold_pct

            if current_cost >= threshold_amount and not already_alerted:
                self.alert_thresholds[threshold_pct] = True

                if self.on_threshold_reached:
                    self.on_threshold_reached(threshold_pct, current_cost)

    def degrade_model(self, current_model: str) -> str:
        """
        Return a cheaper model when approaching budget.

        Degradation strategy:
        - If using premium, switch to budget version of same provider
        - If already on budget, return current model (can't go lower)

        Args:
            current_model: Current model name (e.g., "claude-sonnet")

        Returns:
            Cheaper model name or current if already cheapest
        """
        degradation_map = {
            "claude-sonnet": "claude-haiku",
            "claude-opus": "claude-sonnet",
            "gpt-4o": "gpt-4o-mini",
            "gpt-4o-mini": "gpt-4o-mini",  # Already cheapest
            "claude-haiku": "claude-haiku",  # Already cheapest
        }

        return degradation_map.get(current_model, current_model)

    def should_degrade_model(self, threshold: float = 0.8) -> bool:
        """
        Check if we should degrade to a cheaper model.

        Args:
            threshold: Percentage of daily budget (0-1) to trigger degradation

        Returns:
            True if current spend exceeds threshold
        """
        current_cost = self.get_daily_total()
        threshold_amount = self.daily_budget * threshold
        return current_cost >= threshold_amount

    def get_cost_status(self) -> Dict:
        """Get comprehensive cost status."""
        current_cost = self.get_daily_total()
        remaining = self.daily_budget - current_cost
        pct_used = (current_cost / self.daily_budget) * 100 if self.daily_budget > 0 else 0

        return {
            "daily_budget": self.daily_budget,
            "current_cost": round(current_cost, 4),
            "remaining": round(remaining, 4),
            "percent_used": round(pct_used, 1),
            "under_budget": current_cost < self.daily_budget,
            "at_50_percent": current_cost >= (self.daily_budget * 0.5),
            "at_80_percent": current_cost >= (self.daily_budget * 0.8),
        }


# Usage example with alerting
def alert_callback(threshold_pct: float, current_cost: float):
    """Callback when cost thresholds are reached."""
    print(f"ALERT: Daily cost at {threshold_pct:.0%} of budget: ${current_cost:.2f}")
    # In production: send to monitoring system, PagerDuty, etc.


tracker = CostTracker(
    pricing=MODELS,
    daily_budget=100.0,
    on_threshold_reached=alert_callback
)

# Record a query
cost = tracker.record(
    model="claude-sonnet",
    input_tokens=4000,
    output_tokens=500
)

# Check if we should degrade to save money
if tracker.should_degrade_model(threshold=0.8):
    cheaper_model = tracker.degrade_model("claude-sonnet")
    print(f"Degrading from claude-sonnet to {cheaper_model}")

# Get status anytime
status = tracker.get_cost_status()
print(f"Cost status: {status['percent_used']:.1f}% of daily budget used")

Key additions:

  1. Alert thresholds - Automatically trigger callbacks at 50% and 80% of daily budget
  2. Model degradation - Intelligently downgrade to cheaper models when approaching budget (e.g., Sonnet → Haiku, GPT-4o → GPT-4o-mini)
  3. Callback mechanism - on_threshold_reached allows integration with monitoring, alerting, and operational systems
  4. Comprehensive status - get_cost_status() provides real-time visibility into budget utilization

In production, wire the callback to:

  • Send alerts to Slack/PagerDuty
  • Log to monitoring systems (Datadog, New Relic)
  • Trigger automatic cost reduction (reduce concurrency, use cheaper models)
  • Update user-facing dashboards


Appendix Cross-References

This SectionRelated AppendixConnection
D.1 Token EstimationAppendix C: Quick Reference tablesQuick lookup
D.2 Model PricingAppendix A: A.1 Framework overheadTotal cost picture
D.4 Context BudgetAppendix B: B.1.3 Token Budget AllocationImplementation pattern
D.5 Performance BenchmarksAppendix C: Latency BenchmarksDebugging slow systems
D.8 Cost MonitoringAppendix B: B.8.5 Cost Tracking patternPattern implementation

Appendix D complete. For production cost strategies, see Chapter 11. For debugging cost-related issues, see Appendix C.