Appendix D: Cost Reference
Appendix D, v2.1 — Early 2026
Pricing in this appendix reflects early 2026 rates. Models, pricing tiers, and cost structures change frequently. Use the methodologies here with current pricing from provider documentation.
This appendix provides the numbers you need to estimate costs before they surprise you. No theory—Chapter 11 covers why costs matter. Here you’ll find formulas, calculators, pricing tables, and worked examples.
Important: LLM pricing changes frequently. The numbers here reflect early 2026 rates. Always check provider pricing pages before making commitments. The formulas and patterns, however, remain useful regardless of specific prices.
D.1 Token Estimation
Tokens are the currency of LLM costs. Every API call charges by tokens consumed.
Quick Estimation Rules
For English text:
| Content | Tokens |
|---|---|
| 1 character | ~0.25 tokens |
| 1 word | ~1.3 tokens |
| 4 characters | ~1 token |
| 100 words | ~130 tokens |
| 1 page (500 words) | ~650 tokens |
| 1,000 words | ~1,300 tokens |
For code:
| Content | Tokens |
|---|---|
| 1 line of code | ~15-20 tokens |
| 1 function (typical) | ~100-500 tokens |
| 1 file (500 lines) | ~8,000-10,000 tokens |
| JSON (per KB) | ~400 tokens |
Code is less token-efficient than prose. Punctuation, indentation, and special characters all consume tokens. JSON and structured data are particularly token-hungry.
Token Estimation Code
For quick estimates during development:
def estimate_tokens(text: str) -> int:
"""Quick token estimate: 1 token ≈ 4 characters for English."""
return len(text) // 4
def estimate_tokens_words(word_count: int) -> int:
"""Estimate from word count: 1 token ≈ 0.75 words."""
return int(word_count * 1.33)
For accurate counts, use the tokenizer libraries:
# OpenAI models
import tiktoken
def count_tokens_openai(text: str, model: str = "gpt-4") -> int:
"""Exact token count for OpenAI models."""
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
# Anthropic models
from anthropic import Anthropic
def count_tokens_anthropic(text: str) -> int:
"""Exact token count for Claude models."""
client = Anthropic()
return client.count_tokens(text)
Model-Specific Differences
Different models tokenize differently. The same text may have different token counts across providers:
| Text Sample | GPT-4 | Claude | Llama |
|---|---|---|---|
| “Hello, world!” | 4 | 4 | 5 |
def foo(): return 42 | 9 | 8 | 11 |
| 1KB JSON | ~380 | ~400 | ~420 |
For budgeting purposes, use the 4-character rule for estimates, then verify with the actual tokenizer before production deployment.
D.2 Model Pricing
Generation Models
Prices per 1 million tokens (early 2026):
| Model | Input | Output | Notes |
|---|---|---|---|
| Premium Tier | |||
| Claude 3.5 Sonnet | $3.00 | $15.00 | Best quality/cost balance |
| GPT-4o | $2.50 | $10.00 | Multimodal capable |
| Claude 3 Opus | $15.00 | $75.00 | Maximum capability |
| GPT-4 Turbo | $10.00 | $30.00 | Large context window |
| Budget Tier | |||
| GPT-4o-mini | $0.15 | $0.60 | 20x cheaper than GPT-4o |
| Claude 3 Haiku | $0.25 | $1.25 | Fast, efficient |
| Claude 3.5 Haiku | $0.80 | $4.00 | Improved Haiku |
| Open Source (API) | |||
| Llama 3 70B (via API) | $0.50-1.00 | $0.50-1.00 | Provider dependent |
| Mixtral 8x7B | $0.25-0.50 | $0.25-0.50 | Provider dependent |
Cost Per Query
What a single 10,000-token context query costs:
| Model | Input Cost | Output (500 tok) | Total |
|---|---|---|---|
| Claude 3.5 Sonnet | $0.030 | $0.0075 | ~$0.038 |
| GPT-4o | $0.025 | $0.005 | ~$0.030 |
| GPT-4o-mini | $0.0015 | $0.0003 | ~$0.002 |
| Claude 3 Haiku | $0.0025 | $0.0006 | ~$0.003 |
Embedding Models
Prices per 1 million tokens:
| Model | Price | Dimensions | Notes |
|---|---|---|---|
| text-embedding-3-small | $0.02 | 1536 | Best value |
| text-embedding-3-large | $0.13 | 3072 | Higher quality |
| text-embedding-ada-002 | $0.10 | 1536 | Legacy |
| Cohere embed-v3 | $0.10 | 1024 | Good multilingual |
| Voyage-2 | $0.10 | 1024 | Code-optimized available |
Cached vs. Uncached
Some providers offer prompt caching at reduced rates:
| Provider | Cached Input | Savings |
|---|---|---|
| Anthropic | 10% of base | 90% off |
| OpenAI | Varies | Up to 50% off |
Cache hits require exact prefix matches. Design system prompts to maximize cache reuse.
D.3 Cost Calculators
Basic Cost Formula
input_cost = (input_tokens / 1,000,000) × input_price_per_million
output_cost = (output_tokens / 1,000,000) × output_price_per_million
total_cost = input_cost + output_cost
Query Cost Calculator
from dataclasses import dataclass
from typing import Optional
@dataclass
class ModelPricing:
"""Pricing for a specific model."""
name: str
input_per_million: float
output_per_million: float
# Common model pricing (early 2026)
MODELS = {
"claude-sonnet": ModelPricing("Claude 3.5 Sonnet", 3.00, 15.00),
"gpt-4o": ModelPricing("GPT-4o", 2.50, 10.00),
"gpt-4o-mini": ModelPricing("GPT-4o-mini", 0.15, 0.60),
"claude-haiku": ModelPricing("Claude 3 Haiku", 0.25, 1.25),
}
class CostCalculator:
"""Calculate LLM costs for context engineering systems."""
def __init__(self, model: str):
self.pricing = MODELS[model]
def query_cost(
self,
system_prompt_tokens: int,
user_query_tokens: int,
rag_tokens: int = 0,
memory_tokens: int = 0,
conversation_tokens: int = 0,
expected_output_tokens: int = 500
) -> dict:
"""Calculate cost for a single query."""
total_input = (
system_prompt_tokens +
user_query_tokens +
rag_tokens +
memory_tokens +
conversation_tokens
)
input_cost = (total_input / 1_000_000) * self.pricing.input_per_million
output_cost = (expected_output_tokens / 1_000_000) * self.pricing.output_per_million
return {
"model": self.pricing.name,
"input_tokens": total_input,
"output_tokens": expected_output_tokens,
"input_cost": round(input_cost, 6),
"output_cost": round(output_cost, 6),
"total_cost": round(input_cost + output_cost, 6),
}
def daily_cost(self, queries_per_day: int, avg_cost_per_query: float) -> float:
"""Project daily costs."""
return queries_per_day * avg_cost_per_query
def monthly_cost(self, queries_per_day: int, avg_cost_per_query: float) -> float:
"""Project monthly costs (30 days)."""
return queries_per_day * 30 * avg_cost_per_query
# Example usage
calc = CostCalculator("claude-sonnet")
# Typical RAG query
result = calc.query_cost(
system_prompt_tokens=500,
user_query_tokens=100,
rag_tokens=2000,
memory_tokens=400,
conversation_tokens=1000,
expected_output_tokens=500
)
# Result: ~$0.02 per query
# Monthly projection
monthly = calc.monthly_cost(
queries_per_day=1000,
avg_cost_per_query=0.02
)
# Result: ~$600/month
Multi-Model Cost Calculator
For systems using multiple models (routing, specialist agents):
class MultiModelCalculator:
"""Calculate costs for multi-agent systems."""
def __init__(self):
self.calculators = {
name: CostCalculator(name) for name in MODELS
}
def multi_agent_query(
self,
router_model: str,
router_tokens: int,
specialist_model: str,
specialist_calls: int,
specialist_input_tokens: int,
specialist_output_tokens: int
) -> dict:
"""Calculate cost for a multi-agent query."""
# Router cost (typically small, budget model)
router_calc = self.calculators[router_model]
router_cost = router_calc.query_cost(
system_prompt_tokens=200,
user_query_tokens=router_tokens,
expected_output_tokens=50
)["total_cost"]
# Specialist costs
specialist_calc = self.calculators[specialist_model]
specialist_cost = specialist_calc.query_cost(
system_prompt_tokens=500,
user_query_tokens=specialist_input_tokens,
expected_output_tokens=specialist_output_tokens
)["total_cost"] * specialist_calls
return {
"router_cost": router_cost,
"specialist_cost": specialist_cost,
"total_cost": router_cost + specialist_cost,
"calls": 1 + specialist_calls
}
# Example: Router + 2 specialist calls
multi = MultiModelCalculator()
result = multi.multi_agent_query(
router_model="claude-haiku",
router_tokens=200,
specialist_model="claude-sonnet",
specialist_calls=2,
specialist_input_tokens=3000,
specialist_output_tokens=800
)
# Result: ~$0.05 per user query
Embedding Cost Calculator
def embedding_cost(
num_documents: int,
avg_tokens_per_doc: int,
price_per_million: float = 0.02 # text-embedding-3-small
) -> dict:
"""Calculate one-time embedding costs."""
total_tokens = num_documents * avg_tokens_per_doc
cost = (total_tokens / 1_000_000) * price_per_million
return {
"documents": num_documents,
"total_tokens": total_tokens,
"cost": round(cost, 4)
}
# Example: Embed 10,000 documents
result = embedding_cost(
num_documents=10_000,
avg_tokens_per_doc=500,
price_per_million=0.02
)
# Result: 5M tokens, $0.10
D.4 Context Budget Allocation
Reference Budget Template
A production-ready token budget for a 16,000-token context:
Total Context Budget: 16,000 tokens
├── System Prompt: 500 tokens (3%) [fixed]
├── User Query: 1,000 tokens (6%) [truncate if longer]
├── Memory Context: 400 tokens (3%) [most relevant only]
├── RAG Results: 2,000 tokens (13%) [top-k with limit]
├── Conversation: 2,000 tokens (13%) [sliding window]
└── Response Headroom: 10,100 tokens (62%) [for model output]
Allocation by Use Case
| Component | Chatbot | RAG System | Code Assistant | Agent |
|---|---|---|---|---|
| System Prompt | 3-5% | 5-8% | 8-10% | 10-15% |
| User Query | 5-10% | 5-10% | 10-15% | 5-10% |
| Memory | 5-10% | 2-5% | 5-10% | 10-15% |
| RAG/Context | 0% | 15-25% | 20-30% | 10-20% |
| Conversation | 20-30% | 10-15% | 10-15% | 5-10% |
| Response | 50-60% | 50-60% | 40-50% | 40-50% |
Budget Enforcement Code
from dataclasses import dataclass
from typing import Dict, Any
@dataclass
class ContextBudget:
"""Define and enforce token budgets."""
system_prompt: int = 500
user_query: int = 1000
memory: int = 400
rag: int = 2000
conversation: int = 2000
total_limit: int = 16000
def allocate(self, components: Dict[str, str]) -> Dict[str, str]:
"""Truncate components to fit budgets."""
allocated = {}
for name, content in components.items():
limit = getattr(self, name, 1000)
tokens = len(content) // 4 # Quick estimate
if tokens <= limit:
allocated[name] = content
else:
# Truncate to fit budget
char_limit = limit * 4
allocated[name] = content[:char_limit]
return allocated
def remaining_for_response(self, used_tokens: int) -> int:
"""Calculate remaining tokens for response."""
return self.total_limit - used_tokens
# Example usage
budget = ContextBudget(
system_prompt=500,
user_query=1000,
memory=400,
rag=2000,
conversation=2000,
total_limit=16000
)
components = {
"system_prompt": system_prompt_text,
"user_query": user_input,
"memory": retrieved_memories,
"rag": retrieved_chunks,
"conversation": conversation_history
}
allocated = budget.allocate(components)
Scaling Budgets
For different context windows:
| Total Budget | System | Query | Memory | RAG | Conv | Response |
|---|---|---|---|---|---|---|
| 4K tokens | 200 | 400 | 200 | 500 | 500 | 2,200 |
| 16K tokens | 500 | 1,000 | 400 | 2,000 | 2,000 | 10,100 |
| 32K tokens | 1,000 | 2,000 | 800 | 5,000 | 4,000 | 19,200 |
| 128K tokens | 2,000 | 4,000 | 2,000 | 20,000 | 10,000 | 90,000 |
D.5 Performance Benchmarks
Latency by Operation
Typical latency ranges (p50 values):
| Operation | Latency | Notes |
|---|---|---|
| Embedding | ||
| Single text | 10-50ms | API overhead dominates |
| Batch (100 texts) | 100-300ms | More efficient per-text |
| Vector Search | ||
| In-memory (10K vectors) | 1-5ms | Fastest option |
| In-memory (1M vectors) | 20-50ms | Still fast |
| Cloud (Pinecone, etc.) | 50-150ms | Network latency added |
| Reranking | ||
| Cross-encoder (10 docs) | 100-250ms | Per batch |
| Cohere Rerank | 150-300ms | API call |
| LLM Generation | ||
| First token (short context) | 200-500ms | Time to first token |
| First token (long context) | 500-2000ms | Scales with input |
| Full response (500 tokens) | 2-5s | Depends on output length |
| Full response (2000 tokens) | 8-15s | Streaming recommended |
| Full RAG Pipeline | ||
| Simple (embed + search + generate) | 1-3s | Typical |
| Complex (rerank + multi-step) | 3-8s | More processing |
Latency by Model
Time to first token with 10K token context:
| Model | First Token | Notes |
|---|---|---|
| GPT-4o-mini | 150-300ms | Fastest |
| Claude 3 Haiku | 200-400ms | Fast |
| GPT-4o | 300-600ms | Moderate |
| Claude 3.5 Sonnet | 400-800ms | Moderate |
| Claude 3 Opus | 800-1500ms | Slowest |
Context Size Impact
Latency scaling with context size (approximate):
| Context Size | Relative Latency | Example |
|---|---|---|
| 1K tokens | 1.0x | Baseline |
| 4K tokens | 1.2x | +20% |
| 16K tokens | 1.8x | +80% |
| 32K tokens | 2.5x | +150% |
| 128K tokens | 5-10x | +400-900% |
Throughput Guidelines
Sustainable request rates before hitting limits:
| Provider | Tier | Requests/min | Tokens/min |
|---|---|---|---|
| OpenAI | Free | 3 | 40,000 |
| OpenAI | Tier 1 | 500 | 200,000 |
| OpenAI | Tier 5 | 10,000 | 10,000,000 |
| Anthropic | Free | 5 | 40,000 |
| Anthropic | Build | 1,000 | 400,000 |
| Anthropic | Scale | Custom | Custom |
D.6 Worked Examples
Example 1: Simple RAG Chatbot
Setup: Customer support chatbot with document retrieval
Parameters:
- 500 queries/day
- Model: GPT-4o-mini
- System prompt: 300 tokens
- User query: 100 tokens average
- RAG chunks: 1,500 tokens (3 chunks × 500)
- Output: 300 tokens average
Calculation:
Input tokens per query: 300 + 100 + 1500 = 1,900
Input cost: 1,900 / 1,000,000 × $0.15 = $0.000285
Output cost: 300 / 1,000,000 × $0.60 = $0.00018
Total per query: $0.000465
Daily: 500 × $0.000465 = $0.23
Monthly: $0.23 × 30 = $7
Plus embedding costs (one-time):
- 5,000 support documents × 400 tokens = 2M tokens
- Cost: 2M / 1M × $0.02 = $0.04
Total monthly: ~$7 (embedding is negligible)
Example 2: Production Code Assistant
Setup: Internal developer tool for codebase Q&A
Parameters:
- 2,000 queries/day
- Model: Claude 3.5 Sonnet
- System prompt: 800 tokens (detailed instructions)
- User query: 200 tokens
- RAG code chunks: 4,000 tokens (8 chunks × 500)
- Conversation history: 1,000 tokens
- Output: 800 tokens average (explanations + code)
Calculation:
Input tokens per query: 800 + 200 + 4000 + 1000 = 6,000
Input cost: 6,000 / 1,000,000 × $3.00 = $0.018
Output cost: 800 / 1,000,000 × $15.00 = $0.012
Total per query: $0.03
Daily: 2,000 × $0.03 = $60
Monthly: $60 × 30 = $1,800
Plus embedding costs (one-time):
- 50,000 code files × 600 tokens = 30M tokens
- Cost: 30M / 1M × $0.13 = $3.90 (using large model for code)
Total monthly: ~$1,800
Example 3: Multi-Agent System
Setup: Complex research assistant with routing and specialists
Parameters:
- 1,000 user queries/day
- Router: Claude 3 Haiku (fast, cheap)
- Specialists: Claude 3.5 Sonnet
- Average 2.5 specialist calls per user query
Router call:
Input: 500 tokens (prompt + query)
Output: 50 tokens (routing decision)
Cost: (500/1M × $0.25) + (50/1M × $1.25) = $0.000188
Specialist call (average):
Input: 4,000 tokens (prompt + context)
Output: 600 tokens
Cost: (4000/1M × $3.00) + (600/1M × $15.00) = $0.021
Per user query:
Router: $0.000188
Specialists (2.5 calls): 2.5 × $0.021 = $0.0525
Total: $0.053
Daily: 1,000 × $0.053 = $53
Monthly: $53 × 30 = $1,590
Example 4: High-Volume Consumer App
Setup: AI writing assistant with free tier
Parameters:
- 50,000 queries/day (free users)
- 10,000 queries/day (premium users)
- Free: GPT-4o-mini, 2K context, 200 output
- Premium: GPT-4o, 8K context, 500 output
Free tier:
Input cost: 2,000 / 1,000,000 × $0.15 = $0.0003
Output cost: 200 / 1,000,000 × $0.60 = $0.00012
Per query: $0.00042
Daily: 50,000 × $0.00042 = $21
Monthly: $630
Premium tier:
Input cost: 8,000 / 1,000,000 × $2.50 = $0.02
Output cost: 500 / 1,000,000 × $10.00 = $0.005
Per query: $0.025
Daily: 10,000 × $0.025 = $250
Monthly: $7,500
Total monthly: ~$8,130
D.7 Quick Reference
Cost Rules of Thumb
- Budget models cost ~20x less than premium
- Output tokens cost ~3-5x more than input tokens
- RAG adds 1,000-5,000 tokens per query
- Multi-agent multiplies base cost by number of calls
- Embedding is cheap—don’t optimize prematurely
Token Rules of Thumb
- 4 characters ≈ 1 token (English)
- 1 page ≈ 650 tokens
- 1 code file ≈ 8,000-10,000 tokens
- JSON/XML is 20-30% more tokens than equivalent plain text
Latency Rules of Thumb
- Embedding: 10-50ms (batch for efficiency)
- Vector search: 20-100ms (depends on scale)
- LLM first token: 200-800ms (depends on model + context)
- Full RAG: 1-3 seconds (acceptable for most UX)
Model Selection Quick Guide
| Priority | Choose |
|---|---|
| Lowest cost | GPT-4o-mini |
| Best quality | Claude 3.5 Sonnet or GPT-4o |
| Fastest | Claude 3 Haiku or GPT-4o-mini |
| Longest context | Claude (200K) or GPT-4 Turbo (128K) |
| Routing/classification | Any budget model |
Monthly Cost Quick Estimates
| Scenario | Queries/day | Model | Monthly |
|---|---|---|---|
| Prototype | 100 | Budget | $3-5 |
| Small app | 1,000 | Budget | $30-50 |
| Small app | 1,000 | Premium | $600-1,000 |
| Production | 10,000 | Budget | $300-500 |
| Production | 10,000 | Premium | $6,000-10,000 |
| High volume | 100,000 | Budget | $3,000-5,000 |
D.8 Cost Monitoring Code
Track actual costs in production:
from dataclasses import dataclass, field
from datetime import datetime, date
from typing import Dict
import json
@dataclass
class CostTracker:
"""Track LLM costs in production."""
pricing: Dict[str, ModelPricing]
daily_costs: Dict[str, float] = field(default_factory=dict)
def record(
self,
model: str,
input_tokens: int,
output_tokens: int,
user_id: str = None
) -> float:
"""Record a request and return its cost."""
pricing = self.pricing[model]
cost = (
(input_tokens / 1_000_000) * pricing.input_per_million +
(output_tokens / 1_000_000) * pricing.output_per_million
)
# Track by day
today = date.today().isoformat()
self.daily_costs[today] = self.daily_costs.get(today, 0) + cost
return cost
def get_daily_total(self, day: str = None) -> float:
"""Get total cost for a day."""
if day is None:
day = date.today().isoformat()
return self.daily_costs.get(day, 0)
def check_budget(self, daily_limit: float) -> bool:
"""Check if under daily budget."""
return self.get_daily_total() < daily_limit
# Usage
tracker = CostTracker(pricing=MODELS)
# After each LLM call
cost = tracker.record(
model="claude-sonnet",
input_tokens=4000,
output_tokens=500,
user_id="user_123"
)
# Check budget before expensive operations
if tracker.check_budget(daily_limit=100.0):
# Proceed with request
pass
else:
# Degrade gracefully or alert
pass
D.9 Cost-Latency Tradeoff Analysis (Task 3.1.6)
Every context engineering technique involves a fundamental tradeoff between cost and latency. Understanding this frontier helps you make informed decisions about which optimizations to apply.
The Cost-Latency Frontier
Common operations and their cost-latency impacts:
| Technique | Cost Impact | Latency Impact | When Worth It |
|---|---|---|---|
| Direct LLM call | Baseline | Baseline | Always your starting point |
| RAG retrieval | +$0.001-0.005 | +200-500ms | When you need external context |
| Reranking | +$0.002-0.010 | +100-300ms | When top-k retrieval is unreliable |
| Query expansion | +0.1x base cost | +200-400ms | When recall matters more than latency |
| Multi-agent routing | +$0.0002-0.001 | +50-200ms | When specialization improves quality |
| Prompt caching | -90% input cost | +0ms | When you have repeated prefixes |
| Context compression | +$0.001 | -10-20% latency | When context is large and redundant |
Key insight: Prompt caching is the only pure win—it reduces cost with no latency penalty. All other optimizations require justification.
Model Selection Decision Tree
Use this structured guide to choose models for your latency constraints:
If latency requirement < 500ms:
└─ Use budget model (GPT-4o-mini, Claude 3 Haiku)
└─ Expect: Sub-500ms first token, ~$0.001 per query
If latency requirement < 2 seconds:
└─ If quality is paramount:
└─ Use premium model (Claude 3.5 Sonnet, GPT-4o)
└─ Expect: 400-800ms first token, ~$0.03 per query
└─ If acceptable quality from budget model:
└─ Use budget model + more context
└─ Expect: ~$0.002 per query
If latency requirement < 5 seconds:
└─ Use best-quality model for the task
└─ Optimize context, not model
└─ Expect: Quality-dependent costs
If simple classification/routing:
└─ ALWAYS use budget model
└─ Output is small, quality is predictable
└─ Save ~20x on model cost
If context > 50,000 tokens:
└─ Check if smaller context + better retrieval is cheaper
└─ Compare: Large context premium model vs. small context budget model + reranking
└─ Usually: Better retrieval + budget model wins
Quality Per Dollar Analysis
Calculate quality-per-dollar (QpD) for your system to find the best value:
from dataclasses import dataclass
from typing import Dict, List
import statistics
@dataclass
class QualityScore:
"""Evaluation score for a model response."""
model: str
latency_ms: float
accuracy: float # 0-1
cost: float
@property
def quality_per_dollar(self) -> float:
"""Score per dollar spent."""
if self.cost == 0:
return float('inf')
return self.accuracy / self.cost
@property
def quality_per_second(self) -> float:
"""Score per second of latency."""
if self.latency_ms == 0:
return float('inf')
seconds = self.latency_ms / 1000
return self.accuracy / seconds
class QualityPerDollarAnalysis:
"""Analyze quality-per-dollar across models."""
def __init__(self):
self.scores: List[QualityScore] = []
def add_evaluation(self, scores: List[QualityScore]):
"""Add evaluation results."""
self.scores.extend(scores)
def best_model_for_latency_target(self, max_latency_ms: float) -> QualityScore:
"""Find the model with best QpD within latency budget."""
candidates = [
s for s in self.scores
if s.latency_ms <= max_latency_ms
]
if not candidates:
return None
return max(candidates, key=lambda s: s.quality_per_dollar)
def best_model_for_budget(self, max_cost: float) -> QualityScore:
"""Find the model with best accuracy within cost budget."""
candidates = [
s for s in self.scores
if s.cost <= max_cost
]
if not candidates:
return None
return max(candidates, key=lambda s: s.accuracy)
def report(self):
"""Print quality-per-dollar analysis."""
print("Quality Per Dollar Analysis")
print("=" * 70)
print(f"{'Model':<20} {'Accuracy':<12} {'Cost':<12} {'QpD':<12}")
print("-" * 70)
for score in sorted(
self.scores,
key=lambda s: s.quality_per_dollar,
reverse=True
):
print(
f"{score.model:<20} {score.accuracy:<12.2%} "
f"${score.cost:<11.4f} {score.quality_per_dollar:<12.2f}"
)
# Example: Evaluate multiple models on your task
analysis = QualityPerDollarAnalysis()
analysis.add_evaluation([
QualityScore("gpt-4o-mini", latency_ms=250, accuracy=0.72, cost=0.0015),
QualityScore("claude-haiku", latency_ms=300, accuracy=0.75, cost=0.0025),
QualityScore("gpt-4o", latency_ms=450, accuracy=0.88, cost=0.030),
QualityScore("claude-sonnet", latency_ms=500, accuracy=0.91, cost=0.038),
])
# For 500ms latency budget, what gives best quality/$?
best_500ms = analysis.best_model_for_latency_target(500)
print(f"Best for 500ms: {best_500ms.model} "
f"({best_500ms.accuracy:.1%} accuracy, ${best_500ms.cost:.4f})")
# For $0.01 budget, what's the best accuracy?
best_budget = analysis.best_model_for_budget(0.01)
print(f"Best for $0.01: {best_budget.model} "
f"({best_budget.accuracy:.1%} accuracy, {best_budget.latency_ms:.0f}ms)")
analysis.report()
D.10 Prompt Caching ROI Calculator (Task 3.1.7)
Prompt caching is one of the most underutilized cost optimizations. If your system has repeated prefixes (system prompts, conversation histories, reference materials), caching can save 90% on input costs.
When Caching Pays Off
Caching is worth implementing when:
- Repeated identical prefixes - Your system prompt, instructions, or static reference material appears in multiple queries
- High request volume - More queries mean more cache hits
- Large system prompts - The bigger the cached prefix, the bigger the savings
The formula is straightforward:
Monthly savings = (daily_queries × cache_hit_rate ×
cached_input_tokens × price_per_token ×
cache_discount_factor) × 30
At Anthropic pricing (90% discount on cached input):
- Every 1,000 cached tokens with 90% hit rate saves ~$0.02/day
Caching ROI Calculator
from dataclasses import dataclass
@dataclass
class CachingROICalculator:
"""Calculate prompt caching savings."""
# Anthropic rates (early 2026)
base_input_price_per_million = 3.00 # Claude 3.5 Sonnet
cached_input_price_per_million = 0.30 # 90% discount
def monthly_savings(
self,
system_prompt_tokens: int,
queries_per_day: int,
cache_hit_rate: float,
base_price: float = None,
cached_price: float = None
) -> float:
"""
Calculate monthly savings from prompt caching.
Args:
system_prompt_tokens: Size of cached system prompt
queries_per_day: Daily query volume
cache_hit_rate: Fraction of queries that hit cache (0-1)
base_price: Base input price per million tokens
cached_price: Cached input price per million tokens
Returns:
Monthly savings in dollars
"""
if base_price is None:
base_price = self.base_input_price_per_million
if cached_price is None:
cached_price = self.cached_input_price_per_million
# Cost per query without caching
cost_uncached = (system_prompt_tokens / 1_000_000) * base_price
# Cost per query with caching
cached_queries = queries_per_day * cache_hit_rate
uncached_queries = queries_per_day * (1 - cache_hit_rate)
daily_cost_cached = (
cached_queries * ((system_prompt_tokens / 1_000_000) * cached_price) +
uncached_queries * ((system_prompt_tokens / 1_000_000) * base_price)
)
# Savings
daily_cost_uncached = queries_per_day * cost_uncached
daily_savings = daily_cost_uncached - daily_cost_cached
monthly_savings = daily_savings * 30
return monthly_savings
def breakeven_queries(
self,
system_prompt_tokens: int,
cache_hit_rate: float = 0.9,
base_price: float = None,
cached_price: float = None
) -> float:
"""
How many queries per day to break even on caching overhead?
Note: Caching has no implementation overhead, so breakeven is immediate.
This returns the daily volume at which caching becomes worthwhile ($1+/day).
Args:
system_prompt_tokens: Size of system prompt
cache_hit_rate: Expected cache hit rate
base_price: Base input price per million
cached_price: Cached input price per million
Returns:
Daily queries needed for $1/month savings
"""
if base_price is None:
base_price = self.base_input_price_per_million
if cached_price is None:
cached_price = self.cached_input_price_per_million
# Savings per cached query
savings_per_cached = (
(system_prompt_tokens / 1_000_000) *
(base_price - cached_price)
)
# Queries needed for $1/month
target_monthly = 1.0
target_daily = target_monthly / 30
if savings_per_cached == 0:
return float('inf')
daily_queries = target_daily / (savings_per_cached * cache_hit_rate)
return daily_queries
def report(
self,
system_prompt_tokens: int,
queries_per_day: int,
cache_hit_rate: float = 0.9
):
"""Print caching ROI analysis."""
monthly = self.monthly_savings(
system_prompt_tokens,
queries_per_day,
cache_hit_rate
)
print(f"Prompt Caching ROI Analysis")
print(f"=" * 60)
print(f"System prompt: {system_prompt_tokens:,} tokens")
print(f"Daily volume: {queries_per_day:,} queries")
print(f"Cache hit rate: {cache_hit_rate:.0%}")
print(f"-" * 60)
print(f"Monthly savings: ${monthly:.2f}")
print(f"Annual savings: ${monthly * 12:.2f}")
if monthly > 0:
print(f"\nCaching is worthwhile for this volume.")
else:
breakeven = self.breakeven_queries(system_prompt_tokens, cache_hit_rate)
print(f"\nNeed {breakeven:.0f} queries/day for $1/month savings.")
# Example: 500-token system prompt, 90% cache hit at Anthropic rates
calc = CachingROICalculator()
print("Scenario: 500-token system prompt, 90% cache hit rate\n")
for volume in [100, 1000, 10000]:
calc.report(
system_prompt_tokens=500,
queries_per_day=volume,
cache_hit_rate=0.90
)
print()
Worked Example
Assume:
- System prompt: 500 tokens
- Cache hit rate: 90%
- Claude 3.5 Sonnet pricing: $3.00 per million input tokens (base), $0.30 per million (cached)
- Savings per cached input token: $0.0000027 per token
Results:
| Daily Volume | Monthly Cost (Uncached) | Monthly Cost (Cached) | Monthly Savings |
|---|---|---|---|
| 100 queries | $4.50 | $0.90 | $3.60 |
| 1,000 queries | $45 | $9 | $36 |
| 10,000 queries | $450 | $90 | $360 |
| 100,000 queries | $4,500 | $900 | $3,600 |
Caching becomes worthwhile at very modest volumes. Even 100 daily queries saves ~$3.60/month. At production scale (10,000+ daily), you’re looking at $300-3,600/month in savings from a trivial implementation.
Maximizing Cache Hit Rate
To get the most from caching:
- Keep static content as prefix - Put your entire system prompt, instructions, and reference material at the beginning of the message (before user input)
- Put dynamic content after cached prefix - User queries, conversation history, and dynamic context come after the static system prompt
- Avoid randomizing components - Don’t shuffle or randomize parts of your system prompt—consistency enables cache hits
- Batch similar requests - Process similar user queries together to maximize the chance they share cached prefixes
- Version your prompts - When you update system prompts, do it carefully. A one-character change invalidates all caches
Design your prompt structure like this:
[CACHED PREFIX - never changes]
System prompt (static instructions)
Reference material (company guidelines, examples)
Current date/time
Tool definitions
[END CACHED PREFIX]
[DYNAMIC - changes per query]
Conversation history
User query
Context for this specific request
[END DYNAMIC]
This structure ensures every query benefits from the cached prefix.
D.11 Enhanced Cost Monitoring (Task 3.1.8)
Upgrade the CostTracker from D.8 with production-ready monitoring capabilities:
from dataclasses import dataclass, field
from datetime import datetime, date
from typing import Dict, Callable, Optional
import json
@dataclass
class CostTracker:
"""Track LLM costs in production with budget alerts and model degradation."""
pricing: Dict[str, ModelPricing]
daily_costs: Dict[str, float] = field(default_factory=dict)
daily_budget: float = 100.0
alert_thresholds: Dict[float, bool] = field(default_factory=lambda: {0.5: False, 0.8: False})
on_threshold_reached: Optional[Callable[[float, float], None]] = None
def record(
self,
model: str,
input_tokens: int,
output_tokens: int,
user_id: str = None
) -> float:
"""Record a request and return its cost."""
pricing = self.pricing[model]
cost = (
(input_tokens / 1_000_000) * pricing.input_per_million +
(output_tokens / 1_000_000) * pricing.output_per_million
)
# Track by day
today = date.today().isoformat()
self.daily_costs[today] = self.daily_costs.get(today, 0) + cost
# Check thresholds
self._check_thresholds()
return cost
def get_daily_total(self, day: str = None) -> float:
"""Get total cost for a day."""
if day is None:
day = date.today().isoformat()
return self.daily_costs.get(day, 0)
def check_budget(self, daily_limit: float = None) -> bool:
"""Check if under daily budget."""
if daily_limit is None:
daily_limit = self.daily_budget
return self.get_daily_total() < daily_limit
def _check_thresholds(self):
"""Check if cost has reached alert thresholds."""
current_cost = self.get_daily_total()
for threshold_pct, already_alerted in self.alert_thresholds.items():
threshold_amount = self.daily_budget * threshold_pct
if current_cost >= threshold_amount and not already_alerted:
self.alert_thresholds[threshold_pct] = True
if self.on_threshold_reached:
self.on_threshold_reached(threshold_pct, current_cost)
def degrade_model(self, current_model: str) -> str:
"""
Return a cheaper model when approaching budget.
Degradation strategy:
- If using premium, switch to budget version of same provider
- If already on budget, return current model (can't go lower)
Args:
current_model: Current model name (e.g., "claude-sonnet")
Returns:
Cheaper model name or current if already cheapest
"""
degradation_map = {
"claude-sonnet": "claude-haiku",
"claude-opus": "claude-sonnet",
"gpt-4o": "gpt-4o-mini",
"gpt-4o-mini": "gpt-4o-mini", # Already cheapest
"claude-haiku": "claude-haiku", # Already cheapest
}
return degradation_map.get(current_model, current_model)
def should_degrade_model(self, threshold: float = 0.8) -> bool:
"""
Check if we should degrade to a cheaper model.
Args:
threshold: Percentage of daily budget (0-1) to trigger degradation
Returns:
True if current spend exceeds threshold
"""
current_cost = self.get_daily_total()
threshold_amount = self.daily_budget * threshold
return current_cost >= threshold_amount
def get_cost_status(self) -> Dict:
"""Get comprehensive cost status."""
current_cost = self.get_daily_total()
remaining = self.daily_budget - current_cost
pct_used = (current_cost / self.daily_budget) * 100 if self.daily_budget > 0 else 0
return {
"daily_budget": self.daily_budget,
"current_cost": round(current_cost, 4),
"remaining": round(remaining, 4),
"percent_used": round(pct_used, 1),
"under_budget": current_cost < self.daily_budget,
"at_50_percent": current_cost >= (self.daily_budget * 0.5),
"at_80_percent": current_cost >= (self.daily_budget * 0.8),
}
# Usage example with alerting
def alert_callback(threshold_pct: float, current_cost: float):
"""Callback when cost thresholds are reached."""
print(f"ALERT: Daily cost at {threshold_pct:.0%} of budget: ${current_cost:.2f}")
# In production: send to monitoring system, PagerDuty, etc.
tracker = CostTracker(
pricing=MODELS,
daily_budget=100.0,
on_threshold_reached=alert_callback
)
# Record a query
cost = tracker.record(
model="claude-sonnet",
input_tokens=4000,
output_tokens=500
)
# Check if we should degrade to save money
if tracker.should_degrade_model(threshold=0.8):
cheaper_model = tracker.degrade_model("claude-sonnet")
print(f"Degrading from claude-sonnet to {cheaper_model}")
# Get status anytime
status = tracker.get_cost_status()
print(f"Cost status: {status['percent_used']:.1f}% of daily budget used")
Key additions:
- Alert thresholds - Automatically trigger callbacks at 50% and 80% of daily budget
- Model degradation - Intelligently downgrade to cheaper models when approaching budget (e.g., Sonnet → Haiku, GPT-4o → GPT-4o-mini)
- Callback mechanism -
on_threshold_reachedallows integration with monitoring, alerting, and operational systems - Comprehensive status -
get_cost_status()provides real-time visibility into budget utilization
In production, wire the callback to:
- Send alerts to Slack/PagerDuty
- Log to monitoring systems (Datadog, New Relic)
- Trigger automatic cost reduction (reduce concurrency, use cheaper models)
- Update user-facing dashboards
Appendix Cross-References
| This Section | Related Appendix | Connection |
|---|---|---|
| D.1 Token Estimation | Appendix C: Quick Reference tables | Quick lookup |
| D.2 Model Pricing | Appendix A: A.1 Framework overhead | Total cost picture |
| D.4 Context Budget | Appendix B: B.1.3 Token Budget Allocation | Implementation pattern |
| D.5 Performance Benchmarks | Appendix C: Latency Benchmarks | Debugging slow systems |
| D.8 Cost Monitoring | Appendix B: B.8.5 Cost Tracking pattern | Pattern implementation |
Appendix D complete. For production cost strategies, see Chapter 11. For debugging cost-related issues, see Appendix C.