Prompt Caching: Cut LLM Costs 60–90% on Repeated Context

Most API pricing distinguishes between input tokens (what you send) and output tokens (what the model generates). Prompt caching adds a third category: cached input tokens, which are substantially cheaper — typically 50–90% less than uncached input tokens.

How KV cache reuse works

When an LLM processes your prompt, it computes a "key-value" (KV) representation of every token. This computation is the expensive part. Prompt caching saves this computed KV state server-side. When you send the same prefix on a subsequent request, the model skips recomputing the cached portion and jumps straight to the new content — dramatically reducing both latency and cost.

The critical constraint: the cached prefix must be byte-identical and must appear at the beginning of your prompt. Even a single character difference invalidates the cache.

Cost comparison: with vs without caching

Scenario	Without caching	With caching	Savings
10k system prompt, 1k user message, 100 requests/day	$33/day	$4.40/day	↓ 87%
50k knowledge base, 500 user queries	$125	$18.75	↓ 85%
Tool definitions (30 tools × 200 tokens each)	$3/1k requests	$0.60/1k requests	↓ 80%

Where to apply caching

Long system prompts: Personas, instructions, and behavioral guidelines that don't change per user. If your system prompt is over 1,000 tokens, caching is almost always worth it.
Knowledge base / RAG context: When you inject a static document or knowledge chunk that's the same across many requests, cache it.
Tool definitions: Tool schemas can run 200–500 tokens each. With 20+ tools, that's 4,000–10,000 tokens of cacheable content.
Few-shot examples: Your demonstration examples are typically identical across requests — perfect cache candidates.

Implementation with Anthropic Claude

response = client.messages.create(
    model="claude-opus-4",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,   # 10k tokens
            "cache_control": {"type": "ephemeral"}  # mark for caching
        }
    ],
    messages=[{"role": "user", "content": user_message}]
)
# usage.cache_creation_input_tokens = 10000 (first request)
# usage.cache_read_input_tokens = 10000   (subsequent requests)
# Cache read tokens cost 10% of normal input token price
      

Caching gotchas

Cache TTL: Anthropic's cache TTL is 5 minutes by default. For low-traffic agents, the cache may expire between requests. Factor in cache warm-up cost when estimating savings.
Byte-identical requirement: Dynamic content (timestamps, user IDs) must come after the cached prefix, never inside it.
Output tokens aren't cached: Caching only applies to input. High-output tasks see smaller relative savings.
Minimum cache size: Most providers require a minimum of 1,024 tokens for the cache to apply.

Automatic prompt caching on MoltBot

Cache management, cost tracking per request, and automatic cache warming. 14-day free trial.

Start Free Trial →