Most API pricing distinguishes between input tokens (what you send) and output tokens (what the model generates). Prompt caching adds a third category: cached input tokens, which are substantially cheaper โ typically 50โ90% less than uncached input tokens.
How KV cache reuse works
When an LLM processes your prompt, it computes a "key-value" (KV) representation of every token. This computation is the expensive part. Prompt caching saves this computed KV state server-side. When you send the same prefix on a subsequent request, the model skips recomputing the cached portion and jumps straight to the new content โ dramatically reducing both latency and cost.
The critical constraint: the cached prefix must be byte-identical and must appear at the beginning of your prompt. Even a single character difference invalidates the cache.
Cost comparison: with vs without caching
| Scenario | Without caching | With caching | Savings |
|---|---|---|---|
| 10k system prompt, 1k user message, 100 requests/day | $33/day | $4.40/day | โ 87% |
| 50k knowledge base, 500 user queries | $125 | $18.75 | โ 85% |
| Tool definitions (30 tools ร 200 tokens each) | $3/1k requests | $0.60/1k requests | โ 80% |
Where to apply caching
- Long system prompts: Personas, instructions, and behavioral guidelines that don't change per user. If your system prompt is over 1,000 tokens, caching is almost always worth it.
- Knowledge base / RAG context: When you inject a static document or knowledge chunk that's the same across many requests, cache it.
- Tool definitions: Tool schemas can run 200โ500 tokens each. With 20+ tools, that's 4,000โ10,000 tokens of cacheable content.
- Few-shot examples: Your demonstration examples are typically identical across requests โ perfect cache candidates.
Implementation with Anthropic Claude
Caching gotchas
- Cache TTL: Anthropic's cache TTL is 5 minutes by default. For low-traffic agents, the cache may expire between requests. Factor in cache warm-up cost when estimating savings.
- Byte-identical requirement: Dynamic content (timestamps, user IDs) must come after the cached prefix, never inside it.
- Output tokens aren't cached: Caching only applies to input. High-output tasks see smaller relative savings.
- Minimum cache size: Most providers require a minimum of 1,024 tokens for the cache to apply.
Automatic prompt caching on MoltBot
Cache management, cost tracking per request, and automatic cache warming. 14-day free trial.
Start Free Trial โ