How to Reduce LLM API Costs with Semantic Caching
Every time a user asks your LLM a question it has seen before, you are paying for the same answer twice. Semantic caching solves this by recognizing similar queries and returning cached responses instantly.
What is Semantic Caching?
Traditional caching only matches exact strings. If a user asks "What is machine learning?" and another asks "Explain machine learning", traditional caching treats these as completely different queries.
Semantic caching understands meaning. It converts queries into embeddings (numerical representations of meaning) and finds similar past queries, regardless of exact wording.
Traditional Cache
- "What is ML?" โ Cache miss
- "What is machine learning?" โ Cache miss
- "Explain ML to me" โ Cache miss
3 API calls = 3x cost
Semantic Cache
- "What is ML?" โ API call (cached)
- "What is machine learning?" โ Cache hit
- "Explain ML to me" โ Cache hit
1 API call = 1x cost
How Semantic Caching Works
- Query arrives โ A user sends a question to your LLM application.
- Generate embedding โ The query is converted into a vector embedding that captures its semantic meaning.
- Search cache โ Compare the embedding against previously cached query embeddings using similarity search.
- Return or call API โ If similarity exceeds threshold (e.g., 85%), return cached response. Otherwise, call the LLM.
When to Use Semantic Caching
Semantic caching is most effective when your application has repetitive queries:
Best Use Cases:
- โข FAQ bots โ Users ask similar questions repeatedly
- โข Documentation Q&A โ Common questions about your product
- โข Customer support โ Repetitive inquiries about policies, features
- โข Search assistants โ Similar search intents with different wording
Less Effective For:
- โข Creative generation โ Unique outputs needed each time
- โข Personalized responses โ Context-dependent answers
- โข Real-time data โ Queries about current events or live data
Implementation Considerations
Similarity Threshold
The similarity threshold determines how "close" a query must be to return a cached response. Higher thresholds (90%+) are more conservative but may miss valid matches. Lower thresholds (80%) catch more matches but risk returning inappropriate responses. Most applications start with 85% and adjust based on feedback.
Cache Invalidation
Cached responses can become stale. Consider:
- โข TTL (Time to Live) โ Automatically expire cache entries after a set period
- โข Manual invalidation โ Clear cache when source data changes
- โข Version tagging โ Associate cache entries with content versions
Key Takeaways
- Semantic caching understands meaning, not just exact string matches
- Best for repetitive use cases like FAQ, support, and documentation
- Start with 85% similarity threshold and adjust based on results
- Plan for cache invalidation to avoid stale responses
Costbase includes built-in semantic caching with configurable similarity thresholds.
Try it free