In 2024, you called one model. In 2026, you orchestrate twelve. The difference between a $0.50 answer and a $0.002 answer to the same prompt often comes down not to quality โ but to routing. Getting orchestration right is how teams cut LLM costs by 60-90% while maintaining or improving output quality.
What is LLM Orchestration?
LLM orchestration is the coordination layer between your application and one or more language model endpoints. A well-designed orchestration layer handles:
- Model selection โ routing each request to the best model for that task
- Fallbacks โ switching to a backup model when primary is down or rate-limited
- Caching โ returning cached responses for identical or near-identical prompts
- Context management โ chunking and compressing long contexts to fit model limits
- Cost tracking โ monitoring token usage across every model and provider
- Load balancing โ distributing requests across multiple API keys or providers
The Model Landscape in 2026
The key insight: different models have wildly different cost/quality tradeoffs. Routing to the cheapest viable model per task is the single highest-ROI optimization.
โน๏ธ Real-world cost spread
Claude Opus 4 costs ~$0.015/1K output tokens. Qwen 3.5 B (local) costs effectively $0. A well-routed system pays Opus prices only for tasks that need it โ often less than 15% of requests.
# MoltBot Omnisphere routing table (simplified)
ROUTING_TABLE = {
"code_generation": {"primary": "claude-opus", "fallback": "gpt-5-mini"},
"code_review": {"primary": "claude-sonnet", "fallback": "gemini-pro"},
"summarization": {"primary": "gemini-flash", "fallback": "qwen-local"},
"classification": {"primary": "qwen-local", "fallback": "deepseek-v3"},
"research": {"primary": "gpt-5", "fallback": "kimi-k1.5"},
"simple_qa": {"primary": "qwen-local", "fallback": "deepseek-v3"},
"translation": {"primary": "deepseek-v3", "fallback": "gemini-flash"},
}
Pattern 1: Rule-Based Routing
The simplest orchestration pattern: classify the task upfront, then route to a predetermined model. Works well when your task types are well-defined and stable.
def route_request(prompt: str, task_type: str) -> str:
model = ROUTING_TABLE.get(task_type, {}).get("primary", "gpt-5-mini")
response = call_model(model, prompt)
if response.error:
# Fallback
fallback = ROUTING_TABLE[task_type]["fallback"]
response = call_model(fallback, prompt)
return response.text
Pattern 2: Semantic Routing
Classify the prompt itself using an embedding model, then route based on semantic similarity to known task categories. More flexible than rule-based โ handles novel requests gracefully.
from chromadb import Client
db = Client()
routing_collection = db.get_collection("routing_examples")
def semantic_route(prompt: str) -> str:
# Embed the prompt
embedding = embed(prompt)
# Find closest known task type
results = routing_collection.query(
query_embeddings=[embedding], n_results=1
)
task_type = results["metadatas"][0][0]["task_type"]
return ROUTING_TABLE[task_type]["primary"]
Pattern 3: Cost-Aware Routing (Arbitrage)
This is what MoltBot's Omnisphere engine uses. Route to the cheapest model that is predicted to meet a quality threshold for the given request. Quality is estimated using a fast classifier trained on past outputs.
โ Result
In production across 50+ customers, arbitrage routing reduces average cost-per-request by 64% with zero measurable quality regression on customer-defined evals.
def arbitrage_route(prompt: str, quality_threshold: float = 0.85) -> str:
"""Route to cheapest model predicted to meet quality threshold."""
candidates = get_models_sorted_by_cost()
for model in candidates:
predicted_quality = quality_classifier.predict(prompt, model)
if predicted_quality >= quality_threshold:
return model
# Fall back to best model if no candidate meets threshold
return "claude-opus"
Pattern 4: Cascading Fallbacks
Try models in order from cheapest to most expensive. If output quality (measured via a fast judge model or rule-based check) meets threshold, return it. Otherwise escalate.
CASCADE = ["qwen-local", "deepseek-v3", "gpt-5-mini", "claude-sonnet", "claude-opus"]
def cascade_route(prompt: str, validator) -> str:
for model in CASCADE:
response = call_model(model, prompt)
if validator(response.text):
return response.text
log_escalation(model, response)
return response.text # Return best effort from final model
Context Window Management
Models have context limits (8K to 200K tokens). Long-running agent sessions easily exceed them. Effective orchestration handles this transparently:
- Sliding window: Keep only the last N turns in context
- Hierarchical summarization: Compress older turns into summaries
- Memory offload: Push important facts to ChromaDB, retrieve via RAG
- Token budgeting: Pre-count tokens before sending, trim proactively
def prepare_context(messages: list, model_limit: int = 8192) -> list:
"""Trim/compress context to fit model limit."""
tokens = count_tokens(messages)
if tokens <= model_limit * 0.8:
return messages
# Summarize older messages
old_messages = messages[:-10]
summary = call_model("gemini-flash",
f"Summarize this conversation concisely:\n{old_messages}")
return [{"role": "system", "content": f"[Context summary]: {summary}"}] + messages[-10:]
Caching Strategy
Two levels of caching cut costs dramatically:
- Exact-match cache: Hash (model + prompt) โ cache response. Simple, effective for FAQs and repeated tool calls.
- Semantic cache: If a new prompt is >95% cosine-similar to a cached prompt, return the cached response. Handles rephrased duplicates.
class LLMCache:
def __init__(self, ttl_seconds=600):
self.exact = {}
self.semantic_db = ChromaClient().create_collection("cache")
self.ttl = ttl_seconds
def get(self, prompt, model):
# Exact match
key = sha256(f"{model}:{prompt}".encode()).hexdigest()
if key in self.exact:
return self.exact[key]
# Semantic match
embedding = embed(prompt)
results = self.semantic_db.query(query_embeddings=[embedding], n_results=1)
if results and results["distances"][0][0] < 0.05:
return results["documents"][0][0]
return None
Observability
You can't optimize what you can't measure. Log every orchestration decision:
- Prompt hash, model selected, tokens in/out, latency, cost
- Cache hit/miss ratio per model
- Fallback trigger rate
- Quality scores from judge model
MoltBot ships all of this out of the box
Omnisphere handles routing, fallbacks, caching, context management, and cost tracking automatically โ no configuration needed.