A context window is the model's working memory โ the maximum number of tokens it can process in a single call, including your system prompt, conversation history, tools, and the user's message. Exceed it and you get a hard error. Approach it and quality degrades as the model struggles to attend to early content.
2026 context window landscape
| Model | Context window | Approx. pages | Best for |
|---|---|---|---|
| Claude Haiku 4 | 200K tokens | ~550 pages | Fast tasks, classification, extraction |
| Claude Sonnet 4 | 200K tokens | ~550 pages | Most production use cases |
| Claude Opus 4 | 200K tokens | ~550 pages | Complex reasoning, long documents |
| Gemini 2.0 Ultra | 1M tokens | ~2,700 pages | Codebase analysis, full-book processing |
| GPT-5 | 128K tokens | ~350 pages | General purpose, tool use |
When long context isn't enough: chunking strategies
- Fixed-size chunking: Split documents into equal-sized chunks (e.g., 512 tokens each) with 10โ20% overlap to preserve context at boundaries. Simple and fast. Works well for homogeneous text.
- Semantic chunking: Split at natural boundaries โ paragraphs, sections, sentences โ rather than fixed token counts. Better preserves context and improves retrieval quality.
- Hierarchical chunking: Create parent chunks (e.g., full sections) and child chunks (paragraphs within sections). Retrieve child chunks but include the parent for context. Best quality, more complex to implement.
- Late chunking: Embed the full document first to get cross-chunk attention, then split into chunks. Only feasible for documents within context window size.
Chunking implementation
from moltbot.chunking import SemanticChunker
chunker = SemanticChunker(
chunk_size=512, # tokens per chunk
chunk_overlap=50, # overlap between chunks
split_on=["paragraph", "sentence"], # semantic boundaries
)
chunks = chunker.chunk(document_text)
# Returns list of chunks with metadata
# chunk.text, chunk.token_count, chunk.start_char, chunk.end_char
Long context vs RAG: how to choose
- Use long context when: the full document is under the context limit, you need cross-document reasoning, or latency is not critical.
- Use RAG when: you have more data than any context window, costs need to be minimized, or you need sub-second retrieval across a large corpus.
- Use both: retrieve the most relevant chunks with RAG, then stuff them into a long-context model for final synthesis. This is the production sweet spot for most enterprise use cases.
Automatic chunking + long-context routing on MoltBot
Semantic chunking, hybrid RAG+long-context pipelines, and automatic model selection. 14-day free trial.
Start Free Trial โ