AI Context Windows Explained: Token Limits, Long-Context Models & Chunking

A context window is the model's working memory — the maximum number of tokens it can process in a single call, including your system prompt, conversation history, tools, and the user's message. Exceed it and you get a hard error. Approach it and quality degrades as the model struggles to attend to early content.

2026 context window landscape

Model	Context window	Approx. pages	Best for
Claude Haiku 4	200K tokens	~550 pages	Fast tasks, classification, extraction
Claude Sonnet 4	200K tokens	~550 pages	Most production use cases
Claude Opus 4	200K tokens	~550 pages	Complex reasoning, long documents
Gemini 2.0 Ultra	1M tokens	~2,700 pages	Codebase analysis, full-book processing
GPT-5	128K tokens	~350 pages	General purpose, tool use

When long context isn't enough: chunking strategies

Fixed-size chunking: Split documents into equal-sized chunks (e.g., 512 tokens each) with 10–20% overlap to preserve context at boundaries. Simple and fast. Works well for homogeneous text.
Semantic chunking: Split at natural boundaries — paragraphs, sections, sentences — rather than fixed token counts. Better preserves context and improves retrieval quality.
Hierarchical chunking: Create parent chunks (e.g., full sections) and child chunks (paragraphs within sections). Retrieve child chunks but include the parent for context. Best quality, more complex to implement.
Late chunking: Embed the full document first to get cross-chunk attention, then split into chunks. Only feasible for documents within context window size.

Chunking implementation

from moltbot.chunking import SemanticChunker

chunker = SemanticChunker(
    chunk_size=512,         # tokens per chunk
    chunk_overlap=50,       # overlap between chunks
    split_on=["paragraph", "sentence"],  # semantic boundaries
)

chunks = chunker.chunk(document_text)
# Returns list of chunks with metadata
# chunk.text, chunk.token_count, chunk.start_char, chunk.end_char
      

Long context vs RAG: how to choose

Use long context when: the full document is under the context limit, you need cross-document reasoning, or latency is not critical.
Use RAG when: you have more data than any context window, costs need to be minimized, or you need sub-second retrieval across a large corpus.
Use both: retrieve the most relevant chunks with RAG, then stuff them into a long-context model for final synthesis. This is the production sweet spot for most enterprise use cases.

Automatic chunking + long-context routing on MoltBot

Semantic chunking, hybrid RAG+long-context pipelines, and automatic model selection. 14-day free trial.

Start Free Trial →