~$ man fenetre-de-contexte
What is an LLM's context window?
definition
The context window is the maximum number of tokens an LLM can handle in one go as input plus output.
It sets how much conversation history, documents, or instructions the model can use when creating its next reply.
Bigger windows let models work with longer texts but need more memory and compute power.
Think of it like a notebook page where you can only write so many lines before the top lines get erased to make room for new ones.
key takeaways
- Context window size is counted in tokens, where one token is roughly four characters or three-quarters of a word.
- It caps the total length of the prompt and all prior messages the model can consider.
- Going over the limit causes the oldest tokens to be dropped automatically.
- Current models range from 4k tokens in older versions to 128k or 1M tokens in newer ones.
- Methods such as retrieval-augmented generation help work around small windows by fetching only needed facts.
the 2026 job market
By 2026, teams building LLM apps need engineers who can design prompts and systems around context limits, creating steady demand for roles in AI application development, prompt optimization, and efficient inference pipelines across US, Canada, and UK tech markets.
frequently asked questions
How does context window size change model behavior?
Larger windows let the model keep more history and details, leading to more coherent long conversations. Smaller windows force earlier truncation, which can break continuity. Developers often test different sizes to balance cost and quality.
What token limits do popular LLMs have today?
GPT-4o supports 128k tokens while Claude 3 reaches 200k and some open models go to 1M. Limits are set by the model architecture and training. Always check the provider docs because they can update.
Can you increase an LLM context window after training?
Some techniques like position interpolation or fine-tuning allow modest extensions. Full increases usually require retraining or switching to a different base model. Most production work focuses on prompt compression instead.
Why do longer contexts cost more to run?
Attention mechanisms scale quadratically with sequence length, so compute and memory grow fast. Providers charge per token, and longer inputs use more of them. Efficient chunking and caching reduce these costs in practice.
