Context Window

The maximum number of tokens an LLM can process in a single conversation or prompt, including both input and output. Context windows range from 4K tokens in older models to 128K+ in modern ones like Llama 3 and GPT-4. Larger context windows require significantly more VRAM due to the KV cache growing linearly with sequence length. If you need to process long documents locally, plan for extra VRAM beyond what the model weights alone require.

Related Products

Related Articles

More Terms