Flash Attention
An optimized attention algorithm that dramatically reduces the memory usage and increases the speed of transformer models by restructuring how attention is computed in GPU memory. Flash Attention avoids materializing the full attention matrix, cutting VRAM usage from quadratic to linear in sequence length. It’s now built into most inference engines and lets you run longer context windows on the same hardware. NVIDIA GPUs benefit most; support on AMD and Apple Silicon is improving.