Flash Attention

An optimized attention algorithm that dramatically reduces the memory usage and increases the speed of transformer models by restructuring how attention is computed in GPU memory. Flash Attention avoids materializing the full attention matrix, cutting VRAM usage from quadratic to linear in sequence length. It’s now built into most inference engines and lets you run longer context windows on the same hardware. NVIDIA GPUs benefit most; support on AMD and Apple Silicon is improving.

More Terms

Embedding Fine-tuning FP16 / FP32 / FP8 FP4 GDDR6X / GDDR7 / HBM

Back to Glossary

Flash Attention

Related Products

Related Articles

More Terms