Gradient Checkpointing

A memory optimization technique for training that trades compute time for VRAM savings. Instead of storing all intermediate activations during the forward pass, gradient checkpointing recomputes them during the backward pass. This can reduce training memory usage by 60–80% at the cost of ~20–30% slower training. It’s essential for fine-tuning larger models on consumer GPUs with limited VRAM.