Pruning
A model compression technique that removes unnecessary weights (setting them to zero) to create a smaller, faster model. Structured pruning removes entire neurons or attention heads, while unstructured pruning zeroes out individual weights. Pruning can reduce model size by 50–90% with careful calibration. Combined with quantization, pruning enables very large models to run on consumer GPUs that couldn’t otherwise fit them.