Pruning

A model compression technique that removes unnecessary weights (setting them to zero) to create a smaller, faster model. Structured pruning removes entire neurons or attention heads, while unstructured pruning zeroes out individual weights. Pruning can reduce model size by 50–90% with careful calibration. Combined with quantization, pruning enables very large models to run on consumer GPUs that couldn’t otherwise fit them.

More Terms

PCIe Pipeline Parallelism QLoRA RAG (Retrieval-Augmented Generation)RLHF (Reinforcement Learning from Human Feedback)

Back to Glossary

Pruning

Related Products

More Terms