INT4 / INT8 (Quantization)
Quantization reduces model precision from floating-point (FP16/FP32) to lower-bit integers (INT8 or INT4), shrinking the model so it fits in less VRAM. A 70B-parameter model that needs 140 GB in FP16 can fit in roughly 35–40 GB at INT4. The trade-off is a small accuracy loss, but for most local AI use cases the difference is negligible. Quantization is why consumer GPUs with 24 GB VRAM can run surprisingly large models.