INT4 / INT8 (Quantization)

Quantization reduces model precision from floating-point (FP16/FP32) to lower-bit integers (INT8 or INT4), shrinking the model so it fits in less VRAM. A 70B-parameter model that needs 140 GB in FP16 can fit in roughly 35–40 GB at INT4. The trade-off is a small accuracy loss, but for most local AI use cases the difference is negligible. Quantization is why consumer GPUs with 24 GB VRAM can run surprisingly large models.

More Terms

Gradient Checkpointing Inference Knowledge Distillation KV Cache llama.cpp

Back to Glossary

INT4 / INT8 (Quantization)

Related Products

Related Articles

More Terms