GPTQ

A post-training quantization method that compresses LLMs to 4-bit or 3-bit precision with minimal accuracy loss by using second-order optimization (approximate Hessian information). GPTQ models are optimized for GPU inference and typically run faster than GGUF on NVIDIA cards with enough VRAM. If you have a dedicated NVIDIA GPU and want the fastest quantized inference, GPTQ is often the best format.

More Terms

GDDR6X / GDDR7 / HBM GGUF Gradient Checkpointing Inference INT4 / INT8 (Quantization)

Back to Glossary

GPTQ

Related Products

Related Articles

More Terms