FP16 / FP32 / FP8

Floating-point precision formats that determine how many bits represent each number in a model’s weights. FP32 (32-bit) is full precision, FP16 (16-bit) halves memory usage with minimal accuracy loss, and FP8 (8-bit) cuts it further for inference. Lower precision means you can fit larger models in less VRAM and run them faster — modern GPUs like the RTX 5090 have dedicated FP8 tensor cores specifically for this reason.

Related Products

Related Articles

More Terms