LLM Quantization Formats & CUDA Support Reference

📊 Quantization Formats

Format	Bits	Min CUDA	GPU Examples	Notes
FP16	16	5.3+	GTX 1000, RTX 2000+	Native half precision
BF16	16	8.0+	A100, RTX 3090, 4090	Better range than FP16
FP8 (E4M3/E5M2)	8	8.9+	H100, H200, L40S	Transformer Engine support
MXFP8	8	8.9+	H100, H200, Blackwell	Block-size 32, E8M0 scale
FP6	6	10.0+	GB200, B100, B200	Blackwell native support
MXFP6	6	8.9+	H100+, Blackwell	E2M3/E3M2, block-size 32
INT8	8	6.1+	GTX 1080+, P100+	Wide compatibility
INT4	4	7.5+	RTX 2080+, T4, V100	CUTLASS kernels
MXFP4	4	9.0+	H100, H200, GB200	E2M1, block-size 32, OpenAI
NVFP4	4	10.0+	GB200, B100, B200	E2M1, block-size 16, dual-scale
GPTQ	2-8	7.0+	RTX 2000+, V100+	Group-wise quantization
AWQ	4	7.5+	RTX 3000+, A100+	Activation-aware
QuIP	2-4	7.0+	RTX 2000+, V100+	Incoherence processing
QuIP#	2-4	8.0+	RTX 3090+, A100+	E8P lattice codebook
GGUF/GGML	2-8	6.1+	GTX 1060+, most GPUs	CPU fallback available
EXL2	2-8	7.5+	RTX 2000+, V100+	Variable bit-width
NF4	4	7.0+	RTX 2000+, V100+	QLoRA, normal float
GGUF-IQ	1-8	6.1+	GTX 1060+	Importance matrix

GTX 1000 series, Tesla P100

Tesla V100, Titan V

RTX 2000 series, T4, RTX 6000

A100, RTX 3090

RTX 3000 series consumer

RTX 4000 series, L40S

H100, H200

GB200, B100, B200

FP8/MXFP8: Transformer Engine support, 2x faster than BF16 on H100+

NVFP4: Native on Blackwell, 2x faster than FP8, ~3.5x memory reduction vs FP16

MXFP4: Requires H100+ (CC 9.0), uses Triton kernels, OpenAI GPT-OSS format

MXFP6: Training & inference on H100+, better accuracy than MXFP4

QuIP#: 2-4 bit with E8P lattice codebook, 50% peak bandwidth on RTX 4090

INT4/GPTQ/AWQ: ~3-4x memory reduction, 1.5-2x faster inference

GGUF: Best CPU/GPU hybrid performance

EXL2: Highest quality at low bits, slower than GPTQ