📊 Quantization Formats
| Format | Bits | Min CUDA | GPU Examples | Notes |
|---|---|---|---|---|
| FP16 | 16 | 5.3+ | GTX 1000, RTX 2000+ | Native half precision |
| BF16 | 16 | 8.0+ | A100, RTX 3090, 4090 | Better range than FP16 |
| FP8 (E4M3/E5M2) | 8 | 8.9+ | H100, H200, L40S | Transformer Engine support |
| MXFP8 | 8 | 8.9+ | H100, H200, Blackwell | Block-size 32, E8M0 scale |
| FP6 | 6 | 10.0+ | GB200, B100, B200 | Blackwell native support |
| MXFP6 | 6 | 8.9+ | H100+, Blackwell | E2M3/E3M2, block-size 32 |
| INT8 | 8 | 6.1+ | GTX 1080+, P100+ | Wide compatibility |
| INT4 | 4 | 7.5+ | RTX 2080+, T4, V100 | CUTLASS kernels |
| MXFP4 | 4 | 9.0+ | H100, H200, GB200 | E2M1, block-size 32, OpenAI |
| NVFP4 | 4 | 10.0+ | GB200, B100, B200 | E2M1, block-size 16, dual-scale |
| GPTQ | 2-8 | 7.0+ | RTX 2000+, V100+ | Group-wise quantization |
| AWQ | 4 | 7.5+ | RTX 3000+, A100+ | Activation-aware |
| QuIP | 2-4 | 7.0+ | RTX 2000+, V100+ | Incoherence processing |
| QuIP# | 2-4 | 8.0+ | RTX 3090+, A100+ | E8P lattice codebook |
| GGUF/GGML | 2-8 | 6.1+ | GTX 1060+, most GPUs | CPU fallback available |
| EXL2 | 2-8 | 7.5+ | RTX 2000+, V100+ | Variable bit-width |
| NF4 | 4 | 7.0+ | RTX 2000+, V100+ | QLoRA, normal float |
| GGUF-IQ | 1-8 | 6.1+ | GTX 1060+ | Importance matrix |
🎯 CUDA Compute Capabilities
6.1 - Pascal
GTX 1000 series, Tesla P100
7.0 - Volta
Tesla V100, Titan V
7.5 - Turing
RTX 2000 series, T4, RTX 6000
8.0 - Ampere
A100, RTX 3090
8.6 - Ampere
RTX 3000 series consumer
8.9 - Ada Lovelace
RTX 4000 series, L40S
9.0 - Hopper
H100, H200
10.0 - Blackwell
GB200, B100, B200
âš¡ Performance Notes
FP8/MXFP8: Transformer Engine support, 2x faster than BF16 on H100+
NVFP4: Native on Blackwell, 2x faster than FP8, ~3.5x memory reduction vs FP16
MXFP4: Requires H100+ (CC 9.0), uses Triton kernels, OpenAI GPT-OSS format
MXFP6: Training & inference on H100+, better accuracy than MXFP4
QuIP#: 2-4 bit with E8P lattice codebook, 50% peak bandwidth on RTX 4090
INT4/GPTQ/AWQ: ~3-4x memory reduction, 1.5-2x faster inference
GGUF: Best CPU/GPU hybrid performance
EXL2: Highest quality at low bits, slower than GPTQ