🚀 Quantization Formats & CUDA Support

Complete reference guide for LLM quantization methods and hardware requirements

📊 Quantization Formats

Format Bits Min CUDA GPU Examples Notes
FP16 16 5.3+ GTX 1000, RTX 2000+ Native half precision
BF16 16 8.0+ A100, RTX 3090, 4090 Better range than FP16
FP8 (E4M3/E5M2) 8 8.9+ H100, H200, L40S Transformer Engine support
MXFP8 8 8.9+ H100, H200, Blackwell Block-size 32, E8M0 scale
FP6 6 10.0+ GB200, B100, B200 Blackwell native support
MXFP6 6 8.9+ H100+, Blackwell E2M3/E3M2, block-size 32
INT8 8 6.1+ GTX 1080+, P100+ Wide compatibility
INT4 4 7.5+ RTX 2080+, T4, V100 CUTLASS kernels
MXFP4 4 9.0+ H100, H200, GB200 E2M1, block-size 32, OpenAI
NVFP4 4 10.0+ GB200, B100, B200 E2M1, block-size 16, dual-scale
GPTQ 2-8 7.0+ RTX 2000+, V100+ Group-wise quantization
AWQ 4 7.5+ RTX 3000+, A100+ Activation-aware
QuIP 2-4 7.0+ RTX 2000+, V100+ Incoherence processing
QuIP# 2-4 8.0+ RTX 3090+, A100+ E8P lattice codebook
GGUF/GGML 2-8 6.1+ GTX 1060+, most GPUs CPU fallback available
EXL2 2-8 7.5+ RTX 2000+, V100+ Variable bit-width
NF4 4 7.0+ RTX 2000+, V100+ QLoRA, normal float
GGUF-IQ 1-8 6.1+ GTX 1060+ Importance matrix

🎯 CUDA Compute Capabilities

6.1 - Pascal

GTX 1000 series, Tesla P100

7.0 - Volta

Tesla V100, Titan V

7.5 - Turing

RTX 2000 series, T4, RTX 6000

8.0 - Ampere

A100, RTX 3090

8.6 - Ampere

RTX 3000 series consumer

8.9 - Ada Lovelace

RTX 4000 series, L40S

9.0 - Hopper

H100, H200

10.0 - Blackwell

GB200, B100, B200

âš¡ Performance Notes

FP8/MXFP8: Transformer Engine support, 2x faster than BF16 on H100+
NVFP4: Native on Blackwell, 2x faster than FP8, ~3.5x memory reduction vs FP16
MXFP4: Requires H100+ (CC 9.0), uses Triton kernels, OpenAI GPT-OSS format
MXFP6: Training & inference on H100+, better accuracy than MXFP4
QuIP#: 2-4 bit with E8P lattice codebook, 50% peak bandwidth on RTX 4090
INT4/GPTQ/AWQ: ~3-4x memory reduction, 1.5-2x faster inference
GGUF: Best CPU/GPU hybrid performance
EXL2: Highest quality at low bits, slower than GPTQ