AI Summary Hub

Quantization

Using lower precision (e.g. int8) for weights and activations.

Definition

Quantization is the process of representing neural network weights — and optionally activations — in lower numerical precision than the original training format (typically FP32 or BF16). By mapping floating-point values to a discrete integer range (INT8, INT4, INT2), quantization reduces model memory by 2–8x and enables faster inference on hardware with integer compute units such as GPU tensor cores, NPUs, and dedicated inference accelerators.

In practice, quantization is the most commonly applied model compression technique for LLMs because it requires no architecture changes, works post-training, and delivers memory reductions large enough to shift a model from server-grade hardware to consumer hardware. A 70B parameter model in FP16 requires approximately 140 GB of VRAM; the same model quantized to INT4 fits in around 35 GB, making it runnable on a dual-GPU workstation. The accuracy cost is typically small (1–3% on downstream benchmarks) for INT8, and manageable for INT4 with calibration-aware methods.

Quantization exists on a spectrum of approaches: post-training quantization (PTQ) applies the conversion after training using a small calibration dataset, while quantization-aware training (QAT) fine-tunes the model with simulated quantization so weights learn to be robust to the precision reduction. Modern LLM quantization schemes like GPTQ, AWQ, and GGUF integrate calibration and packing strategies that go beyond naive weight rounding, preserving accuracy even at INT4 precision.

How it works

Post-training quantization (PTQ)

Quantization-aware training (QAT)

Common quantization schemes

SchemePrecisionMethodBest for
Dynamic INT8INT8Quantize activations at runtimeCPU inference, NLP
Static INT8INT8Calibrate activations offlineLow-latency GPU serving
GPTQINT4Second-order weight quantizationLLM serving on consumer GPUs
AWQINT4Activation-aware weight quantizationLLM serving, low accuracy loss
GGUF (llama.cpp)INT2–INT8Mixed-precision per tensorLocal inference on CPU / Apple Silicon
QATINT8Train with simulated quantizationHighest accuracy at INT8

When to use / When NOT to use

ScenarioUse quantizationDo NOT use quantization
Running a large LLM on a consumer GPUYes — INT4 cuts memory 4–8x
Reducing inference latency in productionYes — INT8 accelerates throughput on modern hardware
Deploying models on mobile or edge hardwareYes — TFLite and ONNX support INT8 natively
Maximum accuracy on a well-resourced serverServe FP16 or BF16 if memory and cost allow
Very small models where accuracy loss is significantDistillation or pruning may be more appropriate
Models with unusual activation distributionsStandard PTQ may fail; QAT or activation-aware methods needed

Pros and cons

ProsCons
Large memory reduction (2–8x) with minimal accuracy lossAccuracy degradation increases at aggressive precision (INT2/INT3)
PTQ requires no retraining — fast to applyCalibration quality affects accuracy; needs representative data
Widely supported by runtimes (TFLite, ONNX, vLLM)Requires hardware support for integer ops to see speedups
Enables LLM deployment on consumer and edge hardwareActivation quantization harder than weight-only quantization

Code examples

# Static INT8 post-training quantization with PyTorch
import torch
import torch.quantization

model = MyModel()
model.load_state_dict(torch.load("model.pt"))
model.eval()  # set to inference mode

# Fuse BatchNorm and Conv for quantization efficiency
model_fused = torch.quantization.fuse_modules(model, [["conv", "bn", "relu"]])

# Set quantization config (fbgemm for x86, qnnpack for ARM/mobile)
model_fused.qconfig = torch.quantization.get_default_qconfig("fbgemm")
torch.quantization.prepare(model_fused, inplace=True)

# Calibration pass — run representative data to collect activation statistics
with torch.no_grad():
    for x_batch, _ in calibration_loader:
        model_fused(x_batch)

# Convert weights and activations to INT8
quantized_model = torch.quantization.convert(model_fused, inplace=True)

# Verify size reduction
original_params = sum(p.numel() for p in model.parameters())
quantized_params = sum(p.numel() for p in quantized_model.parameters())
print(f"Parameter count: {original_params:,} (same; precision changed, not count)")
print("INT8 model ready — memory footprint reduced ~4x vs FP32")

# Save quantized model
torch.save(quantized_model.state_dict(), "model_int8.pt")

Practical resources

See also