AI Summary Hub

Model compression

Reducing model size and compute for deployment.

Definition

Model compression is the collective term for a family of techniques that reduce the size, memory footprint, inference latency, or energy consumption of trained neural networks without substantially degrading their accuracy. The primary methods are pruning (removing redundant parameters), quantization (reducing numerical precision), and knowledge distillation (training a smaller model to imitate a larger one). These techniques are often combined — for instance, a distilled model that is then quantized and pruned achieves significantly smaller size than any single method alone.

The motivation for model compression has intensified with the growth of LLMs: a frontier model in FP16 may require 80–320 GB of GPU memory, making deployment on anything other than a high-end server impractical. Compression enables the same or similar knowledge to be expressed in a form that fits within a consumer GPU (16–48 GB), a mobile device (4–12 GB RAM), or even a microcontroller (hundreds of KB). The challenge is managing the accuracy-compression trade-off across diverse downstream tasks.

Compression is applied at different stages: post-training (applied after training is complete, no access to training data required), training-aware (simulation of compression during training so the model adapts), and structured search (neural architecture search or iterative pruning with fine-tuning). The choice of method depends on the target hardware, acceptable accuracy budget, and whether retraining is feasible.

How it works

Compression pipeline

Method comparison

MethodHow it reduces sizeTraining requiredSpeedup type
Unstructured pruningZeros out individual weightsFine-tune afterMemory (sparse storage)
Structured pruningRemoves channels, heads, or layersFine-tune afterWall-clock (dense ops)
Quantization (PTQ)Lower precision (INT8, INT4)No (calibration only)Memory + compute
Quantization (QAT)Lower precision with training adaptationYesMemory + compute
Knowledge distillationTrain smaller model end-to-endYes (full training)All dimensions

When to use / When NOT to use

ScenarioUse model compressionDo NOT use model compression
Deploying LLMs on consumer GPUs or edge devicesYes — quantization makes it feasible
Reducing inference latency in productionYes — INT8 or structured pruning reduce latency
Sharing a distilled model for downstream fine-tuningYes — distillation transfers knowledge efficiently
Accuracy is the primary constraint (no hardware limit)Serve the full model; compression introduces accuracy risk
Model will be retrained frequently on new dataRetraining overhead may outweigh compression gains
Hardware natively supports FP16 efficientlyQuantization may offer minimal benefit on FP16 hardware

Pros and cons

ProsCons
Enables deployment on constrained hardwareAccuracy degradation — especially at aggressive compression ratios
Reduces inference cost and energy consumptionCalibration and fine-tuning require effort and expertise
Multiple methods can be combined for maximum compressionStructured pruning often requires architecture-specific tuning
PTQ requires no retraining (fast to apply)QAT and distillation require access to training data and compute

Code examples

# Post-training quantization with PyTorch (dynamic INT8)
import torch
import torch.quantization

# Load a trained model
model = MyModel()
model.load_state_dict(torch.load("model.pt"))
model.eval()

# Apply dynamic quantization to Linear layers (no calibration data needed)
quantized_model = torch.quantization.quantize_dynamic(
    model,
    qconfig_spec={torch.nn.Linear},
    dtype=torch.qint8,
)

# Check size reduction
original_size = sum(p.numel() for p in model.parameters()) * 4  # FP32 bytes
quantized_size = sum(p.numel() for p in quantized_model.parameters()) * 1  # INT8 bytes
print(f"Size reduction: {original_size / quantized_size:.1f}x")

# Save compressed model
torch.save(quantized_model.state_dict(), "quantized_model.pt")

Tips for effective use

  • Run a sensitivity analysis before compressing: not all layers tolerate the same compression ratio — early and final layers are usually more sensitive.
  • Combine methods in sequence: distill first (new architecture), then prune (remove redundant structure), then quantize (reduce precision) for maximum compression.
  • Always validate on a held-out dev set after each compression step — accuracy can degrade non-monotonically.
  • Use INT8 quantization as the default first step; it is the easiest to apply and recovers most of the memory benefit with minimal accuracy loss.
  • For LLMs, GPTQ or AWQ INT4 quantization often provides a better accuracy-compression ratio than magnitude pruning.

Practical resources

See also