AI Summary Hub

Local inference

Running AI models on-device or on-premises instead of cloud APIs.

Definition

Local inference means running LLMs, vision models, or other AI models entirely on your own hardware — a developer laptop, a workstation, an on-premises server, or an edge device — without sending data to a cloud API provider. Every token generated stays within your own environment, which directly supports data privacy, reduced latency, predictable cost, and offline operation.

The practical feasibility of local inference depends on model compression: full-precision (FP16/BF16) frontier models typically require 80–320 GB of GPU memory, putting them out of reach for most local hardware. Quantization (INT8, INT4, GPTQ, AWQ) reduces memory by 2–8x, making 7B–70B parameter models runnable on consumer or prosumer GPUs (16–48 GB VRAM) and even on CPU-only hardware via GGUF format. Runtimes like Ollama, LM Studio, llama.cpp, vLLM, and TensorFlow Lite handle model loading, memory management, and inference execution with minimal configuration.

Local inference is not a single technology but a stack: model weights (GGUF, SafeTensors, ONNX) + runtime (llama.cpp, Ollama, vLLM, TFLite) + optional serving layer (OpenAI-compatible REST API). This stack can be assembled to serve a single developer interactively or scale to an on-premises cluster serving hundreds of concurrent users, all without a cloud dependency.

How it works

Inference stack

Serving layer (optional)

Runtime comparison

RuntimeBest forFormatGPU required
llama.cppLow-resource CPU/GPU inferenceGGUFNo (CPU-capable)
OllamaDeveloper-friendly local LLM servingGGUF / ModelfileNo (CPU-capable)
vLLMHigh-throughput on-prem serverHuggingFace / safetensorsYes (CUDA)
TensorFlow LiteMobile and microcontroller inference.tfliteNo
LM StudioGUI for local LLM explorationGGUFNo (CPU-capable)

When to use / When NOT to use

ScenarioUse local inferenceDo NOT use local inference
Data must not leave the network (healthcare, legal, finance)Yes — data never leaves local hardware
Low-latency assistant or IDE integrationYes — no network round-trip
Development and testing without API keys or usage limitsYes — free and offline
Air-gapped or restricted network environmentsYes — no external connectivity needed
Needing frontier model quality (GPT-4o, Claude 3.7)Cloud APIs provide larger, more capable models
Unpredictable or bursty load patternsCloud auto-scaling is more cost-effective
No GPU hardware available and low latency is criticalCloud inference is faster on underpowered hardware

Pros and cons

ProsCons
Data stays on your infrastructure — strong privacy guaranteeSmaller or quantized models may have lower quality
No per-token API cost at inference timeYou own hardware, ops, and model updates
Works offline and in restricted networksThroughput and context length limited by hardware
Full control over model version and behaviorNeed quantization and compression for larger models

Code examples

# Install Ollama and run a local LLM
curl -fsSL https://ollama.ai/install.sh | sh

# Pull and run a model interactively
ollama run llama3.2

# Serve an OpenAI-compatible REST API (runs on localhost:11434 by default)
ollama serve &

# Call the API from Python using the OpenAI client
python3 - <<'EOF'
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Explain quantization in one paragraph."}],
    stream=True,
)
for chunk in response:
    print(chunk.choices[0].delta.content or "", end="", flush=True)
EOF

Tips for effective use

  • Start with GGUF Q4_K_M quantization for a good accuracy-speed balance; step down to Q3 or Q2 only if memory is critically constrained.
  • Use Ollama for developer machines and vLLM for on-premises servers serving multiple users concurrently.
  • Pin model versions in your Modelfile or configuration to prevent silent quality changes on updates.
  • Monitor token throughput and first-token latency — these reveal whether your hardware is the bottleneck or the model is over-quantized.
  • For Apple Silicon (M1/M2/M3/M4), llama.cpp and Ollama use the Metal GPU backend automatically, providing near-GPU-class throughput.

Practical resources

  • Ollama — Run LLMs locally with a simple CLI and OpenAI-compatible API
  • llama.cpp — C++ inference engine for LLaMA and compatible models, GGUF format
  • vLLM — High-throughput LLM serving with continuous batching and PagedAttention
  • LM Studio — GUI for discovering, downloading, and running local LLMs
  • TensorFlow Lite — On-device inference for mobile and edge

See also