Local inference

Definition

Local inference means running LLMs, vision models, or other AI models entirely on your own hardware — a developer laptop, a workstation, an on-premises server, or an edge device — without sending data to a cloud API provider. Every token generated stays within your own environment, which directly supports data privacy, reduced latency, predictable cost, and offline operation.

The practical feasibility of local inference depends on model compression: full-precision (FP16/BF16) frontier models typically require 80–320 GB of GPU memory, putting them out of reach for most local hardware. Quantization (INT8, INT4, GPTQ, AWQ) reduces memory by 2–8x, making 7B–70B parameter models runnable on consumer or prosumer GPUs (16–48 GB VRAM) and even on CPU-only hardware via GGUF format. Runtimes like Ollama, LM Studio, llama.cpp, vLLM, and TensorFlow Lite handle model loading, memory management, and inference execution with minimal configuration.

Local inference is not a single technology but a stack: model weights (GGUF, SafeTensors, ONNX) + runtime (llama.cpp, Ollama, vLLM, TFLite) + optional serving layer (OpenAI-compatible REST API). This stack can be assembled to serve a single developer interactively or scale to an on-premises cluster serving hundreds of concurrent users, all without a cloud dependency.

Runtime	Best for	Format	GPU required
llama.cpp	Low-resource CPU/GPU inference	GGUF	No (CPU-capable)
Ollama	Developer-friendly local LLM serving	GGUF / Modelfile	No (CPU-capable)
vLLM	High-throughput on-prem server	HuggingFace / safetensors	Yes (CUDA)
TensorFlow Lite	Mobile and microcontroller inference	.tflite	No
LM Studio	GUI for local LLM exploration	GGUF	No (CPU-capable)

When to use / When NOT to use

Scenario	Use local inference	Do NOT use local inference
Data must not leave the network (healthcare, legal, finance)	Yes — data never leaves local hardware
Low-latency assistant or IDE integration	Yes — no network round-trip
Development and testing without API keys or usage limits	Yes — free and offline
Air-gapped or restricted network environments	Yes — no external connectivity needed
Needing frontier model quality (GPT-4o, Claude 3.7)		Cloud APIs provide larger, more capable models
Unpredictable or bursty load patterns		Cloud auto-scaling is more cost-effective
No GPU hardware available and low latency is critical		Cloud inference is faster on underpowered hardware

Pros and cons

Pros	Cons
Data stays on your infrastructure — strong privacy guarantee	Smaller or quantized models may have lower quality
No per-token API cost at inference time	You own hardware, ops, and model updates
Works offline and in restricted networks	Throughput and context length limited by hardware
Full control over model version and behavior	Need quantization and compression for larger models

Code examples

# Install Ollama and run a local LLM
curl -fsSL https://ollama.ai/install.sh | sh

# Pull and run a model interactively
ollama run llama3.2

# Serve an OpenAI-compatible REST API (runs on localhost:11434 by default)
ollama serve &

# Call the API from Python using the OpenAI client
python3 - <<'EOF'
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Explain quantization in one paragraph."}],
    stream=True,
)
for chunk in response:
    print(chunk.choices[0].delta.content or "", end="", flush=True)
EOF

Tips for effective use

Start with GGUF Q4_K_M quantization for a good accuracy-speed balance; step down to Q3 or Q2 only if memory is critically constrained.
Use Ollama for developer machines and vLLM for on-premises servers serving multiple users concurrently.
Pin model versions in your Modelfile or configuration to prevent silent quality changes on updates.
Monitor token throughput and first-token latency — these reveal whether your hardware is the bottleneck or the model is over-quantized.
For Apple Silicon (M1/M2/M3/M4), llama.cpp and Ollama use the Metal GPU backend automatically, providing near-GPU-class throughput.

Practical resources

Ollama — Run LLMs locally with a simple CLI and OpenAI-compatible API
llama.cpp — C++ inference engine for LLaMA and compatible models, GGUF format
vLLM — High-throughput LLM serving with continuous batching and PagedAttention
LM Studio — GUI for discovering, downloading, and running local LLMs
TensorFlow Lite — On-device inference for mobile and edge

Local inference

Definition

How it works

Inference stack

Serving layer (optional)

Runtime comparison

When to use / When NOT to use

Pros and cons

Code examples

Tips for effective use

Practical resources

See also

On this page