AI Summary Hub

Zero-shot learning

Performing tasks without task-specific training examples.

Definition

Zero-shot learning (ZSL) is the ability of a model to perform a task for which it has received no labeled training examples at inference time. The model generalizes purely from knowledge acquired during pretraining, guided only by a task description — a natural language prompt, a set of class attribute vectors, or a shared embedding space between modalities. There are no gradient updates on the target task; the model must bridge the gap between what it learned during pretraining and the new task specification.

Two major paradigms exist. In the attribute-based approach — the original formulation from computer vision (Lampert et al., 2009) — unseen classes are described by semantic attributes (e.g., "has stripes", "lives in water"), and the model classifies inputs by matching predicted attributes to class descriptions. In the large model approach — now dominant — pretrained LLMs or vision-language models generalize via prompting. For text tasks, the model is given an instruction describing the task and format; for image tasks, CLIP embeds both images and class-name text in a shared space and classifies by cosine similarity.

The quality of zero-shot predictions depends entirely on how well pretraining covered the target task or semantically similar ones. LLMs excel at zero-shot for NLP tasks (classification, summarization, translation, question answering) because web-scale pretraining implicitly covers most text tasks. CLIP-style vision-language models generalize zero-shot to object recognition across hundreds of ImageNet classes. When zero-shot quality is insufficient, few-shot learning (adding examples to the prompt) or fine-tuning are natural next steps.

How it works

Prompt-based zero-shot (LLMs)

The task is fully specified in the prompt: no examples, only instructions and format. The LLM conditions on the prompt and generates or completes the answer. Instruction-tuned models (e.g., GPT-4, Claude, Llama-3-Instruct) are specifically trained to follow zero-shot instructions reliably.

Vision-language zero-shot (CLIP)

CLIP trains an image encoder and a text encoder jointly so that matching image-text pairs have high cosine similarity in a shared embedding space. At inference, class names (e.g., "a photo of a cat") are embedded as text; an input image is embedded and classified by nearest-neighbor to class text embeddings — no labeled images required.

Zero-shot chain-of-thought (CoT)

Adding "Let's think step by step" to a zero-shot prompt elicits multi-step reasoning from LLMs, substantially improving accuracy on arithmetic, logic, and commonsense tasks without providing any worked examples.

Generalized zero-shot learning (GZSL)

In GZSL, the model must classify inputs from both seen (training) classes and unseen (zero-shot) classes simultaneously. This is harder than standard ZSL because the model tends to be biased toward seen classes. Calibration techniques and generative models (synthesizing features for unseen classes) help.

When to use / When NOT to use

ScenarioUse zero-shotAvoid zero-shot
Task is well-described in natural languageYes — instruction-tuned LLMs handle this reliablyNo — if the task requires specialized domain knowledge not in pretraining
No labeled data available at allYes — zero-shot is the only optionNo — collect even a few examples and use few-shot
Rapid prototyping across many tasksYes — no training overheadNo — production systems with quality requirements
New image classes described by textYes — CLIP-style models generalize from class namesNo — if visual similarity to training classes is low
Arithmetic or reasoning tasks requiring high accuracyPartial — use with chain-of-thought promptingPrefer few-shot or fine-tuned models for critical applications

Comparisons

ApproachExamples neededAdaptationAccuracy potentialSpeed to deploy
Zero-shot0Prompt onlyModerateInstant
Few-shot (in-context)1–10In-context examplesHigherVery fast
Fine-tuning100–10K+Gradient updatesHighestSlower
Zero-shot + CoT0Prompt with reasoningHigher than zero-shotInstant

Pros and cons

ProsCons
No labeled data or training requiredQuality depends heavily on pretraining coverage
Instant deployment — just write a promptInconsistent for niche or highly specialized tasks
Flexible — one model handles many tasksNo guarantee of structured output format
CLIP extends zero-shot to vision without image labelsGeneralized ZSL is biased toward seen classes

Code examples

Zero-shot text classification using Hugging Face's NLI-based pipeline:

from transformers import pipeline

# Zero-shot classifier using NLI (no fine-tuning needed)
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

text = "The central bank raised interest rates by 50 basis points to combat inflation."
candidate_labels = ["finance", "sports", "technology", "politics", "science"]

result = classifier(text, candidate_labels=candidate_labels)
print("Top label:", result["labels"][0])      # finance
print("Confidence:", f"{result['scores'][0]:.2%}")

Zero-shot image classification with CLIP:

import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Class descriptions as text (no labeled images needed)
class_texts = [
    "a photo of a cat",
    "a photo of a dog",
    "a photo of a bird",
    "a photo of a car",
]

image = Image.open("test_image.jpg")  # Any image

inputs = processor(
    text=class_texts,
    images=image,
    return_tensors="pt",
    padding=True,
)

with torch.no_grad():
    outputs = model(**inputs)
    probs = outputs.logits_per_image.softmax(dim=1)

predicted_class = class_texts[probs.argmax().item()]
print(f"Predicted: {predicted_class} ({probs.max().item():.2%})")

Practical resources

See also