AI Summary Hub

GPT

Generative Pre-trained Transformer and decoder-only models.

Definition

GPT refers to decoder-only transformer models trained to predict the next token (autoregressive). Scaling these models has led to today's large language models (LLMs) capable of few-shot and zero-shot tasks.

Decoder-only design is well-suited for generation: at each step the model conditions on previous tokens and predicts the next. LLMs built on this idea are then instruction-tuned and aligned (e.g. RLHF) for chat and tool use. For understanding-only tasks, BERT-style encoders can be more parameter-efficient.

The GPT line of models (GPT-1, GPT-2, GPT-3, GPT-4) demonstrated that scaling a simple next-token prediction objective on ever-larger corpora produces models with emergent capabilities: reasoning, code generation, multi-step arithmetic, and few-shot task solving without any task-specific training. The instruction-tuning and RLHF stages that follow base pretraining transform a raw next-token predictor into an assistant that reliably follows natural language instructions, maintains conversation context, and refuses harmful requests. Modern GPT-family deployments are accessed through APIs and support features like function calling, vision inputs, and streaming.

How it works

Causal masking

Tokens are embedded and fed into causal decoder layers: each position can attend only to itself and previous positions (masked self-attention via an upper-triangular mask). This prevents the model from "seeing" the future during both training and inference.

Language modeling head

The next token is predicted from the last position's representation via a linear layer over the vocabulary, followed by softmax. Training maximizes the log-likelihood of the next token given all preceding tokens (teacher forcing). The loss is averaged over all positions, so every token in the sequence contributes a gradient signal.

Inference and sampling

Inference generates autoregressively: sample or greedily pick the next token, append it, and repeat until a stop condition (EOS token or max length). Sampling parameters (temperature, top-k, top-p) control diversity vs. determinism. Prompt engineering and fine-tuning shape task behavior on top of this mechanism.

When to use / When NOT to use

ScenarioUse GPT-style?Notes
Text generation, summarization, dialogueYesThe natural fit for autoregressive generation
Few-shot classification via promptingYesGPT handles this well with few examples
Semantic search / dense retrievalWith cautionBi-encoders (BERT-style) are more efficient
NER or token-level classificationWith cautionEncoder models are more parameter-efficient
Long-context reasoning (>8K tokens)YesModern GPT models support very long contexts
Strict budget / edge deploymentNoGPT models are large; use distilled alternatives

Comparisons

AspectGPT (decoder-only)BERT (encoder-only)
Context directionUnidirectional (causal)Bidirectional
Primary strengthGenerationUnderstanding / classification
Pretraining objectiveNext-token predictionMasked LM + NSP
Zero-shot capabilityHighLow
Embedding quality (retrieval)Moderate without fine-tuningExcellent (bi-encoder)
API accessOpenAI, Anthropic, Mistral, etc.HuggingFace hub

Pros and cons

ProsCons
Strong zero-shot and few-shot generationExpensive to run (large parameter count)
Unified model for diverse tasksProne to hallucination
Instruction-following via promptsNo explicit bidirectional context
Easily extended with tools and RAGOutput must be validated / grounded

Code examples

# Chat completion with OpenAI API + streaming
from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from environment

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a concise technical assistant."},
        {"role": "user",   "content": "Explain the difference between GPT and BERT in two sentences."},
    ],
    temperature=0.3,
    max_tokens=200,
    stream=True,
)

print("Response: ", end="", flush=True)
for chunk in response:
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)
print()  # newline at end

Practical resources

See also