GPT
Generative Pre-trained Transformer and decoder-only models.
Definition
GPT refers to decoder-only transformer models trained to predict the next token (autoregressive). Scaling these models has led to today's large language models (LLMs) capable of few-shot and zero-shot tasks.
Decoder-only design is well-suited for generation: at each step the model conditions on previous tokens and predicts the next. LLMs built on this idea are then instruction-tuned and aligned (e.g. RLHF) for chat and tool use. For understanding-only tasks, BERT-style encoders can be more parameter-efficient.
The GPT line of models (GPT-1, GPT-2, GPT-3, GPT-4) demonstrated that scaling a simple next-token prediction objective on ever-larger corpora produces models with emergent capabilities: reasoning, code generation, multi-step arithmetic, and few-shot task solving without any task-specific training. The instruction-tuning and RLHF stages that follow base pretraining transform a raw next-token predictor into an assistant that reliably follows natural language instructions, maintains conversation context, and refuses harmful requests. Modern GPT-family deployments are accessed through APIs and support features like function calling, vision inputs, and streaming.
How it works
Causal masking
Tokens are embedded and fed into causal decoder layers: each position can attend only to itself and previous positions (masked self-attention via an upper-triangular mask). This prevents the model from "seeing" the future during both training and inference.
Language modeling head
The next token is predicted from the last position's representation via a linear layer over the vocabulary, followed by softmax. Training maximizes the log-likelihood of the next token given all preceding tokens (teacher forcing). The loss is averaged over all positions, so every token in the sequence contributes a gradient signal.
Inference and sampling
Inference generates autoregressively: sample or greedily pick the next token, append it, and repeat until a stop condition (EOS token or max length). Sampling parameters (temperature, top-k, top-p) control diversity vs. determinism. Prompt engineering and fine-tuning shape task behavior on top of this mechanism.
When to use / When NOT to use
| Scenario | Use GPT-style? | Notes |
|---|---|---|
| Text generation, summarization, dialogue | Yes | The natural fit for autoregressive generation |
| Few-shot classification via prompting | Yes | GPT handles this well with few examples |
| Semantic search / dense retrieval | With caution | Bi-encoders (BERT-style) are more efficient |
| NER or token-level classification | With caution | Encoder models are more parameter-efficient |
| Long-context reasoning (>8K tokens) | Yes | Modern GPT models support very long contexts |
| Strict budget / edge deployment | No | GPT models are large; use distilled alternatives |
Comparisons
| Aspect | GPT (decoder-only) | BERT (encoder-only) |
|---|---|---|
| Context direction | Unidirectional (causal) | Bidirectional |
| Primary strength | Generation | Understanding / classification |
| Pretraining objective | Next-token prediction | Masked LM + NSP |
| Zero-shot capability | High | Low |
| Embedding quality (retrieval) | Moderate without fine-tuning | Excellent (bi-encoder) |
| API access | OpenAI, Anthropic, Mistral, etc. | HuggingFace hub |
Pros and cons
| Pros | Cons |
|---|---|
| Strong zero-shot and few-shot generation | Expensive to run (large parameter count) |
| Unified model for diverse tasks | Prone to hallucination |
| Instruction-following via prompts | No explicit bidirectional context |
| Easily extended with tools and RAG | Output must be validated / grounded |
Code examples
# Chat completion with OpenAI API + streaming
from openai import OpenAI
client = OpenAI() # reads OPENAI_API_KEY from environment
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a concise technical assistant."},
{"role": "user", "content": "Explain the difference between GPT and BERT in two sentences."},
],
temperature=0.3,
max_tokens=200,
stream=True,
)
print("Response: ", end="", flush=True)
for chunk in response:
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)
print() # newline at endPractical resources
- Improving Language Understanding by Generative Pre-Training (OpenAI) — Original GPT-1 paper
- Hugging Face – GPT-2 — Model docs and hosted weights
- OpenAI API reference — Complete reference for chat completions endpoint