AI Summary Hub

BERT

Bidirectional Encoder Representations from Transformers.

Definition

BERT is a transformer encoder model pretrained with masked language modeling (MLM) and next-sentence prediction. It produces contextual embeddings that are fine-tuned for downstream NLP tasks.

Unlike GPT-style decoders, BERT uses bidirectional context (left and right of each token), which helps for understanding tasks (e.g. NLP classification, NER, QA) rather than open-ended generation. It is often used as a frozen or fine-tuned encoder in RAG and search pipelines.

BERT's pretraining objective is elegantly simple: randomly mask 15% of tokens in an input and train the model to predict them using the full surrounding context. This forces the encoder to develop rich, context-dependent representations for every token rather than memorizing surface statistics. At fine-tuning time, a small task head (one or two linear layers) is added on top of the pretrained encoder and trained on labeled data — often yielding strong performance with only a few thousand examples. Variants like RoBERTa (improved training recipe), DistilBERT (distilled for speed), and DeBERTa (disentangled attention) have improved upon the original while preserving the encoder-only paradigm.

How it works

Tokenization and embedding

Tokens are produced by the WordPiece tokenizer, which appends a special [CLS] token at the start and [SEP] between/after segments. Each token's embedding is the sum of its token embedding, segment embedding, and positional embedding.

Bidirectional encoder

The encoder layers apply bidirectional self-attention: unlike causal models, every token can attend to every other token in both directions. This produces representations that are deeply context-aware. Stacking 12 or 24 such layers (BERT-Base / BERT-Large) yields powerful universal representations.

Output and fine-tuning

Output can be pooled (the [CLS] vector for sentence-level tasks) or the full sequence (one vector per token for NER, QA). Fine-tuning adds a task head (e.g. linear classifier) and updates the entire model or just the head on labeled data.

When to use / When NOT to use

ScenarioUse BERT?Notes
Text classification (sentiment, intent)Yes[CLS] token + linear head is very effective
Named entity recognition (NER)YesPer-token outputs suit span labeling
Semantic search / retrievalYesFine-tuned or bi-encoder variants (e.g. Sentence-BERT)
Open-ended text generationNoUse GPT-style decoder instead
Very long documents (> 512 tokens)With cautionUse Longformer or chunking strategies
Zero-shot generation tasksNoBERT requires fine-tuning for generation

Comparisons

AspectBERT (encoder-only)GPT (decoder-only)
Context directionBidirectionalUnidirectional (causal)
Primary strengthUnderstanding / classificationGeneration
Pretraining objectiveMasked LM + NSPNext-token prediction
Fine-tuning styleAdd small task headPrompting or supervised fine-tune
Generation capabilityPoor (not designed for it)Excellent
Embedding quality (retrieval)Excellent (with bi-encoder)Moderate without fine-tuning

Pros and cons

ProsCons
Strong contextual representationsCannot generate text autoregressively
Efficient fine-tuning on small datasetsMax 512 tokens (base architecture)
Widely available pretrained variantsRequires labeled data for most tasks
Interpretable attention patternsWeaker than GPT-4-class models on complex reasoning

Code examples

# Fine-tuning BERT for text classification with Hugging Face Transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
import torch

# Minimal synthetic dataset for demonstration
texts  = ["I love this product!", "Terrible experience.", "It was okay I guess.", "Absolutely fantastic!"]
labels = [1, 0, 0, 1]  # 1 = positive, 0 = negative

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=64)

dataset = Dataset.from_dict({"text": texts, "label": labels})
dataset = dataset.map(tokenize, batched=True)
dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

training_args = TrainingArguments(
    output_dir="./bert-sentiment",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    logging_steps=5,
    save_strategy="no",
)

trainer = Trainer(model=model, args=training_args, train_dataset=dataset)
trainer.train()
print("Fine-tuning complete.")

Practical resources

See also