Deep learning

Definition

Deep learning uses neural networks with many layers to learn hierarchical representations from data. It has driven progress in vision, language, and other domains by scaling data and compute.

It extends machine learning by using differentiable, layered models (see neural networks) that learn features automatically instead of hand-crafted ones. Depth allows the model to build increasingly abstract representations (e.g. edges -> textures -> parts -> objects in vision).

The defining characteristic of deep learning is end-to-end learning: raw inputs (pixels, tokens, audio samples) are transformed through successive non-linear layers, and the entire pipeline is optimized jointly by gradient descent. This removes the need for domain-specific feature engineering that traditional ML relies on. The tradeoff is that deep models need substantially more data and compute — GPUs, TPUs, and large memory — and are harder to interpret than classical models.

How it works

Forward pass

Data is fed into the input layer. Each layer applies a linear transformation (matrix multiply + bias) followed by a nonlinearity (e.g. ReLU). Stacking layers produces progressively more abstract representations. The final layer maps to the task output (class scores, regression value, or token logits).

Backward pass and optimization

The loss (e.g. cross-entropy for classification) is computed between predictions and targets. Backpropagation uses the chain rule to compute gradients of the loss with respect to every weight in the network. An optimizer (SGD, Adam) then updates the weights in the direction that reduces loss.

Architectures

Architecture choice tailors connectivity to data type: CNNs exploit spatial locality for images; RNNs handle variable-length sequences; Transformers use global self-attention and now dominate both vision and language tasks at scale.

When to use / When NOT to use

Scenario	Use deep learning?	Notes
Large-scale image or video recognition	Yes	CNNs are the standard backbone
Text understanding or generation	Yes	Transformers set state-of-the-art across NLP
Small structured/tabular dataset	No	Gradient boosting typically outperforms
Need full model interpretability	No	Deep models are largely black boxes
Limited compute / edge deployment	With caution	Use quantization or distilled models
Speech and audio recognition	Yes	Deep models outperform classical signal processing

Comparisons

Aspect	Classical ML	Deep Learning
Feature engineering	Manual	Automatic (end-to-end)
Data requirements	Low to medium	High
Compute requirements	Low	High (GPU/TPU)
Interpretability	High (e.g. trees)	Low
Performance on unstructured data	Moderate	Very high

Pros and cons

Pros	Cons
Automatic feature learning	Data hungry
State-of-the-art on vision and language	Requires GPU/TPU
End-to-end optimization	Hard to interpret
Transfer learning reduces data needs	Long training times

Code examples

# Feedforward network with PyTorch for image classification (MNIST)
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Data loaders
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])
train_loader = DataLoader(
    datasets.MNIST('.', train=True, download=True, transform=transform),
    batch_size=64, shuffle=True
)
test_loader = DataLoader(
    datasets.MNIST('.', train=False, download=True, transform=transform),
    batch_size=1000
)

# Model definition
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Flatten(),
            nn.Linear(28 * 28, 256), nn.ReLU(),
            nn.Linear(256, 128),     nn.ReLU(),
            nn.Linear(128, 10),
        )

    def forward(self, x):
        return self.net(x)

device  = "cuda" if torch.cuda.is_available() else "cpu"
model   = MLP().to(device)
opt     = optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()

# Training
for epoch in range(3):
    model.train()
    for X, y in train_loader:
        X, y = X.to(device), y.to(device)
        opt.zero_grad()
        loss_fn(model(X), y).backward()
        opt.step()

# Evaluation
model.train(False)
correct = sum(
    (model(X.to(device)).argmax(1) == y.to(device)).sum().item()
    for X, y in test_loader
)
print(f"Test accuracy: {correct / len(test_loader.dataset):.2%}")

Practical resources

Deep Learning (Goodfellow et al.) — Free online textbook covering theory in depth
PyTorch – Introduction — Hands-on 60-minute deep learning tutorial
fast.ai – Practical Deep Learning — Top-down course with real-world projects and code