AI Summary Hub

Recurrent neural networks (RNN)

RNNs and sequential data.

Definition

RNNs process sequences by maintaining a hidden state that is updated at each step. They (and variants like LSTM) were the standard for sequence modeling before Transformers.

They are a natural fit for NLP, time series, and any ordered data where context from the past matters. Transformers have largely replaced them in language modeling due to parallelization and long-range dependency handling, but RNNs still appear in streaming or low-latency settings.

The fundamental idea is to share parameters across time: the same weight matrices are used at every step, making the model equivariant to input length. LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) variants address the vanishing gradient problem of plain RNNs with gating mechanisms that control what information is stored, forgotten, or passed forward. These architectures remain competitive in resource-constrained settings, online learning scenarios, and any use case where a compact sequential model with bounded memory is preferable to the quadratic attention of transformers.

How it works

Recurrent computation

At each step, the model receives the current input (e.g. a token or frame) and the previous hidden state. It computes a new hidden state: h_t = tanh(W_h * h_{t-1} + W_x * x_t + b). The hidden state summarizes all information from the beginning of the sequence up to step t.

LSTM gating

LSTM and GRU variants replace the simple tanh cell with gating units. The forget gate decides what to discard from the cell state; the input gate controls what new information to store; the output gate determines what to expose as the hidden state. This allows the network to learn long-range dependencies that plain RNNs cannot.

Training: backprop through time

The recurrence is unrolled in time for training (backpropagation through time, BPTT). At inference, the hidden state is passed forward step by step. Inputs and outputs can be one-to-one, one-to-many, or many-to-one depending on the task (e.g. sequence labeling vs. classification).

When to use / When NOT to use

ScenarioUse RNN?Notes
Streaming inference with low memoryYesRNNs process step-by-step with bounded state
Very long sequences with global contextNoTransformers handle this better
Time-series forecasting (moderate length)YesLSTMs are competitive with lower compute
Parallelizable training requiredNoRNNs are inherently sequential
NLP tasks at scaleNoTransformers dominate modern NLP
Embedded / edge devicesYesSmall LSTM/GRU models are inference-efficient

Comparisons

AspectRNN / LSTMCNNTransformer
Primary use caseSequences, time seriesImages, gridsText, multimodal
Parallelizable trainingNo (sequential)YesYes
Long-range dependenciesModerate (with LSTM)PoorExcellent
Memory footprint (inference)Very low (fixed state)LowHigh (KV cache)
Streaming / online inferenceExcellentN/ADifficult
State-of-the-art NLP performanceNoNoYes

Pros and cons

ProsCons
Natural fit for sequential dataCannot be parallelized during training
Fixed memory footprint at inferenceStruggles with very long dependencies
Efficient for streaming / online use casesLargely superseded by transformers for NLP
Compact models for edge deploymentVanishing gradient (mitigated by LSTM/GRU)

Code examples

# LSTM for sentiment classification with PyTorch
import torch
import torch.nn as nn

class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size: int, embed_dim: int, hidden_dim: int, num_classes: int):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, num_layers=2, dropout=0.3)
        self.classifier = nn.Linear(hidden_dim, num_classes)

    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        x = self.embedding(token_ids)           # (batch, seq_len, embed_dim)
        _, (h_n, _) = self.lstm(x)              # h_n: (num_layers, batch, hidden_dim)
        return self.classifier(h_n[-1])         # use last layer's final hidden state

# Dummy batch: 8 sequences, each 20 tokens, vocab of 5000
model   = LSTMClassifier(vocab_size=5000, embed_dim=64, hidden_dim=128, num_classes=2)
tokens  = torch.randint(0, 5000, (8, 20))
logits  = model(tokens)
print(f"Output shape: {logits.shape}")          # (8, 2)

n_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable parameters: {n_params:,}")

Practical resources

See also