The Transformer Architecture
A deep dive into every component: from positional encoding to multi-head attention to the full encoder block
Vaswani et al. β’ Google Brain β’ 2017
Overview
The Transformer revolutionized NLP by replacing recurrence with self-attention. Every token can attend to every other token in parallel, enabling massive speedups on GPUs/TPUs. This page breaks down all 6 core components that build up to a complete encoder block.
Architecture Overview
The 6 building blocks
Positional Encoding
Injecting sequence order
The Problem
Transformers process all tokens in parallel with no recurrence or convolution. "The cat sat on the mat" and "mat the on sat cat the" would look identical without position information.
The Formula
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
- Even dimensions: sine
- Odd dimensions: cosine
- Low dimensions: high frequency (short wavelengths)
- High dimensions: low frequency (long wavelengths)Bounded [-1, 1]
No exploding values - sine and cosine are always bounded
Unique per Position
Each position gets a distinct encoding fingerprint
Relative Positions Learnable
PE(pos+k) is a linear function of PE(pos)
Extrapolation
Works for sequences longer than training data
Implementation
def positional_encoding(seq_len: int, d_model: int) -> np.ndarray:
# Position indices: (seq_len, 1)
pos = np.arange(seq_len, dtype=np.float32)[:, np.newaxis]
# Dimension indices for even positions: 0, 2, 4, ...
i = np.arange(0, d_model, 2, dtype=np.float32)
# Division term: 1/10000^(2i/d_model)
# Using exp/log for numerical stability
div_term = np.exp(-i * np.log(10000.0) / d_model)
pe = np.zeros((seq_len, d_model), dtype=np.float32)
pe[:, 0::2] = np.sin(pos * div_term) # even indices
pe[:, 1::2] = np.cos(pos * div_term) # odd indices
return peExample Output
pe = positional_encoding(seq_len=4, d_model=4)
# Position 0: [sin(0), cos(0), sin(0), cos(0)] = [0, 1, 0, 1]
# Position 1: [0.84, 0.54, 0.01, 1.0]
# β high freq changes fast β low freq changes slowlyScaled Dot-Product Attention
The heart of the Transformer
Core Intuition
This is where tokens "talk" to each other. Each query asks "what should I pay attention to?", keys answer "here's what I contain", and values provide "here's the information I'll give you".
The Formula
Step-by-Step Data Flow
Compute Similarity Scores
QK^T: (batch, seq_q, d_k) @ (batch, d_k, seq_k) β (batch, seq_q, seq_k)
scores[i][j] = how much query_i should attend to key_j
Scale by βd_k
Large d_k β large dot products β softmax becomes peaked (near one-hot)
Scaling prevents gradient vanishing in softmax's flat regions
Apply Mask (optional)
Add -1e9 to positions that shouldn't attend
After softmax, these become ~0
Softmax over Keys
Convert scores to probabilities (sum to 1 per query)
Each query's attention weights form a probability distribution
Weighted Sum of Values
weights @ V: (batch, seq_q, seq_k) @ (batch, seq_k, d_v) β (batch, seq_q, d_v)
Output[i] = Ξ£β±Ό (attention_weight[i,j] Γ V[j])
Why Scale by βd_k?
# Without scaling:
d_k = 512
dot_product = q Β· k # Can be very large (~hundreds)
# Softmax of large values β extremely peaked distribution
softmax([100, 101, 102]) β [0.0, 0.27, 0.73] # Still okay
softmax([1000, 1001, 1002]) β [0.0, 0.0, 1.0] # Near one-hot!
# Gradients in the flat regions of softmax are tiny β learning stops
# With scaling:
scaled = dot_product / sqrt(512) β dot_product / 22.6
# Now values stay in a reasonable range where softmax has good gradientsMasking Types
Padding Mask
Ignore <PAD> tokens in variable-length batches
Causal Mask
Prevent attending to future tokens (autoregressive decoding)
# Causal mask for seq_len=4
mask = np.triu(np.ones((4, 4)) * -1e9, k=1)
# [[ 0, -1e9, -1e9, -1e9],
# [ 0, 0, -1e9, -1e9],
# [ 0, 0, 0, -1e9],
# [ 0, 0, 0, 0]]Implementation (Stable Softmax)
def scaled_dot_product_attention(q, k, v, mask=None):
d_k = q.shape[-1]
# QK^T / sqrt(d_k)
scores = np.matmul(q, k.transpose(0, 2, 1)) / np.sqrt(d_k)
# Apply mask (add large negative values)
if mask is not None:
scores = scores + mask
# Stable softmax: subtract max to prevent overflow
scores_max = np.max(scores, axis=-1, keepdims=True)
exp_scores = np.exp(scores - scores_max)
weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)
# Weighted sum of values
output = np.matmul(weights, v)
return output, weightsSingle Attention Head
Learning what to attend to
Why Projections?
Raw attention operates directly on input embeddings. Projections allow the model to learn: what to look for (Q), what to be matched against (K), and what information to extract (V).
The Three Projections
W_Q
"What am I looking for?"
Learns to extract the query - the question being asked
W_K
"What do I contain?"
Learns to extract the key - what can be matched against
W_V
"What info do I give?"
Learns to extract the value - the actual content to retrieve
Q = X @ W_Q # (batch, seq, d_model) @ (d_model, d_k) β (batch, seq, d_k)
K = X @ W_K # Same transformation, different learned weights
V = X @ W_V # Same transformation, different learned weights
Head = Attention(Q, K, V) # β (batch, seq, d_v)Self-Attention vs Cross-Attention
Self-Attention
x_q = x_k = x_vUsed in encoder, decoder masked self-attention. Each token attends to all tokens in the same sequence.
Cross-Attention
x_q from decoder, x_k = x_v from encoderDecoder attends to encoder outputs. "What in the input is relevant to what I'm generating?"
Implementation
def single_attention_head(x_q, x_k, x_v, W_q, W_k, W_v):
"""
Args:
x_q: Query input (batch, seq_q, d_model)
x_k: Key input (batch, seq_k, d_model)
x_v: Value input (batch, seq_k, d_model)
W_q, W_k: (d_model, d_k)
W_v: (d_model, d_v)
Returns:
Output (batch, seq_q, d_v)
"""
# Project inputs
Q = np.matmul(x_q, W_q) # (batch, seq_q, d_k)
K = np.matmul(x_k, W_k) # (batch, seq_k, d_k)
V = np.matmul(x_v, W_v) # (batch, seq_k, d_v)
# Apply scaled dot-product attention
return scaled_dot_product_attention(Q, K, V)Multi-Head Attention
Learning multiple relationships
The Key Innovation
A single attention head can only focus on one type of relationship. Multiple heads can simultaneously learn: syntax (subject-verb), semantics (word meanings), positional patterns, long-range dependencies (coreference).
The Formula
where headi = Attention(Q WiQ, K WiK, V WiV)
The Efficient Implementation Trick
Naive (Slow)
heads = [attention(Q @ W_Q[i],
K @ W_K[i],
V @ W_V[i])
for i in range(h)]
output = concat(heads) @ W_OEfficient (Used)
# ONE large projection
Q = x @ W_q # (d_model, d_model)
# Reshape to split heads
Q = Q.reshape(B,S,h,d_k).transpose
# Run attention ONCE
# Heads broadcast as batch dimDetailed Shape Transformations
Example: batch=32, seq=100, d_model=512, num_heads=8, d_k=64
| Step | Operation | Shape |
|---|---|---|
| 1 | Input | (32, 100, 512) |
| 2 | Q = x @ W_q | (32, 100, 512) |
| 3 | reshape(B, S, h, d_k) | (32, 100, 8, 64) |
| 4 | transpose(0, 2, 1, 3) | (32, 8, 100, 64) |
| 5 | scores = Q @ K^T | (32, 8, 100, 100) |
| 6 | attn = softmax @ V | (32, 8, 100, 64) |
| 7 | transpose + reshape | (32, 100, 512) |
| 8 | @ W_o | (32, 100, 512) |
Why the Output Projection W_O?
The concatenated heads have learned different things. WO:
- β’Allows heads to interact and share information
- β’Projects back to expected d_model dimension
- β’Adds another layer of learned transformation
Implementation
def multi_head_attention(x, num_heads, W_q, W_k, W_v, W_o):
batch, seq, d_model = x.shape
d_k = d_model // num_heads
# Step 1: Project to Q, K, V
Q = np.matmul(x, W_q) # (batch, seq, d_model)
K = np.matmul(x, W_k)
V = np.matmul(x, W_v)
# Step 2-3: Split and transpose
# (batch, seq, d_model) β (batch, num_heads, seq, d_k)
Q = Q.reshape(batch, seq, num_heads, d_k).transpose(0, 2, 1, 3)
K = K.reshape(batch, seq, num_heads, d_k).transpose(0, 2, 1, 3)
V = V.reshape(batch, seq, num_heads, d_k).transpose(0, 2, 1, 3)
# Step 4: Attention (batch & heads are both batch dimensions)
attn_output = scaled_dot_product_attention(Q, K, V)
# Step 5: Concat heads
# (batch, heads, seq, d_k) β (batch, seq, d_model)
concat = attn_output.transpose(0, 2, 1, 3).reshape(batch, seq, d_model)
# Step 6: Output projection
return np.matmul(concat, W_o)Layer Normalization
Stabilizing activations
Why Layer Norm, not Batch Norm?
Batch Norm normalizes across the batch dimension - needs large batches, problematic for variable-length sequences. Layer Norm normalizes across the feature dimension - works with any batch size, each sample independent.
The Formula
Visual Intuition
Input tensor: (batch=2, seq=3, d_model=4)
Batch Norm normalizes β (across batch for each feature)
Layer Norm normalizes β (across features for each position)
d_model=4
βββββββββββ
batch=2 β β β β β β Layer Norm normalizes each row
seq=3 β β β β β β independently to mean=0, var=1
β β β β β β
β β β β β β
β β β β β β
β β β β β β
βββββββββββWhy keepdims=True?
x.shape = (2, 3, 4)
# Without keepdims:
mean = np.mean(x, axis=-1) # shape: (2, 3) - can't broadcast!
# With keepdims:
mean = np.mean(x, axis=-1, keepdims=True) # shape: (2, 3, 1)
(x - mean) # (2, 3, 4) - (2, 3, 1) = (2, 3, 4) β broadcasts!Implementation
def layer_norm(x: np.ndarray, eps: float = 1e-5) -> np.ndarray:
"""
Layer normalization over the last dimension.
Args:
x: Input (..., d_model). Typically (batch, seq, d_model)
eps: Small constant to prevent division by zero
Returns:
Normalized array with meanβ0, varβ1 over last axis
"""
mean = np.mean(x, axis=-1, keepdims=True)
variance = np.var(x, axis=-1, keepdims=True)
return (x - mean) / np.sqrt(variance + eps)Example Walkthrough
x = [1.0, 2.0, 3.0]
# Step 1: Mean
mean = (1 + 2 + 3) / 3 = 2.0
# Step 2: Variance
var = ((1-2)Β² + (2-2)Β² + (3-2)Β²) / 3 = (1 + 0 + 1) / 3 = 0.6667
# Step 3: Standard deviation
std = sqrt(0.6667 + 1e-5) β 0.8165
# Step 4: Normalize
x_norm = (x - 2.0) / 0.8165 = [-1.2247, 0.0, 1.2247]
# Verify:
mean(x_norm) = 0 β
var(x_norm) = 1.0 βTransformer Forward Pass
Complete encoder block
Bringing It All Together
The encoder block combines all previous components: positional encoding, multi-head attention, layer normalization, feed-forward network, and residual connections.
Architecture Diagram
Step 1: Add Positional Encoding
x = x + pos_encoding
# Input x: (batch, seq, d_model) - token embeddings
# pos_encoding: (seq, d_model) - broadcasts to each batch item
# Injects position information into embeddingsStep 2: Self-Attention Sub-layer
# 2a. Multi-head attention
attn_output = multi_head_attention(x, num_heads, W_q, W_k, W_v, W_o)
# 2b. Residual connection - allows gradients to flow directly
x = x + attn_output
# 2c. Layer normalization - stabilizes activations
x = layer_norm(x)Residual Connection
Allows gradients to flow directly through the network. Makes it easier to learn identity mappings. Enables training very deep networks.
Layer Norm
Normalizes activations to prevent exploding/vanishing values. Applied after each sub-layer (Post-LN) or before (Pre-LN).
Step 3: Feed-Forward Sub-layer
# 3a. Position-wise FFN with ReLU
# First layer expands: d_model (512) β d_ff (2048)
hidden = np.maximum(0, np.matmul(x, W_ff1)) # ReLU activation
# Second layer contracts: d_ff (2048) β d_model (512)
ff_output = np.matmul(hidden, W_ff2)
# 3b. Residual connection
x = x + ff_output
# 3c. Layer normalization
x = layer_norm(x)Why Expand then Contract?
The larger intermediate dimension (d_ff = 4 Γ d_model = 2048) allows learning more complex transformations before compressing back. ReLU introduces non-linearity.
Post-LN vs Pre-LN
Post-LN (Original Paper)
x β Attention β Add(x) β LayerNormWhat we implement here
Pre-LN (Modern)
x β LayerNorm β Attention β Add(x)GPT-2, modern transformers. More stable for deep networks.
Complete Implementation
def transformer_forward(x, pos_encoding, W_q, W_k, W_v, W_o,
W_ff1, W_ff2, num_heads):
"""
Forward pass through a single Transformer encoder block.
Args:
x: Input embeddings (batch, seq, d_model)
pos_encoding: Positional encoding (seq, d_model)
W_q, W_k, W_v, W_o: Attention weights (d_model, d_model)
W_ff1: FFN first layer (d_model, d_ff)
W_ff2: FFN second layer (d_ff, d_model)
num_heads: Number of attention heads
Returns:
Output (batch, seq, d_model)
"""
# Step 1: Add positional encoding
x = x + pos_encoding
# Step 2: Multi-head self-attention sub-layer
attn_output = multi_head_attention(x, num_heads, W_q, W_k, W_v, W_o)
x = x + attn_output # Residual
x = layer_norm(x)
# Step 3: Feed-forward sub-layer
ff_output = np.matmul(np.maximum(0, np.matmul(x, W_ff1)), W_ff2)
x = x + ff_output # Residual
x = layer_norm(x)
return xParameter Counts
Original paper settings
| Parameter | Value | Description |
|---|---|---|
| d_model | 512 | Model/embedding dimension |
| d_ff | 2048 | FFN inner dimension (4 Γ d_model) |
| num_heads | 8 | Attention heads |
| d_k = d_v | 64 | Per-head dimension (d_model / num_heads) |
| num_layers | 6 | Encoder/decoder layers (stacked) |
W_q, W_k, W_v, W_o
(512, 512) Γ 4
~1M
W_ff1
(512, 2048)
~1M
W_ff2
(2048, 512)
~1M
Total per layer: ~3M parameters (excluding biases)
Key Takeaways
Self-Attention Enables Parallelism
Every token can attend to every other token simultaneously. O(nΒ²) complexity but fully parallelizable on GPU/TPU.
Multi-Head = Multiple Perspectives
Each head learns different relationships (syntax, semantics, coreference). Concatenation combines these views.
Residual Connections Enable Depth
Skip connections allow gradients to flow directly through very deep networks. Essential for training 6+ layers.
Layer Norm Stabilizes Training
Normalizing across features (not batch) works with any batch size and variable-length sequences.
Positional Encoding is Crucial
Without it, the model has no sense of order. Sinusoidal encoding is elegant, bounded, and extrapolates.
Scale Factor βd_k Matters
Prevents softmax from becoming too peaked. Keeps gradients flowing through attention weights.
Based on implementations from "Attention Is All You Need" (Vaswani et al., 2017)