"Attention Is All You Need" (2017)

The Transformer Architecture

A deep dive into every component: from positional encoding to multi-head attention to the full encoder block

Vaswani et al. • Google Brain • 2017

Overview

The Transformer revolutionized NLP by replacing recurrence with self-attention. Every token can attend to every other token in parallel, enabling massive speedups on GPUs/TPUs. This page breaks down all 6 core components that build up to a complete encoder block.

🏗️

Architecture Overview

The 6 building blocks

6. Transformer Forward Pass

Complete encoder block

5. Layer Normalization

Stabilize activations

FFN

Position-wise MLP

4. Multi-Head Attention

h parallel attention heads

3. Single Attention Head

Q, K, V projections

2. Scaled Dot-Product Attention

softmax(QK^T/√d_k)V

1. Positional Encoding

sin/cos position info

Positional Encoding

Injecting sequence order

The Problem

Transformers process all tokens in parallel with no recurrence or convolution. "The cat sat on the mat" and "mat the on sat cat the" would look identical without position information.

The Formula

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

- Even dimensions: sine
- Odd dimensions: cosine
- Low dimensions: high frequency (short wavelengths)
- High dimensions: low frequency (long wavelengths)

📏

Bounded [-1, 1]

No exploding values - sine and cosine are always bounded

🔑

Unique per Position

Each position gets a distinct encoding fingerprint

↔️

Relative Positions Learnable

PE(pos+k) is a linear function of PE(pos)

📈

Extrapolation

Works for sequences longer than training data

Implementation

def positional_encoding(seq_len: int, d_model: int) -> np.ndarray:
    # Position indices: (seq_len, 1)
    pos = np.arange(seq_len, dtype=np.float32)[:, np.newaxis]

    # Dimension indices for even positions: 0, 2, 4, ...
    i = np.arange(0, d_model, 2, dtype=np.float32)

    # Division term: 1/10000^(2i/d_model)
    # Using exp/log for numerical stability
    div_term = np.exp(-i * np.log(10000.0) / d_model)

    pe = np.zeros((seq_len, d_model), dtype=np.float32)
    pe[:, 0::2] = np.sin(pos * div_term)  # even indices
    pe[:, 1::2] = np.cos(pos * div_term)  # odd indices
    return pe

Example Output

pe = positional_encoding(seq_len=4, d_model=4)

# Position 0: [sin(0), cos(0), sin(0), cos(0)] = [0, 1, 0, 1]
# Position 1: [0.84, 0.54, 0.01, 1.0]
#              ↑ high freq changes fast   ↑ low freq changes slowly

Scaled Dot-Product Attention

The heart of the Transformer

Core Intuition

This is where tokens "talk" to each other. Each query asks "what should I pay attention to?", keys answer "here's what I contain", and values provide "here's the information I'll give you".

The Formula

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

Step-by-Step Data Flow

Compute Similarity Scores

QK^T: (batch, seq_q, d_k) @ (batch, d_k, seq_k) → (batch, seq_q, seq_k)

scores[i][j] = how much query_i should attend to key_j

Scale by √d_k

Large d_k → large dot products → softmax becomes peaked (near one-hot)

Scaling prevents gradient vanishing in softmax's flat regions

Apply Mask (optional)

Add -1e9 to positions that shouldn't attend

After softmax, these become ~0

Softmax over Keys

Convert scores to probabilities (sum to 1 per query)

Each query's attention weights form a probability distribution

Weighted Sum of Values

weights @ V: (batch, seq_q, seq_k) @ (batch, seq_k, d_v) → (batch, seq_q, d_v)

Output[i] = Σⱼ (attention_weight[i,j] × V[j])

Why Scale by √d_k?

# Without scaling:
d_k = 512
dot_product = q · k  # Can be very large (~hundreds)

# Softmax of large values → extremely peaked distribution
softmax([100, 101, 102]) ≈ [0.0, 0.27, 0.73]  # Still okay
softmax([1000, 1001, 1002]) ≈ [0.0, 0.0, 1.0]  # Near one-hot!

# Gradients in the flat regions of softmax are tiny → learning stops

# With scaling:
scaled = dot_product / sqrt(512) ≈ dot_product / 22.6
# Now values stay in a reasonable range where softmax has good gradients

Masking Types

Padding Mask

Ignore <PAD> tokens in variable-length batches

Causal Mask

Prevent attending to future tokens (autoregressive decoding)

# Causal mask for seq_len=4
mask = np.triu(np.ones((4, 4)) * -1e9, k=1)
# [[   0, -1e9, -1e9, -1e9],
#  [   0,    0, -1e9, -1e9],
#  [   0,    0,    0, -1e9],
#  [   0,    0,    0,    0]]

Implementation (Stable Softmax)

def scaled_dot_product_attention(q, k, v, mask=None):
    d_k = q.shape[-1]

    # QK^T / sqrt(d_k)
    scores = np.matmul(q, k.transpose(0, 2, 1)) / np.sqrt(d_k)

    # Apply mask (add large negative values)
    if mask is not None:
        scores = scores + mask

    # Stable softmax: subtract max to prevent overflow
    scores_max = np.max(scores, axis=-1, keepdims=True)
    exp_scores = np.exp(scores - scores_max)
    weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)

    # Weighted sum of values
    output = np.matmul(weights, v)
    return output, weights

Single Attention Head

Learning what to attend to

Why Projections?

Raw attention operates directly on input embeddings. Projections allow the model to learn: what to look for (Q), what to be matched against (K), and what information to extract (V).

The Three Projections

W_Q

"What am I looking for?"

Learns to extract the query - the question being asked

W_K

"What do I contain?"

Learns to extract the key - what can be matched against

W_V

"What info do I give?"

Learns to extract the value - the actual content to retrieve

Q = X @ W_Q    # (batch, seq, d_model) @ (d_model, d_k) → (batch, seq, d_k)
K = X @ W_K    # Same transformation, different learned weights
V = X @ W_V    # Same transformation, different learned weights

Head = Attention(Q, K, V)  # → (batch, seq, d_v)

Self-Attention vs Cross-Attention

Self-Attention

x_q = x_k = x_v

Used in encoder, decoder masked self-attention. Each token attends to all tokens in the same sequence.

Cross-Attention

x_q from decoder, x_k = x_v from encoder

Decoder attends to encoder outputs. "What in the input is relevant to what I'm generating?"

Implementation

def single_attention_head(x_q, x_k, x_v, W_q, W_k, W_v):
    """
    Args:
        x_q: Query input (batch, seq_q, d_model)
        x_k: Key input (batch, seq_k, d_model)
        x_v: Value input (batch, seq_k, d_model)
        W_q, W_k: (d_model, d_k)
        W_v: (d_model, d_v)

    Returns:
        Output (batch, seq_q, d_v)
    """
    # Project inputs
    Q = np.matmul(x_q, W_q)  # (batch, seq_q, d_k)
    K = np.matmul(x_k, W_k)  # (batch, seq_k, d_k)
    V = np.matmul(x_v, W_v)  # (batch, seq_k, d_v)

    # Apply scaled dot-product attention
    return scaled_dot_product_attention(Q, K, V)

Multi-Head Attention

Learning multiple relationships

The Key Innovation

A single attention head can only focus on one type of relationship. Multiple heads can simultaneously learn: syntax (subject-verb), semantics (word meanings), positional patterns, long-range dependencies (coreference).

The Formula

MultiHead(Q, K, V) = Concat(head₁, ..., head_h) W_O

where head_i = Attention(Q W_i^Q, K W_i^K, V W_i^V)

The Efficient Implementation Trick

Naive (Slow)

heads = [attention(Q @ W_Q[i],
                  K @ W_K[i],
                  V @ W_V[i])
         for i in range(h)]
output = concat(heads) @ W_O

Efficient (Used)

# ONE large projection
Q = x @ W_q  # (d_model, d_model)
# Reshape to split heads
Q = Q.reshape(B,S,h,d_k).transpose
# Run attention ONCE
# Heads broadcast as batch dim

Detailed Shape Transformations

Example: batch=32, seq=100, d_model=512, num_heads=8, d_k=64

Step	Operation	Shape
1	Input	(32, 100, 512)
2	Q = x @ W_q	(32, 100, 512)
3	reshape(B, S, h, d_k)	(32, 100, 8, 64)
4	transpose(0, 2, 1, 3)	(32, 8, 100, 64)
5	scores = Q @ K^T	(32, 8, 100, 100)
6	attn = softmax @ V	(32, 8, 100, 64)
7	transpose + reshape	(32, 100, 512)
8	@ W_o	(32, 100, 512)

Why the Output Projection W_O?

The concatenated heads have learned different things. W_O:

•Allows heads to interact and share information
•Projects back to expected d_model dimension
•Adds another layer of learned transformation

Implementation

def multi_head_attention(x, num_heads, W_q, W_k, W_v, W_o):
    batch, seq, d_model = x.shape
    d_k = d_model // num_heads

    # Step 1: Project to Q, K, V
    Q = np.matmul(x, W_q)  # (batch, seq, d_model)
    K = np.matmul(x, W_k)
    V = np.matmul(x, W_v)

    # Step 2-3: Split and transpose
    # (batch, seq, d_model) → (batch, num_heads, seq, d_k)
    Q = Q.reshape(batch, seq, num_heads, d_k).transpose(0, 2, 1, 3)
    K = K.reshape(batch, seq, num_heads, d_k).transpose(0, 2, 1, 3)
    V = V.reshape(batch, seq, num_heads, d_k).transpose(0, 2, 1, 3)

    # Step 4: Attention (batch & heads are both batch dimensions)
    attn_output = scaled_dot_product_attention(Q, K, V)

    # Step 5: Concat heads
    # (batch, heads, seq, d_k) → (batch, seq, d_model)
    concat = attn_output.transpose(0, 2, 1, 3).reshape(batch, seq, d_model)

    # Step 6: Output projection
    return np.matmul(concat, W_o)

Layer Normalization

Stabilizing activations

Why Layer Norm, not Batch Norm?

Batch Norm normalizes across the batch dimension - needs large batches, problematic for variable-length sequences. Layer Norm normalizes across the feature dimension - works with any batch size, each sample independent.

The Formula

Step 1:μ = (1/d) Σᵢ xᵢ— Mean over features

Step 2:σ² = (1/d) Σᵢ (xᵢ - μ)²— Variance

Step 3:x̂ = (x - μ) / √(σ² + ε)— Normalize

Visual Intuition

Input tensor: (batch=2, seq=3, d_model=4)

Batch Norm normalizes ↓ (across batch for each feature)
Layer Norm normalizes → (across features for each position)

         d_model=4
        ┌─────────┐
batch=2 │ → → → → │  Layer Norm normalizes each row
seq=3   │ → → → → │  independently to mean=0, var=1
        │ → → → → │
        │ → → → → │
        │ → → → → │
        │ → → → → │
        └─────────┘

Why keepdims=True?

x.shape = (2, 3, 4)

# Without keepdims:
mean = np.mean(x, axis=-1)  # shape: (2, 3) - can't broadcast!

# With keepdims:
mean = np.mean(x, axis=-1, keepdims=True)  # shape: (2, 3, 1)

(x - mean)  # (2, 3, 4) - (2, 3, 1) = (2, 3, 4) ✓ broadcasts!

Implementation

def layer_norm(x: np.ndarray, eps: float = 1e-5) -> np.ndarray:
    """
    Layer normalization over the last dimension.

    Args:
        x: Input (..., d_model). Typically (batch, seq, d_model)
        eps: Small constant to prevent division by zero

    Returns:
        Normalized array with mean≈0, var≈1 over last axis
    """
    mean = np.mean(x, axis=-1, keepdims=True)
    variance = np.var(x, axis=-1, keepdims=True)
    return (x - mean) / np.sqrt(variance + eps)

Example Walkthrough

x = [1.0, 2.0, 3.0]

# Step 1: Mean
mean = (1 + 2 + 3) / 3 = 2.0

# Step 2: Variance
var = ((1-2)² + (2-2)² + (3-2)²) / 3 = (1 + 0 + 1) / 3 = 0.6667

# Step 3: Standard deviation
std = sqrt(0.6667 + 1e-5) ≈ 0.8165

# Step 4: Normalize
x_norm = (x - 2.0) / 0.8165 = [-1.2247, 0.0, 1.2247]

# Verify:
mean(x_norm) = 0 ✓
var(x_norm) = 1.0 ✓

Transformer Forward Pass

Complete encoder block

Bringing It All Together

The encoder block combines all previous components: positional encoding, multi-head attention, layer normalization, feed-forward network, and residual connections.

Architecture Diagram

Input (batch, seq, d_model)

+ Positional Encoding

Self-Attention Sub-layer

Multi-Head Attention

+ Residual

LayerNorm

Feed-Forward Sub-layer

FFN: ReLU(xW₁)W₂

+ Residual

LayerNorm

Output (batch, seq, d_model)

Step 1: Add Positional Encoding

x = x + pos_encoding
# Input x: (batch, seq, d_model) - token embeddings
# pos_encoding: (seq, d_model) - broadcasts to each batch item
# Injects position information into embeddings

Step 2: Self-Attention Sub-layer

# 2a. Multi-head attention
attn_output = multi_head_attention(x, num_heads, W_q, W_k, W_v, W_o)

# 2b. Residual connection - allows gradients to flow directly
x = x + attn_output

# 2c. Layer normalization - stabilizes activations
x = layer_norm(x)

Residual Connection

Allows gradients to flow directly through the network. Makes it easier to learn identity mappings. Enables training very deep networks.

Layer Norm

Normalizes activations to prevent exploding/vanishing values. Applied after each sub-layer (Post-LN) or before (Pre-LN).

Step 3: Feed-Forward Sub-layer

# 3a. Position-wise FFN with ReLU
# First layer expands: d_model (512) → d_ff (2048)
hidden = np.maximum(0, np.matmul(x, W_ff1))  # ReLU activation

# Second layer contracts: d_ff (2048) → d_model (512)
ff_output = np.matmul(hidden, W_ff2)

# 3b. Residual connection
x = x + ff_output

# 3c. Layer normalization
x = layer_norm(x)

Why Expand then Contract?

The larger intermediate dimension (d_ff = 4 × d_model = 2048) allows learning more complex transformations before compressing back. ReLU introduces non-linearity.

Post-LN vs Pre-LN

Post-LN (Original Paper)

x → Attention → Add(x) → LayerNorm

What we implement here

Pre-LN (Modern)

x → LayerNorm → Attention → Add(x)

GPT-2, modern transformers. More stable for deep networks.

Complete Implementation

def transformer_forward(x, pos_encoding, W_q, W_k, W_v, W_o,
                        W_ff1, W_ff2, num_heads):
    """
    Forward pass through a single Transformer encoder block.

    Args:
        x: Input embeddings (batch, seq, d_model)
        pos_encoding: Positional encoding (seq, d_model)
        W_q, W_k, W_v, W_o: Attention weights (d_model, d_model)
        W_ff1: FFN first layer (d_model, d_ff)
        W_ff2: FFN second layer (d_ff, d_model)
        num_heads: Number of attention heads

    Returns:
        Output (batch, seq, d_model)
    """
    # Step 1: Add positional encoding
    x = x + pos_encoding

    # Step 2: Multi-head self-attention sub-layer
    attn_output = multi_head_attention(x, num_heads, W_q, W_k, W_v, W_o)
    x = x + attn_output  # Residual
    x = layer_norm(x)

    # Step 3: Feed-forward sub-layer
    ff_output = np.matmul(np.maximum(0, np.matmul(x, W_ff1)), W_ff2)
    x = x + ff_output  # Residual
    x = layer_norm(x)

    return x

📊

Parameter Counts

Original paper settings

Parameter	Value	Description
d_model	512	Model/embedding dimension
d_ff	2048	FFN inner dimension (4 × d_model)
num_heads	8	Attention heads
d_k = d_v	64	Per-head dimension (d_model / num_heads)
num_layers	6	Encoder/decoder layers (stacked)

W_q, W_k, W_v, W_o

(512, 512) × 4

~1M

W_ff1

(512, 2048)

~1M

W_ff2

(2048, 512)

~1M

Total per layer: ~3M parameters (excluding biases)

💎

Key Takeaways

Self-Attention Enables Parallelism

Every token can attend to every other token simultaneously. O(n²) complexity but fully parallelizable on GPU/TPU.

Multi-Head = Multiple Perspectives

Each head learns different relationships (syntax, semantics, coreference). Concatenation combines these views.

Residual Connections Enable Depth

Skip connections allow gradients to flow directly through very deep networks. Essential for training 6+ layers.

Layer Norm Stabilizes Training

Normalizing across features (not batch) works with any batch size and variable-length sequences.

Positional Encoding is Crucial

Without it, the model has no sense of order. Sinusoidal encoding is elegant, bounded, and extrapolates.

Scale Factor √d_k Matters

Prevents softmax from becoming too peaked. Keeps gradients flowing through attention weights.

Based on implementations from "Attention Is All You Need" (Vaswani et al., 2017)