🏑/repos/karpathy/

nanochat

πŸ€–

NanoChat

Full-stack LLM training in a minimal, hackable codebase

Train and deploy a ChatGPT-like model end-to-end on a single 8Γ—H100 node for ~$100-$800. Includes tokenization, pretraining, midtraining, SFT, optional RL, evaluation, and web serving.

~8000 lines~4 hours training561M params
πŸ”„

Training Pipeline

πŸ“

Tokenizer

RustBPE, 65K vocab

πŸ“š

Base Pretrain

FineWeb-Edu 11B tokens

πŸ’¬

Midtrain

Conversation structure

🎯

SFT

Task finetuning

🌐

Serve

FastAPI + Web UI

~1 hour
Base training
~30 min
Midtraining
~30 min
SFT
~$100
8Γ—H100 cost
πŸ—οΈ

Model Architecture

Modern, minimal Transformer

Architecture Choices

Rotary Embeddings
Relative positional encoding, no absolute position embeddings
RMSNorm (no params)
Purely functional normalization, no learnable scale/bias
QK Norm
Query and key normalization in attention
ReLUΒ² MLP
Uses relu(x).square() instead of GELU
Untied Embeddings
Separate wte (token) and lm_head (output)
Logits Softcap
tanh-based capping at 15 for stability

Model Sizing

# Sizing formula (depth d20 = 561M params)
num_layers = depth        # 20
model_dim = depth Γ— 64    # 1280
num_heads = ceil(dim/128) # 10

# Sequence length
seq_len = 2048 tokens

# Vocab size
vocab = 65536 (2^16)

Optimizers

Linear layersMuon (LR=0.02)
EmbeddingsAdamW (LR=0.2)
LM HeadAdamW (LR=0.004)
πŸ“

Tokenization

RustBPE + Special Tokens

RustBPE Tokenizer

  • β€’ Implemented in Rust for speed
  • β€’ GPT-4 style pre-tokenization
  • β€’ Numbers: 1-2 digits max (modified from GPT-4's 1-3)
  • β€’ Vocab size: 65,536 tokens
  • β€’ Tiktoken wrapper for inference

Special Tokens

<|bos|>
Document start
<|user_start|>
User message
<|user_end|>
User end
<|assistant_start|>
Assistant
<|assistant_end|>
Assistant end
<|python_start|>
Tool call
<|python_end|>
Tool end
<|output_start|>
Tool output

Conversation Rendering

Conversations are tokenized with attention masks that control which tokens the model learns on.

User
mask = 0
No loss
Assistant
mask = 1
Loss applied
Tool Output
mask = 0
No loss
πŸ“š

Training Phases

1

Base Pretraining

scripts/base_train.py

Data

  • β€’ FineWeb-Edu 100B dataset
  • β€’ ~240 parquet shards (~24GB)
  • β€’ 11.2B tokens for d20 model
  • β€’ Chinchilla ratio: 20Γ— params

Config

  • β€’ Batch: 524,288 tokens
  • β€’ Seq length: 2048
  • β€’ Gradient accumulation: auto
  • β€’ Mixed precision: bfloat16
2

Midtraining

scripts/mid_train.py

Purpose

  • β€’ Learn conversation structure
  • β€’ Special token usage
  • β€’ Tool scaffolding
  • β€’ Multiple-choice format

Data Mix

  • β€’ SmolTalk (460K conversations)
  • β€’ MMLU auxiliary (100K MC)
  • β€’ GSM8K (math with tool tags)
  • β€’ Identity conversations
3

Supervised Finetuning

scripts/chat_sft.py

Purpose

  • β€’ Domain/behavior adaptation
  • β€’ Task-specific supervision
  • β€’ Lower learning rate (2% of base)

Data (~23K rows)

  • β€’ ARC-Easy/Challenge
  • β€’ GSM8K train
  • β€’ SmolTalk subset (10K)
  • β€’ Spelling drills
4

Reinforcement Learning (Optional)

scripts/chat_rl.py
optional

Algorithm

  • β€’ GRPO-like REINFORCE
  • β€’ Token-level advantages
  • β€’ Mean baseline
  • β€’ No KL penalty

Scope

  • β€’ GSM8K math problems only
  • β€’ Samples k completions
  • β€’ Optimizes for correct answers
⚑

Inference Engine

Streaming generation with KV cache

KV Cache

  • β€’ Maintains keys/values per layer
  • β€’ Dynamic growth (1024-token chunks)
  • β€’ Batch prefill β†’ replicate for N samples
  • β€’ Position tracking for RoPE offset

Sampling

  • β€’ Temperature-based softmax
  • β€’ Top-k filtering
  • β€’ Greedy when temperature=0
  • β€’ Streaming token generation

Tool Use: Python Calculator

How it works

  1. Detect <|python_start|>
  2. Extract expression until <|python_end|>
  3. Evaluate safely with timeout
  4. Inject <|output_*|> tokens

Safety

  • β€’ Allowed: +, -, *, /, (), .count()
  • β€’ Blocked: __, import, exec, eval
  • β€’ 3 second timeout
🌐

Web Serving

FastAPI + ChatGPT-like UI

Backend (FastAPI)

  • β€’ Worker pool across GPUs
  • β€’ Async request handling
  • β€’ Queue-based distribution
  • β€’ OpenAI-compatible API

Endpoints

GET /Chat UI
POST /chat/completionsStreaming API
GET /healthStatus

Abuse Prevention

Max messages500
Max chars/message8,000
Max total chars32,000
Temperature0.0 - 2.0
Top-k1 - 200
Max tokens1 - 4,096
πŸ“Š

Evaluation

Base Model Eval

  • β€’ CORE metric (Chinchilla benchmark)
  • β€’ Bits per byte on validation
  • β€’ Sampling/generation tests

Chat Model Eval

  • β€’ ARC-Easy/Challenge (MC accuracy)
  • β€’ MMLU (multi-domain MC)
  • β€’ GSM8K (math problems)
  • β€’ HumanEval (Python code)
  • β€’ SpellingBee (letter counting)
  • β€’ ChatCORE (mean-centered accuracy)
πŸ“¦

Data & Tasks

Task Framework

Task
Base class with indexing, slicing, evaluation
TaskMixture
Combines tasks with deterministic shuffling
TaskSequence
Sequential curriculum learning

Available Tasks

MMLUARCGSM8KHumanEvalSmolTalkCustomJSONSpellingBeeSimpleSpelling
πŸ”§

Distributed Training

PyTorch DDP

  • β€’ Multi-GPU training
  • β€’ Rank-based work distribution
  • β€’ Barrier synchronization
  • β€’ Master rank handles checkpointing

Compilation

  • β€’ torch.compile()
  • β€’ dynamic=False for base (static shapes)
  • β€’ dynamic=True for SFT (variable lengths)
  • β€’ ~2-3x speedup
πŸƒ

speedrun.sh

End-to-end in ~4 hours

1
Setup
Create venv, install dependencies
2
Tokenizer
Build RustBPE, download 8 shards, train vocab=65536
3
Base Training
Download 240 shards, pretrain d20 (561M params, 11.2B tokens)
4
Midtraining
Download identity data, train on task mixture
5
SFT
Train on chat-style tasks
6
Report
Generate markdown report card
7
Serve(optional)
python -m scripts.chat_web

Artifacts Location

~/.cache/nanochat/
tokenizer/base_data/base_checkpoints/mid_checkpoints/chatsft_checkpoints/report/
πŸ’‘

Key Takeaways

Minimal but complete

~8000 lines covering tokenization, pretraining, SFT, RL, inference, and web serving. Easy to fork and customize.

Type-specific optimizers

Muon for linear layers, AdamW for embeddings. Different learning rates per parameter type.

Selective loss masking

Only train on assistant/tool tokens. User input and tool outputs are masked (loss=0).

Tool use via state machine

Python calculator integrated into generation. Parses <|python_*|> blocks and injects results.