NanoChat
Full-stack LLM training in a minimal, hackable codebase
Train and deploy a ChatGPT-like model end-to-end on a single 8ΓH100 node for ~$100-$800. Includes tokenization, pretraining, midtraining, SFT, optional RL, evaluation, and web serving.
Training Pipeline
Tokenizer
RustBPE, 65K vocab
Base Pretrain
FineWeb-Edu 11B tokens
Midtrain
Conversation structure
SFT
Task finetuning
Serve
FastAPI + Web UI
Model Architecture
Modern, minimal Transformer
Architecture Choices
Model Sizing
# Sizing formula (depth d20 = 561M params)
num_layers = depth # 20
model_dim = depth Γ 64 # 1280
num_heads = ceil(dim/128) # 10
# Sequence length
seq_len = 2048 tokens
# Vocab size
vocab = 65536 (2^16)Optimizers
Tokenization
RustBPE + Special Tokens
RustBPE Tokenizer
- β’ Implemented in Rust for speed
- β’ GPT-4 style pre-tokenization
- β’ Numbers: 1-2 digits max (modified from GPT-4's 1-3)
- β’ Vocab size: 65,536 tokens
- β’ Tiktoken wrapper for inference
Special Tokens
<|bos|><|user_start|><|user_end|><|assistant_start|><|assistant_end|><|python_start|><|python_end|><|output_start|>Conversation Rendering
Conversations are tokenized with attention masks that control which tokens the model learns on.
Training Phases
Base Pretraining
scripts/base_train.pyData
- β’ FineWeb-Edu 100B dataset
- β’ ~240 parquet shards (~24GB)
- β’ 11.2B tokens for d20 model
- β’ Chinchilla ratio: 20Γ params
Config
- β’ Batch: 524,288 tokens
- β’ Seq length: 2048
- β’ Gradient accumulation: auto
- β’ Mixed precision: bfloat16
Midtraining
scripts/mid_train.pyPurpose
- β’ Learn conversation structure
- β’ Special token usage
- β’ Tool scaffolding
- β’ Multiple-choice format
Data Mix
- β’ SmolTalk (460K conversations)
- β’ MMLU auxiliary (100K MC)
- β’ GSM8K (math with tool tags)
- β’ Identity conversations
Supervised Finetuning
scripts/chat_sft.pyPurpose
- β’ Domain/behavior adaptation
- β’ Task-specific supervision
- β’ Lower learning rate (2% of base)
Data (~23K rows)
- β’ ARC-Easy/Challenge
- β’ GSM8K train
- β’ SmolTalk subset (10K)
- β’ Spelling drills
Reinforcement Learning (Optional)
scripts/chat_rl.pyAlgorithm
- β’ GRPO-like REINFORCE
- β’ Token-level advantages
- β’ Mean baseline
- β’ No KL penalty
Scope
- β’ GSM8K math problems only
- β’ Samples k completions
- β’ Optimizes for correct answers
Inference Engine
Streaming generation with KV cache
KV Cache
- β’ Maintains keys/values per layer
- β’ Dynamic growth (1024-token chunks)
- β’ Batch prefill β replicate for N samples
- β’ Position tracking for RoPE offset
Sampling
- β’ Temperature-based softmax
- β’ Top-k filtering
- β’ Greedy when temperature=0
- β’ Streaming token generation
Tool Use: Python Calculator
How it works
- Detect
<|python_start|> - Extract expression until
<|python_end|> - Evaluate safely with timeout
- Inject
<|output_*|>tokens
Safety
- β’ Allowed: +, -, *, /, (), .count()
- β’ Blocked: __, import, exec, eval
- β’ 3 second timeout
Web Serving
FastAPI + ChatGPT-like UI
Backend (FastAPI)
- β’ Worker pool across GPUs
- β’ Async request handling
- β’ Queue-based distribution
- β’ OpenAI-compatible API
Endpoints
GET /Chat UIPOST /chat/completionsStreaming APIGET /healthStatusAbuse Prevention
Evaluation
Base Model Eval
- β’ CORE metric (Chinchilla benchmark)
- β’ Bits per byte on validation
- β’ Sampling/generation tests
Chat Model Eval
- β’ ARC-Easy/Challenge (MC accuracy)
- β’ MMLU (multi-domain MC)
- β’ GSM8K (math problems)
- β’ HumanEval (Python code)
- β’ SpellingBee (letter counting)
- β’ ChatCORE (mean-centered accuracy)
Data & Tasks
Task Framework
Available Tasks
Distributed Training
PyTorch DDP
- β’ Multi-GPU training
- β’ Rank-based work distribution
- β’ Barrier synchronization
- β’ Master rank handles checkpointing
Compilation
- β’
torch.compile() - β’ dynamic=False for base (static shapes)
- β’ dynamic=True for SFT (variable lengths)
- β’ ~2-3x speedup
speedrun.sh
End-to-end in ~4 hours
Artifacts Location
~/.cache/nanochat/Key Takeaways
Minimal but complete
~8000 lines covering tokenization, pretraining, SFT, RL, inference, and web serving. Easy to fork and customize.
Type-specific optimizers
Muon for linear layers, AdamW for embeddings. Different learning rates per parameter type.
Selective loss masking
Only train on assistant/tool tokens. User input and tool outputs are masked (loss=0).
Tool use via state machine
Python calculator integrated into generation. Parses <|python_*|> blocks and injects results.