🤖

NanoChat

Full-stack LLM training in a minimal, hackable codebase

Train and deploy a ChatGPT-like model end-to-end on a single 8×H100 node for ~$100-$800. Includes tokenization, pretraining, midtraining, SFT, optional RL, evaluation, and web serving.

~8000 lines~4 hours training561M params

🔄

Training Pipeline

📝

Tokenizer

RustBPE, 65K vocab

📚

Base Pretrain

FineWeb-Edu 11B tokens

💬

Midtrain

Conversation structure

🎯

SFT

Task finetuning

🌐

Serve

FastAPI + Web UI

~1 hour

Base training

~30 min

Midtraining

~30 min

SFT

~$100

8×H100 cost

🏗️

Model Architecture

Modern, minimal Transformer

Architecture Choices

Rotary Embeddings

Relative positional encoding, no absolute position embeddings

RMSNorm (no params)

Purely functional normalization, no learnable scale/bias

QK Norm

Query and key normalization in attention

ReLU² MLP

Uses relu(x).square() instead of GELU

Untied Embeddings

Separate wte (token) and lm_head (output)

Logits Softcap

tanh-based capping at 15 for stability

Model Sizing

# Sizing formula (depth d20 = 561M params)
num_layers = depth        # 20
model_dim = depth × 64    # 1280
num_heads = ceil(dim/128) # 10

# Sequence length
seq_len = 2048 tokens

# Vocab size
vocab = 65536 (2^16)

Optimizers

Linear layersMuon (LR=0.02)

EmbeddingsAdamW (LR=0.2)

LM HeadAdamW (LR=0.004)

📝

Tokenization

RustBPE + Special Tokens

RustBPE Tokenizer

• Implemented in Rust for speed
• GPT-4 style pre-tokenization
• Numbers: 1-2 digits max (modified from GPT-4's 1-3)
• Vocab size: 65,536 tokens
• Tiktoken wrapper for inference

Special Tokens

<|bos|>

Document start

<|user_start|>

User message

<|user_end|>

User end

<|assistant_start|>

Assistant

<|assistant_end|>

Assistant end

<|python_start|>

Tool call

<|python_end|>

Tool end

<|output_start|>

Tool output

Conversation Rendering

Conversations are tokenized with attention masks that control which tokens the model learns on.

User

mask = 0

No loss

Assistant

mask = 1

Loss applied

Tool Output

mask = 0

No loss

📚

Training Phases

Base Pretraining

scripts/base_train.py

Data

• FineWeb-Edu 100B dataset
• ~240 parquet shards (~24GB)
• 11.2B tokens for d20 model
• Chinchilla ratio: 20× params

Config

• Batch: 524,288 tokens
• Seq length: 2048
• Gradient accumulation: auto
• Mixed precision: bfloat16

Midtraining

scripts/mid_train.py

Purpose

• Learn conversation structure
• Special token usage
• Tool scaffolding
• Multiple-choice format

Data Mix

• SmolTalk (460K conversations)
• MMLU auxiliary (100K MC)
• GSM8K (math with tool tags)
• Identity conversations

Supervised Finetuning

scripts/chat_sft.py

Purpose

• Domain/behavior adaptation
• Task-specific supervision
• Lower learning rate (2% of base)

Data (~23K rows)

• ARC-Easy/Challenge
• GSM8K train
• SmolTalk subset (10K)
• Spelling drills

Reinforcement Learning (Optional)

scripts/chat_rl.py

optional

Algorithm

• GRPO-like REINFORCE
• Token-level advantages
• Mean baseline
• No KL penalty

Scope

• GSM8K math problems only
• Samples k completions
• Optimizes for correct answers

⚡

Inference Engine

Streaming generation with KV cache

KV Cache

• Maintains keys/values per layer
• Dynamic growth (1024-token chunks)
• Batch prefill → replicate for N samples
• Position tracking for RoPE offset

Sampling

• Temperature-based softmax
• Top-k filtering
• Greedy when temperature=0
• Streaming token generation

Tool Use: Python Calculator

How it works

Detect <|python_start|>
Extract expression until <|python_end|>
Evaluate safely with timeout
Inject <|output_*|> tokens

Safety

• Allowed: +, -, *, /, (), .count()
• Blocked: __, import, exec, eval
• 3 second timeout

🌐

Web Serving

FastAPI + ChatGPT-like UI

Backend (FastAPI)

• Worker pool across GPUs
• Async request handling
• Queue-based distribution
• OpenAI-compatible API

Endpoints

GET /Chat UI

POST /chat/completionsStreaming API

GET /healthStatus

Abuse Prevention

Max messages500

Max chars/message8,000

Max total chars32,000

Temperature0.0 - 2.0

Top-k1 - 200

Max tokens1 - 4,096

📊

Evaluation

Base Model Eval

• CORE metric (Chinchilla benchmark)
• Bits per byte on validation
• Sampling/generation tests

Chat Model Eval

• ARC-Easy/Challenge (MC accuracy)
• MMLU (multi-domain MC)
• GSM8K (math problems)
• HumanEval (Python code)
• SpellingBee (letter counting)
• ChatCORE (mean-centered accuracy)

📦

Data & Tasks

Task Framework

Task

Base class with indexing, slicing, evaluation

TaskMixture

Combines tasks with deterministic shuffling

TaskSequence

Sequential curriculum learning

Available Tasks

MMLUARCGSM8KHumanEvalSmolTalkCustomJSONSpellingBeeSimpleSpelling

🔧

Distributed Training

PyTorch DDP

• Multi-GPU training
• Rank-based work distribution
• Barrier synchronization
• Master rank handles checkpointing

Compilation

• torch.compile()
• dynamic=False for base (static shapes)
• dynamic=True for SFT (variable lengths)
• ~2-3x speedup

🏃

speedrun.sh

End-to-end in ~4 hours

Setup

Create venv, install dependencies

Tokenizer

Build RustBPE, download 8 shards, train vocab=65536

Base Training

Download 240 shards, pretrain d20 (561M params, 11.2B tokens)

Midtraining

Download identity data, train on task mixture

SFT

Train on chat-style tasks

Report

Generate markdown report card

Serve(optional)

python -m scripts.chat_web

Artifacts Location

~/.cache/nanochat/

tokenizer/base_data/base_checkpoints/mid_checkpoints/chatsft_checkpoints/report/

💡

Key Takeaways

Minimal but complete

~8000 lines covering tokenization, pretraining, SFT, RL, inference, and web serving. Easy to fork and customize.

Type-specific optimizers

Muon for linear layers, AdamW for embeddings. Different learning rates per parameter type.

Selective loss masking

Only train on assistant/tool tokens. User input and tool outputs are masked (loss=0).

Tool use via state machine

Python calculator integrated into generation. Parses <|python_*|> blocks and injects results.