🏑/repos/ryanssenn/

torchless

πŸ”₯

Torchless

LLM inference engine built from scratch in C++

A custom-built inference engine that runs Mistral 7B on CPU without PyTorch or any ML frameworks. Hand-coded transformer implementation with full tokenizer, quantization, and inference pipeline.

~1200 lines C++CPU-onlyINT8 quantization
✨

Why Torchless?

Understanding LLMs from first principles

Zero Dependencies

No PyTorch, TensorFlow, or ONNX. Every operation hand-coded in C++ with OpenMP for parallelization.

Educational Focus

Clear, readable implementation of every transformer component. Perfect for understanding how LLMs actually work.

Full Pipeline

Complete from model export to text generation. Includes tokenizer, quantization, and inference.

πŸ—οΈ

Architecture

End-to-end inference pipeline

πŸ“¦

Model Export

HF β†’ Binary

πŸ“

Tokenizer

BPE encoding

🧠

Transformer

32 layers

🎲

Sampling

Token generation

Mistral 7B
Target model
32 layers
Transformer depth
4096 dim
Hidden size
8 KV heads
GQA groups
πŸ“¦

Model Export Pipeline

Converting Hugging Face models to binary

Export Process

  • β€’ Load model from Hugging Face safetensors
  • β€’ Extract config, vocabulary, and BPE merges
  • β€’ Apply optional INT8 quantization
  • β€’ Package into single binary file
  • β€’ Memory-mapped loading for efficiency

Binary Format

[8-byte header size]
[JSON header]
  - config (dims, layers, heads)
  - vocab (token β†’ id mapping)
  - merges (BPE merge rules)
  - tensor offsets
[Padded tensor payload]
  - weights (f32 or int8)
  - quantization scales

Quantization Strategy

Per-Group Symmetric Quantization

  • β€’ Group size: 64 values
  • β€’ Scale computed per group
  • β€’ Maps to INT8 range [-127, 127]
  • β€’ On-the-fly dequantization during matmul

Memory Savings

  • β€’ Float32: 4 bytes/param
  • β€’ INT8: ~1 byte/param + scales
  • β€’ ~4x reduction in model size
  • β€’ Faster memory bandwidth
πŸ“

Tokenization

Byte-Pair Encoding from scratch

BPE Implementation

  • β€’ Full BPE encoder/decoder
  • β€’ Loads vocab and merge rules from tokenizer.json
  • β€’ Mistral-specific pre-tokenization (Metaspace)
  • β€’ Byte fallback for unknown characters
  • β€’ ~209 lines of implementation

Tokenization Flow

1Split text by whitespace/punctuation
2Convert each word to bytes
3Apply BPE merges iteratively
4Map tokens to vocabulary IDs
🧠

Transformer Implementation

Every layer hand-coded in C++

Embedding

~15 lines

Token ID lookup in embedding table. Maps 32K vocab to 4096-dim vectors.

RMSNorm

~20 lines

Root Mean Square normalization. Simpler and faster than LayerNorm.

RoPE

~35 lines

Rotary Position Embeddings. Encodes position through rotation in complex plane.

Attention

~60 lines

Multi-head Grouped-Query Attention with KV cache. 32 query heads, 8 KV heads.

MLP

~25 lines

SwiGLU feedforward: gate * silu(up) projection with 14336 intermediate dim.

LM Head

~10 lines

Project final hidden state to vocabulary logits for next token prediction.

Grouped-Query Attention (GQA)

How GQA Works

  • β€’ 32 query heads with 128-dim each
  • β€’ 8 key-value head pairs (shared)
  • β€’ Reuse factor: 4 query heads per KV pair
  • β€’ Reduces KV cache memory by 4x

KV Cache Optimization

  • β€’ Stores K/V across sequence positions
  • β€’ Avoids recalculating for past tokens
  • β€’ Critical for efficient generation
  • β€’ Managed in InferenceState struct
⚑

CPU Kernels

Optimized math operations

matmul

Matrix multiplication with quantization support

row_matmul

Row-wise matmul for attention scores

softmax

Temperature-aware with numerical stability

rope

Rotary embeddings application

silu

SiLU activation: x * sigmoid(x)

element_ops

Add, multiply, power, sqrt

Optimization Features

OpenMP parallelizationLoop unrollingCache-friendly access patternsOn-the-fly dequantization
πŸ—‚οΈ

Inference State Management

Pre-allocated memory for zero-copy inference

InferenceState Structure

  • β€’ Hidden states and residuals
  • β€’ Query/Key/Value projections
  • β€’ Attention scores and context
  • β€’ MLP intermediate activations
  • β€’ Output logits and probabilities
  • β€’ KV cache per layer

Memory Management

  • β€’ Arena allocator (256 MB)
  • β€’ Zero-copy tensor views
  • β€’ Memory-mapped model file
  • β€’ Single token processing (batch=1)
  • β€’ Reused buffers across tokens
πŸ”„

Inference Lifecycle

How text generation works

1
Loading
Memory-map binary file, parse JSON header, create tensor views
2
Tokenization
Convert input text to token IDs using BPE encoding
3
Transformer Forward
Process through 32 layers: Embed β†’ (RMSNorm β†’ Attention β†’ MLP) Γ— 32 β†’ Final Norm β†’ LM Head
4
Sampling
Apply softmax to logits, sample next token (greedy or multinomial)
5
Auto-regressive
Append token, update KV cache, repeat from step 3
πŸ”¬

Single Transformer Layer

Inside one of 32 layers

Input Hidden State
RMSNorm (pre-attention)
Q/K/V Projection β†’ RoPE β†’ Attention β†’ Output Proj
Residual Add
RMSNorm (pre-MLP)
Gate↑ Γ— SiLU(Up↑) β†’ Down↓ (SwiGLU MLP)
Residual Add β†’ Output Hidden State
πŸ’»

Usage

Building and running

Export Model

python3 export_mistral.py \
  --model_dir ../Mistral-7B-v0.1 \
  --out ./mistral.bin \
  --quant f32  # or int8

Build & Run

# Compile
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build .

# Run inference
./torchless ../mistral.bin "Paris is the capital of"
πŸ“

Code Structure

Source Files

src/tokenizer/BPE tokenizer (209 lines)
src/model/mistral/Transformer modules (181 lines)
src/backend/cpu/Math kernels (154 lines)
src/loader/Model loading (157 lines)
src/common/Tensor infrastructure (107 lines)
main.cppInference entry point

Python Scripts

export_mistral.pyHF to binary converter
parity/Component validation tests

Build

CMakeLists.txtCMake configuration
requirements.txtPython dependencies
πŸš€

Development Focus

Current Work

  • β€’ Performance optimization
  • β€’ CPU SIMD vectorization
  • β€’ Custom CUDA kernels
  • β€’ Mistral 3 architecture support

Planned Features

  • β€’ Temperature scaling for sampling
  • β€’ CPU multithreading optimization
  • β€’ Chat/conversation interface
  • β€’ Ministral 3 3B support
πŸ’‘

Key Takeaways

Framework-free inference

Demonstrates that ML frameworks are conveniences, not requirements. Every operation can be implemented directly.

Zero-copy architecture

Memory-mapped files and tensor views eliminate data copying. Model weights accessed directly from disk.

Quantization from scratch

Per-group INT8 quantization with on-the-fly dequantization. Understand exactly how model compression works.

Educational codebase

~1200 lines of readable C++ covering the complete inference pipeline. Great for learning transformer internals.