Torchless
LLM inference engine built from scratch in C++
A custom-built inference engine that runs Mistral 7B on CPU without PyTorch or any ML frameworks. Hand-coded transformer implementation with full tokenizer, quantization, and inference pipeline.
Why Torchless?
Understanding LLMs from first principles
Zero Dependencies
No PyTorch, TensorFlow, or ONNX. Every operation hand-coded in C++ with OpenMP for parallelization.
Educational Focus
Clear, readable implementation of every transformer component. Perfect for understanding how LLMs actually work.
Full Pipeline
Complete from model export to text generation. Includes tokenizer, quantization, and inference.
Architecture
End-to-end inference pipeline
Model Export
HF β Binary
Tokenizer
BPE encoding
Transformer
32 layers
Sampling
Token generation
Model Export Pipeline
Converting Hugging Face models to binary
Export Process
- β’ Load model from Hugging Face safetensors
- β’ Extract config, vocabulary, and BPE merges
- β’ Apply optional INT8 quantization
- β’ Package into single binary file
- β’ Memory-mapped loading for efficiency
Binary Format
[8-byte header size]
[JSON header]
- config (dims, layers, heads)
- vocab (token β id mapping)
- merges (BPE merge rules)
- tensor offsets
[Padded tensor payload]
- weights (f32 or int8)
- quantization scalesQuantization Strategy
Per-Group Symmetric Quantization
- β’ Group size: 64 values
- β’ Scale computed per group
- β’ Maps to INT8 range [-127, 127]
- β’ On-the-fly dequantization during matmul
Memory Savings
- β’ Float32: 4 bytes/param
- β’ INT8: ~1 byte/param + scales
- β’ ~4x reduction in model size
- β’ Faster memory bandwidth
Tokenization
Byte-Pair Encoding from scratch
BPE Implementation
- β’ Full BPE encoder/decoder
- β’ Loads vocab and merge rules from tokenizer.json
- β’ Mistral-specific pre-tokenization (Metaspace)
- β’ Byte fallback for unknown characters
- β’ ~209 lines of implementation
Tokenization Flow
Transformer Implementation
Every layer hand-coded in C++
Embedding
~15 linesToken ID lookup in embedding table. Maps 32K vocab to 4096-dim vectors.
RMSNorm
~20 linesRoot Mean Square normalization. Simpler and faster than LayerNorm.
RoPE
~35 linesRotary Position Embeddings. Encodes position through rotation in complex plane.
Attention
~60 linesMulti-head Grouped-Query Attention with KV cache. 32 query heads, 8 KV heads.
MLP
~25 linesSwiGLU feedforward: gate * silu(up) projection with 14336 intermediate dim.
LM Head
~10 linesProject final hidden state to vocabulary logits for next token prediction.
Grouped-Query Attention (GQA)
How GQA Works
- β’ 32 query heads with 128-dim each
- β’ 8 key-value head pairs (shared)
- β’ Reuse factor: 4 query heads per KV pair
- β’ Reduces KV cache memory by 4x
KV Cache Optimization
- β’ Stores K/V across sequence positions
- β’ Avoids recalculating for past tokens
- β’ Critical for efficient generation
- β’ Managed in InferenceState struct
CPU Kernels
Optimized math operations
matmulMatrix multiplication with quantization support
row_matmulRow-wise matmul for attention scores
softmaxTemperature-aware with numerical stability
ropeRotary embeddings application
siluSiLU activation: x * sigmoid(x)
element_opsAdd, multiply, power, sqrt
Optimization Features
Inference State Management
Pre-allocated memory for zero-copy inference
InferenceState Structure
- β’ Hidden states and residuals
- β’ Query/Key/Value projections
- β’ Attention scores and context
- β’ MLP intermediate activations
- β’ Output logits and probabilities
- β’ KV cache per layer
Memory Management
- β’ Arena allocator (256 MB)
- β’ Zero-copy tensor views
- β’ Memory-mapped model file
- β’ Single token processing (batch=1)
- β’ Reused buffers across tokens
Inference Lifecycle
How text generation works
Single Transformer Layer
Inside one of 32 layers
Usage
Building and running
Export Model
python3 export_mistral.py \
--model_dir ../Mistral-7B-v0.1 \
--out ./mistral.bin \
--quant f32 # or int8Build & Run
# Compile
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build .
# Run inference
./torchless ../mistral.bin "Paris is the capital of"Code Structure
Source Files
src/tokenizer/BPE tokenizer (209 lines)src/model/mistral/Transformer modules (181 lines)src/backend/cpu/Math kernels (154 lines)src/loader/Model loading (157 lines)src/common/Tensor infrastructure (107 lines)main.cppInference entry pointPython Scripts
export_mistral.pyHF to binary converterparity/Component validation testsBuild
CMakeLists.txtCMake configurationrequirements.txtPython dependenciesDevelopment Focus
Current Work
- β’ Performance optimization
- β’ CPU SIMD vectorization
- β’ Custom CUDA kernels
- β’ Mistral 3 architecture support
Planned Features
- β’ Temperature scaling for sampling
- β’ CPU multithreading optimization
- β’ Chat/conversation interface
- β’ Ministral 3 3B support
Key Takeaways
Framework-free inference
Demonstrates that ML frameworks are conveniences, not requirements. Every operation can be implemented directly.
Zero-copy architecture
Memory-mapped files and tensor views eliminate data copying. Model weights accessed directly from disk.
Quantization from scratch
Per-group INT8 quantization with on-the-fly dequantization. Understand exactly how model compression works.
Educational codebase
~1200 lines of readable C++ covering the complete inference pipeline. Great for learning transformer internals.