S

mini-sglang

LLM inference framework • Radix cache • Overlap scheduling • CUDA graphs

What is mini-sglang?

A lightweight yet high-performance LLM inference framework designed as a transparent reference implementation. It achieves production-grade throughput while remaining readable and modular (~7,300 lines of Python).

🌲

Radix Cache

Prefix matching for KV cache reuse across requests

Overlap Scheduling

Hide CPU overhead by overlapping with GPU compute

📊

CUDA Graphs

Capture and replay decode graphs for minimal latency

🔀

Tensor Parallelism

Multi-GPU inference with synchronized scheduling

System Overview

HTTP Clients
OpenAI-compatible API
FastAPI Server
/v1/chat/completions, SSE streaming
Tokenizer
N workers
Scheduler
Per GPU
Engine
GPU compute
Process isolation + ZMQ message passing = fault tolerance + parallelism
~5,000
Core Code
lines Python
~7,300
Total Lines
with docs/tests
Llama-3, Qwen-3
Models
supported
10K+
Performance
tok/s (0.6B)
Usage
# Start server
python -m minisgl.server.launch --model meta-llama/Llama-3-8B

# OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "What is 2+2?"}],
    "stream": true
  }'

# Response streams via SSE
data: {"choices": [{"delta": {"content": "4"}}]}
data: {"choices": [{"delta": {"content": "!"}}]}
data: {"choices": [{"finish_reason": "stop"}]}