mini-sglang

LLM inference framework • Radix cache • Overlap scheduling • CUDA graphs

What is mini-sglang?

A lightweight yet high-performance LLM inference framework designed as a transparent reference implementation. It achieves production-grade throughput while remaining readable and modular (~7,300 lines of Python).

🌲

Radix Cache

Prefix matching for KV cache reuse across requests

⚡

Overlap Scheduling

Hide CPU overhead by overlapping with GPU compute

📊

CUDA Graphs

Capture and replay decode graphs for minimal latency

🔀

Tensor Parallelism

Multi-GPU inference with synchronized scheduling

System Overview

HTTP Clients

OpenAI-compatible API

FastAPI Server

/v1/chat/completions, SSE streaming

Tokenizer

N workers

Scheduler

Per GPU

Engine

GPU compute

Process isolation + ZMQ message passing = fault tolerance + parallelism

~5,000

Core Code

lines Python

~7,300

Total Lines

with docs/tests

Llama-3, Qwen-3

Models

supported

10K+

Performance

tok/s (0.6B)

Usage

# Start server
python -m minisgl.server.launch --model meta-llama/Llama-3-8B

# OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "What is 2+2?"}],
    "stream": true
  }'

# Response streams via SSE
data: {"choices": [{"delta": {"content": "4"}}]}
data: {"choices": [{"delta": {"content": "!"}}]}
data: {"choices": [{"finish_reason": "stop"}]}