LLM inference framework • Radix cache • Overlap scheduling • CUDA graphs
A lightweight yet high-performance LLM inference framework designed as a transparent reference implementation. It achieves production-grade throughput while remaining readable and modular (~7,300 lines of Python).
Prefix matching for KV cache reuse across requests
Hide CPU overhead by overlapping with GPU compute
Capture and replay decode graphs for minimal latency
Multi-GPU inference with synchronized scheduling
# Start server
python -m minisgl.server.launch --model meta-llama/Llama-3-8B
# OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "What is 2+2?"}],
"stream": true
}'
# Response streams via SSE
data: {"choices": [{"delta": {"content": "4"}}]}
data: {"choices": [{"delta": {"content": "!"}}]}
data: {"choices": [{"finish_reason": "stop"}]}