Build an LLM inference framework from scratch in Rust

Make it work, make it fast, make it serve.

Training frontier models gets the headlines; making them useful in the real world is a different craft. Inference engineering is the work of turning a raw checkpoint into a system that streams fast responses, keeps hardware saturated without blowing memory, and stays responsive when many users hit it at once.

In this series, you build that system end to end. One crate, inferno, grows chapter by chapter: tokenizer, tensors and GGUF loading, a forward pass you implement yourself, CPU and GPU kernels, KV caching and batching, and finally HTTP serving. It goes from an empty repo to an OpenAI-compatible server any chat client can talk to.

Nothing here is pseudocode or hand-waving. Every chapter builds working Rust, and all of it is in that repo. The commit history follows the chapters one to one: nineteen commits, one per chapter, named by act and chapter number. Each commit only adds code and stands on its own. cargo check and cargo run work at every checkpoint.

Read the chapters in order, or check out a chapter's commit and read the diff with the text as commentary.

What you'll build

An OpenAI-compatible HTTP server that loads a real checkpoint, runs on GPU (with tuned CPU paths in the same codebase), and handles concurrent chat traffic: /v1/chat/completions JSON and SSE, continuous batching, paged KV, prefix caching. Don't worry if half those terms are unfamiliar: each gets its own chapter.

Everything in the nineteen chapters below is how we get there, in Rust, with no inference framework, one abstraction at a time, with every major optimization implemented and benchmarked.

What serving a request looks like

HTTP arrives. A chat template formats the messages; the tokenizer converts them to ids; the scheduler decides when and with whom this request batches; the model runs forward through the tensor kernels; token ids stream back out as SSE events.

Figure 1: Chat path. One /v1/chat/completions trace top-to-bottom; node colour = act. The red dashed loop on the right is the decode loop (the only backward arrow).

Every box in that diagram is a chapter. The ones that feel like they should "just work" (the tokenizer, the chat template) turn out to be surprisingly rich up close. The ones that sound complicated (continuous batching, paged attention) turn out to be straightforward once you have built the right abstractions.

Why build your own

Most tutorials pick one arc and stop. Two common shapes:

  1. PyTorch-first (or a similar stack). You get a model running, then the story is serving and deployment (containers, GPUs, batching APIs, autoscaling) while what happens inside forward / generate stays mostly a black box.
  2. Tensors and kernels first. You build memory layouts, maybe a matmul or attention block, climb up to a model that can sample tokens, and often end there, before you have to care about concurrent HTTP, KV fragmentation, or interleaved streams.

Each leaves half the system implicit. You can read about vLLM, SGLang, TGI, and llama.cpp all day, but until you have implemented a KV cache and seen it speed up decode 20×, written a Metal matmul kernel and watched the GPU take over from the CPU, scheduled a batch of heterogeneous requests and seen tokens stream out interleaved, the ideas stay abstract. This series walks both halves, hands on.

Scope: what "from scratch" means

If you wish to make an apple pie from scratch, you must first invent the universe. Carl Sagan quote in Cosmos (1980)

Carl Sagan had it right in Cosmos -- if you want something from scratch, you invent the universe first. We're not quite going that far, but we're coming close. No black-box generate(), no mystery layers. We start from an empty repo and build the full inference stack ourselves -- from loading raw model weights and running them on custom kernels, to serving completions over a standard protocol for any client to consume.

What you bring (outside this codebase):

  • A model file: a Qwen3 0.6B GGUF checkpoint on disk. The tokenizer (vocabulary and merge table) is read straight out of that file's metadata, so there is nothing else to download.
  • Callers: anything that can send HTTP to the address you bind: curl, a script, the Python openai SDK, a browser chat UI. The series implements the server; it does not ship a product front-end.

Three acts

The series is organized by what the engine can do at the end of each act, not by which subsystem it touches. Make it work, make it fast, make it serve mirrors how real inference engines actually evolved (and how you would build one starting today).

Act 1: Make it work. Six chapters from file format and tokenizer through greedy generation. At the end you have model-generate: coherent text from a real Qwen3 GGUF file, one token at a time, in pure Rust. It is correct, but slow enough that Act 2 has room to run.

Act 2: Make one request as fast as possible. Six chapters: measure with a benchmark harness, then the optimization ladder: KV cache, SIMD CPU kernels, multithreading, a Metal GPU backend, and Q8 weight quantization. One request runs end-to-end on a tuned path; every step is benchmarked.

Act 3: Serve many requests at once. Seven chapters: chat templates, an OpenAI-compatible HTTP API, SSE streaming, paged KV, radix-tree prefix caching, a decode scheduler, and batched decode. Concurrent chats stream per user; work is batched and scheduled; shared prefixes reuse KV.

What you should know already

Comfort with systems programming (the codebase is Rust, but the patterns translate) and a willingness to read math (softmax(QK^T / √d)V and friends show up constantly). No prior ML experience required. The math in each chapter is developed as needed, and the transformer itself gets built up layer by layer in chapter I.5.

The full chapter map

Each box is a chapter or act page -- click to jump straight in.

Act I -- make it work - six chapters Six chapters stacked vertically: GGUF, Tokenizer, Tensor, Backend, Qwen3 forward, Greedy generation. Blue colour theme for Act I. Act I · make it work raw bytes to a talking model, in six chapters Curtain up zero to working LLM · everything built from scratch GGUF the file format · header, metadata, tensor index I.1 Tokenizer text ↔ token ids · BPE merges and vocabulary I.2 Tensor the core data structure · flat f32 buffer + shape I.3 Backend the compute trait · matmul, softmax, RoPE I.4 Qwen3 forward the architecture · embed, 28x block, logits I.5 Greedy generation the loop · argmax, decode, append, repeat I.6 Curtain down a working LLM · slow, FP32, single-threaded - and correct Act II -- make it fast - six chapters Six chapters stacked vertically: Benchmark harness, KV cache, SIMD CPU backend, Multithreaded CPU backend, Metal GPU backend, Q8_0 quantization. Green colour theme for Act II. Act II · make it fast correct but slow → turn every speed dial, in six chapters Curtain up we have correctness · now measure, then optimise Benchmark harness measure before you optimise · tok/s, ms/step II.1 KV cache stop recomputing past keys and values II.2 SIMD CPU backend one instruction, many lanes · NEON II.3 Multithreaded CPU backend split the matmul across cores · rayon II.4 Metal GPU backend Apple Silicon compute shaders · MSL II.5 Q8_0 quantization 8-bit weights · half the bytes, near-same quality II.6 Curtain down a fast single-request LLM · ready for concurrency in Act III Act III -- make it serve - seven chapters Seven chapters stacked vertically: Chat pipeline, HTTP API, SSE streaming, Paged KV cache, Radix prefix cache, Decode scheduler, Batched decode. Amber colour theme for Act III. Act III · make it serve one fast request → many concurrent users, in seven chapters Curtain up fast, but single-user · now serve a crowd Chat pipeline render messages[] → prompt · chat template III.1 HTTP API OpenAI-compatible /v1/chat/completions III.2 SSE streaming tokens stream back · text/event-stream III.3 Paged KV cache KV cache as paged blocks · no fragmentation III.4 Radix prefix cache share prompt prefixes across users · radix tree III.5 Decode scheduler interleave many requests on one model III.6 Batched decode many slots · one fused forward pass III.7 Curtain down an OpenAI-compatible server · many chats, streaming, fast
Let's start Act I -- Make it work