Act 1: Make it work
It's tempting to skip the "make it work" phase, grab a library, call model.generate(), and dive straight into the performance work that looks interesting. Don't. The cost of that shortcut is that the performance work stays abstract: you never feel the thing you're speeding up. When the KV cache makes decode dozens of times faster in II.2, it matters that you're comparing it against a forward pass you built yourself and watched take seconds per token. Every Act 2 chapter is a reaction to something you built here.
So the point of Act 1 is narrow and non-negotiable: get a real Qwen3 checkpoint producing coherent text in a program you wrote every line of. Not fast. Not pretty. Not servable. Just working.
The reference model
We use Qwen3 0.6B as the running example: a modern decoder-only transformer small enough to iterate on quickly, large enough to surface every interesting bottleneck. Its architecture is representative of the current generation: RMSNorm, grouped-query attention with RoPE, and a SwiGLU MLP. Don't worry if those terms are unfamiliar; each is built up, from scratch, in chapter I.5.
Everything in the codebase is written generically against a Model trait, so swapping in a different architecture is a matter of writing one new file. Qwen3 0.6B just happens to be what every CLI calls.
The path
Six chapters climb from raw bytes on disk to a checkpoint generating real tokens. Each depends on the previous one, and each adds exactly one capability.
Figure: Act I at a glance. Each chapter produces a concrete artifact; each depends on the one above.
- I.1: GGUF. The file format
ggml/llama.cppuses. Parse the header, the typed key-value metadata (hyperparameters, vocab, chat template), and the tensor index. We don't run anything yet. We learn the format and build agguf-inspecttool to dump what's inside. - I.2: Tokenizer. Byte-pair encoding: turn text into token ids and back, handling UTF-8 correctly. The vocabulary and merge table come straight from the GGUF metadata parsed in I.1. Ships a
tokenizer-demobinary. - I.3: Tensor. A minimal
Tensortype (a flatVec<f32>plus a shape) that every weight and activation flows through. Small, but it pins down the memory layout the rest of the engine assumes. - I.4: Backend. A
Backendtrait listing every numeric operation a transformer needs (matmul,softmax, the pieces ofrmsnorm,silu, RoPE) and a plain scalar CPU implementation of all of them. This trait is the seam every faster backend in Act 2 plugs into. - I.5: Qwen3 forward. Wire the 28-layer forward pass (token embedding, RMSNorm, grouped-query attention with RoPE, SwiGLU MLP, residuals, final norm, output head) and load the real weights out of the GGUF file. One forward pass turns a prompt into next-token logits.
- I.6: Greedy generation. The autoregressive loop: take the highest-probability token, feed it back in, run forward again, repeat until an end-of-sequence token or a length cap. End-to-end multi-token generation behind a
model-generateCLI.
Getting started
Create the crate:
cargo new --lib inferno
cd infernocargo new --lib generates Cargo.toml and src/lib.rs. The default Cargo.toml is just package metadata:
[package]
name = "inferno"
version = "0.1.0"
edition = "2024"
[dependencies]src/lib.rs comes with a sample add function and a test; delete both. Chapter I.1 starts from an empty file.
What "done" looks like
At the end of Act 1, model-generate "Once upon a time" produces a sensible continuation of a real Qwen3 checkpoint. Every line of code was written deliberately; you understand every number that moves through the model.
It will also be slow: seconds per token on a modern laptop. That's not a bug. That's the baseline Act 2 is built against.