Act 1: Recap

Six chapters in, model-generate loads a real GGUF checkpoint off disk, tokenizes a prompt, runs a 28-layer Qwen3 forward pass, and greedily samples tokens until it hits an end-of-sequence token or a length cap. Every piece is code you wrote, from the Tensor struct to the RoPE rotation. No ML framework, no tch, no candle. Just the Rust standard library and one regex crate for tokenization.

What you have

  • A GGUF parser (I.1) that reads the binary file format llama.cpp uses (header, typed key-value metadata, the tensor index, and tensor payloads) with a gguf-inspect tool to dump what's inside.
  • A BPE tokenizer (I.2) that turns text into token ids and back, byte-level so it round-trips any UTF-8, with special-token handling and a tokenizer-demo binary.
  • A Tensor type (I.3): a flat Vec<f32> plus a shape, with the row-major layout pinned down, and GGUF::load_tensor to fill one from disk.
  • A Backend trait and a scalar CpuBackend (I.4): every numeric operation a transformer needs (matmul, softmax, the pieces of RMSNorm, SiLU, RoPE) implemented as plain loops. The trait is the seam every future backend plugs into.
  • A Qwen3 forward pass (I.5): token embedding, 28 blocks of RMSNorm + grouped-query attention with RoPE + SwiGLU MLP, residual connections, final norm, tied output head, built from backend primitives and loaded with real weights.
  • A greedy autoregressive loop (I.6) that predicts, appends, and repeats until EOS or a token cap, behind the model-generate CLI.

That is a real inference engine, by a strict reading of the words. It runs the model. It produces tokens. Hand it "Once upon a time" and it hands back a coherent continuation.

What's wrong

It's slow. Painfully slow: seconds per token on a modern laptop, and worse as the context grows. A hundred-token completion takes minutes. The reasons are structural, not accidental, and naming them precisely is what sets up Act 2:

  • The forward pass is re-run from scratch every step. The generation loop calls model.forward on the entire sequence each time it wants one new token. Generating token 50 recomputes the keys and values for all 49 tokens before it, even though those tokens haven't changed. The work per step grows with the sequence, and nearly all of it is repeated.
  • The CPU backend uses one core, one float at a time. No SIMD, no threading. A modern laptop has eight to twelve cores and 128-bit vector lanes, and the scalar CpuBackend uses none of them.
  • No GPU. The machine has a processor optimized for exactly the dense matrix multiplies that dominate inference, sitting completely idle.
  • FP32 weights. Every weight is a full 32-bit float. Reading the entire weight matrix from memory on every matmul is a bandwidth cost, and during decode, bandwidth is the bottleneck. Halving the weight bytes roughly halves decode latency.

None of these is a bug. They are the honest cost of "make it work first": the simplest correct version, with every shortcut deliberately not taken.

The bridge to Act 2

Before fixing any of it, we measure. Optimizing without a benchmark is how you make something slower and don't notice. Act 2 opens with a benchmark harness (separating prefill (processing the prompt) from decode (generating tokens), because they have different bottlenecks) and only then climbs the optimization ladder: a KV cache so each step processes one token instead of all of them, SIMD and multithreading for the CPU, a Metal GPU backend, and Q8_0 quantization to cut the weight bytes. Each chapter is a measured reaction to a slow thing you built here and watched run.

Continue to Act 2: Make it fast.