Act 2: Recap
The single-request path is fast. Between the Act 1 baseline (seconds per decode token on one scalar CPU core) and the end of Act 2, each chapter carved off a specific inefficiency, and every speedup was measured in the same harness built in II.1. No claim in this act was a vibe; every one of them moved a number we printed.
What moved the needle
Six chapters, in order of what they did:
- II.1: Benchmark harness. No speedup: the instrument. A
Metricstype that times every forward pass and reports time-to-first-token and decode tokens/second. It established the baseline and made everything after it honest. It also handed us the first clue: forward passes that got slower as the response grew. - II.2: KV cache. The single biggest lever, and an algorithmic fix rather than a faster kernel. Without it, generating token
Nre-runs the network over allNprior tokens,N²total work. With it, each decode step processes one token and reuses cached per-layer keys and values. Decode collapsed from growing-per-token to constant-per-token; throughput jumped more than 20×. - II.3: SIMD CPU backend. The first kernel win. NEON intrinsics do matmul 4 (and with unrolling, 16) floats per instruction instead of one. A second
Backendthat delegates everything but matmul to the scalar one. Roughly 4–8×, on both prefill and decode. - II.4: Multithreaded CPU backend. rayon spreads matmul rows across every core. Near-linear scaling for prefill, which has hundreds of output rows to split. Deliberately no benefit for decode: a single-row matrix-vector product can't be split across cores, so the size threshold sends it straight back to the SIMD path.
- II.5: Metal GPU backend. A Metal Shading Language matmul kernel, command buffers, and Apple Silicon's unified memory for copy-free data sharing. Big prefill matmuls move to the thousands of GPU threads; small decode matmuls fall back to the CPU, again by threshold.
- II.6: Q8_0 quantization. The fix for decode. The model's weights ship as 8-bit quantized blocks, roughly a quarter the bytes of FP32, and the engine learned to multiply against them directly, with dequantization fused into the dot product. Decode, which is bandwidth-bound, nearly doubled.
The one idea that explains all of it
Every chapter maps onto the split the act intro opened with: inference is prefill (compute-bound: reading the prompt, big matrix-by-matrix multiplies) and decode (memory-bandwidth-bound: writing the response one token at a time, matrix-by-vector multiplies).
Read the ladder through that lens and it stops being a list of tricks:
- The KV cache helped decode: it fixed decode's algorithm.
- SIMD, threads, and Metal helped prefill most: they attack compute, and prefill's bottleneck is compute. They left decode throughput nearly flat across three chapters, and that flatness was not a failure: decode's bottleneck was never arithmetic.
- Q8_0 helped decode: it cut bytes, and bytes are decode's entire problem.
If you remember one thing from Act 2, remember that decode is bound by memory traffic, not FLOPs. It is why "just write a faster matmul" only takes you so far, and why halving the weight bytes did what three faster matmul backends couldn't.
What the engine still can't do
Everything we built assumes exactly one user, one prompt, one response at a time:
- No server. There's a CLI binary and that's it. No HTTP, no streaming, no request/response shape another program could talk to.
- No chat formatting. Qwen3 was trained to respond to
<|im_start|>...chat framing; we've been feeding it raw strings. Continuation prompts look fine; anything resembling a chat turn degrades. - No concurrency.
BasicKvCacheis a per-request structure. Two concurrent requests would each allocate their own, and the layout (a tensor that gets reallocated on every appended token) makes batching their decodes together impossible. - No reuse across requests. Every request re-prefills its prompt from scratch. In real workloads a large share of a prompt's tokens are identical across requests (a shared system prompt, a shared conversation prefix), and we pay the full prefill cost every single time.
None of this is a "keep stacking optimizations" fix. Serving many requests is a genuinely different problem. The Act 2 KV cache is excellent for one request and actively in the way of many, and the same is true of how we dispatch GPU work and structure the forward pass.
The bridge to Act 3
Act 3 turns this fast single-user engine into a real server: chat templates, an HTTP API, token-by-token SSE streaming, a paged KV layout that shares memory across concurrent sequences, a prefix cache that reuses prefill work across requests, and a scheduler that interleaves many requests at iteration granularity. Several Act 2 data structures get rewritten. The single-request path stays fast (we don't regress it), but the representations underneath it change to support more than one tenant.
Continue to Act 3: Make it serve.