II.3: SIMD CPU backend

The KV cache in II.2 fixed the algorithm: decode is now near-constant work per token, not growing work. But the harness still reports 53-odd milliseconds per decode forward pass, and the obvious next suspect is the kernel. The forward pass is the right amount of arithmetic; surely it's just running that arithmetic slowly?

Look at the code and the case seems open-and-shut. The forward pass is dominated by matmul, and the scalar matmul from I.4 is a triple-nested loop that multiplies and adds one pair of floats at a time. The CPU in your laptop can multiply-add four pairs of floats with a single instruction (that's what its SIMD unit is for), and we have been ignoring it entirely.

This chapter writes a second Backend implementation, SimdCpu, that uses those instructions. It does not change the algorithm, the model, or any result; it computes exactly the same numbers, just four (and with unrolling, sixteen) lanes at a time. We also add the --backend flag to choose between backends, and a small per-operation profiler so we can see where the time goes. Then we run it, and the measurement teaches us the single most important fact about decode. (Spoiler: the case was not open-and-shut.)

What SIMD is

SIMD stands for Single Instruction, Multiple Data. A normal CPU instruction operates on one value: "multiply these two floats." A SIMD instruction operates on a small fixed-size vector of values at once: "multiply these four pairs of floats, in parallel, in one instruction." The CPU has special wide registers (128 bits, enough to hold four 32-bit floats) and a set of instructions that act on all four lanes (the register's four float slots) simultaneously.

On Apple Silicon and other ARM chips, this instruction set is called NEON. Rust exposes it as intrinsics: functions in std::arch::aarch64 that map almost one-to-one onto NEON machine instructions. A few we use here:

vld1q_f32(ptr): load 4 consecutive floats from memory into a 128-bit register.
vst1q_f32(ptr, reg): store a register's 4 floats back to memory.
vdupq_n_f32(x): fill all 4 lanes of a register with the same scalar x.
vfmaq_f32(acc, a, b): fused multiply-add: compute acc + a * b lane-by-lane, all 4 lanes at once. The workhorse.
vaddq_f32(a, b): add two registers lane-by-lane.
vaddvq_f32(reg): horizontal add: sum the 4 lanes of one register down to a single scalar.

Using them means writing unsafe Rust, since these are raw, no-bounds-check, pointer-based operations. That's the price of getting at the hardware directly, and it's contained to two functions.

How matmul becomes vectorizable

Recall what matmul does. To multiply an m × n matrix A by an n × p matrix B, every output element is a dot product: C[i][j] is row i of A dotted with column j of B. The textbook triple loop does exactly that, but it's a poor fit for SIMD, because column j of B is strided: its elements are p floats apart in memory, and SIMD loads want 4 consecutive floats.

The fix is to reorder the loops. Instead of computing one output element at a time, compute one output row at a time, accumulating it across the contributions of A's columns:

PLAINTEXT

for each output row i:
    clear C[i]                       # a row of p floats
    for each j in 0..n:
        a_ij = A[i][j]               # one scalar
        C[i] += a_ij * B[j]          # B[j] is row j of B, p CONSECUTIVE floats

The inner operation is now C[i] += a_ij * B[j]: take a scalar a_ij, multiply a whole row of B by it, add the result into the output row. Both B[j] and C[i] are contiguous runs of p floats. That contiguity is everything: it means we can stride through them 4 floats at a time with vld1q_f32, and the multiply-add C[i] += a_ij * B[j] is exactly what vfmaq_f32 does, four lanes per instruction. This scalar × vector + vector pattern has a classic name: SAXPY.

So SimdCpu's matmul is the loop above, with the innermost row update (C[i] += a_ij * B[j]) replaced by a hand-vectorized NEON function.

A backend that delegates

SimdCpu only needs to make matmul fast, since that's where the time is. Everything else in the Backend trait (add, silu, softmax_rows, the 20-odd other operations) it can hand straight to the scalar CpuBackend from Act 1. So SimdCpu is generic over a fallback backend: it implements the few operations worth vectorizing itself and delegates the rest.

Diagram of one scalar multiply-add per instruction, four NEON lanes per instruction, and four unrolled groups of four lanes running as independent accumulators.

Figure: what "four lanes" buys. One instruction, four multiply-adds; unrolling keeps four of those in flight so the pipeline never waits on a single accumulator.

src/backend/simd_cpu.rs opens with the struct and a constructor:

RUST

use super::Backend;
use crate::tensor::{Tensor, TensorData};
 
pub struct SimdCpu<B: Backend> {
    fallback: B,
}
 
impl<B: Backend> SimdCpu<B> {
    pub fn new(fallback: B) -> Self {
        Self { fallback }
    }

SimdCpu<B> wraps any B: Backend. In practice B is CpuBackend, so a SimdCpu is "the scalar backend, but with a fast matmul bolted on." The next chapter wraps a SimdCpu in turn, which is why this is generic rather than hard-coding CpuBackend.

The matmul driver (the loop-reordered version from above):

RUST

    #[inline(always)]
    fn matmul_fp32_fp32(&self, a: &[f32], b: &[f32], m: usize, n: usize, p: usize) -> Tensor {
        assert_eq!(a.len(), m * n);
        assert_eq!(b.len(), n * p);
        let mut data = vec![0.0f32; m * p];
        let ap = a.as_ptr();
        let bp = b.as_ptr();
        let op = data.as_mut_ptr();
 
        unsafe {
            for i in 0..m {
                let o = op.add(i * p);
                for j in 0..n {
                    let aij = *ap.add(i * n + j);
                    let b_row = bp.add(j * p);
                    Self::saxpy_row_f32(o, b_row, aij, p);
                }
            }
        }
        Tensor::new(data, vec![m, p])
    }
}

data is the output, m * p floats, zero-initialized. Then for each output row i and each j in 0..n: pull out the scalar a[i][j], point at row j of B, and call saxpy_row_f32 to do o += aij * b_row across the whole row. We work with raw pointers (as_ptr, .add(...)) because the NEON helper takes pointers and we want no per-element bounds checks in the hot loop. The assert_eq!s at the top establish the lengths once, so the indexing inside is provably in range.

The Backend impl

SimdCpu's Backend impl. Matmul and one other operation (sum_squares_axis) are done with NEON; everything else forwards to fallback. Start with name and matmul:

RUST

impl<B: Backend> Backend for SimdCpu<B> {
    fn name(&self) -> String {
        "simd".to_string()
    }
 
    #[inline(always)]
    fn matmul(&self, a: &Tensor, b: &Tensor) -> Tensor {
        assert_eq!(a.shape().len(), 2);
        assert_eq!(b.shape().len(), 2);
        let a_shape = a.shape();
        let m = a_shape[0];
        let n = a_shape[1];
        let a_data = match a.as_data() {
            TensorData::Fp32(d) => d,
        };
        match b.as_data() {
            TensorData::Fp32(b_data) => {
                let p = b.shape()[1];
                assert_eq!(n, b.shape()[0], "tensor shape mismatch");
                self.matmul_fp32_fp32(a_data, b_data, m, n, p)
            }
        }
    }

matmul pulls the flat float slices out of the two tensors, checks the inner dimensions agree (n == b.shape()[0]), and calls matmul_fp32_fp32. The match on TensorData has one arm (Fp32) because that is the only tensor data variant that exists right now. II.6 adds a Q8_0 variant and a second arm here.

RUST

    fn sum_squares_axis(&self, x: &Tensor, axis: usize) -> Tensor {
        assert_eq!(x.shape().len(), 2, "sum_squares_axis only supports 2-D");
        assert!(axis < 2, "axis out of bounds");
        if axis == 1 {
            let rows = x.shape()[0];
            let cols = x.shape()[1];
            let mut data = vec![0.0f32; rows];
            let src = x.as_f32_slice();
            for r in 0..rows {
                data[r] = Self::dot_f32(
                    &src[r * cols..(r + 1) * cols],
                    &src[r * cols..(r + 1) * cols],
                );
            }
            Tensor::new(data, vec![rows])
        } else {
            self.fallback.sum_squares_axis(x, axis)
        }
    }

sum_squares_axis computes, for each row, the sum of its squared elements. It's a building block of RMS norm, which runs at every layer. Summing a row's squares is dotting the row with itself, and we already need a fast vectorized dot product for the SIMD work, so this gets it for free: dot_f32(row, row). Only axis == 1 (per-row) gets the NEON path; the rarer axis == 0 case delegates.

The remaining two dozen Backend methods are all one-line delegations to fallback. They're mechanical but the trait requires all of them; here are the first several to show the shape, and the rest follow identically:

RUST

    fn add(&self, a: &Tensor, b: &Tensor) -> Tensor {
        self.fallback.add(a, b)
    }
    fn hadamard(&self, a: &Tensor, b: &Tensor) -> Tensor {
        self.fallback.hadamard(a, b)
    }
    fn scale(&self, x: &Tensor, s: f32) -> Tensor {
        self.fallback.scale(x, s)
    }
    fn silu(&self, x: &Tensor) -> Tensor {
        self.fallback.silu(x)
    }
    fn add_scalar(&self, a: &Tensor, s: f32) -> Tensor {
        self.fallback.add_scalar(a, s)
    }
    fn rsqrt_elem(&self, x: &Tensor) -> Tensor {
        self.fallback.rsqrt_elem(x)
    }
    fn broadcast_row_scalars(&self, t: &Tensor, d: usize) -> Tensor {
        self.fallback.broadcast_row_scalars(t, d)
    }
    fn transpose_2d(&self, a: &Tensor) -> Tensor {
        self.fallback.transpose_2d(a)
    }
    fn softmax_rows(&self, x: &Tensor) -> Tensor {
        self.fallback.softmax_rows(x)
    }
    fn gather_rows(&self, table: &Tensor, row_indices: &[usize]) -> Tensor {
        self.fallback.gather_rows(table, row_indices)
    }

…and so on for reshape_data, fill_strict_upper_tri, copy_2d_from_cols, copy_2d_into_cols, repeat_row_as_matrix, apply_rope, apply_rope_single_row, concat_dim0, copy_row_2d, copy_contiguous_into, and argmax_with_prob, each a one-liner forwarding to self.fallback. The point of the pattern: vectorize only what's hot, inherit a correct implementation of everything else for free.

One method in that list is new this chapter: copy_contiguous_into. It's a small addition to the Backend trait used by later code; we'll cover it in a moment.

The NEON kernels

Now the two unsafe functions that do the actual vectorization. First saxpy_row_f32 (out += aij * b, the inner step of matmul):

RUST

impl<B: Backend> SimdCpu<B> {
    #[inline(always)]
    pub fn saxpy_row_f32(out: *mut f32, b: *const f32, aij: f32, p: usize) {
        use std::arch::aarch64::*;
        unsafe {
            let a = vdupq_n_f32(aij);
            let mut k = 0usize;
            while k + 16 <= p {
                let o0 = vld1q_f32(out.add(k));
                let o4 = vld1q_f32(out.add(k + 4));
                let o8 = vld1q_f32(out.add(k + 8));
                let o12 = vld1q_f32(out.add(k + 12));
                let b0 = vld1q_f32(b.add(k));
                let b4 = vld1q_f32(b.add(k + 4));
                let b8 = vld1q_f32(b.add(k + 8));
                let b12 = vld1q_f32(b.add(k + 12));
                vst1q_f32(out.add(k), vfmaq_f32(o0, a, b0));
                vst1q_f32(out.add(k + 4), vfmaq_f32(o4, a, b4));
                vst1q_f32(out.add(k + 8), vfmaq_f32(o8, a, b8));
                vst1q_f32(out.add(k + 12), vfmaq_f32(o12, a, b12));
                k += 16;
            }
            while k + 4 <= p {
                vst1q_f32(
                    out.add(k),
                    vfmaq_f32(vld1q_f32(out.add(k)), a, vld1q_f32(b.add(k))),
                );
                k += 4;
            }
            while k < p {
                *out.add(k) += aij * *b.add(k);
                k += 1;
            }
        }
    }

The first line, vdupq_n_f32(aij), broadcasts the scalar aij into all 4 lanes of a register, so a vfmaq_f32 multiplies four b values by aij at once. Then three loops, in descending chunk size, which is a standard SIMD shape:

The 16-wide loop. Each iteration processes 16 floats: it loads four 4-float registers from out and four from b, does four vfmaq_f32s (out_chunk + aij * b_chunk), stores the four results back, and advances k by 16. Why 16 and not just 4? Instruction-level parallelism. The four independent vfmaq_f32s have no dependency on each other, so the CPU can keep its execution units busy on all four while none waits on a previous result. Unrolling to 16 hands the hardware four parallel chains instead of one.
The 4-wide loop. Cleans up whatever's left after the 16-wide loop (between 0 and 15 floats), four at a time.
The scalar tail. Handles the final 0–3 floats with ordinary +=. NEON works on full 4-float registers, so any width that isn't a multiple of 4 needs a plain-Rust remainder.

For Qwen3's matmuls, p is typically a clean multiple of 16, so almost all the work goes through the fast first loop; the other two exist only so the kernel is correct for any p.

Then dot_f32, the dot product used by sum_squares_axis:

RUST

    pub fn dot_f32(a: &[f32], b: &[f32]) -> f32 {
        debug_assert_eq!(a.len(), b.len());
        use std::arch::aarch64::*;
        let n = a.len();
        let ap = a.as_ptr();
        let bp = b.as_ptr();
        unsafe {
            let mut acc0 = vdupq_n_f32(0.0);
            let mut acc1 = vdupq_n_f32(0.0);
            let mut acc2 = vdupq_n_f32(0.0);
            let mut acc3 = vdupq_n_f32(0.0);
            let mut i = 0usize;
            while i + 16 <= n {
                acc0 = vfmaq_f32(acc0, vld1q_f32(ap.add(i)), vld1q_f32(bp.add(i)));
                acc1 = vfmaq_f32(acc1, vld1q_f32(ap.add(i + 4)), vld1q_f32(bp.add(i + 4)));
                acc2 = vfmaq_f32(acc2, vld1q_f32(ap.add(i + 8)), vld1q_f32(bp.add(i + 8)));
                acc3 = vfmaq_f32(acc3, vld1q_f32(ap.add(i + 12)), vld1q_f32(bp.add(i + 12)));
                i += 16;
            }
            while i + 4 <= n {
                acc0 = vfmaq_f32(acc0, vld1q_f32(ap.add(i)), vld1q_f32(bp.add(i)));
                i += 4;
            }
            let combined = vaddq_f32(vaddq_f32(acc0, acc1), vaddq_f32(acc2, acc3));
            let mut sum = vaddvq_f32(combined);
            while i < n {
                sum += *ap.add(i) * *bp.add(i);
                i += 1;
            }
            sum
        }
    }
}

Same shape: four accumulator registers (acc0..acc3), a 16-wide loop that feeds each one independently, a 4-wide cleanup, a scalar tail. The four separate accumulators are again about instruction-level parallelism: four independent vfmaq_f32 chains the CPU can interleave. The difference from saxpy_row_f32 is the finish: a dot product reduces to a single number, so after the loops we add the four accumulator registers together (vaddq_f32 twice) and then vaddvq_f32 does the horizontal sum, collapsing the surviving 4-lane register into one f32. The scalar tail adds the last 0–3 products directly.

The new backend trait method

The Backend trait gains one method this chapter, copy_contiguous_into:

RUST

    fn copy_contiguous_into(&self, x: &Tensor, start: usize, dst: &mut [f32]);

It copies a contiguous run of a tensor's floats into a caller-provided slice. The CpuBackend implementation is a one-liner:

RUST

    fn copy_contiguous_into(&self, x: &Tensor, start: usize, dst: &mut [f32]) {
        dst.copy_from_slice(&x.as_f32_slice()[start..start + dst.len()]);
    }

And SimdCpu delegates it like the rest. It's a plumbing primitive (no arithmetic, nothing to vectorize) added to the trait now so it's available to backend code that needs to splice tensor data without going through the Tensor constructor.

The tracing profiler

To justify the SIMD work, and to know which operations to vectorize next, we want a per-operation timing breakdown. src/backend/tracing.rs adds a TracingBackend: another generic wrapper that times every Backend call and emits a tracing event for it. It wraps any backend, so you can profile the scalar one or the SIMD one.

Twenty-odd trait methods would mean twenty-odd near-identical timing wrappers, so the file uses a macro to generate them:

RUST

use super::Backend;
use crate::tensor::Tensor;
use std::cell::Cell;
use std::time::Instant;
 
thread_local! {
    static TRACE_DEPTH: Cell<usize> = const { Cell::new(0) };
}
 
fn fmt_time(ms: f64) -> String {
    if ms < 1.0 {
        format!("{:.1}us", ms * 1000.0)
    } else {
        format!("{:.3}ms", ms)
    }
}

TRACE_DEPTH is a thread-local counter that tracks how deeply nested the current backend call is. Attention calls matmul, which is itself a backend call, so the profiler indents nested calls to show the call tree. fmt_time prints sub-millisecond durations in microseconds and longer ones in milliseconds.

The macro:

RUST

macro_rules! trace_fn {
    (fn $fn:ident (&self $(, $arg:ident : $ty:ty)*) -> $ret:ty) => {
        fn $fn(&self $(, $arg: $ty)*) -> $ret {
            let depth = TRACE_DEPTH.with(|d| {
                let n = d.get();
                d.set(n + 1);
                n
            });
 
            let mut args_vec = vec![];
            $( args_vec.push(format!("{}={:?}", stringify!($arg), $arg)); )*
 
            let t0 = Instant::now();
            let result = self.inner.$fn($($arg),*);
            let elapsed = t0.elapsed();
            let time_ms = elapsed.as_secs_f64() * 1000.0 + elapsed.subsec_nanos() as f64 / 1_000_000.0;
 
            let args = args_vec.join(", ");
            let prefix = "-> ".repeat(depth);
            tracing::trace!("{}[{}] [{}] ({})  {}", prefix, self.name, stringify!($fn), args, fmt_time(time_ms));
 
            TRACE_DEPTH.with(|d| { d.set(d.get() - 1); });
            result
        }
    };
    (fn $fn:ident (&self $(, $arg:ident : $ty:ty)*)) => {
        fn $fn(&self $(, $arg: $ty)*) {
            let depth = TRACE_DEPTH.with(|d| {
                let n = d.get();
                d.set(n + 1);
                n
            });
 
            let mut args_vec = vec![];
            $( args_vec.push(format!("{}={:?}", stringify!($arg), $arg)); )*
 
            let t0 = Instant::now();
            self.inner.$fn($($arg),*);
            let elapsed = t0.elapsed();
            let time_ms = elapsed.as_secs_f64() * 1000.0 + elapsed.subsec_nanos() as f64 / 1_000_000.0;
 
            let args = args_vec.join(", ");
            let prefix = "-> ".repeat(depth);
            tracing::trace!("{}[{}] [{}] ({})  {}", prefix, self.name, stringify!($fn), args, fmt_time(time_ms));
 
            TRACE_DEPTH.with(|d| { d.set(d.get() - 1); });
        }
    };
}

trace_fn! takes a method signature and generates a wrapper that does the same five things every time: bump the depth counter, format the arguments, time the inner call, emit a tracing::trace! event with an indent proportional to depth, restore the counter. There are two macro arms (one for methods that return a value, one for the -> () methods like copy_2d_into_cols) because the returning arm needs let result = ...; ...; result and the unit arm doesn't.

The wrapper struct and its impl:

RUST

pub struct TracingBackend<B: Backend> {
    inner: B,
    name: String,
}
 
impl<B: Backend> TracingBackend<B> {
    pub fn new(inner: B) -> Self {
        let name = inner.name().to_uppercase();
        Self { inner, name }
    }
}
 
impl<B: Backend> Backend for TracingBackend<B> {
    fn name(&self) -> String {
        self.name.clone()
    }
 
    trace_fn!(fn add              (&self, a: &Tensor, b: &Tensor) -> Tensor);
    trace_fn!(fn hadamard         (&self, a: &Tensor, b: &Tensor) -> Tensor);
    trace_fn!(fn matmul           (&self, a: &Tensor, b: &Tensor) -> Tensor);
    trace_fn!(fn concat_dim0      (&self, a: &Tensor, b: &Tensor) -> Tensor);
    trace_fn!(fn silu             (&self, x: &Tensor) -> Tensor);
    trace_fn!(fn rsqrt_elem       (&self, x: &Tensor) -> Tensor);
    trace_fn!(fn transpose_2d     (&self, x: &Tensor) -> Tensor);
    trace_fn!(fn softmax_rows     (&self, x: &Tensor) -> Tensor);
    trace_fn!(fn scale            (&self, x: &Tensor, s: f32) -> Tensor);
    trace_fn!(fn add_scalar       (&self, x: &Tensor, s: f32) -> Tensor);
    trace_fn!(fn sum_squares_axis (&self, x: &Tensor, axis: usize) -> Tensor);
    trace_fn!(fn broadcast_row_scalars (&self, x: &Tensor, d: usize) -> Tensor);
    trace_fn!(fn repeat_row_as_matrix  (&self, x: &Tensor, rows: usize) -> Tensor);
    trace_fn!(fn reshape_data     (&self, x: &Tensor, shape: Vec<usize>) -> Tensor);
    trace_fn!(fn gather_rows      (&self, x: &Tensor, indices: &[usize]) -> Tensor);
    trace_fn!(fn fill_strict_upper_tri (&self, x: &Tensor, value: f32) -> Tensor);
    trace_fn!(fn copy_2d_from_cols(&self, x: &Tensor, w: usize, col_offset: usize) -> Tensor);
    trace_fn!(fn copy_row_2d      (&self, x: &Tensor, row: usize) -> Tensor);
    trace_fn!(fn apply_rope       (&self, x: &Tensor, head_dim: usize, rope_theta: f32) -> Tensor);
    trace_fn!(fn apply_rope_single_row (&self, x: &Tensor, position: usize, head_dim: usize, rope_theta: f32) -> Tensor);
    trace_fn!(fn argmax_with_prob (&self, x: &Tensor) -> (usize, f32));
    trace_fn!(fn copy_2d_into_cols(&self, dst: &mut [f32], dst_cols: usize, src: &Tensor, col_offset: usize));
    trace_fn!(fn copy_contiguous_into (&self, x: &Tensor, start: usize, dst: &mut [f32]));
}

Each trace_fn!(...) line expands into a full timed wrapper. The whole Backend trait, every method, gets a profiling wrapper, in 22 lines. The profiler is opt-in: it only attaches when RUST_LOG requests trace-level output, which is the next piece.

The factory: selecting and wrapping a backend

src/backend/factory.rs now has to do two things: pick a backend by name, and optionally wrap it in TracingBackend:

RUST

use std::sync::Arc;
 
use super::Backend;
use super::{CpuBackend, SimdCpu, TracingBackend};
 
pub fn create_backend(name: &str, enable_tracing: bool) -> Result<Arc<dyn Backend>, String> {
    let name = name.trim();
    match name {
        "scalar" => Ok(wrap_scalar(enable_tracing)),
        "simd" => Ok(wrap_simd(enable_tracing)),
        other => Err(format!(
            "unknown backend {other:?} (supported: scalar, simd)"
        )),
    }
}
 
fn wrap_scalar(enable_tracing: bool) -> Arc<dyn Backend> {
    if enable_tracing {
        Arc::new(TracingBackend::new(CpuBackend))
    } else {
        Arc::new(CpuBackend)
    }
}
 
fn wrap_simd(enable_tracing: bool) -> Arc<dyn Backend> {
    let simd = SimdCpu::new(CpuBackend);
    if enable_tracing {
        Arc::new(TracingBackend::new(simd))
    } else {
        Arc::new(simd)
    }
}

create_backend gains an enable_tracing parameter and a new "simd" arm. Each wrap_* helper builds the backend and, if tracing is on, slips a TracingBackend around it. Note how the generics compose: wrap_simd builds SimdCpu::new(CpuBackend), a SIMD backend with the scalar one as fallback, and tracing wraps the whole SimdCpu<CpuBackend>. The wrapper types stack cleanly because each is generic over what it wraps.

The module file exports the new types:

RUST

mod backend_trait;
pub(crate) mod cpu;
mod factory;
pub(crate) mod simd_cpu;
pub(crate) mod tracing;
 
pub use backend_trait::Backend;
pub use factory::create_backend;
 
pub(crate) use cpu::CpuBackend;
pub(crate) use simd_cpu::SimdCpu;
pub(crate) use tracing::TracingBackend;

The --backend flag

src/cli/args.rs learns --backend. Unlike --kv (whose mode word is optional), --backend requires a value, so ArgCursor gets a helper that insists on one:

RUST

    fn expect_value(&mut self, flag: &str) -> String {
        match self.advance() {
            Some(v) if !v.starts_with('-') => v.to_string(),
            _ => panic!("{flag} requires a value"),
        }
    }

expect_value consumes the next argument and returns it. If there isn't one, or it looks like another flag, it panics with a clear message naming the offending flag.

CliArgs gets a backend field and the parse arm:

RUST

pub struct CliArgs {
    kv_mode: Option<&'static str>,
    backend: Option<String>,
    positionals: Vec<String>,
}

RUST

    pub fn parse(args: Vec<String>) -> Self {
        let mut kv_mode = None;
        let mut backend = None;
        let mut positionals = Vec::new();
 
        let mut cur = ArgCursor::new(&args);
        while cur.has_more() {
            match cur.peek() {
                Some("--kv") => {
                    cur.advance();
                    kv_mode = Some(parse_kv_mode(&mut cur));
                }
                Some("--backend") => {
                    cur.advance();
                    backend = Some(cur.expect_value("--backend"));
                }
                _ => positionals.push(cur.take()),
            }
        }
 
        Self {
            kv_mode,
            backend,
            positionals,
        }
    }

A --backend accessor with a caller-supplied default, plus a free function to decide whether tracing should be on:

RUST

    pub fn backend(&self, default: &str) -> String {
        self.backend
            .clone()
            .unwrap_or_else(|| default.to_string())
    }
 
    pub fn kv_cache_mode(&self) -> Option<&'static str> {
        self.kv_mode
    }
}
 
pub fn rust_log_enables_trace() -> bool {
    std::env::var("RUST_LOG")
        .map(|v| v.to_lowercase().contains("trace"))
        .unwrap_or(false)
}

rust_log_enables_trace checks whether RUST_LOG mentions trace. The per-operation profiler is expensive (it formats arguments and emits an event for every backend call), so we only attach the TracingBackend when the user has actually asked for trace-level output. The module re-export adds it:

RUST

mod args;
 
pub use args::{CliArgs, rust_log_enables_trace};

RUST

pub use cli::{CliArgs, rust_log_enables_trace};

Wiring it into the binary

model-generate reads --backend and passes the trace decision to create_backend:

RUST

use std::path::Path;
 
use inferno::{
    CliArgs, Metrics, create_backend, create_kv_cache, greedy_generate, load_from_gguf_path,
    rust_log_enables_trace,
};
 
fn usage() -> ! {
    eprintln!(
        "usage: model-generate [--kv [basic]] [--backend scalar|simd] <gguf_path> [prompt] [max_new_tokens]"
    );
    std::process::exit(2);
}

RUST

    let args = CliArgs::from_env();
    let backend_name = args.backend("simd");
    let kv_mode = args.kv_cache_mode();

RUST

    let backend = create_backend(&backend_name, rust_log_enables_trace()).unwrap_or_else(|e| {
        eprintln!("error: {e}");
        std::process::exit(2);
    });

RUST

    println!("backend: {}", backend_name);
    println!("kv cache: {}", kv_mode.unwrap_or("off"));

Two things changed. The default backend is now "simd", not "scalar". From this chapter on, model-generate uses the fast path unless you ask for --backend scalar. And create_backend gets rust_log_enables_trace(), so a RUST_LOG=trace run automatically gets the profiler.

Running it

Compare scalar against SIMD on a decode-heavy run, KV cache on:

BASH

cargo run --release --bin model-generate -- --kv --backend scalar path/to/Qwen3-0.6B-FP32.gguf "Once upon a time, in a small village by the sea, there lived a baker" 32
cargo run --release --bin model-generate -- --kv --backend simd   path/to/Qwen3-0.6B-FP32.gguf "Once upon a time, in a small village by the sea, there lived a baker" 32

Scalar:

PLAINTEXT

backend: scalar
kv cache: basic
 
metrics:
  time_to_first_token_ms: 612.232
  decode_tokens_per_second: 18.315
  per_forward_ms: min 53.267  max 612.232  mean 72.026  (n=32)

SIMD:

PLAINTEXT

backend: simd
kv cache: basic
 
metrics:
  time_to_first_token_ms: 601.787
  decode_tokens_per_second: 18.413
  per_forward_ms: min 53.289  max 601.787  mean 71.417  (n=32)

We quadrupled the arithmetic rate of the hottest kernel in the program, verified the instructions are really in the binary (disassemble it and count the fmla v...4s), and the outputs are token-for-token identical. The throughput did not move: 18.3 tokens per second scalar, 18.4 SIMD. Nothing is broken. This is the measurement doing its job, and it is the most instructive number in the act.

Bar chart: scalar backend at 18.3 tokens per second and SIMD backend at 18.4, both touching a dashed line marking the one-core memory-bandwidth ceiling.

Figure: both backends parked at the same wall. The dashed line is one core streaming 2.4 GB of weights per token at the ~45 GB/s the memory system delivers; neither kernel can sit above it.

Here's the arithmetic that explains it. A decode step reads every weight in the model once: ~600M weights × 4 bytes ≈ 2.4 GB pulled through the memory system per token. At ~18.4 tokens/second that is ~45 GB/s, which is about what one performance core can stream from DRAM on this machine. Decode was already limited by memory bandwidth, not arithmetic. The scalar loop's multiply-adds were never the holdup; they were waiting on the same weight bytes the NEON loop now waits on. Making the arithmetic 4× faster just means the vector unit idles harder between cache-line arrivals.

This is the central fact of inference performance, and you have now measured it rather than read it: decode is bandwidth-bound. Every decode win from here on comes from moving fewer bytes (quantization, II.6) or amortizing the bytes across more work (batching, III.7), not from faster math on the same bytes.

For the profiler, add RUST_LOG=trace and you get per-operation timing (matmul, softmax_rows, apply_rope, each with its duration), confirming where every millisecond of those 53 goes:

BASH

RUST_LOG=trace cargo run --release --bin model-generate -- --kv --backend simd path/to/Qwen3-0.6B-FP32.gguf "Hi" 1

Where this leaves us

SimdCpu is a second Backend, selectable with --backend simd, that vectorizes matmul with NEON and inherits everything else from the scalar backend, and the honest result is a wash on this workload, because decode saturates one core's memory bandwidth either way. So was the chapter wasted? No, three ways. First, the negative result is the payoff: you now know decode's real ceiling, which tells you in advance what will and won't work for the rest of the act. Second, compute-bound moments do exist: prefill over hundreds of prompt rows chews far more arithmetic per byte, which is where the next chapter's cores and II.5's GPU will land their wins. Third, the machinery built here is load-bearing: the decorator architecture and --backend flag carry every later backend, TracingBackend is the act's profiler, and in II.6 this same file's SIMD skills deliver the act's biggest decode win, with integer NEON, where the arithmetic is, for once, the bottleneck.

But first, the idle silicon: SimdCpu still runs on one core. Your laptop has eight, or ten, and right now nine of them sit idle while one grinds through every matmul row in sequence. One core's bandwidth ceiling is not the chip's bandwidth ceiling, and prefill has arithmetic to spare. The rows of a matmul are completely independent (row i of the output never depends on row j), which is the textbook setup for parallelism. The next chapter wraps the SIMD backend in a parallel one that spreads matmul rows across every core at once.