II.3: SIMD CPU backend
The KV cache in II.2 fixed the algorithm: decode is now constant work per token, not growing work. But the harness still reports about 21 ms per decode forward pass, and that number is now a kernel problem, not an algorithm problem. The forward pass is the right amount of arithmetic; it's just running that arithmetic slowly.
Where does the time go? The forward pass is dominated by matmul. The scalar matmul from I.4 is a triple-nested loop that multiplies and adds one pair of floats at a time. The CPU in your laptop can multiply-add four pairs of floats with a single instruction (that's what its SIMD unit is for), and we have been ignoring it entirely.
This chapter writes a second Backend implementation, SimdCpu, that uses those instructions. It does not change the algorithm, the model, or any result; it computes exactly the same numbers, just four (and with unrolling, sixteen) lanes at a time. We also add the --backend flag to choose between backends, and a small per-operation profiler so we can see where the time goes.
What SIMD is
SIMD stands for Single Instruction, Multiple Data. A normal CPU instruction operates on one value: "multiply these two floats." A SIMD instruction operates on a small fixed-size vector of values at once: "multiply these four pairs of floats, in parallel, in one instruction." The CPU has special wide registers (128 bits, enough to hold four 32-bit floats) and a set of instructions that act on all four lanes simultaneously.
On Apple Silicon and other ARM chips, this instruction set is called NEON. Rust exposes it as intrinsics: functions in std::arch::aarch64 that map almost one-to-one onto NEON machine instructions. A few we use here:
vld1q_f32(ptr): load 4 consecutive floats from memory into a 128-bit register.vst1q_f32(ptr, reg): store a register's 4 floats back to memory.vdupq_n_f32(x): fill all 4 lanes of a register with the same scalarx.vfmaq_f32(acc, a, b): fused multiply-add: computeacc + a * blane-by-lane, all 4 lanes at once. The workhorse.vaddq_f32(a, b): add two registers lane-by-lane.vaddvq_f32(reg): horizontal add: sum the 4 lanes of one register down to a single scalar.
Using them means writing unsafe Rust, since these are raw, no-bounds-check, pointer-based operations. That's the price of getting at the hardware directly, and it's contained to two functions.
How matmul becomes vectorizable
Recall what matmul does. To multiply an m × n matrix A by an n × p matrix B, every output element is a dot product: C[i][j] is row i of A dotted with column j of B. The textbook triple loop does exactly that, but it's a poor fit for SIMD, because column j of B is strided: its elements are p floats apart in memory, and SIMD loads want 4 consecutive floats.
The fix is to reorder the loops. Instead of computing one output element at a time, compute one output row at a time, accumulating it across the contributions of A's columns:
for each output row i:
clear C[i] # a row of p floats
for each j in 0..n:
a_ij = A[i][j] # one scalar
C[i] += a_ij * B[j] # B[j] is row j of B, p CONSECUTIVE floatsThe inner operation is now C[i] += a_ij * B[j]: take a scalar a_ij, multiply a whole row of B by it, add the result into the output row. Both B[j] and C[i] are contiguous runs of p floats. That contiguity is everything: it means we can stride through them 4 floats at a time with vld1q_f32, and the multiply-add C[i] += a_ij * B[j] is exactly what vfmaq_f32 does, four lanes per instruction. This scalar × vector + vector pattern has a classic name: SAXPY.
So SimdCpu's matmul is the loop above, with the innermost row update (C[i] += a_ij * B[j]) replaced by a hand-vectorized NEON function.
A backend that delegates
SimdCpu only needs to make matmul fast, since that's where the time is. Everything else in the Backend trait (add, silu, softmax_rows, the 20-odd other operations) it can hand straight to the scalar CpuBackend from Act 1. So SimdCpu is generic over a fallback backend: it implements the few operations worth vectorizing itself and delegates the rest.
src/backend/simd_cpu.rs opens with the struct and a constructor:
use super::Backend;
use crate::tensor::{Tensor, TensorData};
pub struct SimdCpu<B: Backend> {
fallback: B,
}
impl<B: Backend> SimdCpu<B> {
pub fn new(fallback: B) -> Self {
Self { fallback }
}SimdCpu<B> wraps any B: Backend. In practice B is CpuBackend, so a SimdCpu is "the scalar backend, but with a fast matmul bolted on." The next chapter wraps a SimdCpu in turn, which is why this is generic rather than hard-coding CpuBackend.
The matmul driver (the loop-reordered version from above):
#[inline(always)]
fn matmul_fp32_fp32(&self, a: &[f32], b: &[f32], m: usize, n: usize, p: usize) -> Tensor {
assert_eq!(a.len(), m * n);
assert_eq!(b.len(), n * p);
let mut data = vec![0.0f32; m * p];
let ap = a.as_ptr();
let bp = b.as_ptr();
let op = data.as_mut_ptr();
unsafe {
for i in 0..m {
let o = op.add(i * p);
for j in 0..n {
let aij = *ap.add(i * n + j);
let b_row = bp.add(j * p);
Self::saxpy_row_f32(o, b_row, aij, p);
}
}
}
Tensor::new(data, vec![m, p])
}
}data is the output, m * p floats, zero-initialized. Then for each output row i and each j in 0..n: pull out the scalar a[i][j], point at row j of B, and call saxpy_row_f32 to do o += aij * b_row across the whole row. We work with raw pointers (as_ptr, .add(...)) because the NEON helper takes pointers and we want no per-element bounds checks in the hot loop. The assert_eq!s at the top establish the lengths once, so the indexing inside is provably in range.
The Backend impl
SimdCpu's Backend impl. Matmul and one other operation (sum_squares_axis) are done with NEON; everything else forwards to fallback. Start with name and matmul:
impl<B: Backend> Backend for SimdCpu<B> {
fn name(&self) -> String {
"simd".to_string()
}
#[inline(always)]
fn matmul(&self, a: &Tensor, b: &Tensor) -> Tensor {
assert_eq!(a.shape().len(), 2);
assert_eq!(b.shape().len(), 2);
let a_shape = a.shape();
let m = a_shape[0];
let n = a_shape[1];
let a_data = match a.as_data() {
TensorData::Fp32(d) => d,
};
match b.as_data() {
TensorData::Fp32(b_data) => {
let p = b.shape()[1];
assert_eq!(n, b.shape()[0], "tensor shape mismatch");
self.matmul_fp32_fp32(a_data, b_data, m, n, p)
}
}
}matmul pulls the flat float slices out of the two tensors, checks the inner dimensions agree (n == b.shape()[0]), and calls matmul_fp32_fp32. The match on TensorData has one arm (Fp32) because that is the only tensor data variant that exists right now. II.6 adds a Q8_0 variant and a second arm here.
fn sum_squares_axis(&self, x: &Tensor, axis: usize) -> Tensor {
assert_eq!(x.shape().len(), 2, "sum_squares_axis only supports 2-D");
assert!(axis < 2, "axis out of bounds");
if axis == 1 {
let rows = x.shape()[0];
let cols = x.shape()[1];
let mut data = vec![0.0f32; rows];
let src = x.as_f32_slice();
for r in 0..rows {
data[r] = Self::dot_f32(
&src[r * cols..(r + 1) * cols],
&src[r * cols..(r + 1) * cols],
);
}
Tensor::new(data, vec![rows])
} else {
self.fallback.sum_squares_axis(x, axis)
}
}sum_squares_axis computes, for each row, the sum of its squared elements. It's a building block of RMS norm, which runs at every layer. Summing a row's squares is dotting the row with itself, and we already need a fast vectorized dot product for the SIMD work, so this gets it for free: dot_f32(row, row). Only axis == 1 (per-row) gets the NEON path; the rarer axis == 0 case delegates.
The remaining two dozen Backend methods are all one-line delegations to fallback. They're mechanical but the trait requires all of them; here are the first several to show the shape, and the rest follow identically:
fn add(&self, a: &Tensor, b: &Tensor) -> Tensor {
self.fallback.add(a, b)
}
fn hadamard(&self, a: &Tensor, b: &Tensor) -> Tensor {
self.fallback.hadamard(a, b)
}
fn scale(&self, x: &Tensor, s: f32) -> Tensor {
self.fallback.scale(x, s)
}
fn silu(&self, x: &Tensor) -> Tensor {
self.fallback.silu(x)
}
fn add_scalar(&self, a: &Tensor, s: f32) -> Tensor {
self.fallback.add_scalar(a, s)
}
fn rsqrt_elem(&self, x: &Tensor) -> Tensor {
self.fallback.rsqrt_elem(x)
}
fn broadcast_row_scalars(&self, t: &Tensor, d: usize) -> Tensor {
self.fallback.broadcast_row_scalars(t, d)
}
fn transpose_2d(&self, a: &Tensor) -> Tensor {
self.fallback.transpose_2d(a)
}
fn softmax_rows(&self, x: &Tensor) -> Tensor {
self.fallback.softmax_rows(x)
}
fn gather_rows(&self, table: &Tensor, row_indices: &[usize]) -> Tensor {
self.fallback.gather_rows(table, row_indices)
}…and so on for reshape_data, fill_strict_upper_tri, copy_2d_from_cols, copy_2d_into_cols, repeat_row_as_matrix, apply_rope, apply_rope_single_row, concat_dim0, copy_row_2d, copy_contiguous_into, and argmax_with_prob, each a one-liner forwarding to self.fallback. The point of the pattern: vectorize only what's hot, inherit a correct implementation of everything else for free.
One method in that list is new this chapter: copy_contiguous_into. It's a small addition to the Backend trait used by later code; we'll cover it in a moment.
The NEON kernels
Now the two unsafe functions that do the actual vectorization. First saxpy_row_f32 (out += aij * b, the inner step of matmul):
impl<B: Backend> SimdCpu<B> {
#[inline(always)]
pub fn saxpy_row_f32(out: *mut f32, b: *const f32, aij: f32, p: usize) {
use std::arch::aarch64::*;
unsafe {
let a = vdupq_n_f32(aij);
let mut k = 0usize;
while k + 16 <= p {
let o0 = vld1q_f32(out.add(k));
let o4 = vld1q_f32(out.add(k + 4));
let o8 = vld1q_f32(out.add(k + 8));
let o12 = vld1q_f32(out.add(k + 12));
let b0 = vld1q_f32(b.add(k));
let b4 = vld1q_f32(b.add(k + 4));
let b8 = vld1q_f32(b.add(k + 8));
let b12 = vld1q_f32(b.add(k + 12));
vst1q_f32(out.add(k), vfmaq_f32(o0, a, b0));
vst1q_f32(out.add(k + 4), vfmaq_f32(o4, a, b4));
vst1q_f32(out.add(k + 8), vfmaq_f32(o8, a, b8));
vst1q_f32(out.add(k + 12), vfmaq_f32(o12, a, b12));
k += 16;
}
while k + 4 <= p {
vst1q_f32(
out.add(k),
vfmaq_f32(vld1q_f32(out.add(k)), a, vld1q_f32(b.add(k))),
);
k += 4;
}
while k < p {
*out.add(k) += aij * *b.add(k);
k += 1;
}
}
}The first line, vdupq_n_f32(aij), broadcasts the scalar aij into all 4 lanes of a register, so a vfmaq_f32 multiplies four b values by aij at once. Then three loops, in descending chunk size, which is a standard SIMD shape:
- The 16-wide loop. Each iteration processes 16 floats: it loads four 4-float registers from
outand four fromb, does fourvfmaq_f32s (out_chunk + aij * b_chunk), stores the four results back, and advanceskby 16. Why 16 and not just 4? Instruction-level parallelism. The four independentvfmaq_f32s have no dependency on each other, so the CPU can keep its execution units busy on all four while none waits on a previous result. Unrolling to 16 hands the hardware four parallel chains instead of one. - The 4-wide loop. Cleans up whatever's left after the 16-wide loop (between 0 and 15 floats), four at a time.
- The scalar tail. Handles the final 0–3 floats with ordinary
+=. NEON works on full 4-float registers, so any width that isn't a multiple of 4 needs a plain-Rust remainder.
For Qwen3's matmuls, p is typically a clean multiple of 16, so almost all the work goes through the fast first loop; the other two exist only so the kernel is correct for any p.
Then dot_f32, the dot product used by sum_squares_axis:
pub fn dot_f32(a: &[f32], b: &[f32]) -> f32 {
debug_assert_eq!(a.len(), b.len());
use std::arch::aarch64::*;
let n = a.len();
let ap = a.as_ptr();
let bp = b.as_ptr();
unsafe {
let mut acc0 = vdupq_n_f32(0.0);
let mut acc1 = vdupq_n_f32(0.0);
let mut acc2 = vdupq_n_f32(0.0);
let mut acc3 = vdupq_n_f32(0.0);
let mut i = 0usize;
while i + 16 <= n {
acc0 = vfmaq_f32(acc0, vld1q_f32(ap.add(i)), vld1q_f32(bp.add(i)));
acc1 = vfmaq_f32(acc1, vld1q_f32(ap.add(i + 4)), vld1q_f32(bp.add(i + 4)));
acc2 = vfmaq_f32(acc2, vld1q_f32(ap.add(i + 8)), vld1q_f32(bp.add(i + 8)));
acc3 = vfmaq_f32(acc3, vld1q_f32(ap.add(i + 12)), vld1q_f32(bp.add(i + 12)));
i += 16;
}
while i + 4 <= n {
acc0 = vfmaq_f32(acc0, vld1q_f32(ap.add(i)), vld1q_f32(bp.add(i)));
i += 4;
}
let combined = vaddq_f32(vaddq_f32(acc0, acc1), vaddq_f32(acc2, acc3));
let mut sum = vaddvq_f32(combined);
while i < n {
sum += *ap.add(i) * *bp.add(i);
i += 1;
}
sum
}
}
}Same shape: four accumulator registers (acc0..acc3), a 16-wide loop that feeds each one independently, a 4-wide cleanup, a scalar tail. The four separate accumulators are again about instruction-level parallelism: four independent vfmaq_f32 chains the CPU can interleave. The difference from saxpy_row_f32 is the finish: a dot product reduces to a single number, so after the loops we add the four accumulator registers together (vaddq_f32 twice) and then vaddvq_f32 does the horizontal sum, collapsing the surviving 4-lane register into one f32. The scalar tail adds the last 0–3 products directly.
The new backend trait method
The Backend trait gains one method this chapter, copy_contiguous_into:
fn copy_contiguous_into(&self, x: &Tensor, start: usize, dst: &mut [f32]);It copies a contiguous run of a tensor's floats into a caller-provided slice. The CpuBackend implementation is a one-liner:
fn copy_contiguous_into(&self, x: &Tensor, start: usize, dst: &mut [f32]) {
dst.copy_from_slice(&x.as_f32_slice()[start..start + dst.len()]);
}And SimdCpu delegates it like the rest. It's a plumbing primitive (no arithmetic, nothing to vectorize) added to the trait now so it's available to backend code that needs to splice tensor data without going through the Tensor constructor.
The tracing profiler
To justify the SIMD work, and to know which operations to vectorize next, we want a per-operation timing breakdown. src/backend/tracing.rs adds a TracingBackend: another generic wrapper that times every Backend call and emits a tracing event for it. It wraps any backend, so you can profile the scalar one or the SIMD one.
Twenty-odd trait methods would mean twenty-odd near-identical timing wrappers, so the file uses a macro to generate them:
use super::Backend;
use crate::tensor::Tensor;
use std::cell::Cell;
use std::time::Instant;
thread_local! {
static TRACE_DEPTH: Cell<usize> = const { Cell::new(0) };
}
fn fmt_time(ms: f64) -> String {
if ms < 1.0 {
format!("{:.1}us", ms * 1000.0)
} else {
format!("{:.3}ms", ms)
}
}TRACE_DEPTH is a thread-local counter that tracks how deeply nested the current backend call is. Attention calls matmul, which is itself a backend call, so the profiler indents nested calls to show the call tree. fmt_time prints sub-millisecond durations in microseconds and longer ones in milliseconds.
The macro:
macro_rules! trace_fn {
(fn $fn:ident (&self $(, $arg:ident : $ty:ty)*) -> $ret:ty) => {
fn $fn(&self $(, $arg: $ty)*) -> $ret {
let depth = TRACE_DEPTH.with(|d| {
let n = d.get();
d.set(n + 1);
n
});
let mut args_vec = vec![];
$( args_vec.push(format!("{}={:?}", stringify!($arg), $arg)); )*
let t0 = Instant::now();
let result = self.inner.$fn($($arg),*);
let elapsed = t0.elapsed();
let time_ms = elapsed.as_secs_f64() * 1000.0 + elapsed.subsec_nanos() as f64 / 1_000_000.0;
let args = args_vec.join(", ");
let prefix = "-> ".repeat(depth);
tracing::trace!("{}[{}] [{}] ({}) {}", prefix, self.name, stringify!($fn), args, fmt_time(time_ms));
TRACE_DEPTH.with(|d| { d.set(d.get() - 1); });
result
}
};
(fn $fn:ident (&self $(, $arg:ident : $ty:ty)*)) => {
fn $fn(&self $(, $arg: $ty)*) {
let depth = TRACE_DEPTH.with(|d| {
let n = d.get();
d.set(n + 1);
n
});
let mut args_vec = vec![];
$( args_vec.push(format!("{}={:?}", stringify!($arg), $arg)); )*
let t0 = Instant::now();
self.inner.$fn($($arg),*);
let elapsed = t0.elapsed();
let time_ms = elapsed.as_secs_f64() * 1000.0 + elapsed.subsec_nanos() as f64 / 1_000_000.0;
let args = args_vec.join(", ");
let prefix = "-> ".repeat(depth);
tracing::trace!("{}[{}] [{}] ({}) {}", prefix, self.name, stringify!($fn), args, fmt_time(time_ms));
TRACE_DEPTH.with(|d| { d.set(d.get() - 1); });
}
};
}trace_fn! takes a method signature and generates a wrapper that does the same five things every time: bump the depth counter, format the arguments, time the inner call, emit a tracing::trace! event with an indent proportional to depth, restore the counter. There are two macro arms (one for methods that return a value, one for the -> () methods like copy_2d_into_cols) because the returning arm needs let result = ...; ...; result and the unit arm doesn't.
The wrapper struct and its impl:
pub struct TracingBackend<B: Backend> {
inner: B,
name: String,
}
impl<B: Backend> TracingBackend<B> {
pub fn new(inner: B) -> Self {
let name = inner.name().to_uppercase();
Self { inner, name }
}
}
impl<B: Backend> Backend for TracingBackend<B> {
fn name(&self) -> String {
self.name.clone()
}
trace_fn!(fn add (&self, a: &Tensor, b: &Tensor) -> Tensor);
trace_fn!(fn hadamard (&self, a: &Tensor, b: &Tensor) -> Tensor);
trace_fn!(fn matmul (&self, a: &Tensor, b: &Tensor) -> Tensor);
trace_fn!(fn concat_dim0 (&self, a: &Tensor, b: &Tensor) -> Tensor);
trace_fn!(fn silu (&self, x: &Tensor) -> Tensor);
trace_fn!(fn rsqrt_elem (&self, x: &Tensor) -> Tensor);
trace_fn!(fn transpose_2d (&self, x: &Tensor) -> Tensor);
trace_fn!(fn softmax_rows (&self, x: &Tensor) -> Tensor);
trace_fn!(fn scale (&self, x: &Tensor, s: f32) -> Tensor);
trace_fn!(fn add_scalar (&self, x: &Tensor, s: f32) -> Tensor);
trace_fn!(fn sum_squares_axis (&self, x: &Tensor, axis: usize) -> Tensor);
trace_fn!(fn broadcast_row_scalars (&self, x: &Tensor, d: usize) -> Tensor);
trace_fn!(fn repeat_row_as_matrix (&self, x: &Tensor, rows: usize) -> Tensor);
trace_fn!(fn reshape_data (&self, x: &Tensor, shape: Vec<usize>) -> Tensor);
trace_fn!(fn gather_rows (&self, x: &Tensor, indices: &[usize]) -> Tensor);
trace_fn!(fn fill_strict_upper_tri (&self, x: &Tensor, value: f32) -> Tensor);
trace_fn!(fn copy_2d_from_cols(&self, x: &Tensor, w: usize, col_offset: usize) -> Tensor);
trace_fn!(fn copy_row_2d (&self, x: &Tensor, row: usize) -> Tensor);
trace_fn!(fn apply_rope (&self, x: &Tensor, head_dim: usize, rope_theta: f32) -> Tensor);
trace_fn!(fn apply_rope_single_row (&self, x: &Tensor, position: usize, head_dim: usize, rope_theta: f32) -> Tensor);
trace_fn!(fn argmax_with_prob (&self, x: &Tensor) -> (usize, f32));
trace_fn!(fn copy_2d_into_cols(&self, dst: &mut [f32], dst_cols: usize, src: &Tensor, col_offset: usize));
trace_fn!(fn copy_contiguous_into (&self, x: &Tensor, start: usize, dst: &mut [f32]));
}Each trace_fn!(...) line expands into a full timed wrapper. The whole Backend trait, every method, gets a profiling wrapper, in 22 lines. The profiler is opt-in: it only attaches when RUST_LOG requests trace-level output, which is the next piece.
The factory: selecting and wrapping a backend
src/backend/factory.rs now has to do two things: pick a backend by name, and optionally wrap it in TracingBackend:
use std::sync::Arc;
use super::Backend;
use super::{CpuBackend, SimdCpu, TracingBackend};
pub fn create_backend(name: &str, enable_tracing: bool) -> Result<Arc<dyn Backend>, String> {
let name = name.trim();
match name {
"scalar" => Ok(wrap_scalar(enable_tracing)),
"simd" => Ok(wrap_simd(enable_tracing)),
other => Err(format!(
"unknown backend {other:?} (supported: scalar, simd)"
)),
}
}
fn wrap_scalar(enable_tracing: bool) -> Arc<dyn Backend> {
if enable_tracing {
Arc::new(TracingBackend::new(CpuBackend))
} else {
Arc::new(CpuBackend)
}
}
fn wrap_simd(enable_tracing: bool) -> Arc<dyn Backend> {
let simd = SimdCpu::new(CpuBackend);
if enable_tracing {
Arc::new(TracingBackend::new(simd))
} else {
Arc::new(simd)
}
}create_backend gains an enable_tracing parameter and a new "simd" arm. Each wrap_* helper builds the backend and, if tracing is on, slips a TracingBackend around it. Note how the generics compose: wrap_simd builds SimdCpu::new(CpuBackend), a SIMD backend with the scalar one as fallback, and tracing wraps the whole SimdCpu<CpuBackend>. The wrapper types stack cleanly because each is generic over what it wraps.
The module file exports the new types:
mod backend_trait;
pub(crate) mod cpu;
mod factory;
pub(crate) mod simd_cpu;
pub(crate) mod tracing;
pub use backend_trait::Backend;
pub use factory::create_backend;
pub(crate) use cpu::CpuBackend;
pub(crate) use simd_cpu::SimdCpu;
pub(crate) use tracing::TracingBackend;The --backend flag
src/cli/args.rs learns --backend. Unlike --kv (whose mode word is optional), --backend requires a value, so ArgCursor gets a helper that insists on one:
fn expect_value(&mut self, flag: &str) -> String {
match self.advance() {
Some(v) if !v.starts_with('-') => v.to_string(),
_ => panic!("{flag} requires a value"),
}
}expect_value consumes the next argument and returns it. If there isn't one, or it looks like another flag, it panics with a clear message naming the offending flag.
CliArgs gets a backend field and the parse arm:
pub struct CliArgs {
kv_mode: Option<&'static str>,
backend: Option<String>,
positionals: Vec<String>,
} pub fn parse(args: Vec<String>) -> Self {
let mut kv_mode = None;
let mut backend = None;
let mut positionals = Vec::new();
let mut cur = ArgCursor::new(&args);
while cur.has_more() {
match cur.peek() {
Some("--kv") => {
cur.advance();
kv_mode = Some(parse_kv_mode(&mut cur));
}
Some("--backend") => {
cur.advance();
backend = Some(cur.expect_value("--backend"));
}
_ => positionals.push(cur.take()),
}
}
Self {
kv_mode,
backend,
positionals,
}
}A --backend accessor with a caller-supplied default, plus a free function to decide whether tracing should be on:
pub fn backend(&self, default: &str) -> String {
self.backend
.clone()
.unwrap_or_else(|| default.to_string())
}
pub fn kv_cache_mode(&self) -> Option<&'static str> {
self.kv_mode
}
}
pub fn rust_log_enables_trace() -> bool {
std::env::var("RUST_LOG")
.map(|v| v.to_lowercase().contains("trace"))
.unwrap_or(false)
}rust_log_enables_trace checks whether RUST_LOG mentions trace. The per-operation profiler is expensive (it formats arguments and emits an event for every backend call), so we only attach the TracingBackend when the user has actually asked for trace-level output. The module re-export adds it:
mod args;
pub use args::{CliArgs, rust_log_enables_trace};pub use cli::{CliArgs, rust_log_enables_trace};Wiring it into the binary
model-generate reads --backend and passes the trace decision to create_backend:
use std::path::Path;
use inferno::{
CliArgs, Metrics, create_backend, create_kv_cache, greedy_generate, load_from_gguf_path,
rust_log_enables_trace,
};
fn usage() -> ! {
eprintln!(
"usage: model-generate [--kv [basic]] [--backend scalar|simd] <gguf_path> [prompt] [max_new_tokens]"
);
std::process::exit(2);
} let args = CliArgs::from_env();
let backend_name = args.backend("simd");
let kv_mode = args.kv_cache_mode(); let backend = create_backend(&backend_name, rust_log_enables_trace()).unwrap_or_else(|e| {
eprintln!("error: {e}");
std::process::exit(2);
}); println!("backend: {}", backend_name);
println!("kv cache: {}", kv_mode.unwrap_or("off"));Two things changed. The default backend is now "simd", not "scalar". From this chapter on, model-generate uses the fast path unless you ask for --backend scalar. And create_backend gets rust_log_enables_trace(), so a RUST_LOG=trace run automatically gets the profiler.
Running it
Compare scalar against SIMD on a decode-heavy run, KV cache on:
cargo run --release --bin model-generate -- --kv --backend scalar path/to/qwen3-0.6b.gguf "Once upon a time" 32
cargo run --release --bin model-generate -- --kv --backend simd path/to/qwen3-0.6b.gguf "Once upon a time" 32Scalar, the II.2 numbers:
backend: scalar
kv cache: basic
metrics:
time_to_first_token_ms: 423.104
decode_tokens_per_second: 47.602
per_forward_ms: min 19.882 max 423.104 mean 32.690 (n=32)SIMD:
backend: simd
kv cache: basic
metrics:
time_to_first_token_ms: 78.221
decode_tokens_per_second: 271.402
per_forward_ms: min 3.402 max 78.221 mean 5.781 (n=32)Both numbers move, and both for the same reason (matmul got 4-to-16× faster), but the magnitudes differ. Time to first token (prefill) drops from ~423 ms to ~78 ms: prefill is matmul-dominated, so vectorizing matmul helps it almost directly. Decode throughput climbs from ~48 to ~270 tokens/sec. Identical output text, identical token ids. SimdCpu computes exactly what CpuBackend does, just with the lanes the hardware always had.
For the profiler, add RUST_LOG=trace and you get a per-operation, indented call tree (matmul, softmax_rows, apply_rope, each with its duration), confirming that matmul dominates and is the right thing to keep optimizing:
RUST_LOG=trace cargo run --release --bin model-generate -- --kv --backend simd path/to/qwen3-0.6b.gguf "Hi" 1Where this leaves us
SimdCpu is a second Backend, selectable with --backend simd, that vectorizes matmul with NEON and inherits everything else from the scalar backend. The KV cache fixed the algorithm; SIMD fixed the kernel; together they've taken decode from the Act 1 baseline's fraction of a token per second to a few hundred. And TracingBackend gives us a profiler to keep aiming the next optimization.
But SimdCpu still runs on one core. Your laptop has eight, or ten, and right now nine of them sit idle while one grinds through every matmul row in sequence. The rows of a matmul are completely independent (row i of the output never depends on row j), which is the textbook setup for parallelism. The next chapter wraps the SIMD backend in a parallel one that spreads matmul rows across every core at once.