II.4: Multithreaded CPU backend
The SIMD backend from II.3 made each matmul instruction do more work: 16 floats at a time instead of one. But all those instructions still run on a single CPU core. The laptop this is built on has ten cores; nine of them are idle while one grinds through a matmul row by row.
This chapter fixes that. It adds a third Backend, Parallel, that splits matmul across every core at once. The result is another multiplied speedup on top of SIMD. Like SIMD, it changes nothing about what gets computed, only how the work is spread.
There is a subtlety worth stating up front, and the chapter's numbers will bear it out: multithreading helps prefill a great deal and decode much less. Understanding why is the point of the chapter, not a footnote to it.
Why matmul rows parallelize cleanly
A matmul produces an m × p output. Every output element C[i][j] is row i of A dotted with column j of B, and crucially no output element depends on any other. Computing C[5][2] needs nothing from C[4][7]. Each output row can be computed entirely independently of every other row.
That independence is exactly what makes parallelism safe and easy here. If we hand row 0 to core 0, row 1 to core 1, and so on, the cores never need to talk to each other, never wait on each other, and never write to the same memory. There are no locks, no shared mutable state, no race conditions to reason about. This is the friendliest possible kind of parallel work, embarrassingly parallel in the standard phrase.
So the plan: keep the SIMD saxpy kernel from II.3 exactly as it is, but instead of looping over output rows on one core, distribute the rows across all cores. Each core SIMD-computes its share of rows; when they're all done, the output is complete.
rayon: parallel iterators
We could spawn threads by hand with std::thread, but splitting work into balanced chunks, sizing a thread pool to the core count, and joining everything back up is fiddly to get right. So we use rayon, the standard data-parallelism crate for Rust.
rayon's core idea: take an ordinary iterator and turn it into a parallel iterator by changing .iter() to .par_iter() (or .chunks_mut(...) to .par_chunks_mut(...)). rayon then runs the iterator's work across a thread pool it manages (sized to your CPU's core count) using work-stealing so that if one core finishes early it grabs work from a busier one, keeping all cores loaded. The closure body you write is unchanged; only the iterator adapter differs.
It goes in Cargo.toml:
[dependencies]
regex = "1"
rayon = "1"
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["fmt", "env-filter"] }The Parallel backend
Same pattern as SimdCpu in II.3: a generic wrapper that does matmul itself and delegates everything else to the backend it wraps. src/backend/parallel_cpu.rs:
use super::Backend;
use super::simd_cpu::SimdCpu;
use crate::tensor::{Tensor, TensorData};
use rayon::prelude::*;
const MIN_ROWS_FOR_PARALLEL: usize = 4;
pub struct Parallel<B: Backend> {
inner: B,
}
impl<B: Backend> Parallel<B> {
pub fn new(inner: B) -> Self {
Self { inner }
}Parallel<B> wraps a backend B. In practice B is a SimdCpu<CpuBackend>, so a Parallel is "the SIMD backend, but with matmul spread over cores." We deliberately wrap the SIMD backend, not the scalar one: parallelism and vectorization are independent wins, and stacking them gives both. Each core runs the fast SIMD kernel; all cores run at once.
MIN_ROWS_FOR_PARALLEL = 4 is a threshold we'll come back to. It's the heart of the prefill-versus-decode story.
The parallel matmul:
fn matmul_fp32_fp32(
&self,
a_data: &[f32],
b_data: &[f32],
m: usize,
n: usize,
p: usize,
) -> Tensor {
assert_eq!(a_data.len(), m * n, "matmul_fp32_fp32: len(a) must be m*n");
assert_eq!(b_data.len(), n * p, "matmul_fp32_fp32: len(b) must be n*p");
let mut data = vec![0.0f32; m * p];
data.par_chunks_mut(p).enumerate().for_each(|(i, out_row)| {
let op = out_row.as_mut_ptr();
unsafe {
for j in 0..n {
let aij = a_data[i * n + j];
let b_row = b_data.as_ptr().add(j * p);
SimdCpu::<B>::saxpy_row_f32(op, b_row, aij, p);
}
}
});
Tensor::new(data, vec![m, p])
}
}This is the SIMD matmul from II.3 with exactly one line changed. The output buffer data is m * p floats, m rows of p. data.par_chunks_mut(p) cuts that buffer into m contiguous chunks of p floats (one chunk per output row) and hands them out in parallel. .enumerate() pairs each chunk with its row index i. The closure body (for each j, do the SIMD saxpy_row_f32 of B's row j scaled by a[i][j] into out_row) is the identical inner loop from SimdCpu::matmul_fp32_fp32.
The reason this is safe: par_chunks_mut gives each parallel task a disjoint mutable slice of data. Row i's task can only touch row i's p floats; no two tasks can ever touch the same memory. Rust's borrow checker accepts this; it's the whole reason par_chunks_mut exists rather than handing every thread the full &mut [f32]. So even with raw-pointer NEON code inside, the parallelism itself is provably race-free.
Compared to II.3's matmul_fp32_fp32, the change is precisely for i in 0..m { ... } becoming data.par_chunks_mut(p).enumerate().for_each(|(i, out_row)| { ... }). Same kernel, distributed.
The matmul method and the threshold
impl<B: Backend + Sync> Backend for Parallel<B> {
fn name(&self) -> String {
"parallel".to_string()
}
fn matmul(&self, a: &Tensor, b: &Tensor) -> Tensor {
assert_eq!(a.shape().len(), 2);
assert_eq!(b.shape().len(), 2);
let m = a.shape()[0];
if m < MIN_ROWS_FOR_PARALLEL {
return self.inner.matmul(a, b);
}
let a_shape = a.shape();
let b_shape = b.shape();
let n = a_shape[1];
match (a.as_data(), b.as_data()) {
(TensorData::Fp32(a_data), TensorData::Fp32(b_data)) => {
let p = b_shape[1];
assert_eq!(n, b_shape[0], "tensor shape mismatch");
self.matmul_fp32_fp32(a_data, b_data, m, n, p)
}
}
}The impl is bounded B: Backend + Sync, meaning the wrapped backend must be Sync (safe to share across threads) because rayon's worker threads all reference it at once. CpuBackend and SimdCpu are both Sync, so this is satisfied.
Now the threshold. The first thing matmul does is check m, the number of output rows:
if m < MIN_ROWS_FOR_PARALLEL {
return self.inner.matmul(a, b);
}If the matmul has fewer than 4 output rows, don't parallelize it: fall straight through to the wrapped SIMD backend and run it on one core. This is the prefill/decode distinction, made into code, and it's worth being precise about.
Spreading work across cores is not free. rayon has to wake worker threads, hand each a chunk, and synchronize when they finish. That overhead is a fixed cost of maybe a few microseconds. If the matmul is big (a prefill matmul with hundreds of output rows, one per prompt token), that overhead is a rounding error against the work itself, and parallelism is a huge win. If the matmul is tiny (a decode matmul with one output row, because decode processes one token), there is nothing to split. Ten cores can't share one row. Worse, paying the thread-wakeup overhead to "parallelize" a one-row matmul makes it slower than just running it on one core.
So MIN_ROWS_FOR_PARALLEL is the guard: matmuls with m >= 4 rows go parallel; matmuls with fewer rows go straight to SIMD. And this is exactly why multithreading helps prefill far more than decode:
- Prefill processes the whole prompt at once. Its matmuls have one output row per prompt token, easily hundreds.
mis large, the threshold passes, all cores engage. Prefill scales nearly linearly with core count. - Decode processes one token. Its matmuls have one output row.
m == 1, the threshold fails every time, and the decode path runs entirely on the SIMD backend; it gets no benefit from this chapter at all.
That is not a flaw in the design; it's a consequence of the workloads. Decode's matmuls are matrix-vector products with a single output row, and a single row genuinely cannot be split across cores. Decode's real lever is reducing bytes, which is what II.6 is about. This chapter's win is prefill.
fn sum_squares_axis(&self, x: &Tensor, axis: usize) -> Tensor {
assert_eq!(x.shape().len(), 2, "sum_squares_axis only supports 2-D");
assert!(axis < 2, "axis out of bounds");
if axis == 1 {
let rows = x.shape()[0];
let cols = x.shape()[1];
let src = x.as_f32_slice();
let data: Vec<f32> = (0..rows)
.into_par_iter()
.map(|r| {
SimdCpu::<B>::dot_f32(
&src[r * cols..(r + 1) * cols],
&src[r * cols..(r + 1) * cols],
)
})
.collect();
Tensor::new(data, vec![rows])
} else {
self.inner.sum_squares_axis(x, axis)
}
}sum_squares_axis gets the same treatment as in II.3, parallelized: (0..rows).into_par_iter() turns the range of row indices into a parallel iterator, and .map(...) computes each row's sum-of-squares (via the SIMD dot_f32) on whatever core rayon assigns. .collect() gathers the results in order. The rows are independent, so this is again race-free by construction.
Everything else delegates to inner, the same pattern as SimdCpu. Here are the first few; the rest follow identically:
fn add(&self, a: &Tensor, b: &Tensor) -> Tensor {
self.inner.add(a, b)
}
fn hadamard(&self, a: &Tensor, b: &Tensor) -> Tensor {
self.inner.hadamard(a, b)
}
fn scale(&self, x: &Tensor, s: f32) -> Tensor {
self.inner.scale(x, s)
}
fn silu(&self, x: &Tensor) -> Tensor {
self.inner.silu(x)
}
fn add_scalar(&self, a: &Tensor, s: f32) -> Tensor {
self.inner.add_scalar(a, s)
}
fn rsqrt_elem(&self, x: &Tensor) -> Tensor {
self.inner.rsqrt_elem(x)
}…and the same for broadcast_row_scalars, transpose_2d, softmax_rows, gather_rows, reshape_data, fill_strict_upper_tri, copy_2d_from_cols, copy_2d_into_cols, repeat_row_as_matrix, apply_rope, apply_rope_single_row, concat_dim0, copy_row_2d, copy_contiguous_into, and argmax_with_prob. Parallel vectorizes nothing on its own; it parallelizes the two operations that have parallelizable structure (matmul, sum_squares_axis) and lets the SIMD backend underneath handle the rest. Those other operations are cheap enough that the thread-pool overhead would not pay for itself.
Wiring it into the factory
src/backend/factory.rs gets a "parallel" arm:
use std::sync::Arc;
use super::Backend;
use super::{CpuBackend, Parallel, SimdCpu, TracingBackend};
pub fn create_backend(name: &str, enable_tracing: bool) -> Result<Arc<dyn Backend>, String> {
let name = name.trim();
match name {
"scalar" => Ok(wrap_scalar(enable_tracing)),
"simd" => Ok(wrap_simd(enable_tracing)),
"parallel" => Ok(wrap_parallel(enable_tracing)),
other => Err(format!(
"unknown backend {other:?} (supported: scalar, simd, parallel)"
)),
}
}fn wrap_parallel(enable_tracing: bool) -> Arc<dyn Backend> {
let parallel = Parallel::new(SimdCpu::new(CpuBackend));
if enable_tracing {
Arc::new(TracingBackend::new(parallel))
} else {
Arc::new(parallel)
}
}The line that matters is Parallel::new(SimdCpu::new(CpuBackend)). Read it inside-out: CpuBackend is the scalar base; SimdCpu::new(...) wraps it with vectorized matmul; Parallel::new(...) wraps that with cross-core distribution. The type is Parallel<SimdCpu<CpuBackend>>, three layers, each generic over the one below. A parallel matmul on a big input runs the SIMD saxpy kernel on every core; a tiny one falls through to a single-core SIMD run; anything that isn't matmul falls all the way to scalar. The generic wrappers from II.3 are what make this composition possible: no layer hard-codes what's beneath it.
The module exports it:
mod backend_trait;
pub(crate) mod cpu;
mod factory;
pub(crate) mod parallel_cpu;
pub(crate) mod simd_cpu;
pub(crate) mod tracing;
pub use backend_trait::Backend;
pub use factory::create_backend;
pub(crate) use cpu::CpuBackend;
pub(crate) use parallel_cpu::Parallel;
pub(crate) use simd_cpu::SimdCpu;
pub(crate) use tracing::TracingBackend;And model-generate's usage string adds the new option, the only change to the binary:
fn usage() -> ! {
eprintln!(
"usage: model-generate [--kv [basic]] [--backend scalar|simd|parallel] <gguf_path> [prompt] [max_new_tokens]"
);
std::process::exit(2);
}The --backend flag from II.3 already does all the parsing work; "parallel" simply joins "scalar" and "simd" as a valid value.
Running it
This chapter's win is prefill, so the run that shows it best uses a long prompt: more prompt tokens means more output rows in the prefill matmuls, which means more for the cores to chew on. Compare SIMD against parallel:
cargo run --release --bin model-generate -- --kv --backend simd path/to/qwen3-0.6b.gguf "$(cat long-prompt.txt)" 32
cargo run --release --bin model-generate -- --kv --backend parallel path/to/qwen3-0.6b.gguf "$(cat long-prompt.txt)" 32SIMD, on a ~512-token prompt:
backend: simd
kv cache: basic
metrics:
time_to_first_token_ms: 1840.115
decode_tokens_per_second: 268.901
per_forward_ms: min 3.624 max 1840.115 mean 60.842 (n=32)Parallel, same prompt:
backend: parallel
kv cache: basic
metrics:
time_to_first_token_ms: 312.774
decode_tokens_per_second: 281.140
per_forward_ms: min 3.503 max 312.774 mean 13.402 (n=32)The numbers say exactly what the threshold predicted. Time to first token (pure prefill) drops from ~1840 ms to ~313 ms, almost 6× on a ten-core machine (not a clean 10× because of memory bandwidth limits and the work that doesn't parallelize). Decode throughput barely moves, from ~269 to ~281 tokens/sec: decode's matmuls have one output row, the m < 4 threshold sends them straight to the single-core SIMD path, and so decode runs essentially the same code it did in II.3. That gap between the two numbers is not a disappointment; it is the design working as intended.
Where this leaves us
Parallel is a third Backend, --backend parallel, that spreads matmul rows across every CPU core while keeping the SIMD kernel underneath. Prefill, which has plenty of rows to split, scales close to the core count. Decode, whose matmuls are single-row matrix-vector products, deliberately falls through to the single-core SIMD path. There is nothing to parallelize, and trying would only add overhead.
We've now extracted most of what a CPU can give: 16-wide SIMD lanes, all the cores. A GPU has a different and larger budget: thousands of tiny arithmetic units instead of ten big ones. Matmul, with its perfectly independent output elements, is precisely the shape of work GPUs are built for. The next chapter writes a fourth backend that offloads matmul to the GPU through Apple's Metal API.