II.4: Multithreaded CPU backend

The SIMD backend from II.3 made each matmul instruction do more work: 16 floats at a time instead of one. But all those instructions still run on a single CPU core. The laptop this is built on has ten cores; nine of them are idle while one grinds through a matmul row by row.

This chapter fixes that. It adds a third Backend, Parallel, that splits matmul across every core at once. The result is another multiplied speedup on top of SIMD. Like SIMD, it changes nothing about what gets computed, only how the work is spread.

One subtlety up front, and the chapter's numbers will bear it out: multithreading helps prefill a great deal and decode much less. Understanding why is the point of the chapter, not a footnote to it.

Why matmul rows parallelize cleanly

A matmul produces an m × p output. Every output element C[i][j] is row i of A dotted with column j of B, and, crucially, no output element depends on any other. Computing C[5][2] needs nothing from C[4][7]. Each output row can be computed entirely independently of every other row.

That independence is what makes parallelism safe and easy here. If we hand row 0 to core 0, row 1 to core 1, and so on, the cores never need to talk to each other, never wait on each other, and never write to the same memory. There are no locks, no shared mutable state, no race conditions to reason about. This is the friendliest possible kind of parallel work, embarrassingly parallel in the standard phrase.

So the plan: keep the SIMD saxpy kernel from II.3 exactly as it is, but instead of looping over output rows on one core, distribute the rows across all cores. Each core SIMD-computes its share of rows; when they're all done, the output is complete.

rayon: parallel iterators

We could spawn threads by hand with std::thread, but splitting work into balanced chunks, sizing a thread pool to the core count, and joining everything back up is fiddly to get right. So we use rayon, the standard data-parallelism crate for Rust.

rayon's core idea: take an ordinary iterator and turn it into a parallel iterator by changing .iter() to .par_iter() (or .chunks_mut(...) to .par_chunks_mut(...)). rayon then runs the iterator's work across a thread pool it manages (sized to your CPU's core count) using work-stealing so that if one core finishes early it grabs work from a busier one, keeping all cores loaded. The closure body you write is unchanged; only the iterator adapter differs.

It goes in Cargo.toml:

TOML

[dependencies]
regex = "1"
rayon = "1"
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["fmt", "env-filter"] }

The Parallel backend

Same pattern as SimdCpu in II.3: a generic wrapper that does matmul itself and delegates everything else to the backend it wraps. src/backend/parallel_cpu.rs:

RUST

use super::Backend;
use super::simd_cpu::SimdCpu;
use crate::tensor::{Tensor, TensorData};
use rayon::prelude::*;
 
const MIN_ROWS_FOR_PARALLEL: usize = 4;
 
pub struct Parallel<B: Backend> {
    inner: B,
}
 
impl<B: Backend> Parallel<B> {
    pub fn new(inner: B) -> Self {
        Self { inner }
    }

Parallel<B> wraps a backend B. In practice B is a SimdCpu<CpuBackend>, so a Parallel is "the SIMD backend, but with matmul spread over cores." We deliberately wrap the SIMD backend, not the scalar one: parallelism and vectorization are independent wins, and stacking them gives both. Each core runs the fast SIMD kernel; all cores run at once.

MIN_ROWS_FOR_PARALLEL = 4 is a threshold we'll come back to. It's the heart of the prefill-versus-decode story.

The parallel matmul:

RUST

    fn matmul_fp32_fp32(
        &self,
        a_data: &[f32],
        b_data: &[f32],
        m: usize,
        n: usize,
        p: usize,
    ) -> Tensor {
        assert_eq!(a_data.len(), m * n, "matmul_fp32_fp32: len(a) must be m*n");
        assert_eq!(b_data.len(), n * p, "matmul_fp32_fp32: len(b) must be n*p");
        let mut data = vec![0.0f32; m * p];
 
        data.par_chunks_mut(p).enumerate().for_each(|(i, out_row)| {
            let op = out_row.as_mut_ptr();
            unsafe {
                for j in 0..n {
                    let aij = a_data[i * n + j];
                    let b_row = b_data.as_ptr().add(j * p);
                    SimdCpu::<B>::saxpy_row_f32(op, b_row, aij, p);
                }
            }
        });
 
        Tensor::new(data, vec![m, p])
    }
}

This is the SIMD matmul from II.3 with exactly one line changed. The output buffer data is m * p floats, m rows of p. data.par_chunks_mut(p) cuts that buffer into m contiguous chunks of p floats (one chunk per output row) and hands them out in parallel. .enumerate() pairs each chunk with its row index i. The closure body (for each j, do the SIMD saxpy_row_f32 of B's row j scaled by a[i][j] into out_row) is the identical inner loop from SimdCpu::matmul_fp32_fp32.

Output matrix drawn as eight row stripes, colored by which of four cores owns each row.

Figure: the split rayon makes. Each core owns whole output rows, so no synchronization is needed inside the loop.

The reason this is safe: par_chunks_mut gives each parallel task a disjoint mutable slice of data. Row i's task can only touch row i's p floats; no two tasks can ever touch the same memory. Rust's borrow checker accepts this; it's the whole reason par_chunks_mut exists rather than handing every thread the full &mut [f32]. So even with raw-pointer NEON code inside, the parallelism itself is provably race-free.

Compared to II.3's matmul_fp32_fp32, the change is for i in 0..m { ... } becoming data.par_chunks_mut(p).enumerate().for_each(|(i, out_row)| { ... }). Same kernel, distributed.

The matmul method and the threshold

RUST

impl<B: Backend + Sync> Backend for Parallel<B> {
    fn name(&self) -> String {
        "parallel".to_string()
    }
 
    fn matmul(&self, a: &Tensor, b: &Tensor) -> Tensor {
        assert_eq!(a.shape().len(), 2);
        assert_eq!(b.shape().len(), 2);
        let m = a.shape()[0];
        if m < MIN_ROWS_FOR_PARALLEL {
            return self.inner.matmul(a, b);
        }
        let a_shape = a.shape();
        let b_shape = b.shape();
        let n = a_shape[1];
        match (a.as_data(), b.as_data()) {
            (TensorData::Fp32(a_data), TensorData::Fp32(b_data)) => {
                let p = b_shape[1];
                assert_eq!(n, b_shape[0], "tensor shape mismatch");
                self.matmul_fp32_fp32(a_data, b_data, m, n, p)
            }
        }
    }

The impl is bounded B: Backend + Sync, meaning the wrapped backend must be Sync (safe to share across threads) because rayon's worker threads all reference it at once. CpuBackend and SimdCpu are both Sync, so this is satisfied.

Now the threshold. The first thing matmul does is check m, the number of output rows:

RUST

if m < MIN_ROWS_FOR_PARALLEL {
    return self.inner.matmul(a, b);
}

If the matmul has fewer than 4 output rows, don't parallelize it: fall straight through to the wrapped SIMD backend and run it on one core. This is the prefill/decode distinction, made into code.

Spreading work across cores is not free. rayon has to wake worker threads, hand each a chunk, and synchronize when they finish. That overhead is a fixed cost of maybe a few microseconds. If the matmul is big (a prefill matmul with hundreds of output rows, one per prompt token), that overhead is a rounding error against the work itself, and parallelism is a huge win. If the matmul is tiny (a decode matmul with one output row, because decode processes one token), there is nothing to split. Ten cores can't share one row. Worse, paying the thread-wakeup overhead to "parallelize" a one-row matmul makes it slower than just running it on one core.

So MIN_ROWS_FOR_PARALLEL is the guard: matmuls with m >= 4 rows go parallel; matmuls with fewer rows go straight to SIMD. And this is exactly why multithreading helps prefill far more than decode:

Prefill processes the whole prompt at once. Its matmuls have one output row per prompt token, easily hundreds. m is large, the threshold passes, all cores engage. Prefill speeds up with core count.
Decode processes one token. Its matmuls have one output row. m == 1, the threshold fails every time, and the decode path runs entirely on the SIMD backend; it gets no benefit from this chapter at all.

That is a consequence of the workloads, not a flaw in the design. Decode's matmuls are matrix-vector products with a single output row, and a single row cannot be split across cores. Decode's real lever is reducing bytes, which is what II.6 is about. This chapter's win is prefill.

RUST

    fn sum_squares_axis(&self, x: &Tensor, axis: usize) -> Tensor {
        assert_eq!(x.shape().len(), 2, "sum_squares_axis only supports 2-D");
        assert!(axis < 2, "axis out of bounds");
        if axis == 1 {
            let rows = x.shape()[0];
            let cols = x.shape()[1];
            let src = x.as_f32_slice();
 
            let data: Vec<f32> = (0..rows)
                .into_par_iter()
                .map(|r| {
                    SimdCpu::<B>::dot_f32(
                        &src[r * cols..(r + 1) * cols],
                        &src[r * cols..(r + 1) * cols],
                    )
                })
                .collect();
 
            Tensor::new(data, vec![rows])
        } else {
            self.inner.sum_squares_axis(x, axis)
        }
    }

sum_squares_axis gets the same treatment as in II.3, parallelized: (0..rows).into_par_iter() turns the range of row indices into a parallel iterator, and .map(...) computes each row's sum-of-squares (via the SIMD dot_f32) on whatever core rayon assigns. .collect() gathers the results in order. The rows are independent, so this is again race-free by construction.

Everything else delegates to inner, the same pattern as SimdCpu. Here are the first few; the rest follow identically:

RUST

    fn add(&self, a: &Tensor, b: &Tensor) -> Tensor {
        self.inner.add(a, b)
    }
    fn hadamard(&self, a: &Tensor, b: &Tensor) -> Tensor {
        self.inner.hadamard(a, b)
    }
    fn scale(&self, x: &Tensor, s: f32) -> Tensor {
        self.inner.scale(x, s)
    }
    fn silu(&self, x: &Tensor) -> Tensor {
        self.inner.silu(x)
    }
    fn add_scalar(&self, a: &Tensor, s: f32) -> Tensor {
        self.inner.add_scalar(a, s)
    }
    fn rsqrt_elem(&self, x: &Tensor) -> Tensor {
        self.inner.rsqrt_elem(x)
    }

…and the same for broadcast_row_scalars, transpose_2d, softmax_rows, gather_rows, reshape_data, fill_strict_upper_tri, copy_2d_from_cols, copy_2d_into_cols, repeat_row_as_matrix, apply_rope, apply_rope_single_row, concat_dim0, copy_row_2d, copy_contiguous_into, and argmax_with_prob. Parallel vectorizes nothing on its own; it parallelizes the two operations that have parallelizable structure (matmul, sum_squares_axis) and lets the SIMD backend underneath handle the rest. Those other operations are cheap enough that the thread-pool overhead would not pay for itself.

Wiring it into the factory

src/backend/factory.rs gets a "parallel" arm:

RUST

use std::sync::Arc;
 
use super::Backend;
use super::{CpuBackend, Parallel, SimdCpu, TracingBackend};
 
pub fn create_backend(name: &str, enable_tracing: bool) -> Result<Arc<dyn Backend>, String> {
    let name = name.trim();
    match name {
        "scalar" => Ok(wrap_scalar(enable_tracing)),
        "simd" => Ok(wrap_simd(enable_tracing)),
        "parallel" => Ok(wrap_parallel(enable_tracing)),
        other => Err(format!(
            "unknown backend {other:?} (supported: scalar, simd, parallel)"
        )),
    }
}

RUST

fn wrap_parallel(enable_tracing: bool) -> Arc<dyn Backend> {
    let parallel = Parallel::new(SimdCpu::new(CpuBackend));
    if enable_tracing {
        Arc::new(TracingBackend::new(parallel))
    } else {
        Arc::new(parallel)
    }
}

The line that matters is Parallel::new(SimdCpu::new(CpuBackend)). Read it inside-out: CpuBackend is the scalar base; SimdCpu::new(...) wraps it with vectorized matmul; Parallel::new(...) wraps that with cross-core distribution. The type is Parallel<SimdCpu<CpuBackend>>, three layers, each generic over the one below. A parallel matmul on a big input runs the SIMD saxpy kernel on every core; a tiny one falls through to a single-core SIMD run; anything that isn't matmul falls all the way to scalar. The generic wrappers from II.3 are what make this composition possible: no layer hard-codes what's beneath it.

The module exports it:

RUST

mod backend_trait;
pub(crate) mod cpu;
mod factory;
pub(crate) mod parallel_cpu;
pub(crate) mod simd_cpu;
pub(crate) mod tracing;
 
pub use backend_trait::Backend;
pub use factory::create_backend;
 
pub(crate) use cpu::CpuBackend;
pub(crate) use parallel_cpu::Parallel;
pub(crate) use simd_cpu::SimdCpu;
pub(crate) use tracing::TracingBackend;

And model-generate's usage string adds the new option, the only change to the binary:

RUST

fn usage() -> ! {
    eprintln!(
        "usage: model-generate [--kv [basic]] [--backend scalar|simd|parallel] <gguf_path> [prompt] [max_new_tokens]"
    );
    std::process::exit(2);
}

The --backend flag from II.3 already does all the parsing work; "parallel" simply joins "scalar" and "simd" as a valid value.

Running it

This chapter's win is prefill, so the run that shows it best uses a long prompt: more prompt tokens means more output rows in the prefill matmuls, which means more for the cores to chew on. First build one, 51 repetitions of a pangram, which Qwen3's tokenizer turns into exactly 511 tokens:

BASH

printf 'The quick brown fox jumps over the lazy dog. %.0s' {1..51} > long-prompt.txt

Now compare SIMD against parallel:

BASH

cargo run --release --bin model-generate -- --kv --backend simd     path/to/Qwen3-0.6B-FP32.gguf "$(cat long-prompt.txt)" 32
cargo run --release --bin model-generate -- --kv --backend parallel path/to/Qwen3-0.6B-FP32.gguf "$(cat long-prompt.txt)" 32

SIMD, on a ~512-token prompt:

PLAINTEXT

backend: simd
kv cache: basic
 
metrics:
  time_to_first_token_ms: 19378.363
  decode_tokens_per_second: 10.726
  per_forward_ms: min 86.876  max 19378.363  mean 695.892  (n=32)

Parallel, same prompt:

PLAINTEXT

backend: parallel
kv cache: basic
 
metrics:
  time_to_first_token_ms: 3783.670
  decode_tokens_per_second: 10.630
  per_forward_ms: min 86.750  max 3783.670  mean 209.370  (n=32)

The numbers match what the threshold predicted. Time to first token (pure prefill) drops from ~19.4 s to ~3.8 s, a 5.1× win on a ten-core machine (not a clean 10× because of memory bandwidth limits and the work that doesn't parallelize). Decode throughput doesn't move at all: ~10.7 tokens/sec before, ~10.6 after. Decode's matmuls have one output row, the m < 4 threshold sends them straight to the single-core SIMD path, and so decode runs essentially the same code it did in II.3. (If you're comparing against II.3's ~18 tokens/sec, note the context: that was a 4-token prompt. With 511 tokens of history, attention and the cache append still walk the whole sequence on every decode step, so decode is slower here. The point is that the threads didn't change it, which is what the threshold was designed to do.) That gap between the two numbers is the design working as intended.

Where this leaves us

Parallel is a third Backend, --backend parallel, that spreads matmul rows across every CPU core while keeping the SIMD kernel underneath. Prefill, which has plenty of rows to split, gets a 5.1× speedup from the extra cores. Decode, whose matmuls are single-row matrix-vector products, deliberately falls through to the single-core SIMD path. There is nothing to parallelize, and trying would only add overhead.

We've now extracted most of what a CPU can give: 16-wide SIMD lanes, all the cores. A GPU has a different and larger budget: thousands of tiny arithmetic units instead of ten big ones. Matmul, with its perfectly independent output elements, is the shape of work GPUs are built for. The next chapter writes a fourth backend that offloads matmul to the GPU through Apple's Metal API.