II.5: Metal GPU backend

The CPU backends are about as fast as a CPU gets for this: SIMD packs 16 floats into one instruction (II.3), threads spread the work over every core (II.4). A ten-core laptop CPU has, very roughly, ten powerful arithmetic units running at once.

A GPU has thousands of small ones. Matmul, where every output element is an independent dot product, is the canonical workload they were built for. This chapter writes a fourth Backend, Metal, that runs matmul on the GPU through Apple's Metal API: a compute kernel written in Metal Shading Language, command buffers to launch it, and Apple Silicon's unified memory so the CPU and GPU share the same bytes without copying.

Like every backend before it, Metal changes only where matmul runs, not what it computes. Like Parallel, it uses a size threshold to decide when the GPU is worth the trip and when to fall back to the CPU.

What "offload to the GPU" means

For a generalist, here's the model. The GPU is a separate processor. It does not share the CPU's instruction stream or its program. To make it do work you must, every time:

Make sure the data the GPU needs lives in memory the GPU can read. On a discrete graphics card this means copying data across the PCIe bus into the card's own memory, which is slow. On Apple Silicon the CPU and GPU share one physical pool of RAM (unified memory), so "making the data available" can be free, which matters a lot here.
Tell the GPU which program to run. A GPU program is called a kernel (or shader). You write it in a small C-like language (for Metal, Metal Shading Language (MSL)) and the driver compiles it for the GPU.
Launch the kernel over a grid of threads. The GPU runs the same kernel on thousands of threads simultaneously, each knowing its own coordinate in the grid. For matmul, you launch enough threads to cover the output matrix, in groups that cooperate on one tile of C each.
Wait for it to finish, then read the results back.

Steps 1 and 4 (getting data to and from the GPU, and the round-trip of asking it to do something) are latency. They cost a fixed amount of time no matter how small the actual computation is. The GPU only pays off when step 3, the kernel, is doing enough work to dwarf that fixed overhead. That trade-off drives every design decision in this chapter, and it's the same trade-off as II.4's MIN_ROWS_FOR_PARALLEL threshold, just with a bigger fixed cost.

Three crates for talking to Metal

We don't reimplement Metal; we bind to Apple's system framework through three crates:

TOML

[dependencies]
regex = "1"
rayon = "1"
mtl-gpu = "1.0"
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["fmt", "env-filter"] }
mtl-foundation = "1.0.1"
mtl-sys = "1.0.1"

mtl-sys is the raw, unsafe layer: the Objective-C messaging machinery (Metal's API is Objective-C under the hood). mtl-foundation and mtl-gpu are safer Rust wrappers over the device, command queue, buffers, and pipelines. We mostly use mtl-gpu, dropping to mtl-sys once for one operation the wrapper doesn't expose.

The Metal code lives in a new submodule, src/backend/metal/, split three ways: shaders.rs (the kernel source), context.rs (device and dispatch plumbing), and backend.rs (the Backend impl). We'll take them in that order.

The matmul kernel

src/backend/metal/shaders.rs holds the GPU program as a Rust string constant, MSL source that the Metal driver compiles at startup:

RUST

pub const SHADERS: &str = r#"
#include <metal_stdlib>
using namespace metal;
 
kernel void matmul_fp32_fp32(
    device const float* a [[buffer(0)]],
    device const float* b [[buffer(1)]],
    device       float* c [[buffer(2)]],
    constant     uint&  m [[buffer(3)]],
    constant     uint&  n [[buffer(4)]],
    constant     uint&  p [[buffer(5)]],
    uint2  tgid [[threadgroup_position_in_grid]],
    ushort sgid [[simdgroup_index_in_threadgroup]],
    ushort tid  [[thread_index_in_threadgroup]]
) {
    threadgroup float tA[32 * 32];
    threadgroup float tB[32 * 32];
    threadgroup float tC[32 * 32];
 
    const uint row0 = tgid.x * 32;
    const uint col0 = tgid.y * 32;
 
    const uint sg_row = (sgid / 2) * 16;
    const uint sg_col = (sgid % 2) * 16;
 
    simdgroup_float8x8 acc[2][2];
    for (uint i = 0; i < 2; i++) {
        for (uint j = 0; j < 2; j++) {
            acc[i][j] = simdgroup_float8x8(0.0f);
        }
    }
 
    for (uint k0 = 0; k0 < n; k0 += 32) {
        for (uint e = tid; e < 1024; e += 128) {
            const uint rr = e / 32;
            const uint kk = e % 32;
 
            const uint ar = row0 + rr;
            const uint ak = k0 + kk;
            tA[e] = (ar < m && ak < n) ? a[ar * n + ak] : 0.0f;
 
            const uint br = k0 + rr;
            const uint bc = col0 + kk;
            tB[e] = (br < n && bc < p) ? b[br * p + bc] : 0.0f;
        }
        threadgroup_barrier(mem_flags::mem_threadgroup);
 
        for (uint ks = 0; ks < 32; ks += 8) {
            simdgroup_float8x8 af[2];
            simdgroup_float8x8 bf[2];
            for (uint i = 0; i < 2; i++) {
                simdgroup_load(af[i], tA + (sg_row + i * 8) * 32 + ks, 32);
                simdgroup_load(bf[i], tB + ks * 32 + sg_col + i * 8, 32);
            }
            for (uint i = 0; i < 2; i++) {
                for (uint j = 0; j < 2; j++) {
                    simdgroup_multiply_accumulate(acc[i][j], af[i], bf[j], acc[i][j]);
                }
            }
        }
        threadgroup_barrier(mem_flags::mem_threadgroup);
    }
 
    for (uint i = 0; i < 2; i++) {
        for (uint j = 0; j < 2; j++) {
            simdgroup_store(acc[i][j], tC + (sg_row + i * 8) * 32 + sg_col + j * 8, 32);
        }
    }
    threadgroup_barrier(mem_flags::mem_threadgroup);
 
    for (uint e = tid; e < 1024; e += 128) {
        const uint rr = e / 32;
        const uint cc = e % 32;
        if (row0 + rr < m && col0 + cc < p) {
            c[(row0 + rr) * p + col0 + cc] = tC[e];
        }
    }
}
 
"#;

This computes C = A · B with A being m × n and B being n × p. The kernel keyword marks matmul_fp32_fp32 as a GPU entry point. Its parameters are bound to numbered slots: [[buffer(0)]] through [[buffer(5)]] are how Rust passes the three matrices and the three dimensions in. The last three parameters are filled in by the hardware and tell each thread where it sits: tgid is its threadgroup's coordinate in the launch grid, tid is its index within that threadgroup, and sgid is the index of its SIMD-group, the idea this whole kernel is built around.

A SIMD-group is the set of 32 GPU threads that execute in lockstep: one instruction, issued once, carried out by all 32 at the same moment. It is the same idea as the SIMD lanes from II.3, scaled up. Where the CPU packed a few float lanes into each instruction, the GPU yokes 32 whole threads to a single instruction stream. And on Apple GPUs a SIMD-group carries something extra: small matrix hardware that can multiply entire 8×8 matrices as single operations. Everything below exists to keep that hardware fed. Each threadgroup (a batch of GPU threads that run together and can share fast on-chip memory) here is 128 threads, which is four SIMD-groups.

The division of labor: each threadgroup owns one 32×32 tile of the output C, with its corner at (row0, col0), and its four SIMD-groups each own a 16×16 quadrant of that tile (sg_row, sg_col). A quadrant is held as acc: four 8×8 fragments of type simdgroup_float8x8, values that live in the matrix unit's own registers rather than in any single thread's. The zeroing loop at the top initializes them, and they accumulate results across the whole k0 loop without touching memory.

The k0 loop is the same memory story as everywhere else in Act 2. A naive kernel would stream rows of A and columns of B straight from slow main memory, with neighboring threads re-reading mostly the same data. Instead, each iteration stages one 32×32 slab of A and one of B into the small, fast threadgroup-shared arrays tA and tB. Each slab is 1024 elements and there are 128 threads, so the staging loop has every thread make 8 loads, and consecutive threads read consecutive addresses, which lets the memory system merge (coalesce) each round of loads into a few wide transactions. The (ar < m && ak < n) guards zero-pad the slab wherever it hangs past the matrix edge; m is a prompt length, not a multiple of 32, so the last row of tiles is nearly always ragged. A threadgroup_barrier makes every thread wait until both slabs are fully staged.

Then the matrix hardware takes over. simdgroup_load pulls one 8×8 fragment out of a staged slab into the matrix unit's registers. simdgroup_multiply_accumulate multiplies an 8×8 fragment of A by an 8×8 fragment of B and adds the result into an accumulator fragment, and that entire 8 × 8 × 8 computation, 512 multiply-adds, is one hardware operation. The natural first version of this kernel, a tile kernel where each thread computes one output element with its own scalar fused-multiply-add loop, issues one multiply-add per instruction. This issues 512.

The gap shows up on the clock. We wrote that scalar tile version first, and it measured 2.46 s on the 511-token FP32 prefill; moving the inner product onto the SIMD-group matrix hardware (plus a small last-row fix to II.2's prefill path) brings the same prefill to 1.68 s. For calibration: llama.cpp's Metal prefill is still about 6× faster than ours, because it also runs flash attention (a tiled attention algorithm; the series conclusion says more) and encodes the whole forward pass into one command buffer instead of paying a round trip per matmul. Real headroom remains; this kernel is the biggest single step toward it.

After the last k0 iteration the accumulators hold finished 8×8 blocks of the output. simdgroup_store writes them to the shared tC tile, one more barrier lets every fragment land, and the 128 threads copy the 32×32 tile out to C with the same bounds checks the staging used, skipping anything past the matrix edge.

The Metal context

src/backend/metal/context.rs owns the GPU connection (the device, the command queue, the compiled kernel) and the function that dispatches a matmul. It opens with kernel registration and two helpers:

RUST

use std::collections::HashMap;
use std::ffi::c_void;
 
use mtl_foundation::object::Referencing;
use mtl_gpu::device as mtl_device;
use mtl_gpu::{
    Buffer, CommandQueue, ComputeCommandEncoder, ComputePipelineState, Device, ResourceOptions,
    Size,
};
use mtl_sys::{msg_send_4, sel};
 
use super::shaders::SHADERS;
 
pub(crate) const KERNEL_MATMUL_FP32_FP32: &str = "matmul_fp32_fp32";
 
pub(crate) const KERNELS: &[&str] = &[KERNEL_MATMUL_FP32_FP32];

KERNELS is the list of kernel names to compile: one for now, with II.6 adding a second.

RUST

unsafe fn wrap_shared_storage<T: Copy>(device: &Device, data: &[T]) -> Option<Buffer> {
    unsafe {
        let ptr: *mut c_void = msg_send_4(
            device.as_ptr(),
            sel!(newBufferWithBytesNoCopy: length: options: deallocator:),
            data.as_ptr() as *mut c_void,
            std::mem::size_of_val(data) as usize,
            ResourceOptions::STORAGE_MODE_SHARED,
            std::ptr::null_mut::<c_void>(),
        );
        Buffer::from_raw(ptr)
    }
}

This is the unified-memory trick, and it is the one place we drop to raw mtl-sys. newBufferWithBytesNoCopy tells Metal: "make a GPU buffer that points directly at this CPU memory; do not allocate, do not copy." Because Apple Silicon's CPU and GPU share one physical RAM pool (STORAGE_MODE_SHARED), the GPU can read the exact bytes the Rust Vec already holds. On a discrete GPU, getting data across to the card is a real copy over PCIe; here it is a pointer wrap. That is why "step 1" (getting data to the GPU) is nearly free on this hardware, and it's the reason a GPU backend is worth writing for a model this small.

RUST

#[inline]
fn set_u32(enc: &ComputeCommandEncoder, val: u32, index: usize) {
    let raw = val.to_ne_bytes();
    enc.set_bytes(&raw, index);
}

set_u32 passes a scalar (a matrix dimension) into a kernel buffer slot, specifically those constant uint& m parameters.

The context struct and its constructor:

RUST

pub struct MetalContext {
    device: Device,
    queue: CommandQueue,
    pipelines: HashMap<&'static str, ComputePipelineState>,
}
 
impl MetalContext {
    pub fn new() -> Self {
        let device = mtl_device::system_default().unwrap();
 
        let library = device.new_library_with_source(SHADERS, None).unwrap();
 
        let pipelines: HashMap<&'static str, _> = KERNELS
            .iter()
            .copied()
            .map(|name| {
                let func = library.new_function_with_name(name).unwrap();
                let pipeline = device
                    .new_compute_pipeline_state_with_function(&func)
                    .unwrap();
                (name, pipeline)
            })
            .collect();
 
        let queue = device.new_command_queue().unwrap();
 
        eprintln!(
            "MetalBackend: device={} unified_memory={}",
            device.name(),
            device.has_unified_memory()
        );
 
        Self {
            device,
            queue,
            pipelines,
        }
    }

new does the one-time GPU setup, so it runs once at startup and never again. system_default() grabs the GPU. new_library_with_source(SHADERS, ...) hands our MSL string to the driver, which compiles it. For each kernel name, new_compute_pipeline_state_with_function builds a compute pipeline, the GPU-ready, launchable form of that kernel, and we stash them in a HashMap keyed by name. new_command_queue() creates the channel for submitting work. The eprintln! reports the GPU's name and confirms unified memory is on.

The dispatch function, the generic "run a kernel" routine:

RUST

    fn dispatch(&self, kernel: &str, bufs: &[&Buffer], scalars: &[u32], grid: Size, threads: Size) {
        let cmd = self.queue.command_buffer().unwrap();
        let enc =
            unsafe { ComputeCommandEncoder::from_raw(cmd.compute_command_encoder()) }.unwrap();
        let pipeline = self.pipelines.get(kernel).unwrap();
        enc.set_compute_pipeline_state(pipeline);
        for (i, buf) in bufs.iter().enumerate() {
            enc.set_buffer(buf, 0, i);
        }
        let buf_count = bufs.len();
        for (i, &val) in scalars.iter().enumerate() {
            set_u32(&enc, val, buf_count + i);
        }
        enc.dispatch_threadgroups(grid, threads);
        enc.end_encoding();
        cmd.commit();
        cmd.wait_until_completed();
    }

This is the universal shape of "make the GPU do something":

command_buffer(): a command buffer is a batch of GPU instructions you build up and then submit.
compute_command_encoder(): the encoder writes compute commands into that buffer.
set_compute_pipeline_state(pipeline): select which kernel to run.
The two loops bind the arguments: each tensor Buffer goes into slot i (matching [[buffer(0..2)]]), each scalar goes into the slots after them ([[buffer(3..5)]]).
dispatch_threadgroups(grid, threads): launch. threads is the threadgroup size; grid is how many threadgroups. Together they define the thread grid the kernel runs over.
end_encoding() / commit(): finish and submit to the GPU.
wait_until_completed(): block until the GPU is done.

That final wait_until_completed is the synchronous round-trip ("step 4"), and it is the latency the size threshold exists to amortize.

wrap_f32 is the small convenience that turns a float slice into a shared-storage buffer:

RUST

    fn wrap_f32(&self, data: &[f32]) -> Buffer {
        unsafe { wrap_shared_storage(&self.device, data) }.unwrap()
    }

And the matmul entry point:

RUST

    pub fn matmul_fp32_fp32(&self, a: &[f32], b: &[f32], m: usize, n: usize, p: usize) -> Vec<f32> {
        let an = m.checked_mul(n).unwrap();
        let np = n.checked_mul(p).unwrap();
        let mp = m.checked_mul(p).unwrap();
        assert_eq!(a.len(), an);
        assert_eq!(b.len(), np);
 
        let c_data = vec![0.0f32; mp];
 
        let a_buf = self.wrap_f32(a);
        let b_buf = self.wrap_f32(b);
        let c_buf = self.wrap_f32(&c_data);
 
        self.dispatch(
            KERNEL_MATMUL_FP32_FP32,
            &[&a_buf, &b_buf, &c_buf],
            &[m as u32, n as u32, p as u32],
            Size::new(m.div_ceil(32), p.div_ceil(32), 1),
            Size::new(128, 1, 1),
        );
 
        c_data
    }

It allocates the output c_data, wraps all three matrices as shared buffers (no copy), and dispatches. The geometry matches the kernel: Size::new(128, 1, 1) is one threadgroup of 128 threads (the four SIMD-groups), and the grid is (m.div_ceil(32), p.div_ceil(32), 1) threadgroups, enough 32×32 output tiles to cover the whole m × p output, rounding up. Then the crucial detail of unified memory: the kernel wrote its results straight into the memory c_data already owns, so once dispatch returns, c_data is the answer, with no read-back step. We just return it.

The Metal backend

src/backend/metal/backend.rs is the Backend impl. Same delegation pattern as SimdCpu and Parallel: implement matmul, forward the rest:

RUST

use crate::tensor::{Tensor, TensorData};
 
use crate::backend::Backend;
 
use super::context::MetalContext;
 
const MIN_M_FOR_GPU_MATMUL: usize = 8;
 
pub struct Metal<B: Backend> {
    ctx: MetalContext,
    fallback: B,
}
 
impl<B: Backend> Metal<B> {
    pub fn new(fallback: B) -> Self {
        Self {
            ctx: MetalContext::new(),
            fallback,
        }
    }

Metal<B> holds the MetalContext and a fallback backend. MIN_M_FOR_GPU_MATMUL = 8 is this backend's size threshold, the GPU equivalent of Parallel's MIN_ROWS_FOR_PARALLEL, set higher because the GPU's fixed cost (the command-buffer round-trip, wait_until_completed) is larger than rayon's thread-wakeup cost.

RUST

    fn matmul_fp32_fp32(
        &self,
        a_data: &[f32],
        b_data: &[f32],
        m: usize,
        n: usize,
        p: usize,
    ) -> Tensor {
        assert_eq!(a_data.len(), m * n, "matmul_fp32_fp32: len(a) must be m*n");
        assert_eq!(b_data.len(), n * p, "matmul_fp32_fp32: len(b) must be n*p");
        let data = self.ctx.matmul_fp32_fp32(a_data, b_data, m, n, p);
        Tensor::new(data, vec![m, p])
    }
}

A thin adapter from the Backend world to the MetalContext world: validate the shapes, dispatch to the GPU, wrap the result back into a Tensor.

The Backend impl, starting with matmul:

RUST

impl<B: Backend> Backend for Metal<B> {
    fn name(&self) -> String {
        "metal".to_string()
    }
 
    fn matmul(&self, a: &Tensor, b: &Tensor) -> Tensor {
        assert_eq!(a.shape().len(), 2);
        assert_eq!(b.shape().len(), 2);
        let a_shape = a.shape();
        let b_shape = b.shape();
        let m = a_shape[0];
        let n = a_shape[1];
        match (a.as_data(), b.as_data()) {
            (TensorData::Fp32(a_data), TensorData::Fp32(b_data)) => {
                let p = b_shape[1];
                assert_eq!(n, b_shape[0], "tensor shape mismatch");
                if m == 0 || p == 0 {
                    return Tensor::new(vec![], vec![m, p]);
                }
                if m < MIN_M_FOR_GPU_MATMUL {
                    return self.fallback.matmul(a, b);
                }
                self.matmul_fp32_fp32(a_data, b_data, m, n, p)
            }
        }
    }

Three guards before the GPU is touched. An empty result (m == 0 || p == 0) short-circuits. m < MIN_M_FOR_GPU_MATMUL, fewer than 8 output rows, falls back to the CPU. This is the same prefill/decode logic as II.4: a prefill matmul has one row per prompt token (dozens or hundreds, straight to the GPU), but a decode matmul has a single output row, far below 8, so it runs on the fallback. The GPU round-trip would dwarf the work of a one-row matmul. Decode, once again, stays on the CPU; the GPU is a prefill accelerator here.

The rest of the trait delegates to fallback. The first few:

RUST

    fn sum_squares_axis(&self, x: &Tensor, axis: usize) -> Tensor {
        self.fallback.sum_squares_axis(x, axis)
    }
    fn add(&self, a: &Tensor, b: &Tensor) -> Tensor {
        self.fallback.add(a, b)
    }
    fn hadamard(&self, a: &Tensor, b: &Tensor) -> Tensor {
        self.fallback.hadamard(a, b)
    }
    fn scale(&self, x: &Tensor, s: f32) -> Tensor {
        self.fallback.scale(x, s)
    }
    fn silu(&self, x: &Tensor) -> Tensor {
        self.fallback.silu(x)
    }

…and identically for add_scalar, rsqrt_elem, broadcast_row_scalars, transpose_2d, softmax_rows, gather_rows, reshape_data, fill_strict_upper_tri, copy_2d_from_cols, copy_2d_into_cols, repeat_row_as_matrix, apply_rope, apply_rope_single_row, concat_dim0, copy_row_2d, copy_contiguous_into, and argmax_with_prob. Only the big matmuls go to the GPU; everything else (including, by the threshold, every decode matmul) stays on the CPU fallback.

The module file exports Metal:

RUST

mod backend;
mod context;
mod shaders;
 
pub use backend::Metal;

Wiring it into the factory

src/backend/factory.rs gets the "metal" arm:

RUST

use std::sync::Arc;
 
use super::Backend;
use super::{CpuBackend, Metal, Parallel, SimdCpu, TracingBackend};
 
pub fn create_backend(name: &str, enable_tracing: bool) -> Result<Arc<dyn Backend>, String> {
    let name = name.trim();
    match name {
        "scalar" => Ok(wrap_scalar(enable_tracing)),
        "simd" => Ok(wrap_simd(enable_tracing)),
        "parallel" => Ok(wrap_parallel(enable_tracing)),
        "metal" => Ok(wrap_metal(enable_tracing)),
        other => Err(format!(
            "unknown backend {other:?} (supported: scalar, simd, parallel, metal)"
        )),
    }
}

RUST

fn wrap_metal(enable_tracing: bool) -> Arc<dyn Backend> {
    let metal = Metal::new(SimdCpu::new(CpuBackend));
    if enable_tracing {
        Arc::new(TracingBackend::new(metal))
    } else {
        Arc::new(metal)
    }
}

Metal::new(SimdCpu::new(CpuBackend)) gives the GPU backend a SIMD fallback. So a small (sub-8-row) matmul, which is every decode matmul, runs on the vectorized CPU kernel from II.3; a large one goes to the GPU. We pick SimdCpu rather than Parallel for the fallback because the fallback only ever handles tiny matmuls (single-row decode matmuls), and those are below Parallel's threshold too, so the extra layer would do nothing.

The module exports it:

RUST

mod backend_trait;
pub(crate) mod cpu;
mod factory;
pub(crate) mod metal;
pub(crate) mod parallel_cpu;
pub(crate) mod simd_cpu;
pub(crate) mod tracing;
 
pub use backend_trait::Backend;
pub use factory::create_backend;
 
pub(crate) use cpu::CpuBackend;
pub(crate) use metal::Metal;
pub(crate) use parallel_cpu::Parallel;
pub(crate) use simd_cpu::SimdCpu;
pub(crate) use tracing::TracingBackend;

And model-generate's usage string adds metal:

RUST

fn usage() -> ! {
    eprintln!(
        "usage: model-generate [--kv [basic]] [--backend scalar|simd|parallel|metal] <gguf_path> [prompt] [max_new_tokens]"
    );
    std::process::exit(2);
}

Running it

The GPU's win is prefill, so use a long prompt and compare parallel against metal:

BASH

cargo run --release --bin model-generate -- --kv --backend parallel path/to/Qwen3-0.6B-FP32.gguf "$(cat long-prompt.txt)" 32
cargo run --release --bin model-generate -- --kv --backend metal    path/to/Qwen3-0.6B-FP32.gguf "$(cat long-prompt.txt)" 32

The parallel run re-establishes the II.4 baseline on this prompt:

PLAINTEXT

backend: parallel
kv cache: basic
 
metrics:
  time_to_first_token_ms: 3740.495
  decode_tokens_per_second: 10.403
  per_forward_ms: min 87.141  max 3740.495  mean 210.013  (n=32)

The Metal run first prints the device line from MetalContext::new, then the usual metrics:

PLAINTEXT

MetalBackend: device=Apple M2 Pro unified_memory=true
backend: metal
kv cache: basic
 
metrics:
  time_to_first_token_ms: 1676.500
  decode_tokens_per_second: 10.341
  per_forward_ms: min 87.739  max 1676.500  mean 146.075  (n=32)

Against the parallel baseline on the same ~512-token prompt (~3.7 s time to first token), prefill drops to roughly 1.7 s, a 2.2× win: the thousands of GPU threads chew through the big prefill matmuls far faster than ten CPU cores. Decode throughput is essentially unchanged (~10.4 vs ~10.3 tokens/sec), since decode's single-row matmuls are below MIN_M_FOR_GPU_MATMUL, so they fall back to the SIMD CPU kernel and never reach the GPU. The split is what the threshold guarantees: the GPU accelerates prefill; decode stays on the CPU.

Horizontal bar chart of 511-token prefill time: SIMD on one core 19.4 seconds, parallel on all cores 3.8 seconds, Metal GPU 1.7 seconds.

Figure: the prefill ladder so far. One core, ten cores, thousands of GPU threads: each rung splits the same compute-bound matmuls across more hardware.

Where this leaves us

Metal is a fourth Backend, --backend metal, that compiles an MSL matmul kernel, dispatches big matmuls to the GPU through command buffers, and uses Apple Silicon's unified memory to share data with no copy. Prefill, the compute-bound half of inference, now runs on hardware built for this shape of work.

But notice what hasn't moved through the last three chapters: decode throughput. SIMD, parallel, and metal all left it essentially flat; every one of those backends sends decode's single-row matmuls down the same CPU path. This is the act intro's central point asserting itself: decode is memory-bandwidth-bound. The bottleneck is the time spent reading the model's weights out of memory, the same gigabytes loaded for every single token, not arithmetic throughput. No amount of faster arithmetic touches that. The only lever is reading fewer bytes. The final chapter of Act 2 pulls it: Q8_0 quantization, which stores the weights in 8 bits instead of 32 and speeds decode up 2.3× by shrinking the bytes read per token.