II.6: Q8_0 quantization

Decode throughput has been pinned at the same ~18 tokens per second since the KV cache landed in II.2. II.3 proved why by trying to move it with vector instructions and measuring a dead-flat result; threads (II.4) and the GPU (II.5) then delivered real prefill wins and left decode exactly where it was. Three chapters of faster arithmetic; zero decode movement. Now we finally act on what II.3 measured.

Decode is memory-bandwidth-bound. To generate one token, the model reads every weight (all ~600 million numbers of Qwen3 0.6B) out of memory and through the arithmetic units. With a single token in flight, those weights get used once and thrown away; almost all the wall-clock time is the reading, not the computing. Faster arithmetic can't help when arithmetic isn't the bottleneck. The only lever that moves decode is reading fewer bytes.

This chapter pulls that lever. We have been running the FP32 export of the checkpoint: every weight a 32-bit float, four bytes each, on disk and in memory. Qwen3 also ships as a second export whose weights are in an 8-bit quantized format called Q8_0. This chapter switches to that file, loads its weights as-is, and teaches the engine to keep them quantized in memory and multiply against them directly. Quarter the bytes per weight (actually a bit better than a quarter, as we'll see) and decode, whose whole problem is bytes, gets dramatically faster.

What quantization is

A weight is a number like 0.0426 or -0.0119. We've stored each in an f32: 32 bits, four bytes, with enormous range and precision. A trained model's weights don't need that range. Within any small group of weights, the values cluster tightly, all within a similar magnitude. Spending 32 bits on each is wasteful.

Quantization stores them in fewer bits by giving up precision the model can tolerate. Q8_0, the format Qwen3's quantized export ships in and the default across the llama.cpp ecosystem, does it like this:

Group the weights into blocks of 32 consecutive values.
For each block, find the largest-magnitude weight and derive a single scale factor from it.
Store each of the 32 weights as a signed 8-bit integer (i8, range −128…127), representing "this many scale-units." The real value is quant × scale.
Store the block's scale once, as a 16-bit half-precision float (f16).

So one block is: a 2-byte scale plus 32 one-byte quants = 34 bytes for 32 weights. In FP32 those same 32 weights would be 32 × 4 = 128 bytes. That's a 3.76× size reduction, close to 4× but slightly less because of the per-block scale overhead. The _0 in the name means "no zero-point": the quants are symmetric around zero, scale only. Loading a weight back is one multiply: f32_value = quant_i8 as f32 * scale.

A 34-byte strip: two blue scale bytes followed by 32 green quant bytes.

Figure: one block, byte by byte. This is also the exact memory layout of the Rust struct we define next.

The accuracy cost is real but small. Eight bits per weight, with a per-32-block scale, is enough that the model's outputs are nearly indistinguishable from the FP32 version, which is why this format is the default for shipping models. We get most of a 4× bandwidth win for a precision loss the model shrugs off.

The Q8_0 block format

A new file, src/tensor/q8_0.rs, defines the block and the code to read it. The block first:

RUST

#[repr(C, packed)]
#[derive(Clone, Copy, Debug)]
pub(crate) struct Q8_0Block {
    scale: [u8; 2],
    quants: [i8; 32],
}

Q8_0Block is exactly the on-disk layout: 2 bytes of scale, then 32 signed quants. #[repr(C, packed)] is load-bearing: it tells Rust to lay the struct out in memory with C's field order and no padding, so a Q8_0Block in memory is byte-identical to a block in the GGUF file. That equivalence is what lets us, later, hand a &[Q8_0Block] straight to the GPU as raw bytes. The size is 2 + 32 = 34 bytes.

Reading a block out of a byte slice:

RUST

impl Q8_0Block {
    #[inline]
    pub(crate) fn from_bytes(src: &[u8]) -> Self {
        let mut scale = [0u8; 2];
        scale.copy_from_slice(&src[0..2]);
        let mut quants = [0i8; 32];
        for i in 0..32 {
            quants[i] = src[2 + i] as i8;
        }
        Self { scale, quants }
    }

Copy the first 2 bytes as the scale, then the next 32 bytes as the quants. The src[2 + i] as i8 reinterprets each byte as a signed integer: bit pattern 0xFF becomes -1, not 255, because Q8_0 quants are signed.

Decoding the scale is the one fiddly part:

RUST

    #[inline]
    pub(crate) fn scale_f32(&self) -> f32 {
        let bits = u16::from_le_bytes(self.scale);
        let sign = ((bits >> 15) & 1) as u32;
        let exp = ((bits >> 10) & 0x1F) as u32;
        let frac = (bits & 0x3FF) as u32;
        let f32_bits = (sign << 31) | ((exp.wrapping_add(112)) << 23) | (frac << 13);
        f32::from_bits(f32_bits)
    }

The scale is a 16-bit half-precision float (f16). Rust's standard library has no native f16 arithmetic, so we convert it to f32 by hand, bit by bit. An f16 packs a 1-bit sign, a 5-bit exponent, and a 10-bit fraction; an f32 packs a 1-bit sign, an 8-bit exponent, and a 23-bit fraction. The conversion: pull out the three f16 fields with shifts and masks, then reassemble them in f32 positions: sign to bit 31, fraction shifted up 13 bits to fill the wider field. The exponent needs a rebias: f16 biases its exponent by 15, f32 by 127, so wrapping_add(112) (which is 127 − 15) corrects it. f32::from_bits then reinterprets the assembled bit pattern as a float. (This handles normal numbers, which block scales always are.)

RUST

    #[inline]
    pub(crate) fn quants(&self) -> &[i8; 32] {
        &self.quants
    }
}
 
pub(crate) const Q8_0_BLOCK_SIZE: usize = 32;
 
pub(crate) const Q8_0_BLOCK_BYTES: usize = 2 + Q8_0_BLOCK_SIZE;
 
#[inline]
pub(crate) fn blocks_packed_byte_len(blocks: &[Q8_0Block]) -> usize {
    blocks.len() * Q8_0_BLOCK_BYTES
}

Q8_0_BLOCK_SIZE is the 32 weights per block; Q8_0_BLOCK_BYTES is the 34 bytes one block occupies. blocks_packed_byte_len gives the total byte length of a slice of blocks, used when handing blocks to the GPU as raw bytes.

And the bulk reader that turns a tensor's worth of bytes into blocks:

RUST

pub(super) fn blocks_from_bytes(bytes: &[u8]) -> Vec<Q8_0Block> {
    assert_eq!(
        bytes.len() % Q8_0_BLOCK_BYTES,
        0,
        "Q8_0 byte length {} not a multiple of {}",
        bytes.len(),
        Q8_0_BLOCK_BYTES
    );
    let n = bytes.len() / Q8_0_BLOCK_BYTES;
    let mut out = Vec::with_capacity(n);
    for i in 0..n {
        let off = i * Q8_0_BLOCK_BYTES;
        out.push(Q8_0Block::from_bytes(&bytes[off..]));
    }
    out
}

The input byte count must be an exact multiple of 34; then it's n blocks, each parsed with from_bytes.

The Tensor type learns a second representation

Until now a Tensor's data has been one thing: Vec<f32>. src/tensor/mod.rs adds the module:

RUST

pub(crate) mod q8_0;
mod tensor;
 
pub use tensor::Tensor;

And src/tensor/tensor.rs gives TensorData a second variant:

RUST

use crate::tensor::q8_0::{Q8_0_BLOCK_SIZE, Q8_0Block};
 
#[derive(Clone)]
pub(crate) enum TensorData {
    Fp32(Vec<f32>),
    Q8_0(Vec<Q8_0Block>),
}
 
impl std::fmt::Debug for TensorData {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        match self {
            TensorData::Fp32(v) => write!(f, "Fp32({} elems)", v.len()),
            TensorData::Q8_0(v) => write!(f, "Q8_0({} blocks)", v.len()),
        }
    }
}

This is why every Backend matmul throughout Act 2 has matched on TensorData, in those match a.as_data() { TensorData::Fp32(d) => d } blocks. A tensor can now be stored either as FP32 floats or as Q8_0 blocks, and code that touches tensor data has to handle both. The Debug impl reports block count for the Q8_0 case.

The constructor that builds a Q8_0 tensor from raw bytes:

RUST

    pub(crate) fn new_q8_0_from_bytes(bytes: &[u8], shape: Vec<usize>) -> Self {
        assert_eq!(
            shape.len(),
            2,
            "Q8_0 tensors must be 2-D, got shape {:?}",
            shape
        );
        let cols = shape[1];
        assert_eq!(
            cols % Q8_0_BLOCK_SIZE,
            0,
            "Q8_0 tensor cols {} not divisible by {}",
            cols,
            Q8_0_BLOCK_SIZE
        );
        let rows = shape[0];
        let expected_blocks = rows * (cols / Q8_0_BLOCK_SIZE);
        let blocks = super::q8_0::blocks_from_bytes(bytes);
        assert_eq!(
            blocks.len(),
            expected_blocks,
            "Q8_0 block count mismatch: got {} expected {}",
            blocks.len(),
            expected_blocks
        );
        Self {
            data: TensorData::Q8_0(blocks),
            shape,
        }
    }

A Q8_0 tensor is always 2-D (a weight matrix), and its column count must be a multiple of 32, since blocks tile each row, giving cols / 32 blocks per row and rows × cols / 32 total. The asserts check both invariants, and that the byte count produced the expected number of blocks. The shape is still stored in logical terms (rows and columns of weights), even though the data is now blocks, so the rest of the engine sees a normal [rows, cols] matrix and never has to know it's quantized underneath.

as_f32_slice, which several places call, can only answer for FP32 tensors, so it gets an explicit panic for the Q8_0 case:

RUST

    pub(crate) fn as_f32_slice(&self) -> &[f32] {
        match &self.data {
            TensorData::Fp32(v) => v,
            _ => panic!("expected FP32 tensor, got {:?}", self.data),
        }
    }
}

The Debug impl for Tensor also gains a Q8_0 byte-size arm (fmt_bytes(blocks_packed_byte_len(v))) so a quantized tensor reports its true on-the-wire size.

Loading Q8_0 tensors from the GGUF file

Way back in I.1, gguf-inspect showed most tensors carrying type tag 8. That tag is Q8_0. I.3 loaded only type 0 (FP32) and panicked on anything else; the checkpoint we've been running is FP32. Now src/gguf/gguf.rs learns to load type 8 directly:

RUST

            8 => {
                use crate::tensor::q8_0::{Q8_0_BLOCK_BYTES, Q8_0_BLOCK_SIZE};
                assert_eq!(
                    numel % Q8_0_BLOCK_SIZE as u64,
                    0,
                    "Q8_0 tensor {name} numel {numel} not divisible by {Q8_0_BLOCK_SIZE}"
                );
                let block_count = (numel as usize) / Q8_0_BLOCK_SIZE;
                let byte_count = block_count * Q8_0_BLOCK_BYTES;
                let bytes = self.read_tensor_bytes(offset, byte_count);
                Tensor::new_q8_0_from_bytes(&bytes, shape)
            }

This is a new arm in the match ggml_type block alongside the existing 0 => { ... } FP32 arm. The tensor has numel logical weights; in Q8_0 that's numel / 32 blocks at 34 bytes each. It reads exactly that many bytes off disk and hands them to new_q8_0_from_bytes. The crucial difference from the FP32 path: the FP32 arm expands every weight into an f32 as it reads; this arm keeps the bytes compact. A Q8_0 weight matrix now occupies in RAM what it occupied on disk, about a quarter of the FP32 footprint, and that smaller footprint is the entire decode win, because decode's cost is bytes moved.

Quantized matmul on the CPU

The forward pass now has FP32 activations (the running values flowing through the network, computed at runtime, still floats) multiplied against Q8_0 weights. We need a matmul that takes an FP32 left side and a Q8_0 right side. src/backend/cpu.rs gets two functions.

The core is matvec_q8_0, one vector times a Q8_0 matrix, which is exactly the decode case:

RUST

    pub(crate) fn matvec_q8_0(
        blocks: &[crate::tensor::q8_0::Q8_0Block],
        x: &[f32],
        rows: usize,
        cols: usize,
    ) -> Result<Vec<f32>, String> {
        use crate::tensor::q8_0::Q8_0_BLOCK_SIZE;
        if cols % Q8_0_BLOCK_SIZE != 0 {
            return Err(format!(
                "q8_0 matvec: cols {cols} not multiple of {Q8_0_BLOCK_SIZE}"
            ));
        }
        let blocks_per_row = cols / Q8_0_BLOCK_SIZE;
        let expected = rows * blocks_per_row;
        if blocks.len() != expected {
            return Err(format!(
                "q8_0 matvec: blocks {} != rows {rows} * bpr {blocks_per_row}",
                blocks.len()
            ));
        }
        if x.len() != cols {
            return Err(format!("q8_0 matvec: x len {} != cols {cols}", x.len()));
        }
        let mut out = vec![0.0_f32; rows];
        for r in 0..rows {
            let mut sum = 0.0f32;
            for b in 0..blocks_per_row {
                let block = &blocks[r * blocks_per_row + b];
                let scale = block.scale_f32();
                let quants = block.quants();
                let x_off = b * Q8_0_BLOCK_SIZE;
                let mut block_sum = 0.0f32;
                for j in 0..Q8_0_BLOCK_SIZE {
                    block_sum += quants[j] as f32 * x[x_off + j];
                }
                sum += scale * block_sum;
            }
            out[r] = sum;
        }
        Ok(out)
    }

After the validity checks, this computes out[r] = the dot product of input vector x with weight row r. Each weight row is blocks_per_row Q8_0 blocks. The inner loop over j is where the format pays off: it accumulates quants[j] as f32 * x[x_off + j] (the integer quant times the float activation) and only after the 32-element block sum does it multiply once by the block's scale. This is the key efficiency of the format. The dequantization (turning quants back into real weight values) does not happen as a separate pass that materializes a Vec<f32> of expanded weights. It is fused into the dot product: one scalar multiply per 32-element block, not per weight. We never build the FP32 weight matrix in memory at all. The compact bytes go straight into the arithmetic, which is what keeps the bandwidth win.

matmul_fp32_q8_0 wraps it for the general FP32-matrix × Q8_0-matrix case (prefill, where the left side has several rows):

RUST

    pub(crate) fn matmul_fp32_q8_0(
        a: &[f32],
        a_shape: &[usize],
        b_blocks: &[crate::tensor::q8_0::Q8_0Block],
        b_shape: &[usize],
    ) -> Tensor {
        use crate::tensor::q8_0::Q8_0_BLOCK_SIZE;
        assert_eq!(a_shape.len(), 2);
        assert_eq!(b_shape.len(), 2);
        let m = a_shape[0];
        let n = a_shape[1];
        let p = b_shape[0];
        let n_b = b_shape[1];
        assert_eq!(n, n_b, "Q8_0 matmul inner dim mismatch");
        assert_eq!(a.len(), m * n, "matmul_fp32_q8_0: a len");
        let blocks_per_row = n / Q8_0_BLOCK_SIZE;
        assert_eq!(
            b_blocks.len(),
            p * blocks_per_row,
            "matmul_fp32_q8_0: b block count"
        );
        let mut out = Vec::with_capacity(m * p);
        for b_idx in 0..m {
            let x = &a[b_idx * n..(b_idx + 1) * n];
            out.extend_from_slice(
                &Self::matvec_q8_0(b_blocks, x, p, n).expect("matmul_fp32_q8_0 matvec failed"),
            );
        }
        Tensor::new(out, vec![m, p])
    }

It runs matvec_q8_0 once per row of the FP32 left matrix a. One detail in the shapes: the Q8_0 weight matrix's logical shape is [p, n], p output features each a row of n input weights, so the inner dimension n matches a's column count, and the result is [m, p]. (FP32 weights arrive in this same [p, n] layout, but our FP32 matmul was written to consume b as [n, p], which is why they get transposed at load time. matvec_q8_0 consumes the native orientation directly, so Q8_0 weights skip the transpose; more on that below.)

The chapter also adds a unit test that exercises one hand-built block end to end:

RUST

#[cfg(test)]
mod q8_tests {
    use super::{Backend, CpuBackend};
    use crate::tensor::Tensor;
    use crate::tensor::q8_0::Q8_0_BLOCK_BYTES;
 
    #[test]
    fn q8_0_matmul_one_block() {
        let mut bytes = vec![0u8; Q8_0_BLOCK_BYTES];
        bytes[0] = 0x00;
        bytes[1] = 0x3c;
        for i in 0..32 {
            bytes[2 + i] = 1i8 as u8;
        }
        let w = Tensor::new_q8_0_from_bytes(&bytes, vec![1, 32]);
        let x = Tensor::new(vec![2.0f32; 32], vec![1, 32]);
        let cpu = CpuBackend;
        let y = cpu.matmul(&x, &w);
        assert_eq!(y.shape(), &[1, 1]);
        assert!((y.as_f32_slice()[0] - 64.0).abs() < 1e-3);
    }
}

It constructs a single block: scale bytes 0x00, 0x3c, which is the f16 bit pattern for 1.0, and 32 quants all equal to 1. So the block's weights all decode to 1.0 × 1 = 1.0. Multiplied against an input vector of 32 twos, the dot product is 32 × (1.0 × 2.0) = 64.0, and the test checks exactly that. A tiny, exact, hand-verifiable proof that the format decoding and the fused matmul agree.

Quantized matmul through the backends

The CpuBackend::matmul dispatch grows arms for the new tensor-data combination:

RUST

            (TensorData::Fp32(a_data), TensorData::Fp32(b_data)) => {
                CpuBackend::matmul_fp32_fp32(a_data, a_shape, b_data, b_shape)
            }
            (TensorData::Fp32(a_data), TensorData::Q8_0(b_blocks)) => {
                CpuBackend::matmul_fp32_q8_0(a_data, a_shape, b_blocks, b_shape)
            }
            _ => panic!("matmul: LHS must be FP32"),

matmul now matches on the pair of data types. FP32 × FP32 is the Act 1 path; FP32 × Q8_0 is the new quantized path. The _ arm rejects anything else: the left operand (the activations) is always FP32; only the right operand (the weights) is ever quantized.

gather_rows needs the same treatment, because the token embedding table is also a Q8_0 tensor and looking a token up means reading and dequantizing one of its rows:

RUST

        match table.as_data() {
            TensorData::Fp32(data) => CpuBackend::gather_rows_fp32(data, shape, row_indices),
            TensorData::Q8_0(blocks) => CpuBackend::gather_rows_q8_0(blocks, shape, row_indices),
        }

gather_rows_q8_0 is straightforward: for each requested row index, walk its blocks and push quant × scale for every weight, producing an ordinary FP32 row. (transpose_2d gets a Q8_0(_) => panic!(...) arm too: quantized tensors are never transposed at runtime.)

There is a trap here, and it is worth measuring before fixing: the scalar kernel above dequantizes every weight inside the inner loop, an as f32 and a multiply per weight, before the actual multiply-accumulate. Measure it and the result is about 3.4 tokens/second, five times slower than just running the FP32 file. We quartered the bytes and then spent all the savings (and more) converting them back one at a time. Quantization only pays when the kernel is built for it.

The fix is the trick every production engine uses (llama.cpp calls it Q8_0 × Q8_0): quantize the activation vector too, once per matvec, into the same 32-element block format. Then the inner loop is pure integer work (i8 × i8 products summed into an i32), and each block costs two scale multiplies at the end instead of 32 conversions inside. SimdCpu owns this kernel. First, quantizing one activation row:

RUST

fn quantize_row_i8(x: &[f32]) -> (Vec<i8>, Vec<f32>) {
    let blocks = x.len() / Q8_0_BLOCK_SIZE;
    let mut quants = vec![0i8; x.len()];
    let mut scales = vec![0.0f32; blocks];
    for b in 0..blocks {
        let chunk = &x[b * Q8_0_BLOCK_SIZE..(b + 1) * Q8_0_BLOCK_SIZE];
        let absmax = chunk.iter().fold(0.0f32, |m, v| m.max(v.abs()));
        if absmax == 0.0 {
            continue;
        }
        let scale = absmax / 127.0;
        let inv = 127.0 / absmax;
        scales[b] = scale;
        let out = &mut quants[b * Q8_0_BLOCK_SIZE..(b + 1) * Q8_0_BLOCK_SIZE];
        for (q, v) in out.iter_mut().zip(chunk) {
            *q = (v * inv).round().clamp(-127.0, 127.0) as i8;
        }
    }
    (quants, scales)
}

Per 32-element block: find the largest magnitude, scale so it maps to 127, round everything into an i8. Exactly the encoding the weights already use, computed on the fly. It costs a pass over one 1024-element vector per matvec, noise next to the 600M weight reads it unlocks. (Yes, this rounds the activations; the error is on the order of 0.4%, the same order as the weight quantization itself. Greedy decode on our test prompts picks identical tokens.)

Then the kernel. The CPU has a dedicated instruction for this shape of work: sdot. Give it two 16-byte registers of i8 values and, in each 32-bit lane of an accumulator register, it dots four i8 pairs and adds the result in: sixteen multiply-adds per instruction, integers end to end, no float conversion in sight, and an i32 accumulator with room to spare. Two sdots cover a 32-weight block. The catch is getting Rust to emit it: the intrinsic for sdot isn't stabilized yet, but inline asm! is stable, so we write the instruction ourselves. asm! exists for exactly this case, a machine instruction no library function exposes: you write the instruction as text, declare which values feed it and which register catches the result, and the compiler splices it into the generated code. That is all the asm! block below does. Two sdots per block, then one horizontal add and the two scales:

RUST

pub(crate) fn matvec_q8_0(
    blocks: &[Q8_0Block],
    x: &[f32],
    rows: usize,
    cols: usize,
) -> Result<Vec<f32>, String> {
    use std::arch::aarch64::*;
    use std::arch::asm;
    if cols % Q8_0_BLOCK_SIZE != 0 {
        return Err(format!(
            "q8_0 matvec: cols {cols} not multiple of {Q8_0_BLOCK_SIZE}"
        ));
    }
    let blocks_per_row = cols / Q8_0_BLOCK_SIZE;
    let expected = rows * blocks_per_row;
    if blocks.len() != expected {
        return Err(format!(
            "q8_0 matvec: blocks {} != rows {rows} * bpr {blocks_per_row}",
            blocks.len()
        ));
    }
    if x.len() != cols {
        return Err(format!("q8_0 matvec: x len {} != cols {cols}", x.len()));
    }
    let (xq, xs) = quantize_row_i8(x);
    let mut out = vec![0.0_f32; rows];
    let xqp = xq.as_ptr();
    unsafe {
        for r in 0..rows {
            let row_blocks = &blocks[r * blocks_per_row..(r + 1) * blocks_per_row];
            let mut sum = 0.0f32;
            for (b, block) in row_blocks.iter().enumerate() {
                let w = block.quants().as_ptr();
                let a = xqp.add(b * Q8_0_BLOCK_SIZE);
 
                let w0 = vld1q_s8(w);
                let w1 = vld1q_s8(w.add(16));
                let a0 = vld1q_s8(a);
                let a1 = vld1q_s8(a.add(16));
 
                let mut acc = vdupq_n_s32(0);
                asm!(
                    "sdot {acc:v}.4s, {w0:v}.16b, {a0:v}.16b",
                    "sdot {acc:v}.4s, {w1:v}.16b, {a1:v}.16b",
                    acc = inout(vreg) acc,
                    w0 = in(vreg) w0,
                    a0 = in(vreg) a0,
                    w1 = in(vreg) w1,
                    a1 = in(vreg) a1,
                    options(pure, nomem, nostack)
                );
 
                let isum = vaddvq_s32(acc);
                sum += block.scale_f32() * xs[b] * isum as f32;
            }
            out[r] = sum;
        }
    }
    Ok(out)
}

This kernel went through two versions, and the gap between them is worth being honest about. The first used the older two-instruction NEON idiom: vmull_s8 to multiply i8 pairs into widened i16 products, then vpadalq_s16 to pairwise-accumulate them into i32 lanes. It reached about 37 tokens/sec. Swapping in sdot roughly halves the inner loop's instruction count and lands at about 41. Past that, the kernel stops being the limiter: the rest of the forward pass (attention bookkeeping, cache copies) sets the floor, so a sharper dot product shaves an ever-smaller slice. The act's lesson, a third time: the win comes from feeding the bottleneck, and once the kernel outruns it, the bottleneck moves. The imports at the top of the file widen for the block types:

RUST

use crate::tensor::q8_0::{Q8_0_BLOCK_SIZE, Q8_0Block};

SimdCpu's matmul-level wrapper runs the kernel once per row of the FP32 left matrix:

RUST

    #[inline(always)]
    fn matmul_fp32_q8_0(
        &self,
        a: &[f32],
        a_shape: &[usize],
        b_blocks: &[Q8_0Block],
        b_shape: &[usize],
    ) -> Tensor {
        assert_eq!(a_shape.len(), 2);
        assert_eq!(b_shape.len(), 2);
        let m = a_shape[0];
        let n = a_shape[1];
        let p = b_shape[0];
        assert_eq!(n, b_shape[1], "Q8_0 matmul inner dim mismatch");
        assert_eq!(a.len(), m * n, "matmul_fp32_q8_0: a len");
        let mut out = Vec::with_capacity(m * p);
        for i in 0..m {
            let x = &a[i * n..(i + 1) * n];
            out.extend_from_slice(&matvec_q8_0(b_blocks, x, p, n).expect("simd Q8 matvec failed"));
        }
        Tensor::new(out, vec![m, p])
    }

and a test pins the NEON path to a plain-Rust integer reference, so the intrinsics can't silently drift (the tolerance is tight because both sides use the same quantized inputs; the only difference allowed is float summation order):

RUST

#[cfg(test)]
mod q8_simd_tests {
    use super::{matvec_q8_0, quantize_row_i8};
    use crate::tensor::q8_0::{Q8_0_BLOCK_BYTES, Q8_0_BLOCK_SIZE, Q8_0Block};
 
    #[test]
    fn neon_matvec_matches_integer_reference() {
        let rows = 3;
        let cols = 64;
        let blocks_per_row = cols / Q8_0_BLOCK_SIZE;
        let mut blocks = Vec::new();
        for i in 0..rows * blocks_per_row {
            let mut bytes = vec![0u8; Q8_0_BLOCK_BYTES];
            bytes[0] = 0x00;
            bytes[1] = 0x3c;
            for j in 0..Q8_0_BLOCK_SIZE {
                bytes[2 + j] = ((i * 37 + j * 11) % 251) as u8;
            }
            blocks.push(Q8_0Block::from_bytes(&bytes));
        }
        let x: Vec<f32> = (0..cols).map(|j| (j as f32 - 31.5) * 0.03125).collect();
 
        let fast = matvec_q8_0(&blocks, &x, rows, cols).unwrap();
 
        let (xq, xs) = quantize_row_i8(&x);
        for r in 0..rows {
            let mut expected = 0.0f32;
            for b in 0..blocks_per_row {
                let block = &blocks[r * blocks_per_row + b];
                let mut isum = 0i32;
                for j in 0..Q8_0_BLOCK_SIZE {
                    isum += block.quants()[j] as i32 * xq[b * Q8_0_BLOCK_SIZE + j] as i32;
                }
                expected += block.scale_f32() * xs[b] * isum as f32;
            }
            let tol = 1e-4 * expected.abs().max(1.0);
            assert!(
                (fast[r] - expected).abs() < tol,
                "row {r}: neon {} vs reference {expected}",
                fast[r]
            );
        }
    }
}

matmul routes to the wrapper:

RUST

        let a_data = match a.as_data() {
            TensorData::Fp32(d) => d,
            TensorData::Q8_0(_) => panic!("matmul: LHS must be FP32"),
        };
        match b.as_data() {
            TensorData::Q8_0(blocks) => {
                assert_eq!(n, b.shape()[1], "Q8_0 matmul inner dim mismatch");
                self.matmul_fp32_q8_0(a_data, a_shape, blocks, b.shape())
            }
            TensorData::Fp32(b_data) => {
                let p = b.shape()[1];
                assert_eq!(n, b.shape()[0], "tensor shape mismatch");
                self.matmul_fp32_fp32(a_data, b_data, m, n, p)
            }
        }

The Parallel backend gets a real parallel implementation, the same row-distribution trick from II.4 applied to quantized matmul: each core runs the integer kernel on its own rows. Its imports pick up the kernel alongside SimdCpu, plus the block type:

RUST

use super::simd_cpu::{SimdCpu, matvec_q8_0};
use crate::tensor::q8_0::Q8_0Block;

Then the method:

RUST

    fn matmul_fp32_q8_0(
        &self,
        a_data: &[f32],
        blocks: &[Q8_0Block],
        m: usize,
        n: usize,
        p: usize,
    ) -> Tensor {
        assert_eq!(a_data.len(), m * n, "matmul_fp32_q8_0: len(a) must be m*n");
        let mut data = vec![0.0f32; m * p];
        data.par_chunks_mut(p).enumerate().for_each(|(i, out_row)| {
            let x = &a_data[i * n..(i + 1) * n];
            let row = matvec_q8_0(blocks, x, p, n).expect("parallel_cpu Q8 matvec failed");
            out_row.copy_from_slice(&row);
        });
        Tensor::new(data, vec![m, p])
    }

par_chunks_mut(p) cuts the output into rows and runs one matvec_q8_0 per row across all cores: quantized matmul, parallelized. It's wired into Parallel::matmul with the same (Fp32, Q8_0) match arm.

Quantized matmul on the GPU

Metal gets a Q8_0 path too, and the design decision here matters more than the code. The tempting shortcut is a matrix-vector kernel (dequantize one weight row, dot it with the activation) launched once per prompt row. Don't. A 511-token prefill would mean ~7 matmuls x 511 rows x 28 layers (on the order of a hundred thousand sequential GPU dispatches), and at a fraction of a millisecond of launch-and-sync overhead each, the GPU spends its life waiting to be asked. We measured that shortcut at ~23 seconds for one prefill.

Bar chart comparing 511-token Q8_0 prefill on Metal: 23.2 seconds with one dispatch per row versus 1.5 seconds with the SIMD-group matrix kernel.

Figure: the cost of asking the GPU 100,000 times. Same weights, same arithmetic; the only difference is one kernel launch per prompt row versus one per matmul.

So the Q8_0 kernel is a tiled matmul: one dispatch covers the whole output grid. It is the same SIMD-group matrix kernel as II.5's FP32 one, with the same 32×32 output tile per threadgroup and the same four SIMD-groups accumulating 8×8 fragments (the walkthrough there covers all of that), plus a single new step: as each 32×32 weight slab is staged into threadgroup memory, it is dequantized in place. Every dequantized weight then gets reused 32 times from fast threadgroup memory, once per row of the output tile (this is the same shape production engines use; llama.cpp's mul_mm kernels also stage dequantized tiles and feed them to the SIMD-group matrix hardware). Added to src/backend/metal/shaders.rs:

RUST

constant uint BLOCK_QLEN  = 32u;
constant uint BLOCK_BYTES = 34u;
 
inline float decode_scale(device const uchar* base) {
    ushort bits = ushort(base[0]) | (ushort(base[1]) << 8);
    return float(as_type<half>(bits));
}
 
kernel void matmul_fp32_q8_0(
    device const float* a       [[buffer(0)]],
    device const uchar* blocks  [[buffer(1)]],
    device       float* c       [[buffer(2)]],
    constant     uint&  m       [[buffer(3)]],
    constant     uint&  n       [[buffer(4)]],
    constant     uint&  p       [[buffer(5)]],
    uint2  tgid [[threadgroup_position_in_grid]],
    ushort sgid [[simdgroup_index_in_threadgroup]],
    ushort tid  [[thread_index_in_threadgroup]]
) {
    threadgroup float tA[32 * 32];
    threadgroup float tW[32 * 32];
    threadgroup float tC[32 * 32];
 
    const uint row0 = tgid.x * 32;
    const uint col0 = tgid.y * 32;
    const uint blocks_per_row = n / BLOCK_QLEN;
 
    const uint sg_row = (sgid / 2) * 16;
    const uint sg_col = (sgid % 2) * 16;
 
    simdgroup_float8x8 acc[2][2];
    for (uint i = 0; i < 2; i++) {
        for (uint j = 0; j < 2; j++) {
            acc[i][j] = simdgroup_float8x8(0.0f);
        }
    }
 
    for (uint k0 = 0; k0 < n; k0 += 32) {
        for (uint e = tid; e < 1024; e += 128) {
            const uint rr = e / 32;
            const uint kk = e % 32;
 
            const uint ar = row0 + rr;
            tA[e] = (ar < m) ? a[ar * n + k0 + kk] : 0.0f;
 
            const uint wr = col0 + rr;
            float wv = 0.0f;
            if (wr < p) {
                const uint k = k0 + kk;
                const uint off = (wr * blocks_per_row + k / BLOCK_QLEN) * BLOCK_BYTES;
                wv = decode_scale(blocks + off) * float(int(char(blocks[off + 2u + (k % BLOCK_QLEN)])));
            }
            tW[e] = wv;
        }
        threadgroup_barrier(mem_flags::mem_threadgroup);
 
        for (uint ks = 0; ks < 32; ks += 8) {
            simdgroup_float8x8 af[2];
            simdgroup_float8x8 wf[2];
            for (uint i = 0; i < 2; i++) {
                simdgroup_load(af[i], tA + (sg_row + i * 8) * 32 + ks, 32);
                simdgroup_load(wf[i], tW + (sg_col + i * 8) * 32 + ks, 32, ulong2(0, 0), true);
            }
            for (uint i = 0; i < 2; i++) {
                for (uint j = 0; j < 2; j++) {
                    simdgroup_multiply_accumulate(acc[i][j], af[i], wf[j], acc[i][j]);
                }
            }
        }
        threadgroup_barrier(mem_flags::mem_threadgroup);
    }
 
    for (uint i = 0; i < 2; i++) {
        for (uint j = 0; j < 2; j++) {
            simdgroup_store(acc[i][j], tC + (sg_row + i * 8) * 32 + sg_col + j * 8, 32);
        }
    }
    threadgroup_barrier(mem_flags::mem_threadgroup);
 
    for (uint e = tid; e < 1024; e += 128) {
        const uint rr = e / 32;
        const uint cc = e % 32;
        if (row0 + rr < m && col0 + cc < p) {
            c[(row0 + rr) * p + col0 + cc] = tC[e];
        }
    }
}

tA staging is identical to the FP32 kernel's. The tW staging is the new part: for each staged element, the thread locates weight W[wr][k] inside its 34-byte block (block index wr * blocks_per_row + k / 32, quant byte 2 + k % 32), decodes the block's half scale with decode_scale, and writes the dequantized float into the tile. One wrinkle: the weight slab is staged in the weights' native [out_features, in_features] orientation, so simdgroup_load reads tW with its transpose flag, the trailing true. From there the fragment loop is II.5's exactly, and the matrix hardware never knows the weights were ever quantized.

context.rs registers the kernel alongside the FP32 one:

RUST

pub(crate) const KERNEL_MATMUL_FP32_FP32: &str = "matmul_fp32_fp32";
pub(crate) const KERNEL_MATMUL_FP32_Q8_0: &str = "matmul_fp32_q8_0";
 
pub(crate) const KERNELS: &[&str] = &[KERNEL_MATMUL_FP32_FP32, KERNEL_MATMUL_FP32_Q8_0];

The kernel reads the blocks as raw bytes, so the context gains a wrap_u8 twin of II.5's wrap_f32 (same no-copy shared-storage wrapping, different element type) and a dispatch wrapper shaped exactly like matmul_fp32_fp32's:

RUST

    fn wrap_u8(&self, data: &[u8]) -> Buffer {
        unsafe { wrap_shared_storage(&self.device, data) }.unwrap()
    }

RUST

    pub fn matmul_fp32_q8_0(
        &self,
        a: &[f32],
        blocks: &[u8],
        m: usize,
        n: usize,
        p: usize,
    ) -> Vec<f32> {
        use crate::tensor::q8_0::{Q8_0_BLOCK_BYTES, Q8_0_BLOCK_SIZE};
 
        assert_eq!(n % Q8_0_BLOCK_SIZE, 0);
        let blocks_per_row = n / Q8_0_BLOCK_SIZE;
        assert_eq!(blocks.len(), p * blocks_per_row * Q8_0_BLOCK_BYTES);
        assert_eq!(a.len(), m * n);
 
        let c_data = vec![0.0f32; m * p];
        let a_buf = self.wrap_f32(a);
        let b_buf = self.wrap_u8(blocks);
        let c_buf = self.wrap_f32(&c_data);
 
        self.dispatch(
            KERNEL_MATMUL_FP32_Q8_0,
            &[&a_buf, &b_buf, &c_buf],
            &[m as u32, n as u32, p as u32],
            Size::new(m.div_ceil(32), p.div_ceil(32), 1),
            Size::new(128, 1, 1),
        );
 
        c_data
    }

backend.rs imports the block type, reinterprets the block slice as the raw bytes the kernel wants (safe because Q8_0Block is #[repr(C, packed)], exactly 34 bytes, no padding), and makes one dispatch:

RUST

use crate::tensor::q8_0::{Q8_0Block, blocks_packed_byte_len};

RUST

    fn matmul_fp32_q8_0(
        &self,
        a_data: &[f32],
        b_blocks: &[Q8_0Block],
        m: usize,
        n: usize,
        p: usize,
    ) -> Tensor {
        use crate::tensor::q8_0::Q8_0_BLOCK_SIZE;
        assert_eq!(a_data.len(), m * n, "matmul_fp32_q8_0: len(a) must be m*n");
        let blocks_per_row = n / Q8_0_BLOCK_SIZE;
        assert_eq!(
            b_blocks.len(),
            p * blocks_per_row,
            "matmul_fp32_q8_0: block count must match shape"
        );
 
        let block_bytes = unsafe {
            std::slice::from_raw_parts(
                b_blocks.as_ptr() as *const u8,
                blocks_packed_byte_len(b_blocks),
            )
        };
        let out = self.ctx.matmul_fp32_q8_0(a_data, block_bytes, m, n, p);
        Tensor::new(out, vec![m, p])
    }

Finally the (Fp32, Q8_0) arm in Metal::matmul, with the same MIN_M_FOR_GPU_MATMUL threshold logic from II.5: big quantized matmuls go to the GPU, small ones (decode's single row) fall back to the SIMD CPU integer kernel:

RUST

            (TensorData::Fp32(a_data), TensorData::Q8_0(blocks)) => {
                let p = b_shape[0];
                assert_eq!(n, b_shape[1], "Q8_0 matmul inner dim mismatch");
                if m == 0 || p == 0 {
                    return Tensor::new(vec![], vec![m, p]);
                }
                if m < MIN_M_FOR_GPU_MATMUL {
                    return self.fallback.matmul(a, b);
                }
                self.matmul_fp32_q8_0(a_data, blocks, m, n, p)
            }

One small change in the forward pass

The model code barely changes. That's the payoff of the Tensor/Backend abstraction. Two spots in src/model/qwen3/forward.rs need to handle a Q8_0 tensor where they previously assumed FP32.

FP32 weight matrices get transposed at load time, not because of how they arrive (every GGUF weight arrives [out_features, in_features]) but because our FP32 matmul consumes its right-hand side as [n, p]. The Q8_0 kernels were written to consume the native orientation (and transpose_2d would panic on quantized data anyway), so loading a Q8_0 weight is just: use it as is. Both spots are a single new match arm:

RUST

            match embed.as_data() {
                TensorData::Fp32(_) => ops.transpose_2d(&embed),
                TensorData::Q8_0(_) => embed.clone(),
            }

RUST

    match t.as_data() {
        TensorData::Fp32(_) => ops.transpose_2d(&t),
        TensorData::Q8_0(_) => t,
    }

The first is in the lm-head setup (when the output weights are tied to the embedding table); the second is in load_ggml_weight_for_matmul_rhs, the per-layer weight loader. That is the entire change to the model's forward pass. The 28-layer attention-and-MLP loop, every backend dispatch, the KV cache: none of it knows or cares that the weights are now 8-bit. Each matmul call sees a Tensor, and the Backend quietly picks the FP32 or Q8_0 kernel based on the data variant. The abstraction held.

Running it

Grab the Q8_0 export of the checkpoint. Point gguf-inspect from I.1 at it and you'll see the type-8 tensors the engine can now load directly:

BASH

cargo run --release --bin model-generate -- --kv --backend simd path/to/qwen3-0.6b-q8_0.gguf "Once upon a time, in a small village by the sea, there lived a baker" 64

Against the FP32 file on the same backend and prompt (both measured over 64 tokens):

PLAINTEXT

backend: simd
kv cache: basic
 
  -- FP32 weights --
  time_to_first_token_ms: 599.124
  decode_tokens_per_second: 17.824
  per_forward_ms: min 53.677  max 599.124  mean 64.590  (n=64)
 
  -- Q8_0 weights --
  time_to_first_token_ms: 256.579
  decode_tokens_per_second: 41.023
  per_forward_ms: min 22.070  max 256.579  mean 28.005  (n=64)

Decode throughput jumps 2.3×, from ~18 to ~41 tokens/sec, the first decode movement since the KV cache, after three chapters where faster arithmetic couldn't buy a single token per second. The reason is the one II.3 measured: decode reads every weight once per token, the weights now occupy about a quarter of the bytes, and decode is bandwidth-bound, so fewer bytes means proportionally less time. (It's a ~2.3× win rather than ~4× because the integer kernel adds some arithmetic back and activations, KV, and everything else stay FP32, but the weights are the bulk of the traffic, and that's what shows up in the number.)

The prefill story is a bonus we didn't order, and it's already visible in the block above: with a 17-token prompt, time to first token drops from 599 ms to 257 ms, roughly 2.3×. It gets bigger with a real prompt. On the multi-core backend the 511-token prompt from II.4 prefills in 2.3 s with Q8_0 versus 3.8 s with FP32: when all the cores hammer memory at once, prefill starts to feel the bandwidth wall too, and smaller weights pay there as well. And on Metal, the quantized SIMD-group matrix kernel makes the same 511-token prefill the fastest configuration in the whole act, 1.55 s with Q8_0 versus 1.68 s with FP32, the dequantization hidden entirely inside the tile staging. (Decode under Metal still falls back to the SIMD integer kernel through the m < 8 threshold, so it rides the same 2.3× win.)

Bar chart of decode speed for four Q8_0 kernels: naive dequantize-per-weight at 3.4 tokens per second, NEON float-convert at 18.9, vmull integer dot at 36.7, sdot integer dot at 41.0, with the FP32 baseline of 17.8 marked as a dashed line.

Figure: the same weights, four kernels. Quantized bytes only pay off once the kernel stops converting them one at a time; the naive version loses to FP32, the integer versions leave it behind, and the shipped sdot kernel beats it 2.3×.

And the output? Token-for-token identical to the FP32 run on this prompt, despite two layers of rounding (weights at export time, activations on the fly). That's not guaranteed in general (this is a 0.6B model being asked for greedy continuations), but it's a good check that nothing broke.

Where this leaves us, and the end of Act 2

Q8_0 quantization is the last rung, and the act's biggest decode win. By reading 8-bit weights instead of 32-bit ones, and by refusing to dequantize them until after the arithmetic, decode finally got fast. It was the half of inference that no faster arithmetic could touch, because its true bottleneck was always bytes.

Step back and look at the whole ladder, every number measured on the same M2 Pro. The benchmark harness made every claim checkable. The KV cache fixed the algorithm: 0.55 → 18.4 tokens/sec, a 33× decode win. SIMD vectorized the kernel and moved nothing, proving decode was bandwidth-bound, the fact the rest of the act stands on. Threads cut a 511-token prefill from 19.4 s to 3.8 s; Metal cut it again to 1.7 s. And Q8_0 halved decode's memory traffic and more than doubled its speed, 18 → 41 tokens/sec. A single request now runs end to end fast, and every speedup (and every non-speedup) traces to one specific chapter and one specific bottleneck.

What we still cannot do: serve this to anyone. There is one CLI binary, one prompt at a time, no second concurrent request, no HTTP, no chat formatting. The Act 2 recap takes stock of the speedups, and then Act 3 turns this fast single-user engine into a real server.