II.6: Q8_0 quantization
Through SIMD (II.3), threads (II.4), and the GPU (II.5), one number stayed stubbornly flat: decode throughput. Each of those backends made arithmetic faster, and each left decode roughly where it found it. The act intro said why this would happen, and now we have to actually act on it.
Decode is memory-bandwidth-bound. To generate one token, the model reads every weight (all ~600 million numbers of Qwen3 0.6B) out of memory and through the arithmetic units. With a single token in flight, those weights get used once and thrown away; almost all the wall-clock time is the reading, not the computing. Faster arithmetic can't help when arithmetic isn't the bottleneck. The only lever that moves decode is reading fewer bytes.
This chapter pulls that lever. We have been storing each weight as a 32-bit float. The model on disk doesn't: its weights are already in an 8-bit quantized format called Q8_0, and so far we've been expanding each one to a full FP32 float at load time. This chapter teaches the engine to keep the weights quantized and multiply against them directly. Quarter the bytes per weight (actually a bit better than a quarter, as we'll see) and decode, whose whole problem is bytes, gets dramatically faster.
What quantization is
A weight is a number like 0.0426 or -0.0119. We've stored each in an f32: 32 bits, four bytes, with enormous range and precision. A trained model's weights don't need that range. Within any small group of weights, the values cluster tightly, all within a similar magnitude. Spending 32 bits on each is wasteful.
Quantization stores them in fewer bits by giving up precision the model can tolerate. Q8_0, the format Qwen3's checkpoint ships in and used across the llama.cpp ecosystem, does it like this:
- Group the weights into blocks of 32 consecutive values.
- For each block, find the largest-magnitude weight and derive a single scale factor from it.
- Store each of the 32 weights as a signed 8-bit integer (
i8, range −128…127), representing "this many scale-units." The real value isquant × scale. - Store the block's scale once, as a 16-bit half-precision float (
f16).
So one block is: a 2-byte scale plus 32 one-byte quants = 34 bytes for 32 weights. In FP32 those same 32 weights would be 32 × 4 = 128 bytes. That's a 3.76× size reduction, close to 4× but slightly less because of the per-block scale overhead. The _0 in the name means "no zero-point": the quants are symmetric around zero, scale only. Loading a weight back is one multiply: f32_value = quant_i8 as f32 * scale.
The accuracy cost is real but small. Eight bits per weight, with a per-32-block scale, is enough that the model's outputs are nearly indistinguishable from the FP32 version, which is exactly why this format is the default for shipping models. We get most of a 4× bandwidth win for a precision loss the model shrugs off.
The Q8_0 block format
A new file, src/tensor/q8_0.rs, defines the block and the code to read it. The block first:
#[repr(C, packed)]
#[derive(Clone, Copy, Debug)]
pub(crate) struct Q8_0Block {
scale: [u8; 2],
quants: [i8; 32],
}Q8_0Block is exactly the on-disk layout: 2 bytes of scale, then 32 signed quants. #[repr(C, packed)] is load-bearing: it tells Rust to lay the struct out in memory with C's field order and no padding, so a Q8_0Block in memory is byte-identical to a block in the GGUF file. That equivalence is what lets us, later, hand a &[Q8_0Block] straight to the GPU as raw bytes. The size is 2 + 32 = 34 bytes.
Reading a block out of a byte slice:
impl Q8_0Block {
#[inline]
pub(crate) fn from_bytes(src: &[u8]) -> Self {
let mut scale = [0u8; 2];
scale.copy_from_slice(&src[0..2]);
let mut quants = [0i8; 32];
for i in 0..32 {
quants[i] = src[2 + i] as i8;
}
Self { scale, quants }
}Copy the first 2 bytes as the scale, then the next 32 bytes as the quants. The src[2 + i] as i8 reinterprets each byte as a signed integer: bit pattern 0xFF becomes -1, not 255, because Q8_0 quants are signed.
Decoding the scale is the one fiddly part:
#[inline]
pub(crate) fn scale_f32(&self) -> f32 {
let bits = u16::from_le_bytes(self.scale);
let sign = ((bits >> 15) & 1) as u32;
let exp = ((bits >> 10) & 0x1F) as u32;
let frac = (bits & 0x3FF) as u32;
let f32_bits = (sign << 31) | ((exp.wrapping_add(112)) << 23) | (frac << 13);
f32::from_bits(f32_bits)
}The scale is a 16-bit half-precision float (f16). Rust's standard library has no native f16 arithmetic, so we convert it to f32 by hand, bit by bit. An f16 packs a 1-bit sign, a 5-bit exponent, and a 10-bit fraction; an f32 packs a 1-bit sign, an 8-bit exponent, and a 23-bit fraction. The conversion: pull out the three f16 fields with shifts and masks, then reassemble them in f32 positions: sign to bit 31, fraction shifted up 13 bits to fill the wider field. The exponent needs a rebias: f16 biases its exponent by 15, f32 by 127, so wrapping_add(112) (which is 127 − 15) corrects it. f32::from_bits then reinterprets the assembled bit pattern as a float. (This handles normal numbers, which block scales always are.)
#[inline]
pub(crate) fn quants(&self) -> &[i8; 32] {
&self.quants
}
}
pub(crate) const Q8_0_BLOCK_SIZE: usize = 32;
pub(crate) const Q8_0_BLOCK_BYTES: usize = 2 + Q8_0_BLOCK_SIZE;
#[inline]
pub(crate) fn blocks_packed_byte_len(blocks: &[Q8_0Block]) -> usize {
blocks.len() * Q8_0_BLOCK_BYTES
}Q8_0_BLOCK_SIZE is the 32 weights per block; Q8_0_BLOCK_BYTES is the 34 bytes one block occupies. blocks_packed_byte_len gives the total byte length of a slice of blocks, used when handing blocks to the GPU as raw bytes.
And the bulk reader that turns a tensor's worth of bytes into blocks:
pub(super) fn blocks_from_bytes(bytes: &[u8]) -> Vec<Q8_0Block> {
assert_eq!(
bytes.len() % Q8_0_BLOCK_BYTES,
0,
"Q8_0 byte length {} not a multiple of {}",
bytes.len(),
Q8_0_BLOCK_BYTES
);
let n = bytes.len() / Q8_0_BLOCK_BYTES;
let mut out = Vec::with_capacity(n);
for i in 0..n {
let off = i * Q8_0_BLOCK_BYTES;
out.push(Q8_0Block::from_bytes(&bytes[off..]));
}
out
}The input byte count must be an exact multiple of 34; then it's n blocks, each parsed with from_bytes.
The Tensor type learns a second representation
Until now a Tensor's data has been one thing: Vec<f32>. src/tensor/mod.rs adds the module:
pub(crate) mod q8_0;
mod tensor;
pub use tensor::Tensor;And src/tensor/tensor.rs gives TensorData a second variant:
use crate::tensor::q8_0::{Q8_0_BLOCK_SIZE, Q8_0Block};
#[derive(Clone)]
pub(crate) enum TensorData {
Fp32(Vec<f32>),
Q8_0(Vec<Q8_0Block>),
}
impl std::fmt::Debug for TensorData {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
TensorData::Fp32(v) => write!(f, "Fp32({} elems)", v.len()),
TensorData::Q8_0(v) => write!(f, "Q8_0({} blocks)", v.len()),
}
}
}This is why every Backend matmul throughout Act 2 has matched on TensorData, in those match a.as_data() { TensorData::Fp32(d) => d } blocks. A tensor can now be stored either as FP32 floats or as Q8_0 blocks, and code that touches tensor data has to handle both. The Debug impl reports block count for the Q8_0 case.
The constructor that builds a Q8_0 tensor from raw bytes:
pub(crate) fn new_q8_0_from_bytes(bytes: &[u8], shape: Vec<usize>) -> Self {
assert_eq!(
shape.len(),
2,
"Q8_0 tensors must be 2-D, got shape {:?}",
shape
);
let cols = shape[1];
assert_eq!(
cols % Q8_0_BLOCK_SIZE,
0,
"Q8_0 tensor cols {} not divisible by {}",
cols,
Q8_0_BLOCK_SIZE
);
let rows = shape[0];
let expected_blocks = rows * (cols / Q8_0_BLOCK_SIZE);
let blocks = super::q8_0::blocks_from_bytes(bytes);
assert_eq!(
blocks.len(),
expected_blocks,
"Q8_0 block count mismatch: got {} expected {}",
blocks.len(),
expected_blocks
);
Self {
data: TensorData::Q8_0(blocks),
shape,
}
}A Q8_0 tensor is always 2-D (a weight matrix), and its column count must be a multiple of 32, since blocks tile each row, giving cols / 32 blocks per row and rows × cols / 32 total. The asserts check both invariants, and that the byte count produced the expected number of blocks. The shape is still stored in logical terms (rows and columns of weights), even though the data is now blocks, so the rest of the engine sees a normal [rows, cols] matrix and never has to know it's quantized underneath.
as_f32_slice, which several places call, can only answer for FP32 tensors, so it gets an explicit panic for the Q8_0 case:
pub(crate) fn as_f32_slice(&self) -> &[f32] {
match &self.data {
TensorData::Fp32(v) => v,
_ => panic!("expected FP32 tensor, got {:?}", self.data),
}
}
}The Debug impl for Tensor also gains a Q8_0 byte-size arm (fmt_bytes(blocks_packed_byte_len(v))) so a quantized tensor reports its true on-the-wire size.
Loading Q8_0 tensors from the GGUF file
Way back in I.1, gguf-inspect showed most tensors carrying type tag 8. That tag is Q8_0. I.3 loaded only type 0 (FP32) and panicked on anything else; the checkpoint we've been running is FP32. Now src/gguf/gguf.rs learns to load type 8 directly:
8 => {
use crate::tensor::q8_0::{Q8_0_BLOCK_BYTES, Q8_0_BLOCK_SIZE};
assert_eq!(
numel % Q8_0_BLOCK_SIZE as u64,
0,
"Q8_0 tensor {name} numel {numel} not divisible by {Q8_0_BLOCK_SIZE}"
);
let block_count = (numel as usize) / Q8_0_BLOCK_SIZE;
let byte_count = block_count * Q8_0_BLOCK_BYTES;
let bytes = self.read_tensor_bytes(offset, byte_count);
Tensor::new_q8_0_from_bytes(&bytes, shape)
}This is a new arm in the match ggml_type block alongside the existing 0 => { ... } FP32 arm. The tensor has numel logical weights; in Q8_0 that's numel / 32 blocks at 34 bytes each. It reads exactly that many bytes off disk and hands them to new_q8_0_from_bytes. The crucial difference from the FP32 path: the FP32 arm expands every weight into an f32 as it reads; this arm keeps the bytes compact. A Q8_0 weight matrix now occupies in RAM what it occupied on disk, about a quarter of the FP32 footprint, and that smaller footprint is the entire decode win, because decode's cost is bytes moved.
Quantized matmul on the CPU
The forward pass now has FP32 activations (the running values flowing through the network, computed at runtime, still floats) multiplied against Q8_0 weights. We need a matmul that takes an FP32 left side and a Q8_0 right side. src/backend/cpu.rs gets two functions.
The core is matvec_q8_0, one vector times a Q8_0 matrix, which is exactly the decode case:
pub(crate) fn matvec_q8_0(
blocks: &[crate::tensor::q8_0::Q8_0Block],
x: &[f32],
rows: usize,
cols: usize,
) -> Result<Vec<f32>, String> {
use crate::tensor::q8_0::Q8_0_BLOCK_SIZE;
if cols % Q8_0_BLOCK_SIZE != 0 {
return Err(format!(
"q8_0 matvec: cols {cols} not multiple of {Q8_0_BLOCK_SIZE}"
));
}
let blocks_per_row = cols / Q8_0_BLOCK_SIZE;
let expected = rows * blocks_per_row;
if blocks.len() != expected {
return Err(format!(
"q8_0 matvec: blocks {} != rows {rows} * bpr {blocks_per_row}",
blocks.len()
));
}
if x.len() != cols {
return Err(format!("q8_0 matvec: x len {} != cols {cols}", x.len()));
}
let mut out = vec![0.0_f32; rows];
for r in 0..rows {
let mut sum = 0.0f32;
for b in 0..blocks_per_row {
let block = &blocks[r * blocks_per_row + b];
let scale = block.scale_f32();
let quants = block.quants();
let x_off = b * Q8_0_BLOCK_SIZE;
let mut block_sum = 0.0f32;
for j in 0..Q8_0_BLOCK_SIZE {
block_sum += quants[j] as f32 * x[x_off + j];
}
sum += scale * block_sum;
}
out[r] = sum;
}
Ok(out)
}After the validity checks, this computes out[r] = the dot product of input vector x with weight row r. Each weight row is blocks_per_row Q8_0 blocks. The inner loop over j is the part worth seeing: it accumulates quants[j] as f32 * x[x_off + j] (the integer quant times the float activation) and only after the 32-element block sum does it multiply once by the block's scale. This is the key efficiency of the format. The dequantization (turning quants back into real weight values) does not happen as a separate pass that materializes a Vec<f32> of expanded weights. It is fused into the dot product: one scalar multiply per 32-element block, not per weight. We never build the FP32 weight matrix in memory at all. The compact bytes go straight into the arithmetic, which is exactly what keeps the bandwidth win.
matmul_fp32_q8_0 wraps it for the general FP32-matrix × Q8_0-matrix case (prefill, where the left side has several rows):
pub(crate) fn matmul_fp32_q8_0(
a: &[f32],
a_shape: &[usize],
b_blocks: &[crate::tensor::q8_0::Q8_0Block],
b_shape: &[usize],
) -> Tensor {
use crate::tensor::q8_0::Q8_0_BLOCK_SIZE;
assert_eq!(a_shape.len(), 2);
assert_eq!(b_shape.len(), 2);
let m = a_shape[0];
let n = a_shape[1];
let p = b_shape[0];
let n_b = b_shape[1];
assert_eq!(n, n_b, "Q8_0 matmul inner dim mismatch");
assert_eq!(a.len(), m * n, "matmul_fp32_q8_0: a len");
let blocks_per_row = n / Q8_0_BLOCK_SIZE;
assert_eq!(
b_blocks.len(),
p * blocks_per_row,
"matmul_fp32_q8_0: b block count"
);
let mut out = Vec::with_capacity(m * p);
for b_idx in 0..m {
let x = &a[b_idx * n..(b_idx + 1) * n];
out.extend_from_slice(
&Self::matvec_q8_0(b_blocks, x, p, n).expect("matmul_fp32_q8_0 matvec failed"),
);
}
Tensor::new(out, vec![m, p])
}It runs matvec_q8_0 once per row of the FP32 left matrix a. One detail in the shapes: the Q8_0 weight matrix's logical shape is [p, n], p output features each a row of n input weights, so the inner dimension n matches a's column count, and the result is [m, p]. (Q8_0 weight tensors come in already row-major per output feature, so unlike the FP32 weights they don't need a transpose at load time; more on that below.)
The chapter also adds a unit test that exercises one hand-built block end to end:
#[cfg(test)]
mod q8_tests {
use super::{Backend, CpuBackend};
use crate::tensor::Tensor;
use crate::tensor::q8_0::Q8_0_BLOCK_BYTES;
#[test]
fn q8_0_matmul_one_block() {
let mut bytes = vec![0u8; Q8_0_BLOCK_BYTES];
bytes[0] = 0x00;
bytes[1] = 0x3c;
for i in 0..32 {
bytes[2 + i] = 1i8 as u8;
}
let w = Tensor::new_q8_0_from_bytes(&bytes, vec![1, 32]);
let x = Tensor::new(vec![2.0f32; 32], vec![1, 32]);
let cpu = CpuBackend;
let y = cpu.matmul(&x, &w);
assert_eq!(y.shape(), &[1, 1]);
assert!((y.as_f32_slice()[0] - 64.0).abs() < 1e-3);
}
}It constructs a single block: scale bytes 0x00, 0x3c, which is the f16 bit pattern for 1.0, and 32 quants all equal to 1. So the block's weights all decode to 1.0 × 1 = 1.0. Multiplied against an input vector of 32 twos, the dot product is 32 × (1.0 × 2.0) = 64.0, and the test checks exactly that. A tiny, exact, hand-verifiable proof that the format decoding and the fused matmul agree.
Quantized matmul through the backends
The CpuBackend::matmul dispatch grows arms for the new tensor-data combination:
(TensorData::Fp32(a_data), TensorData::Fp32(b_data)) => {
CpuBackend::matmul_fp32_fp32(a_data, a_shape, b_data, b_shape)
}
(TensorData::Fp32(a_data), TensorData::Q8_0(b_blocks)) => {
CpuBackend::matmul_fp32_q8_0(a_data, a_shape, b_blocks, b_shape)
}
_ => panic!("matmul: LHS must be FP32"),matmul now matches on the pair of data types. FP32 × FP32 is the Act 1 path; FP32 × Q8_0 is the new quantized path. The _ arm rejects anything else: the left operand (the activations) is always genuine FP32; only the right operand (the weights) is ever quantized.
gather_rows needs the same treatment, because the token embedding table is also a Q8_0 tensor and looking a token up means reading and dequantizing one of its rows:
match table.as_data() {
TensorData::Fp32(data) => CpuBackend::gather_rows_fp32(data, shape, row_indices),
TensorData::Q8_0(blocks) => CpuBackend::gather_rows_q8_0(blocks, shape, row_indices),
}gather_rows_q8_0 is straightforward: for each requested row index, walk its blocks and push quant × scale for every weight, producing an ordinary FP32 row. (transpose_2d gets a Q8_0(_) => panic!(...) arm too: quantized tensors are never transposed at runtime.)
The SimdCpu backend handles Q8_0 by simply delegating to the CPU's quantized matmul:
#[inline(always)]
fn matmul_fp32_q8_0(
&self,
a: &[f32],
a_shape: &[usize],
b_blocks: &[crate::tensor::q8_0::Q8_0Block],
b_shape: &[usize],
) -> Tensor {
CpuBackend::matmul_fp32_q8_0(a, a_shape, b_blocks, b_shape)
}and routes to it from matmul:
let a_data = match a.as_data() {
TensorData::Fp32(d) => d,
TensorData::Q8_0(_) => panic!("matmul: LHS must be FP32"),
};
match b.as_data() {
TensorData::Q8_0(blocks) => {
assert_eq!(n, b.shape()[1], "Q8_0 matmul inner dim mismatch");
self.matmul_fp32_q8_0(a_data, a_shape, blocks, b.shape())
}
TensorData::Fp32(b_data) => {
let p = b.shape()[1];
assert_eq!(n, b.shape()[0], "tensor shape mismatch");
self.matmul_fp32_fp32(a_data, b_data, m, n, p)
}
}The Parallel backend gets a real parallel implementation, the same row-distribution trick from II.4 applied to quantized matmul:
fn matmul_fp32_q8_0(
&self,
a_data: &[f32],
blocks: &[Q8_0Block],
m: usize,
n: usize,
p: usize,
) -> Tensor {
assert_eq!(a_data.len(), m * n, "matmul_fp32_q8_0: len(a) must be m*n");
let mut data = vec![0.0f32; m * p];
data.par_chunks_mut(p).enumerate().for_each(|(i, out_row)| {
let x = &a_data[i * n..(i + 1) * n];
let row =
CpuBackend::matvec_q8_0(blocks, x, p, n).expect("parallel_cpu Q8 matvec failed");
out_row.copy_from_slice(&row);
});
Tensor::new(data, vec![m, p])
}par_chunks_mut(p) cuts the output into rows and runs one matvec_q8_0 per row across all cores: quantized matmul, parallelized. It's wired into Parallel::matmul with the same (Fp32, Q8_0) match arm.
Quantized matmul on the GPU
Metal gets a Q8_0 path too: a second kernel that does the matrix-vector product directly against quantized blocks. The MSL kernel, added to src/backend/metal/shaders.rs:
constant uint BLOCK_QLEN = 32u;
constant uint BLOCK_BYTES = 34u;
constant uint SIMD_SIZE = 32u;
inline float decode_scale(device const uchar* base) {
ushort bits = ushort(base[0]) | (ushort(base[1]) << 8);
return float(as_type<half>(bits));
}
kernel void matvec_fp32_q8_0(
device const uchar* blocks [[buffer(0)]],
device const float* x [[buffer(1)]],
device float* out [[buffer(2)]],
constant uint& rows [[buffer(3)]],
constant uint& cols [[buffer(4)]],
uint3 tg_id [[threadgroup_position_in_grid]],
uint lid [[thread_index_in_threadgroup]]
) {
uint row = tg_id.x;
if (row >= rows) return;
uint blocks_per_row = cols / BLOCK_QLEN;
uint row_block_start = row * blocks_per_row;
float acc = 0.0f;
for (uint b = lid; b < blocks_per_row; b += SIMD_SIZE) {
uint byte_off = (row_block_start + b) * BLOCK_BYTES;
float scale = decode_scale(blocks + byte_off);
device const uchar* qptr = blocks + byte_off + 2;
uint sx_off = b * BLOCK_QLEN;
float4 sum4 = float4(0.0f);
sum4 += float4(float(int(char(qptr[0]))), float(int(char(qptr[1]))), float(int(char(qptr[2]))), float(int(char(qptr[3])))) * float4(x[sx_off + 0], x[sx_off + 1], x[sx_off + 2], x[sx_off + 3]);
sum4 += float4(float(int(char(qptr[4]))), float(int(char(qptr[5]))), float(int(char(qptr[6]))), float(int(char(qptr[7])))) * float4(x[sx_off + 4], x[sx_off + 5], x[sx_off + 6], x[sx_off + 7]);
sum4 += float4(float(int(char(qptr[8]))), float(int(char(qptr[9]))), float(int(char(qptr[10]))), float(int(char(qptr[11])))) * float4(x[sx_off + 8], x[sx_off + 9], x[sx_off + 10], x[sx_off + 11]);
sum4 += float4(float(int(char(qptr[12]))), float(int(char(qptr[13]))), float(int(char(qptr[14]))), float(int(char(qptr[15])))) * float4(x[sx_off + 12], x[sx_off + 13], x[sx_off + 14], x[sx_off + 15]);
sum4 += float4(float(int(char(qptr[16]))), float(int(char(qptr[17]))), float(int(char(qptr[18]))), float(int(char(qptr[19])))) * float4(x[sx_off + 16], x[sx_off + 17], x[sx_off + 18], x[sx_off + 19]);
sum4 += float4(float(int(char(qptr[20]))), float(int(char(qptr[21]))), float(int(char(qptr[22]))), float(int(char(qptr[23])))) * float4(x[sx_off + 20], x[sx_off + 21], x[sx_off + 22], x[sx_off + 23]);
sum4 += float4(float(int(char(qptr[24]))), float(int(char(qptr[25]))), float(int(char(qptr[26]))), float(int(char(qptr[27])))) * float4(x[sx_off + 24], x[sx_off + 25], x[sx_off + 26], x[sx_off + 27]);
sum4 += float4(float(int(char(qptr[28]))), float(int(char(qptr[29]))), float(int(char(qptr[30]))), float(int(char(qptr[31])))) * float4(x[sx_off + 28], x[sx_off + 29], x[sx_off + 30], x[sx_off + 31]);
acc += scale * (sum4.x + sum4.y + sum4.z + sum4.w);
}
acc = simd_sum(acc);
if (lid == 0) {
out[row] = acc;
}
}This kernel computes one output element per threadgroup: threadgroup row produces out[row]. Inside, the 32 threads of the group (SIMD_SIZE) split the row's blocks among themselves: thread lid handles blocks lid, lid + 32, lid + 64, … . Each thread decodes its block's f16 scale (with decode_scale, the GPU mirror of scale_f32), unpacks the 32 signed quants (int(char(...)) reinterprets a byte as a signed value), multiplies them lane-wise against the activations using float4 SIMD, and folds in the scale once per block, exactly the fused-dequant pattern from the CPU. Then simd_sum reduces the 32 threads' partial sums into one, and thread 0 writes it out. The kernel reads the compact 34-byte blocks straight from GPU memory; no expanded weight matrix is ever materialized on the GPU either.
context.rs registers the kernel and adds a dispatch wrapper, matvec_fp32_q8_0, that wraps the block bytes and activations as buffers and launches one threadgroup of 32 threads per output row. backend.rs gets a matmul_fp32_q8_0 method and an (Fp32, Q8_0) arm in matmul with the same MIN_M_FOR_GPU_MATMUL threshold logic from II.5: big quantized matmuls go to the GPU, small ones fall back to the SIMD CPU.
One small change in the forward pass
The model code barely changes. That's the payoff of the Tensor/Backend abstraction. Two spots in src/model/qwen3/forward.rs need to handle a Q8_0 tensor where they previously assumed FP32.
FP32 weight matrices have to be transposed at load time so matmul sees them in the right orientation. Q8_0 weight tensors already arrive row-major per output feature (and transpose_2d would panic on them anyway), so loading one is just: use it as is. Both spots are a single new match arm:
match embed.as_data() {
TensorData::Fp32(_) => ops.transpose_2d(&embed),
TensorData::Q8_0(_) => embed.clone(),
} match t.as_data() {
TensorData::Fp32(_) => ops.transpose_2d(&t),
TensorData::Q8_0(_) => t,
}The first is in the lm-head setup (when the output weights are tied to the embedding table); the second is in load_ggml_weight_for_matmul_rhs, the per-layer weight loader. That is the entire change to the model's forward pass. The 28-layer attention-and-MLP loop, every backend dispatch, the KV cache: none of it knows or cares that the weights are now 8-bit. Each matmul call sees a Tensor, and the Backend quietly picks the FP32 or Q8_0 kernel based on the data variant. The abstraction held.
Running it
Run the same Q8_0 checkpoint gguf-inspect showed back in I.1. The engine now loads its type-8 tensors directly instead of expanding them:
cargo run --release --bin model-generate -- --kv --backend simd path/to/qwen3-0.6b-q8_0.gguf "Once upon a time" 64Against the FP32 run from II.3 on the same backend:
backend: simd
kv cache: basic
-- FP32 weights --
time_to_first_token_ms: 78.221
decode_tokens_per_second: 271.402
-- Q8_0 weights --
time_to_first_token_ms: 71.004
decode_tokens_per_second: 503.918Decode throughput nearly doubles, from ~271 to ~504 tokens/sec, and the reason is precisely the one the act intro named. Decode reads every weight once per token; the weights now occupy roughly a quarter of the bytes; decode is bandwidth-bound, so a fraction of the bytes means a fraction of the time. (It's a ~2× decode win rather than ~4× because activations, the KV cache, and overhead are still FP32, but the weights, which are the bulk of the traffic, shrank, and that's what shows up in the number.) Time to first token barely moves: prefill is compute-bound, dequantization adds a little arithmetic, and the two roughly cancel. The win lands exactly where the model said it would: on decode.
Where this leaves us, and the end of Act 2
Q8_0 quantization is the last rung. By reading 8-bit weights instead of 32-bit ones, decode finally got fast. It was the half of inference that no faster arithmetic could touch, because its true bottleneck was always bytes.
Step back and look at the whole ladder. The benchmark harness made every claim measurable. The KV cache fixed the algorithm: decode became constant work per token instead of growing work. SIMD made each instruction do 16 floats; threads used every core; Metal moved big matmuls to the GPU. Those were three compounding wins for the compute-bound prefill. And Q8_0 cut decode's memory traffic. A single request now runs end to end fast, and every speedup between here and the Act 1 baseline traces to one specific chapter and one specific bottleneck.
What we still cannot do: serve this to anyone. There is one CLI binary, one prompt at a time, no second concurrent request, no HTTP, no chat formatting. The Act 2 recap takes stock of the speedups, and then Act 3 turns this fast single-user engine into a real server.