I.4: Backend

We have a Tensor, a flat buffer of floats with a shape. We can load real weights into one. But a tensor that just sits there is useless: a transformer is computation, and so far we cannot compute anything. This chapter fixes that. It introduces the Backend: the layer that does the actual arithmetic.

There is a design decision baked into this chapter that pays off for the rest of the series, so it's worth stating up front. We do not write the transformer's math directly against Tensor. Instead we define a trait, Backend, that lists every numeric operation a transformer needs (matrix multiply, softmax, the pieces of a normalization, an activation function, and so on) and then we write the model in I.5 purely in terms of that trait. The model never touches a raw float; it only ever calls backend.matmul(...), backend.softmax_rows(...).

Why bother? Because Act 2 is entirely about making those operations fast: SIMD, threads, a GPU. Each of those is a new implementation of the same trait. The model code, written once against the trait, never changes. Swapping in the GPU backend is constructing a different object; the forward pass is none the wiser. This chapter writes the trait and one implementation: CpuBackend, the plainest possible scalar version, straightforward loops, one core, no tricks. It will be slow, and that is the point: it is the correct, readable baseline that every Act 2 optimization is measured against and validated against.

The operations a transformer needs

Before the code, the menu. A decoder transformer, when you strip it to arithmetic, needs surprisingly few distinct operations. Here is the whole list, each with a one-line gloss. We'll meet them properly as the model gets built in I.5, but it helps to see the shape of the thing now.

Matrix multiply (matmul). The dominant cost of inference. Every weight matrix is applied by multiplying it against the activations. If you make one thing fast in Act 2, it's this.
Elementwise add, multiply, scale (add, hadamard, scale, add_scalar). Adding two tensors position by position; multiplying two tensors position by position (the "Hadamard" product); multiplying or adding a constant. Residual connections and normalization are built from these.
Activation (silu). A nonlinear function applied to every element. Without a nonlinearity a deep stack of matrix multiplies would collapse into a single one; the activation is what lets the model learn non-trivial functions.
Softmax (softmax_rows). Turns a row of arbitrary numbers into a probability distribution: all positive, summing to one. Attention uses it to decide how much weight each token gets.
Normalization pieces (rsqrt_elem, sum_squares_axis, broadcast_row_scalars). The arithmetic of RMSNorm, which keeps activations from drifting to extreme magnitudes as they pass through dozens of layers.
Reshaping and rearranging (transpose_2d, reshape_data). Swapping the two axes of a matrix; reinterpreting a buffer under a new shape. Plumbing, but necessary plumbing.
Gather (gather_rows). Pick out specified rows of a matrix. This is how token embedding works: the embedding table is a big matrix with one learned row of numbers per vocabulary token (that row is the token's embedding), and a token id is a row index into it.

That's it. A baker's dozen of primitives, and a transformer is built.

The trait

The backend module: backend_trait.rs for the trait, cpu.rs for the implementation, mod.rs to wire them:

RUST

mod backend_trait;
pub(crate) mod cpu;
 
pub use backend_trait::Backend;

The trait itself is just signatures: a list of method shapes, no bodies:

RUST

use crate::tensor::Tensor;
 
pub trait Backend: Send + Sync {
    fn name(&self) -> String;
 
    fn add(&self, a: &Tensor, b: &Tensor) -> Tensor;
    fn hadamard(&self, a: &Tensor, b: &Tensor) -> Tensor;
    fn silu(&self, x: &Tensor) -> Tensor;
    fn scale(&self, x: &Tensor, s: f32) -> Tensor;
    fn add_scalar(&self, a: &Tensor, s: f32) -> Tensor;
    fn rsqrt_elem(&self, x: &Tensor) -> Tensor;
    fn sum_squares_axis(&self, x: &Tensor, axis: usize) -> Tensor;
    fn broadcast_row_scalars(&self, t: &Tensor, d: usize) -> Tensor;
    fn transpose_2d(&self, a: &Tensor) -> Tensor;
    fn softmax_rows(&self, x: &Tensor) -> Tensor;
 
    fn reshape_data(&self, x: &Tensor, shape: Vec<usize>) -> Tensor;
 
    fn matmul(&self, a: &Tensor, b: &Tensor) -> Tensor;
 
    fn gather_rows(&self, table: &Tensor, row_indices: &[usize]) -> Tensor;
}

Every method takes &self and tensors, and returns a fresh Tensor. Nothing is mutated in place; each operation produces a new tensor. That is not the fastest possible style (it allocates on every call) but it is the simplest to reason about and the simplest to get correct, and correctness is Act 1's only job. name returns a human-readable backend name, handy for logs when there are several backends to choose from later.

Send + Sync again: a backend will be shared across request-handling threads in Act 3, so it must be safe to do so. A pure-computation object with no mutable state satisfies this for free.

That's the entire contract. Anything that wants to be a backend implements these methods. Now the one that does.

The CPU backend

CpuBackend is a zero-sized struct; it holds no state, it's just a name to hang the implementation on:

RUST

use super::Backend;
use crate::tensor::{Tensor, TensorData};
 
pub struct CpuBackend;

It needs to reach inside Tensor to get at the raw floats, so before the implementation we add a few accessors to the Tensor type. The tensor module file re-exports TensorData within the crate:

RUST

mod tensor;
 
pub use tensor::Tensor;
pub(crate) use tensor::TensorData;

And tensor.rs gains five methods:

RUST

    pub fn shape(&self) -> &[usize] {
        &self.shape
    }
 
    pub fn numel(&self) -> usize {
        self.shape.iter().product()
    }
 
    pub fn shape_vec(&self) -> Vec<usize> {
        self.shape.clone()
    }
 
    #[inline]
    pub(crate) fn as_data(&self) -> &TensorData {
        &self.data
    }
 
    pub(crate) fn as_f32_slice(&self) -> &[f32] {
        match &self.data {
            TensorData::Fp32(v) => v,
        }
    }

shape borrows the shape, numel is the element count, shape_vec clones the shape (needed when an operation produces an output with the same shape as its input). as_data exposes the TensorData enum so a kernel can match on the variant. as_f32_slice is the workhorse: it unwraps straight to the underlying &[f32], which is the form every scalar loop wants. (When Q8_0 arrives in Act 2, as_f32_slice will be the thing that doesn't apply to a quantized tensor; for now there's only one variant and it always succeeds.)

Two helper kernels

CpuBackend has a small private impl block with two helpers that the trait methods delegate to. First, matrix multiply for two FP32 tensors:

RUST

impl CpuBackend {
    pub(crate) fn matmul_fp32_fp32(
        a: &[f32],
        a_shape: &[usize],
        b: &[f32],
        b_shape: &[usize],
    ) -> Tensor {
        assert_eq!(a_shape.len(), 2);
        assert_eq!(b_shape.len(), 2);
        let m = a_shape[0];
        let n = a_shape[1];
        let n_b = b_shape[0];
        let p = b_shape[1];
        assert_eq!(n, n_b, "tensor shape mismatch");
        assert_eq!(a.len(), m * n, "matmul_fp32_fp32: a len");
        assert_eq!(b.len(), n * p, "matmul_fp32_fp32: b len");
        let mut data = vec![0.0f32; m * p];
        matmul_rowmajor_blocked(a, b, m, n, p, &mut data, 0, m);
        Tensor::new(data, vec![m, p])
    }

Matrix multiply: an m × n matrix times an n × p matrix gives an m × p result, where output element [i][k] is the dot product of row i of the first matrix with column k of the second. The n of the first must equal the n of the second; that's the assert_eq!(n, n_b, ...). This function just unpacks shapes, checks them, allocates the m * p output, and hands the flat slices to matmul_rowmajor_blocked (defined below) to do the loop.

Second helper, gather_rows:

RUST

    fn gather_rows_fp32(src: &[f32], table_shape: &[usize], row_indices: &[usize]) -> Tensor {
        assert_eq!(table_shape.len(), 2);
        let num_rows = table_shape[0];
        let cols = table_shape[1];
        assert_eq!(
            src.len(),
            num_rows * cols,
            "gather_rows_fp32: len(src) must match table shape"
        );
        let mut data = vec![0.0f32; row_indices.len() * cols];
        for (out_i, &row) in row_indices.iter().enumerate() {
            assert!(
                row < num_rows,
                "gather_rows: index {row} out of range {num_rows}"
            );
            let off = row * cols;
            data[out_i * cols..(out_i + 1) * cols].copy_from_slice(&src[off..off + cols]);
        }
        Tensor::new(data, vec![row_indices.len(), cols])
    }
}

Given a num_rows × cols table and a list of row indices, this builds a new matrix containing just those rows, in the order asked for. The row-major layout makes it trivial: row r of the table is the contiguous slice src[r*cols .. (r+1)*cols], so each gathered row is a single copy_from_slice. This is exactly token embedding: the embedding table is vocab_size × hidden_size, and gathering rows [9707, 11, 1879] produces the embedding vectors for those three tokens.

The blocked matmul loop

The actual multiply lives in a free function at the bottom of the file:

RUST

pub(crate) fn matmul_rowmajor_blocked(
    a: &[f32],
    b: &[f32],
    m: usize,
    n: usize,
    p: usize,
    out: &mut [f32],
    row_start: usize,
    row_end: usize,
) {
    assert_eq!(a.len(), m * n);
    assert_eq!(b.len(), n * p);
    assert_eq!(out.len(), m * p);
    assert!(row_start <= row_end && row_end <= m);
 
    for i in row_start..row_end {
        let o = i * p;
        out[o..o + p].fill(0.0);
    }
    for j in 0..n {
        let b_off = j * p;
        let b_row = &b[b_off..b_off + p];
        for i in row_start..row_end {
            let aij = a[i * n + j];
            let o = i * p;
            let row = &mut out[o..o + p];
            for k in 0..p {
                row[k] += aij * b_row[k];
            }
        }
    }
}

This computes the multiply for output rows row_start..row_end. Taking a row range rather than always doing all of m is forward planning: in II.4 we hand different row ranges to different threads, and this is the function each thread calls. Here row_start, row_end are always 0, m.

The loop order matters. The textbook way to write matmul is "for each output element, dot-product a row and a column." We don't do that. The column of b is not contiguous in row-major memory; striding down a column thrashes the cache. Instead this is the j-i-k order: the outer loop is over n (the shared dimension), and the inner work is out[i] += a[i][j] * b[j] where b[j] is a whole contiguous row of b. Both b_row and the output row are walked contiguously in the innermost k loop. That keeps memory access sequential, which is what the CPU cache and prefetcher want, and it's why even this "scalar baseline" isn't catastrophically slow. (out is zeroed first because the loop accumulates into it.)

Two panels comparing matmul loop orders: in ijk order the B matrix is walked down columns, strided and cache-hostile, while in jik order B is walked along rows, contiguous and cache-friendly

Figure: why the loop order matters. Same arithmetic either way. The textbook i-j-k order strides down columns of b; the j-i-k order walks whole rows of b and out in consecutive memory.

It is still single-threaded, still scalar: one float multiplied at a time. Act 2 attacks exactly this loop with SIMD and threads.

Implementing the trait

Now impl Backend for CpuBackend. Most methods are short scalar loops; we'll go through them in groups. The elementwise ones first:

RUST

impl Backend for CpuBackend {
    fn name(&self) -> String {
        "cpu".to_string()
    }
 
    fn add(&self, a: &Tensor, b: &Tensor) -> Tensor {
        assert_eq!(a.shape(), b.shape(), "tensor shape mismatch");
        let n = a.as_f32_slice().len();
        let mut data = vec![0.0f32; n];
        for i in 0..n {
            data[i] = a.as_f32_slice()[i] + b.as_f32_slice()[i];
        }
        Tensor::new(data, a.shape_vec())
    }
 
    fn hadamard(&self, a: &Tensor, b: &Tensor) -> Tensor {
        assert_eq!(a.shape(), b.shape(), "tensor shape mismatch");
        let n = a.as_f32_slice().len();
        let mut data = vec![0.0f32; n];
        for i in 0..n {
            data[i] = a.as_f32_slice()[i] * b.as_f32_slice()[i];
        }
        Tensor::new(data, a.shape_vec())
    }

add and hadamard are the two elementwise binary operations: position-by-position sum and position-by-position product. Both require the two inputs to have identical shapes (the assert_eq!), then walk the flat buffers in lockstep. The output keeps the input shape. "Hadamard" is just the formal name for elementwise multiply, distinct from matmul, which is the dot-product kind.

RUST

    fn silu(&self, x: &Tensor) -> Tensor {
        let n = x.as_f32_slice().len();
        let mut data = vec![0.0f32; n];
        for i in 0..n {
            let xi = x.as_f32_slice()[i];
            data[i] = xi * (1.0 / (1.0 + (-xi).exp()));
        }
        Tensor::new(data, x.shape_vec())
    }

silu is the activation function: the per-element nonlinearity. Its formula is x · σ(x) where σ(x) = 1/(1+e^-x) is the sigmoid. Intuitively: for large positive x it acts like the identity (σ approaches 1), for large negative x it squashes toward zero, and around zero it's a smooth dip. The transformer's MLP uses it; I.5 shows how. The only thing that matters for the backend is that it's applied independently to every element, an embarrassingly parallel loop.

RUST

    fn scale(&self, x: &Tensor, s: f32) -> Tensor {
        let n = x.as_f32_slice().len();
        let mut data = vec![0.0f32; n];
        for i in 0..n {
            data[i] = x.as_f32_slice()[i] * s;
        }
        Tensor::new(data, x.shape_vec())
    }
 
    fn add_scalar(&self, a: &Tensor, s: f32) -> Tensor {
        let n = a.as_f32_slice().len();
        let mut data = vec![0.0f32; n];
        for i in 0..n {
            data[i] = a.as_f32_slice()[i] + s;
        }
        Tensor::new(data, a.shape_vec())
    }
 
    fn rsqrt_elem(&self, x: &Tensor) -> Tensor {
        let n = x.as_f32_slice().len();
        let mut data = vec![0.0f32; n];
        for i in 0..n {
            data[i] = 1.0 / x.as_f32_slice()[i].sqrt();
        }
        Tensor::new(data, x.shape_vec())
    }

scale multiplies every element by a constant; add_scalar adds a constant to every element; rsqrt_elem replaces every element with its reciprocal square root, 1/√x. The last one looks oddly specific; it exists because it is one step of RMSNorm, which we'll assemble from these primitives in I.5. The pattern is: keep each backend method tiny and general, build the layer logic out of them.

Normalization helpers

The next three are the less-obvious pieces of RMSNorm:

RUST

    fn sum_squares_axis(&self, x: &Tensor, axis: usize) -> Tensor {
        assert_eq!(x.shape().len(), 2, "sum_squares_axis only supports 2-D");
        assert!(axis < 2, "axis out of bounds");
        if axis == 1 {
            let rows = x.shape()[0];
            let cols = x.shape()[1];
            let mut data = vec![0.0f32; rows];
            let src = x.as_f32_slice();
            for r in 0..rows {
                let mut sum = 0.0f32;
                for c in 0..cols {
                    let v = src[r * cols + c];
                    sum += v * v;
                }
                data[r] = sum;
            }
            Tensor::new(data, vec![rows])
        } else {
            let rows = x.shape()[0];
            let cols = x.shape()[1];
            let mut data = vec![0.0f32; cols];
            let src = x.as_f32_slice();
            for c in 0..cols {
                let mut sum = 0.0f32;
                for r in 0..rows {
                    let v = src[r * cols + c];
                    sum += v * v;
                }
                data[c] = sum;
            }
            Tensor::new(data, vec![cols])
        }
    }

sum_squares_axis collapses one axis of a 2-D tensor by summing squares along it. With axis == 1 it produces one number per row (the sum of the squares of that row's elements); with axis == 0, one per column. RMSNorm needs the per-row version: the "RMS" in RMSNorm is the root-mean-square of a row, and the sum of squares is the first step of computing it. Notice the two branches differ only in memory access pattern: the axis == 1 branch walks each row contiguously (fast); the axis == 0 branch strides down columns (slower). The model only uses axis == 1.

RUST

    fn broadcast_row_scalars(&self, t: &Tensor, d: usize) -> Tensor {
        assert_eq!(t.shape().len(), 1, "broadcast_row_scalars expects 1-D input");
        let rows = t.shape()[0];
        let mut data = vec![0.0f32; rows * d];
        let src = t.as_f32_slice();
        for r in 0..rows {
            let s = src[r];
            for c in 0..d {
                data[r * d + c] = s;
            }
        }
        Tensor::new(data, vec![rows, d])
    }

broadcast_row_scalars takes a 1-D tensor of rows values and a width d, and produces a rows × d matrix where every element of row r is the single value t[r]. This is broadcasting: turning a per-row scalar into a full matrix so it can be multiplied elementwise against another rows × d matrix. RMSNorm computes one normalization factor per row, then needs to apply it to every element of that row; broadcast_row_scalars is how the per-row factor becomes a full-shaped tensor that hadamard can consume.

RUST

    fn transpose_2d(&self, a: &Tensor) -> Tensor {
        assert_eq!(a.shape().len(), 2, "transpose_2d only supports 2-D");
        match a.as_data() {
            TensorData::Fp32(src) => {
                let m = a.shape()[0];
                let n = a.shape()[1];
                let mut data = vec![0.0f32; m * n];
                for i in 0..m {
                    for j in 0..n {
                        data[j * m + i] = src[i * n + j];
                    }
                }
                Tensor::new(data, vec![n, m])
            }
        }
    }

transpose_2d swaps the two axes: an m × n matrix becomes n × m, with element [i][j] moving to [j][i]. In flat-buffer terms, the element at i*n + j moves to j*m + i. We need this because GGUF stores weight matrices in one orientation and our matmul wants the other; I.5's weight loader calls transpose_2d once per matrix at load time, so the forward pass never pays for it.

Softmax

RUST

    fn softmax_rows(&self, x: &Tensor) -> Tensor {
        assert_eq!(x.shape().len(), 2, "softmax_rows only supports 2-D");
        let rows = x.shape()[0];
        let cols = x.shape()[1];
        let mut data = vec![0.0f32; rows * cols];
        let src = x.as_f32_slice();
        for r in 0..rows {
            let off = r * cols;
            let row = &src[off..off + cols];
            let max = row.iter().copied().fold(f32::NEG_INFINITY, f32::max);
            let mut sum = 0.0f32;
            for c in 0..cols {
                let e = (src[off + c] - max).exp();
                data[off + c] = e;
                sum += e;
            }
            for c in 0..cols {
                data[off + c] /= sum;
            }
        }
        Tensor::new(data, x.shape_vec())
    }

softmax_rows turns each row of a matrix into a probability distribution. The plain definition of softmax is "exponentiate every element, then divide by the total" (exp(x_i) / Σ exp(x_j)). That makes everything positive and makes the row sum to 1, with larger inputs getting exponentially larger shares. But the naive version is numerically dangerous: exp of a moderately large number overflows to infinity. The fix, used here, is to subtract the row's maximum from every element before exponentiating. exp(x_i - max) is always exp of something ≤ 0, so it's always in (0, 1] with no overflow, and the result is mathematically identical because the subtracted constant cancels in the division. This "subtract the max" trick is standard; every serious softmax does it. Attention will lean on this method heavily in I.5.

Matmul, gather, reshape

The last three trait methods are thin wrappers over the helpers we already wrote:

RUST

    fn matmul(&self, a: &Tensor, b: &Tensor) -> Tensor {
        assert_eq!(a.shape().len(), 2);
        assert_eq!(b.shape().len(), 2);
        let a_shape = a.shape();
        let b_shape = b.shape();
        match (a.as_data(), b.as_data()) {
            (TensorData::Fp32(a_data), TensorData::Fp32(b_data)) => {
                CpuBackend::matmul_fp32_fp32(a_data, a_shape, b_data, b_shape)
            }
        }
    }
 
    fn gather_rows(&self, table: &Tensor, row_indices: &[usize]) -> Tensor {
        assert_eq!(table.shape().len(), 2);
        let shape = table.shape();
        match table.as_data() {
            TensorData::Fp32(data) => CpuBackend::gather_rows_fp32(data, shape, row_indices),
        }
    }
 
    fn reshape_data(&self, x: &Tensor, shape: Vec<usize>) -> Tensor {
        let expected: usize = shape.iter().product();
        assert_eq!(x.numel(), expected, "reshape size mismatch");
        Tensor::new(x.as_f32_slice().to_vec(), shape)
    }
}

matmul and gather_rows match on the TensorData variant and dispatch to the FP32 helpers. The match looks unnecessary with only one variant, but it's the seam: when Q8_0 arrives, a quantized matmul is a new arm here, and the model code calling backend.matmul(...) doesn't change. reshape_data reinterprets a buffer under a new shape; it checks the element count is unchanged, then makes a fresh tensor with the same data and the new shape. (It copies the data; an in-place reshape would be possible, but copying keeps the "every operation returns a fresh tensor" rule uniform.)

src/lib.rs adds mod backend; and re-exports Backend; the model in the next chapter will be written entirely against it.

Where this leaves us

There is no binary to run this chapter; a backend is a library component, not a program. What we have is the engine's compute layer: a Backend trait enumerating every numeric operation a transformer needs, and CpuBackend, a complete, correct, scalar implementation of all of them. Plain loops, one core, one float at a time. Slow, and meant to be.

The payoff is structural. The model in the next chapter is written purely against the Backend trait; it calls matmul, softmax_rows, silu, never a raw float. Which means I.5 can build the entire 28-layer Qwen3 forward pass without once thinking about performance, and Act 2 can make every one of these operations dramatically faster without touching a line of the model. The trait is the seam, and it's now in place.