I.5: Qwen3 forward

This is the chapter where the model becomes a model. We have a tokenizer, a tensor type, and a backend full of numeric operations. What we don't have is the thing that arranges those operations into Qwen3: the actual transformer. That's this chapter: the forward pass.

A forward pass is one trip through the network. You hand it a list of token ids (the prompt) and it hands back, for each position, a vector of logits: one score per vocabulary entry, saying how strongly the model predicts that token comes next. Run it once and you have everything needed to pick the next token. Run it repeatedly, feeding each prediction back in, and you have text generation, but that loop is I.6. Here we build the single pass.

We're going to develop the architecture from scratch. If "attention," "RMSNorm," "RoPE," and "SwiGLU" mean nothing to you, good. They'll be built up one at a time, with the math, as we hit each one. By the end you'll have written every line of a real transformer and understand every number that moves through it.

The shape of a decoder transformer

Qwen3 0.6B is a decoder-only transformer, the same family as GPT. Here is the whole thing, top to bottom:

PLAINTEXT

token ids ──> embedding lookup ──> x  (seq × hidden)
                                    │
        ┌───────────────────────────┤  repeated 28 times
        │   x ──> RMSNorm ──> attention ──┐
        │   x ◄──────────── add ◄─────────┘   (residual)
        │   x ──> RMSNorm ──> SwiGLU MLP ──┐
        │   x ◄──────────── add ◄──────────┘   (residual)
        └───────────────────────────┐
                                    │
                          final RMSNorm
                                    │
                           output head (matmul)
                                    │
                          logits  (seq × vocab)

The structure is regular: an embedding at the front, a stack of 28 identical transformer blocks, a final norm, and an output projection. Each block does two things (an attention sub-layer and an MLP sub-layer) and each of those is wrapped the same way: normalize the input, do the work, add the result back to what you started with. That "add it back" is the residual connection, and it's what lets you stack 28 layers without the signal degrading: each layer only has to compute a correction to x, not a whole new x.

x, the thing flowing down the diagram, is a matrix, seq × hidden: one row per token in the prompt, each row a hidden-dimensional vector (1024 for this model). It enters as raw embeddings and leaves as something the output head can turn into logits. Every block refines it.

We'll build the pieces bottom-up (normalization, attention, MLP) then assemble them.

New backend operations

The forward pass needs a few operations the I.4 backend didn't have. They go on the Backend trait:

RUST

    fn fill_strict_upper_tri(&self, x: &Tensor, value: f32) -> Tensor;
 
    fn copy_2d_from_cols(&self, src: &Tensor, w: usize, col_offset: usize) -> Tensor;
 
    fn copy_2d_into_cols(&self, dst: &mut [f32], dst_cols: usize, src: &Tensor, col_offset: usize);
 
    fn repeat_row_as_matrix(&self, weight: &Tensor, rows: usize) -> Tensor;
 
    fn apply_rope(&self, x: &Tensor, head_dim: usize, rope_theta: f32) -> Tensor;

fill_strict_upper_tri overwrites the strictly-upper-triangular part of a square matrix with a value (that's the causal mask, explained when we get to attention). copy_2d_from_cols and copy_2d_into_cols slice a contiguous block of columns out of a matrix and write one back, the plumbing that lets us treat one wide matrix as several side-by-side "heads." repeat_row_as_matrix stacks a single vector into a matrix with that vector as every row, for broadcasting RMSNorm's weight. apply_rope applies the rotary position encoding, the subject of its own section.

The CPU implementations are mostly index arithmetic. The four copy/repeat/mask ones first:

RUST

    fn fill_strict_upper_tri(&self, x: &Tensor, value: f32) -> Tensor {
        assert_eq!(x.shape().len(), 2);
        let n = x.shape()[0];
        assert_eq!(x.shape()[1], n);
        let mut data = x.as_f32_slice().to_vec();
        for i in 0..n {
            for j in (i + 1)..n {
                data[i * n + j] = value;
            }
        }
        Tensor::new(data, x.shape_vec())
    }
 
    fn copy_2d_from_cols(&self, src: &Tensor, w: usize, col_offset: usize) -> Tensor {
        assert_eq!(src.shape().len(), 2);
        let seq = src.shape()[0];
        let src_cols = src.shape()[1];
        assert!(col_offset + w <= src_cols);
        let mut data = vec![0.0f32; seq * w];
        for s in 0..seq {
            let from = s * src_cols + col_offset;
            let to = s * w;
            data[to..to + w].copy_from_slice(&src.as_f32_slice()[from..from + w]);
        }
        Tensor::new(data, vec![seq, w])
    }
 
    fn copy_2d_into_cols(&self, dst: &mut [f32], dst_cols: usize, src: &Tensor, col_offset: usize) {
        assert_eq!(src.shape().len(), 2);
        let seq = src.shape()[0];
        let w = src.shape()[1];
        assert_eq!(dst.len(), seq * dst_cols);
        assert!(col_offset + w <= dst_cols);
        for s in 0..seq {
            let dst_start = s * dst_cols + col_offset;
            let src_start = s * w;
            dst[dst_start..dst_start + w]
                .copy_from_slice(&src.as_f32_slice()[src_start..src_start + w]);
        }
    }
 
    fn repeat_row_as_matrix(&self, weight: &Tensor, rows: usize) -> Tensor {
        assert_eq!(weight.shape().len(), 1);
        let d = weight.shape()[0];
        let mut data = vec![0.0f32; rows * d];
        for r in 0..rows {
            data[r * d..(r + 1) * d].copy_from_slice(weight.as_f32_slice());
        }
        Tensor::new(data, vec![rows, d])
    }

fill_strict_upper_tri walks the cells above the diagonal (j > i) and overwrites them. copy_2d_from_cols copies, for every row, the w-wide window starting at col_offset, pulling one head's worth of columns out of a wider matrix. copy_2d_into_cols is the reverse, writing a narrow matrix back into a column window of a wide buffer. repeat_row_as_matrix copies a 1-D weight into every row of a rows × d matrix. None of it is interesting math; it's the bookkeeping that the head-by-head attention loop needs.

apply_rope is interesting, and it gets the next section.

RoPE: telling the model where each token is

A transformer's attention, on its own, has no notion of order. To attention, "the dog bit the man" and "the man bit the dog" are bags of the same tokens; it sees no positions. Something has to inject the information "this token is at position 0, that one at position 7." That something is positional encoding, and Qwen3 uses the rotary kind, RoPE (Rotary Position Embedding).

The idea: take each token's vector, split it into pairs of numbers, and treat each pair as a 2-D point. Then rotate each point by an angle that depends on the token's position in the sequence. Token at position 0 gets rotated by 0 (unchanged); token at position 5 gets rotated more; and crucially, different pairs are rotated at different frequencies: some pairs spin fast as position increases, some slow. The set of rotation angles encodes the position, the way the hands of a bank of clocks running at different speeds together encode a time.

Three unit circles for positions 0, 1, and 2: the same two arrows rotate counterclockwise as position increases, the blue arrow sweeping faster than the green one.

Figure: one fast pair and one slow pair, walking through positions. Each pair of vector entries is a 2-D point; position rotates it, and each pair has its own speed.

Why rotation, specifically? Because rotation has a property attention can exploit. When attention later compares two tokens (a dot product of their vectors), the rotations interact so that the result depends only on the relative offset between the two positions, not their absolute values. A token learns "the word three positions back" rather than "the word at position 47." That's the kind of structure language has.

The rotation of one pair, for a token at position, by frequency freq, is the standard 2-D rotation:

PLAINTEXT

  x' = x·cos(θ) − y·sin(θ)
  y' = x·sin(θ) + y·cos(θ)        where  θ = position · freq

Here is that, for every pair of every head of one row:

RUST

/// Rotates the `n_heads` RoPE pairs in `out[row_offset .. row_offset + n_heads*head_dim]`
/// for a row at sequence `position`.
fn rope_rotate_row(
    out: &mut [f32],
    row_offset: usize,
    position: usize,
    n_heads: usize,
    head_dim: usize,
    rope_theta: f32,
) {
    let half = head_dim / 2;
    for h in 0..n_heads {
        let base = row_offset + h * head_dim;
        for i in 0..half {
            let freq = 1.0_f32 / rope_theta.powf((i as f32) / (half as f32));
            let (sn, c) = ((position as f32) * freq).sin_cos();
            let a = out[base + i];
            let b = out[base + i + half];
            out[base + i] = a * c - b * sn;
            out[base + i + half] = a * sn + b * c;
        }
    }
}

The pairing convention here is "first half / second half": element i pairs with element i + half, not i with i+1. The frequency for pair i is 1 / theta^(i/half): pair 0 spins fastest, later pairs slower and slower, geometrically spaced. theta (the rope_theta config value, often 10000 or larger) sets the overall spread. sin_cos computes both at once. Then the two paired elements are rotated by position · freq.

apply_rope runs that over every row of a tensor:

RUST

    fn apply_rope(&self, x: &Tensor, head_dim: usize, rope_theta: f32) -> Tensor {
        assert_eq!(x.shape().len(), 2);
        let seq = x.shape()[0];
        let total_width = x.shape()[1];
        let n_heads = total_width / head_dim;
        assert_eq!(total_width, n_heads * head_dim);
        assert!(head_dim % 2 == 0);
        let mut out = x.as_f32_slice().to_vec();
        for s in 0..seq {
            rope_rotate_row(&mut out, s * total_width, s, n_heads, head_dim, rope_theta);
        }
        Tensor::new(out, x.shape_vec())
    }

Row s is the token at position s, so it's rotated by position s. The width of the input is n_heads · head_dim; RoPE is applied per head, which is why it needs head_dim to slice the row into heads. We'll see what "head" means in a moment.

RMSNorm: keeping activations in check

Stack 28 layers and, without intervention, the magnitude of x drifts; values blow up or vanish, and training (and inference) become unstable. The fix is normalization at the start of each sub-layer: rescale x so its values sit in a controlled range, regardless of what the previous layer did.

Qwen3 uses RMSNorm (root-mean-square normalization). For each row vector, compute its root-mean-square (√(mean of squares)) and divide every element by it. That forces the row to unit RMS. Then multiply elementwise by a learned per-element weight vector, so the model can scale each dimension back up or down as it sees fit. The formula for one row x of dimension d:

PLAINTEXT

  rms  = √( (Σ xᵢ²)/d + ε )
  out  = (x / rms) · weight

The ε (a tiny constant) inside the square root just prevents a divide-by-zero if a row is all zeros. Now watch how this is built entirely from I.4 backend primitives: no new kernel, just composition. The model/common module holds it:

RUST

pub fn rms_norm_weighted_last(ops: &dyn Backend, x: &Tensor, weight: &Tensor, eps: f32) -> Tensor {
    assert!(!x.shape().is_empty());
    let d = *x.shape().last().expect("non-empty shape");
    assert_eq!(weight.shape(), &[d][..]);
    assert_eq!(x.numel() % d, 0);
    let rows = x.numel() / d;
    let x_2d = ops.reshape_data(x, vec![rows, d]);
    let sum_sq = ops.sum_squares_axis(&x_2d, 1);
    let mean_sq = ops.scale(&sum_sq, 1.0 / d as f32);
    let denom = ops.add_scalar(&mean_sq, eps);
    let inv_rms = ops.rsqrt_elem(&denom);
    let inv_rms_2d = ops.broadcast_row_scalars(&inv_rms, d);
    let w_2d = ops.repeat_row_as_matrix(weight, rows);
    let scaled = ops.hadamard(&x_2d, &inv_rms_2d);
    let out_2d = ops.hadamard(&scaled, &w_2d);
    ops.reshape_data(&out_2d, x.shape_vec())
}

Read it as the formula, line by line. Reshape x to 2-D so each row is one vector. sum_squares_axis(_, 1) gives Σ xᵢ² per row. scale by 1/d makes it the mean. add_scalar adds ε. rsqrt_elem gives 1/√(...), the reciprocal RMS, one number per row. broadcast_row_scalars blows that per-row number up to a full rows × d matrix; repeat_row_as_matrix does the same for the learned weight. Two hadamards (multiply by the reciprocal RMS, then by the weight) and reshape back. That is RMSNorm, assembled from seven backend calls. This is the design from I.4 paying off: the layer logic is composition, the backend is the only thing that ever has to be made fast.

Qwen3 has a second variant. Inside attention it normalizes each head separately:

RUST

pub fn headwise_rms_norm_weighted(
    ops: &dyn Backend,
    x: &Tensor,
    n_heads: usize,
    head_dim: usize,
    weight: &Tensor,
    eps: f32,
) -> Tensor {
    assert_eq!(x.shape().len(), 2);
    let seq = x.shape()[0];
    let width = x.shape()[1];
    assert_eq!(width, n_heads * head_dim);
    assert_eq!(weight.shape(), &[head_dim][..]);
    let mut out = vec![0.0f32; seq * width];
    for h in 0..n_heads {
        let head = ops.copy_2d_from_cols(x, head_dim, h * head_dim);
        let head_normed = rms_norm_weighted_last(ops, &head, weight, eps);
        ops.copy_2d_into_cols(&mut out, width, &head_normed, h * head_dim);
    }
    Tensor::new(out, x.shape_vec())
}

It slices out each head's head_dim columns with copy_2d_from_cols, runs the ordinary RMSNorm on that slice, and writes it back with copy_2d_into_cols. Qwen3 applies this to the query and key vectors inside attention (the "QK-norm" trick); we'll see it called there. The module file re-exports both:

RUST

mod attention;
mod mask;
mod norm;
 
pub(crate) use attention::gqa_attention_forward_with_kv;
pub(crate) use norm::{headwise_rms_norm_weighted, rms_norm_weighted_last};

Attention: the heart of the transformer

Attention is the mechanism that lets a token look at other tokens. When the model processes the word "it" in "the dog chased the cat because it was fast," attention is what lets the "it" position pull in information from "dog" or "cat" to figure out what "it" refers to. It is the one operation in a transformer that mixes information across positions; everything else (norm, MLP) works on each token independently.

Here is the mechanism. For each token, the model produces three vectors by multiplying x against three learned weight matrices:

a query (Q): "what am I looking for?"
a key (K): "what do I offer to others looking?"
a value (V): "what do I contribute if attended to?"

Then, for a given token, attention works like a soft dictionary lookup. Take that token's query and compare it, by dot product, against the key of every token. A large dot product means "this key matches what I'm looking for." Those dot products are the attention scores. Run the scores through softmax to turn them into weights that sum to 1, and the token's output is the weighted average of all the values. A token attends strongly to the tokens whose keys matched its query, and copies their values.

Two refinements are essential and both appear in the code.

The causal mask. A decoder generates text left to right. When computing position 5's output, it must not look at positions 6, 7, … (those are the future, not yet generated). So before softmax, we set the scores for all future positions to −∞. After softmax, exp(−∞) = 0, so future tokens get exactly zero weight. The scores form a seq × seq matrix where row i, column j is "how much token i attends to token j"; "no looking at the future" means zeroing every cell where j > i, the strictly upper triangle. That's what fill_strict_upper_tri is for:

RUST

use crate::backend::Backend;
use crate::tensor::Tensor;
 
pub fn causal_mask_upper_tri(ops: &dyn Backend, scores: &Tensor) -> Tensor {
    ops.fill_strict_upper_tri(scores, f32::NEG_INFINITY)
}

The scale. The dot products grow with the dimension of the vectors; large scores push softmax into a near-one-hot regime (nearly all the weight on a single position), with vanishing gradients in training and brittle behavior at inference. So scores are divided by √head_dim before the mask. Putting the whole single-head computation together, softmax(QKᵀ / √d)·V, the canonical attention formula:

RUST

fn gqa_attention_context_one_head(
    ops: &dyn Backend,
    q_h: &Tensor,
    k_h: &Tensor,
    v_h: &Tensor,
    scale_attn: f32,
) -> Tensor {
    let scores = ops.matmul(q_h, &ops.transpose_2d(k_h));
    let scores = ops.scale(&scores, scale_attn);
    let scores = causal_mask_upper_tri(ops, &scores);
    let attn = ops.softmax_rows(&scores);
    ops.matmul(&attn, v_h)
}

matmul(q_h, transpose(k_h)) is QKᵀ: every query dotted with every key, giving the seq × seq score matrix. scale divides by √d. The mask zeroes the future. softmax_rows normalizes each row to weights summing to 1. The final matmul(attn, v_h) is the weighted average of the values. Five lines, and it is the formula exactly.

Multi-head, and grouped-query attention

One set of Q/K/V would give the model one way to relate tokens. Transformers use multi-head attention: split the vectors into n_heads independent chunks, each head_dim wide, and run the attention computation separately per head. One head might learn to track syntax, another long-range references, another something else. The heads' outputs are concatenated back into one wide vector.

Qwen3 adds a memory-saving twist: grouped-query attention (GQA). Plain multi-head has one K and one V per query head. K and V are the expensive things to store (in Act 2 they become the KV cache). GQA gives the model many query heads but few key/value heads; several query heads share one K/V head. Qwen3 0.6B has 16 query heads and 8 KV heads, so each KV head is shared by 2 query heads. Fewer K/V means a smaller cache for nearly free.

Sixteen query-head boxes in a row, each pair connected to one of eight key/value head boxes below.

Figure: Qwen3 0.6B's head grouping. Sixteen query heads, eight K/V heads, two queries per K/V head.

slice_head pulls one head's columns out of a wide Q, K, or V matrix:

RUST

pub(crate) fn slice_head(
    ops: &dyn Backend,
    x: &Tensor,
    seq: usize,
    n_heads: usize,
    head_dim: usize,
    head_idx: usize,
) -> Tensor {
    let n_cols = n_heads * head_dim;
    assert_eq!(x.shape(), &[seq, n_cols][..]);
    ops.copy_2d_from_cols(x, head_dim, head_idx * head_dim)
}

And here is the full attention sub-layer. It's long; we'll take it in three pieces. First, projections and per-head norms:

RUST

pub(crate) fn gqa_attention_forward_with_kv(
    ops: &dyn Backend,
    x: &Tensor,
    q_proj: &Tensor,
    k_proj: &Tensor,
    v_proj: &Tensor,
    o_proj: &Tensor,
    attn_q_norm: &Tensor,
    attn_k_norm: &Tensor,
    num_attention_heads: usize,
    num_key_value_heads: usize,
    head_dim: usize,
    rms_norm_eps: f32,
    rope_theta: f32,
) -> (Tensor, Tensor, Tensor) {
    let seq = x.shape()[0];
    let nh = num_attention_heads;
    let nkv = num_key_value_heads;
    let hd = head_dim;
    let qw = nh * hd;
    let kv_group = nh / nkv;
 
    let mut q = ops.matmul(x, q_proj);
    let mut k = ops.matmul(x, k_proj);
    let v = ops.matmul(x, v_proj);
 
    q = headwise_rms_norm_weighted(ops, &q, nh, hd, attn_q_norm, rms_norm_eps);
    k = headwise_rms_norm_weighted(ops, &k, nkv, hd, attn_k_norm, rms_norm_eps);

x is multiplied by q_proj, k_proj, v_proj (the three learned projection matrices) to produce the query, key, and value matrices. q is seq × (nh·hd), k and v are seq × (nkv·hd), narrower because there are fewer KV heads. Then q and k get the per-head RMSNorm we wrote earlier (Qwen3's QK-norm). kv_group is how many query heads share one KV head.

Second piece: RoPE, then the per-head loop:

RUST

    let q = ops.apply_rope(&q, hd, rope_theta);
    let k_rope = ops.apply_rope(&k, hd, rope_theta);
 
    let scale_attn = 1.0 / (hd as f32).sqrt();
    let mut concat = vec![0.0f32; seq * qw];
 
    for h_idx in 0..nh {
        let kv_h = h_idx / kv_group;
        let q_h = slice_head(ops, &q, seq, nh, hd, h_idx);
        let k_h = slice_head(ops, &k_rope, seq, nkv, hd, kv_h);
        let v_h = slice_head(ops, &v, seq, nkv, hd, kv_h);
 
        let ctx = gqa_attention_context_one_head(ops, &q_h, &k_h, &v_h, scale_attn);
 
        assert_eq!(ctx.shape(), &[seq, hd][..]);
        ops.copy_2d_into_cols(&mut concat, qw, &ctx, h_idx * hd);
    }

RoPE is applied to q and k (not v; only the things being compared need position information). Then the loop runs once per query head. The line kv_h = h_idx / kv_group is GQA in action: query heads 0 and 1 both map to KV head 0, query heads 2 and 3 to KV head 1, and so on. Each head's Q/K/V are sliced out, the single-head attention runs, and the seq × hd result is written into its slot in the concat buffer.

Third piece: the output projection:

RUST

    let merged = Tensor::new(concat, vec![seq, qw]);
    let out = ops.matmul(&merged, o_proj);
    (out, k_rope, v)
}

The concatenated per-head outputs are multiplied by o_proj, the output projection matrix, to mix the heads back into a seq × hidden result. The function also returns k_rope and v, the rotated keys and the values. We don't use those return values in Act 1's forward pass, but the signature anticipates Act 2: those are exactly the tensors a KV cache stores so they don't have to be recomputed every step. Returning them now means the KV cache in II.2 is a change to the caller, not to this function.

The model config

Before the model struct, the hyperparameters. Every number that defines this particular network (how many layers, how wide, how many heads) lives in the GGUF metadata, and Qwen3Config reads them out. First we extend the GGUF side with two small helpers. A typed metadata accessor:

RUST

use std::collections::HashMap;
 
use super::types::MetadataValue;
 
pub(crate) trait GgufMetadata {
    fn as_usize(&self, key: &str) -> usize;
    fn as_f32(&self, key: &str) -> Option<f32>;
}
 
impl GgufMetadata for HashMap<String, MetadataValue> {
    fn as_usize(&self, key: &str) -> usize {
        let v = self
            .get(key)
            .expect(&format!("missing GGUF metadata key {key:?}"));
        v.as_u64()
            .map(|u| u as usize)
            .or_else(|| v.as_f32().map(|f| f as usize))
            .expect(&format!("GGUF metadata {key:?} is not numeric"))
    }
 
    fn as_f32(&self, key: &str) -> Option<f32> {
        self.get(key).and_then(MetadataValue::as_f32)
    }
}

This is an extension trait on the metadata HashMap; as_usize("qwen3.embedding_length") reads a key and converts, panicking with a clear message if the key is missing or the wrong type. It needs an as_f32 getter on MetadataValue, added to types.rs:

RUST

    pub fn as_f32(&self) -> Option<f32> {
        match self {
            MetadataValue::Float32(x) => Some(*x),
            MetadataValue::Float64(x) => Some(*x as f32),
            MetadataValue::Uint32(x) => Some(*x as f32),
            MetadataValue::Int32(x) => Some(*x as f32),
            _ => None,
        }
    }

And a utility to count layers by scanning tensor names: there's no explicit layer-count metadata key we trust, so we count the distinct blk.<i> prefixes (recall from I.1 that per-layer tensors are named blk.0.attn_q.weight and so on):

RUST

pub(crate) fn count_layers(tensors: &[TensorInfo]) -> usize {
    let mut max_i = None::<usize>;
    for t in tensors {
        if let Some(rest) = t.name.strip_prefix("blk.") {
            if let Some((idx_str, _)) = rest.split_once('.') {
                if let Ok(i) = idx_str.parse::<usize>() {
                    max_i = Some(max_i.map_or(i, |m| m.max(i)));
                }
            }
        }
    }
    max_i.map_or(0, |v| v + 1)
}

The gguf module file gains mod metadata_trait;, a has_tensor helper, and two new pub(crate) re-exports, GgufMetadata and count_layers (bringing it to three, alongside the MetadataValue one from I.2). GGUF gets a one-line has_tensor:

RUST

    pub fn has_tensor(&self, name: &str) -> bool {
        self.tensors.iter().any(|t| t.name == name)
    }

Now Qwen3Config itself:

RUST

use crate::gguf::{GGUF, GgufMetadata, count_layers};
 
#[derive(Clone, Debug)]
pub(crate) struct Qwen3Config {
    pub vocab_size: usize,
    pub hidden_size: usize,
    pub num_hidden_layers: usize,
    pub num_attention_heads: usize,
    pub num_key_value_heads: usize,
 
    pub head_dim: usize,
 
    pub intermediate_size: usize,
    pub rms_norm_eps: f32,
    pub rope_theta: f32,
}
 
impl Qwen3Config {
    pub fn q_width(&self) -> usize {
        self.num_attention_heads * self.head_dim
    }
 
    pub fn kv_width(&self) -> usize {
        self.num_key_value_heads * self.head_dim
    }

Every field is one architectural number: hidden_size is the width of x (1024); num_hidden_layers is 28; num_attention_heads and num_key_value_heads are the 16 and 8 of GQA; head_dim is each head's width; intermediate_size is the MLP's inner width; rms_norm_eps and rope_theta are the ε and θ constants we met above. q_width and kv_width are the total widths of the query and key/value matrices.

from_gguf populates it:

RUST

    pub fn from_gguf(gguf: &GGUF) -> Self {
        let meta = &gguf.metadata;
        let hidden_size = meta.as_usize("qwen3.embedding_length");
        let te = gguf
            .tensors
            .iter()
            .find(|t| t.name == "token_embd.weight")
            .expect("GGUF missing token_embd.weight");
        assert_eq!(te.dims.len(), 2, "token_embd.weight must be 2-D");
        assert_eq!(
            te.dims[0] as usize, hidden_size,
            "token_embd.weight dim[0] must match GGUF embedding_length metadata"
        );
        let vocab_size = te.dims[1] as usize;
        let head_dim = meta.as_usize("qwen3.attention.key_length");
        let num_attention_heads = meta.as_usize("qwen3.attention.head_count");
        let num_key_value_heads = meta.as_usize("qwen3.attention.head_count_kv");
        let intermediate_size = meta.as_usize("qwen3.feed_forward_length");
        let num_hidden_layers = count_layers(&gguf.tensors);
        assert!(
            num_hidden_layers > 0,
            "GGUF has no blk.N.* tensors (layer count)"
        );
        let rope_theta = meta
            .as_f32("qwen3.rope.freq_base")
            .expect("GGUF missing rope.freq_base metadata");
        let rms_norm_eps = meta
            .as_f32("qwen3.attention.layer_norm_rms_epsilon")
            .expect("GGUF missing attention.layer_norm_rms_epsilon metadata");
        Self {
            vocab_size,
            hidden_size,
            num_hidden_layers,
            num_attention_heads,
            num_key_value_heads,
            head_dim,
            intermediate_size,
            rms_norm_eps,
            rope_theta,
        }
    }
 
    pub(crate) fn validate(&self) {
        assert_eq!(self.num_attention_heads % self.num_key_value_heads, 0);
        assert!(self.head_dim % 2 == 0, "RoPE requires even head_dim");
    }
}

Mostly metadata lookups. vocab_size is read from the shape of the embedding tensor rather than a metadata key: token_embd.weight is vocab_size × hidden_size, so its dims tell us both, and we cross-check hidden_size against the metadata to catch a mismatched file. validate asserts the two invariants the math depends on: the KV head count must divide evenly into the query head count (for GQA), and head_dim must be even (for RoPE's pairing).

The Qwen3 model

Flowchart of one Qwen3 transformer layer: RMSNorm feeds the query, key, and value projections, queries and keys get per-head RMSNorm and RoPE, grouped-query attention combines them, and after a residual add the MLP applies gate, up, SiLU, and down projections with a second residual add.

Figure: one layer, end to end. Everything in this diagram happens 28 times, once per layer, before the final norm and output head.

The same layer annotated with tensor shapes: hidden states of seq by 1024 become queries of seq by 2048 and keys and values of seq by 1024, sliced into heads of seq by 128, concatenated back, and expanded to seq by 3072 inside the MLP; after 28 layers the output head produces seq by 151936 logits.

Figure: the same walk, in shapes. If you can follow the shapes, you can debug the forward pass; every mismatch panics at one of these seams.

Now the model struct. A layer holds the eleven weight tensors of one transformer block:

RUST

pub(crate) struct Qwen3Layer {
    pub input_layernorm: Tensor,
 
    pub q_proj: Tensor,
 
    pub k_proj: Tensor,
    pub v_proj: Tensor,
 
    pub o_proj: Tensor,
 
    pub attn_q_norm: Tensor,
 
    pub attn_k_norm: Tensor,
    pub post_attention_layernorm: Tensor,
    pub gate_proj: Tensor,
    pub up_proj: Tensor,
    pub down_proj: Tensor,
}

input_layernorm is the RMSNorm weight before attention; q/k/v/o_proj are attention's four projection matrices; attn_q_norm/attn_k_norm are the QK-norm weights; post_attention_layernorm is the RMSNorm before the MLP; gate_proj/up_proj/down_proj are the MLP's three matrices. The model bundles the layers with the embedding and the head:

RUST

pub(crate) struct Qwen3Model {
    pub(crate) config: Qwen3Config,
    cpu_backend: Arc<dyn Backend>,
 
    embed: Tensor,
    layers: Vec<Qwen3Layer>,
 
    norm: Tensor,
 
    lm_head: Tensor,
}

embed is the token embedding table, layers the 28 blocks, norm the final RMSNorm weight, lm_head the output projection. cpu_backend is the Backend the whole forward pass runs against, held as an Arc<dyn Backend> so it can be swapped (Act 2) and shared (Act 3).

Loading the weights

The constructor reads every weight tensor out of the GGUF file. Two small loader helpers first:

RUST

fn load_ggml_weight_for_matmul_rhs(
    ops: &dyn Backend,
    gguf: &mut GGUF,
    name: &str,
    expected_ne: [usize; 2],
) -> Tensor {
    let t = gguf.load_tensor(name);
    assert_eq!(
        t.shape(),
        &[expected_ne[1], expected_ne[0]],
        "{name} shape mismatch"
    );
    match t.as_data() {
        TensorData::Fp32(_) => ops.transpose_2d(&t),
    }
}
 
fn load_vec_1d(gguf: &mut GGUF, name: &str, len: usize) -> Tensor {
    let t = gguf.load_tensor(name);
    assert_eq!(t.shape(), &[len], "{name} shape mismatch");
    t
}

load_vec_1d loads a 1-D tensor (a norm weight) and checks its length. load_ggml_weight_for_matmul_rhs loads a 2-D weight matrix and transposes it once, at load time. GGUF stores weight matrices in the orientation a y = Wx convention wants; our forward pass multiplies x (as a row-major seq × in) on the left, so it needs W transposed to in × out. Doing the transpose here, once per matrix at startup, means the forward pass (run thousands of times) never pays for it. The expected_ne argument is the shape we expect after transpose, and the assert checks the file matches.

Mapping from GGUF tensor names like blk.7.attn_q.weight to the loader's fields like layers 7 q_proj, with a note on which weights are transposed at load.

Figure: file names to struct fields. The blk.{i} prefix picks the layer; the suffix picks the field; 2-D weights get their one-time transpose on the way in.

The constructor:

RUST

impl Qwen3Model {
    pub fn new(gguf: &mut GGUF, cpu_backend: Arc<dyn Backend>) -> Self {
        let config = Qwen3Config::from_gguf(gguf);
        config.validate();
        let h = config.hidden_size;
        let v = config.vocab_size;
        let qw = config.q_width();
        let kvw = config.kv_width();
        let hd = config.head_dim;
        let inter = config.intermediate_size;
        let ops = cpu_backend.as_ref();
 
        let embed = gguf.load_tensor("token_embd.weight");
        assert_eq!(embed.shape(), &[v, h], "token_embd.weight shape mismatch");
 
        let norm = load_vec_1d(gguf, "output_norm.weight", h);
 
        let lm_head = if gguf.has_tensor("output.weight") {
            load_ggml_weight_for_matmul_rhs(ops, gguf, "output.weight", [h, v])
        } else {
            match embed.as_data() {
                TensorData::Fp32(_) => ops.transpose_2d(&embed),
            }
        };

It builds the config, then loads the embedding table and the final norm. The output head is loaded with one wrinkle: many models tie the output head to the embedding table; they're the same matrix, transposed. So if the file has a separate output.weight we use it; otherwise we transpose embed to serve as the head. (Both branches are real for us: the FP32 export we're running ships a separate output.weight, so the if branch is live here, while the Q8_0 export we switch to in II.6 ties them and takes the else branch.)

RUST

        let mut layers = Vec::with_capacity(config.num_hidden_layers);
        for i in 0..config.num_hidden_layers {
            let p = format!("blk.{i}");
            layers.push(Qwen3Layer {
                input_layernorm: load_vec_1d(gguf, &format!("{p}.attn_norm.weight"), h),
                q_proj: load_ggml_weight_for_matmul_rhs(
                    ops,
                    gguf,
                    &format!("{p}.attn_q.weight"),
                    [h, qw],
                ),
                k_proj: load_ggml_weight_for_matmul_rhs(
                    ops,
                    gguf,
                    &format!("{p}.attn_k.weight"),
                    [h, kvw],
                ),
                v_proj: load_ggml_weight_for_matmul_rhs(
                    ops,
                    gguf,
                    &format!("{p}.attn_v.weight"),
                    [h, kvw],
                ),
                o_proj: load_ggml_weight_for_matmul_rhs(
                    ops,
                    gguf,
                    &format!("{p}.attn_output.weight"),
                    [qw, h],
                ),
                attn_q_norm: load_vec_1d(gguf, &format!("{p}.attn_q_norm.weight"), hd),
                attn_k_norm: load_vec_1d(gguf, &format!("{p}.attn_k_norm.weight"), hd),
                post_attention_layernorm: load_vec_1d(gguf, &format!("{p}.ffn_norm.weight"), h),
                gate_proj: load_ggml_weight_for_matmul_rhs(
                    ops,
                    gguf,
                    &format!("{p}.ffn_gate.weight"),
                    [h, inter],
                ),
                up_proj: load_ggml_weight_for_matmul_rhs(
                    ops,
                    gguf,
                    &format!("{p}.ffn_up.weight"),
                    [h, inter],
                ),
                down_proj: load_ggml_weight_for_matmul_rhs(
                    ops,
                    gguf,
                    &format!("{p}.ffn_down.weight"),
                    [inter, h],
                ),
            });
        }
 
        Self {
            config,
            cpu_backend,
            embed,
            layers,
            norm,
            lm_head,
        }
    }

The loop runs once per layer, building the GGUF tensor name from the blk.<i> prefix and loading all eleven tensors. The shapes wire up the dimensions: the query projection goes hidden → q_width, the MLP's up and gate go hidden → intermediate, down comes back intermediate → hidden. After this returns, every weight of the model is in memory, transposed and shape-checked.

The forward pass

And finally the forward pass, the diagram from the top of the chapter, in code:

RUST

    pub fn forward(&self, token_ids: &[usize]) -> Tensor {
        self.forward_common(token_ids)
    }
 
    fn forward_common(&self, token_ids: &[usize]) -> Tensor {
        let ops = self.cpu_backend.as_ref();
        let cfg = &self.config;
        let mut x = ops.gather_rows(&self.embed, token_ids);
        for layer in self.layers.iter() {
            let normed = rms_norm_weighted_last(ops, &x, &layer.input_layernorm, cfg.rms_norm_eps);
            let (attn_out, _k_rope, _v) = gqa_attention_forward_with_kv(
                ops,
                &normed,
                &layer.q_proj,
                &layer.k_proj,
                &layer.v_proj,
                &layer.o_proj,
                &layer.attn_q_norm,
                &layer.attn_k_norm,
                cfg.num_attention_heads,
                cfg.num_key_value_heads,
                cfg.head_dim,
                cfg.rms_norm_eps,
                cfg.rope_theta,
            );
            x = ops.add(&x, &attn_out);
            let normed_mlp =
                rms_norm_weighted_last(ops, &x, &layer.post_attention_layernorm, cfg.rms_norm_eps);
            let mlp_out = mlp_forward(ops, &normed_mlp, layer);
            x = ops.add(&x, &mlp_out);
        }
        let x = rms_norm_weighted_last(ops, &x, &self.norm, cfg.rms_norm_eps);
        ops.matmul(&x, &self.lm_head)
    }
}

Read it against the diagram. gather_rows turns the token ids into x: the embedding lookup, picking one row per token from the embedding table. Then the loop over 28 layers, each doing the two-sub-layer pattern: RMSNorm, attention, x = x + attn_out (the residual add); RMSNorm again, MLP, x = x + mlp_out (the second residual). After all layers, the final RMSNorm, then matmul against lm_head to produce logits: a seq × vocab_size matrix, one row of vocabulary scores per input position. (The _k_rope and _v are the KV-cache tensors mentioned earlier, ignored for now.)

The MLP (Qwen3's SwiGLU) is a short helper:

RUST

fn mlp_forward(ops: &dyn Backend, x: &Tensor, layer: &Qwen3Layer) -> Tensor {
    let gate = ops.matmul(x, &layer.gate_proj);
    let up = ops.matmul(x, &layer.up_proj);
    let gate = ops.silu(&gate);
    let hidden = ops.hadamard(&gate, &up);
    ops.matmul(&hidden, &layer.down_proj)
}

A plain transformer MLP would be: project up to a wider dimension, apply an activation, project back down. SwiGLU adds a gate. It computes two up-projections (gate and up), runs silu (the activation from I.4) on gate, multiplies the two together elementwise, and projects the result down. The silu(gate) term acts as a soft, learned filter on up: it can suppress some dimensions and pass others. That gating is what makes SwiGLU work better than a plain MLP, and it's standard in modern transformers.

Tying it off

The Model trait (the generic interface a forward pass exposes) and the wiring:

RUST

use crate::tensor::Tensor;
 
pub trait Model: Send + Sync {
    fn forward(&self, token_ids: &[usize]) -> Tensor;
}

RUST

impl Model for Qwen3Model {
    fn forward(&self, token_ids: &[usize]) -> Tensor {
        Qwen3Model::forward(self, token_ids)
    }
}

Model has one method: take token ids, return logits. The generation loop in I.6 is written against this trait, so it never names Qwen3Model directly; a different architecture is a different impl Model.

model/load.rs is the public entry point. Three supporting functions, then the one a binary calls. First, architecture detection: reading general.architecture from the metadata and refusing anything that isn't Qwen3:

RUST

pub(crate) fn architecture_from_gguf(gguf: &GGUF) -> Result<String, String> {
    let meta = &gguf.metadata;
    if let Some(arch) = meta.get("general.architecture").and_then(|v| v.as_str()) {
        let a = arch.trim().to_ascii_lowercase();
        if a.contains("qwen3") || a == "qwen" {
            return Ok("qwen3".into());
        }
        if a.contains("qwen2") && !a.contains("qwen3") {
            return Err(
                "GGUF general.architecture looks like qwen2; this build only loads qwen3 layouts"
                    .into(),
            );
        }
        return Err(format!(
            "unsupported GGUF general.architecture {a:?} (supported: qwen3)"
        ));
    }
    if meta.keys().any(|k| k.starts_with("qwen3.")) {
        return Ok("qwen3".into());
    }
    Err(
        "could not infer model architecture from GGUF (missing general.architecture and no qwen3.* metadata)"
            .into(),
    )
}

It returns a clean error string for a qwen2 file or any unsupported architecture, and falls back to sniffing for qwen3.* metadata keys if the general.architecture key is absent entirely. model_from_gguf uses that to pick which Model to build:

RUST

pub(crate) fn model_from_gguf(
    gguf: &mut GGUF,
    backend: Arc<dyn Backend>,
) -> Result<(Arc<dyn Model>, String), String> {
    let arch = architecture_from_gguf(gguf)?;
    let model: Arc<dyn Model> = match arch.as_str() {
        "qwen3" => Arc::new(Qwen3Model::new(gguf, backend)),
        other => return Err(format!("unsupported architecture {other:?}")),
    };
    Ok((model, arch))
}

The match has one live arm: "qwen3" builds the Qwen3Model we wrote above. Adding a second architecture later means one new arm and one new file; nothing else changes. The tokenizer loader is the parallel for I.2's BpeTokenizer:

RUST

pub(crate) fn tokenizer_from_gguf_metadata(
    metadata: &HashMap<String, MetadataValue>,
) -> Result<Arc<dyn Tokenizer>, String> {
    match metadata
        .get("tokenizer.ggml.model")
        .and_then(|v| v.as_str())
    {
        Some("gpt2") => {}
        Some(other) => {
            return Err(format!(
                "tokenizer.ggml.model is {other:?}; only \"gpt2\" (BPE) is supported"
            ));
        }
        None => return Err("GGUF missing tokenizer.ggml.model".into()),
    }
    if !metadata.contains_key("tokenizer.ggml.tokens") {
        return Err("GGUF missing tokenizer.ggml.tokens (BPE vocab)".into());
    }
    Ok(Arc::new(BpeTokenizer::from_gguf_metadata(metadata)))
}

It checks the file carries a gpt2 BPE tokenizer with a vocabulary, then builds the BpeTokenizer. And the public entry point ties all three together:

RUST

pub fn load_from_gguf_path(
    path: &Path,
    backend: Arc<dyn Backend>,
) -> Result<(Arc<dyn Model>, Arc<dyn Tokenizer>), String> {
    let mut gguf = GGUF::parse(path);
    let tokenizer = tokenizer_from_gguf_metadata(&gguf.metadata)?;
    let (model, _) = model_from_gguf(&mut gguf, backend)?;
    Ok((model, tokenizer))
}

load_from_gguf_path is the one call a binary makes: give it a path and a backend, get back a model and a tokenizer, both ready to run. The module re-exports it, and src/lib.rs adds mod model; plus pub use model::{load_from_gguf_path, Model};.

Where this leaves us

There's no binary this chapter; a forward pass is a function, not a program, and a single pass produces logits, not text. But the hard part is done. We have built, from arithmetic primitives, a complete and correct Qwen3: embedding, 28 blocks of RMSNorm + grouped-query attention with RoPE + SwiGLU MLP, final norm, output head, and a loader that fills it with real weights from a GGUF file. model.forward(token_ids) runs the whole network and returns next-token logits for every position.

One forward pass predicts one next token. Generating text means doing it over and over: predict, append, predict again. The next chapter writes that autoregressive loop, adds the argmax (take the highest-scoring entry) that turns logits into a token choice, and wraps it all in model-generate, the first binary in this series that produces real, coherent text.