I.2: Tokenizer

A language model does not read text. It reads numbers. Specifically, it reads a list of integers (token ids), each one an index into a fixed vocabulary the model was trained with. Before any matrix ever gets multiplied, something has to turn the string "Hello, world!" into a list like [9707, 11, 1879, 0], and after the model produces an answer, something has to turn the ids it emits back into readable text. That something is the tokenizer, and it is the subject of this chapter.

The vocabulary Qwen3 uses has roughly 150,000 entries. A token is usually not a whole word and usually not a single character; it is a chunk somewhere in between, a subword. Common words get their own token; rare words get split into pieces; arbitrary bytes always have a fallback. The scheme that decides those chunks is byte-pair encoding (BPE), and the entire vocabulary plus the rules for combining pieces were already sitting in the GGUF metadata we parsed in I.1. This chapter reads them out and turns them into a working encoder and decoder.

The goal: a BpeTokenizer built straight from GGUF metadata, plus a tokenizer-demo binary that shows a string going to ids and back. No tensors yet, no model. Just text in, integers out, integers in, text out.

What byte-pair encoding is

Start with the problem. You want a fixed vocabulary, small enough that the model's output layer (the final step that scores every vocabulary entry) stays a manageable size, but expressive enough that any string (English, code, emoji, a language the model barely saw) can be represented. Two extremes fail. A vocabulary of whole words can't spell anything it hasn't seen, and there are too many words. A vocabulary of single characters handles everything but makes sequences enormous, and the model has to relearn spelling from scratch every time.

BPE is the compromise, and it is built by a dead-simple training procedure. Start with text split into individual characters. Count every adjacent pair. Find the most frequent pair, say t followed by h, and merge it into a new symbol th. Record that merge. Now th is a unit; count pairs again, merge the most frequent again. Repeat a few tens of thousands of times. Each merge is one new vocabulary entry, and the ordered list of merges is the algorithm: to tokenize new text, you replay the merges in the order they were learned, greedily, until no more apply.

That ordered list (the merge table) and the resulting set of pieces (the vocabulary) are exactly the two things the GGUF file stores for us. We don't train a tokenizer here. We load one and run its encode/decode rules. Concretely, encoding one chunk of text means:

PLAINTEXT

"there"  ->  t h e r e            (start: one symbol per character)
         ->  t h e re             (merge "r"+"e", an early merge)
         ->  th e re              (merge "t"+"h")
         ->  the re               (merge "th"+"e")
         ->  there                (merge "the"+"re")

At each step we pick, among all adjacent pairs currently present, the one whose merge was learned earliest (lowest rank), apply it, and repeat. The final symbols are looked up in the vocabulary to get ids.

Figure: the merges above, replayed. Watch the pair with the lowest rank collapse first each round.

Five-step diagram merging the word lower from the single characters l, o, w, e, r into one token, applying one merge per step

Figure: merges, step by step. Another word, same process: lower goes from five single-character symbols to one token in four merges, and every merge shortens the sequence without losing any text.

One wrinkle BPE on its own doesn't solve: text is bytes, and bytes include things like newlines, tabs, and the high half of UTF-8 that don't print nicely. GPT-2-style tokenizers (which is the family Qwen3 belongs to) handle this with a byte-level trick: before BPE ever runs, every one of the 256 possible bytes is mapped to a single safe, printable Unicode character. BPE then operates purely on those safe characters, and decoding maps them back to bytes. This guarantees there is no input the tokenizer can choke on: worst case, a string falls all the way back to 256 single-byte tokens. We'll build that byte mapping first.

Wiring it up

Two small bookkeeping changes before the tokenizer itself. The crate gains one dependency (the regex crate, which we need for one specific job: splitting text into pre-tokenization chunks; more on that later) and declares the new binary:

TOML

[package]
name = "inferno"
version = "0.1.0"
edition = "2024"
 
[dependencies]
regex = "1"
 
[[bin]]
name = "gguf-inspect"
path = "src/bin/gguf-inspect.rs"
[[bin]]
name = "tokenizer-demo"
path = "src/bin/tokenizer-demo.rs"

The library root declares the tokenizer module and re-exports the two names binaries need:

RUST

mod cli;
mod gguf;
mod tokenizer;
 
pub use cli::CliArgs;
pub use gguf::{GGUF, TensorInfo};
pub use tokenizer::{BpeTokenizer, Tokenizer};

The tokenizer reads metadata, so it needs MetadataValue, the type-tagged enum from I.1. That type was internal to the gguf module; we make it visible within the crate (but not to the outside world) by adding one pub(crate) re-export to the gguf module file:

RUST

mod gguf;
mod read;
mod types;
 
pub use gguf::GGUF;
pub use types::TensorInfo;
 
pub(crate) use types::MetadataValue;

And the tokenizer needs two more getters on MetadataValue. I.1 added as_u64 for one numeric field; the tokenizer needs to pull strings (as_str) and arrays (as_array). The vocabulary is an array of strings, the merge table an array of strings. We add both to the same impl MetadataValue block:

RUST

    pub fn as_str(&self) -> Option<&str> {
        match self {
            MetadataValue::String(s) => Some(s.as_str()),
            _ => None,
        }
    }
 
    pub fn as_array(&self) -> Option<&Vec<MetadataValue>> {
        match self {
            MetadataValue::Array(arr) => Some(arr),
            _ => None,
        }
    }

The tokenizer module

The new tokenizer module is three files. tokenizer_trait.rs defines the interface, bpe.rs is the implementation, and mod.rs ties them together:

RUST

mod bpe;
mod tokenizer_trait;
 
pub use bpe::BpeTokenizer;
pub use tokenizer_trait::Tokenizer;

A trait first, because later acts will have more than one tokenizer-shaped thing, and a chat pipeline that wants to be generic over them. The contract is small:

RUST

pub trait Tokenizer: Send + Sync {
    fn encode(&self, text: &str) -> Vec<usize>;
    fn decode(&self, ids: &[usize]) -> String;
 
    fn chat_template(&self) -> Option<&str>;
 
    fn eos_token_id(&self) -> usize;
}

encode and decode are the obvious pair. eos_token_id returns the id of the end-of-sequence token, the special token a model emits to say "I'm done"; the generation loop in I.6 watches for it. chat_template returns an optional string: chat-tuned models ship a template (a little templating snippet) describing how to format a conversation into a prompt. We don't use it in Act 1. It surfaces in III.1, but the tokenizer is where it lives, so the trait exposes it now.

Send + Sync on the trait means a tokenizer can be shared across threads, which matters once we serve concurrent requests in Act 3. It costs nothing here.

The byte-level codec

Now bpe.rs. The first piece is the byte-to-safe-character mapping described earlier: the GPT-2 byte-level trick. We give it its own struct, ByteCodec, holding the mapping in both directions:

RUST

use regex::Regex;
use std::collections::{HashMap, HashSet};
 
use crate::gguf::MetadataValue;
 
// ... pre_tokenize_pattern, defined later this chapter, sits here ...
 
#[derive(Clone)]
pub(crate) struct ByteCodec {
    byte_to_char: [char; 256],
    char_to_byte: HashMap<char, u8>,
}
 
impl ByteCodec {
    pub(crate) fn new() -> Self {
        let byte_to_char = Self::build_table();
        let char_to_byte: HashMap<char, u8> = byte_to_char
            .iter()
            .enumerate()
            .map(|(b, &ch)| (ch, b as u8))
            .collect();
        Self {
            byte_to_char,
            char_to_byte,
        }
    }

byte_to_char is a 256-entry array: given any byte, what printable character represents it. char_to_byte is the inverse, built by walking that array. The interesting part is build_table, which decides the mapping:

RUST

    fn build_table() -> [char; 256] {
        let mut order: Vec<u8> = Vec::with_capacity(256);
        order.extend(33..=126);
        order.extend(161..=172);
        order.extend(174..=255);
 
        let mut chars: Vec<char> = order
            .iter()
            .copied()
            .map(|b| char::from_u32(u32::from(b)).unwrap())
            .collect();
 
        let mut seen = [false; 256];
        for &b in &order {
            seen[b as usize] = true;
        }
 
        let mut extra = 0u32;
        for b in 0u16..=255u16 {
            let b = b as u8;
            if !seen[b as usize] {
                order.push(b);
                chars.push(char::from_u32(256 + extra).unwrap());
                extra += 1;
            }
        }
 
        let mut out = ['\0'; 256];
        for (i, &b) in order.iter().enumerate() {
            out[b as usize] = chars[i];
        }
        out
    }
}

The logic, in words: the byte ranges 33..=126, 161..=172, 174..=255 are bytes that are already printable, non-whitespace Unicode code points; those map to themselves. Every byte not in that set (control characters, space, and a couple of others, roughly 68 of them) gets mapped to a code point starting at 256 and counting up, well clear of anything else. The result: all 256 bytes map to 256 distinct, visible, non-whitespace characters. This is the exact table the GPT-2 tokenizer defined, and Qwen3 inherits it.

Two helpers convert between raw bytes and the mapped string:

RUST

    fn encode_bytes(&self, bytes: &[u8]) -> String {
        bytes
            .iter()
            .map(|&b| self.byte_to_char[b as usize])
            .collect::<String>()
    }
 
    fn decode_chars(&self, mapped: &str) -> String {
        let mut bytes = Vec::with_capacity(mapped.len());
        for ch in mapped.chars() {
            if let Some(&b) = self.char_to_byte.get(&ch) {
                bytes.push(b);
            } else {
                let mut buf = [0u8; 4];
                let utf8 = ch.encode_utf8(&mut buf);
                bytes.extend_from_slice(utf8.as_bytes());
            }
        }
        String::from_utf8_lossy(&bytes).into_owned()
    }

encode_bytes is a straight lookup per byte. decode_chars is the inverse: each character in the mapped string should be in char_to_byte, so we recover the original byte. The else branch is defensive; if some character isn't in the table (it shouldn't happen for well-formed token pieces) we keep its UTF-8 bytes as-is rather than panic. from_utf8_lossy turns the recovered byte vector back into a String, substituting the replacement character for any invalid sequence.

These two methods sit between new and build_table in the file, shown separately here only so the explanation can follow each piece.

Pre-tokenization

Before BPE runs on a string, the string is chopped into coarse chunks, and BPE runs within each chunk, never across chunk boundaries. This step is called pre-tokenization, and its job is to keep merges from doing silly things: gluing the end of one word onto the start of the next, swallowing punctuation into words, mangling whitespace runs. GPT-2-family tokenizers define the chunking with a regular expression, and the GGUF metadata tells us which family we're dealing with via a tokenizer.ggml.pre field:

RUST

fn pre_tokenize_pattern(pre: &str) -> &'static str {
    match pre {
        "gpt2" | "qwen2" | "qwen3" => {
            r"(?i:'(?:[sdmt]|ll|ve|re))|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]|\s+"
        }
        _ => panic!("unsupported pretokenizer: {pre}"),
    }
}

The regex looks dense; it is just an alternation of cases. Reading left to right: contractions like 's 'll 're (case-insensitive); a run of letters optionally led by one non-letter/non-digit (this is how a leading space ends up attached to a word, so " the" is one chunk); one to three digits (so a long number is chopped into small chunks rather than kept as one giant undifferentiated blob); a run of punctuation optionally led by a space; runs of newlines; and finally any other whitespace run. Every character of the input lands in exactly one chunk. This is the only thing we use the regex crate for.

A caveat: this is close to, not exactly, the reference qwen2/qwen3 pattern. The reference splits digits strictly one at a time (\p{N}, not \p{N}{1,3}) and ends with a \s+(?!\S) lookahead that Rust's regex crate can't express, so a run of several spaces before a word chunks slightly differently here than in HF or llama.cpp. In practice the final token ids almost always coincide anyway, because the merge table contains no digit merges.

The tokenizer struct

Now the main type. BpeTokenizer holds everything needed to encode and decode:

RUST

#[derive(Clone)]
pub struct BpeTokenizer {
    token_to_id: HashMap<String, usize>,
    id_to_token: Vec<String>,
    merges: HashMap<(String, String), usize>,
    codec: ByteCodec,
 
    special_id_set: HashSet<usize>,
 
    special_token_re: Option<Regex>,
    pretokenize_regex: Regex,
 
    eos_token_id: usize,
 
    chat_template: Option<String>,
}

The two vocabulary maps are inverses: id_to_token is a Vec indexed by id (decoding direction), token_to_id a HashMap from piece to id (encoding direction). merges is the merge table: it maps a pair of pieces to its rank (its position in the learned order; lower rank = learned earlier = higher priority). codec is the byte mapping we just built. pretokenize_regex is the compiled pre-tokenization pattern.

The remaining three fields are about special tokens. A vocabulary contains, alongside ordinary text pieces, a handful of control tokens (end-of-sequence, chat-role markers like <|im_start|>, and so on). These must never be split by BPE or produced by accident; they are matched literally. special_token_re is a regex that matches any special token's literal text, special_id_set is the set of their ids, and eos_token_id is the one specific special token the generation loop cares about.

Building it from GGUF metadata

The constructor reads the whole tokenizer out of the metadata map. It is long, so we take it in pieces. First the vocabulary:

RUST

impl BpeTokenizer {
    pub fn from_gguf_metadata(metadata: &std::collections::HashMap<String, MetadataValue>) -> Self {
        let model = metadata
            .get("tokenizer.ggml.model")
            .unwrap()
            .as_str()
            .unwrap();
        assert_eq!(model, "gpt2");
 
        let tokens = metadata
            .get("tokenizer.ggml.tokens")
            .unwrap()
            .as_array()
            .unwrap();
        let id_to_token: Vec<String> = tokens
            .iter()
            .map(|v| v.as_str().unwrap().to_string())
            .collect();
        let token_to_id = id_to_token
            .iter()
            .enumerate()
            .map(|(i, tok)| (tok.clone(), i))
            .collect();

tokenizer.ggml.model names the tokenizer family; we assert it's gpt2, the only one we support. tokenizer.ggml.tokens is the vocabulary: an array of strings, one per id. The id of a token is its position in that array, so id_to_token is a direct collect, and token_to_id is the same data inverted into a map.

Next the merge table:

RUST

        let merges = metadata
            .get("tokenizer.ggml.merges")
            .unwrap()
            .as_array()
            .unwrap();
        let merges_map: HashMap<(String, String), usize> = merges
            .iter()
            .enumerate()
            .map(|(rank, value)| {
                let line = value.as_str().unwrap();
                let mut parts = line.split_whitespace();
                let a = parts.next().unwrap().to_string();
                let b = parts.next().unwrap().to_string();
                assert!(parts.next().is_none());
                ((a, b), rank)
            })
            .collect();

tokenizer.ggml.merges is an array of strings, each of the form "piece_a piece_b": the two pieces that merge, space-separated. The array order is the learned order, so the index in the array is the rank. We split each line into its two pieces and build a map from (a, b) to rank. The assert!(parts.next().is_none()) confirms there were exactly two pieces on the line.

Now special tokens. The metadata carries a parallel array, tokenizer.ggml.token_type, with one type code per token; codes 3 and 4 mark control and user-defined special tokens:

RUST

        let token_types = metadata
            .get("tokenizer.ggml.token_type")
            .unwrap()
            .as_array()
            .unwrap();
        assert_eq!(token_types.len(), id_to_token.len());
 
        let mut special_ids: Vec<usize> = token_types
            .iter()
            .enumerate()
            .filter_map(|(id, ty_val)| {
                let ty = ty_val.as_u64()?;
                if ty != 3 && ty != 4 {
                    return None;
                }
                Some(id)
            })
            .collect();
        special_ids.sort_by_key(|&id| std::cmp::Reverse(id_to_token[id].len()));
 
        let special_id_set: HashSet<usize> = special_ids.iter().copied().collect();

We collect every id whose type is 3 or 4, then sort them longest-piece-first. That ordering matters for the next step: when we build a regex that matches any special token, we want a longer token like <|im_start|> to win over a shorter one that might be its prefix. Regex alternation prefers earlier alternatives, so longest-first gives the correct, greedy match. special_id_set is the same set as a hash set, for fast membership checks during decode.

RUST

        let special_token_re = if special_ids.is_empty() {
            None
        } else {
            let pattern = special_ids
                .iter()
                .map(|&id| regex::escape(&id_to_token[id]))
                .collect::<Vec<_>>()
                .join("|");
            Some(Regex::new(&pattern).unwrap())
        };
 
        let pre = metadata
            .get("tokenizer.ggml.pre")
            .unwrap()
            .as_str()
            .unwrap();
        let pretokenize_regex = Regex::new(pre_tokenize_pattern(pre)).unwrap();

special_token_re is the alternation of every special token's literal text (regex::escape makes sure characters like | are matched literally, not as regex syntax). pretokenize_regex compiles the pre-tokenization pattern picked by tokenizer.ggml.pre.

Finally the two scalar metadata fields:

RUST

        let eos_token_id = metadata
            .get("tokenizer.ggml.eos_token_id")
            .and_then(|v| v.as_u64())
            .map(|n| n as usize)
            .expect("GGUF missing tokenizer.ggml.eos_token_id");
        assert!(
            eos_token_id < id_to_token.len(),
            "tokenizer.ggml.eos_token_id out of range for tokenizer.ggml.tokens"
        );
 
        let chat_template = metadata
            .get("tokenizer.chat_template")
            .and_then(|v| match v {
                MetadataValue::String(s) if !s.is_empty() => Some(s.clone()),
                _ => None,
            });
 
        Self {
            token_to_id,
            id_to_token,
            merges: merges_map,
            codec: ByteCodec::new(),
            special_id_set,
            special_token_re,
            pretokenize_regex,
            eos_token_id,
            chat_template,
        }
    }

eos_token_id is required; we panic if it's missing or out of range. chat_template is optional; we keep it only if it's a non-empty string. The struct is assembled and returned.

A couple of small accessors round out this part of the impl block:

RUST

    pub fn eos_token_piece(&self) -> &str {
        &self.id_to_token[self.eos_token_id]
    }
 
    pub fn lookup_token(&self, piece: &str) -> Option<usize> {
        self.token_to_id.get(piece).copied()
    }

eos_token_piece returns the text of the EOS token, lookup_token looks up a single piece's id. Neither is used this chapter; both are wanted by the chat pipeline in Act 3.

The BPE merge loop

This is the heart of the tokenizer: the function that takes one pre-tokenized chunk (already byte-mapped into safe characters) and applies the merge table to it:

RUST

    fn bpe(&self, token: &str) -> Vec<String> {
        let mut word: Vec<String> = token.chars().map(|c| c.to_string()).collect();
        if word.len() <= 1 {
            return word;
        }
 
        loop {
            let best = (0..word.len().saturating_sub(1))
                .filter_map(|i| {
                    let key = (word[i].clone(), word[i + 1].clone());
                    self.merges.get(&key).map(|&rank| (i, rank))
                })
                .min_by_key(|&(_, rank)| rank);
 
            let Some((merge_i, _)) = best else { break };
 
            let mut next = Vec::with_capacity(word.len() - 1);
            let mut i = 0usize;
            while i < word.len() {
                if i + 1 < word.len() && i == merge_i {
                    next.push(format!("{}{}", word[i], word[i + 1]));
                    i += 2;
                } else {
                    next.push(word[i].clone());
                    i += 1;
                }
            }
            word = next;
            if word.len() <= 1 {
                break;
            }
        }
        word
    }

word starts as one symbol per character. The loop does one merge per iteration. It scans every adjacent pair (word[i], word[i+1]), looks each up in the merge table, and keeps the one with the lowest rank: the merge learned earliest. If no adjacent pair is in the table, there's nothing left to merge and the loop breaks. Otherwise it rebuilds word with that one pair fused into a single symbol, and goes again. When the word collapses to a single symbol, or no merge applies, we're done. The result is the list of final pieces, exactly the "there" walk from the start of the chapter, in code.

This is the simple-and-correct version. It rescans all pairs every iteration, which is fine for chunks of a few characters but quadratic in chunk length. Real prompts pre-tokenize into short chunks, so it's never a bottleneck, and clarity wins here.

Two small helpers wrap the regexes:

RUST

    fn pre_tokenize<'a>(&'a self, text: &'a str) -> impl Iterator<Item = &'a str> {
        self.pretokenize_regex.find_iter(text).map(|m| m.as_str())
    }
 
    fn split_specials<'a>(&self, text: &'a str) -> Vec<(&'a str, bool)> {
        let Some(re) = &self.special_token_re else {
            return vec![(text, false)];
        };
 
        let mut result = Vec::new();
        let mut last_end = 0;
 
        for m in re.find_iter(text) {
            if m.start() > last_end {
                result.push((&text[last_end..m.start()], false));
            }
            result.push((m.as_str(), true));
            last_end = m.end();
        }
 
        if last_end < text.len() {
            result.push((&text[last_end..], false));
        }
 
        result
    }
}

pre_tokenize yields the coarse chunks. split_specials splits the input around special-token occurrences, returning a list of (fragment, is_special) pairs: special-token fragments are emitted as-is (the bool is true), ordinary text in between is emitted with false. This runs before pre-tokenization so that a special token like <|im_start|> is recognised whole and never fed through BPE.

encode and decode

Flowchart of the encode pipeline: input text is split around special tokens, pre-tokenized into chunks, byte-mapped, merged by BPE, and looked up in the vocabulary to produce token ids 9707, 11, 1879, 0.

Figure: the whole encode path for one string. Every stage below is a function we have already written; encode just chains them.

The Tokenizer trait implementation puts it all together. encode:

RUST

impl super::Tokenizer for BpeTokenizer {
    fn encode(&self, text: &str) -> Vec<usize> {
        let mut out = Vec::new();
 
        for (fragment, is_special) in self.split_specials(text) {
            if is_special {
                out.push(self.token_to_id.get(fragment).copied().unwrap());
            } else {
                for token in self.pre_tokenize(fragment) {
                    let mapped = self.codec.encode_bytes(token.as_bytes());
                    for piece in self.bpe(&mapped) {
                        match self.token_to_id.get(&piece) {
                            Some(&id) => out.push(id),
                            None => {
                                for ch in piece.chars() {
                                    match self.token_to_id.get(&ch.to_string()) {
                                        Some(&id) => out.push(id),
                                        None => {
                                            eprintln!(
                                                "warning: unknown token piece '{ch}' (dropped)"
                                            )
                                        }
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
 
        out
    }

The full encode pipeline, in nesting order: split into special and non-special fragments; a special fragment becomes its single id directly; a non-special fragment is pre-tokenized into chunks; each chunk's bytes are mapped to safe characters; BPE merges that into pieces; each piece is looked up to get an id. The None branch is the safety net: if a BPE piece somehow isn't in the vocabulary, fall back to its individual characters, which (thanks to the byte-level codec) always are. The very last None should be unreachable; if it ever fires, we warn and drop rather than crash.

decode runs the reverse:

RUST

    fn decode(&self, ids: &[usize]) -> String {
        let mut result = String::new();
        let mut bpe_chars = String::new();
 
        for &id in ids {
            if self.special_id_set.contains(&id) {
                if !bpe_chars.is_empty() {
                    result.push_str(&self.codec.decode_chars(&bpe_chars));
                    bpe_chars.clear();
                }
                result.push_str(&self.id_to_token[id]);
                continue;
            }
 
            if let Some(tok) = self.id_to_token.get(id) {
                bpe_chars.push_str(tok);
            }
        }
 
        if !bpe_chars.is_empty() {
            result.push_str(&self.codec.decode_chars(&bpe_chars));
        }
 
        result
    }

The subtlety here: an ordinary token's piece is in mapped (byte-level) characters, and a single UTF-8 character can be split across two adjacent tokens. So we can't decode token-by-token; we must concatenate the mapped pieces of a run of ordinary tokens, then run decode_chars once on the whole run to recover bytes correctly. bpe_chars is that accumulator. A special token isn't byte-mapped, so when we hit one we first flush the accumulated run, then append the special token's literal text verbatim. A final flush handles the trailing run.

The last two trait methods just expose stored fields:

RUST

    fn chat_template(&self) -> Option<&str> {
        self.chat_template.as_deref()
    }
 
    fn eos_token_id(&self) -> usize {
        self.eos_token_id
    }
}

A binary to see it work

tokenizer-demo takes a GGUF path and an optional string, builds the tokenizer, and prints the round trip:

RUST

use std::path::Path;
use std::process;
 
use inferno::{BpeTokenizer, CliArgs, GGUF, Tokenizer};
 
fn usage() -> ! {
    eprintln!("usage: tokenizer-demo <model.gguf> [text]");
    process::exit(2);
}
 
fn main() {
    let args = CliArgs::from_env();
    let positional = args.positionals();
    let gguf_path = positional.first().map(|s| s.as_str()).unwrap_or_else(|| usage());
    let text = positional
        .get(1)
        .map(|s| s.as_str())
        .unwrap_or("Hello, world!");
 
    let path = Path::new(gguf_path);
    let gguf = GGUF::parse(path);
 
    let tok = BpeTokenizer::from_gguf_metadata(&gguf.metadata);
 
    let ids = tok.encode(text);
    println!("ids ({}): {:?}", ids.len(), ids);
    let splits: Vec<String> = ids.iter().map(|&id| tok.decode(&[id])).collect();
    println!("splits ({}): {:?}", splits.len(), splits);
    println!("decode: {:?}", tok.decode(&ids));
}

It parses the GGUF metadata (no tensor data; the tokenizer only needs the metadata map), builds the BpeTokenizer, and encodes the text. Then it does something useful for understanding: it decodes each id individually into splits, so you can see the exact chunk each token represents. Finally it decodes the whole id list at once, which should reproduce the input.

Running it

BASH

cargo run --bin tokenizer-demo -- path/to/Qwen3-0.6B-FP32.gguf "Hello, world!"

PLAINTEXT

ids (4): [9707, 11, 1879, 0]
splits (4): ["Hello", ",", " world", "!"]
decode: "Hello, world!"

Four tokens for four pieces: Hello, ,, world (note the leading space; that's pre-tokenization attaching the space to the following word), and !. The decode line reproduces the input exactly. Try a word the vocabulary won't have whole, or some non-Latin text, and you'll see it fall back to smaller pieces, but it always round-trips, because the byte-level codec guarantees it.

Where this leaves us

Text is now numbers, and numbers are now text. BpeTokenizer::from_gguf_metadata reads the vocabulary and merge table out of the GGUF file and gives us a clean encode/decode pair: the two ends of the inference pipeline. Everything between them, from here on, is arithmetic on arrays of floats.

Which means it's time to build the thing those floats live in. The next chapter introduces the Tensor (a flat buffer plus a shape) and teaches the GGUF parser to read tensor payloads, not just the index, off disk.