III.6: Decode scheduler

The server has been one-request-at-a-time since III.2. chat_completions calls spawn_blocking, which runs a whole chat turn (prefill plus every decode step) start to finish, then returns. A second request arriving mid-turn waits in line for the first to finish entirely.

spawn_blocking will happily run two of those closures on two threads. But that doesn't help: there is one model, one set of weights, and they aren't safe to drive from two threads at once. Two concurrent turns would race on the model.

So concurrency needs a different shape. Instead of "each request runs its own turn to completion," we want: one component owns the model, holds several requests' worth of state side by side, and advances all of them by interleaving their decode steps: request A's step, request B's step, A's step, B's step. To the model it's still one caller; to the users it's concurrent.

That component is the decode scheduler. This chapter builds it: a background worker thread that owns the model, holds a fixed set of decode slots, admits requests into free slots, and runs the decode loop across all active slots. The HTTP handlers stop calling the model; they submit a job and await a result.

The architecture

PLAINTEXT
HTTP handler ──submit(job)──> ┌─────────────────────────────┐
HTTP handler ──submit(job)──> │  scheduler worker thread     │
HTTP handler ──submit(job)──> │   owns: model, tokenizer,    │
        ▲                     │         backend, prefix cache│
        │                     │                              │
        │                     │   slots: [A][B][ ][ ]        │
        │                     │   waiting queue: [C][D]      │
        └────result───────────│                              │
                              │   loop: drain submits        │
                              │         evict finished       │
                              │         admit from queue     │
                              │         decode all slots     │
                              └─────────────────────────────┘

One worker thread owns everything heavy. HTTP handlers communicate with it over channels: a handler sends a SubmitRequest (the job plus a reply channel) and blocks on the reply. The worker has a fixed array of slots (each slot is one in-flight request's full decode state) and a waiting queue for jobs that arrived when all slots were full.

The worker runs a loop: pull new submissions into the queue, retire any finished slots (sending their results back), admit queued jobs into freed slots, then run one decode step on every active slot. Round and round. Each pass advances every active request by one token, so N requests make progress together.

This chapter interleaves the slots; each slot still gets its own forward pass per tick. III.7 fuses those per-slot passes into one batched pass. The slot and worker structure built here is exactly what III.7 extends.

The III.6 commit adds a scheduler module with four files: a job type, the scheduler handle, the slot, and the worker loop.

src/scheduler/mod.rsRUST
mod job;
mod scheduler;
mod slot;
mod worker;
 
pub use job::ChatJob;
pub use scheduler::InferenceScheduler;

The job

src/scheduler/job.rs defines what a caller submits and the envelope it travels in:

src/scheduler/job.rsRUST
use std::sync::mpsc::Sender;
 
use tokio::sync::mpsc::UnboundedSender;
 
use crate::chat::{ChatTemplateMessage, ChatTurnResult};
 
pub struct ChatJob {
    pub messages: Vec<ChatTemplateMessage>,
    pub max_new_tokens: usize,
    pub kv_mode: &'static str,
    pub stream_delta: Option<UnboundedSender<String>>,
}
 
pub(crate) struct SubmitRequest {
    pub job: ChatJob,
    pub reply: Sender<Result<ChatTurnResult, String>>,
}

A ChatJob is one chat turn's request: the messages, the token cap, the KV cache mode. stream_delta is the interesting field: an optional tokio channel sender. If the request is streaming (SSE), the handler passes a sender here, and the slot pushes each text delta into it as tokens are generated. If non-streaming, it's None and the slot just builds the final result.

SubmitRequest wraps a job with a reply channel, a std::sync::mpsc::Sender the worker uses to send the finished ChatTurnResult back to whoever submitted. Note the two channel kinds: std::sync::mpsc for the reply (the worker thread and the blocking submitter are both plain threads) and tokio::sync::mpsc for stream deltas (the async runtime consumes those).

The scheduler handle

src/scheduler/scheduler.rs is the public face, what HTTP handlers hold. It's a thin handle around a channel to the worker:

src/scheduler/scheduler.rsRUST
use std::sync::Arc;
use std::sync::mpsc::{self, Sender};
 
use crate::cache::RadixPrefixCache;
use crate::chat::ChatTurnResult;
use crate::model::Model;
use crate::backend::Backend;
use crate::tokenizer::Tokenizer;
 
use super::job::{ChatJob, SubmitRequest};
use super::worker;
 
pub struct InferenceScheduler {
    submit_tx: Sender<SubmitRequest>,
}
 
impl InferenceScheduler {
    pub fn new(
        max_slots: usize,
        model: Arc<dyn Model>,
        tokenizer: Arc<dyn Tokenizer>,
        backend: Arc<dyn Backend>,
        prefix_cache: Option<Arc<RadixPrefixCache>>,
    ) -> Arc<Self> {
        assert!(max_slots >= 1, "scheduler max_slots must be >= 1");
        let submit_tx = worker::spawn(max_slots, model, tokenizer, backend, prefix_cache);
        Arc::new(Self { submit_tx })
    }

InferenceScheduler holds just one thing: submit_tx, the sending end of the channel into the worker. new spawns the worker thread (handing it the model, tokenizer, backend, and prefix cache; the worker owns them now) and keeps the returned sender. The prefix cache from III.5 finally has a home: the worker owns it, so every request the worker admits shares one cache.

Two submit methods. submit is synchronous, for callers already on a blocking thread:

src/scheduler/scheduler.rsRUST
    pub fn submit(&self, job: ChatJob) -> Result<ChatTurnResult, String> {
        let (reply_tx, reply_rx) = mpsc::channel();
        self.submit_tx
            .send(SubmitRequest {
                job,
                reply: reply_tx,
            })
            .map_err(|_| "inference scheduler shut down".to_string())?;
        reply_rx
            .recv()
            .map_err(|_| "scheduler reply channel closed".to_string())?
    }

Make a fresh one-shot reply channel, send the job plus the reply sender into the worker, then block on reply_rx.recv() until the worker sends the result back. The caller's thread parks; the worker does the work.

submit_async is the version an axum handler calls; it wraps submit in spawn_blocking so the async runtime isn't blocked:

src/scheduler/scheduler.rsRUST
    pub async fn submit_async(self: &Arc<Self>, job: ChatJob) -> Result<ChatTurnResult, String> {
        let sched = Arc::clone(self);
        match tokio::task::spawn_blocking(move || sched.submit(job)).await {
            Ok(result) => result,
            Err(e) => Err(format!("scheduler join error: {e}")),
        }
    }
}

The spawn_blocking here is not running the model; it's only parking on reply_rx.recv(). The model runs on the worker thread. The blocking thread this uses just waits.

The slot

src/scheduler/slot.rs is the in-flight state of one request inside the worker. It's essentially the ChatDecodeState from III.1 plus the bookkeeping a server needs: where to send the reply, whether to stream, whether it's done.

src/scheduler/slot.rsRUST
use std::sync::mpsc::Sender;
use std::sync::Arc;
 
use tokio::sync::mpsc::UnboundedSender;
 
use crate::cache::{KvCache, RadixPrefixCache};
use crate::chat::{
    ChatDecodeStep, ChatTurnResult, build_chat_turn_result, chat_decode_kv_one_step,
    prepare_chat_decode_state, stream_delta,
};
use crate::decode::Metrics;
use crate::model::Model;
use crate::backend::Backend;
use crate::tokenizer::Tokenizer;
 
use super::job::SubmitRequest;
 
pub(crate) struct Slot {
    pub prompt: String,
    pub prompt_ids: Vec<usize>,
    pub eos_token_id: usize,
    pub ids: Vec<usize>,
    pub next_id: usize,
    pub cache: Box<dyn KvCache>,
    pub metrics: Metrics,
    pub max_new_tokens: usize,
    pub reply: Sender<Result<ChatTurnResult, String>>,
    pub stream: Option<UnboundedSender<String>>,
    pub prev_text: String,
    pub finished: bool,
    pub err: Option<String>,
}

The first chunk, prompt through max_new_tokens, is the decode state: the prompt and its ids, the running id sequence, the predicted next token, the KV cache, timing, the token cap. The rest is server bookkeeping: reply is where the final result goes; stream is the optional delta channel; prev_text tracks streamed text for the delta computation; finished and err say whether and how the slot ended.

Building a slot runs the prefill: III.1's prepare_chat_decode_state, now with the prefix cache wired in:

src/scheduler/slot.rsRUST
impl Slot {
    /// Builds a slot, or sends the failure on the reply channel and returns `None`.
    pub fn build(
        req: SubmitRequest,
        model: &Arc<dyn Model>,
        tokenizer: &dyn Tokenizer,
        backend: &Arc<dyn Backend>,
        prefix_cache: Option<&Arc<RadixPrefixCache>>,
    ) -> Option<Slot> {
        let job = req.job;
 
        let st = match prepare_chat_decode_state(
            Arc::clone(model),
            tokenizer,
            Arc::clone(backend),
            &job.messages,
            job.kv_mode,
            prefix_cache,
        ) {
            Ok(st) => st,
            Err(e) => {
                let _ = req.reply.send(Err(e));
                return None;
            }
        };
 
        Some(Slot {
            prompt: st.prompt,
            prompt_ids: st.prompt_ids,
            eos_token_id: st.eos_token_id,
            ids: st.ids,
            next_id: st.next_id,
            cache: st.cache,
            metrics: st.metrics,
            max_new_tokens: job.max_new_tokens,
            reply: req.reply,
            stream: job.stream_delta,
            prev_text: String::new(),
            finished: false,
            err: None,
        })
    }
 
    pub fn is_active(&self) -> bool {
        !self.finished && self.err.is_none()
    }

build runs prefill, including a prefix-cache lookup since prepare_chat_decode_state does that now. If prefill fails, the error goes straight back on the reply channel and build returns None (no slot created). Otherwise the decode state moves field-by-field into a fresh Slot. is_active is the worker's "still has work" predicate.

decode_one advances the slot by one token, the per-tick step:

src/scheduler/slot.rsRUST
    pub fn decode_one(
        &mut self,
        model: &dyn Model,
        tokenizer: &dyn Tokenizer,
        backend: &dyn Backend,
    ) {
        if !self.is_active() {
            return;
        }
 
        let ids_before = self.ids.len();
        let step = chat_decode_kv_one_step(
            &mut self.ids,
            &mut self.next_id,
            self.cache.as_mut(),
            self.prompt_ids.len(),
            self.eos_token_id,
            self.max_new_tokens,
            model,
            backend,
            &mut self.metrics,
            |_| {},
        );
        self.finish_step(tokenizer, ids_before, step);
    }
 
    pub fn fail(&mut self, err: String) {
        self.err = Some(err);
        self.finished = true;
    }

decode_one calls chat_decode_kv_one_step from III.1, the same single-step function chat-repl uses. It records ids_before so finish_step can tell whether a token was actually produced, then hands the step result to finish_step.

finish_step handles streaming and end-of-turn:

src/scheduler/slot.rsRUST
    pub fn finish_step(
        &mut self,
        tokenizer: &dyn Tokenizer,
        ids_before: usize,
        step: Result<ChatDecodeStep, String>,
    ) {
        if self.stream.is_some() && self.ids.len() > ids_before {
            self.emit_stream_delta(tokenizer);
        }
        match step {
            Ok(ChatDecodeStep::Continue) => {}
            Ok(ChatDecodeStep::Finished { .. }) => self.finished = true,
            Err(e) => {
                self.err = Some(e);
                self.finished = true;
            }
        }
    }

If this slot is streaming and a token landed, push a delta. Then act on the step: Continue does nothing, Finished marks the slot done, an error records the error and marks it done.

When a slot finishes, send_result builds the final ChatTurnResult and sends it on the reply channel:

src/scheduler/slot.rsRUST
    pub fn send_result(self, tokenizer: &dyn Tokenizer) {
        if let Some(e) = self.err {
            let _ = self.reply.send(Err(e));
            return;
        }
 
        let turn = build_chat_turn_result(
            self.prompt,
            self.prompt_ids,
            self.eos_token_id,
            self.ids,
            tokenizer,
            &self.metrics,
        );
        let _ = self.reply.send(Ok(turn));
    }
 
    fn emit_stream_delta(&mut self, tokenizer: &dyn Tokenizer) {
        let new_tokens = &self.ids[self.prompt_ids.len()..];
        if let Some(delta) = stream_delta(tokenizer, new_tokens, &mut self.prev_text)
            && let Some(tx) = &self.stream
        {
            let _ = tx.send(delta);
        }
    }
}

send_result takes self by value; the slot is consumed. On error it sends the error; otherwise it builds the result with III.1's build_chat_turn_result and sends it. emit_stream_delta reuses III.1's stream_delta, decoding the generated tokens, diffing against prev_text, sending only the new text, and pushing it down the slot's stream channel.

The worker loop

src/scheduler/worker.rs is the engine. It spawns the thread and runs the loop. First, spawn:

src/scheduler/worker.rsRUST
use std::collections::VecDeque;
use std::sync::Arc;
use std::sync::mpsc::{self, Receiver, Sender, TryRecvError};
use std::thread;
 
use crate::cache::RadixPrefixCache;
use crate::model::Model;
use crate::backend::Backend;
use crate::tokenizer::Tokenizer;
 
use super::job::SubmitRequest;
use super::slot::Slot;
 
/// Spawns the worker thread and returns the submit sender.
pub(super) fn spawn(
    max_slots: usize,
    model: Arc<dyn Model>,
    tokenizer: Arc<dyn Tokenizer>,
    backend: Arc<dyn Backend>,
    prefix_cache: Option<Arc<RadixPrefixCache>>,
) -> Sender<SubmitRequest> {
    let (ready_tx, ready_rx) = mpsc::channel();
    thread::Builder::new()
        .name("inferno-scheduler".into())
        .spawn(move || {
            let (submit_tx, submit_rx) = mpsc::channel();
            ready_tx
                .send(submit_tx)
                .expect("hand off submit sender to InferenceScheduler");
            let mut worker = WorkerLoop {
                submit_rx,
                model,
                tokenizer,
                backend,
                prefix_cache,
                slots: (0..max_slots).map(|_| None).collect(),
                waiting_queue: VecDeque::new(),
                submit_gone: false,
            };
            worker.main_loop();
        })
        .expect("spawn inference scheduler thread");
    ready_rx
        .recv()
        .expect("scheduler worker failed to start")
}

spawn starts a named thread. Inside it, the worker creates its own submit channel (submit_tx/submit_rx) and hands submit_tx back to spawn over a one-shot ready channel. spawn returns that sender; it's what InferenceScheduler keeps. The dance ensures the channel is created on the worker side and the caller only gets the handle once the worker is genuinely running.

WorkerLoop is the state the loop owns:

src/scheduler/worker.rsRUST
struct WorkerLoop {
    submit_rx: Receiver<SubmitRequest>,
    model: Arc<dyn Model>,
    tokenizer: Arc<dyn Tokenizer>,
    backend: Arc<dyn Backend>,
    prefix_cache: Option<Arc<RadixPrefixCache>>,
    slots: Vec<Option<Slot>>,
    waiting_queue: VecDeque<SubmitRequest>,
    submit_gone: bool,
}

slots is a fixed-length Vec<Option<Slot>>, where None is a free slot and Some is occupied. waiting_queue holds submissions when every slot is full. submit_gone tracks whether all schedulers were dropped (time to shut down).

The main loop:

src/scheduler/worker.rsRUST
impl WorkerLoop {
    fn main_loop(&mut self) {
        loop {
            if !self.submit_gone {
                self.drain_submits();
            }
 
            self.evict_finished();
            self.admit();
 
            let any_active = self
                .slots
                .iter()
                .any(|s| s.as_ref().is_some_and(Slot::is_active));
 
            if any_active {
                self.decode_tick();
            } else if self.waiting_queue.is_empty() && self.slots.iter().all(Option::is_none) {
                if self.submit_gone {
                    return;
                }
                // Fully idle: block until the next job arrives.
                match self.submit_rx.recv() {
                    Ok(req) => self.waiting_queue.push_back(req),
                    Err(_) => {
                        self.submit_gone = true;
                        return;
                    }
                }
            }
        }
    }

Each iteration: drain newly submitted jobs into the queue, retire finished slots, admit queued jobs into free slots, then, if any slot is active, run one decode step on all of them. If nothing is active and nothing is queued, the worker is fully idle, so it blocks on submit_rx.recv() instead of spinning, sleeping until the next request wakes it. When all schedulers are dropped the channel closes and the worker returns.

The four steps. drain_submits non-blockingly pulls every queued submission:

src/scheduler/worker.rsRUST
    fn drain_submits(&mut self) {
        loop {
            match self.submit_rx.try_recv() {
                Ok(req) => self.waiting_queue.push_back(req),
                Err(TryRecvError::Empty) => return,
                Err(TryRecvError::Disconnected) => {
                    self.submit_gone = true;
                    return;
                }
            }
        }
    }

try_recv doesn't block; it grabs everything available and returns the moment the channel is empty, so a busy worker keeps decoding rather than waiting on new work.

evict_finished retires done slots:

src/scheduler/worker.rsRUST
    fn evict_finished(&mut self) {
        let tokenizer = self.tokenizer.as_ref();
        for entry in self.slots.iter_mut() {
            let done = entry.as_ref().is_some_and(|s| s.finished);
            if done {
                entry.take().unwrap().send_result(tokenizer);
            }
        }
    }

For each finished slot, take() removes it (leaving None) and send_result consumes it, sending the final result and freeing the slot for the next job.

admit fills free slots from the queue:

src/scheduler/worker.rsRUST
    fn admit(&mut self) {
        for entry in self.slots.iter_mut() {
            if entry.is_some() {
                continue;
            }
            let Some(req) = self.waiting_queue.pop_front() else { return };
 
            if let Some(slot) = Slot::build(
                req,
                &self.model,
                self.tokenizer.as_ref(),
                &self.backend,
                self.prefix_cache.as_ref(),
            ) {
                *entry = Some(slot);
            }
        }
    }

For each empty slot, pull the next queued job and Slot::build it, which runs that request's prefill. If build returns None (prefill failed; the error already went back to the caller), the slot stays empty for the next job.

decode_tick advances every active slot one token:

src/scheduler/worker.rsRUST
    fn decode_tick(&mut self) {
        for slot in self.slots.iter_mut().flatten() {
            slot.decode_one(
                self.model.as_ref(),
                self.tokenizer.as_ref(),
                self.backend.as_ref(),
            );
        }
    }
}

One decode_one per occupied slot. This is the interleave: across many ticks, every active request advances by one token per tick, so they all make steady progress. (It's also exactly the function III.7 replaces, fusing the per-slot forward passes into one.)

Wiring the server to the scheduler

ChatServerState no longer holds the model directly; it holds the scheduler:

src/openapi/router.rsRUST
pub struct ChatServerState {
    pub scheduler: Arc<InferenceScheduler>,
    pub default_max_tokens: usize,
    pub model_label: String,
    pub kv_cache_mode: &'static str,
}

The non-streaming handler stops calling the model and submits a job instead:

src/openapi/router.rsRUST
    let job = ChatJob {
        messages,
        max_new_tokens: max_tokens,
        kv_mode: state.kv_cache_mode,
        stream_delta: None,
    };
    let out = state.scheduler.submit_async(job).await.unwrap();

stream_delta: None, non-streaming. submit_async hands the job to the worker and awaits the result. The handler is now tiny: build a job, submit, shape the response.

Streaming is similar, but it bridges the slot's tokio delta channel to the SSE channel from III.3:

src/openapi/router.rsRUST
fn stream_response(
    messages: Vec<ChatTemplateMessage>,
    max_tokens: usize,
    model_name: String,
    sched: Arc<InferenceScheduler>,
    kv_mode: &'static str,
) -> Response {
    let (sse_tx, sse_rx) = tokio::sync::mpsc::unbounded_channel::<String>();
    let (delta_tx, mut delta_rx) = tokio::sync::mpsc::unbounded_channel::<String>();
 
    let id = new_completion_id();
    let created = unix_secs();
 
    {
        let tx = sse_tx.clone();
        let id = id.clone();
        let model = model_name.clone();
        tokio::spawn(async move {
            while let Some(token) = delta_rx.recv().await {
                let chunk = build_chunk(&id, created, &model, None, Some(token), None);
                if tx.send(chunk).is_err() {
                    break;
                }
            }
        });
    }

Two channels now. delta_tx/delta_rx is the slot's raw-text delta channel; sse_tx/sse_rx is the SSE chunk channel. A small tokio::spawned task bridges them: read each raw delta from delta_rx, wrap it with build_chunk into a chat.completion.chunk, push it to sse_tx. The slot doesn't know about SSE framing; it just sends text.

The job is submitted with delta_tx as its stream_delta:

src/openapi/router.rsRUST
    tokio::task::spawn_blocking(move || {
        sse_tx
            .send(build_chunk(
                &id,
                created,
                &model_name,
                Some("assistant"),
                None,
                None,
            ))
            .unwrap();
 
        let job = ChatJob {
            messages,
            max_new_tokens: max_tokens,
            kv_mode,
            stream_delta: Some(delta_tx),
        };
        let out = sched.submit(job).unwrap();
 
        let reason = if out.hit_stop { "stop" } else { "length" };
        let _ = sse_tx.send(build_chunk(
            &id,
            created,
            &model_name,
            None,
            None,
            Some(reason),
        ));
        let _ = sse_tx.send("[DONE]".into());
    });

Send the opening role chunk, submit the job (Some(delta_tx); the worker's slot now streams deltas through it), block on the result, then send the finish chunk and [DONE]. The rest of stream_response, turning sse_rx into the SSE body, is unchanged from III.3.

The chat-server binary

chat-server builds an InferenceScheduler instead of holding the model. It learns --max-concurrent for the slot count, defaulting to the machine's parallelism:

src/bin/chat-server.rsRUST
fn default_max_concurrent() -> usize {
    std::thread::available_parallelism()
        .map(|n| n.get().max(1))
        .unwrap_or(4)
}

and --prefix-cache-max from III.5 is read here too:

src/bin/chat-server.rsRUST
    let prefix_cache_max = args.prefix_cache_max(0);
src/bin/chat-server.rsRUST
    let max_concurrent = args.max_concurrent(default_max_concurrent());

--max-concurrent is one more arm in the CLI parser:

src/cli/args.rsRUST
                Some("--max-concurrent") => {
                    cur.advance();
                    let n = parse_usize(&mut cur, "--max-concurrent");
                    assert!(n >= 1, "--max-concurrent must be >= 1");
                    max_concurrent = Some(n);
                }

After loading the model, main builds the prefix cache and the scheduler, then puts the scheduler in the state:

src/bin/chat-server.rsRUST
    let prefix_cache = if prefix_cache_max > 0 {
        Some(RadixPrefixCache::new(prefix_cache_max))
    } else {
        None
    };
 
    eprintln!(
        "model: {model_label} ({}), backend: {}, kv: {kv_cache_mode}, prefix max: {}, slots: {}",
        gguf_path.display(),
        backend_name,
        prefix_cache_max,
        max_concurrent,
    );
 
    let scheduler =
        InferenceScheduler::new(max_concurrent, model, tokenizer, backend, prefix_cache);
 
    let state = Arc::new(ChatServerState {
        scheduler,
        default_max_tokens,
        model_label,
        kv_cache_mode,
    });

InferenceScheduler::new consumes the model, tokenizer, backend, and prefix cache, spawns the worker thread, and returns the handle. From here on, every request flows through the scheduler.

Running it

Start the server with 4 slots and a prefix cache:

BASH
cargo run --release --bin chat-server -- \
  --kv paged --max-concurrent 4 --prefix-cache-max 32 path/to/qwen3-0.6b.gguf 128
PLAINTEXT
model: qwen3-0.6b (path/to/qwen3-0.6b.gguf), backend: simd, kv: paged, prefix max: 32, slots: 4
listening on http://127.0.0.1:8000

Now fire several requests at once, three curls backgrounded in one line:

BASH
for q in "Name a color." "Name an animal." "Name a planet."; do
  curl -s http://127.0.0.1:8000/v1/chat/completions \
    -H 'content-type: application/json' \
    -d "{\"messages\":[{\"role\":\"user\",\"content\":\"$q\"}]}" &
done; wait
PLAINTEXT
{"id":"chatcmpl-...","object":"chat.completion",...,"choices":[{"index":0,"message":{"role":"assistant","content":"Blue is a color."},"finish_reason":"stop"}],...}
{"id":"chatcmpl-...","object":"chat.completion",...,"choices":[{"index":0,"message":{"role":"assistant","content":"A dog is an animal."},"finish_reason":"stop"}],...}
{"id":"chatcmpl-...","object":"chat.completion",...,"choices":[{"index":0,"message":{"role":"assistant","content":"Earth is a planet."},"finish_reason":"stop"}],...}

All three are admitted into separate slots and decoded together; the worker interleaves their steps instead of running them strictly back to back. With --max-concurrent 1 the same three requests would queue and run one after another; with 4 slots they overlap.

Where this leaves us

The server handles concurrent requests. A background worker owns the model, tokenizer, backend, and prefix cache; HTTP handlers submit jobs and await results over channels. A fixed pool of slots holds in-flight requests, a waiting queue holds the overflow, and the worker loop admits, decodes, and retires slots in a steady cycle.

But the decode_tick runs each slot's forward pass separately: four active slots means four independent forward passes per tick. Each of those passes does a matrix-times-vector projection, leaving the hardware mostly idle. Run four sequences as one batched pass and the projections become a matrix-times-matrix, the work the hardware is actually built for. The last chapter fuses the per-slot passes into one batched decode.