III.2: HTTP API

III.1 built a reusable chat turn: hand run_chat_turn_streaming_with_prefix a list of messages and it renders the prompt, runs the model, and streams text back through a callback. chat-repl was the first caller. This chapter writes the second one, and it's the caller that turns the project from a CLI into a server.

The goal is an HTTP server that speaks OpenAI's API. That choice is deliberate. OpenAI's /v1/chat/completions endpoint is the de-facto standard: dozens of clients, SDKs, and chat UIs already know how to talk to it. If our server accepts the same JSON, every one of those tools works against it for free, with no custom client and no glue code. Rather than invent a protocol, we implement one that already won.

This chapter does the non-streaming half: send a request, the server runs the whole turn, sends back one JSON response with the full reply. Streaming (tokens arriving as they're generated) is III.3. We'll build the server with axum (an HTTP framework) on top of tokio (Rust's async runtime), define the OpenAI request and response types, and ship a chat-server binary.

What we're implementing

Three endpoints, all under /v1 except the health check:

GET /health: returns {"status": "ok"}. A load balancer or the benchmark harness pings this to know the server is up.
GET /v1/models: lists the models the server hosts. OpenAI clients call this to discover what's available; we host exactly one.
POST /v1/chat/completions: the main endpoint. Body is a JSON object with a messages array; response is the assistant's reply.

A minimal request body looks like:

JSON

{
  "model": "qwen3-0.6b-q8_0",
  "messages": [{"role": "user", "content": "What is 2 + 2?"}]
}

and the response we send back:

JSON

{
  "id": "chatcmpl-18c20ed4bd078cf0",
  "object": "chat.completion",
  "created": 1784004709,
  "model": "qwen3-0.6b-q8_0",
  "choices": [{
    "index": 0,
    "message": {"role": "assistant", "content": "<think>\nOkay, so the question is asking, [...] </think>\n\n2 + 2 equals 4. \n\n**Answer:** 4"},
    "finish_reason": "stop"
  }],
  "usage": {"prompt_tokens": 16, "completion_tokens": 274, "total_tokens": 290}
}

Every field has a job. choices is an array because the API can return multiple completions per request (we always return one). finish_reason is "stop" if the model emitted its end-of-sequence token and "length" if it hit the token cap first, exactly the hit_stop flag from ChatTurnResult in III.1. usage is the token accounting clients use for cost tracking.

The crate

Two new dependencies, one new binary:

TOML

tokio = { version = "1", features = ["rt-multi-thread", "macros", "sync"] }
axum = "0.8"

TOML

[[bin]]
name = "chat-server"
path = "src/bin/chat-server.rs"

tokio is the async runtime, the thing that drives many network connections concurrently without a thread per connection. axum is a web framework built on tokio: it handles HTTP parsing, routing a URL to a handler function, and serializing responses. We already pulled in serde/serde_json in III.1; axum uses them to turn request JSON into Rust structs and back.

The library gets an openapi module:

RUST

mod openapi;

RUST

pub use openapi::{chat_router, ChatServerState};

chat_router builds the axum router; ChatServerState is the shared state every handler can see. The module has two files: the JSON types and the router:

RUST

mod router;
pub(crate) mod types;
 
pub use router::{ChatServerState, chat_router};

The OpenAI JSON types

src/openapi/types.rs is a one-to-one Rust mirror of OpenAI's JSON. serde does the conversion: #[derive(Deserialize)] types are parsed from incoming JSON, #[derive(Serialize)] types are written to outgoing JSON, and the struct field names become the JSON keys.

One wrinkle first. OpenAI allows a message's content to be either a plain string or an array of typed parts ({"type": "text", "text": "..."}), the latter for multimodal inputs. We only do text, so we collapse both forms to a plain String with a custom deserializer:

RUST

use serde::{Deserialize, Deserializer, Serialize};
 
fn deserialize_chat_content<'de, D>(deserializer: D) -> Result<String, D::Error>
where
    D: Deserializer<'de>,
{
    let v = serde_json::Value::deserialize(deserializer)?;
    Ok(chat_content_to_plain_text(&v))
}
 
fn chat_content_to_plain_text(v: &serde_json::Value) -> String {
    match v {
        serde_json::Value::String(s) => s.clone(),
        serde_json::Value::Array(parts) => parts
            .iter()
            .filter_map(|p| {
                if let Some(t) = p.get("text").and_then(|x| x.as_str()) {
                    return Some(t.to_string());
                }
                p.as_str().map(String::from)
            })
            .collect::<Vec<_>>()
            .concat(),
        serde_json::Value::Null => String::new(),
        _ => String::new(),
    }
}

deserialize_chat_content parses the field into a generic serde_json::Value first, then chat_content_to_plain_text flattens it: a string stays a string, an array has its text parts concatenated, anything else becomes empty. A client that sends the array form just works.

Now the request body:

RUST

#[derive(Debug, Deserialize)]
pub struct ChatCompletionRequest {
    #[serde(default)]
    pub model: String,
    pub messages: Vec<ChatCompletionMessage>,
    #[serde(default)]
    pub max_tokens: Option<u32>,
    #[serde(default)]
    pub max_completion_tokens: Option<u32>,
    #[serde(default)]
    pub stream: bool,
}
 
#[derive(Debug, Deserialize, Serialize, Clone)]
pub struct ChatCompletionMessage {
    pub role: String,
    #[serde(deserialize_with = "deserialize_chat_content")]
    pub content: String,
    #[serde(default)]
    pub reasoning_content: Option<String>,
}

#[serde(default)] makes a field optional: if the JSON omits it, the field gets its type's default ("", None, false). Only messages is required. There are two token-limit fields because OpenAI renamed max_tokens to max_completion_tokens; clients send one or the other, so we accept both. stream we read but ignore this chapter; it's III.3's job. The deserialize_with attribute on content is what routes that field through the flattener above.

The response side uses five structs that nest into the JSON shown earlier:

RUST

#[derive(Debug, Serialize)]
pub struct ChatCompletionResponse {
    pub id: String,
    pub object: &'static str,
    pub created: u64,
    pub model: String,
    pub choices: Vec<CompletionChoice>,
    pub usage: Usage,
}
 
#[derive(Debug, Serialize)]
pub struct CompletionChoice {
    pub index: u32,
    pub message: ChoiceMessage,
    pub finish_reason: String,
}
 
#[derive(Debug, Serialize)]
pub struct ChoiceMessage {
    pub role: &'static str,
    pub content: String,
}
 
#[derive(Debug, Serialize)]
pub struct Usage {
    pub prompt_tokens: u32,
    pub completion_tokens: u32,
    pub total_tokens: u32,
}

object is &'static str because it's always the constant "chat.completion", never read from input, only written out. The structure is exactly the response JSON above: a top-level object holding a choices array, each choice holding a message, plus a usage block.

And the /v1/models types:

RUST

#[derive(Debug, Serialize)]
pub struct ModelObject {
    pub id: String,
    pub object: &'static str,
}
 
#[derive(Debug, Serialize)]
pub struct ModelsListResponse {
    pub object: &'static str,
    pub data: Vec<ModelObject>,
}

A list response with a data array of model descriptors, though our data always has exactly one entry.

The router

src/openapi/router.rs wires URLs to handlers. The imports and the shared state:

RUST

use std::sync::Arc;
use std::time::{SystemTime, UNIX_EPOCH};
 
use axum::extract::State;
use axum::response::{IntoResponse, Response};
use axum::routing::{get, post};
use axum::{Json, Router};
 
use crate::backend::Backend;
use crate::chat::{ChatTemplateMessage, run_chat_turn_streaming_with_prefix};
use crate::decode::Metrics;
use crate::model::Model;
use crate::openapi::types::{
    ChatCompletionMessage, ChatCompletionRequest, ChatCompletionResponse, ChoiceMessage,
    CompletionChoice, ModelObject, ModelsListResponse, Usage,
};
use crate::tokenizer::Tokenizer;
 
pub struct ChatServerState {
    pub model: Arc<dyn Model>,
    pub tokenizer: Arc<dyn Tokenizer>,
    pub backend: Arc<dyn Backend>,
    pub default_max_tokens: usize,
    pub model_label: String,
    pub kv_cache_mode: &'static str,
}

ChatServerState holds everything the handlers need that doesn't change per request: the loaded model, tokenizer, and backend, plus a few config values. Every field is shared: Arc<dyn Model> is a reference-counted pointer, so the model is loaded once and the same instance answers every request. axum hands a clone of an Arc<ChatServerState> to each handler invocation.

chat_router builds the routing table:

RUST

pub fn chat_router(state: Arc<ChatServerState>) -> Router {
    Router::new()
        .route("/health", get(health))
        .route("/v1/models", get(list_models))
        .route("/v1/chat/completions", post(chat_completions))
        .with_state(state)
}

Three routes: a method (get/post), a path, and the handler function. with_state attaches the shared state so any handler can ask for it.

The two simple handlers:

RUST

async fn health() -> Json<serde_json::Value> {
    Json(serde_json::json!({ "status": "ok" }))
}
 
async fn list_models(State(state): State<Arc<ChatServerState>>) -> Json<ModelsListResponse> {
    Json(ModelsListResponse {
        object: "list",
        data: vec![ModelObject {
            id: state.model_label.clone(),
            object: "model",
        }],
    })
}

async fn because axum handlers run on the tokio runtime. Wrapping a value in Json(...) tells axum to serialize it and set the Content-Type header. health returns a fixed object. list_models asks for the state (State(state): State<...> is axum's syntax for "give me the shared state") and reports the single hosted model.

A few helpers for the main handler. First, converting the wire message type to the chat-pipeline message type from III.1:

RUST

fn map_messages(msgs: &[ChatCompletionMessage]) -> Vec<ChatTemplateMessage> {
    assert!(!msgs.is_empty(), "messages must not be empty");
    msgs.iter()
        .map(|m| ChatTemplateMessage {
            role: m.role.clone(),
            content: m.content.clone(),
            reasoning_content: m.reasoning_content.clone(),
        })
        .collect()
}

Two near-identical structs: ChatCompletionMessage is the HTTP-layer type, ChatTemplateMessage is the chat-layer type. Keeping them separate means the chat module has no dependency on the HTTP module; map_messages is the one-line bridge.

How many tokens to generate, and which model name to echo back:

RUST

fn resolve_max_tokens(req: &ChatCompletionRequest, default: usize) -> usize {
    let from_req = req
        .max_tokens
        .or(req.max_completion_tokens)
        .and_then(|n| (n >= 1).then_some(n as usize));
    from_req.unwrap_or(default)
}
 
fn effective_model(req: &ChatCompletionRequest, state: &ChatServerState) -> String {
    if req.model.is_empty() {
        state.model_label.clone()
    } else {
        req.model.clone()
    }
}

resolve_max_tokens prefers max_tokens, falls back to max_completion_tokens, ignores zero or negative values, and finally falls back to the server's configured default. effective_model echoes back whatever model name the client sent, or the server's own label if the client didn't say.

Two more helpers generate the response metadata:

RUST

fn unix_secs() -> u64 {
    SystemTime::now()
        .duration_since(UNIX_EPOCH)
        .unwrap()
        .as_secs()
}
 
fn new_completion_id() -> String {
    let nanos = SystemTime::now()
        .duration_since(UNIX_EPOCH)
        .unwrap()
        .as_nanos();
    format!("chatcmpl-{nanos:x}")
}

unix_secs is the created timestamp. new_completion_id builds the chatcmpl-... id: the current time in nanoseconds, hex-encoded, which is unique enough for our purposes.

Now the handler that does the work:

RUST

async fn chat_completions(
    State(state): State<Arc<ChatServerState>>,
    Json(req): Json<ChatCompletionRequest>,
) -> Response {
    let messages = map_messages(&req.messages);
    let max_tokens = resolve_max_tokens(&req, state.default_max_tokens);
    let model_name = effective_model(&req, &state);
 
    let kv_mode = Some(state.kv_cache_mode);
    let job_state = Arc::clone(&state);
    let out = tokio::task::spawn_blocking(move || {
        let mut metrics = Metrics::default();
        run_chat_turn_streaming_with_prefix(
            job_state.model.clone(),
            job_state.tokenizer.clone(),
            &job_state.backend,
            &messages,
            max_tokens,
            kv_mode,
            &mut metrics,
            |_| {},
        )
    })
    .await
    .unwrap()
    .unwrap();

Json(req): Json<ChatCompletionRequest> is axum parsing the request body into the struct; if the JSON is malformed, axum rejects it with a 400 before the handler even runs. We map the messages, resolve the token cap and model name.

The important line is tokio::task::spawn_blocking. Running a forward pass is CPU-bound: it pegs a core for the whole turn. tokio's async runtime is built for I/O-bound work; if you ran the model directly inside an async fn, you'd block one of tokio's small pool of worker threads for seconds and starve every other connection. spawn_blocking moves the heavy work onto a separate thread pool meant for this. The handler awaits the result, leaving the async worker free.

Inside the blocking closure we call run_chat_turn_streaming_with_prefix from III.1. The on_delta callback is |_| {}, empty, because this chapter doesn't stream; we just want the final ChatTurnResult. The double .unwrap() unwraps the spawn_blocking join result and then the Result the chat turn returns.

The rest of the handler shapes the ChatTurnResult into the OpenAI response:

RUST

    let prompt_t = out.prompt_tokens as u32;
    let completion_t = out.generated_tokens as u32;
 
    Json(ChatCompletionResponse {
        id: new_completion_id(),
        object: "chat.completion",
        created: unix_secs(),
        model: model_name,
        choices: vec![CompletionChoice {
            index: 0,
            message: ChoiceMessage {
                role: "assistant",
                content: out.text,
            },
            finish_reason: if out.hit_stop { "stop" } else { "length" }.into(),
        }],
        usage: Usage {
            prompt_tokens: prompt_t,
            completion_tokens: completion_t,
            total_tokens: prompt_t + completion_t,
        },
    })
    .into_response()
}

One choices entry holding the decoded reply. finish_reason is the hit_stop flag from III.1 mapped to OpenAI's words: "stop" for end-of-sequence, "length" for the token cap. usage reports the prompt and completion token counts. .into_response() finishes it into the HTTP Response axum sends back.

The chat-server binary

src/bin/chat-server.rs loads the model, builds the state, and starts the server:

RUST

use std::net::SocketAddr;
use std::path::PathBuf;
use std::sync::Arc;
 
use axum::serve;
use inferno::{
    ChatServerState, CliArgs, chat_router, create_backend, load_from_gguf_path,
    rust_log_enables_trace,
};
use tokio::net::TcpListener;
 
#[tokio::main]
async fn main() {
    tracing_subscriber::fmt()
        .with_env_filter(tracing_subscriber::EnvFilter::from_default_env())
        .init();
 
    let args = CliArgs::from_env();
    let backend_name = args.backend("simd");
    let kv_cache_mode = args.kv_cache_mode().unwrap_or("basic");
    let bind: SocketAddr = args
        .bind("127.0.0.1:8000")
        .parse()
        .expect("invalid --bind address");

#[tokio::main] is the macro that wraps main in a tokio runtime, so the async/await machinery works. It parses arguments, including a new --bind flag for the listen address, defaulting to 127.0.0.1:8000.

That --bind flag touches the parser in the four places every flag does: a field on CliArgs, a None-initialized local in parse, a match arm that fills it, and the field in the closing Self { ... } literal:

RUST

    bind: Option<String>,
// ...
        let mut bind = None;
// ...
                Some("--bind") => {
                    cur.advance();
                    bind = Some(cur.expect_value("--bind"));
                }
// ...
            bind,

Plus a getter:

RUST

    pub fn bind(&self, default: &str) -> String {
        self.bind
            .clone()
            .unwrap_or_else(|| default.to_string())
    }

Same additive pattern the parser has used since I.1, and the shape every later flag will repeat.

Back in main, load the model and build the state:

RUST

    let positional = args.positionals();
    assert!(
        !positional.is_empty(),
        "usage: chat-server [options] <gguf_path> [max_tokens]"
    );
 
    let gguf_path = PathBuf::from(&positional[0]);
    let default_max_tokens: usize = positional
        .get(1)
        .map(|s| s.parse().unwrap())
        .unwrap_or(256);
 
    let backend = create_backend(&backend_name, rust_log_enables_trace()).unwrap();
    let (model, tokenizer) = load_from_gguf_path(gguf_path.as_path(), backend.clone()).unwrap();
 
    assert!(
        tokenizer.chat_template().is_some(),
        "GGUF has no tokenizer.chat_template metadata"
    );
 
    let model_label = gguf_path
        .file_stem()
        .and_then(|s| s.to_str())
        .unwrap_or("local-model")
        .to_string();
 
    eprintln!(
        "model: {model_label} ({}), backend: {}, kv: {kv_cache_mode}",
        gguf_path.display(),
        backend_name,
    );
 
    let state = Arc::new(ChatServerState {
        model,
        tokenizer,
        backend,
        default_max_tokens,
        model_label,
        kv_cache_mode,
    });

Standard setup: a GGUF path and an optional default token cap (256). Build the backend and load the model once, here, then check it has a chat template. model_label is the GGUF filename without its extension, which is what /v1/models reports. The state goes into an Arc so it can be shared across every connection.

Finally, bind the socket and serve:

RUST

    let listener = TcpListener::bind(bind).await.unwrap();
    eprintln!("listening on http://{bind}");
 
    serve(listener, chat_router(state)).await.unwrap();
}

TcpListener::bind opens the port; axum::serve runs the accept loop, handing each connection to the router. serve never returns under normal operation; the process runs until killed.

Running it

Start the server:

BASH

cargo run --release --bin chat-server -- --kv basic path/to/qwen3-0.6b-q8_0.gguf 512

(The default cap of 512 instead of 256 is deliberate: Qwen3 spends a couple hundred tokens thinking before it answers, and we'd rather the demo end on "stop" than "length".)

PLAINTEXT

model: qwen3-0.6b-q8_0 (path/to/qwen3-0.6b-q8_0.gguf), backend: simd, kv: basic
listening on http://127.0.0.1:8000

In another terminal, hit the health check and list the model:

BASH

curl -s http://127.0.0.1:8000/health
curl -s http://127.0.0.1:8000/v1/models

PLAINTEXT

{"status":"ok"}
{"object":"list","data":[{"id":"qwen3-0.6b-q8_0","object":"model"}]}

Now a real chat completion:

BASH

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H 'content-type: application/json' \
  -d '{"model":"qwen3-0.6b-q8_0","messages":[{"role":"user","content":"What is 2 + 2?"}]}'

JSON

{
  "id": "chatcmpl-18c20ed4bd078cf0",
  "object": "chat.completion",
  "created": 1784004709,
  "model": "qwen3-0.6b-q8_0",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<think>\nOkay, so the question is asking, \"What is 2 + 2?\" Hmm, let me think. Well, addition is a basic arithmetic operation. [... ~200 more thinking tokens ...] \n</think>\n\n2 + 2 equals 4. \n\n**Answer:** 4"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": { "prompt_tokens": 16, "completion_tokens": 274, "total_tokens": 290 }
}

The whole <think>...</think> block lands in content: the API layer doesn't hide the model's reasoning, it just reports what the model generated (the bracketed ellipsis above stands in for the middle of it). That JSON is byte-compatible with what api.openai.com returns. Point the official OpenAI Python SDK at base_url="http://127.0.0.1:8000/v1" and client.chat.completions.create(...) works against your own Qwen3.

Where this leaves us

The engine is a server. It speaks OpenAI's HTTP protocol, so the entire ecosystem of OpenAI clients can talk to it without modification. spawn_blocking keeps the CPU-bound forward pass off the async runtime's worker threads.

But the whole reply arrives in one lump: the client waits the full turn, then gets everything at once. A chat UI wants tokens to appear as they're generated. The chat pipeline from III.1 already streams text deltas through a callback; we just discarded them with |_| {}. The next chapter wires that callback to the network and implements stream: true.