Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Sipp Documentation

Sipp packages local and gateway-backed inference runtimes for browser, Node.js, Python, and Rust applications. The project is organized around one client model: register local and remote endpoints with SippClient.add, keep the returned endpoint reference, and choose that reference for query, chat, or embed.

This book starts with the published packages that application developers use. Source checkout, build orchestration, repository architecture, and contribution workflow live in the maintainer section.

Warning

Sipp is under active development. Changes will be made frequently. If you find any issues, bugs, or need any features, please raise them in the github or Discord server (Discord).

Start Here

  • Roadmap outlines the engineering milestones, memory architectures, and long-term vision.
  • Installation lists the published package install commands.
  • Quickstarts shows short Browser, Node.js, Python, Rust, and gateway paths.
  • Using the Core Library describes the public package surfaces in depth.
  • Gateway explains the first-party server, Docker workflows, configuration, testing, operations, toolkit, and architecture.
  • Frameworks covers Next.js, TanStack, and React/Vite integration patterns.
  • Gateway And Hybrid Inference explains when to use local endpoints, gateway endpoints, and provider endpoints.
  • Maintainers covers source builds, tests, repo structure, and contribution workflow.

Build The Book Locally

Use sipp docs from a source checkout:

sipp docs build
sipp docs serve

sipp docs build installs mdbook and mdbook-mermaid when missing, extracts the bundled Mermaid JavaScript assets, and writes the generated book to book/; If the sipp launcher is not active, use cargo xtask docs ... with the same arguments.

Sipp Technical Roadmap

This document outlines the engineering milestones and long-term research initiatives for Sipp Core, Sipp Gateway, and Sipp Platform.

Sipp is built around three core ideas: maximizing privacy-preserving inference, low latency interactions, and high-performance compute across the edge and cloud.

The current core library has a powerful WebGPU backend for running models in-browser, as well as bare-metal GPU support for CUDA or Vulkan when running on device or server. We see the future of AI as hybrid, with edge-native AI processing and cloud-based AI processing working together seamlessly.

Research 1: Sipp Core: The Local Runtime Library

Sipp Core is built to be a high-performance power house for running inference locally, either on bare-metal GPU/NPU or via WebGPU for browser-based applications. It is built on a foundation of llama.cpp with a custom C++ and Rust runtime layer.

Key Initiatives

  • Edge-Native Local RAG & Memory Optimization: Integrate an in-memory, zero-dependency vector database (compiled directly to WASM) into the client SDK. This enables developers to run fully local vector searches, embed conversational state, and execute document retrievals with zero external API dependencies or cloud database costs.

  • Full-Spectrum Client Support (Apps, Web, and Games): Sipp currently supports browser through WebGPU and desktop through CUDA and Vulkan backends. Our next phase will be to expand backend support for hardware accelerated inference across web, desktop and mobile devices. This includes:

    • Desktop & Mobile Wrappers: Expand native compilation targets for Electron and Tauri apps, exposing direct access to NVIDIA CUDA, Metal, and Vulkan.

    • Gaming Runtimes: Lightweight SDK integration frameworks for Unreal Engine and Unity to support local, low-latency AI agents inside application loops.

  • Cross-Site & Cross-App Persistence Caching: Standard browser sandboxing isolates cache stores to individual web origins. We seek to solve this by building a lightweight, local background desktop daemon built in Rust. This daemon serves as a centralized, secure model registry mirror. If a user visits an Electron app or a website utilizing a specific model, the local runtime fetches it instantly from the daemon’s cache instead of re-downloading gigabytes of weights.

  • Client-Side Local Contextual Routing: There may be times where running a query locally may not produce good enough results, in which re-routing to a cloud or provider model is needed. However, when this should happen or how a query could be split apart is unknown. We beleive a solution is in a hyper-lightweight, client-side small language model (sLLM) that makes those decisions dynamically, we see two applications:

    1. PII/PPI Stripping and Masking: A local model intercepts text inputs to detect and strip Protected Personal Information (PPI) or Personally Identifiable Information (PII), replacing sensitive entities with secure local tokenized hashes before any cloud handoff occurs.

    2. Contextual Query Splitting: The local engine analyzes incoming chats to determine what components can be handled instantly on the edge (e.g., immediate structural formatting, basic data verification) vs. what must be escalated to the cloud, dynamically stitching cloud completions back into the interface as they stream down.

Research 2: The Gateway Server (The Orchestration & Interception Layer)

The open-source Gateway Server serves as an autonomous “API Fortress” that acts as a secure, high-performance middleware layer between client networks and cloud endpoints.

                    ┌──────────────────────────────┐
                    │ Client Submits Prompt to GW  │
                    └──────────────┬───────────────┘
                                   │
                                   ▼
                    ┌──────────────────────────────┐
                    │ Preemptive Middle-Layer Cache│
                    │   (Vector & KV Intercept)    │
                    └──────────────┬───────────────┘
                                   │
              ┌────────────────────┴────────────────────┐
              ▼ (Cache Hit / Guided Path)              ▼ (Cache Miss)
  ┌───────────────────────┐                 ┌───────────────────────┐
  │ Route to Endpoint X   │                 │ Route to Endpoint Y   │
  │ (Low-cost/Fast Stream)│                 │ (Deep Processing/MoE) │
  └───────────────────────┘                 └───────────────────────┘

Key Initiatives

  • Gateway-Level Vector Memory & RAG Interception: The gateway implements an internal, stateful vector index layer to handle server-side memory optimization. It caches semantic embeddings of historical document fragments and prior system queries. When a client submits a prompt, the gateway performs a preemptive vector evaluation to determine if a relevant context context match exists, entirely bypassing the need to repeatedly re-fetch or re-encode massive RAG documents from central cloud instances.

  • Preemptive Middle-Layer Caching: In tandem with vector storage, the gateway features a stateful intermediate cache layer designed to intercept incoming requests before they hit large upstream models. If a cached structural completion matches the incoming footprint, the gateway can reroute traffic conditionally (e.g., “If cache footprint exists, route to fast Endpoint X; if not, route to reasoning Endpoint Y”).

  • Persistent Admin Control Dashboard: Expand on the gateway dashboard and admin UI to visualize active routes, manage cryptographic client application identities, view live input/output token allocation metrics, and manually map model fallback rules and more.

  • Token-Aware Traffic Shaping: Implements token-bucket rate limiters directly inside the networking wrapper to monitor and throttle users based on their exact token throughput footprint, protecting downstream clusters from malicious execution loops or unexpected API bills.

Getting Started

Start here when adding Sipp to an application from published packages. Maintainer source builds are covered separately.

  • Installation lists the Browser, Node.js, Python, and Rust package install commands.
  • Quickstarts gives minimal local and gateway client examples.
  • Models And Backends explains GGUF model expectations and backend selection.
  • Source Builds covers checkout setup, sipp, cargo xtask, examples, and demos for maintainers.

Installation

Install the published package for the runtime your application uses. All public client packages use the same endpoint model: register an endpoint, keep the returned endpoint reference, and choose that endpoint for query, chat, or embed.

Package Installs

SurfaceInstallUse for
Browsernpm install @sipp/sippBrowser-local GGUF inference and browser gateway clients.
Node.jsnpm install @sipp/sipp-serverServer-side local inference and framework route handlers.
Pythonpip install sipppyPython scripts, services, and gateway clients.
Python CUDAGitHub release wheelPython local inference with CUDA backend wheels.
Python Vulkanpip install "sipppy[vulkan]"Python local inference with Vulkan backend wheels.
Python Metalpip install "sipppy[metal]"Python local inference with Metal backend wheels on macOS.
Rustcargo add sipp-rsRust applications and services.

The current release workflow publishes browser npm, Node npm, Python wheels, and Rust crates. It does not yet publish a standalone gateway-server binary, container image, or cargo install target. Use the source checkout and Dockerfile when deploying the gateway server until a public server artifact is added.

Runtime Requirements

  • Local inference needs a compatible GGUF model file or browser-served GGUF asset.
  • Python wheels require Python 3.10 or newer.
  • Browser-local inference needs a modern browser with WebAssembly support; WebGPU acceleration depends on the browser and device. For details, please refer to Gateway.
  • Node installs use @sipp/sipp-server; npm resolves the matching optional platform binary package automatically. Python installs use the sipppy wheel (imported as sipp) for CPU and extras such as sipppy[cuda] for GPU backend wheels; the sipppy wheels currently ship from GitHub Releases while the full PyPI build matrix is in progress (see the Python package page). Use SIPP_NODE_BACKEND or SIPP_PYTHON_BACKEND when you need to force cpu, vulkan, cuda, or metal.
  • Gateway clients need only the gateway base URL, public target name, and application-owned authentication value.

Next Steps

Quickstarts

These snippets show the public call shapes for query, chat, and embed. query sends the exact prompt string and never applies a chat template. A plain prompt is only for completion-style/base models; for decoder-only chat or instruct GGUFs, render the model’s template yourself. Local query also supports encoder-decoder GGUF text models. chat sends role-tagged messages. embed returns vectors and needs an embedding-capable local model loaded with embedding mode enabled.

Local context naming differs only by language casing: browser and Node.js use contextKey; Python and Rust use context_key.

See Examples And Demos for runnable end-to-end files.

Browser Local

npm install @sipp/sipp
import { SippClient, type ChatMessage } from '@sipp/sipp';

const client = new SippClient();
const messages: readonly ChatMessage[] = [
  { role: 'system', content: 'Answer concisely.' },
  { role: 'user', content: 'Explain local browser inference.' },
];
const queryPrompt = [
  '<|system|>',
  'Answer concisely.',
  '<|user|>',
  'Explain local browser inference.',
  '<|assistant|>',
].join('\n');

const textEndpoint = await client.add('text', {
  kind: 'local',
  source: '/models/chat.gguf',
  options: { backend: 'webgpu', runtime: { context: { n_ctx: 2048 } } },
});

// query: raw prompt; replace markers with the target model's template.
const query = await client.query(queryPrompt, {
  endpoint: textEndpoint,
  maxTokens: 64,
  contextKey: 'browser-query',
}).response;

// chat: role messages; local runtime uses tokenizer.chat_template.
const chat = await client.chat(messages, {
  endpoint: textEndpoint,
  maxTokens: 64,
  contextKey: 'browser-chat',
}).response;

const embedEndpoint = await client.add('embed', {
  kind: 'local',
  source: '/models/embed.gguf',
  options: {
    backend: 'webgpu',
    runtime: { context: { n_ctx: 2048, embeddings: true, pooling: 'mean' } },
  },
});

// embed: vector output; local endpoint must be embedding-capable.
const embedding = await client.embed('Sipp embedding input.', {
  endpoint: embedEndpoint,
  contextKey: 'browser-embed',
  normalize: true,
}).response;

console.log(query.text, chat.text, embedding.values.length);
await client.close();

Node.js Local

npm install @sipp/sipp-server
import { SippClient } from '@sipp/sipp-server';

const client = new SippClient();
const messages = [
  { role: 'system', content: 'Answer concisely.' },
  { role: 'user', content: 'Explain local Node.js inference.' },
];
const queryPrompt = [
  '<|system|>',
  'Answer concisely.',
  '<|user|>',
  'Explain local Node.js inference.',
  '<|assistant|>',
].join('\n');
const textOptions = { maxTokens: 64 };
const textModel = process.argv[2] ?? 'chat.gguf';
const embedModel = process.argv[3] ?? 'embed.gguf';

const textEndpoint = await client.add('text', {
  kind: 'local',
  modelPath: textModel,
  config: { context: { n_ctx: 2048 } },
});

// query: raw prompt; replace markers with the target model's template.
const query = await client.query({
  endpoint: textEndpoint,
  prompt: queryPrompt,
  options: textOptions,
  local: { contextKey: 'node-query' },
}).response;

// chat: role messages; local runtime uses tokenizer.chat_template.
const chat = await client.chat({
  endpoint: textEndpoint,
  messages,
  options: textOptions,
  local: { contextKey: 'node-chat' },
}).response;

const embedEndpoint = await client.add('embed', {
  kind: 'local',
  modelPath: embedModel,
  config: { context: { n_ctx: 2048, embeddings: true, pooling: 'mean' } },
});

// embed: vector output; local endpoint must be embedding-capable.
const embedding = await client.embed({
  endpoint: embedEndpoint,
  input: 'Sipp embedding input.',
  local: { contextKey: 'node-embed', normalize: true },
}).response;

console.log(query.text, chat.text, embedding.values.length);

Local query also supports encoder-decoder GGUF text models, while many encoder-decoder models cannot use chat because they do not declare tokenizer.chat_template. Encoder-decoder text models do not produce embeddings through this runtime.

Python Local

# sippy cuda wheel is currently published via GitHub Releases ;full release matrix is on progress
pip install sipppy
from sipp import (
    ChatMessage,
    SippClient,
    SippTextOptions,
    ContextRuntimeConfig,
    LocalEmbedOptions,
    LocalTextOptions,
    LocalModelDescriptor,
    NativeRuntimeConfig,
)

client = SippClient()
messages = [
    ChatMessage("system", "Answer concisely."),
    ChatMessage("user", "Explain local Python inference."),
]
query_prompt = "\n".join(
    [
        "<|system|>",
        "Answer concisely.",
        "<|user|>",
        "Explain local Python inference.",
        "<|assistant|>",
    ]
)
text_options = SippTextOptions(max_tokens=64)

text_endpoint = client.add("text", LocalModelDescriptor("chat.gguf"))

# query: raw prompt; replace markers with the target model's template.
query = client.query(
    query_prompt,
    endpoint=text_endpoint,
    options=text_options,
    local=LocalTextOptions(context_key="python-query"),
).result()

# chat: role messages; local runtime uses tokenizer.chat_template.
chat = client.chat(
    messages,
    endpoint=text_endpoint,
    options=text_options,
    local=LocalTextOptions(context_key="python-chat"),
).result()

embed_endpoint = client.add(
    "embed",
    LocalModelDescriptor(
        "embed.gguf",
        NativeRuntimeConfig(
            context=ContextRuntimeConfig(
                n_ctx=2048,
                embeddings=True,
                pooling="mean",
            ),
        ),
    ),
)

# embed: vector output; local endpoint must be embedding-capable.
embedding = client.embed(
    "Sipp embedding input.",
    endpoint=embed_endpoint,
    local=LocalEmbedOptions(context_key="python-embed", normalize=True),
).result()

print(query["text"], chat["text"], len(embedding["values"]))

Rust Local

cargo add sipp-rs
#![allow(unused)]
fn main() {
use sipp::engine::{
    ChatMessage, ChatRole, ContextRuntimeConfig, NativeRuntimeConfig, PoolingType,
};
use sipp::{
    SippChatRequest, SippClient, SippEmbedRequest, SippQueryRequest,
    SippTextOptions, EndpointDescriptor, LocalEmbedOptions, LocalTextOptions,
};

let mut client = SippClient::new();
let messages = vec![
    ChatMessage::new(ChatRole::System, "Answer concisely."),
    ChatMessage::new(ChatRole::User, "Explain local Rust inference."),
];
let query_prompt = [
    "<|system|>",
    "Answer concisely.",
    "<|user|>",
    "Explain local Rust inference.",
    "<|assistant|>",
]
.join("\n");
let text_options = SippTextOptions {
    max_tokens: Some(64),
    ..Default::default()
};

let text_endpoint = client
    .add("text", EndpointDescriptor::local("chat.gguf", Default::default()))
    .await?;

// query: raw prompt; replace markers with the target model's template.
let query = client
    .query(SippQueryRequest {
        endpoint: Some(text_endpoint.clone()),
        prompt: query_prompt,
        options: text_options.clone(),
        local: LocalTextOptions {
            context_key: Some("rust-query".to_string()),
            ..Default::default()
        },
        ..Default::default()
    })
    .await?;

// chat: role messages; local runtime uses tokenizer.chat_template.
let chat = client
    .chat(SippChatRequest {
        endpoint: Some(text_endpoint),
        messages,
        options: text_options,
        local: LocalTextOptions {
            context_key: Some("rust-chat".to_string()),
            ..Default::default()
        },
        ..Default::default()
    })
    .await?;

let embed_endpoint = client
    .add("embed", EndpointDescriptor::local("embed.gguf", embed_config()))
    .await?;

// embed: vector output; local endpoint must be embedding-capable.
let embedding = client
    .embed(SippEmbedRequest {
        endpoint: Some(embed_endpoint),
        input: "Sipp embedding input.".to_string(),
        local: LocalEmbedOptions {
            context_key: Some("rust-embed".to_string()),
            normalize: Some(true),
        },
        ..Default::default()
    })
    .await?;

println!("{}, {}, {}", query.text, chat.text, embedding.values.len());

fn embed_config() -> NativeRuntimeConfig {
    NativeRuntimeConfig {
        context: ContextRuntimeConfig {
            n_ctx: Some(2048),
            embeddings: Some(true),
            pooling: Some(PoolingType::Mean),
            ..Default::default()
        },
        ..Default::default()
    }
}
}

Gateway

Gateway clients keep model paths, provider credentials, target policy, and metrics in the gateway process. The example uses the browser package shape; Node.js uses the same request-object shape shown above.

import { SippClient, type ChatMessage } from '@sipp/sipp';

const client = new SippClient();
const endpoint = await client.add('gateway', {
  kind: 'gateway',
  target: 'local',
  baseUrl: 'https://gateway.example.com',
  authentication: { kind: 'bearer', value: await getGatewayToken() },
});
const messages: readonly ChatMessage[] = [
  { role: 'system', content: 'Answer concisely.' },
  { role: 'user', content: 'Explain gateway inference.' },
];
const queryPrompt = [
  '<|system|>',
  'Answer concisely.',
  '<|user|>',
  'Explain gateway inference.',
  '<|assistant|>',
].join('\n');

// query: gateway forwards the raw prompt to the selected target.
const query = await client.query(queryPrompt, {
  endpoint,
  maxTokens: 64,
}).response;

// chat: gateway maps role messages for the selected provider/local target.
const chat = await client.chat(messages, { endpoint, maxTokens: 64 }).response;

// embed: target must support embeddings.
const embedding = await client.embed('Sipp embedding input.', {
  endpoint,
}).response;

console.log(query.text, chat.text, embedding.values.length);
await client.close();

Gateway query preserves the raw prompt, so it is the gateway path for custom templates or local encoder-decoder targets. Gateway embed requires the target to support embeddings.

Direct Provider

Use direct provider endpoints only in trusted server code (e.g. self-hosted service). Provider support is model-specific: query needs a completion-compatible provider or model, chat needs a chat model, and embed needs an embedding model.

import { SippClient } from '@sipp/sipp-server';

function env(name: string): string {
  const value = process.env[name];
  if (value == null || value === '') {
    throw new Error(`${name} is required`);
  }
  return value;
}

const client = new SippClient();
const chatMessages = [
  { role: 'system', content: 'Answer concisely.' },
  { role: 'user', content: 'Explain provider inference.' },
];

const completionEndpoint = await client.add('completion', {
  kind: 'provider',
  provider: 'openai_compatible',
  model: env('COMPLETION_MODEL'),
  baseUrl: env('COMPLETION_BASE_URL'),
  apiKey: env('COMPLETION_API_KEY'),
});
const chatEndpoint = await client.add('chat', {
  kind: 'provider',
  provider: 'openai',
  model: env('OPENAI_CHAT_MODEL'),
  apiKey: env('OPENAI_API_KEY'),
});
const embedEndpoint = await client.add('embed', {
  kind: 'provider',
  provider: 'openai',
  model: env('OPENAI_EMBED_MODEL'),
  apiKey: env('OPENAI_API_KEY'),
});

// query: raw completion prompt for a completion-compatible provider.
const query = await client.query({
  endpoint: completionEndpoint,
  prompt: 'Write one provider inference sentence.',
  options: { maxTokens: 64 },
}).response;

// chat: provider-native role messages.
const chat = await client.chat({
  endpoint: chatEndpoint,
  messages: chatMessages,
  options: { maxTokens: 64 },
}).response;

// embed: provider-native embedding model.
const embedding = await client.embed({
  endpoint: embedEndpoint,
  input: 'Sipp embedding input.',
}).response;

console.log(query.text, chat.text, embedding.values.length);

Runtime Tuning

Local endpoint tuning, browser WebGPU options, worker/threading choices, generation options, and provider/gateway option buckets are documented in Runtime Options.

Building and Running from Source Code

Runnable source examples and demos live in the maintainer lane: Source Builds.

Models And Backends

Sipp local inference uses GGUF model files. Text workflows need a text GGUF model, embedding workflows need a model that reports embedding support, and vision chat workflows need both a model GGUF and a projector GGUF.

Model Sources

For local package usage, pass an explicit GGUF model path in Node.js, Python, or Rust, or serve a GGUF model URL to browser code:

  • Browser: source: '/models/model.gguf'
  • Node.js: modelPath: '/path/to/model.gguf'
  • Python: LocalModelDescriptor('/path/to/model.gguf')
  • Rust: EndpointDescriptor::local(model_path, config)

Source examples and smoke workflows can use a cached sample model under .build/models; see Source Builds.

Native Backends

Backend names are shared across build and runtime selection:

  • cpu: portable default backend.
  • vulkan: GPU backend for Vulkan-capable systems.
  • cuda: NVIDIA CUDA backend.
  • metal: Apple Metal backend on macOS.

Runtime selection is package-specific:

  • Node.js: SIPP_NODE_BACKEND=cpu|vulkan|cuda|metal
  • Python: SIPP_PYTHON_BACKEND=cpu|vulkan|cuda|metal
  • CLI: --backend auto|cpu|cuda|metal|vulkan

Leave runtime backend variables unset for automatic selection.

Maintainer builds can produce backend-specific artifacts with sipp or cargo xtask; see Source Builds.

For the full package/backend matrix and llama.cpp/ggml operation support guidance, see Backend Matrix.

sipp CLI

sipp is the repo-local launcher for Sipp source checkout workflows. It forwards to cargo xtask after setup has installed wrapper scripts under .build/bin.

Use sipp when you are working from the repository and need to build native artifacts, run demos, start the gateway server, manage xtask toolchains, or run cataloged tests, or build the documentation book. Published packages such as @sipp/sipp, @sipp/sipp-server, and the Python wheel (sipppy) do not require sipp.

Command Shape

Every sipp command has the same arguments as cargo xtask:

sipp doctor
sipp build node --backend cpu
sipp run examples serve browser
sipp test list
sipp docs build

If the launcher is not active in the current shell, use the same command after cargo xtask:

cargo xtask doctor
cargo xtask build node --backend cpu

Pages

Setup

Run the setup script from the repository root. It builds the xtask binary when needed, installs sipp launchers under .build/bin, and can bootstrap managed toolchains and sample files for the selected workflow.

Unix Shells

source ./setup.sh
sipp doctor

Running ./setup.sh without source still performs setup, but it cannot modify the current shell PATH. It prints the environment script to source afterward.

Windows PowerShell

.\setup.ps1
sipp doctor

The PowerShell script updates PATH for the current PowerShell session and loads .build\bin\sipp-env.ps1 when setup succeeds.

Windows CMD

setup.cmd
sipp doctor

setup.cmd invokes the PowerShell setup script and activates .build\bin for the current CMD session.

Profiles

Use a profile when you know which development surface you need:

sipp setup --profile browser
sipp setup --profile bindings
sipp setup --profile full --yes
ProfileUse for
browserBrowser package, WASM, WebGPU examples, and demos.
bindingsNative Node.js and Python binding development.
fullFull workspace development across browser and native bindings.

Useful setup flags:

  • --yes: accept recommended actions without prompting.
  • --no-downloads: skip toolchain, dependency, and sample-model downloads.
  • --no-splash: skip the interactive splash.
  • --plain: disable bounded terminal rendering.

Generated Files

Setup writes only repo-local generated files:

  • .build/xtask/debug/xtask or .build\xtask\debug\xtask.exe
  • .build/bin/sipp, .build/bin/sipp.cmd, and .build/bin/sipp.ps1
  • .build/bin/sipp-env.sh and .build/bin/sipp-env.ps1
  • xtask-managed toolchains and caches under .build/toolchain

Commands

sipp groups source checkout automation into focused command families. Use sipp <group> --help for generated help and the current option list.

Health Checks

sipp doctor
sipp doctor --target wasm
sipp doctor --target node --backend vulkan
sipp toolchain status

doctor checks local readiness without installing or deleting anything. toolchain status reports xtask-managed tools such as Bun, Python, uv, Emscripten, and Ninja. CUDA is externally installed; xtask reports it but does not install or delete it.

Build

sipp build core
sipp build wasm
sipp build node --backend cpu
sipp build python --backend vulkan
sipp build cli --backend all
sipp build gateway-server --backend cpu
sipp build all

build all builds the main target families with default CPU native outputs. It does not build every backend variant for every package.

Backend values:

  • cpu: portable default.
  • cuda: NVIDIA CUDA backend; requires a local CUDA Toolkit.
  • metal: Apple Metal backend on macOS.
  • vulkan: Vulkan backend; xtask can bootstrap the Vulkan SDK when needed.
  • all: host-supported backend set for the selected target.

Run

sipp run examples serve browser --port 5173
sipp run examples serve gateway-local --model .build/models/model.gguf --bind 127.0.0.1:8787
sipp run examples gateway rust --case query
sipp run demos serve chat
sipp run tools serve playground
sipp run gateway-server check --config apps/gateway-server/config/local.toml
sipp run gateway-server serve --config apps/gateway-server/config/local.toml --backend cpu

run commands are for long-lived demos, gateway processes, example servers, and non-test diagnostics. Test execution lives under sipp test.

Docs

sipp docs build
sipp docs serve
sipp docs build --lang zh

docs build installs mdbook and mdbook-mermaid when missing, extracts the bundled Mermaid JavaScript assets into theme/, and writes the generated book to book/.

Test

sipp test list
sipp test list --group unit --layer interface --cases --search router --format json
sipp test unit group full
sipp test unit suite rust-crates --package sipp-rs
sipp test unit suite node-package --backend cpu
sipp test unit suite browser --wasm-threading single-thread
sipp test smoke suite example-node --backend cpu
sipp test smoke group local-model --backend cpu
sipp test verify --changed
sipp test verify --target public-docs

Model-backed smoke tests use the setup sample model cache under .build/models when --model is omitted. See Testing for the full suite catalog.

Clean

sipp clean --dry-run
sipp clean
sipp clean --purge
sipp clean --toolchains

clean removes generated build outputs while preserving downloaded toolchains and dependency installs. --purge also removes workspace node_modules directories. --toolchains removes xtask-managed toolchains under .build/toolchain.

Output Flags

Most command groups accept the shared output flags:

  • --verbose: stream subprocess output directly.
  • --no-banner: disable decorative banners.
  • --plain: disable bounded inline rendering.

Troubleshooting

sipp Is Not Found

Run setup from the repository root and keep the environment active in the same shell:

source ./setup.sh
.\setup.ps1
setup.cmd

If you cannot activate sipp, use cargo xtask with the same arguments:

cargo xtask doctor
cargo xtask test list

Setup Rebuilds xtask

The setup scripts rebuild .build/xtask/debug/xtask when xtask source files, workspace manifests, or Cargo configuration are newer than .build/xtask/sipp.stamp. This is expected after pulling changes that affect developer automation.

PowerShell Blocks Script Execution

Run the script with the current-user execution policy configured by your machine, or invoke it for the current process:

powershell -NoProfile -ExecutionPolicy Bypass -File .\setup.ps1

PATH Is Active Only In One Terminal

The launcher is installed under .build/bin. Setup activates that directory for the current shell session. Open a new terminal and run setup again, or source the generated environment script:

source .build/bin/sipp-env.sh
. .build\bin\sipp-env.ps1

Toolchain Or Backend Is Missing

Use:

sipp doctor
sipp toolchain status

Then install xtask-managed components when appropriate:

sipp toolchain install uv
sipp toolchain install all

CUDA is not installed by xtask. Install CUDA through NVIDIA tooling and rerun sipp doctor --target node --backend cuda or the target you need.

Using the Core Library

Sipp exposes one endpoint-oriented client model across all public package surfaces. See the Library API Overview for the shared SippClient.add, query, chat, and embed contracts, endpoint descriptor reference, and gateway-client symmetry patterns.

Most developers should start here instead of building from source.

Package Surfaces

SurfaceInstallPrimary use
Library API OverviewShared add, query, chat, and embed contracts across all surfaces.
Browsernpm install @sipp/sippBrowser-local GGUF inference, WebGPU/WASM runtime, and browser gateway clients.
Node.jsnpm install @sipp/sipp-serverNode server processes, route handlers, and backend services.
Pythonpip install sipppyPython services, scripts, and gateway clients.
Rustcargo add sipp-rsRust applications and services.
Gateway ServerSource-built todayFirst-party HTTP gateway for local and provider targets.
Gateway DockerDocker from sourceLocal and production container workflows for the gateway server.
Gateway ToolkitSource-built todayRust toolkit for custom gateway applications.

The current release workflow publishes browser npm, Node npm, Python wheels, and Rust crates. The gateway server is documented in the Gateway section as a user-facing deployment surface, but it does not yet have a published binary or public image.

Framework Guides

When integrating JavaScript packages with a framework, see:

Supporting Reference

Library API Overview

The Sipp libraries for Rust, Node.js, Python, and Browser expose the same endpoint-oriented client model.

At a high level:

  1. Register an endpoint with add.
  2. Keep the returned EndpointRef.
  3. Pass that reference to query, chat, or embed.

This keeps application code the same whether inference runs locally, through a gateway, through a provider, or across a hybrid setup.

Core Client Methods

SippClient exposes four primary methods:

MethodPurpose
addRegister a local, gateway, or provider endpoint and return an EndpointRef.
queryGenerate text from a raw prompt string. No chat template is applied.
chatGenerate text from ordered { role, content } messages.
embedGenerate an embedding vector from text input.

add() — Register an Endpoint

add(id: string, descriptor: EndpointDescriptor) -> EndpointRef

add registers an endpoint with the current client instance.

The id is caller-defined and scoped to the client. Reusing an id replaces the existing endpoint. The returned EndpointRef is a lightweight handle with:

FieldDescription
kindEndpoint kind: "local", "gateway", or "provider".
idThe endpoint id registered on this client.

Pass the returned EndpointRef to query, chat, or embed to choose where the operation runs.

Local Endpoint

A local endpoint loads a GGUF model into the current process. The application owns model selection, runtime lifecycle, and cleanup.

FieldTypeDescription
kind"local"Endpoint kind selector.
modelPathstring / PathBufFilesystem path or browser URL for the GGUF artifact.
configNativeRuntimeConfig optionalLoad-time runtime configuration, including context size, GPU placement, scheduler policy, cache mode, sampling defaults, and observability.

Use a local endpoint when the current process should own model execution.

Gateway Endpoint

A gateway endpoint sends requests to a remote Sipp gateway over HTTP. The gateway process owns provider credentials, local model paths, access policy, concurrency, and metrics.

FieldTypeDescription
kind"gateway"Endpoint kind selector.
targetstringPublic target name resolved by the gateway. Sent as the model field in gateway profile requests.
baseUrlstringAbsolute HTTP(S) URL of the gateway service.
authentication{ kind, value?, headerName? }Auth strategy: "none", "bearer", or "header".
staticHeaders{ name, value }[] optionalAdditional HTTP headers attached to every request.
timeoutMs / timeoutPolicynumber / struct optionalConnection, request, and streaming read deadlines.
queryRoutestring optionalQuery route. Defaults to /v1/query.
chatRoutestring optionalChat route. Defaults to /v1/chat.
embedRoutestring optionalEmbedding route. Defaults to /v1/embed.
protocolOptionsmap optionalProfile-specific options merged into every request body.

Use a gateway endpoint when a separate service should own model access and operational policy.

Provider Endpoint

A provider endpoint calls a model provider directly. This is intended for trusted server-side code that manages its own credential lifecycle.

FieldTypeDescription
kind"provider"Endpoint kind selector.
provider"openai" / "anthropic" / "openai_compatible"Provider adapter.
modelstringProvider model identifier.
apiKeystring optionalProvider API key.
baseUrlstring optionalOverride for the provider base URL.

Use a provider endpoint when server-side code should call a provider API directly without a Sipp gateway.


query() — Generate from a Raw Prompt

query(request: SippQueryRequest) -> SippTextRun

query sends the prompt string to the selected endpoint exactly as supplied. No chat template is applied.

Use query when the application owns the full prompt shape, including custom templates, completion-style models, encoder-decoder text models, few-shot prompts, or agent loops that render prompts themselves.

Request Fields

FieldTypeDescription
endpointEndpointRefRegistered endpoint to target. May be omitted only when exactly one local endpoint supports the operation.
promptstringRaw prompt text.
optionsSippTextOptions optionalShared generation options: maxTokens, temperature, topP, and stop.
localLocalTextOptions optionalLocal-only options such as contextKey, grammar, jsonSchema, sampling overrides, and media inputs. Rejected by gateway endpoints.
endpointOptionsmap optionalFree-form options forwarded to gateway endpoint implementations.
providerOptionsmap optionalFree-form options forwarded to direct provider adapters. Rejected by gateway endpoints.
emitTokensbooleanWhen true, stream TokenBatch values through the returned run handle.

Return Value

query returns a SippTextRun.

MemberTypeDescription
responsePromise / FutureResolves to SippTextResponse when generation completes.
tokensAsync iterableStreams TokenBatch values when emitTokens is true.
cancel(reason)methodCancels an in-flight generation.

SippTextResponse contains the generated text, finishReason, token usage, and optional localStats for local endpoints.


chat() — Generate from Role Messages

chat(request: SippChatRequest) -> SippTextRun

chat sends ordered role/content messages to the selected endpoint. The endpoint owns message rendering.

Endpoint kindMessage handling
LocalRenders messages through the GGUF-declared tokenizer.chat_template. Fails if the model has no template.
GatewayForwards messages to the resolved gateway target. Provider targets handle their own message mapping.
ProviderSends messages using the provider’s native chat-completions format.

Request Fields

FieldTypeDescription
endpointEndpointRefRegistered endpoint to target.
messages{ role, content }[]Ordered conversation turns.
optionsSippTextOptionsSame shared generation options as query.
localLocalTextOptionsSame local-only options as query.
emitTokensbooleanSame streaming control as query.

Return Value

chat returns the same SippTextRun shape as query.


embed() — Generate an Embedding

embed(request: SippEmbedRequest) -> SippEmbeddingRun

embed produces a single embedding vector from text input. It does not accept generation options and does not stream tokens.

Request Fields

FieldTypeDescription
endpointEndpointRefRegistered endpoint to target.
inputstringText to vectorize.
localLocalEmbedOptions optionalLocal embedding options, including contextKey and normalize.
endpointOptionsmap optionalFree-form options for gateway endpoint implementations.
providerOptionsmap optionalFree-form options for direct provider adapters.

Return Value

embed returns a SippEmbeddingRun.

MemberTypeDescription
responsePromise / FutureResolves to SippEmbeddingResponse when encoding completes.
cancel(reason)methodCancels an in-flight embedding.

SippEmbeddingResponse contains the float values array, optional token usage, the pooling strategy, and the normalized flag.


Gateway and Client Symmetry

The same SippClient API works on both sides of the gateway boundary.

Server Side

A server process creates a SippClient, registers local endpoints, and maps HTTP routes to query, chat, or embed.

Server client:
  add("local-model", LocalDescriptor { modelPath, config })
  -> route handler decodes HTTP request
  -> route handler calls client.query/chat/embed
  -> route handler encodes HTTP response

The first-party Gateway Server uses this pattern. Application-owned Node, Python, or Rust servers can also use it through the gateway profile helpers.

Client Side

A client process creates a SippClient, registers gateway endpoints, and calls query, chat, or embed the same way it would call a local endpoint.

Client client:
  add("remote", GatewayDescriptor { target, baseUrl, authentication })
  -> client.query/chat/embed({ endpoint: ref, ... })
  -> request is sent to the gateway over HTTP

Hybrid Pattern

A single client can register multiple endpoint kinds. The application chooses where an operation runs by passing a different endpoint reference.

localRef = client.add("local", LocalDescriptor { ... })
gatewayRef = client.add("gateway", GatewayDescriptor { ... })

client.query({ endpoint: localRef, prompt, ... })
client.query({ endpoint: gatewayRef, prompt, ... })

The operation code stays the same. Only the endpoint reference changes.

Why the Endpoint Model Matters

The endpoint model gives applications one API surface across multiple deployment shapes.

BenefitDescription
Stable operation codequery, chat, and embed are called the same way for local, gateway, provider, and hybrid setups.
Swappable execution targetsMove inference between local models, gateway targets, and direct providers by changing endpoint descriptors.
Clear ownership boundariesLocal endpoints keep lifecycle in-process; gateway endpoints move access, credentials, policy, and metrics to a service boundary.
Language symmetryPatterns learned in one language package transfer directly to the others.
Extensible endpoint kindsNew endpoint kinds can be added without changing the operation call pattern.

Visual Summary

flowchart LR
    %% -------------------------
    %% Node Styling
    %% -------------------------
    classDef client_node fill:#eef6ff,stroke:#4a90e2,stroke-width:1.5px,color:#111,rx:6,ry:6;
    classDef setup_node fill:#f7f7f7,stroke:#999,stroke-width:1px,color:#111,rx:6,ry:6;
    classDef runtime_node fill:#f3fff0,stroke:#52a852,stroke-width:2px,color:#111,rx:6,ry:6;
    classDef gateway_node fill:#fff7e6,stroke:#d99000,stroke-width:2px,color:#111,rx:6,ry:6;
    classDef provider_node fill:#f8f0ff,stroke:#8e44ad,stroke-width:1.5px,color:#111,rx:6,ry:6;

    %% -------------------------
    %% Client Process
    %% -------------------------
    subgraph CLIENT["Client Process"]
        direction TB
        CApp["Application Code"]:::client_node
        CClient["SippClient<br/>add(...) -> EndpointRef<br/>query / chat / embed"]:::client_node
        CApp --> CClient

        %% Logical grouping for endpoint registration options
        subgraph CSetup["Endpoint Setup (options)"]
            direction LR
            CLocalEP["local (GGUF)"]:::setup_node
            CGatewayEP["gateway (Remote)"]:::setup_node
            CProviderEP["provider (API)"]:::setup_node
        end
        CClient -. "Registers" .-> CSetup

        %% Local execution flow for local ref
        subgraph CLocalRuntime["Local Runtime"]
            direction LR
            CLocalRun["GGUF Runtime"]:::runtime_node
        end

        %% Connection for local usage
        CClient -- "Local Ref (query)" --> CLocalRun
    end

    %% -------------------------
    %% Server Process
    %% -------------------------
    subgraph SERVER["Server Process / Gateway Server"]
        direction TB
        SGateway["Gateway Server<br/>HTTP: /v1/query, /chat, /embed"]:::gateway_node
        SClient["SippClient (same lib)"]:::client_node
        SGateway --> SClient

        %% Logical grouping for endpoint registration options
        subgraph SSetup["Endpoint Setup (options)"]
            direction LR
            SLocalEP["local (GGUF)"]:::setup_node
            SProviderEP["provider (API)"]:::setup_node
        end
        SClient -. "Registers" .-> SSetup

        %% Local execution flow for local ref
        subgraph SLocalRuntime["Local Runtime"]
            direction LR
            SLocalRun["GGUF Runtime"]:::runtime_node
        end

        %% Connection for local usage
        SClient -- "Local Ref (query)" --> SLocalRun
    end

    %% -------------------------
    %% External Providers
    %% -------------------------
    Providers["Provider APIs<br/>OpenAI / Gemini / Anthropic / etc."]:::provider_node

    %% -------------------------
    %% Cross-process / Remote connections
    %% -------------------------
    CClient == "Gateway Ref (query)" ==> SGateway
    CClient == "Provider Ref (query)" ==> Providers
    SClient == "Provider Ref (query)" ==> Providers

    %% -------------------------
    %% Styling Assignment to Nodes
    %% -------------------------
    class CApp,CClient client_node;
    class CLocalEP,CGatewayEP,CProviderEP,SLocalEP,SProviderEP setup_node;
    class CLocalRun,SLocalRun runtime_node;
    class SGateway gateway_node;
    class Providers provider_node;

Browser Package

The browser package target is @sipp/sipp. It exposes SippClient for browser-local GGUF inference, gateway calls, provider descriptors where supported, token streaming, OPFS-backed model caching, and browser runtime lifecycle management.

See the Library API Overview for the shared add, query, chat, and embed contracts.

Install

npm install @sipp/sipp

Use this package in browser code. For server routes or Node services, use @sipp/sipp-server.

Use It For

  • Browser-local text and vision inference.
  • WebGPU or CPU execution through the browser runtime.
  • OPFS-backed model caching.
  • Gateway-backed query, chat, and embedding calls.
  • Character and director helpers used by demos.

Local GGUF Chat

import { SippClient, type ChatMessage } from '@sipp/sipp';

const client = new SippClient();
const endpoint = await client.add('default', {
  kind: 'local',
  source: '/models/model.gguf',
  options: {
    backend: 'webgpu',
    runtime: {
      context: { n_ctx: 2048 },
    },
  },
});

const messages: readonly ChatMessage[] = [
  { role: 'system', content: 'Answer concisely.' },
  { role: 'user', content: 'Explain Sipp in one sentence.' },
];

const run = client.chat(messages, {
  endpoint,
  emitTokens: true,
  maxTokens: 64,
  contextKey: 'browser-local',
});

let streamed = '';
for await (const batch of run.tokens) {
  streamed += batch.text;
}
const response = await run.response;
console.log(streamed || response.text);
await client.close();

Use query when the prompt is already rendered for the target model. See the API overview for the query/chat/embed contracts.

Gateway Chat

Use gateway endpoints when a separate server owns model paths, provider credentials, target policy, and metrics.

const endpoint = await client.add('gateway', {
  kind: 'gateway',
  target: 'local',
  baseUrl: 'https://gateway.example.com',
  authentication: {
    kind: 'bearer',
    valueProvider: getShortLivedGatewayToken,
  },
});
const messages = [
  { role: 'system', content: 'Answer concisely.' },
  { role: 'user', content: 'Explain gateway inference.' },
];

const run = client.chat(messages, {
  endpoint,
  maxTokens: 64,
});

Browser apps should use short-lived gateway tokens or proxy through an application server route. Do not ship provider credentials or long-lived gateway tokens in browser bundles.

Browser Runtime Options

The browser runtime links Sipp’s Rust WASM ABI with llama.cpp and ggml through Emscripten. It runs GGUF text and vision models with WebGPU when the browser exposes a compatible adapter, and falls back to CPU execution for compatible local workflows. OPFS-backed model caching keeps repeated browser loads local after the first model fetch or file import.

The package resolves its packaged JavaScript and WASM assets at runtime. Most apps should not override asset URLs. Use executionMode, wasmThreading, browserCache, and local endpoint options.runtime only when the application needs explicit control over browser execution, storage, or local runtime behavior.

See Runtime Options for SippClient options, WebGPU/backend selection, worker mode, pthread requirements, and local runtime config groups.

Node.js Package

The Node.js package target is @sipp/sipp-server. It exposes the native Sipp client API to Node server processes, route handlers, and framework server functions. Applications own framework routes, request validation, auth, and deployment policy.

See the Library API Overview for the shared add, query, chat, and embed contracts.

Install

npm install @sipp/sipp-server

Use this package only in Node runtime code. Browser components should use @sipp/sipp.

@sipp/sipp-server is a wrapper package. npm installs the matching optional platform package for the current OS and CPU, and the runtime loader selects the best packaged backend for that host.

Use It For

  • Server-side local GGUF inference.
  • Gateway-backed and provider-backed inference from server code.
  • Token streaming from Node processes.
  • Framework route handlers in Node runtimes.
  • Backend selection for native bindings.

Local GGUF Query

import { SippClient } from '@sipp/sipp-server';

const client = new SippClient();
const endpoint = await client.add('default', {
  kind: 'local',
  modelPath: process.argv[2],
  config: {
    context: { n_ctx: 2048 },
    scheduler: { continuous_batching: true, prefill_chunk_size: 0 },
    cache: { mode: 'live_slot_prefix' },
    observability: { runtime_metrics: true },
  },
});
const queryPrompt = [
  '<|system|>',
  'Answer concisely.',
  '<|user|>',
  'Explain Sipp in one sentence.',
  '<|assistant|>',
].join('\n');

const run = client.query({
  endpoint,
  // query: raw prompt; replace markers with the target model's template.
  prompt: queryPrompt,
  emitTokens: true,
  options: { maxTokens: 64, temperature: 0.7 },
  local: { contextKey: 'node-local' },
});

let streamed = '';
for await (const batch of run) {
  streamed += batch.text;
}
const response = await run.response;
console.log(streamed || response.text);

Set SIPP_NODE_BACKEND=cpu|vulkan|cuda|metal to choose a native backend. By default, macOS tries metal then cpu; Windows and Linux try cuda, vulkan, then cpu. See Runtime Options for local runtime config groups and request option boundaries.

On Intel Macs with integrated GPUs, prefer SIPP_NODE_BACKEND=cpu. The Metal backend is intended for Apple Silicon and tested AMD Mac GPUs. Apple Silicon can run x64 Node through Rosetta 2, but x64 packages are used only by an x64 Node process; native arm64 Node should use arm64 packages.

Gateway Chat

function requiredEnv(name: string): string {
  const value = process.env[name];
  if (value == null || value === '') {
    throw new Error(`${name} is required`);
  }
  return value;
}

const endpoint = await client.add('gateway', {
  kind: 'gateway',
  target: requiredEnv('SIPP_GATEWAY_TARGET'),
  baseUrl: requiredEnv('SIPP_GATEWAY_URL'),
  authentication: {
    kind: 'bearer',
    value: requiredEnv('SIPP_GATEWAY_TOKEN'),
  },
});
const messages = [
  { role: 'system', content: 'Answer concisely.' },
  { role: 'user', content: 'Explain gateway inference.' },
];
const run = client.chat({
  endpoint,
  messages,
  options: { maxTokens: 64 },
});
console.log((await run.response).text);

The application only needs the gateway URL, bearer token, and public target. Provider credentials and local model paths stay in the gateway process.

Direct Provider Chat

Use direct provider endpoints only in trusted server code. Keep the provider key in the server environment; OPENAI_API_KEY="<mock-openai-key>" is only a placeholder value in examples.

function requiredEnv(name: string): string {
  const value = process.env[name];
  if (value == null || value === '') {
    throw new Error(`${name} is required`);
  }
  return value;
}

const endpoint = await client.add('provider', {
  kind: 'provider',
  provider: 'openai',
  model: process.env.OPENAI_MODEL ?? 'gpt-5-mini',
  apiKey: requiredEnv('OPENAI_API_KEY'),
});
const messages = [
  { role: 'system', content: 'Answer concisely.' },
  { role: 'user', content: 'Explain provider inference.' },
];
const run = client.chat({
  endpoint,
  messages,
  options: { maxTokens: 64 },
});
console.log((await run.response).text);

Pass provider-only request fields through providerOptions. See Providers for the full provider/gateway split.

Gateway Profile Helpers

Use the gateway profile helpers when a Node route should behave like a first-party gateway endpoint for browser kind: 'gateway' clients. The helpers decode model, prompt, messages, input, and snake_case generation options, then format JSON or SSE responses. The route can execute the decoded request against a provider, a local endpoint, or a separate gateway.

import {
  SippClient,
  decodeGatewayQueryBody,
  gatewayErrorResponse,
  gatewayTextResponseBody,
  gatewayTextStreamResponse,
} from '@sipp/sipp-server';

function requiredEnv(name: string): string {
  const value = process.env[name];
  if (value == null || value === '') {
    throw new Error(`${name} is required`);
  }
  return value;
}

export async function handleQuery(request: Request): Promise<Response> {
  try {
    const decoded = decodeGatewayQueryBody(await request.json());
    const client = new SippClient();
    const endpoint = await client.add('provider', {
      kind: 'provider',
      provider: 'openai',
      model: decoded.target,
      apiKey: requiredEnv('OPENAI_API_KEY'),
    });
    const run = client.query({ ...decoded.request, endpoint });
    return decoded.stream
      ? gatewayTextStreamResponse(run)
      : Response.json(
          gatewayTextResponseBody(decoded.target, await run.response),
        );
  } catch (error) {
    const response = gatewayErrorResponse(error);
    return Response.json(response.body, response.init);
  }
}

Use decodeGatewayChatBody() and decodeGatewayEmbedBody() for /v1/chat and /v1/embed compatible routes. Use gatewayEmbeddingResponseBody() for finite embedding responses.

Framework Routes

Use @sipp/sipp-server in server-only code such as Next.js App Router route handlers with runtime = 'nodejs', TanStack Start server functions, Express routes, or background workers. Do not import it from browser bundles.

Python Package

The Python wheel is named sipppy. Python code imports the sipp module, which exposes native descriptor classes, run handles, token streaming, and the same endpoint model as the Rust client.

Published wheels require Python 3.10 or newer.

See the Library API Overview for the shared add, query, chat, and embed contracts.

Install

Note

Python wheels currently ship from the project’s GitHub Releases, not PyPI. A full PyPI release with a complete build matrix (CPU and GPU backends across operating systems, architectures, and Python versions, in the style of PyTorch’s distribution matrix) is in progress. The package name sipppy import are stable; only the distribution channel will change.

Download the sipppy wheel that matches your platform, Python version, and backend from the GitHub Releases page, then install it with pip. The default wheel includes the CPU backend:

pip install sipppy

The default wheel includes the CPU backend. Install PyPI-published GPU backends as extras:

pip install "sipppy[vulkan]"
pip install "sipppy[metal]"

The backend wheels are separate PyPI distributions. For example, sipppy[vulkan] installs the main sipppy wheel plus the matching sipppy-backend-vulkan wheel for the same release version. Python code still imports sipp. CUDA backend wheels are attached to GitHub releases for the first public release and will move to PyPI after the CUDA wheel size limit is raised.

Use It For

  • Python applications that need local GGUF inference.
  • Gateway-backed inference from Python services or scripts.
  • Direct provider descriptors where server-side credentials are appropriate.
  • Runtime metrics and backend selection in Python services.

Local GGUF Query

import sys

from sipp import (
    CacheRuntimeConfig,
    SippClient,
    SippTextOptions,
    ContextRuntimeConfig,
    LocalModelDescriptor,
    LocalTextOptions,
    NativeRuntimeConfig,
    ObservabilityRuntimeConfig,
    SchedulerRuntimeConfig,
)


client = SippClient()
endpoint = client.add(
    "default",
    LocalModelDescriptor(
        sys.argv[1],
        NativeRuntimeConfig(
            context=ContextRuntimeConfig(n_ctx=2048),
            scheduler=SchedulerRuntimeConfig(
                continuous_batching=True,
                prefill_chunk_size=0,
            ),
            cache=CacheRuntimeConfig(mode="live_slot_prefix"),
            observability=ObservabilityRuntimeConfig(runtime_metrics=True),
        ),
    ),
)
query_prompt = "\n".join(
    [
        "<|system|>",
        "Answer concisely.",
        "<|user|>",
        "Explain Sipp in one sentence.",
        "<|assistant|>",
    ]
)
run = client.query(
    # query: raw prompt; replace markers with the target model's template.
    query_prompt,
    endpoint=endpoint,
    options=SippTextOptions(max_tokens=64),
    local=LocalTextOptions(context_key="python-local"),
)
print(run.result()["text"])

Set SIPP_PYTHON_BACKEND=cpu|vulkan|cuda|metal to choose an installed native backend. See Runtime Options for local runtime config groups and request option boundaries.

On Intel Macs with integrated GPUs, prefer SIPP_PYTHON_BACKEND=cpu. The Metal backend is intended for Apple Silicon and tested AMD Mac GPUs. Apple Silicon can run x64 Python through Rosetta 2, but x64 wheels are used only by an x64 Python process; native arm64 Python should use arm64 wheels.

Gateway Chat

import os

from sipp import ChatMessage, SippClient, SippTextOptions, GatewayDescriptor


client = SippClient()
endpoint = client.add(
    "gateway",
    GatewayDescriptor(
        os.environ["SIPP_GATEWAY_TARGET"],
        os.environ["SIPP_GATEWAY_URL"],
        authentication_kind="bearer",
        authentication_value=os.environ["SIPP_GATEWAY_TOKEN"],
    ),
)
messages = [
    ChatMessage("system", "Answer concisely."),
    ChatMessage("user", "Explain gateway inference."),
]
run = client.chat(
    messages,
    endpoint=endpoint,
    options=SippTextOptions(max_tokens=64),
)
print(run.result()["text"])

Gateway clients need only the gateway URL, bearer token, and public target. Provider credentials and local model paths stay in the gateway process.

Rust Package

The Rust package target is sipp-rs. It publishes the sipp library crate for Rust applications and re-exports the high-level client API plus selected runtime, backend, lifecycle, shard, provider, and gateway types.

sipp-rs depends on sipp-sys, the native llama.cpp FFI crate. Installing sipp-rs from crates.io builds the native backend from source on the target machine; it is not a binary wheel-style package.

See the Library API Overview for the shared add, query, chat, and embed contracts.

Install

cargo add sipp-rs

The release workflow publishes sipp-sys first, then publishes sipp-rs. Applications depend on the sipp-rs package and import the sipp crate.

Build Requirements

Rust applications that depend on sipp-rs need the normal Rust toolchain plus the native build tools used by sipp-sys:

  • A C/C++ compiler for the target platform.
  • CMake.
  • Ninja or a compatible CMake generator.
  • Platform SDKs required by the selected backend.

The CPU native backend is the baseline and does not require a Cargo feature. Backend features add their own requirements:

  • cuda: CUDA Toolkit plus a compatible NVIDIA driver.
  • metal: macOS with Xcode command line tools.
  • vulkan: Vulkan SDK or system Vulkan development libraries.
  • openmp: OpenMP compiler/runtime support for the target platform.

Use It For

  • Rust applications that need local GGUF inference.
  • Gateway-backed query, chat, and embedding calls.
  • Direct provider descriptors behind the providers feature.
  • Shared Sipp value types across application boundaries.

Local GGUF Query

#![allow(unused)]
fn main() {
use sipp::{
    SippClient, SippQueryRequest, SippTextOptions, EndpointDescriptor,
    LocalTextOptions,
};
use sipp::engine::{
    CacheRuntimeConfig, ContextRuntimeConfig, KvReuseMode, NativeRuntimeConfig,
    ObservabilityRuntimeConfig, SchedulerRuntimeConfig,
};

async fn run(
    model_path: std::path::PathBuf,
) -> Result<(), Box<dyn std::error::Error>> {
    let mut client = SippClient::new();
    let endpoint = client
        .add(
            "default",
            EndpointDescriptor::local(model_path, runtime_config()),
        )
        .await?;

    let response = client
        .query(SippQueryRequest {
            endpoint: Some(endpoint),
            prompt: "Explain Sipp in one sentence.".to_string(),
            options: SippTextOptions {
                max_tokens: Some(64),
                ..Default::default()
            },
            local: LocalTextOptions {
                context_key: Some("rust-local".to_string()),
                ..Default::default()
            },
            ..Default::default()
        })
        .await?;
    println!("{}", response.text);
    Ok(())
}

fn runtime_config() -> NativeRuntimeConfig {
    NativeRuntimeConfig {
        context: ContextRuntimeConfig {
            n_ctx: Some(2048),
            ..Default::default()
        },
        scheduler: SchedulerRuntimeConfig {
            continuous_batching: true,
            prefill_chunk_size: 0,
            ..Default::default()
        },
        cache: CacheRuntimeConfig {
            mode: KvReuseMode::LiveSlotPrefix,
            ..Default::default()
        },
        observability: ObservabilityRuntimeConfig {
            runtime_metrics: true,
            backend_profiling: false,
        },
        ..Default::default()
    }
}
}

See Runtime Options for the shared runtime config groups and request option boundaries.

Gateway Query

#![allow(unused)]
fn main() {
use sipp::{
    SippClient, SippQueryRequest, SippTextOptions, EndpointDescriptor,
    GatewayAuthentication, GatewayEndpointConfig, GatewayRoutes, GatewaySecret,
    GatewayTimeoutPolicy,
};

let mut client = SippClient::new();
let endpoint = client
    .add(
        "gateway",
        EndpointDescriptor::gateway(GatewayEndpointConfig {
            target: std::env::var("SIPP_GATEWAY_TARGET")?,
            base_url: std::env::var("SIPP_GATEWAY_URL")?,
            routes: GatewayRoutes::default(),
            authentication: GatewayAuthentication::Bearer(GatewaySecret::new(
                std::env::var("SIPP_GATEWAY_TOKEN")?,
            )),
            static_headers: Default::default(),
            timeouts: GatewayTimeoutPolicy::default(),
            protocol_options: Default::default(),
        }),
    )
    .await?;

let response = client
    .query(SippQueryRequest {
        endpoint: Some(endpoint),
        prompt: "Explain gateway inference.".to_string(),
        options: SippTextOptions {
            max_tokens: Some(64),
            ..Default::default()
        },
        ..Default::default()
    })
    .await?;
println!("{}", response.text);
}

Frameworks

These guides show how to use the JavaScript-facing Sipp packages in common application frameworks. See the Library API Overview for the shared add, query, chat, and embed contracts.

Use the browser package, @sipp/sipp, when inference runs in the browser or when browser code calls a gateway. Use the Node package, @sipp/sipp-server, only in server-only code such as route handlers, server functions, API routes, workers, or services that run in a Node.js runtime.

Guides

  • React And Vite: Baseline browser-local setup, WebGPU/WASM asset behavior, OPFS model loading, and local development headers.
  • Next.js: App Router provider routes, Client Components, gateway-profile compatibility, and streaming.
  • TanStack: TanStack Start provider functions, server routes, and TanStack Query patterns.

Package Selection

EnvironmentPackageNotes
Browser component@sipp/sippUse for browser-local GGUF inference or direct gateway calls.
Node server route@sipp/sipp-serverUse for direct provider endpoints, local server inference, or gateway clients.
Gateway profile route@sipp/sipp-serverUse when a browser kind: 'gateway' endpoint calls a framework route.
Gateway clientEitherBrowser code can call a separate gateway with short-lived tokens, or server code can use server-held secrets.

Provider-First Server Routes

Next.js and TanStack server routes should usually demonstrate direct provider endpoints when the framework server owns the credential. Register a provider in server-only code:

const endpoint = await client.add('provider', {
  kind: 'provider',
  provider: 'openai',
  model: requiredEnv('OPENAI_MODEL'),
  apiKey: requiredEnv('OPENAI_API_KEY'),
});

Use OPENAI_API_KEY="<mock-openai-key>" only as a placeholder in docs and examples. Do not expose real provider keys in browser bundles.

Gateway Route Field Names

Browser gateway descriptors require an absolute http or https baseUrl and use routes: { query, chat, embed } for route overrides. Node gateway descriptors use queryRoute, chatRoute, and embedRoute when server code calls a gateway through @sipp/sipp-server.

Keep provider credentials and long-lived gateway tokens out of browser bundles. When a browser app needs gateway access, issue short-lived application tokens or proxy through a server route.

Use decodeGatewayQueryBody(), decodeGatewayChatBody(), decodeGatewayEmbedBody(), and the matching response helpers from @sipp/sipp-server when a framework route should be registered as a browser kind: 'gateway' endpoint. Those helpers keep route examples focused on auth, target policy, provider selection, and client lifecycle instead of gateway profile JSON shaping.

React And Vite

React and Vite are the baseline browser integration for the @sipp/sipp package. Use this guide for Vite-specific setup, local development headers, runtime asset overrides, and the source browser examples.

For the full local inference option map, see Local Inference and Runtime Options.

Install

npm install @sipp/sipp

Browser Local Query

Use @sipp/sipp only in browser code. A local endpoint source can be a model URL served by the app, a user-provided File, an installed model id, or shard sources.

import { useState } from 'react';
import { SippClient } from '@sipp/sipp';

export function LocalQuery(): JSX.Element {
  const [text, setText] = useState('');

  async function run(): Promise<void> {
    const client = new SippClient();
    try {
      const endpoint = await client.add('default', {
        kind: 'local',
        source: '/models/model.gguf',
        options: {
          backend: 'webgpu',
          runtime: {
            context: { n_ctx: 2048 },
          },
        },
      });
      const response = await client.query('Explain Sipp.', {
        endpoint,
        maxTokens: 64,
      }).response;
      setText(response.text);
    } finally {
      await client.close();
    }
  }

  return (
    <button type="button" onClick={() => void run()}>
      {text || 'Run'}
    </button>
  );
}

Omit backend to let the browser runtime choose a compatible backend. Use backend: 'webgpu' when the UI should explicitly request WebGPU and surface errors or fallbacks itself.

Local Development Headers

The pthread WASM runtime requires SharedArrayBuffer and cross-origin isolation. Configure Vite dev and preview headers before using wasmThreading: 'pthread':

// vite.config.ts
import { defineConfig } from 'vite';

export default defineConfig({
  server: {
    headers: {
      'Cross-Origin-Opener-Policy': 'same-origin',
      'Cross-Origin-Embedder-Policy': 'require-corp',
    },
  },
  preview: {
    headers: {
      'Cross-Origin-Opener-Policy': 'same-origin',
      'Cross-Origin-Embedder-Policy': 'require-corp',
    },
  },
});

Use wasmThreading: 'single-thread' when the app cannot serve those headers. Use executionMode: 'main-thread' only for debugging or constrained hosts.

Runtime Asset Overrides

The browser package resolves its packaged Emscripten JavaScript and WASM assets at runtime. Most Vite apps can use new SippClient() without asset overrides.

Override runtime asset URLs only when your bundler or deployment moves package assets:

const client = new SippClient({
  moduleUrl: '/assets/sipp-wasm.js',
  wasmUrl: '/assets/sipp-wasm.wasm',
});

When overriding assets, provide both moduleUrl and wasmUrl. For pthread runtime assets, provide both pthreadModuleUrl and pthreadWasmUrl.

Model Files And Cache

Serve model URLs from the application or let users select local .gguf files. The browser runtime stores model data through OPFS where available, so repeated loads can stay local after the first import or fetch.

Tune browser storage with browserCache on SippClient and tune local runtime behavior with options.runtime on the local endpoint descriptor. See Browser Caching and Runtime Options.

Existing Examples

Serve the source examples when working from a checkout:

sipp run examples serve browser

Then open the printed URL and use:

  • /query.html
  • /chat.html
  • /embed.html
  • /gateway_local.html
  • /gateway_query.html
  • /gateway_chat.html
  • /gateway_embed.html

The gateway pages demonstrate browser calls to gateway-profile endpoints. Keep production server routes in a route-owning framework, an application server, or the first-party gateway server.

Next.js

Use @sipp/sipp-server in App Router route handlers that run in the Node.js runtime. Use @sipp/sipp only in Client Components or browser-only modules.

Next.js App Router pages and layouts are Server Components by default. Add 'use client' only to modules that need browser APIs, state, event handlers, or browser-local Sipp runtime access.

Profile-Compatible Provider Route

Route handlers are a good place to keep provider credentials off the client. Set runtime = 'nodejs' for routes that import @sipp/sipp-server.

Routes that are registered from a browser kind: 'gateway' endpoint must speak the first-party gateway profile. Use the gateway profile helpers from @sipp/sipp-server to decode the incoming body and format JSON or SSE responses. The route can still execute the request against a direct provider endpoint.

Use OPENAI_API_KEY="<mock-openai-key>" as a placeholder in examples. In a real deployment, keep the key in your server environment or secret manager.

// app/api/sipp/query/route.ts
import {
  SippClient,
  decodeGatewayQueryBody,
  gatewayErrorResponse,
  gatewayTextResponseBody,
  gatewayTextStreamResponse,
} from '@sipp/sipp-server';

export const runtime = 'nodejs';

function requiredEnv(name: string): string {
  const value = process.env[name];
  if (value == null || value === '') {
    throw new Error(`${name} is required`);
  }
  return value;
}

export async function POST(request: Request): Promise<Response> {
  try {
    const decoded = decodeGatewayQueryBody(await request.json());
    const client = new SippClient();
    const endpoint = await client.add('provider', {
      kind: 'provider',
      provider: 'openai',
      model: decoded.target,
      apiKey: requiredEnv('OPENAI_API_KEY'),
    });
    const run = client.query({
      ...decoded.request,
      endpoint,
    });
    if (decoded.stream) {
      return gatewayTextStreamResponse(run);
    }
    return Response.json(
      gatewayTextResponseBody(decoded.target, await run.response),
    );
  } catch (error) {
    const response = gatewayErrorResponse(error);
    return Response.json(response.body, response.init);
  }
}

Do not return an app-specific shape such as { text } from a route that the browser package calls through client.add({ kind: 'gateway' }). That route is an HTTP gateway endpoint from the browser client’s perspective, even when it is implemented inside the Next application. The server-side implementation can resolve the request to a provider, a local endpoint, or a separate gateway.

For high-throughput services, keep endpoint setup in a server-only module and reuse the client lifecycle according to your deployment model. Do not import that module from Client Components.

Streaming Route Handler

Use a route handler when the browser should receive token updates but the server should keep the provider credential.

// app/api/sipp/stream/route.ts
import { SippClient } from '@sipp/sipp-server';

export const runtime = 'nodejs';

const encoder = new TextEncoder();

function requiredEnv(name: string): string {
  const value = process.env[name];
  if (value == null || value === '') {
    throw new Error(`${name} is required`);
  }
  return value;
}

export async function POST(request: Request): Promise<Response> {
  const { prompt } = await request.json() as { prompt?: string };
  if (prompt == null || prompt.trim() === '') {
    return Response.json({ error: 'prompt is required' }, { status: 400 });
  }

  const client = new SippClient();
  const endpoint = await client.add('provider', {
    kind: 'provider',
    provider: 'openai',
    model: requiredEnv('OPENAI_MODEL'),
    apiKey: requiredEnv('OPENAI_API_KEY'),
  });
  const run = client.query({
    endpoint,
    prompt,
    emitTokens: true,
    options: { maxTokens: 128 },
  });

  const stream = new ReadableStream<Uint8Array>({
    async start(controller) {
      try {
        for await (const batch of run.tokens) {
          controller.enqueue(encoder.encode(batch.text));
        }
        await run.response;
        controller.close();
      } catch (error) {
        controller.error(error);
      }
    },
    cancel() {
      run.cancel('client_disconnected');
    },
  });

  return new Response(stream, {
    headers: { 'Content-Type': 'text/plain; charset=utf-8' },
  });
}

Browser-Local Client Component

Browser-local inference needs browser APIs and should live behind a Client Component boundary.

// app/local-chat/LocalChat.tsx
'use client';

import { useState } from 'react';
import { SippClient } from '@sipp/sipp';

export function LocalChat(): JSX.Element {
  const [text, setText] = useState('');

  async function run(prompt: string): Promise<void> {
    const client = new SippClient();
    try {
      const endpoint = await client.add('default', {
        kind: 'local',
        source: '/models/model.gguf',
      });
      const response = await client.query(prompt, {
        endpoint,
        maxTokens: 64,
      }).response;
      setText(response.text);
    } finally {
      await client.close();
    }
  }

  return (
    <button type="button" onClick={() => void run('Explain local inference.')}>
      {text || 'Run'}
    </button>
  );
}

If you override moduleUrl, wasmUrl, pthreadModuleUrl, or pthreadWasmUrl, provide both the JavaScript and WASM asset URLs for the selected runtime. Use wasmThreading: 'pthread' only when the app is served with cross-origin isolation headers that enable SharedArrayBuffer.

Hybrid Client Component

Use one browser SippClient to register a browser-local endpoint and a same-origin provider route that speaks the gateway profile. Select the endpoint reference at request time; the query call stays the same.

// app/hybrid-chat/HybridChat.tsx
'use client';

import { useState } from 'react';
import { SippClient, type EndpointRef } from '@sipp/sipp';

type InferenceMode = 'local' | 'providerRoute';

export function HybridChat(): JSX.Element {
  const [mode, setMode] = useState<InferenceMode>('local');
  const [text, setText] = useState('');

  async function run(prompt: string): Promise<void> {
    const client = new SippClient();
    try {
      const localEndpoint = await client.add('browser-local', {
        kind: 'local',
        source: '/models/model.gguf',
      });
      const providerRouteEndpoint = await client.add('app-route', {
        kind: 'gateway',
        target: 'gpt-5-mini',
        baseUrl: window.location.origin,
        routes: { query: '/api/sipp/query' },
        authentication: { kind: 'none' },
      });
      const endpoint: EndpointRef =
        mode === 'local' ? localEndpoint : providerRouteEndpoint;
      const response = await client.query(prompt, {
        endpoint,
        maxTokens: 64,
      }).response;
      setText(response.text);
    } finally {
      await client.close();
    }
  }

  return (
    <>
      <select
        value={mode}
        onChange={(event) => setMode(event.currentTarget.value as InferenceMode)}
      >
        <option value="local">Browser local</option>
        <option value="providerRoute">Provider route</option>
      </select>
      <button type="button" onClick={() => void run('Explain hybrid inference.')}>
        {text || 'Run'}
      </button>
    </>
  );
}

Browser gateway descriptors require an absolute http or https baseUrl. For same-origin Next routes, use window.location.origin and set route overrides such as routes: { query: '/api/sipp/query' }. The target value becomes the provider model in the server route above.

Separate Gateway Pattern

Use a separate Sipp gateway when you want central target policy, shared provider credentials, local model hosting, rate controls, or metrics across multiple applications. For direct browser-to-gateway calls, do not embed a long-lived gateway token in the client bundle. Have a Next route issue a short-lived app token, then use a browser valueProvider:

const endpoint = await client.add('gateway', {
  kind: 'gateway',
  target: 'local',
  baseUrl: 'https://gateway.example.com',
  authentication: {
    kind: 'bearer',
    valueProvider: async () => {
      const response = await fetch('/api/sipp/token', { method: 'POST' });
      return await response.text();
    },
  },
});

References

TanStack

TanStack apps usually need two Sipp patterns:

  • TanStack Start server functions for server-only Sipp work, provider credentials, local model paths, gateway tokens, and typed app RPC.
  • TanStack Start server routes when browser code should register the route as a kind: 'gateway' endpoint through the Sipp browser package.
  • TanStack Query for client-side final responses that can be cached or refetched by query key.

Use explicit component state or a custom hook for token streaming. TanStack Query is best for Promise-shaped final data, not for appending token batches as they arrive.

TanStack Start Server Function

Server functions run on the server and can be called from loaders, components, hooks, or other server functions. Keep @sipp/sipp-server, provider credentials, and gateway tokens in server-only functions.

Use OPENAI_API_KEY="<mock-openai-key>" as a placeholder in examples. In a real deployment, keep the key in your server environment or secret manager.

// src/server/sipp.ts
import { createServerFn } from '@tanstack/react-start';
import { SippClient } from '@sipp/sipp-server';

function requiredEnv(name: string): string {
  const value = process.env[name];
  if (value == null || value === '') {
    throw new Error(`${name} is required`);
  }
  return value;
}

export const querySipp = createServerFn({ method: 'POST' })
  .inputValidator((data: { prompt: string }) => data)
  .handler(async ({ data }) => {
    const client = new SippClient();
    const endpoint = await client.add('provider', {
      kind: 'provider',
      provider: 'openai',
      model: requiredEnv('OPENAI_MODEL'),
      apiKey: requiredEnv('OPENAI_API_KEY'),
    });
    const run = client.query({
      endpoint,
      prompt: data.prompt,
      options: { maxTokens: 128 },
    });
    const response = await run.response;
    return { text: response.text, usage: response.usage };
  });

Validate server-function inputs with the same rigor as any public endpoint. Server functions are callable network endpoints, so apply application auth and tenant checks inside the function or middleware.

Server functions are a good fit for typed application calls that return application-owned shapes such as { text }. They are not the right surface for browser client.add({ kind: 'gateway' }) endpoints, because those endpoints expect the first-party gateway HTTP profile.

TanStack Start Provider Route

Use a server route when the browser package should call the framework route as a gateway endpoint. The route accepts the first-party query profile and returns the fields consumed by browser gateway endpoints. The gateway profile helpers decode the browser request and format JSON or SSE responses. The route can then execute the request against a direct provider endpoint.

// src/routes/api/sipp/query.ts
import { createFileRoute } from '@tanstack/react-router';
import {
  SippClient,
  decodeGatewayQueryBody,
  gatewayErrorResponse,
  gatewayTextResponseBody,
  gatewayTextStreamResponse,
} from '@sipp/sipp-server';

function requiredEnv(name: string): string {
  const value = process.env[name];
  if (value == null || value === '') {
    throw new Error(`${name} is required`);
  }
  return value;
}

export const Route = createFileRoute('/api/sipp/query')({
  server: {
    handlers: {
      POST: async ({ request }) => {
        try {
          const decoded = decodeGatewayQueryBody(await request.json());
          const client = new SippClient();
          const endpoint = await client.add('provider', {
            kind: 'provider',
            provider: 'openai',
            model: decoded.target,
            apiKey: requiredEnv('OPENAI_API_KEY'),
          });
          const run = client.query({
            ...decoded.request,
            endpoint,
          });
          if (decoded.stream) {
            return gatewayTextStreamResponse(run);
          }
          return Response.json(
            gatewayTextResponseBody(decoded.target, await run.response),
          );
        } catch (error) {
          const response = gatewayErrorResponse(error);
          return Response.json(response.body, response.init);
        }
      },
    },
  },
});

This route uses the browser profile field model as the provider model and keeps the provider credential on the server. Add application auth or model allowlists before exposing the route to users.

Use a separate Sipp gateway when you want central target policy, shared provider credentials, local model hosting, rate controls, or metrics across multiple applications.

TanStack Query For Final Responses

Use TanStack Query when the UI needs a final response and normal query cache behavior.

import { useQuery } from '@tanstack/react-query';
import { querySipp } from '../server/sipp';

export function Answer({ prompt }: { readonly prompt: string }): JSX.Element {
  const result = useQuery({
    queryKey: ['sipp-query', prompt],
    queryFn: () => querySipp({ data: { prompt } }),
    enabled: prompt.trim() !== '',
  });

  if (result.isPending) return <p>Loading...</p>;
  if (result.isError) return <p>{result.error.message}</p>;
  return <pre>{result.data.text}</pre>;
}

Keep the query key tied to the prompt, target, and any user-visible generation options that change the result.

Streaming Tokens

For token streaming, create a server route or server function that returns a stream, then append chunks with component state.

import { useState } from 'react';

export function StreamingAnswer(): JSX.Element {
  const [text, setText] = useState('');

  async function run(prompt: string): Promise<void> {
    setText('');
    const response = await fetch('/api/sipp/stream', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ prompt }),
    });
    if (response.body == null) {
      throw new Error('streaming response body is missing');
    }

    const reader = response.body.getReader();
    const decoder = new TextDecoder();
    while (true) {
      const { value, done } = await reader.read();
      if (done) break;
      setText((current) => current + decoder.decode(value, { stream: true }));
    }
  }

  return (
    <button type="button" onClick={() => void run('Explain streaming.')}>
      {text || 'Run'}
    </button>
  );
}

Browser Package

Use browser @sipp/sipp from components that run in the browser. That includes browser-local GGUF inference and gateway endpoints with short-lived tokens or same-origin server routes.

import { useState } from 'react';
import { SippClient } from '@sipp/sipp';

export function LocalAnswer(): JSX.Element {
  const [text, setText] = useState('');

  async function run(prompt: string): Promise<void> {
    const client = new SippClient();
    try {
      const endpoint = await client.add('browser-local', {
        kind: 'local',
        source: '/models/model.gguf',
      });
      const response = await client.query(prompt, {
        endpoint,
        maxTokens: 64,
      }).response;
      setText(response.text);
    } finally {
      await client.close();
    }
  }

  return (
    <button type="button" onClick={() => void run('Explain local inference.')}>
      {text || 'Run'}
    </button>
  );
}

Do not import @sipp/sipp-server from browser modules.

Browser Hybrid Endpoints

Register browser-local and same-origin gateway endpoints on one browser SippClient, then choose the endpoint reference for each request. The same-origin route can execute against a provider while still speaking the gateway profile to the browser client.

import { useState } from 'react';
import { SippClient, type EndpointRef } from '@sipp/sipp';

type InferenceMode = 'local' | 'providerRoute';

export function HybridAnswer(): JSX.Element {
  const [mode, setMode] = useState<InferenceMode>('local');
  const [text, setText] = useState('');

  async function run(prompt: string): Promise<void> {
    const client = new SippClient();
    try {
      const localEndpoint = await client.add('browser-local', {
        kind: 'local',
        source: '/models/model.gguf',
      });
      const providerRouteEndpoint = await client.add('app-route', {
        kind: 'gateway',
        target: 'gpt-5-mini',
        baseUrl: window.location.origin,
        routes: { query: '/api/sipp/query' },
        authentication: { kind: 'none' },
      });
      const endpoint: EndpointRef =
        mode === 'local' ? localEndpoint : providerRouteEndpoint;
      const response = await client.query(prompt, {
        endpoint,
        maxTokens: 64,
      }).response;
      setText(response.text);
    } finally {
      await client.close();
    }
  }

  return (
    <>
      <select
        value={mode}
        onChange={(event) => setMode(event.currentTarget.value as InferenceMode)}
      >
        <option value="local">Browser local</option>
        <option value="providerRoute">Provider route</option>
      </select>
      <button type="button" onClick={() => void run('Explain hybrid inference.')}>
        {text || 'Run'}
      </button>
    </>
  );
}

Browser gateway descriptors need an absolute http or https baseUrl. Same-origin TanStack routes should use window.location.origin and route overrides such as routes: { query: '/api/sipp/query' }. The target value becomes the provider model in the server route above.

References

Gateway

Sipp gateway workflows put one HTTP boundary in front of local GGUF targets and provider-backed targets. Applications still use the same client model: register an endpoint with SippClient.add, keep the returned endpoint reference, and choose that reference for query, chat, or embed.

Use a gateway when you want a separate process to own model paths, provider credentials, target access policy, concurrency limits, metrics, and operational routes.

Notices

Warning

The gateway server is in active development. Changes will be made frequently, and things will break. If you use it for production, be cautious and watch for release updates. You can join our Discord server and follow up on development.

What To Use

NeedStart here
Run the first-party server from a checkoutServer
Build and run the Docker imageDocker
Understand the TOML fileConfiguration
Test with curl, Postman, or raw HTTPTesting
Operate health, metrics, admin, and ingressOperations
Build your own gateway applicationToolkit
Understand package boundariesArchitecture
Debug common failuresTroubleshooting

The current release workflow publishes browser npm, Node npm, Python wheels, and Rust crates. It does not yet publish a standalone gateway-server binary, public container image, or cargo install target. Build the first-party server from the source checkout or with the provided Dockerfile.

Gateway Shapes

  • First-party server: apps/gateway-server provides TOML configuration, bearer-token policy, local and provider targets, management routes, metrics, and an Admin Dashboard.
  • Docker image: apps/gateway-server/Dockerfile builds the same staged gateway distribution and runs sipp-gateway serve --config /etc/sipp/gateway.toml.
  • Gateway toolkit: lib/gateway provides codecs, HTTP error helpers, authentication traits, observability traits, and the first-party JSON/SSE profile for custom applications.
  • Gateway clients: Browser, Node, Python, and Rust packages all register gateway endpoints through the same .add path used for local and provider endpoints.

Deployment Shapes

  • On-board GPU inference: configure a local GGUF target, build or run the gateway with vulkan, cuda, or metal, and mount or point at the model path the process can read.
  • Provider-only router: configure only provider targets such as openai, openai_compatible, or anthropic. No local model path or /models mount is required, and a CPU gateway image is sufficient because inference runs at the provider.
  • Hybrid: configure both a local GPU target and provider targets. Clients still send the public gateway target name in the request model field.

Default Routes

The first-party server examples use:

  • Public: /v1/query, /v1/chat, /v1/embed.
  • Management: /, /healthz, /readyz, /metrics, /admin.

Those paths are application configuration, not core library behavior. Custom gateway applications can choose their own routes.

Gateway Quickstart

Use the on-board local path when the gateway should load a GGUF model, or the provider-only path when it should route requests upstream. Read Server and Docker before production deployment.

On-Board Local From Source

cp apps/gateway-server/.env.example apps/gateway-server/.env
cp apps/gateway-server/config/local.toml.example apps/gateway-server/config/local.toml

Edit apps/gateway-server/config/local.toml:

  • Set the local target model to a GGUF file visible from the workspace root.
  • Keep local source binds on 127.0.0.1.
  • Keep admin_password_env = "SIPP_GATEWAY_ADMIN_PASSWORD" unless you also change the .env secret name.

Load secrets and start:

set -a
. apps/gateway-server/.env
set +a
sipp run gateway-server check --config apps/gateway-server/config/local.toml --backend vulkan
sipp run gateway-server serve --config apps/gateway-server/config/local.toml --backend vulkan

Use cuda for NVIDIA hosts or metal for macOS hosts when those are the intended on-board inference backends.

Provider-Only From Source

cp apps/gateway-server/.env.example apps/gateway-server/.env
cp apps/gateway-server/config/provider-only.toml.example apps/gateway-server/config/provider-only.toml

Set provider secrets in apps/gateway-server/.env, then run:

set -a
. apps/gateway-server/.env
set +a
sipp run gateway-server check --config apps/gateway-server/config/provider-only.toml --backend cpu
sipp run gateway-server serve --config apps/gateway-server/config/provider-only.toml --backend cpu

Use the request target openai-chat with the checked-in provider-only example.

Docker

Docker uses one secrets-only .env, one gateway TOML, and one explicit Compose file:

cp apps/gateway-server/.env.example apps/gateway-server/.env
cp apps/gateway-server/development.yml.example apps/gateway-server/development.yml
cp apps/gateway-server/config/development.toml.example apps/gateway-server/config/development.toml
docker compose --env-file apps/gateway-server/.env -f apps/gateway-server/development.yml build
docker compose --env-file apps/gateway-server/.env -f apps/gateway-server/development.yml up

Use development-provider-only.yml.example and config/provider-only.toml.example for provider-only Docker.

First HTTP Request

In a second terminal:

set -a
. apps/gateway-server/.env
set +a
export GATEWAY_URL="http://127.0.0.1:8080"
export GATEWAY_MANAGEMENT_URL="http://127.0.0.1:9090"

curl --fail --silent "$GATEWAY_MANAGEMENT_URL/readyz"
curl -sS "$GATEWAY_URL/v1/query" \
  -H "Authorization: Bearer $SIPP_GATEWAY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"model":"local","prompt":"Explain gateway inference.","max_tokens":64}'

Use "model":"openai-chat" for the provider-only example.

Open http://127.0.0.1:9090/admin and log in with the value of SIPP_GATEWAY_ADMIN_PASSWORD.

Gateway Server

The Sipp Gateway Server is the first-party HTTP application for teams that want one inference boundary for local GGUF targets and provider-backed targets. It lives in apps/gateway-server.

This page covers source checkout and generated executable operation. Use Docker for container workflows and Configuration for the TOML schema.

The current release workflow does not publish a standalone binary, public container image, or cargo install target. Build it from the source checkout.

Source Workflow

Use sipp for source checkout workflows. sipp is the setup-installed launcher for cargo xtask; when the launcher is unavailable, use cargo xtask with the same arguments.

cp apps/gateway-server/config/local.toml.example apps/gateway-server/config/local.toml
cp apps/gateway-server/.env.example apps/gateway-server/.env
set -a
. apps/gateway-server/.env
set +a
sipp run gateway-server check --config apps/gateway-server/config/local.toml --backend vulkan
sipp run gateway-server serve --config apps/gateway-server/config/local.toml --backend vulkan

Before running real on-board inference tests, update the ignored local TOML with the token env names, admin password env name, and model path. Update only secret values in the secrets env file.

sipp run gateway-server check builds the staged gateway distribution for the selected backend, then runs sipp-gateway check. The binary check command parses and validates TOML only. It does not read bearer-token environment variables, load model files, contact providers, or bind ports.

sipp run gateway-server serve builds the staged gateway distribution, then runs the generated sipp-gateway executable from the workspace root. It reads secret environment variables named by TOML, loads targets, binds both listeners, and exits cleanly on Ctrl-C.

Use --backend cpu|vulkan|cuda|metal|all to select the backend compiled into the staged gateway distribution.

Provider-Only Source Workflow

Provider-only gateways route to upstream APIs and do not load a local GGUF model. Use a CPU gateway build because inference happens at the provider:

cp apps/gateway-server/config/provider-only.toml.example apps/gateway-server/config/provider-only.toml
cp apps/gateway-server/.env.example apps/gateway-server/.env
set -a
. apps/gateway-server/.env
set +a
sipp run gateway-server check --config apps/gateway-server/config/provider-only.toml --backend cpu
sipp run gateway-server serve --config apps/gateway-server/config/provider-only.toml --backend cpu

Use Configuration for Anthropic and OpenAI-compatible target snippets.

Generated Executable

sipp build gateway-server --backend <backend> stages a runnable distribution in .build/artifacts/gateway-server. The directory contains the sipp-gateway executable, base runtime libraries, and selected GGML backend plugins. The build also compiles the React Admin Dashboard from apps/gateway-server/admin-ui and copies its Vite output to .build/artifacts/gateway-server/admin-ui. Keep the executable, dashboard asset directory, and runtime libraries together.

Direct execution must put the artifact directory on the dynamic loader path. The executable reads dashboard assets from admin-ui beside the binary unless SIPP_GATEWAY_ADMIN_ASSETS_DIR points at another Vite dist directory.

Linux:

set -a
. apps/gateway-server/.env
set +a
export LD_LIBRARY_PATH="$(pwd)/.build/artifacts/gateway-server${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}"
.build/artifacts/gateway-server/sipp-gateway check --config apps/gateway-server/config/local.toml
.build/artifacts/gateway-server/sipp-gateway serve --config apps/gateway-server/config/local.toml

macOS:

set -a
. apps/gateway-server/.env
set +a
export DYLD_LIBRARY_PATH="$(pwd)/.build/artifacts/gateway-server${DYLD_LIBRARY_PATH:+:$DYLD_LIBRARY_PATH}"
.build/artifacts/gateway-server/sipp-gateway check --config apps/gateway-server/config/local.toml
.build/artifacts/gateway-server/sipp-gateway serve --config apps/gateway-server/config/local.toml

Windows PowerShell:

Get-Content apps\gateway-server\.env | ForEach-Object {
    if ($_ -and -not $_.StartsWith("#")) {
        $name, $value = $_.Split("=", 2)
        Set-Item -Path "Env:$name" -Value $value
    }
}
$dist = Join-Path (Get-Location) ".build\artifacts\gateway-server"
$env:PATH = "$dist;$env:PATH"
.\.build\artifacts\gateway-server\sipp-gateway.exe check --config apps\gateway-server\config\local.toml
.\.build\artifacts\gateway-server\sipp-gateway.exe serve --config apps\gateway-server\config\local.toml

Relative model paths in TOML are resolved from the process working directory. The sipp run gateway-server ... workflow runs from the workspace root. When running the executable from another directory, use absolute model paths or start the process from the workspace root.

Backends

The gateway server supports the same native backend names as other native targets:

  • cpu: provider-only router build or local-inference diagnostic backend.
  • cuda: NVIDIA CUDA backend.
  • metal: Apple Metal backend on macOS.
  • vulkan: Vulkan backend.
  • all: host-supported backend set for build commands.

For on-board local target TOML, backend = "auto" selects the best compiled and available backend in this order: CUDA, Metal, Vulkan, then CPU. Production model-serving configs should use auto or an explicit GPU backend. Explicit cpu disables GPU offload and is intended only for diagnostics. Explicit GPU backends fail if that backend was not compiled or is unavailable.

Admin Dashboard

The Admin Dashboard password is read from the env var named by TOML:

admin_password_env = "SIPP_GATEWAY_ADMIN_PASSWORD"

Keep the real value in a secrets env file or production secret manager.

Gateway Docker

Gateway Docker workflows use explicit Compose files plus the gateway TOML and a secrets-only .env file.

The separation is strict:

  • .env contains secret values only.
  • TOML contains gateway application configuration.
  • Compose YAML contains Docker build, image, port, mount, healthcheck, and container orchestration settings.

The container runs:

sipp-gateway serve --config /etc/sipp/gateway.toml

Files

  • apps/gateway-server/Dockerfile builds the staged gateway distribution.
  • apps/gateway-server/.env.example is the secrets-only env template.
  • apps/gateway-server/development.yml.example builds and runs a local model-serving image.
  • apps/gateway-server/development-provider-only.yml.example builds and runs a provider-router image with no model mount.
  • apps/gateway-server/production.yml.example runs a prebuilt production model-serving image.
  • apps/gateway-server/production-provider-only.yml.example runs a prebuilt provider-router image with no model mount.
  • apps/gateway-server/config/*.toml.example are gateway application config templates.

Local Model-Serving Docker

From the repository root:

cp apps/gateway-server/.env.example apps/gateway-server/.env
cp apps/gateway-server/development.yml.example apps/gateway-server/development.yml
cp apps/gateway-server/config/development.toml.example apps/gateway-server/config/development.toml

Edit apps/gateway-server/.env and set only secrets:

SIPP_GATEWAY_ADMIN_PASSWORD=replace-me
SIPP_GATEWAY_TOKEN=replace-me
OPENAI_API_KEY=replace-me
ANTHROPIC_API_KEY=replace-me

Edit apps/gateway-server/config/development.toml:

  • Set the local target model to the path the container sees, usually /models/<file>.gguf.
  • Keep public_bind = "0.0.0.0:8080" and management_bind = "0.0.0.0:9090" so the gateway listens inside the container.
  • Keep admin_password_env = "SIPP_GATEWAY_ADMIN_PASSWORD" unless the .env secret name also changes.

Edit apps/gateway-server/development.yml for Docker concerns such as image tag, build backend, build images, model mount, port publishing, and healthcheck.

Build and run with one backend profile. CPU works on Windows, macOS, and Linux. GPU containers require host-specific device support.

Warning

Windows Docker Desktop does not support the first-party Vulkan gateway path. NVIDIA Windows hosts should use the cuda profile. Do not use old vulkan-windows configs; ggml_vulkan: No devices found means the container cannot enumerate a usable Vulkan physical device.

# CPU, portable across Windows, macOS, and Linux
docker compose --profile cpu --env-file apps/gateway-server/.env -f apps/gateway-server/development.yml config
docker compose --profile cpu --env-file apps/gateway-server/.env -f apps/gateway-server/development.yml build gateway-cpu
docker compose --profile cpu --env-file apps/gateway-server/.env -f apps/gateway-server/development.yml up gateway-cpu

# CUDA, Linux or Windows Docker Desktop with NVIDIA GPU support
docker compose --profile cuda --env-file apps/gateway-server/.env -f apps/gateway-server/development.yml config
docker compose --profile cuda --env-file apps/gateway-server/.env -f apps/gateway-server/development.yml build gateway-cuda
docker compose --profile cuda --env-file apps/gateway-server/.env -f apps/gateway-server/development.yml up gateway-cuda

# Vulkan on native Linux, uses /dev/dri
docker compose --profile vulkan-linux --env-file apps/gateway-server/.env -f apps/gateway-server/development.yml config
docker compose --profile vulkan-linux --env-file apps/gateway-server/.env -f apps/gateway-server/development.yml build gateway-vulkan-linux
docker compose --profile vulkan-linux --env-file apps/gateway-server/.env -f apps/gateway-server/development.yml up gateway-vulkan-linux

If Compose reports orphan containers after switching service names, remove the old containers once:

docker compose --env-file apps/gateway-server/.env -f apps/gateway-server/development.yml down --remove-orphans

Provider-Only Docker

Provider-only Docker runs use the provider-only Compose template and no model mount:

cp apps/gateway-server/.env.example apps/gateway-server/.env
cp apps/gateway-server/development-provider-only.yml.example apps/gateway-server/development-provider-only.yml
cp apps/gateway-server/config/provider-only.toml.example apps/gateway-server/config/provider-only.toml

Set secrets in apps/gateway-server/.env, then run:

docker compose --env-file apps/gateway-server/.env -f apps/gateway-server/development-provider-only.yml config
docker compose --env-file apps/gateway-server/.env -f apps/gateway-server/development-provider-only.yml build
docker compose --env-file apps/gateway-server/.env -f apps/gateway-server/development-provider-only.yml up

The provider-only template builds a CPU gateway image because inference happens upstream.

Production Docker

Keep production TOML, Compose, and .env copies outside the repository:

mkdir -p /opt/sipp/gateway
cp apps/gateway-server/.env.example /opt/sipp/gateway/.env
cp apps/gateway-server/production.yml.example /opt/sipp/gateway/production.yml
cp apps/gateway-server/config/production.toml.example /opt/sipp/gateway/production.toml

Edit /opt/sipp/gateway/.env for secret values only. Edit /opt/sipp/gateway/production.toml for gateway runtime configuration. Edit /opt/sipp/gateway/production.yml for image names, host model mounts, ports, restart policy, and healthcheck.

Deploy with one backend profile:

# CPU
docker compose --profile cpu --env-file /opt/sipp/gateway/.env -f /opt/sipp/gateway/production.yml config
docker compose --profile cpu --env-file /opt/sipp/gateway/.env -f /opt/sipp/gateway/production.yml up -d gateway-cpu

# CUDA, requires NVIDIA Container Toolkit on the host
docker compose --profile cuda --env-file /opt/sipp/gateway/.env -f /opt/sipp/gateway/production.yml config
docker compose --profile cuda --env-file /opt/sipp/gateway/.env -f /opt/sipp/gateway/production.yml up -d gateway-cuda

# Vulkan on Linux hosts, requires /dev/dri rendering devices
docker compose --profile vulkan-linux --env-file /opt/sipp/gateway/.env -f /opt/sipp/gateway/production.yml config
docker compose --profile vulkan-linux --env-file /opt/sipp/gateway/.env -f /opt/sipp/gateway/production.yml up -d gateway-vulkan-linux

For provider-only production, copy production-provider-only.yml.example and config/provider-only.toml.example instead.

Bind And Mount Behavior

The TOML file always uses the same schema, but bind and path interpretation changes by runtime mode.

RuntimeTOML bind valuesHost exposureLocal target model path
Source/exeHost addresses, usually 127.0.0.1:* for developmentThe process binds directly on the hostPath seen from the process working directory
Local ComposeContainer addresses, usually 0.0.0.0:8080 and 0.0.0.0:9090Compose ports map host ports to 127.0.0.1 in local templates/models/<file>.gguf
Production ComposeContainer addresses, usually 0.0.0.0:8080 and 0.0.0.0:9090Compose exposes public and keeps management host-local by default/models/<file>.gguf
Provider-only ComposeContainer addresses, usually 0.0.0.0:8080 and 0.0.0.0:9090Provider-only templates follow the same port rulesNo local model path

Keep management private in production. Put public ingress, TLS, and external auth controls in front of the public listener when needed.

Raw Docker Build

Raw Docker commands are supported as an escape hatch. Supply every build arg explicitly:

docker build \
  --build-arg SIPP_GATEWAY_BACKEND=vulkan \
  --build-arg SIPP_GATEWAY_BUILDER_IMAGE=rust:bookworm \
  --build-arg SIPP_GATEWAY_RUNTIME_IMAGE=ubuntu:22.04 \
  --build-arg SIPP_GATEWAY_INSTALL_RUSTUP=0 \
  -f apps/gateway-server/Dockerfile \
  -t sipp-gateway:vulkan .

Backend Hardware & Docker Constraints

Published gateway images use backend-specific tags: latest-cpu, latest-cuda, and latest-vulkan.

Supported first-party Docker profiles:

Host runtimeGPU vendorSupported profileBackendNotes
Linux DockerNVIDIAcudaCUDARecommended NVIDIA GPU path. Requires NVIDIA drivers and container runtime support.
Linux DockerAMD or Intelvulkan-linuxVulkanRequires host /dev/dri rendering devices and a usable Vulkan driver stack.
Linux DockerNo supported GPUcpuCPUPortable diagnostic and fallback path.
Windows Docker DesktopNVIDIAcudaCUDARequires Docker Desktop WSL2 GPU support and NVIDIA container GPU passthrough.
Windows Docker DesktopAMD or IntelcpuCPUFirst-party Docker does not support Windows Vulkan GPU inference.
macOS DockerAnycpuCPUMetal is available only through native macOS execution, not Linux Docker.

CPU Backend (latest-cpu / cpu profile)

  • Standard portable execution. Works on any host without special driver dependencies.
  • This is the Docker path for macOS local development.

CUDA Backend (latest-cuda / cuda profile)

  • Requires the NVIDIA Container Toolkit to be installed and configured on the host.
  • Requires NVIDIA host GPU drivers.
  • Exposed using Docker Compose GPU device reservation capabilities.
  • Supported on Linux and Windows Docker Desktop WSL2 hosts with NVIDIA GPU support.

CUDA Architecture Selection

Set SIPP_CUDA_ARCHITECTURES to control the compiled GPU architecture list. The value is passed verbatim to CMake, so use semicolon-separated entries. In Docker builds, pass it as the SIPP_CUDA_ARCHITECTURES build arg; the Compose CUDA service forwards it to the builder stage.

Defaults are layered:

  • cargo xtask build CUDA targets (node, python, cli, gateway-server) default to the portable cloud GPU list below so packaged artifacts stay deterministic across build hosts. Docker gateway builds run xtask, so they inherit the same default when the build arg is empty.
  • Raw cargo build of sipp-sys outside xtask does not set CMAKE_CUDA_ARCHITECTURES, which lets vendored llama.cpp choose CUDA-version-aware defaults for the local toolkit.

Portable cloud GPU release images use:

75-virtual;80-virtual;86-real;89-real;90-virtual;120a-real;121a-real
EntryTarget GPUs
75-virtualT4 and other Turing cloud GPUs
80-virtualA100 and other Ampere data-center GPUs
86-realA10, A40, RTX A6000-class Ampere
89-realL4, L40S, Ada
90-virtualH100, H200 Hopper
120a-realBlackwell architecture-specific target
121a-realNewer Blackwell architecture-specific target

For faster builds targeting a known GPU, narrow the list. For example, 80 for A100 only, 90 for H100/H200 only, or 89 for L4/L40S only.

CUDA 13 removes offline compilation support for GPU architectures before compute capability 7.5, so 61 (Pascal) and 70 (Volta) are excluded from CUDA 13 builds. Supporting those GPUs requires a separate legacy build using a CUDA 12.x toolkit image with an explicit SIPP_CUDA_ARCHITECTURES list.

The a-suffix Blackwell entries are architecture-specific and not forward-compatible; keep them aligned with the targets vendored llama.cpp uses. Plain TensorRT-free CUDA images are the default because the gateway links against CUDA runtime libraries only; use TensorRT images only if a TensorRT dependency is introduced.

Vulkan Backend (latest-vulkan image)

  • Supported first-party Docker profile is Linux-only: vulkan-linux.
  • Linux runs expose host rendering devices with /dev/dri:/dev/dri.
  • Windows Docker Desktop Vulkan is unsupported for gateway inference. NVIDIA Windows hosts should use cuda instead.
  • The runtime container packages libvulkan1 and mesa-vulkan-drivers for the supported Linux Vulkan profile.

Apple Metal Backend (macOS hypervisor constraints)

Warning

Metal cannot run inside a standard Linux Docker container. Docker on macOS runs within a virtualized Linux hypervisor VM. Apple does not support direct forwarding of the Metal GPU API from macOS into Linux VMs.

Due to this hard architectural boundary:

  1. Docker Limitation: Running the gateway container on macOS will result in a CPU-only fallback or Vulkan device discovery failure (no Metal GPU acceleration).
  2. Native Execution: To utilize Apple Silicon GPU acceleration (Metal), macOS users must compile and run the gateway server natively:
    cargo xtask build gateway-server --backend metal
    ./.build/artifacts/gateway-server/sipp-gateway serve --config apps/gateway-server/config/development.toml
    

Health Check

The Compose templates probe the management readiness route:

curl --fail --silent http://127.0.0.1:9090/readyz

If you change the readiness route in TOML, update the Compose healthcheck too.

Gateway Configuration

apps/gateway-server is configured by one TOML file. The same schema is used for source/exe runs and Docker runs; only path and bind interpretation changes. Use Gateway Server for source/exe commands and Docker for container commands.

Example

public_bind = "0.0.0.0:8080"
management_bind = "0.0.0.0:9090"
max_request_bytes = 1048576
max_concurrent_requests = 4
allowed_origins = []
admin_password_env = "SIPP_GATEWAY_ADMIN_PASSWORD"

[security.client_ip]
source = "peer"
trusted_proxy_cidrs = []

[security.rate_limit]
enabled = false
requests_per_minute = 60
burst = 60

[routes]
query = "/v1/query"
chat = "/v1/chat"
embed = "/v1/embed"
index = "/"
health = "/healthz"
readiness = "/readyz"
metrics = "/metrics"
admin = "/admin"

[[tokens]]
env = "SIPP_GATEWAY_TOKEN"
caller = "production-client"
targets = ["local"]

[[targets]]
name = "local"
type = "local"
model = "/models/model.gguf"
backend = "auto"
stats = "basic"

Gateway Deployment Shapes

The same TOML schema supports three deployment shapes. Choose the shape by the configured targets.

On-Board GPU Inference

Use a local GGUF target when the gateway server owns model loading and GPU inference:

[[tokens]]
env = "SIPP_GATEWAY_TOKEN"
caller = "gpu-client"
targets = ["local-gpu"]

[[targets]]
name = "local-gpu"
type = "local"
model = "/models/model.gguf"
backend = "auto"
stats = "basic"

Use backend = "auto" or an explicit GPU backend such as cuda, metal, or vulkan. The process must be able to read the GGUF path. Docker runs usually mount the host model directory at /models.

Provider-Only Router

Use provider targets only when the gateway should hold provider credentials and route client prompts to upstream APIs without loading a local model:

[[tokens]]
env = "SIPP_GATEWAY_TOKEN"
caller = "provider-client"
targets = ["openai-chat"]

[[targets]]
name = "openai-chat"
type = "openai"
model = "gpt-5-mini"
api_key_env = "OPENAI_API_KEY"
timeout_seconds = 60

Provider-only configs have no type = "local" target, no model filesystem path, and no backend field. CPU gateway builds are appropriate here because the gateway is not performing on-board inference.

Hybrid

Use both target families when clients should be able to choose between a server-hosted local model and provider endpoints:

[[tokens]]
env = "SIPP_GATEWAY_TOKEN"
caller = "hybrid-client"
targets = ["local-gpu", "openai-chat"]

[[targets]]
name = "local-gpu"
type = "local"
model = "/models/model.gguf"
backend = "auto"
stats = "basic"

[[targets]]
name = "openai-chat"
type = "openai"
model = "gpt-5-mini"
api_key_env = "OPENAI_API_KEY"
timeout_seconds = 60

Requests select the public target name through the request model field, for example local-gpu or openai-chat.

Top-Level Fields

FieldMeaning
public_bindAddress for public inference routes. Source/exe binds this on the host; Docker binds inside the container.
management_bindAddress for health, readiness, metrics, index, and admin routes. Must differ from public_bind.
max_request_bytesMaximum HTTP request body size. Must be greater than zero.
max_concurrent_requestsOptional application-wide request admission limit. Omit for unbounded.
allowed_originsCORS allowlist for browser requests to the public listener. Empty disables the CORS layer.
admin_password_envEnvironment variable containing the Admin Dashboard password. Required and non-blank.
securityRequired in-memory client identification and rate limiting settings.

check validates these fields without reading secret env vars, loading models, contacting providers, or binding ports.

Secrets

TOML names secret environment variables. Secret values belong in a private .env file or production secret manager, not in TOML.

SIPP_GATEWAY_ADMIN_PASSWORD=replace-me
SIPP_GATEWAY_TOKEN=replace-me
OPENAI_API_KEY=replace-me
ANTHROPIC_API_KEY=replace-me

serve rejects missing or blank secret env values at startup. Bearer token values must also contain no whitespace.

Routes

query, chat, and embed are required public routes. The other routes are management routes:

  • index: optional management index JSON route.
  • health: optional liveness route returning ok.
  • readiness: optional readiness route returning ready.
  • metrics: optional Prometheus text route.
  • admin: optional Admin Dashboard route. Session JSON endpoints live under <admin>/api/session.

Routes must be absolute paths and must not contain query strings or fragments. Public routes cannot duplicate each other. Management routes cannot duplicate each other.

Tokens

Each [[tokens]] block maps one bearer-token environment variable to a caller label and a target allowlist:

[[tokens]]
env = "SIPP_GATEWAY_TOKEN"
caller = "browser-client"
targets = ["local", "openai-chat"]
  • env names the environment variable containing the bearer token value.
  • caller is a stable label used in request metadata and diagnostics.
  • targets lists allowed [[targets]].name values. An empty list grants all configured targets.

Token values must be non-empty and contain no whitespace. They are read only when serve starts.

In-Memory Security Controls

Gateway security controls are process-local in the current version. Admin Dashboard sessions, CSRF tokens, rolling dashboard history, per-client rate-limit buckets, manual blocklist entries, and runtime control overrides disappear when the server restarts. The gateway does not write TOML, create a state file, or use an external cache or database for these controls.

The checked-in examples use the TCP peer address for client IP extraction:

[security.client_ip]
source = "peer"
trusted_proxy_cidrs = []

source can be peer, x_forwarded_for, or x_real_ip. Forwarded headers are ignored unless trusted_proxy_cidrs contains the proxy CIDR that is allowed to supply them. Keep source = "peer" unless the gateway sits behind a trusted reverse proxy that preserves the real client address.

Per-client rate limiting is configured explicitly:

[security.rate_limit]
enabled = false
requests_per_minute = 60
burst = 60

When enabled, the limiter uses an in-memory token bucket keyed by the resolved client IP. requests_per_minute controls refill rate. burst controls bucket capacity.

Targets

Each [[targets]] block publishes one model or provider endpoint under a stable target name.

Local GGUF

[[targets]]
name = "local"
type = "local"
model = ".build/models/qwen2.5-0.5b-instruct-q4_0.gguf"
backend = "auto"
stats = "basic"
  • model is the GGUF path seen by the process. Relative paths resolve from the process working directory.
  • backend can be auto, cpu, cuda, metal, or vulkan.
  • stats can be off, basic, or profile.
  • runtime can contain advanced native runtime settings from the shared runtime options schema.

For on-board inference, prefer backend = "auto" or an explicit GPU backend. backend = "auto" selects the best compiled and available backend in this order: CUDA, Metal, Vulkan, then CPU. Explicit cpu disables GPU offload and is intended only for diagnostics. Explicit GPU backends fail if that backend was not compiled or is unavailable.

stats = "off" disables runtime metrics and backend profiling. stats = "basic" enables runtime metrics. stats = "profile" enables runtime metrics and backend profiling.

OpenAI

[[targets]]
name = "openai-chat"
type = "openai"
model = "provider-model"
api_key_env = "OPENAI_API_KEY"
base_url = "https://api.openai.com/v1"
timeout_seconds = 60

base_url and timeout_seconds are optional. The API key is read from api_key_env when serve starts.

OpenAI-Compatible

[[targets]]
name = "compatible-chat"
type = "openai_compatible"
model = "served-model"
base_url = "https://provider.example/v1"
token_env = "PROVIDER_TOKEN"
correlation_header = "x-request-id"
timeout_seconds = 60

base_url and token_env are required. correlation_header and timeout_seconds are optional.

Anthropic

[[targets]]
name = "anthropic-chat"
type = "anthropic"
model = "provider-model"
api_key_env = "ANTHROPIC_API_KEY"
version = "2023-06-01"
timeout_seconds = 60

base_url, version, and timeout_seconds are optional. The API key is read from api_key_env when serve starts.

Bind Behavior

Source/exe mode binds public_bind and management_bind directly on the host. Docker mode binds those addresses inside the container; Compose ports decide host exposure.

For Docker:

  • The gateway process should listen on container interfaces such as 0.0.0.0:8080 and 0.0.0.0:9090.
  • Local testing keeps both host ports on 127.0.0.1 through Compose port bindings.
  • Production exposes public traffic through the configured host port and keeps management on 127.0.0.1 by default.
  • Local model paths should match the container mount point in the Compose volume configuration.
  • Provider-only Docker configs do not need a model mount because no local GGUF target is loaded.

Admin Dashboard

The dashboard is served only on the management listener. It uses the value of admin_password_env for login, stores short-lived HTTP-only sessions, and does not render the password, bearer tokens, or provider secrets.

The dashboard serves a React single-page application from the gateway distribution’s admin-ui asset directory and exposes session-protected JSON endpoints under <admin>/api/*. Login uses POST <admin>/api/session, logout uses DELETE <admin>/api/session, and mutating admin API calls require the session CSRF token in the x-sipp-admin-csrf header. Runtime edits made from the dashboard affect only the running process and reset on restart.

Gateway Testing

Use this page when testing the first-party gateway with curl, Postman, or any other raw HTTP client. The examples assume the default routes from apps/gateway-server/config/*.toml.

Environment

Bash:

export GATEWAY_URL="http://127.0.0.1:8080"
export GATEWAY_MANAGEMENT_URL="http://127.0.0.1:9090"
export SIPP_GATEWAY_TOKEN="replace-me"
export SIPP_GATEWAY_TARGET="local"

PowerShell:

$env:GATEWAY_URL = "http://127.0.0.1:8080"
$env:GATEWAY_MANAGEMENT_URL = "http://127.0.0.1:9090"
$env:SIPP_GATEWAY_TOKEN = "replace-me"
$env:SIPP_GATEWAY_TARGET = "local"

Management Probes

Health and readiness do not require bearer authentication:

curl --fail --silent "$GATEWAY_MANAGEMENT_URL/healthz"
curl --fail --silent "$GATEWAY_MANAGEMENT_URL/readyz"
curl --fail --silent "$GATEWAY_MANAGEMENT_URL/metrics"

The Admin Dashboard is available at:

http://127.0.0.1:9090/admin

Log in with the value of the env var named by admin_password_env in TOML.

Query

curl -sS "$GATEWAY_URL/v1/query" \
  -H "Authorization: Bearer $SIPP_GATEWAY_TOKEN" \
  -H "Content-Type: application/json" \
  -H "x-request-id: curl-query-1" \
  -d '{
    "model": "'"$SIPP_GATEWAY_TARGET"'",
    "prompt": "Explain gateway inference in one sentence.",
    "max_tokens": 64,
    "temperature": 0.2
  }'

Finite text responses use JSON:

{
  "id": "response",
  "model": "local",
  "text": "A gateway centralizes inference behind an HTTP boundary.",
  "finish_reason": "stop"
}

When usage is available, the response also includes:

{
  "usage": {
    "input_tokens": 8,
    "output_tokens": 12,
    "total_tokens": 20
  }
}

Chat

curl -sS "$GATEWAY_URL/v1/chat" \
  -H "Authorization: Bearer $SIPP_GATEWAY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$SIPP_GATEWAY_TARGET"'",
    "messages": [
      { "role": "system", "content": "Answer briefly." },
      { "role": "user", "content": "What does the gateway own?" }
    ],
    "max_tokens": 64
  }'

Chat uses the same finite text response shape as query. Valid message roles are system, user, and assistant.

Embeddings

curl -sS "$GATEWAY_URL/v1/embed" \
  -H "Authorization: Bearer $SIPP_GATEWAY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$SIPP_GATEWAY_TARGET"'",
    "input": "gateway inference"
  }'

Embedding responses use JSON:

{
  "id": "response",
  "model": "local",
  "embedding": [0.0123, -0.0456]
}

Embedding requires a target that supports embeddings. Text-only local models or provider targets can return an execution error for /v1/embed.

Streaming

Query and chat support server-sent events when the request contains "stream": true:

curl -N -sS "$GATEWAY_URL/v1/query" \
  -H "Authorization: Bearer $SIPP_GATEWAY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$SIPP_GATEWAY_TARGET"'",
    "prompt": "Write one short sentence about gateways.",
    "max_tokens": 64,
    "stream": true
  }'

The stream content type is text/event-stream. Events are newline-delimited SSE frames:

event: token
data: {"text":"Gateways","sequence":0}

event: usage
data: {"input_tokens":8,"output_tokens":9,"total_tokens":17}

event: done
data: {"finish_reason":"stop"}

If an error happens after streaming has started, the stream emits:

event: error
data: {"error":{"code":"execution","message":"..."}}

Postman

Create a Postman environment with these variables:

VariableExample
gateway_urlhttp://127.0.0.1:8080
management_urlhttp://127.0.0.1:9090
gateway_tokenreplace-me
gateway_targetlocal

For public routes:

  • Method: POST.
  • Authorization: Bearer Token with {{gateway_token}}.
  • Header: Content-Type: application/json.
  • Body: raw JSON.
  • Query URL: {{gateway_url}}/v1/query.
  • Chat URL: {{gateway_url}}/v1/chat.
  • Embed URL: {{gateway_url}}/v1/embed.

For management probes:

  • Method: GET.
  • URLs: {{management_url}}/healthz, {{management_url}}/readyz, and {{management_url}}/metrics.
  • No bearer token is required.

Postman can display finite JSON responses directly. For streaming requests, use a client that preserves SSE frames, such as curl -N, when debugging token timing and terminal events.

Common HTTP Failures

StatusCommon cause
400Invalid JSON, invalid route body, or unsupported request field value.
401Missing bearer token or malformed Authorization header.
403Bearer token is valid but not allowed to use the requested target.
404Requested model target is not configured.
413Request body exceeds max_request_bytes.
429max_concurrent_requests admission limit is full.
500Target load or execution failure. Check gateway logs and target config.

Non-streaming errors use JSON:

{
  "error": {
    "code": "authorization",
    "message": "token is not allowed to access target"
  }
}

Gateway Operations

The first-party gateway has one public listener and one management listener. Keep those operational surfaces separate in deployment.

Public Listener

The public listener serves inference routes:

  • /v1/query
  • /v1/chat
  • /v1/embed

Every public request must include a bearer token accepted by the configured [[tokens]] policy. The request model field is the public target name. The gateway resolves that target to a local model or provider endpoint.

Put TLS, external authentication, rate limiting, and network ingress in front of the public listener when exposing it beyond a trusted network.

Management Listener

The management listener can serve:

  • /: optional index JSON route.
  • /healthz: liveness route returning ok.
  • /readyz: readiness route returning ready.
  • /metrics: Prometheus text metrics route.
  • /admin: password-protected Admin Dashboard.

Keep the management listener private. In Docker production, the Compose file binds the management host port to 127.0.0.1 by default.

Admin Dashboard

The Admin Dashboard uses the value of the env var named by admin_password_env in TOML for login. It stores short-lived HTTP-only sessions and does not render the password, bearer tokens, or provider secrets.

Use the dashboard to inspect configured routes, targets, selected local backends, and current request metrics. Do not expose it directly to the public internet.

Metrics

The metrics route renders low-cardinality Prometheus text. Current gateway metrics include request and error counters by operation, for example:

sipp_gateway_requests_total{operation="query"} 3
sipp_gateway_errors_total{operation="chat"} 1

Target-level local runtime metrics depend on the target stats setting:

  • off: disable runtime metrics and backend profiling.
  • basic: enable runtime metrics.
  • profile: enable runtime metrics and backend profiling.

Logging

The gateway uses tracing JSON logs. Set RUST_LOG in the process environment to control verbosity:

RUST_LOG=info
RUST_LOG=debug,sipp_gateway_server=trace

Do not log bearer token values, provider credentials, or production TOML contents.

CORS

allowed_origins controls browser access to the public listener. An empty array disables the CORS layer. Add only trusted browser origins:

allowed_origins = ["https://app.example.com"]

Browser clients should use short-lived gateway tokens supplied at runtime, not long-lived tokens embedded in bundles.

Secrets

The gateway uses two types of secrets:

  • admin_password_env: TOML field naming the dashboard password env var.
  • Token/provider env vars: names are configured in TOML; values are read from the process environment when serve starts.

Keep secrets env files private and outside source control. Use deployment secret stores where available.

Gateway Toolkit

sipp-gateway is a route-free Rust HTTP toolkit for applications that want to expose Sipp inference through their own server framework.

The toolkit provides codecs, authentication and observability traits, HTTP error helpers, and the first-party JSON/SSE profile. Applications bind sockets, register routes, load configuration, and define deployment policy.

Use Gateway Server when you want the first-party server application with TOML, bearer tokens, target policy, metrics, probes, and listener management.

Distribution

The toolkit crate target is sipp-gateway. crates.io publishing covers the sipp-rs and sipp-sys crates; the toolkit is intentionally source-distributed. Use Source Builds when consuming the toolkit from this checkout.

Use It For

  • Building application-owned HTTP gateway routes.
  • Translating request bodies into typed Sipp requests.
  • Encoding JSON and SSE responses.
  • Sharing the first-party protocol profile with Sipp clients.

Minimal Handler Shape

#![allow(unused)]
fn main() {
use sipp_gateway::{GatewayCodec, ProtocolCodec};

let codec = GatewayCodec;
let mut decoded = codec.decode_query(&body)?;
decoded.request.endpoint = Some(resolve(&decoded.target)?);
let response = client.query(decoded.request).await?;
let bytes = codec.encode_text(&decoded.target, &response)?;
}

Custom gateway applications own sockets, route layout, authentication, configuration files, target policy, CORS, logging, and deployment defaults. Node route handlers can use the matching gateway profile helpers exported by @sipp/sipp-server when implementing the same first-party profile in framework routes.

Boundaries

lib/gateway supplies helpers, not an application:

  • It does not register routes.
  • It does not bind listeners.
  • It does not own bearer-token policy.
  • It does not own TOML, CORS, metrics, or deployment behavior.

Default /v1/query, /v1/chat, and /v1/embed paths belong only to applications that choose them.

Gateway Architecture

Gateway behavior is split into independent layers. There is no compatibility layer for deleted gateway route-autowiring or remote endpoint APIs.

Core Execution

sipp::gateway_core (the gateway_core module of the sipp crate, behind the gateway feature) exposes only typed query, chat, and embed execution:

  • GatewayRequestContext and cancellation.
  • TargetResolver, Authorizer, AdmissionController, and GatewayExecutor.
  • GatewayPipeline ordering and admission-permit lifetime.
  • Protocol-neutral finite results and streaming events.

It does not depend on HTTP, Axum routes, JSON, SSE, bearer tokens, status codes, aliases, TOML, or fixed limits.

The sipp client API owns local, provider, and gateway endpoint registration through SippClient.add(...). Gateway endpoints call an HTTP gateway as a client transport and are never selected implicitly.

Developer Toolkit

lib/gateway contains route-free HTTP helpers for applications that choose to expose a gateway:

  • ProtocolCodec for request, response, stream, and error wire formats.
  • Authenticator for arbitrary authentication.
  • ErrorTranslator for application HTTP error mapping.
  • GatewayCodec for the first-party Sipp JSON/SSE profile.
  • GatewayHttpError and SSE/error response encoders.

It does not register routes, expose a router, or own handler paths. Applications decode requests, select targets, call client.query(), client.chat(), or client.embed() directly, and encode responses explicitly.

Public Endpoints

Rust, Node, Python, and browser packages expose gateway endpoint descriptors through the same .add path used for local and provider endpoints:

  • A protocol target.
  • A gateway base URL.
  • Query, chat, and embed routes.
  • Authentication strategy.
  • Static headers.
  • Timeout policy.
  • Protocol-specific request options.

The endpoint id is supplied only to .add. Local model, provider, and gateway descriptors are different descriptor kinds, but query, chat, and embed request shapes are identical once an endpoint ref is selected.

First-Party Applications

apps/gateway-server is one opinionated first-party application. Its bearer tokens, target access, concurrency limit, CORS, routes, management listener, metrics, and TOML format are application-owned.

examples/gateway demonstrates the canonical developer pattern:

  • Create a SippClient.
  • Add local, provider, or gateway endpoints with .add.
  • Define Axum routes in the example application.
  • Decode each route body, select an endpoint, call client.*, and encode the response.

Default /v1/query, /v1/chat, and /v1/embed paths belong only to applications that choose them. The library supplies codecs and endpoint transports, not route ownership.

Gateway Troubleshooting

Use this page when the first-party gateway starts, serves, or responds differently than expected.

check Succeeds But serve Fails

check parses and validates TOML only. It does not read token environment variables, load model files, contact providers, or bind ports.

If serve fails after check succeeds, verify:

  • Bearer token env vars named by [[tokens]].env are present and non-empty.
  • The env var named by admin_password_env is present and non-empty.
  • Provider secret env vars such as OPENAI_API_KEY are present for provider targets.
  • Local GGUF paths exist from the process point of view.
  • public_bind and management_bind are available and not already in use.
  • Requested GPU backends were compiled and are available on the host.

Missing DLL Or Shared Library

Direct executable runs must put .build/artifacts/gateway-server on the dynamic loader path. The staged executable depends on runtime libraries and GGML backend plugins in that same directory.

  • Windows: prepend the artifact directory to PATH.
  • Linux: prepend the artifact directory to LD_LIBRARY_PATH.
  • macOS: prepend the artifact directory to DYLD_LIBRARY_PATH.

The sipp run gateway-server ... workflow handles this automatically.

Relative Model Path Is Wrong

Relative local target model paths resolve from the process working directory. sipp run gateway-server ... runs from the workspace root. Direct executable commands run wherever the shell is currently located.

Use absolute model paths when starting the executable from another directory. For Docker, use the container path, not the host path.

Docker Port Is Published But Host Cannot Connect

In Docker mode, public_bind and management_bind are addresses inside the container. Use container listener values such as:

public_bind = "0.0.0.0:8080"
management_bind = "0.0.0.0:9090"

Then use Compose ports to control host exposure. The local Compose templates map both host ports to 127.0.0.1 for workstation-only access.

401 Unauthorized

The public route did not receive a valid bearer token. Check:

  • Header is Authorization: Bearer <token>.
  • Token value matches the environment variable named by a [[tokens]] block.
  • Token contains no whitespace.
  • The gateway process was restarted after changing the token environment.

403 Forbidden

The bearer token is valid, but its targets allowlist does not include the request model target. Add the target name to the relevant [[tokens]] block or use a token that grants that target.

404 Target Not Found

The request model value does not match any configured [[targets]].name. The model field in public HTTP requests is a public gateway target name, not necessarily the provider model or GGUF file name.

CORS Failure In Browser

Browser requests require the public listener to allow the page origin. Add the exact origin to allowed_origins:

allowed_origins = ["http://localhost:5173"]

An empty allowed_origins array disables the CORS layer.

GPU Backend Fails

Explicit local target backends fail when the backend was not compiled or is not available at runtime. Use backend = "auto" to let the gateway pick the best compiled and available backend, or select a GPU backend that was included in the build. Explicit cpu disables GPU offload and is useful only for diagnosing local-inference setup issues.

Docker GPU builds also require host runtime support:

  • CUDA requires NVIDIA host drivers and container runtime support.
  • Vulkan requires GPU device access, Vulkan loader, and driver support.
  • Metal is macOS-only and not available from Linux Docker.

If Docker logs show ggml_vulkan: No devices found, the container has loaded the Vulkan backend but cannot enumerate a usable Vulkan physical device. On Windows Docker Desktop with NVIDIA GPUs, use the cuda profile instead.

Admin Dashboard Login Fails

The dashboard password is read from the env var named by admin_password_env in the selected TOML file. Confirm the secrets env file or secret manager has that value, and confirm the gateway is using the intended TOML through --config.

The dashboard is served on the management listener only.

Guides

Guides explain cross-package behavior and workflow choices that belong outside individual README files.

Local Inference

Local inference runs a GGUF model inside the current browser, Node.js, Python, Rust, or CLI process. The application owns model selection, runtime lifecycle, resource cleanup, and the request options that should be exposed to users.

Register a local endpoint with SippClient.add, keep the returned endpoint reference, and pass that reference to query, chat, or embed.

Endpoint Flow

  1. Choose a GGUF model that supports the requested capability.
  2. Register the model with a local descriptor.
  3. Set load-time runtime options on the endpoint descriptor.
  4. Pass request-time generation options to query, chat, or embed.
  5. Stream tokens or await the final response.
  6. Close the client when the page, worker, service, or script no longer needs the runtime.

Local endpoints do not route implicitly. A client can register multiple endpoints, but every request that should use a specific destination should pass the endpoint reference returned by add.

Model Sources

Browser local endpoints can load:

  • A model URL served by the application.
  • A user-selected File.
  • Multiple shard URLs or files.
  • An installed model id returned by browser model-management APIs.
  • A model plus projector pair for vision-capable models.

Node.js, Python, Rust, and CLI local endpoints use filesystem paths. Source examples and smoke workflows can use cached sample models under .build/models when running from a checkout.

Runtime And Request Options

Keep option layers separate:

  • Browser client options such as executionMode, wasmThreading, runtime asset URLs, and browserCache belong on new SippClient(...).
  • Local endpoint load options choose the model source, browser backend preference, progress callbacks, and NativeRuntimeConfig.
  • Runtime config groups such as context, sampling, scheduler, cache, placement, multimodal, residency, and observability describe stable local endpoint behavior.
  • Request options such as maxTokens, temperature, topP, stop, cancellation, and emitTokens belong on query, chat, or embed.
  • Local-only request options such as context keys, grammars, media inputs, and embedding normalization should not be sent to gateway or provider endpoints.

See Runtime Options for the canonical option map and field groups.

Threads And Browser Execution

Browser execution has two separate choices:

  • executionMode: 'worker' or auto keeps inference work off the UI thread when workers are available.
  • wasmThreading: 'pthread' enables the pthread WASM runtime and requires SharedArrayBuffer plus cross-origin isolation headers.

Use wasmThreading: 'single-thread' when the app cannot serve COOP/COEP headers. Use executionMode: 'main-thread' mainly for debugging or constrained hosts.

Native Node.js, Python, and Rust local endpoints can tune CPU thread counts with context.n_threads and context.n_threads_batch. Leave them unset for runtime defaults unless the application has measured a better value.

Text, Embeddings, And Vision

  • Query and chat require text generation support.
  • Embed requires a model/runtime that reports embedding support.
  • Vision chat requires a text/vision model plus projector data where the model family requires it.
  • Streaming text requires emitTokens and consuming the returned token iterable before or alongside the final response.
  • GBNF grammars and media inputs are local-only request features.

Backend Matrix

Sipp local inference is built on llama.cpp and ggml. Sipp owns the client APIs, endpoint model, scheduling, package bindings, browser lifecycle, and gateway integration; llama.cpp and ggml provide the GGUF runtime and backend kernels.

Backend support therefore has two layers:

  • Sipp support: which backend names each package can select and how the backend is built or chosen.
  • ggml support: which tensor operations each ggml backend implements.

For the ggml operation-level matrix, use the upstream llama.cpp GGML operations table. That table is generated from llama.cpp backend probes and is the source of truth for per-operation support.

Sipp Backend Names

BackendDevice classWhere Sipp exposes itNotes
cpuHost CPUBrowser, Node.js, Python, Rust/source, CLI, gateway serverPortable default. Native builds use ggml CPU; browser builds use WASM CPU with the browser runtime.
webgpuBrowser GPU through WebGPUBrowser packageBrowser-only. Selected with browser local endpoint options.backend; requires a WebGPU-capable browser and adapter.
cudaNVIDIA GPUNative source builds, Node.js, Python, CLI, gateway serverRequires a local CUDA Toolkit and compatible NVIDIA driver. xtask reports CUDA readiness but does not install CUDA.
metalApple GPU through MetalNative source builds, Node.js, Python, CLI, gateway server on macOSmacOS-only native backend. Best for Apple Silicon and tested AMD Macs; use CPU on Intel integrated GPUs.
vulkanGPU through VulkanNative source builds, Node.js, Python, CLI, gateway serverRequires a Vulkan-capable system and driver. xtask can bootstrap the Vulkan SDK for builds. macOS Vulkan is source-build only and runs through a Metal translation layer.

Upstream llama.cpp/ggml supports more backend families than Sipp currently exposes as package/runtime selectors, including BLAS, CANN, OpenCL, SYCL, ZenDNN, and zDNN. Those appear in the upstream operation matrix but are not first-party Sipp backend names at this time.

Package And Runtime Selection

SurfaceSupported backend selectorsHow to select
Browser localauto, cpu, webgpuclient.add(..., { kind: 'local', options: { backend: 'webgpu' } })
Node.js localcpu, vulkan, cuda, metal`SIPP_NODE_BACKEND=cpu
Python localcpu, vulkan, cuda, metal`SIPP_PYTHON_BACKEND=cpu
CLIauto, cpu, cuda, metal, vulkansipp ... --backend <backend>
Gateway serverauto, cpu, cuda, metal, vulkanBuild or run with sipp ... --backend <backend>; target TOML can set backend = "auto" or a concrete backend.
Rust source/client workflowsCompiled native backend setBuild through sipp or cargo xtask; runtime availability follows the linked native artifacts.

auto is a runtime selection policy. all is a build/test selector used by sipp and cargo xtask; it builds or checks the host-supported backend set for that target and is not a runtime backend name.

Mixing Backends

Keep build artifact selection separate from engine backend selection.

  • A build artifact decides which ggml GPU backends are compiled and loadable in the current process. A CUDA-only artifact does not make Vulkan available, and a Metal-only artifact does not make CUDA or Vulkan available.
  • cpu is the exception in the engine policy. When an engine is explicitly planned for cpu, Sipp disables GPU layers, device placement, GPU K/V offload, op offload, flash attention, and GPU residency leasing for that load.
  • Explicit GPU selections such as cuda, metal, vulkan, and webgpu must be both compiled into the active artifact and available on the host.
  • Node.js and Python choose the native binding at process load with SIPP_NODE_BACKEND or SIPP_PYTHON_BACKEND. Their local model descriptors do not carry a separate per-engine backend field, so use a different process or artifact when you need a different GPU backend.
  • Gateway, CLI, browser, and lower-level Rust lifecycle paths expose backend selectors at the target/load/run layer. They can select only from the backend set available to that artifact and host.

Practical examples:

Active artifact/processCPU engineCUDA engineMetal engineVulkan engine
CUDA-only native artifactYes, where the surface exposes CPU selectionYes, if the CUDA device is availableNoNo
Metal-only native artifactYes, where the surface exposes CPU selectionNoYes, on macOSNo
Vulkan-only native artifactYes, where the surface exposes CPU selectionNoNoYes, if the Vulkan device is available
Multi-backend source buildYesYes, if compiled and availableYes, if compiled and availableYes, if compiled and available

CLI examples:

# Build a CUDA-capable CLI artifact.
sipp build cli --backend cuda

# Use CUDA when the CUDA device is available.
sipp ./models/model.gguf "Explain this model." --chat --backend cuda

# Force CPU for a run; this disables GPU offload for that engine.
sipp ./models/model.gguf "Explain this model." --chat --backend cpu

# This requires a Vulkan-capable artifact; a CUDA-only artifact is not enough.
sipp ./models/model.gguf "Explain this model." --chat --backend vulkan

Gateway target examples:

# Same gateway process, different local targets.
# Each GPU backend must be compiled into the active gateway artifact.
[[targets]]
name = "local-cuda"
type = "local"
model = "./models/model.gguf"
backend = "cuda"

[[targets]]
name = "local-cpu"
type = "local"
model = "./models/model.gguf"
backend = "cpu"

Browser examples:

// Browser local supports CPU and WebGPU backend selection per local endpoint.
await client.add('local-webgpu', {
  kind: 'local',
  model: './models/model.gguf',
  options: { backend: 'webgpu' },
});

await client.add('local-cpu', {
  kind: 'local',
  model: './models/model.gguf',
  options: { backend: 'cpu' },
});

Node.js and Python examples:

# PowerShell: choose the native binding before starting the process.
$env:SIPP_NODE_BACKEND = "cuda"
node .\examples\node\chat.mjs .\models\model.gguf "Explain this model."

$env:SIPP_NODE_BACKEND = "cpu"
node .\examples\node\chat.mjs .\models\model.gguf "Explain this model."
# Bash: choose the native binding before starting the process.
SIPP_PYTHON_BACKEND=cuda \
  python examples/python/chat.py ./models/model.gguf "Explain this model."

SIPP_PYTHON_BACKEND=cpu \
  python examples/python/chat.py ./models/model.gguf "Explain this model."

Build Matrix

Build commandBackend argumentResult
sipp build wasmnoneBrowser WASM package with CPU and WebGPU runtime support.
sipp build node --backend cpucpu, cuda, metal, vulkan, allNode native binding artifacts for the selected backend set.
sipp build python --backend cpucpu, cuda, metal, vulkan, allPython native binding artifacts for the selected backend set.
sipp build cli --backend cpucpu, cuda, metal, vulkan, allLocal sipp CLI distribution for the selected backend set.
sipp build gateway-server --backend cpucpu, cuda, metal, vulkan, allGateway server distribution for the selected backend set.
sipp build allnoneCore, WASM, Python CPU, Node CPU, and CLI CPU targets.

sipp build all is intentionally conservative. Use an explicit backend build when you need CUDA, Metal, or Vulkan artifacts.

Operation Support

ggml backends do not all implement the same operation set. Common transformer inference paths are covered by the backends Sipp exposes, but support for a specific model family depends on the ggml operations used by that model and the selected backend.

Use these rules when diagnosing backend issues:

  • If a model works on cpu but fails on a GPU backend, check the upstream ggml operations matrix for the missing operation.
  • If a GPU backend lacks an operation, llama.cpp/ggml may fall back for some paths, keep tensors on CPU for that operation, or fail depending on the graph and backend policy.
  • If a package cannot see a backend at runtime, check that the artifact was built or installed for that backend and that the device driver/runtime is visible to the process.
  • Browser webgpu depends on both compiled WebGPU support and browser adapter availability. Use backend: 'cpu' to force the browser CPU path.

For local verification from a source checkout:

sipp doctor --target node --backend vulkan
sipp run llama backend-ops --backend vulkan --mode support
sipp run llama backend-ops --backend cuda --mode perf --op MUL_MAT

The llama backend-ops command builds llama.cpp’s backend operation tool for the selected backend and is useful when investigating operation coverage or performance outside the Sipp client path.

Practical Selection

Use cpu first when validating a model or reproducing correctness issues. Move to a GPU backend after the model, prompt format, and runtime config are known to work.

Use webgpu for browser-local acceleration when the application can require a modern WebGPU browser. Keep a CPU fallback for browsers, drivers, and devices that do not expose a compatible adapter.

Use cuda for NVIDIA-heavy native deployments and metal for Apple Silicon or tested AMD macOS deployments. On Intel Macs with integrated GPUs, use cpu unless the exact model, context size, and device have been tested and Metal is stable and faster than CPU. Use vulkan when you want a cross-vendor native GPU path and have tested the target driver stack. On macOS, prefer Metal over Vulkan unless you are specifically testing LunarG’s Vulkan-over-Metal drivers.

Gateway And Hybrid Inference

Gateway inference lets an application call a separate Sipp gateway over HTTP. Hybrid inference registers local and gateway endpoints in the same client so each request can choose where it runs.

When To Use A Gateway

  • Keep provider credentials out of browser or edge clients.
  • Centralize target access policy and concurrency limits.
  • Serve local models from a controlled machine.
  • Expose a stable HTTP boundary to multiple language clients.

Gateway Deployment Shapes

The first-party gateway can be deployed in three shapes:

  • On-board GPU inference: the gateway loads a local GGUF model and serves it through a GPU backend.
  • Provider-only router: the gateway has no local model and forwards requests to provider targets such as OpenAI, Anthropic, or OpenAI-compatible APIs.
  • Hybrid: the gateway exposes both local GPU targets and provider targets.

Endpoint Model

The client does not route implicitly. Every application registers descriptors and selects an endpoint reference:

  • Local descriptor: a GGUF model loaded by the current runtime.
  • Gateway descriptor: a base URL, target name, routes, and authentication.
  • Provider descriptor: direct provider adapter where the package supports it.

Gateway descriptors send the target as the first-party profile model field. The gateway process resolves that public target name to a local or provider endpoint.

Authentication

Server and script environments use bearer values from environment variables. Browser applications use short-lived tokens supplied at runtime through a provider callback.

Browser Caching

Browser-local inference caches model data in browser storage so repeated loads avoid full network downloads when the runtime supports that path. Sipp browser examples and demos use this path for GGUF model loading.

Responsibilities

The browser package owns runtime integration and cache mechanics. Applications still own:

  • The model URL or file selection UI.
  • Progress display and cancellation behavior.
  • Storage-clearing controls when users need to reclaim space.
  • Fallback behavior when browser storage is unavailable.

Practical Guidance

  • Prefer model URLs that support range requests for large assets.
  • Keep default demo models small enough for first-run onboarding.
  • Treat browser storage as user-controlled and best-effort.
  • Close SippClient instances when a page, worker, or component no longer needs local runtime resources.

Use the browser examples for minimal flows and the playground for runtime diagnostics.

Providers

Sipp can call external providers directly from trusted server-side processes or indirectly through a Sipp gateway. Both paths use the same endpoint model: register a descriptor with SippClient.add, keep the endpoint reference, and pass it to query, chat, or embed.

Provider credentials must stay in trusted code. Do not ship long-lived provider keys in browser bundles.

Direct Provider Endpoints

Use a direct provider endpoint when the current server process owns the credential lifecycle and application policy. This is the recommended framework route pattern for Next.js and TanStack server code.

import { SippClient } from '@sipp/sipp-server';

function requiredEnv(name: string): string {
  const value = process.env[name];
  if (value == null || value === '') {
    throw new Error(`${name} is required`);
  }
  return value;
}

const client = new SippClient();
const endpoint = await client.add('provider', {
  kind: 'provider',
  provider: 'openai',
  model: process.env.OPENAI_MODEL ?? 'gpt-5-mini',
  apiKey: requiredEnv('OPENAI_API_KEY'),
});

const run = client.chat({
  endpoint,
  messages: [{ role: 'user', content: 'Explain provider inference.' }],
  options: { maxTokens: 128, temperature: 0.2 },
});
console.log((await run.response).text);

Use OPENAI_API_KEY="<mock-openai-key>" only as a placeholder in docs and examples. Real keys belong in environment variables or a secret manager.

Provider Options

Typed request fields should use Sipp’s request options. Provider-only fields belong in providerOptions:

const run = client.chat({
  endpoint,
  messages,
  options: { maxTokens: 128 },
  providerOptions: {
    reasoning_effort: 'low',
  },
});

providerOptions is for direct provider endpoints. Gateway-specific extensions belong in endpointOptions or descriptor-level protocolOptions, because the gateway implementation owns how those fields are interpreted.

Provider-Backed Gateway Targets

Use the first-party gateway when multiple applications should share target policy, provider credentials, local model hosting, admission control, metrics, or a stable HTTP boundary.

OpenAI target:

[[targets]]
name = "openai-chat"
type = "openai"
model = "gpt-5-mini"
api_key_env = "OPENAI_API_KEY"

OpenAI-compatible target:

[[targets]]
name = "compatible-chat"
type = "openai_compatible"
model = "provider-model"
base_url = "https://provider.example/v1"
token_env = "COMPATIBLE_API_TOKEN"
correlation_header = "x-request-id"

Anthropic target:

[[targets]]
name = "anthropic-chat"
type = "anthropic"
model = "claude-3-5-sonnet-latest"
api_key_env = "ANTHROPIC_API_KEY"

Gateway clients receive only the public target name, gateway URL, and gateway authentication value. Provider credentials stay in the gateway process.

Browser Applications

Browser applications should usually call an application route or gateway, not a provider directly. If a BYOK browser flow is required, use short-lived provider keys supplied at runtime through the browser provider descriptor and keep the user-facing risks explicit.

Vision

Vision workflows send image data alongside chat messages. They are local-only unless an application gateway deliberately implements an equivalent media profile.

Local Vision

Local vision examples typically require:

  • A compatible vision-capable GGUF model.
  • A projector GGUF when required by the model family.
  • An image path, browser canvas export, or image byte payload.

The Rust, Node.js, and Python example directories include vision_chat examples. Browser demos show canvas and file-oriented workflows.

Browser Vision

Browser applications keep captured image payloads small and grounded in the task. The proactive drawing demo uses cropped JPEG captures so the model sees the relevant ink instead of a full page screenshot.

Reference

Reference pages collect command, configuration, testing, and application details that need a stable home outside package READMEs.

Inference Operations

Sipp separates the operation from the endpoint. Choose query, chat, or embed based on the input shape and expected output, then pass the endpoint reference that decides where the request runs.

Shared Contract

  1. Register a local, gateway, or provider descriptor with SippClient.add.
  2. Keep the returned endpoint reference.
  3. Pass that reference to query, chat, or embed.

query and chat both produce text. They share maxTokens, temperature, topP, stop, cancellation, and token streaming. embed produces vectors and does not use generation options or token streaming.

OperationInputOutputBest fit
queryOne already-rendered prompt string.Generated text.Raw completions, custom templates, encoder-decoder text generation, few-shot prompts, and agent loops that render prompts themselves.
chatOrdered { role, content } messages.Generated assistant text.Conversation-shaped model calls where the endpoint owns the chat-template or provider-message mapping.
embedOne text input.One embedding vector.Retrieval, semantic search, ranking, clustering, and memory indexes.

Local Inference

Local endpoints run a GGUF model in the current browser, Node.js, Python, Rust, or CLI process.

OperationWhat Sipp sends to the runtimeTemplate behaviorLocal-only options
queryThe prompt string exactly as supplied. Decoder-only models run the normal decode path; encoder-decoder models run an encoder pass and then the decoder loop.No chat template is applied. Use this when the application owns a custom or generic prompt format.Context keys, grammars, JSON schema, sampling overrides, media inputs.
chatMessages are rendered to one prompt with llama.cpp chat-template support and add_assistant = true.Requires the GGUF to declare tokenizer.chat_template. Sipp checks model metadata, not the llama.cpp fallback chain, before allowing local chat.Same text options as query, including context keys and media inputs.
embedThe input text is encoded by the local embedding runtime.No chat template and no generation.Context key and embedding normalization.

Local chat is a prompt renderer plus generation call, not a conversation store. Pass prior turns in messages when they should be visible to the model. Use a context key only for local KV-cache reuse. Encoder-decoder text models, such as T5 or BART GGUF files, use query for text generation. Encoder-only models do not generate text and should use embed when they expose pooled embeddings.

Local Query With A Custom Template

Use query when you want to own the full prompt shape, including a hand-written or application-provided chat template.

const endpoint = await client.add('local', {
  kind: 'local',
  modelPath: '/models/model.gguf',
});

const prompt = [
  '<|system|>',
  'Answer with one concise paragraph.',
  '<|user|>',
  'Explain local query.',
  '<|assistant|>',
].join('\n');

const run = client.query({
  endpoint,
  prompt,
  options: { maxTokens: 128, temperature: 0.2 },
  local: { contextKey: 'docs-example' },
  emitTokens: true,
});

Local Chat With The Model Template

Use chat when the GGUF model declares the chat template it expects. Sipp passes the role messages to llama.cpp template rendering and then generates from the rendered prompt.

const run = client.chat({
  endpoint,
  messages: [
    { role: 'system', content: 'Answer with one concise paragraph.' },
    { role: 'user', content: 'Explain local chat.' },
  ],
  options: { maxTokens: 128, temperature: 0.2 },
  local: { contextKey: 'docs-example' },
  emitTokens: true,
});

If the model has no tokenizer.chat_template, local chat fails. Use query with an explicit prompt template for base models, legacy models, or any generic template the application wants to control.

Local Query With Encoder-Decoder Models

Use query for encoder-decoder GGUF models. The source prompt is encoded first; Sipp then drives the decoder from the model’s decoder-start token.

const endpoint = await client.add('t5-local', {
  kind: 'local',
  modelPath: '/models/t5-small-f16.gguf',
});

const run = client.query({
  endpoint,
  prompt: 'translate English to German: Hello, world.',
  options: { maxTokens: 64 },
});

Most encoder-decoder text models do not declare a GGUF chat template. In that case chat is rejected even though query works.

Local Embed

Use embed with a model/runtime that supports embeddings. Local embedding normalization is a local-only option.

const run = client.embed({
  endpoint,
  input: 'Vectorize this sentence for retrieval.',
  local: { normalize: true },
});

const embedding = (await run.response).values;

Remote Gateway

A gateway endpoint sends the operation over HTTP. The first-party profile uses separate routes and payload shapes:

OperationDefault routeRequired body fields
query/v1/querymodel, prompt
chat/v1/chatmodel, messages
embed/v1/embedmodel, input

model is the public gateway target name. The gateway resolves that target to a local GGUF endpoint, OpenAI endpoint, OpenAI-compatible endpoint, or Anthropic endpoint.

Gateway calls accept shared text options for query and chat, such as max_tokens, temperature, top_p, stop, and stream. Local-only fields such as contextKey, grammar, jsonSchema, sampling, media, and normalize are rejected by gateway endpoints. Direct-provider providerOptions are also rejected by gateway endpoints; a custom gateway must translate provider-specific extensions deliberately.

Gateway Target Mapping

Gateway target typequery behaviorchat behaviorembed behavior
Local GGUFRuns local raw-prompt generation. Decoder-only models decode directly; encoder-decoder models run encoder prefill plus decoder generation. No chat template is added.Runs local chat rendering with the GGUF-declared chat template. Fails if the model has no template, including many encoder-decoder models.Runs local embedding if the loaded model/runtime supports embeddings. Encoder-decoder text models do not produce embeddings through this runtime.
OpenAISends an OpenAI completions request with prompt.Sends an OpenAI chat-completions request with messages.Sends an OpenAI embeddings request with input and encoding_format: "float".
OpenAI-compatibleSends /completions with prompt.Sends /chat/completions with messages.Sends /embeddings with input and encoding_format: "float".
AnthropicWraps the prompt as one user message and sends an Anthropic /messages request.Sends Anthropic /messages; system role messages are joined into the top-level system field, and user/assistant messages remain in messages.Unsupported by the native Anthropic adapter.

Provider support still depends on the upstream model and provider. For example, an OpenAI-compatible target may expose chat but not completions, so gateway chat can work while gateway query fails for that target.

Gateway Client Chat

const endpoint = await client.add('gateway-openai', {
  kind: 'gateway',
  target: 'openai-chat',
  baseUrl: process.env.SIPP_GATEWAY_URL!,
  authentication: {
    kind: 'bearer',
    value: process.env.SIPP_GATEWAY_TOKEN!,
  },
});

const run = client.chat({
  endpoint,
  messages: [
    { role: 'system', content: 'Answer for application developers.' },
    { role: 'user', content: 'When should I use gateway chat?' },
  ],
  options: { maxTokens: 128, temperature: 0.2 },
});

First-Party Gateway HTTP Examples

Raw-prompt query:

curl -X POST "$SIPP_GATEWAY_URL/v1/query" \
  -H "Authorization: Bearer $SIPP_GATEWAY_TOKEN" \
  -H "content-type: application/json" \
  -d '{
    "model": "compatible-completion",
    "prompt": "Explain gateway query in one sentence.",
    "max_tokens": 64
  }'

Chat:

curl -X POST "$SIPP_GATEWAY_URL/v1/chat" \
  -H "Authorization: Bearer $SIPP_GATEWAY_TOKEN" \
  -H "content-type: application/json" \
  -d '{
    "model": "anthropic-chat",
    "messages": [
      { "role": "system", "content": "Answer briefly." },
      { "role": "user", "content": "Explain gateway chat." }
    ],
    "max_tokens": 128
  }'

Embedding:

curl -X POST "$SIPP_GATEWAY_URL/v1/embed" \
  -H "Authorization: Bearer $SIPP_GATEWAY_TOKEN" \
  -H "content-type: application/json" \
  -d '{
    "model": "openai-embed",
    "input": "Text to index for retrieval."
  }'

Choosing Quickly

  • Use local query when the application must control every token in the prompt, including custom or generic chat templates, or when the target is an encoder-decoder text model.
  • Use local chat when the GGUF model declares its own chat template and the application already has role messages.
  • Use local embed when vectors should be produced in the current process and local normalization matters.
  • Use gateway query when the target supports raw-prompt generation, including local decoder-only or encoder-decoder GGUF targets and OpenAI-compatible completions targets.
  • Use gateway chat for provider chat models and for local GGUF chat models with declared templates.
  • Use gateway embed for local, OpenAI, or OpenAI-compatible embedding targets; do not use it with native Anthropic targets.

Runtime Options

Sipp keeps runtime configuration close to the endpoint that owns local inference. Request options stay on query, chat, or embed calls. Gateway and provider extensions use separate option buckets so applications can see which boundary receives each field.

Option Layers

LayerBrowser packageNode.js packagePurpose
Client optionsnew SippClient(options)Environment and process setupBrowser assets, workers, browser cache, and backend selection.
Local endpoint load optionsclient.add(..., { kind: 'local', options })client.add(..., { kind: 'local', config })Model source, backend preference, progress, and native runtime config.
Text request optionsclient.query(prompt, options)client.query({ options })Output length, sampling shortcuts, streaming, cancellation, and stop strings.
Local request optionscontextKey, grammar, media, normalizelocal: { contextKey, grammar, media, normalize }Local-only prompt state, grammars, images, and embedding normalization.
Gateway extensionsendpointOptionsendpointOptionsExtra fields consumed by gateway endpoint implementations.
Provider extensionsproviderOptionsproviderOptionsProvider-only fields merged into direct provider requests.

Python and Rust expose the same concepts with language-native descriptors and runtime config classes or structs.

Browser Client Options

Browser SippClientOptions affect the WebAssembly runtime, worker transport, and browser storage. They do not select a model by themselves.

OptionUse
executionModeauto uses a worker when available. worker forces worker transport. main-thread is useful for debugging or constrained hosts.
wasmThreadingsingle-thread loads the single-thread WASM runtime. pthread loads the pthread runtime.
moduleUrl, wasmUrlOverride single-thread runtime asset URLs when a bundler or deployment moves package assets. Provide both together.
pthreadModuleUrl, pthreadWasmUrlOverride pthread runtime asset URLs. Provide both together.
browserCacheTune OPFS split thresholds and direct-load behavior for browser GGUF storage.
trustedOriginsAllow runtime asset URLs from additional origins. Defaults allow same-origin package assets.
workerUrlOverride the worker entry URL when the bundler cannot resolve the packaged worker.

wasmThreading: 'pthread' requires SharedArrayBuffer, cross-origin isolation, and COOP/COEP headers. Use single-thread when the application cannot serve those headers.

const client = new SippClient({
  executionMode: 'worker',
  wasmThreading: 'single-thread',
});

Local Endpoint Options

Browser local endpoints use source plus optional load options:

const endpoint = await client.add('browser-local', {
  kind: 'local',
  source: '/models/model.gguf',
  options: {
    backend: 'webgpu',
    runtime: {
      context: { n_ctx: 2048 },
    },
  },
});

Node.js local endpoints use modelPath and config:

const endpoint = await client.add('node-local', {
  kind: 'local',
  modelPath: '/models/model.gguf',
  config: {
    context: { n_ctx: 2048, n_threads: 8, n_threads_batch: 8 },
  },
});

Browser backend accepts auto, cpu, or webgpu. Native package backend selection is package-specific: Node.js uses SIPP_NODE_BACKEND, Python uses SIPP_PYTHON_BACKEND, and the CLI uses --backend.

Native Runtime Config

NativeRuntimeConfig groups local runtime settings by responsibility.

GroupCommon fieldsUse
placementdevices, gpu_layers, split_mode, main_gpu, tensor_split, use_mmap, use_mlock, fit_paramsModel placement, memory mapping, and GPU residency choices.
contextn_ctx, n_batch, n_ubatch, n_parallel, n_threads, n_threads_batch, flash_attention, offload_kqvContext window, batch sizes, CPU thread counts, attention, and KV behavior.
samplingsamplers, seed, top_k, top_p, min_p, temperature, repeat_penalty, mirostat, logit_biasDefault local sampling behavior for text generation.
schedulercontinuous_batching, policy, prefill_chunk_size, max_running_requests, max_queued_requestsRequest scheduling, batching, and queue limits.
cachemode, retained_prefix_tokens, snapshot_interval_tokens, max_snapshot_entries, max_snapshot_bytesPrefix KV reuse and snapshot behavior.
multimodalprojector_path, use_gpu, image_min_tokens, image_max_tokensVision projector and image-token settings.
residencymax_gpu_models_per_device, allow_cpu_models_while_gpu_loaded, require_gpu_leaseGPU model residency policy for native runtimes.
observabilityruntime_metrics, backend_profilingRuntime timing, throughput, and backend diagnostics.

Use runtime config for stable endpoint behavior. Use request options for values that should vary per prompt, user action, or UI control.

Request Options

Text-producing calls share common generation controls:

OptionUse
maxTokensMaximum generated tokens for the response.
temperatureRequest-local temperature shortcut.
topPRequest-local nucleus sampling shortcut.
stopStop strings for text generation.
signalCancellation through AbortSignal where supported.
emitTokensEnables token streaming through the returned run handle.

Local text calls can also use a prompt context key, GBNF grammar, and media inputs for vision-capable models. Embedding calls can set normalization through local embedding options.

Gateway-specific fields belong in endpointOptions. Direct provider-specific fields belong in providerOptions:

const run = client.chat({
  endpoint,
  messages,
  options: { maxTokens: 128, temperature: 0.2 },
  providerOptions: {
    reasoning_effort: 'low',
  },
});

Provider options cannot override typed fields such as model, messages, prompt, temperature, or topP/top_p; set those through the typed request options where Sipp exposes them.

Device Support

Sipp runs across a range of devices, operating systems, browsers, and GPU accelerators. This page documents which configurations are supported, at what level, and any known limitations.

Compute Backends

Backend names are shared across build configuration and runtime selection. The same name selects the backend in each surface.

BackendStatusFeature flagDefaultPlatformsNotes
CPUSupportednativeYesAllPortable fallback, no accelerator required
CUDASupportedcudaNoLinux, WindowsNVIDIA GPUs, compute capability 7.5+
MetalSupportedmetalNomacOSApple Silicon and AMD GPUs; use CPU on Intel integrated GPUs
VulkanSupportedvulkanNoLinux, WindowsVulkan 1.2+ GPU required
WebGPUSupportedGGML_WEBGPU (CMake)NoWASM browsersBrowser-only, requires shader-f16

Runtime selection:

  • CLI: --backend auto|cpu|cuda|metal|vulkan
  • Node.js: SIPP_NODE_BACKEND=cpu|vulkan|cuda|metal
  • Python: SIPP_PYTHON_BACKEND=cpu|vulkan|cuda|metal
  • Browser: backend: 'auto' | 'cpu' | 'webgpu' in model load options

Leave the variable unset for automatic backend selection.

Backend Availability by Package

BackendNode.jsPythonRustBrowser (WASM)Gateway
CPUYesYesYesYesYes
CUDAYesYesYesYes
MetalYesYesYes
VulkanYesYesYesYes
WebGPUYes

Additional llama.cpp Backends (Not Yet Exposed)

The vendored llama.cpp supports additional backends that Sipp does not currently expose as feature flags. Community contributions are welcome.

  • SYCL (Intel oneAPI)
  • HIP / ROCm (AMD)
  • OpenCL
  • OpenVINO
  • CANN (Huawei Ascend)
  • MUSA (Moore Threads)
  • Hexagon (Qualcomm DSP)
  • ZenDNN (AMD)
  • RPC (remote backend)

These backends require custom CMake flags on top of the vendored llama.cpp build and are not available through Sipp’s standard build or package commands.


Desktop Browser Support Matrix

The table below shows the first browser version where each feature is available for desktop operating systems. A dash () means the feature is not supported.

BrowserSupportWASM stWASM pthread¹WebGPUWebGPU + f16²OPFS³Workers
Chrome (Win, Mac, Linux)✅ Tested5792⁴113113864
Edge (Win, Mac, Linux)❌ Untested79⁵92⁴1131138679⁵
Firefox (Windows)❌ Untested5279⁴1411411113.5
Firefox (macOS)❌ Untested5279⁴145⁶145⁶1113.5
Firefox (Linux)❌ Untested5279⁴⚠ Nightly⚠ Nightly1113.5
Safari (macOS)❌ Untested1115.2⁴262616.44
Opera (Win, Mac, Linux)❌ Untested4478⁴99997211.5
ChromeOS❌ Untested5792⁴113113864
Other Chromium-based⁷❌ Untested57+92⁴11311386+4+

Footnotes:

  • ¹ WASM pthread requires the server to send Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp (or credentialless) HTTP headers. See WASM Threading below.
  • ² The shader-f16 WebGPU feature is required by Sipp’s browser WebGPU backend. Availability depends on GPU and driver support in addition to the browser version.
  • ³ Origin Private File System. Used for model data caching. Requires a secure context (HTTPS). Firefox support is behind the dom.fs.enabled preference until version 111.
  • ⁴ Version listed is when SharedArrayBuffer became available with cross-origin isolation headers. Earlier versions may have had the feature without the header requirement.
  • ⁵ Edge switched to a Chromium engine at version 79. The Chromium-based Edge supports WASM single-thread from 79, Workers from 79. The legacy EdgeHTML engine supported Workers from version 12 and WASM from version 16.
  • ⁶ Firefox 145 enables WebGPU on macOS version 26 (ARM64). Intel Mac support is in progress in Nightly.
  • ⁷ Includes Brave, Vivaldi, Arc, and other Chromium-derived browsers. Versions match their underlying Chromium release.

Mobile Browser Support Matrix

BrowserSupportWASM stWASM pthread¹WebGPUWebGPU + f16²OPFS³Workers
Chrome (Android)🟡 Pending5792⁴121⁵121⁵8656
Safari (iOS / iPadOS)❌ Untested1115.2⁴262616.45
Safari (visionOS)❌ Untested1115.2⁴262616.45
Samsung Internet (Android)❌ Untested816⁴2424214
Opera (Android)❌ Untested4478⁴80807211.5
Firefox (Android)❌ Untested5279⁴⚠ Beta/Nightly⚠ Beta/Nightly15052
Android WebView❌ Untested5792⁴⚠ Flag⁶⚠ Flag⁶8656

Footnotes:

  • ¹ Requires COOP/COEP HTTP headers as described in WASM Threading.
  • ² The shader-f16 feature may not be available on all mobile GPU/driver combinations even when the browser version supports it.
  • ³ Origin Private File System. Chrome for Android and Samsung Internet support OPFS. iOS Safari supports OPFS from 16.4.
  • ⁴ Version listed is when SharedArrayBuffer became available with cross-origin isolation headers.
  • ⁵ Chrome 121 on Android 12+ with Qualcomm or ARM GPUs. Support on other GPU vendors (Imagination, Samsung Xclipse) is still rolling out.
  • ⁶ Android WebView requires the --enable-unsafe-webgpu flag. Not recommended for production use.

WASM Threading

Sipp ships two WASM runtime artifacts:

ArtifactThread countToken streamingRequirements
sipp-wasm.js (single-thread)1postMessageNone
sipp-wasm-pthread.js (pthread)up to 4⁷SharedArrayBuffer ringCOOP + COEP headers, secure context

⁷ Defaults to min(4, navigator.hardwareConcurrency). Override with runtime.context.n_threads in model load options.

The client auto-detects pthread availability at runtime:

function supportsWasmPthreads(): boolean {
  return (
    typeof SharedArrayBuffer !== 'undefined' &&
    globalThis.crossOriginIsolated === true &&
    typeof Worker !== 'undefined'
  );
}

Set wasmThreading: 'single-thread' in client options when the hosting environment cannot serve COOP/COEP headers (for example, GitHub Pages or shared hosting without header control).


Platform & OS Support

OSx64arm64Other architecturesAvailable bindings
Linux (glibc)YesYesarm, loong64, riscv64, ppc64, s390xNode.js, Python, Rust
Linux (musl)YesYesarm, loong64, riscv64Node.js
Windows (MSVC)YesYesia32Node.js, Python, Rust
Windows (GNU)YesNode.js
macOSYesYesuniversal2Node.js, Python, Rust
AndroidYesarm (eabi)Node.js
FreeBSDYesYesNode.js
OpenHarmonyYesYesarmNode.js

Docker Containers

ProfileBackendHost OSNotes
CPUCPULinux, macOS, WindowsWorks everywhere, no GPU passthrough
CUDACUDALinux, Windows (WSL2)Requires NVIDIA Container Toolkit
VulkanVulkanLinux onlyWindows Docker Desktop does not support Vulkan passthrough
MetalMetal unavailable inside Linux containers

GPU & Accelerator Support

NVIDIA CUDA

Sipp targets NVIDIA GPUs with compute capability 7.5 and above. CUDA 13 removes support for architectures below 7.5.

ArchitectureCompute CapabilityTarget GPUs
Turing7.5T4, Quadro RTX, GeForce RTX 20-series
Ampere8.0, 8.6A100, A10, A40, RTX A6000, GeForce RTX 30-series
Ada Lovelace8.9L4, L40S, GeForce RTX 40-series
Hopper9.0H100, H200
Blackwell (Data Center)10.0B100, B200, GB200
Blackwell (Consumer/Edge)12.0, 12.1GeForce RTX 50-series, RTX PRO Blackwell

Vulkan

Any GPU with Vulkan 1.2 or later driver support works on Linux and Windows. Tested on:

  • NVIDIA: Turing, Ampere, Ada Lovelace, Hopper (proprietary driver)
  • AMD: RDNA 2 and later (AMDGPU PRO or RADV)
  • Intel: Gen12/Xe and later (ANV)

Windows Docker Desktop does not support the Vulkan backend.

macOS source builds can compile Vulkan through the LunarG SDK, but LunarG’s macOS drivers translate Vulkan to Metal. Sipp does not publish macOS Vulkan packages because the native Metal backend is simpler for normal macOS use and macOS Vulkan adds loader/ICD runtime requirements.

Metal

  • Apple Silicon: M1, M2, M3, M4 series
  • AMD: GPUs supported by macOS (Radeon Pro series)

Metal is macOS-only and unavailable inside Docker containers. Intel integrated GPUs expose Metal, but Sipp does not treat them as a recommended Metal target; use the CPU backend on those Macs unless you have tested the exact model, context size, and device and confirmed that Metal is stable and faster than CPU.

Apple Silicon can run x64 processes through Rosetta 2. A darwin-x64 Node or Python native package is only used by an x64 Node/Python process; native arm64 Node/Python installations use the darwin-arm64 packages and are the preferred path on Apple Silicon.

WebGPU (Browser)

Any GPU that the host browser exposes as a WebGPU adapter may work, but Sipp requires the shader-f16 feature for WebGPU acceleration. Common configurations:

GPU FamilyChrome (D3D12)Chrome (Vulkan)Firefox (wgpu)Safari (Metal)
NVIDIAYesYes (Linux)Yes
AMDYesYes (Linux)YesYes
Intel integratedYesYes (Linux)YesYes
Apple SiliconYesYes
Qualcomm (Android)Yes
ARM MaliYes (Android)

Language Binding Support

PackageInstall commandStatusRun timePrimary use
Browser (@sipp/sipp)npm install @sipp/sippPublished (npm)WASM / WebGPUBrowser-local GGUF inference, gateway clients
Node.js (@sipp/sipp-server)npm install @sipp/sipp-serverPublished (npm)N-API nativeServer processes, route handlers, backend services
Python (sipppy)pip install sipppyPublished (PyPI)PyO3 nativePython services, scripts, gateway clients
Rust (sipp-rs)cargo add sipp-rsPublished (crates.io)Native-backed Rust crateRust applications and services
Gateway serverSource-builtSource onlyAxum binaryHTTP gateway for local and provider targets
Gateway DockerDocker from sourceSource onlyContainerProduction container workflows
Gateway toolkitSource artifactSource onlyRust crateCustom gateway applications

Limitations & Work in Progress

  • Gateway server does not have a published binary or public container image yet. It must be built from source.
  • Windows Docker Vulkan is not supported. Use the CUDA or CPU profiles on Windows with WSL2.
  • macOS Docker is CPU-only. Metal cannot run inside a Linux Docker container.
  • Android and iOS are not first-class package targets. The browser WASM package works on mobile web browsers, but no native Android or iOS packages are published.
  • Chrome (desktop) is the primary tested browser target. Other desktop browsers (Edge, Firefox, Safari, Opera, Chromium derivatives) are untested.
  • Mobile browser support has not been validated yet. Chrome (Android) is the next target for testing.
  • Firefox WebGPU on Linux and Android is in active development (Nightly / Beta). Firefox WebGPU on macOS Intel is also in progress.
  • Gateways are compatible with OpenAI and OpenAI-compatible providers plus Anthropic. Additional provider support is added over time.

CLI

apps/cli builds the sipp command-line application for local GGUF text generation. It is useful for runtime smoke testing, manual model checks, and quick local prompts.

Build

cargo xtask build cli --backend cpu
cargo xtask build cli --backend all

Run

cargo run -p sipp-cli -- <model.gguf> "Explain Sipp."

Useful flags include:

  • --max-tokens
  • --ctx-size
  • --backend auto|cpu|cuda|metal|vulkan
  • --temperature
  • --stats off|basic|profile
  • --chat

Use cargo run -p sipp-cli -- --help for the full generated help.

Configuration

Sipp configuration is intentionally split by responsibility. Core crates do not own HTTP routes, authentication schemes, TOML files, or deployment policy.

Runtime Configuration

Local runtime configuration belongs to the endpoint descriptor or package-level runtime options. Common areas include context size, scheduler behavior, cache mode, observability, sampling, and backend selection. See Runtime Options for the shared option map.

Gateway Configuration

apps/gateway-server owns TOML configuration for the first-party gateway application:

  • [routes] selects public and management paths.
  • admin_password_env names the secret env var containing the Admin Dashboard password.
  • [[tokens]] maps bearer-token environment variables to caller labels and allowed targets.
  • [[targets]] defines local, OpenAI, OpenAI-compatible, or Anthropic targets. Local targets can select backend = "auto", cpu, cuda, metal, or vulkan. See Gateway Configuration for the full schema.

Custom wire formats, authentication schemes, and route layouts belong in separate applications composed from lib/gateway.

Environment Variables

  • SIPP_GATEWAY_TOKEN: development bearer token for examples and gateway server commands.
  • SIPP_GATEWAY_ADMIN_PASSWORD: Admin Dashboard password used by gateway examples.
  • SIPP_GATEWAY_URL: gateway base URL for client examples.
  • SIPP_NODE_BACKEND: Node runtime backend selection.
  • SIPP_PYTHON_BACKEND: Python runtime backend selection.
  • OPENAI_API_KEY: provider credential used by OpenAI examples and provider-backed gateway targets.

Examples And Demos

Examples are small, runnable integrations. Demos are broader browser experiences for inspecting runtime behavior and user-facing workflows.

Examples

  • examples/rust: Rust query, chat, embed, vision, gateway, and provider examples.
  • examples/node: Node.js query, chat, embed, vision, and gateway examples.
  • examples/python: Python query, chat, embed, vision, and gateway examples.
  • examples/web: Vite browser pages for local and gateway workflows.
  • examples/gateway: minimal Axum gateway route composition.

Start with:

cargo xtask run examples gateway rust --case query
cargo xtask run examples serve browser

Demos

  • demos/chat: focused browser chat interface for local GGUF models.
  • demos/avatar: React and three.js character demo.
  • demos/proactive-ui: drawing-to-vision demo with runtime tracing.
  • demos/simulation: multi-agent simulation demo using director helpers.
  • tools/playground: browser runtime diagnostics and automation tool.

Start with:

cargo xtask run demos serve chat
cargo xtask run tools serve playground

Use cargo xtask test smoke group examples --backend cpu for model-backed example smoke coverage when validating broader runtime behavior.

Maintainers

This section is for developers working from the Sipp source checkout. It covers repository structure, build orchestration, tests, coverage, and contribution workflow.

Application developers who only need the published packages should start with Using the Core Library.

Start Here

Source Builds

Use the source checkout when developing Sipp itself, validating package artifacts, running examples, or deploying the gateway server before a public server artifact exists.

Bootstrap

From the repository root:

source ./setup.sh
sipp doctor
sipp test list

On Windows, run .\setup.ps1 from PowerShell or setup.cmd from CMD. After setup, sipp is a repo-local alias for cargo xtask; use cargo xtask ... with the same arguments if the launcher is not active.

Build Targets

Use the xtask orchestrator instead of direct build commands when compiling Sipp targets. It manages native dependencies, backend toolchains, and package staging.

sipp build core
sipp build node --backend cpu
sipp build python --backend cpu
sipp build gateway-server --backend cpu
sipp build wasm
sipp build all

Use --backend vulkan, --backend cuda, --backend metal, or --backend all where a native package target supports those backends.

CUDA builds compile a portable cloud GPU architecture list by default. Set SIPP_CUDA_ARCHITECTURES (semicolon-separated CMake entries, for example 80 for A100 only) before building to narrow the list for faster local builds. See docs/gateway/docker.md for the full list and rationale.

Examples And Demos

Run browser examples and demos through sipp. These commands start Vite dev servers and do not accept native backend flags:

sipp run examples serve browser
sipp run demos serve avatar
sipp run demos serve simulation

Gateway Hello World Examples

Gateway example workflows start a local gateway, run a client example, and stop the gateway when the client exits. They start examples/gateway and then run a client from examples/rust, examples/node, or examples/python.

Use --case query|chat|embed to choose the client case. Use --backend cpu|vulkan|cuda|metal when the gateway process should use a specific native backend.

sipp run examples gateway rust --case query
sipp run examples gateway node --case chat
sipp run examples gateway python --case embed --backend vulkan

Playground

The browser playground lives under tools/playground. Use it to inspect local inference, vision model setup, GGUF loading, runtime observability, and repeatable browser runtime smoke checks.

sipp run tools serve playground

Gateway Server

The release workflow does not yet publish a standalone gateway-server binary or container image. Use sipp for source checkout checks and raw Docker commands for container deployment. The canonical source guide is Gateway Server; Docker deployment is covered in Gateway Docker.

cp apps/gateway-server/config/local.toml.example apps/gateway-server/config/local.toml
cp apps/gateway-server/.env.example apps/gateway-server/.env
set -a
. apps/gateway-server/.env
set +a
sipp run gateway-server check --config apps/gateway-server/config/local.toml --backend cpu
sipp run gateway-server serve --config apps/gateway-server/config/local.toml --backend cpu

The copied local config expects a local GGUF model under .build/models and a dashboard password env var named by the selected TOML file. Keep secrets env files private because they contain the Admin Dashboard password and provider credentials.

Validation

Use the narrowest relevant target from Testing. Common entry points are:

sipp test list
sipp test unit group full
sipp test smoke group examples --backend cpu
sipp test verify --target public-docs

Architecture

Sipp separates inference primitives from protocol and deployment policy. The public package surfaces compose lower-level crates without moving HTTP routes, serialized wire formats, or deployment defaults into core inference layers.

Published Crates

  • crates/sipp: the public sipp Rust library published as sipp-rs. Former foundational crates continue as module folders:
    • core: low-level shared types.
    • shard: GGUF cache planning and split-file utilities.
    • backend, engine, lifecycle, runtime: local inference, scheduling, lifecycle, and memory management.
    • client: typed endpoint registration and query, chat, embed dispatch, re-exported at the crate root.
    • providers (feature providers): explicitly selected external provider adapters.
    • gateway_core (feature gateway): protocol-neutral gateway execution traits and pipeline ordering.
  • crates/sys: the sipp-sys crate — unsafe FFI bindings, native llama.cpp shims, and the vendored llama.cpp/ source tree.

Public Libraries

  • lib/web: browser package source.
  • lib/node: Node.js server package source.
  • lib/python: Python package source.
  • lib/gateway: route-free HTTP gateway toolkit, consumed from source checkouts.

Applications And Examples

  • apps/gateway-server: opinionated first-party gateway application.
  • apps/cli: command-line local inference application.
  • examples: small copyable integrations.
  • demos: browser experiences built on the public package surfaces.
  • xtask: build, test, run, packaging, and maintenance orchestration.

For gateway-specific layering, read Gateway Architecture.

Gateway Architecture

Gateway architecture documentation now lives in Gateway Architecture.

Use:

Testing

Sipp tests are cataloged by cargo xtask test list. Use that command first when choosing a target or checking what CI runs.

Commands

cargo xtask test has four top-level actions:

  • list: list unit and smoke suites and optionally discover/search cheap cases.
  • unit: run deterministic code-flow and API-layer tests by suite or group.
  • smoke: run holistic integration smoke tests by suite or group.
  • verify: analyze existing coverage artifacts and validate test structure.

Common Commands

cargo xtask test list
cargo xtask test list --group unit --layer interface --cases --search router --format json
cargo xtask test unit group full
cargo xtask test unit group whitebox
cargo xtask test unit group interface
cargo xtask test unit suite xtask
cargo xtask test unit suite rust-crates --package sipp-rs
cargo xtask test unit suite browser --wasm-threading single-thread
cargo xtask test unit suite demos --wasm-threading single-thread
cargo xtask test unit suite node-package --backend cpu
cargo xtask test unit suite python-package --backend cpu
cargo xtask test smoke suite example-node --backend cpu
cargo xtask test smoke suite example-gateway --backend cpu --case query
cargo xtask test smoke suite playground-browser
cargo xtask test smoke group examples --backend cpu
cargo xtask test smoke group local-model --backend cpu
cargo xtask test smoke group full --backend cpu
cargo xtask test verify --target whitebox
cargo xtask test verify --changed

test unit owns deterministic tests. It is split into explicit namespaces:

  • test unit suite <name> runs exactly one deterministic unit suite.
  • test unit group <name> runs a named bundle of deterministic unit suites.

Unit suite names expose suite-specific options, such as test unit suite rust-crates --package <crate> and test unit suite node-package --backend cpu.

Unit Suites

CommandWhat runsCode location
cargo xtask test unit suite xtaskxtask CLI and orchestration testsxtask/src/tests
cargo xtask test unit suite rust-cratesWorkspace crate unit testscrates, lib/gateway, apps
cargo xtask test unit suite rust-bindingsRust binding crate unit testsbindings/node, bindings/python, bindings/wasm
cargo xtask test unit suite browserBrowser TypeScript testslib/web/tests
cargo xtask test unit suite demosBrowser demo TypeScript testsdemos
cargo xtask test unit suite apiCrate-level public API integration testscrates/sipp/tests
cargo xtask test unit suite cliCLI black-box integration testsapps/cli/tests
cargo xtask test unit suite node-packageDeterministic Node package API testslib/node, bindings/node
cargo xtask test unit suite python-packageDeterministic Python package API testslib/python, bindings/python

Unit Groups

CommandSuites
cargo xtask test unit group whiteboxxtask, rust-crates, rust-bindings, browser, and demos
cargo xtask test unit group interfaceapi, cli, node-package, and python-package
cargo xtask test unit group fullEvery deterministic unit suite

Browser and demo unit suites accept --wasm-threading single-thread|pthread|all. CI uses single-thread to keep source validation fast. Release package builds continue to use cargo xtask build wasm, whose default is all.

test smoke owns holistic integration checks. It is split into explicit namespaces:

  • test smoke suite <name> runs exactly one smoke suite.
  • test smoke group <name> runs a named bundle of smoke suites.

Model-backed smoke suites default to the setup sample model cache under .build/models when --model is omitted. Rust, Node, Python, gateway, and browser example smoke accept repeated --case query|chat|embed. Embedding cases require a model/runtime that reports embedding support.

Smoke Suites

CommandWhat runsCode location
cargo xtask test smoke suite cliStaged local CLI generation smokeapps/cli
cargo xtask test smoke suite example-rustRust query/chat/embed examplesexamples/rust
cargo xtask test smoke suite example-nodeNode query.mjs/chat.mjs/embed.mjs examplesexamples/node
cargo xtask test smoke suite example-pythonPython query.py/chat.py/embed.py examplesexamples/python
cargo xtask test smoke suite example-gatewayEmbedded local gateway proxy plus Rust/Node/Python local-and-gateway clientsexamples/gateway, examples/rust, examples/node, examples/python
cargo xtask test smoke suite example-browserBrowser query.html/chat.html/embed.html examples through Playwrightexamples/web
cargo xtask test smoke suite playground-browserBrowser playground runtime smoke through Playwrighttools/playground
cargo xtask test smoke suite llama-backend-opsllama.cpp backend operation correctness smokecrates/sys/llama.cpp

Smoke Groups

CommandSuites
cargo xtask test smoke group examplesexample-rust, example-node, example-python, example-gateway, and example-browser
cargo xtask test smoke group local-modelcli, example-rust, example-node, and example-python
cargo xtask test smoke group fullEvery smoke suite, including playground, gateway, and llama checks

Use cargo xtask run examples serve browser to manually serve browser examples. Use cargo xtask run examples serve gateway-local --model <model.gguf> to serve the minimal local gateway proxy. Provider-backed and production serving use apps/gateway-server; validate its configuration with sipp run gateway-server check --config <path> and use raw Docker commands from Gateway Docker for container testing. Use Gateway Testing for curl and Postman checks. Playground validation remains under test smoke suite playground-browser.

test unit and test smoke print a final suite and test/check summary, then write .build/test/run-report.json and .build/test/run-report.md. Coverage-capable unit suites also write fresh coverage artifacts under .build/coverage/.

test verify does not execute test suites. It validates test structure, catalog ownership, test/runtime code separation, optional changed-file coverage, and existing coverage artifacts.

Package Locations

  • lib/web publishes @noumena-labs/sipp and public @sipp/sipp.
  • lib/node publishes @noumena-labs/sipp-server and public @sipp/sipp-server.
  • lib/python publishes Python sipp.
  • crates/sipp publishes the Rust package sipp-rs with library crate sipp.

Coverage

Sipp coverage is driven through the same test catalog used by cargo xtask test list. General test command guidance lives in testing.md.

Commands

cargo xtask test list
cargo xtask test list --group unit --layer whitebox --cases --format json
cargo xtask test unit group whitebox
cargo xtask test verify --target whitebox
cargo xtask test verify --target node
cargo xtask test verify --changed

test unit is the command that executes deterministic coverage-capable suites and creates fresh coverage data. Rust writes coverage through cargo-llvm-cov, Node writes coverage through c8, and Python writes coverage through pytest-cov.

test verify defaults to all coverage-capable unit suites. It does not execute test suites, build bindings, download models, or run smoke tests. Use --target to narrow which existing coverage artifacts are analyzed. Explicitly selecting a unit target that is not coverage-capable fails with a clear error.

--changed validates that changed first-party source files owned by the selected unit suites have matching changed tests owned by the same catalog suites. test verify also checks catalog ownership and test/runtime code separation so tests do not live inside runtime source files.

test list --format json is the stable catalog surface used by CI and contributors. Each suite entry includes id, group, layer, description, requirements, sourceRoots, backendPolicy, coverage, and caseDiscovery. Use --cases when a tool needs discoverable files and case names that map to the suite runner.

Tools

Coverage reporting uses the tools required by the selected report areas:

  • cargo-llvm-cov for Rust/native execution and report rendering.
  • c8 for Node wrapper coverage during test unit suite node-package.
  • pytest-cov for Python wrapper coverage during test unit suite python-package.

test verify only reads existing coverage artifacts and renders summaries from them.

Outputs

Reports are written under .build/coverage/:

  • rust/lcov.info and rust/html/
  • node/lcov.info
  • python/lcov.info, python/cobertura.xml, and python/html/
  • baseline.json
  • coverage-summary.md

Test command reports are written under .build/test/:

  • run-report.json and run-report.md
  • verify-report.json and verify-report.md

The baseline includes first-party crates/ and bindings/ code. It intentionally excludes generated outputs, caches, tests, examples, third_party/, and the vendored crates/sys/llama.cpp/ tree.

Policy

The current implementation records the baseline and does not fail on percentage thresholds. It does fail when an enabled coverage area produces an empty first-party report. Thresholds should be added after the baseline is stable and the largest uncovered first-party areas are addressed.

Contributing

Sipp is a polyglot monorepo. Keep contributions focused, documented, and validated with the narrowest useful commands.

Before submitting issues or PRs, be ready to explain why the change matters and how it works. AI-assisted coding is fine, including agent-generated drafts, but the author is responsible for reviewing, understanding, and maintaining the final change.

Identify The Why

For issues and feature requests, explain the problem, who it affects, and how it could affect the system. This helps maintainers evaluate the priority and choose the right implementation path.

Explain The How

For PRs, describe what changed and how the implementation works. If you cannot explain the behavior, risks, and validation, revisit the change before asking for review.

Communication

Use your own words in issues and PRs. Keep the main message concise, then add supporting detail only when it helps reviewers understand the change.

For each issue or PR:

  • Explain why it matters.
  • Describe what changed.
  • Keep the scope atomic.
  • Avoid unrelated cleanup.

Before Editing

  • Read the root README and the relevant package or app README.
  • Use cargo xtask test list to inspect available validation targets.
  • Use cargo xtask commands for builds and long-running workflows.
  • Avoid changing vendored files under third_party/ or crates/sys/llama.cpp/ unless the task is explicitly about the vendor source.

Documentation Changes

  • Keep README files short and task-oriented.
  • Put detailed guides and references in this mdBook.
  • Prefer examples that can be copied and run from a clean checkout.
  • Update docs when public APIs, package behavior, commands, or configuration change.

Validation

For documentation-only changes:

sipp docs build
cargo xtask test list
cargo xtask test verify --target public-docs

For code changes, use the narrowest relevant test target from Testing. Run broader suites only when the change crosses package or runtime boundaries.

Known Issues

This page tracks current issues that users may hit when running Sipp.

Browser Pulse Animations Can Reduce WebGPU Decode Throughput

Status: open.

Continuous pulse animations in the page can slow down browser-local inference, especially WebGPU decode throughput. This has been observed as lower tokens-per-second while a demo or app is rendering pulsing UI or scene effects during generation.

Affected surface:

  • Browser-local inference through the @sipp/sipp browser package.
  • Demos or applications that keep pulse animations active while the model is decoding.

Workarounds:

  • Disable or pause pulse animations while a request is decoding.
  • Prefer static state indicators or lower-frequency updates during generation.
  • Test browser inference performance with visual animations disabled before comparing backend or model throughput.

Hybrid Graphics Laptops May Pick The Integrated GPU

Status: open.

On Windows laptops with both integrated and discrete graphics, the browser may choose the integrated GPU for WebGPU. Browser-local inference still runs, but decode throughput can be much lower than expected.

Workaround:

  1. Open Windows Settings.
  2. Go to System > Display > Graphics.
  3. Add the browser executable you use for Sipp, such as Chrome, Edge, or another Chromium-based browser.
  4. Set that browser to High performance.
  5. Restart the browser and reload the Sipp page.

This setting is stronger than relying on browser flags because it tells Windows which GPU the browser process should prefer.