Sipp Documentation
Sipp packages local and gateway-backed inference runtimes for browser,
Node.js, Python, and Rust applications. The project is organized around one
client model: register local and remote endpoints with SippClient.add, keep
the returned endpoint reference, and choose that reference for query, chat,
or embed.
This book starts with the published packages that application developers use. Source checkout, build orchestration, repository architecture, and contribution workflow live in the maintainer section.
Warning
Sipp is under active development. Changes will be made frequently. If you find any issues, bugs, or need any features, please raise them in the github or Discord server (Discord).
Start Here
- Roadmap outlines the engineering milestones, memory architectures, and long-term vision.
- Installation lists the published package install commands.
- Quickstarts shows short Browser, Node.js, Python, Rust, and gateway paths.
- Using the Core Library describes the public package surfaces in depth.
- Gateway explains the first-party server, Docker workflows, configuration, testing, operations, toolkit, and architecture.
- Frameworks covers Next.js, TanStack, and React/Vite integration patterns.
- Gateway And Hybrid Inference explains when to use local endpoints, gateway endpoints, and provider endpoints.
- Maintainers covers source builds, tests, repo structure, and contribution workflow.
Build The Book Locally
Use sipp docs from a source checkout:
sipp docs build
sipp docs serve
sipp docs build installs mdbook and mdbook-mermaid when missing, extracts
the bundled Mermaid JavaScript assets, and writes the generated book to
book/; If the sipp launcher is not active, use cargo xtask docs ...
with the same arguments.
Sipp Technical Roadmap
This document outlines the engineering milestones and long-term research initiatives for Sipp Core, Sipp Gateway, and Sipp Platform.
Sipp is built around three core ideas: maximizing privacy-preserving inference, low latency interactions, and high-performance compute across the edge and cloud.
The current core library has a powerful WebGPU backend for running models in-browser, as well as bare-metal GPU support for CUDA or Vulkan when running on device or server. We see the future of AI as hybrid, with edge-native AI processing and cloud-based AI processing working together seamlessly.
Research 1: Sipp Core: The Local Runtime Library
Sipp Core is built to be a high-performance power house for running inference locally, either on bare-metal GPU/NPU or via WebGPU for browser-based applications. It is built on a foundation of llama.cpp with a custom C++ and Rust runtime layer.
Key Initiatives
-
Edge-Native Local RAG & Memory Optimization: Integrate an in-memory, zero-dependency vector database (compiled directly to WASM) into the client SDK. This enables developers to run fully local vector searches, embed conversational state, and execute document retrievals with zero external API dependencies or cloud database costs.
-
Full-Spectrum Client Support (Apps, Web, and Games): Sipp currently supports browser through WebGPU and desktop through CUDA and Vulkan backends. Our next phase will be to expand backend support for hardware accelerated inference across web, desktop and mobile devices. This includes:
-
Desktop & Mobile Wrappers: Expand native compilation targets for Electron and Tauri apps, exposing direct access to NVIDIA CUDA, Metal, and Vulkan.
-
Gaming Runtimes: Lightweight SDK integration frameworks for Unreal Engine and Unity to support local, low-latency AI agents inside application loops.
-
-
Cross-Site & Cross-App Persistence Caching: Standard browser sandboxing isolates cache stores to individual web origins. We seek to solve this by building a lightweight, local background desktop daemon built in Rust. This daemon serves as a centralized, secure model registry mirror. If a user visits an Electron app or a website utilizing a specific model, the local runtime fetches it instantly from the daemon’s cache instead of re-downloading gigabytes of weights.
-
Client-Side Local Contextual Routing: There may be times where running a query locally may not produce good enough results, in which re-routing to a cloud or provider model is needed. However, when this should happen or how a query could be split apart is unknown. We beleive a solution is in a hyper-lightweight, client-side small language model (sLLM) that makes those decisions dynamically, we see two applications:
-
PII/PPI Stripping and Masking: A local model intercepts text inputs to detect and strip Protected Personal Information (PPI) or Personally Identifiable Information (PII), replacing sensitive entities with secure local tokenized hashes before any cloud handoff occurs.
-
Contextual Query Splitting: The local engine analyzes incoming chats to determine what components can be handled instantly on the edge (e.g., immediate structural formatting, basic data verification) vs. what must be escalated to the cloud, dynamically stitching cloud completions back into the interface as they stream down.
-
Research 2: The Gateway Server (The Orchestration & Interception Layer)
The open-source Gateway Server serves as an autonomous “API Fortress” that acts as a secure, high-performance middleware layer between client networks and cloud endpoints.
┌──────────────────────────────┐
│ Client Submits Prompt to GW │
└──────────────┬───────────────┘
│
▼
┌──────────────────────────────┐
│ Preemptive Middle-Layer Cache│
│ (Vector & KV Intercept) │
└──────────────┬───────────────┘
│
┌────────────────────┴────────────────────┐
▼ (Cache Hit / Guided Path) ▼ (Cache Miss)
┌───────────────────────┐ ┌───────────────────────┐
│ Route to Endpoint X │ │ Route to Endpoint Y │
│ (Low-cost/Fast Stream)│ │ (Deep Processing/MoE) │
└───────────────────────┘ └───────────────────────┘
Key Initiatives
-
Gateway-Level Vector Memory & RAG Interception: The gateway implements an internal, stateful vector index layer to handle server-side memory optimization. It caches semantic embeddings of historical document fragments and prior system queries. When a client submits a prompt, the gateway performs a preemptive vector evaluation to determine if a relevant context context match exists, entirely bypassing the need to repeatedly re-fetch or re-encode massive RAG documents from central cloud instances.
-
Preemptive Middle-Layer Caching: In tandem with vector storage, the gateway features a stateful intermediate cache layer designed to intercept incoming requests before they hit large upstream models. If a cached structural completion matches the incoming footprint, the gateway can reroute traffic conditionally (e.g., “If cache footprint exists, route to fast Endpoint X; if not, route to reasoning Endpoint Y”).
-
Persistent Admin Control Dashboard: Expand on the gateway dashboard and admin UI to visualize active routes, manage cryptographic client application identities, view live input/output token allocation metrics, and manually map model fallback rules and more.
-
Token-Aware Traffic Shaping: Implements token-bucket rate limiters directly inside the networking wrapper to monitor and throttle users based on their exact token throughput footprint, protecting downstream clusters from malicious execution loops or unexpected API bills.
Getting Started
Start here when adding Sipp to an application from published packages. Maintainer source builds are covered separately.
- Installation lists the Browser, Node.js, Python, and Rust package install commands.
- Quickstarts gives minimal local and gateway client examples.
- Models And Backends explains GGUF model expectations and backend selection.
- Source Builds covers checkout setup,
sipp,cargo xtask, examples, and demos for maintainers.
Installation
Install the published package for the runtime your application uses. All
public client packages use the same endpoint model: register an endpoint, keep
the returned endpoint reference, and choose that endpoint for query, chat,
or embed.
Package Installs
| Surface | Install | Use for |
|---|---|---|
| Browser | npm install @sipp/sipp | Browser-local GGUF inference and browser gateway clients. |
| Node.js | npm install @sipp/sipp-server | Server-side local inference and framework route handlers. |
| Python | pip install sipppy | Python scripts, services, and gateway clients. |
| Python CUDA | GitHub release wheel | Python local inference with CUDA backend wheels. |
| Python Vulkan | pip install "sipppy[vulkan]" | Python local inference with Vulkan backend wheels. |
| Python Metal | pip install "sipppy[metal]" | Python local inference with Metal backend wheels on macOS. |
| Rust | cargo add sipp-rs | Rust applications and services. |
The current release workflow publishes browser npm, Node npm, Python wheels,
and Rust crates. It does not yet publish a standalone gateway-server
binary, container image, or cargo install target. Use the source checkout and
Dockerfile when deploying the gateway server until a public server artifact is
added.
Runtime Requirements
- Local inference needs a compatible GGUF model file or browser-served GGUF asset.
- Python wheels require Python 3.10 or newer.
- Browser-local inference needs a modern browser with WebAssembly support; WebGPU acceleration depends on the browser and device. For details, please refer to Gateway.
- Node installs use
@sipp/sipp-server; npm resolves the matching optional platform binary package automatically. Python installs use thesipppywheel (imported assipp) for CPU and extras such assipppy[cuda]for GPU backend wheels; thesipppywheels currently ship from GitHub Releases while the full PyPI build matrix is in progress (see the Python package page). UseSIPP_NODE_BACKENDorSIPP_PYTHON_BACKENDwhen you need to forcecpu,vulkan,cuda, ormetal. - Gateway clients need only the gateway base URL, public target name, and application-owned authentication value.
Next Steps
- sipp CLI for source checkouts
- Browser package
- Node.js package
- Python package
- Rust package
- Gateway
- Maintainer source builds
Quickstarts
These snippets show the public call shapes for query, chat, and embed.
query sends the exact prompt string and never applies a chat template. A
plain prompt is only for completion-style/base models; for decoder-only chat or
instruct GGUFs, render the model’s template yourself. Local query also supports encoder-decoder GGUF text models. chat sends role-tagged
messages. embed returns vectors and needs an embedding-capable local model
loaded with embedding mode enabled.
Local context naming differs only by language casing: browser and Node.js use
contextKey; Python and Rust use context_key.
See Examples And Demos for runnable end-to-end files.
Browser Local
npm install @sipp/sipp
import { SippClient, type ChatMessage } from '@sipp/sipp';
const client = new SippClient();
const messages: readonly ChatMessage[] = [
{ role: 'system', content: 'Answer concisely.' },
{ role: 'user', content: 'Explain local browser inference.' },
];
const queryPrompt = [
'<|system|>',
'Answer concisely.',
'<|user|>',
'Explain local browser inference.',
'<|assistant|>',
].join('\n');
const textEndpoint = await client.add('text', {
kind: 'local',
source: '/models/chat.gguf',
options: { backend: 'webgpu', runtime: { context: { n_ctx: 2048 } } },
});
// query: raw prompt; replace markers with the target model's template.
const query = await client.query(queryPrompt, {
endpoint: textEndpoint,
maxTokens: 64,
contextKey: 'browser-query',
}).response;
// chat: role messages; local runtime uses tokenizer.chat_template.
const chat = await client.chat(messages, {
endpoint: textEndpoint,
maxTokens: 64,
contextKey: 'browser-chat',
}).response;
const embedEndpoint = await client.add('embed', {
kind: 'local',
source: '/models/embed.gguf',
options: {
backend: 'webgpu',
runtime: { context: { n_ctx: 2048, embeddings: true, pooling: 'mean' } },
},
});
// embed: vector output; local endpoint must be embedding-capable.
const embedding = await client.embed('Sipp embedding input.', {
endpoint: embedEndpoint,
contextKey: 'browser-embed',
normalize: true,
}).response;
console.log(query.text, chat.text, embedding.values.length);
await client.close();
Node.js Local
npm install @sipp/sipp-server
import { SippClient } from '@sipp/sipp-server';
const client = new SippClient();
const messages = [
{ role: 'system', content: 'Answer concisely.' },
{ role: 'user', content: 'Explain local Node.js inference.' },
];
const queryPrompt = [
'<|system|>',
'Answer concisely.',
'<|user|>',
'Explain local Node.js inference.',
'<|assistant|>',
].join('\n');
const textOptions = { maxTokens: 64 };
const textModel = process.argv[2] ?? 'chat.gguf';
const embedModel = process.argv[3] ?? 'embed.gguf';
const textEndpoint = await client.add('text', {
kind: 'local',
modelPath: textModel,
config: { context: { n_ctx: 2048 } },
});
// query: raw prompt; replace markers with the target model's template.
const query = await client.query({
endpoint: textEndpoint,
prompt: queryPrompt,
options: textOptions,
local: { contextKey: 'node-query' },
}).response;
// chat: role messages; local runtime uses tokenizer.chat_template.
const chat = await client.chat({
endpoint: textEndpoint,
messages,
options: textOptions,
local: { contextKey: 'node-chat' },
}).response;
const embedEndpoint = await client.add('embed', {
kind: 'local',
modelPath: embedModel,
config: { context: { n_ctx: 2048, embeddings: true, pooling: 'mean' } },
});
// embed: vector output; local endpoint must be embedding-capable.
const embedding = await client.embed({
endpoint: embedEndpoint,
input: 'Sipp embedding input.',
local: { contextKey: 'node-embed', normalize: true },
}).response;
console.log(query.text, chat.text, embedding.values.length);
Local query also supports encoder-decoder GGUF text models, while many
encoder-decoder models cannot use chat because they do not declare
tokenizer.chat_template. Encoder-decoder text models do not produce
embeddings through this runtime.
Python Local
# sippy cuda wheel is currently published via GitHub Releases ;full release matrix is on progress
pip install sipppy
from sipp import (
ChatMessage,
SippClient,
SippTextOptions,
ContextRuntimeConfig,
LocalEmbedOptions,
LocalTextOptions,
LocalModelDescriptor,
NativeRuntimeConfig,
)
client = SippClient()
messages = [
ChatMessage("system", "Answer concisely."),
ChatMessage("user", "Explain local Python inference."),
]
query_prompt = "\n".join(
[
"<|system|>",
"Answer concisely.",
"<|user|>",
"Explain local Python inference.",
"<|assistant|>",
]
)
text_options = SippTextOptions(max_tokens=64)
text_endpoint = client.add("text", LocalModelDescriptor("chat.gguf"))
# query: raw prompt; replace markers with the target model's template.
query = client.query(
query_prompt,
endpoint=text_endpoint,
options=text_options,
local=LocalTextOptions(context_key="python-query"),
).result()
# chat: role messages; local runtime uses tokenizer.chat_template.
chat = client.chat(
messages,
endpoint=text_endpoint,
options=text_options,
local=LocalTextOptions(context_key="python-chat"),
).result()
embed_endpoint = client.add(
"embed",
LocalModelDescriptor(
"embed.gguf",
NativeRuntimeConfig(
context=ContextRuntimeConfig(
n_ctx=2048,
embeddings=True,
pooling="mean",
),
),
),
)
# embed: vector output; local endpoint must be embedding-capable.
embedding = client.embed(
"Sipp embedding input.",
endpoint=embed_endpoint,
local=LocalEmbedOptions(context_key="python-embed", normalize=True),
).result()
print(query["text"], chat["text"], len(embedding["values"]))
Rust Local
cargo add sipp-rs
#![allow(unused)]
fn main() {
use sipp::engine::{
ChatMessage, ChatRole, ContextRuntimeConfig, NativeRuntimeConfig, PoolingType,
};
use sipp::{
SippChatRequest, SippClient, SippEmbedRequest, SippQueryRequest,
SippTextOptions, EndpointDescriptor, LocalEmbedOptions, LocalTextOptions,
};
let mut client = SippClient::new();
let messages = vec![
ChatMessage::new(ChatRole::System, "Answer concisely."),
ChatMessage::new(ChatRole::User, "Explain local Rust inference."),
];
let query_prompt = [
"<|system|>",
"Answer concisely.",
"<|user|>",
"Explain local Rust inference.",
"<|assistant|>",
]
.join("\n");
let text_options = SippTextOptions {
max_tokens: Some(64),
..Default::default()
};
let text_endpoint = client
.add("text", EndpointDescriptor::local("chat.gguf", Default::default()))
.await?;
// query: raw prompt; replace markers with the target model's template.
let query = client
.query(SippQueryRequest {
endpoint: Some(text_endpoint.clone()),
prompt: query_prompt,
options: text_options.clone(),
local: LocalTextOptions {
context_key: Some("rust-query".to_string()),
..Default::default()
},
..Default::default()
})
.await?;
// chat: role messages; local runtime uses tokenizer.chat_template.
let chat = client
.chat(SippChatRequest {
endpoint: Some(text_endpoint),
messages,
options: text_options,
local: LocalTextOptions {
context_key: Some("rust-chat".to_string()),
..Default::default()
},
..Default::default()
})
.await?;
let embed_endpoint = client
.add("embed", EndpointDescriptor::local("embed.gguf", embed_config()))
.await?;
// embed: vector output; local endpoint must be embedding-capable.
let embedding = client
.embed(SippEmbedRequest {
endpoint: Some(embed_endpoint),
input: "Sipp embedding input.".to_string(),
local: LocalEmbedOptions {
context_key: Some("rust-embed".to_string()),
normalize: Some(true),
},
..Default::default()
})
.await?;
println!("{}, {}, {}", query.text, chat.text, embedding.values.len());
fn embed_config() -> NativeRuntimeConfig {
NativeRuntimeConfig {
context: ContextRuntimeConfig {
n_ctx: Some(2048),
embeddings: Some(true),
pooling: Some(PoolingType::Mean),
..Default::default()
},
..Default::default()
}
}
}
Gateway
Gateway clients keep model paths, provider credentials, target policy, and metrics in the gateway process. The example uses the browser package shape; Node.js uses the same request-object shape shown above.
import { SippClient, type ChatMessage } from '@sipp/sipp';
const client = new SippClient();
const endpoint = await client.add('gateway', {
kind: 'gateway',
target: 'local',
baseUrl: 'https://gateway.example.com',
authentication: { kind: 'bearer', value: await getGatewayToken() },
});
const messages: readonly ChatMessage[] = [
{ role: 'system', content: 'Answer concisely.' },
{ role: 'user', content: 'Explain gateway inference.' },
];
const queryPrompt = [
'<|system|>',
'Answer concisely.',
'<|user|>',
'Explain gateway inference.',
'<|assistant|>',
].join('\n');
// query: gateway forwards the raw prompt to the selected target.
const query = await client.query(queryPrompt, {
endpoint,
maxTokens: 64,
}).response;
// chat: gateway maps role messages for the selected provider/local target.
const chat = await client.chat(messages, { endpoint, maxTokens: 64 }).response;
// embed: target must support embeddings.
const embedding = await client.embed('Sipp embedding input.', {
endpoint,
}).response;
console.log(query.text, chat.text, embedding.values.length);
await client.close();
Gateway query preserves the raw prompt, so it is the gateway path for custom
templates or local encoder-decoder targets. Gateway embed requires the target
to support embeddings.
Direct Provider
Use direct provider endpoints only in trusted server code (e.g. self-hosted service). Provider support is
model-specific: query needs a completion-compatible provider or model,
chat needs a chat model, and embed needs an embedding model.
import { SippClient } from '@sipp/sipp-server';
function env(name: string): string {
const value = process.env[name];
if (value == null || value === '') {
throw new Error(`${name} is required`);
}
return value;
}
const client = new SippClient();
const chatMessages = [
{ role: 'system', content: 'Answer concisely.' },
{ role: 'user', content: 'Explain provider inference.' },
];
const completionEndpoint = await client.add('completion', {
kind: 'provider',
provider: 'openai_compatible',
model: env('COMPLETION_MODEL'),
baseUrl: env('COMPLETION_BASE_URL'),
apiKey: env('COMPLETION_API_KEY'),
});
const chatEndpoint = await client.add('chat', {
kind: 'provider',
provider: 'openai',
model: env('OPENAI_CHAT_MODEL'),
apiKey: env('OPENAI_API_KEY'),
});
const embedEndpoint = await client.add('embed', {
kind: 'provider',
provider: 'openai',
model: env('OPENAI_EMBED_MODEL'),
apiKey: env('OPENAI_API_KEY'),
});
// query: raw completion prompt for a completion-compatible provider.
const query = await client.query({
endpoint: completionEndpoint,
prompt: 'Write one provider inference sentence.',
options: { maxTokens: 64 },
}).response;
// chat: provider-native role messages.
const chat = await client.chat({
endpoint: chatEndpoint,
messages: chatMessages,
options: { maxTokens: 64 },
}).response;
// embed: provider-native embedding model.
const embedding = await client.embed({
endpoint: embedEndpoint,
input: 'Sipp embedding input.',
}).response;
console.log(query.text, chat.text, embedding.values.length);
Runtime Tuning
Local endpoint tuning, browser WebGPU options, worker/threading choices, generation options, and provider/gateway option buckets are documented in Runtime Options.
Building and Running from Source Code
Runnable source examples and demos live in the maintainer lane: Source Builds.
Models And Backends
Sipp local inference uses GGUF model files. Text workflows need a text GGUF model, embedding workflows need a model that reports embedding support, and vision chat workflows need both a model GGUF and a projector GGUF.
Model Sources
For local package usage, pass an explicit GGUF model path in Node.js, Python, or Rust, or serve a GGUF model URL to browser code:
- Browser:
source: '/models/model.gguf' - Node.js:
modelPath: '/path/to/model.gguf' - Python:
LocalModelDescriptor('/path/to/model.gguf') - Rust:
EndpointDescriptor::local(model_path, config)
Source examples and smoke workflows can use a cached sample model under
.build/models; see Source Builds.
Native Backends
Backend names are shared across build and runtime selection:
cpu: portable default backend.vulkan: GPU backend for Vulkan-capable systems.cuda: NVIDIA CUDA backend.metal: Apple Metal backend on macOS.
Runtime selection is package-specific:
- Node.js:
SIPP_NODE_BACKEND=cpu|vulkan|cuda|metal - Python:
SIPP_PYTHON_BACKEND=cpu|vulkan|cuda|metal - CLI:
--backend auto|cpu|cuda|metal|vulkan
Leave runtime backend variables unset for automatic selection.
Maintainer builds can produce backend-specific artifacts with sipp or
cargo xtask; see Source Builds.
For the full package/backend matrix and llama.cpp/ggml operation support guidance, see Backend Matrix.
sipp CLI
sipp is the repo-local launcher for Sipp source checkout workflows. It
forwards to cargo xtask after setup has installed wrapper scripts under
.build/bin.
Use sipp when you are working from the repository and need to build native
artifacts, run demos, start the gateway server, manage xtask toolchains, or run
cataloged tests, or build the documentation book. Published packages such as
@sipp/sipp, @sipp/sipp-server, and the Python wheel (sipppy) do not require sipp.
Command Shape
Every sipp command has the same arguments as cargo xtask:
sipp doctor
sipp build node --backend cpu
sipp run examples serve browser
sipp test list
sipp docs build
If the launcher is not active in the current shell, use the same command after
cargo xtask:
cargo xtask doctor
cargo xtask build node --backend cpu
Pages
Setup
Run the setup script from the repository root. It builds the xtask binary when
needed, installs sipp launchers under .build/bin, and can bootstrap managed
toolchains and sample files for the selected workflow.
Unix Shells
source ./setup.sh
sipp doctor
Running ./setup.sh without source still performs setup, but it cannot modify
the current shell PATH. It prints the environment script to source afterward.
Windows PowerShell
.\setup.ps1
sipp doctor
The PowerShell script updates PATH for the current PowerShell session and
loads .build\bin\sipp-env.ps1 when setup succeeds.
Windows CMD
setup.cmd
sipp doctor
setup.cmd invokes the PowerShell setup script and activates .build\bin for
the current CMD session.
Profiles
Use a profile when you know which development surface you need:
sipp setup --profile browser
sipp setup --profile bindings
sipp setup --profile full --yes
| Profile | Use for |
|---|---|
browser | Browser package, WASM, WebGPU examples, and demos. |
bindings | Native Node.js and Python binding development. |
full | Full workspace development across browser and native bindings. |
Useful setup flags:
--yes: accept recommended actions without prompting.--no-downloads: skip toolchain, dependency, and sample-model downloads.--no-splash: skip the interactive splash.--plain: disable bounded terminal rendering.
Generated Files
Setup writes only repo-local generated files:
.build/xtask/debug/xtaskor.build\xtask\debug\xtask.exe.build/bin/sipp,.build/bin/sipp.cmd, and.build/bin/sipp.ps1.build/bin/sipp-env.shand.build/bin/sipp-env.ps1- xtask-managed toolchains and caches under
.build/toolchain
Commands
sipp groups source checkout automation into focused command families. Use
sipp <group> --help for generated help and the current option list.
Health Checks
sipp doctor
sipp doctor --target wasm
sipp doctor --target node --backend vulkan
sipp toolchain status
doctor checks local readiness without installing or deleting anything.
toolchain status reports xtask-managed tools such as Bun, Python, uv,
Emscripten, and Ninja. CUDA is externally installed; xtask reports it but does
not install or delete it.
Build
sipp build core
sipp build wasm
sipp build node --backend cpu
sipp build python --backend vulkan
sipp build cli --backend all
sipp build gateway-server --backend cpu
sipp build all
build all builds the main target families with default CPU native outputs. It
does not build every backend variant for every package.
Backend values:
cpu: portable default.cuda: NVIDIA CUDA backend; requires a local CUDA Toolkit.metal: Apple Metal backend on macOS.vulkan: Vulkan backend; xtask can bootstrap the Vulkan SDK when needed.all: host-supported backend set for the selected target.
Run
sipp run examples serve browser --port 5173
sipp run examples serve gateway-local --model .build/models/model.gguf --bind 127.0.0.1:8787
sipp run examples gateway rust --case query
sipp run demos serve chat
sipp run tools serve playground
sipp run gateway-server check --config apps/gateway-server/config/local.toml
sipp run gateway-server serve --config apps/gateway-server/config/local.toml --backend cpu
run commands are for long-lived demos, gateway processes, example servers,
and non-test diagnostics. Test execution lives under sipp test.
Docs
sipp docs build
sipp docs serve
sipp docs build --lang zh
docs build installs mdbook and mdbook-mermaid when missing, extracts the
bundled Mermaid JavaScript assets into theme/, and writes the generated book
to book/.
Test
sipp test list
sipp test list --group unit --layer interface --cases --search router --format json
sipp test unit group full
sipp test unit suite rust-crates --package sipp-rs
sipp test unit suite node-package --backend cpu
sipp test unit suite browser --wasm-threading single-thread
sipp test smoke suite example-node --backend cpu
sipp test smoke group local-model --backend cpu
sipp test verify --changed
sipp test verify --target public-docs
Model-backed smoke tests use the setup sample model cache under .build/models
when --model is omitted. See Testing for the full suite
catalog.
Clean
sipp clean --dry-run
sipp clean
sipp clean --purge
sipp clean --toolchains
clean removes generated build outputs while preserving downloaded toolchains
and dependency installs. --purge also removes workspace node_modules
directories. --toolchains removes xtask-managed toolchains under
.build/toolchain.
Output Flags
Most command groups accept the shared output flags:
--verbose: stream subprocess output directly.--no-banner: disable decorative banners.--plain: disable bounded inline rendering.
Troubleshooting
sipp Is Not Found
Run setup from the repository root and keep the environment active in the same shell:
source ./setup.sh
.\setup.ps1
setup.cmd
If you cannot activate sipp, use cargo xtask with the same arguments:
cargo xtask doctor
cargo xtask test list
Setup Rebuilds xtask
The setup scripts rebuild .build/xtask/debug/xtask when xtask source files,
workspace manifests, or Cargo configuration are newer than
.build/xtask/sipp.stamp. This is expected after pulling changes that affect
developer automation.
PowerShell Blocks Script Execution
Run the script with the current-user execution policy configured by your machine, or invoke it for the current process:
powershell -NoProfile -ExecutionPolicy Bypass -File .\setup.ps1
PATH Is Active Only In One Terminal
The launcher is installed under .build/bin. Setup activates that directory
for the current shell session. Open a new terminal and run setup again, or
source the generated environment script:
source .build/bin/sipp-env.sh
. .build\bin\sipp-env.ps1
Toolchain Or Backend Is Missing
Use:
sipp doctor
sipp toolchain status
Then install xtask-managed components when appropriate:
sipp toolchain install uv
sipp toolchain install all
CUDA is not installed by xtask. Install CUDA through NVIDIA tooling and rerun
sipp doctor --target node --backend cuda or the target you need.
Using the Core Library
Sipp exposes one endpoint-oriented client model across all public package
surfaces. See the Library API Overview for the shared
SippClient.add, query, chat, and embed contracts, endpoint descriptor
reference, and gateway-client symmetry patterns.
Most developers should start here instead of building from source.
Package Surfaces
| Surface | Install | Primary use |
|---|---|---|
| Library API Overview | — | Shared add, query, chat, and embed contracts across all surfaces. |
| Browser | npm install @sipp/sipp | Browser-local GGUF inference, WebGPU/WASM runtime, and browser gateway clients. |
| Node.js | npm install @sipp/sipp-server | Node server processes, route handlers, and backend services. |
| Python | pip install sipppy | Python services, scripts, and gateway clients. |
| Rust | cargo add sipp-rs | Rust applications and services. |
| Gateway Server | Source-built today | First-party HTTP gateway for local and provider targets. |
| Gateway Docker | Docker from source | Local and production container workflows for the gateway server. |
| Gateway Toolkit | Source-built today | Rust toolkit for custom gateway applications. |
The current release workflow publishes browser npm, Node npm, Python wheels, and Rust crates. The gateway server is documented in the Gateway section as a user-facing deployment surface, but it does not yet have a published binary or public image.
Framework Guides
When integrating JavaScript packages with a framework, see:
Supporting Reference
- Providers — provider and gateway provider split
- Runtime Options — option layer map and field reference
- Source Builds — developing from this checkout
Library API Overview
The Sipp libraries for Rust, Node.js, Python, and Browser expose the same endpoint-oriented client model.
At a high level:
- Register an endpoint with
add. - Keep the returned
EndpointRef. - Pass that reference to
query,chat, orembed.
This keeps application code the same whether inference runs locally, through a gateway, through a provider, or across a hybrid setup.
Core Client Methods
SippClient exposes four primary methods:
| Method | Purpose |
|---|---|
add | Register a local, gateway, or provider endpoint and return an EndpointRef. |
query | Generate text from a raw prompt string. No chat template is applied. |
chat | Generate text from ordered { role, content } messages. |
embed | Generate an embedding vector from text input. |
add() — Register an Endpoint
add(id: string, descriptor: EndpointDescriptor) -> EndpointRef
add registers an endpoint with the current client instance.
The id is caller-defined and scoped to the client. Reusing an id replaces
the existing endpoint. The returned EndpointRef is a lightweight handle with:
| Field | Description |
|---|---|
kind | Endpoint kind: "local", "gateway", or "provider". |
id | The endpoint id registered on this client. |
Pass the returned EndpointRef to query, chat, or embed to choose where
the operation runs.
Local Endpoint
A local endpoint loads a GGUF model into the current process. The application owns model selection, runtime lifecycle, and cleanup.
| Field | Type | Description |
|---|---|---|
kind | "local" | Endpoint kind selector. |
modelPath | string / PathBuf | Filesystem path or browser URL for the GGUF artifact. |
config | NativeRuntimeConfig optional | Load-time runtime configuration, including context size, GPU placement, scheduler policy, cache mode, sampling defaults, and observability. |
Use a local endpoint when the current process should own model execution.
Gateway Endpoint
A gateway endpoint sends requests to a remote Sipp gateway over HTTP. The gateway process owns provider credentials, local model paths, access policy, concurrency, and metrics.
| Field | Type | Description |
|---|---|---|
kind | "gateway" | Endpoint kind selector. |
target | string | Public target name resolved by the gateway. Sent as the model field in gateway profile requests. |
baseUrl | string | Absolute HTTP(S) URL of the gateway service. |
authentication | { kind, value?, headerName? } | Auth strategy: "none", "bearer", or "header". |
staticHeaders | { name, value }[] optional | Additional HTTP headers attached to every request. |
timeoutMs / timeoutPolicy | number / struct optional | Connection, request, and streaming read deadlines. |
queryRoute | string optional | Query route. Defaults to /v1/query. |
chatRoute | string optional | Chat route. Defaults to /v1/chat. |
embedRoute | string optional | Embedding route. Defaults to /v1/embed. |
protocolOptions | map optional | Profile-specific options merged into every request body. |
Use a gateway endpoint when a separate service should own model access and operational policy.
Provider Endpoint
A provider endpoint calls a model provider directly. This is intended for trusted server-side code that manages its own credential lifecycle.
| Field | Type | Description |
|---|---|---|
kind | "provider" | Endpoint kind selector. |
provider | "openai" / "anthropic" / "openai_compatible" | Provider adapter. |
model | string | Provider model identifier. |
apiKey | string optional | Provider API key. |
baseUrl | string optional | Override for the provider base URL. |
Use a provider endpoint when server-side code should call a provider API directly without a Sipp gateway.
query() — Generate from a Raw Prompt
query(request: SippQueryRequest) -> SippTextRun
query sends the prompt string to the selected endpoint exactly as supplied.
No chat template is applied.
Use query when the application owns the full prompt shape, including custom
templates, completion-style models, encoder-decoder text models, few-shot
prompts, or agent loops that render prompts themselves.
Request Fields
| Field | Type | Description |
|---|---|---|
endpoint | EndpointRef | Registered endpoint to target. May be omitted only when exactly one local endpoint supports the operation. |
prompt | string | Raw prompt text. |
options | SippTextOptions optional | Shared generation options: maxTokens, temperature, topP, and stop. |
local | LocalTextOptions optional | Local-only options such as contextKey, grammar, jsonSchema, sampling overrides, and media inputs. Rejected by gateway endpoints. |
endpointOptions | map optional | Free-form options forwarded to gateway endpoint implementations. |
providerOptions | map optional | Free-form options forwarded to direct provider adapters. Rejected by gateway endpoints. |
emitTokens | boolean | When true, stream TokenBatch values through the returned run handle. |
Return Value
query returns a SippTextRun.
| Member | Type | Description |
|---|---|---|
response | Promise / Future | Resolves to SippTextResponse when generation completes. |
tokens | Async iterable | Streams TokenBatch values when emitTokens is true. |
cancel(reason) | method | Cancels an in-flight generation. |
SippTextResponse contains the generated text, finishReason, token
usage, and optional localStats for local endpoints.
chat() — Generate from Role Messages
chat(request: SippChatRequest) -> SippTextRun
chat sends ordered role/content messages to the selected endpoint. The
endpoint owns message rendering.
| Endpoint kind | Message handling |
|---|---|
| Local | Renders messages through the GGUF-declared tokenizer.chat_template. Fails if the model has no template. |
| Gateway | Forwards messages to the resolved gateway target. Provider targets handle their own message mapping. |
| Provider | Sends messages using the provider’s native chat-completions format. |
Request Fields
| Field | Type | Description |
|---|---|---|
endpoint | EndpointRef | Registered endpoint to target. |
messages | { role, content }[] | Ordered conversation turns. |
options | SippTextOptions | Same shared generation options as query. |
local | LocalTextOptions | Same local-only options as query. |
emitTokens | boolean | Same streaming control as query. |
Return Value
chat returns the same SippTextRun shape as query.
embed() — Generate an Embedding
embed(request: SippEmbedRequest) -> SippEmbeddingRun
embed produces a single embedding vector from text input. It does not accept
generation options and does not stream tokens.
Request Fields
| Field | Type | Description |
|---|---|---|
endpoint | EndpointRef | Registered endpoint to target. |
input | string | Text to vectorize. |
local | LocalEmbedOptions optional | Local embedding options, including contextKey and normalize. |
endpointOptions | map optional | Free-form options for gateway endpoint implementations. |
providerOptions | map optional | Free-form options for direct provider adapters. |
Return Value
embed returns a SippEmbeddingRun.
| Member | Type | Description |
|---|---|---|
response | Promise / Future | Resolves to SippEmbeddingResponse when encoding completes. |
cancel(reason) | method | Cancels an in-flight embedding. |
SippEmbeddingResponse contains the float values array, optional token
usage, the pooling strategy, and the normalized flag.
Gateway and Client Symmetry
The same SippClient API works on both sides of the gateway boundary.
Server Side
A server process creates a SippClient, registers local endpoints, and maps
HTTP routes to query, chat, or embed.
Server client:
add("local-model", LocalDescriptor { modelPath, config })
-> route handler decodes HTTP request
-> route handler calls client.query/chat/embed
-> route handler encodes HTTP response
The first-party Gateway Server uses this pattern. Application-owned Node, Python, or Rust servers can also use it through the gateway profile helpers.
Client Side
A client process creates a SippClient, registers gateway endpoints, and
calls query, chat, or embed the same way it would call a local endpoint.
Client client:
add("remote", GatewayDescriptor { target, baseUrl, authentication })
-> client.query/chat/embed({ endpoint: ref, ... })
-> request is sent to the gateway over HTTP
Hybrid Pattern
A single client can register multiple endpoint kinds. The application chooses where an operation runs by passing a different endpoint reference.
localRef = client.add("local", LocalDescriptor { ... })
gatewayRef = client.add("gateway", GatewayDescriptor { ... })
client.query({ endpoint: localRef, prompt, ... })
client.query({ endpoint: gatewayRef, prompt, ... })
The operation code stays the same. Only the endpoint reference changes.
Why the Endpoint Model Matters
The endpoint model gives applications one API surface across multiple deployment shapes.
| Benefit | Description |
|---|---|
| Stable operation code | query, chat, and embed are called the same way for local, gateway, provider, and hybrid setups. |
| Swappable execution targets | Move inference between local models, gateway targets, and direct providers by changing endpoint descriptors. |
| Clear ownership boundaries | Local endpoints keep lifecycle in-process; gateway endpoints move access, credentials, policy, and metrics to a service boundary. |
| Language symmetry | Patterns learned in one language package transfer directly to the others. |
| Extensible endpoint kinds | New endpoint kinds can be added without changing the operation call pattern. |
Visual Summary
flowchart LR
%% -------------------------
%% Node Styling
%% -------------------------
classDef client_node fill:#eef6ff,stroke:#4a90e2,stroke-width:1.5px,color:#111,rx:6,ry:6;
classDef setup_node fill:#f7f7f7,stroke:#999,stroke-width:1px,color:#111,rx:6,ry:6;
classDef runtime_node fill:#f3fff0,stroke:#52a852,stroke-width:2px,color:#111,rx:6,ry:6;
classDef gateway_node fill:#fff7e6,stroke:#d99000,stroke-width:2px,color:#111,rx:6,ry:6;
classDef provider_node fill:#f8f0ff,stroke:#8e44ad,stroke-width:1.5px,color:#111,rx:6,ry:6;
%% -------------------------
%% Client Process
%% -------------------------
subgraph CLIENT["Client Process"]
direction TB
CApp["Application Code"]:::client_node
CClient["SippClient<br/>add(...) -> EndpointRef<br/>query / chat / embed"]:::client_node
CApp --> CClient
%% Logical grouping for endpoint registration options
subgraph CSetup["Endpoint Setup (options)"]
direction LR
CLocalEP["local (GGUF)"]:::setup_node
CGatewayEP["gateway (Remote)"]:::setup_node
CProviderEP["provider (API)"]:::setup_node
end
CClient -. "Registers" .-> CSetup
%% Local execution flow for local ref
subgraph CLocalRuntime["Local Runtime"]
direction LR
CLocalRun["GGUF Runtime"]:::runtime_node
end
%% Connection for local usage
CClient -- "Local Ref (query)" --> CLocalRun
end
%% -------------------------
%% Server Process
%% -------------------------
subgraph SERVER["Server Process / Gateway Server"]
direction TB
SGateway["Gateway Server<br/>HTTP: /v1/query, /chat, /embed"]:::gateway_node
SClient["SippClient (same lib)"]:::client_node
SGateway --> SClient
%% Logical grouping for endpoint registration options
subgraph SSetup["Endpoint Setup (options)"]
direction LR
SLocalEP["local (GGUF)"]:::setup_node
SProviderEP["provider (API)"]:::setup_node
end
SClient -. "Registers" .-> SSetup
%% Local execution flow for local ref
subgraph SLocalRuntime["Local Runtime"]
direction LR
SLocalRun["GGUF Runtime"]:::runtime_node
end
%% Connection for local usage
SClient -- "Local Ref (query)" --> SLocalRun
end
%% -------------------------
%% External Providers
%% -------------------------
Providers["Provider APIs<br/>OpenAI / Gemini / Anthropic / etc."]:::provider_node
%% -------------------------
%% Cross-process / Remote connections
%% -------------------------
CClient == "Gateway Ref (query)" ==> SGateway
CClient == "Provider Ref (query)" ==> Providers
SClient == "Provider Ref (query)" ==> Providers
%% -------------------------
%% Styling Assignment to Nodes
%% -------------------------
class CApp,CClient client_node;
class CLocalEP,CGatewayEP,CProviderEP,SLocalEP,SProviderEP setup_node;
class CLocalRun,SLocalRun runtime_node;
class SGateway gateway_node;
class Providers provider_node;
Related Docs
- Using the Core Library — per-language install steps and examples.
- Inference Operations — operation contracts, template behavior, and gateway target mapping.
- Local Inference — model sources, runtime options, threads, and browser execution.
- Gateway and Hybrid Inference — deployment shapes, endpoint model, and authentication patterns.
- Runtime Options — complete option layer map and field reference.
Browser Package
The browser package target is @sipp/sipp. It exposes SippClient for
browser-local GGUF inference, gateway calls, provider descriptors where
supported, token streaming, OPFS-backed model caching, and browser runtime
lifecycle management.
See the Library API Overview for the shared add, query,
chat, and embed contracts.
Install
npm install @sipp/sipp
Use this package in browser code. For server routes or Node services, use
@sipp/sipp-server.
Use It For
- Browser-local text and vision inference.
- WebGPU or CPU execution through the browser runtime.
- OPFS-backed model caching.
- Gateway-backed query, chat, and embedding calls.
- Character and director helpers used by demos.
Local GGUF Chat
import { SippClient, type ChatMessage } from '@sipp/sipp';
const client = new SippClient();
const endpoint = await client.add('default', {
kind: 'local',
source: '/models/model.gguf',
options: {
backend: 'webgpu',
runtime: {
context: { n_ctx: 2048 },
},
},
});
const messages: readonly ChatMessage[] = [
{ role: 'system', content: 'Answer concisely.' },
{ role: 'user', content: 'Explain Sipp in one sentence.' },
];
const run = client.chat(messages, {
endpoint,
emitTokens: true,
maxTokens: 64,
contextKey: 'browser-local',
});
let streamed = '';
for await (const batch of run.tokens) {
streamed += batch.text;
}
const response = await run.response;
console.log(streamed || response.text);
await client.close();
Use query when the prompt is already rendered for the target model. See the
API overview for the
query/chat/embed contracts.
Gateway Chat
Use gateway endpoints when a separate server owns model paths, provider credentials, target policy, and metrics.
const endpoint = await client.add('gateway', {
kind: 'gateway',
target: 'local',
baseUrl: 'https://gateway.example.com',
authentication: {
kind: 'bearer',
valueProvider: getShortLivedGatewayToken,
},
});
const messages = [
{ role: 'system', content: 'Answer concisely.' },
{ role: 'user', content: 'Explain gateway inference.' },
];
const run = client.chat(messages, {
endpoint,
maxTokens: 64,
});
Browser apps should use short-lived gateway tokens or proxy through an application server route. Do not ship provider credentials or long-lived gateway tokens in browser bundles.
Browser Runtime Options
The browser runtime links Sipp’s Rust WASM ABI with llama.cpp and ggml through Emscripten. It runs GGUF text and vision models with WebGPU when the browser exposes a compatible adapter, and falls back to CPU execution for compatible local workflows. OPFS-backed model caching keeps repeated browser loads local after the first model fetch or file import.
The package resolves its packaged JavaScript and WASM assets at runtime. Most
apps should not override asset URLs. Use executionMode, wasmThreading,
browserCache, and local endpoint options.runtime only when the application
needs explicit control over browser execution, storage, or local runtime
behavior.
See Runtime Options for SippClient
options, WebGPU/backend selection, worker mode, pthread requirements, and
local runtime config groups.
Related Docs
- Gateway
- Next.js
- TanStack
- React And Vite
- Local Inference
- Runtime Options
- Providers
- Browser Caching
- Gateway And Hybrid Inference
- Examples And Demos
- Maintainer source builds
Node.js Package
The Node.js package target is @sipp/sipp-server. It exposes the native
Sipp client API to Node server processes, route handlers, and framework
server functions. Applications own framework routes, request validation, auth,
and deployment policy.
See the Library API Overview for the shared add, query,
chat, and embed contracts.
Install
npm install @sipp/sipp-server
Use this package only in Node runtime code. Browser components should use
@sipp/sipp.
@sipp/sipp-server is a wrapper package. npm installs the matching optional
platform package for the current OS and CPU, and the runtime loader selects
the best packaged backend for that host.
Use It For
- Server-side local GGUF inference.
- Gateway-backed and provider-backed inference from server code.
- Token streaming from Node processes.
- Framework route handlers in Node runtimes.
- Backend selection for native bindings.
Local GGUF Query
import { SippClient } from '@sipp/sipp-server';
const client = new SippClient();
const endpoint = await client.add('default', {
kind: 'local',
modelPath: process.argv[2],
config: {
context: { n_ctx: 2048 },
scheduler: { continuous_batching: true, prefill_chunk_size: 0 },
cache: { mode: 'live_slot_prefix' },
observability: { runtime_metrics: true },
},
});
const queryPrompt = [
'<|system|>',
'Answer concisely.',
'<|user|>',
'Explain Sipp in one sentence.',
'<|assistant|>',
].join('\n');
const run = client.query({
endpoint,
// query: raw prompt; replace markers with the target model's template.
prompt: queryPrompt,
emitTokens: true,
options: { maxTokens: 64, temperature: 0.7 },
local: { contextKey: 'node-local' },
});
let streamed = '';
for await (const batch of run) {
streamed += batch.text;
}
const response = await run.response;
console.log(streamed || response.text);
Set SIPP_NODE_BACKEND=cpu|vulkan|cuda|metal to choose a native backend.
By default, macOS tries metal then cpu; Windows and Linux try cuda,
vulkan, then cpu.
See Runtime Options for local runtime config
groups and request option boundaries.
On Intel Macs with integrated GPUs, prefer SIPP_NODE_BACKEND=cpu.
The Metal backend is intended for Apple Silicon and tested AMD Mac GPUs.
Apple Silicon can run x64 Node through Rosetta 2, but x64 packages are used
only by an x64 Node process; native arm64 Node should use arm64 packages.
Gateway Chat
function requiredEnv(name: string): string {
const value = process.env[name];
if (value == null || value === '') {
throw new Error(`${name} is required`);
}
return value;
}
const endpoint = await client.add('gateway', {
kind: 'gateway',
target: requiredEnv('SIPP_GATEWAY_TARGET'),
baseUrl: requiredEnv('SIPP_GATEWAY_URL'),
authentication: {
kind: 'bearer',
value: requiredEnv('SIPP_GATEWAY_TOKEN'),
},
});
const messages = [
{ role: 'system', content: 'Answer concisely.' },
{ role: 'user', content: 'Explain gateway inference.' },
];
const run = client.chat({
endpoint,
messages,
options: { maxTokens: 64 },
});
console.log((await run.response).text);
The application only needs the gateway URL, bearer token, and public target. Provider credentials and local model paths stay in the gateway process.
Direct Provider Chat
Use direct provider endpoints only in trusted server code. Keep the provider
key in the server environment; OPENAI_API_KEY="<mock-openai-key>" is only a
placeholder value in examples.
function requiredEnv(name: string): string {
const value = process.env[name];
if (value == null || value === '') {
throw new Error(`${name} is required`);
}
return value;
}
const endpoint = await client.add('provider', {
kind: 'provider',
provider: 'openai',
model: process.env.OPENAI_MODEL ?? 'gpt-5-mini',
apiKey: requiredEnv('OPENAI_API_KEY'),
});
const messages = [
{ role: 'system', content: 'Answer concisely.' },
{ role: 'user', content: 'Explain provider inference.' },
];
const run = client.chat({
endpoint,
messages,
options: { maxTokens: 64 },
});
console.log((await run.response).text);
Pass provider-only request fields through providerOptions. See
Providers for the full provider/gateway split.
Gateway Profile Helpers
Use the gateway profile helpers when a Node route should behave like a
first-party gateway endpoint for browser kind: 'gateway' clients. The helpers
decode model, prompt, messages, input, and snake_case generation
options, then format JSON or SSE responses. The route can execute the decoded
request against a provider, a local endpoint, or a separate gateway.
import {
SippClient,
decodeGatewayQueryBody,
gatewayErrorResponse,
gatewayTextResponseBody,
gatewayTextStreamResponse,
} from '@sipp/sipp-server';
function requiredEnv(name: string): string {
const value = process.env[name];
if (value == null || value === '') {
throw new Error(`${name} is required`);
}
return value;
}
export async function handleQuery(request: Request): Promise<Response> {
try {
const decoded = decodeGatewayQueryBody(await request.json());
const client = new SippClient();
const endpoint = await client.add('provider', {
kind: 'provider',
provider: 'openai',
model: decoded.target,
apiKey: requiredEnv('OPENAI_API_KEY'),
});
const run = client.query({ ...decoded.request, endpoint });
return decoded.stream
? gatewayTextStreamResponse(run)
: Response.json(
gatewayTextResponseBody(decoded.target, await run.response),
);
} catch (error) {
const response = gatewayErrorResponse(error);
return Response.json(response.body, response.init);
}
}
Use decodeGatewayChatBody() and decodeGatewayEmbedBody() for /v1/chat
and /v1/embed compatible routes. Use gatewayEmbeddingResponseBody() for
finite embedding responses.
Framework Routes
Use @sipp/sipp-server in server-only code such as Next.js App Router route
handlers with runtime = 'nodejs', TanStack Start server functions, Express
routes, or background workers. Do not import it from browser bundles.
Related Docs
- Gateway Server
- Next.js
- TanStack
- Local Inference
- Providers
- Runtime Options
- Gateway And Hybrid Inference
- Maintainer source builds
Python Package
The Python wheel is named sipppy. Python code imports the sipp module, which
exposes native descriptor classes, run handles, token streaming, and the same
endpoint model as the Rust client.
Published wheels require Python 3.10 or newer.
See the Library API Overview for the shared add, query,
chat, and embed contracts.
Install
Note
Python wheels currently ship from the project’s GitHub Releases, not PyPI. A full PyPI release with a complete build matrix (CPU and GPU backends across operating systems, architectures, and Python versions, in the style of PyTorch’s distribution matrix) is in progress. The package name
sipppyimport are stable; only the distribution channel will change.
Download the sipppy wheel that matches your platform, Python version, and
backend from the GitHub Releases
page, then install it with pip. The default wheel includes the CPU backend:
pip install sipppy
The default wheel includes the CPU backend. Install PyPI-published GPU backends as extras:
pip install "sipppy[vulkan]"
pip install "sipppy[metal]"
The backend wheels are separate PyPI distributions. For example,
sipppy[vulkan] installs the main sipppy wheel plus the matching
sipppy-backend-vulkan wheel for the same release version. Python code still
imports sipp. CUDA backend wheels are attached to GitHub releases for the
first public release and will move to PyPI after the CUDA wheel size limit is
raised.
Use It For
- Python applications that need local GGUF inference.
- Gateway-backed inference from Python services or scripts.
- Direct provider descriptors where server-side credentials are appropriate.
- Runtime metrics and backend selection in Python services.
Local GGUF Query
import sys
from sipp import (
CacheRuntimeConfig,
SippClient,
SippTextOptions,
ContextRuntimeConfig,
LocalModelDescriptor,
LocalTextOptions,
NativeRuntimeConfig,
ObservabilityRuntimeConfig,
SchedulerRuntimeConfig,
)
client = SippClient()
endpoint = client.add(
"default",
LocalModelDescriptor(
sys.argv[1],
NativeRuntimeConfig(
context=ContextRuntimeConfig(n_ctx=2048),
scheduler=SchedulerRuntimeConfig(
continuous_batching=True,
prefill_chunk_size=0,
),
cache=CacheRuntimeConfig(mode="live_slot_prefix"),
observability=ObservabilityRuntimeConfig(runtime_metrics=True),
),
),
)
query_prompt = "\n".join(
[
"<|system|>",
"Answer concisely.",
"<|user|>",
"Explain Sipp in one sentence.",
"<|assistant|>",
]
)
run = client.query(
# query: raw prompt; replace markers with the target model's template.
query_prompt,
endpoint=endpoint,
options=SippTextOptions(max_tokens=64),
local=LocalTextOptions(context_key="python-local"),
)
print(run.result()["text"])
Set SIPP_PYTHON_BACKEND=cpu|vulkan|cuda|metal to choose an installed native
backend. See Runtime Options for local
runtime config groups and request option boundaries.
On Intel Macs with integrated GPUs, prefer SIPP_PYTHON_BACKEND=cpu.
The Metal backend is intended for Apple Silicon and tested AMD Mac GPUs.
Apple Silicon can run x64 Python through Rosetta 2, but x64 wheels are used
only by an x64 Python process; native arm64 Python should use arm64 wheels.
Gateway Chat
import os
from sipp import ChatMessage, SippClient, SippTextOptions, GatewayDescriptor
client = SippClient()
endpoint = client.add(
"gateway",
GatewayDescriptor(
os.environ["SIPP_GATEWAY_TARGET"],
os.environ["SIPP_GATEWAY_URL"],
authentication_kind="bearer",
authentication_value=os.environ["SIPP_GATEWAY_TOKEN"],
),
)
messages = [
ChatMessage("system", "Answer concisely."),
ChatMessage("user", "Explain gateway inference."),
]
run = client.chat(
messages,
endpoint=endpoint,
options=SippTextOptions(max_tokens=64),
)
print(run.result()["text"])
Gateway clients need only the gateway URL, bearer token, and public target. Provider credentials and local model paths stay in the gateway process.
Related Docs
- Gateway Server
- Installation
- Local Inference
- Providers
- Runtime Options
- Gateway And Hybrid Inference
- Maintainer source builds
Rust Package
The Rust package target is sipp-rs. It publishes the sipp library crate
for Rust applications and re-exports the high-level client API plus selected
runtime, backend, lifecycle, shard, provider, and gateway types.
sipp-rs depends on sipp-sys, the native llama.cpp FFI crate. Installing
sipp-rs from crates.io builds the native backend from source on the target
machine; it is not a binary wheel-style package.
See the Library API Overview for the shared add, query,
chat, and embed contracts.
Install
cargo add sipp-rs
The release workflow publishes sipp-sys first, then publishes sipp-rs.
Applications depend on the sipp-rs package and import the sipp crate.
Build Requirements
Rust applications that depend on sipp-rs need the normal Rust toolchain plus
the native build tools used by sipp-sys:
- A C/C++ compiler for the target platform.
- CMake.
- Ninja or a compatible CMake generator.
- Platform SDKs required by the selected backend.
The CPU native backend is the baseline and does not require a Cargo feature. Backend features add their own requirements:
cuda: CUDA Toolkit plus a compatible NVIDIA driver.metal: macOS with Xcode command line tools.vulkan: Vulkan SDK or system Vulkan development libraries.openmp: OpenMP compiler/runtime support for the target platform.
Use It For
- Rust applications that need local GGUF inference.
- Gateway-backed query, chat, and embedding calls.
- Direct provider descriptors behind the
providersfeature. - Shared Sipp value types across application boundaries.
Local GGUF Query
#![allow(unused)]
fn main() {
use sipp::{
SippClient, SippQueryRequest, SippTextOptions, EndpointDescriptor,
LocalTextOptions,
};
use sipp::engine::{
CacheRuntimeConfig, ContextRuntimeConfig, KvReuseMode, NativeRuntimeConfig,
ObservabilityRuntimeConfig, SchedulerRuntimeConfig,
};
async fn run(
model_path: std::path::PathBuf,
) -> Result<(), Box<dyn std::error::Error>> {
let mut client = SippClient::new();
let endpoint = client
.add(
"default",
EndpointDescriptor::local(model_path, runtime_config()),
)
.await?;
let response = client
.query(SippQueryRequest {
endpoint: Some(endpoint),
prompt: "Explain Sipp in one sentence.".to_string(),
options: SippTextOptions {
max_tokens: Some(64),
..Default::default()
},
local: LocalTextOptions {
context_key: Some("rust-local".to_string()),
..Default::default()
},
..Default::default()
})
.await?;
println!("{}", response.text);
Ok(())
}
fn runtime_config() -> NativeRuntimeConfig {
NativeRuntimeConfig {
context: ContextRuntimeConfig {
n_ctx: Some(2048),
..Default::default()
},
scheduler: SchedulerRuntimeConfig {
continuous_batching: true,
prefill_chunk_size: 0,
..Default::default()
},
cache: CacheRuntimeConfig {
mode: KvReuseMode::LiveSlotPrefix,
..Default::default()
},
observability: ObservabilityRuntimeConfig {
runtime_metrics: true,
backend_profiling: false,
},
..Default::default()
}
}
}
See Runtime Options for the shared runtime config groups and request option boundaries.
Gateway Query
#![allow(unused)]
fn main() {
use sipp::{
SippClient, SippQueryRequest, SippTextOptions, EndpointDescriptor,
GatewayAuthentication, GatewayEndpointConfig, GatewayRoutes, GatewaySecret,
GatewayTimeoutPolicy,
};
let mut client = SippClient::new();
let endpoint = client
.add(
"gateway",
EndpointDescriptor::gateway(GatewayEndpointConfig {
target: std::env::var("SIPP_GATEWAY_TARGET")?,
base_url: std::env::var("SIPP_GATEWAY_URL")?,
routes: GatewayRoutes::default(),
authentication: GatewayAuthentication::Bearer(GatewaySecret::new(
std::env::var("SIPP_GATEWAY_TOKEN")?,
)),
static_headers: Default::default(),
timeouts: GatewayTimeoutPolicy::default(),
protocol_options: Default::default(),
}),
)
.await?;
let response = client
.query(SippQueryRequest {
endpoint: Some(endpoint),
prompt: "Explain gateway inference.".to_string(),
options: SippTextOptions {
max_tokens: Some(64),
..Default::default()
},
..Default::default()
})
.await?;
println!("{}", response.text);
}
Related Docs
- Gateway Server
- Gateway Toolkit
- Local Inference
- Providers
- Runtime Options
- Gateway And Hybrid Inference
- Architecture
- Maintainer source builds
Frameworks
These guides show how to use the JavaScript-facing Sipp packages in common
application frameworks. See the Library API Overview for
the shared add, query, chat, and embed contracts.
Use the browser package, @sipp/sipp, when inference runs in the browser or when
browser code calls a gateway. Use the Node package, @sipp/sipp-server, only in
server-only code such as route handlers, server functions, API routes, workers,
or services that run in a Node.js runtime.
Guides
- React And Vite: Baseline browser-local setup, WebGPU/WASM asset behavior, OPFS model loading, and local development headers.
- Next.js: App Router provider routes, Client Components, gateway-profile compatibility, and streaming.
- TanStack: TanStack Start provider functions, server routes, and TanStack Query patterns.
Package Selection
| Environment | Package | Notes |
|---|---|---|
| Browser component | @sipp/sipp | Use for browser-local GGUF inference or direct gateway calls. |
| Node server route | @sipp/sipp-server | Use for direct provider endpoints, local server inference, or gateway clients. |
| Gateway profile route | @sipp/sipp-server | Use when a browser kind: 'gateway' endpoint calls a framework route. |
| Gateway client | Either | Browser code can call a separate gateway with short-lived tokens, or server code can use server-held secrets. |
Provider-First Server Routes
Next.js and TanStack server routes should usually demonstrate direct provider endpoints when the framework server owns the credential. Register a provider in server-only code:
const endpoint = await client.add('provider', {
kind: 'provider',
provider: 'openai',
model: requiredEnv('OPENAI_MODEL'),
apiKey: requiredEnv('OPENAI_API_KEY'),
});
Use OPENAI_API_KEY="<mock-openai-key>" only as a placeholder in docs and
examples. Do not expose real provider keys in browser bundles.
Gateway Route Field Names
Browser gateway descriptors require an absolute http or https baseUrl
and use routes: { query, chat, embed } for route overrides. Node gateway
descriptors use queryRoute, chatRoute, and embedRoute when server code
calls a gateway through @sipp/sipp-server.
Keep provider credentials and long-lived gateway tokens out of browser bundles. When a browser app needs gateway access, issue short-lived application tokens or proxy through a server route.
Use decodeGatewayQueryBody(), decodeGatewayChatBody(),
decodeGatewayEmbedBody(), and the matching response helpers from
@sipp/sipp-server when a framework route should be registered as a browser
kind: 'gateway' endpoint. Those helpers keep route examples focused on auth,
target policy, provider selection, and client lifecycle instead of gateway
profile JSON shaping.
React And Vite
React and Vite are the baseline browser integration for the @sipp/sipp
package. Use this guide for Vite-specific setup, local development headers,
runtime asset overrides, and the source browser examples.
For the full local inference option map, see Local Inference and Runtime Options.
Install
npm install @sipp/sipp
Browser Local Query
Use @sipp/sipp only in browser code. A local endpoint source can be a model
URL served by the app, a user-provided File, an installed model id, or shard
sources.
import { useState } from 'react';
import { SippClient } from '@sipp/sipp';
export function LocalQuery(): JSX.Element {
const [text, setText] = useState('');
async function run(): Promise<void> {
const client = new SippClient();
try {
const endpoint = await client.add('default', {
kind: 'local',
source: '/models/model.gguf',
options: {
backend: 'webgpu',
runtime: {
context: { n_ctx: 2048 },
},
},
});
const response = await client.query('Explain Sipp.', {
endpoint,
maxTokens: 64,
}).response;
setText(response.text);
} finally {
await client.close();
}
}
return (
<button type="button" onClick={() => void run()}>
{text || 'Run'}
</button>
);
}
Omit backend to let the browser runtime choose a compatible backend. Use
backend: 'webgpu' when the UI should explicitly request WebGPU and surface
errors or fallbacks itself.
Local Development Headers
The pthread WASM runtime requires SharedArrayBuffer and cross-origin
isolation. Configure Vite dev and preview headers before using
wasmThreading: 'pthread':
// vite.config.ts
import { defineConfig } from 'vite';
export default defineConfig({
server: {
headers: {
'Cross-Origin-Opener-Policy': 'same-origin',
'Cross-Origin-Embedder-Policy': 'require-corp',
},
},
preview: {
headers: {
'Cross-Origin-Opener-Policy': 'same-origin',
'Cross-Origin-Embedder-Policy': 'require-corp',
},
},
});
Use wasmThreading: 'single-thread' when the app cannot serve those headers.
Use executionMode: 'main-thread' only for debugging or constrained hosts.
Runtime Asset Overrides
The browser package resolves its packaged Emscripten JavaScript and WASM assets
at runtime. Most Vite apps can use new SippClient() without asset
overrides.
Override runtime asset URLs only when your bundler or deployment moves package assets:
const client = new SippClient({
moduleUrl: '/assets/sipp-wasm.js',
wasmUrl: '/assets/sipp-wasm.wasm',
});
When overriding assets, provide both moduleUrl and wasmUrl. For pthread
runtime assets, provide both pthreadModuleUrl and pthreadWasmUrl.
Model Files And Cache
Serve model URLs from the application or let users select local .gguf files.
The browser runtime stores model data through OPFS where available, so repeated
loads can stay local after the first import or fetch.
Tune browser storage with browserCache on SippClient and tune local
runtime behavior with options.runtime on the local endpoint descriptor. See
Browser Caching and
Runtime Options.
Existing Examples
Serve the source examples when working from a checkout:
sipp run examples serve browser
Then open the printed URL and use:
/query.html/chat.html/embed.html/gateway_local.html/gateway_query.html/gateway_chat.html/gateway_embed.html
The gateway pages demonstrate browser calls to gateway-profile endpoints. Keep production server routes in a route-owning framework, an application server, or the first-party gateway server.
Related Docs
Next.js
Use @sipp/sipp-server in App Router route handlers that run in the Node.js
runtime. Use @sipp/sipp only in Client Components or browser-only modules.
Next.js App Router pages and layouts are Server Components by default. Add
'use client' only to modules that need browser APIs, state, event handlers,
or browser-local Sipp runtime access.
Profile-Compatible Provider Route
Route handlers are a good place to keep provider credentials off the client.
Set runtime = 'nodejs' for routes that import @sipp/sipp-server.
Routes that are registered from a browser kind: 'gateway' endpoint must speak
the first-party gateway profile. Use the gateway profile helpers from
@sipp/sipp-server to decode the incoming body and format JSON or SSE
responses. The route can still execute the request against a direct provider
endpoint.
Use OPENAI_API_KEY="<mock-openai-key>" as a placeholder in examples. In a
real deployment, keep the key in your server environment or secret manager.
// app/api/sipp/query/route.ts
import {
SippClient,
decodeGatewayQueryBody,
gatewayErrorResponse,
gatewayTextResponseBody,
gatewayTextStreamResponse,
} from '@sipp/sipp-server';
export const runtime = 'nodejs';
function requiredEnv(name: string): string {
const value = process.env[name];
if (value == null || value === '') {
throw new Error(`${name} is required`);
}
return value;
}
export async function POST(request: Request): Promise<Response> {
try {
const decoded = decodeGatewayQueryBody(await request.json());
const client = new SippClient();
const endpoint = await client.add('provider', {
kind: 'provider',
provider: 'openai',
model: decoded.target,
apiKey: requiredEnv('OPENAI_API_KEY'),
});
const run = client.query({
...decoded.request,
endpoint,
});
if (decoded.stream) {
return gatewayTextStreamResponse(run);
}
return Response.json(
gatewayTextResponseBody(decoded.target, await run.response),
);
} catch (error) {
const response = gatewayErrorResponse(error);
return Response.json(response.body, response.init);
}
}
Do not return an app-specific shape such as { text } from a route that the
browser package calls through client.add({ kind: 'gateway' }). That route is
an HTTP gateway endpoint from the browser client’s perspective, even when it is
implemented inside the Next application. The server-side implementation can
resolve the request to a provider, a local endpoint, or a separate gateway.
For high-throughput services, keep endpoint setup in a server-only module and reuse the client lifecycle according to your deployment model. Do not import that module from Client Components.
Streaming Route Handler
Use a route handler when the browser should receive token updates but the server should keep the provider credential.
// app/api/sipp/stream/route.ts
import { SippClient } from '@sipp/sipp-server';
export const runtime = 'nodejs';
const encoder = new TextEncoder();
function requiredEnv(name: string): string {
const value = process.env[name];
if (value == null || value === '') {
throw new Error(`${name} is required`);
}
return value;
}
export async function POST(request: Request): Promise<Response> {
const { prompt } = await request.json() as { prompt?: string };
if (prompt == null || prompt.trim() === '') {
return Response.json({ error: 'prompt is required' }, { status: 400 });
}
const client = new SippClient();
const endpoint = await client.add('provider', {
kind: 'provider',
provider: 'openai',
model: requiredEnv('OPENAI_MODEL'),
apiKey: requiredEnv('OPENAI_API_KEY'),
});
const run = client.query({
endpoint,
prompt,
emitTokens: true,
options: { maxTokens: 128 },
});
const stream = new ReadableStream<Uint8Array>({
async start(controller) {
try {
for await (const batch of run.tokens) {
controller.enqueue(encoder.encode(batch.text));
}
await run.response;
controller.close();
} catch (error) {
controller.error(error);
}
},
cancel() {
run.cancel('client_disconnected');
},
});
return new Response(stream, {
headers: { 'Content-Type': 'text/plain; charset=utf-8' },
});
}
Browser-Local Client Component
Browser-local inference needs browser APIs and should live behind a Client Component boundary.
// app/local-chat/LocalChat.tsx
'use client';
import { useState } from 'react';
import { SippClient } from '@sipp/sipp';
export function LocalChat(): JSX.Element {
const [text, setText] = useState('');
async function run(prompt: string): Promise<void> {
const client = new SippClient();
try {
const endpoint = await client.add('default', {
kind: 'local',
source: '/models/model.gguf',
});
const response = await client.query(prompt, {
endpoint,
maxTokens: 64,
}).response;
setText(response.text);
} finally {
await client.close();
}
}
return (
<button type="button" onClick={() => void run('Explain local inference.')}>
{text || 'Run'}
</button>
);
}
If you override moduleUrl, wasmUrl, pthreadModuleUrl, or
pthreadWasmUrl, provide both the JavaScript and WASM asset URLs for the
selected runtime. Use wasmThreading: 'pthread' only when the app is served
with cross-origin isolation headers that enable SharedArrayBuffer.
Hybrid Client Component
Use one browser SippClient to register a browser-local endpoint and a
same-origin provider route that speaks the gateway profile. Select the endpoint
reference at request time; the query call stays the same.
// app/hybrid-chat/HybridChat.tsx
'use client';
import { useState } from 'react';
import { SippClient, type EndpointRef } from '@sipp/sipp';
type InferenceMode = 'local' | 'providerRoute';
export function HybridChat(): JSX.Element {
const [mode, setMode] = useState<InferenceMode>('local');
const [text, setText] = useState('');
async function run(prompt: string): Promise<void> {
const client = new SippClient();
try {
const localEndpoint = await client.add('browser-local', {
kind: 'local',
source: '/models/model.gguf',
});
const providerRouteEndpoint = await client.add('app-route', {
kind: 'gateway',
target: 'gpt-5-mini',
baseUrl: window.location.origin,
routes: { query: '/api/sipp/query' },
authentication: { kind: 'none' },
});
const endpoint: EndpointRef =
mode === 'local' ? localEndpoint : providerRouteEndpoint;
const response = await client.query(prompt, {
endpoint,
maxTokens: 64,
}).response;
setText(response.text);
} finally {
await client.close();
}
}
return (
<>
<select
value={mode}
onChange={(event) => setMode(event.currentTarget.value as InferenceMode)}
>
<option value="local">Browser local</option>
<option value="providerRoute">Provider route</option>
</select>
<button type="button" onClick={() => void run('Explain hybrid inference.')}>
{text || 'Run'}
</button>
</>
);
}
Browser gateway descriptors require an absolute http or https baseUrl.
For same-origin Next routes, use window.location.origin and set route
overrides such as routes: { query: '/api/sipp/query' }. The target
value becomes the provider model in the server route above.
Separate Gateway Pattern
Use a separate Sipp gateway when you want central target policy, shared
provider credentials, local model hosting, rate controls, or metrics across
multiple applications. For direct browser-to-gateway calls, do not embed a
long-lived gateway token in the client bundle. Have a Next route issue a
short-lived app token, then use a browser valueProvider:
const endpoint = await client.add('gateway', {
kind: 'gateway',
target: 'local',
baseUrl: 'https://gateway.example.com',
authentication: {
kind: 'bearer',
valueProvider: async () => {
const response = await fetch('/api/sipp/token', { method: 'POST' });
return await response.text();
},
},
});
References
TanStack
TanStack apps usually need two Sipp patterns:
- TanStack Start server functions for server-only Sipp work, provider credentials, local model paths, gateway tokens, and typed app RPC.
- TanStack Start server routes when browser code should register the route as
a
kind: 'gateway'endpoint through the Sipp browser package. - TanStack Query for client-side final responses that can be cached or refetched by query key.
Use explicit component state or a custom hook for token streaming. TanStack Query is best for Promise-shaped final data, not for appending token batches as they arrive.
TanStack Start Server Function
Server functions run on the server and can be called from loaders, components,
hooks, or other server functions. Keep @sipp/sipp-server, provider
credentials, and gateway tokens in server-only functions.
Use OPENAI_API_KEY="<mock-openai-key>" as a placeholder in examples. In a
real deployment, keep the key in your server environment or secret manager.
// src/server/sipp.ts
import { createServerFn } from '@tanstack/react-start';
import { SippClient } from '@sipp/sipp-server';
function requiredEnv(name: string): string {
const value = process.env[name];
if (value == null || value === '') {
throw new Error(`${name} is required`);
}
return value;
}
export const querySipp = createServerFn({ method: 'POST' })
.inputValidator((data: { prompt: string }) => data)
.handler(async ({ data }) => {
const client = new SippClient();
const endpoint = await client.add('provider', {
kind: 'provider',
provider: 'openai',
model: requiredEnv('OPENAI_MODEL'),
apiKey: requiredEnv('OPENAI_API_KEY'),
});
const run = client.query({
endpoint,
prompt: data.prompt,
options: { maxTokens: 128 },
});
const response = await run.response;
return { text: response.text, usage: response.usage };
});
Validate server-function inputs with the same rigor as any public endpoint. Server functions are callable network endpoints, so apply application auth and tenant checks inside the function or middleware.
Server functions are a good fit for typed application calls that return
application-owned shapes such as { text }. They are not the right surface for
browser client.add({ kind: 'gateway' }) endpoints, because those endpoints
expect the first-party gateway HTTP profile.
TanStack Start Provider Route
Use a server route when the browser package should call the framework route as a gateway endpoint. The route accepts the first-party query profile and returns the fields consumed by browser gateway endpoints. The gateway profile helpers decode the browser request and format JSON or SSE responses. The route can then execute the request against a direct provider endpoint.
// src/routes/api/sipp/query.ts
import { createFileRoute } from '@tanstack/react-router';
import {
SippClient,
decodeGatewayQueryBody,
gatewayErrorResponse,
gatewayTextResponseBody,
gatewayTextStreamResponse,
} from '@sipp/sipp-server';
function requiredEnv(name: string): string {
const value = process.env[name];
if (value == null || value === '') {
throw new Error(`${name} is required`);
}
return value;
}
export const Route = createFileRoute('/api/sipp/query')({
server: {
handlers: {
POST: async ({ request }) => {
try {
const decoded = decodeGatewayQueryBody(await request.json());
const client = new SippClient();
const endpoint = await client.add('provider', {
kind: 'provider',
provider: 'openai',
model: decoded.target,
apiKey: requiredEnv('OPENAI_API_KEY'),
});
const run = client.query({
...decoded.request,
endpoint,
});
if (decoded.stream) {
return gatewayTextStreamResponse(run);
}
return Response.json(
gatewayTextResponseBody(decoded.target, await run.response),
);
} catch (error) {
const response = gatewayErrorResponse(error);
return Response.json(response.body, response.init);
}
},
},
},
});
This route uses the browser profile field model as the provider model and
keeps the provider credential on the server. Add application auth or model
allowlists before exposing the route to users.
Use a separate Sipp gateway when you want central target policy, shared provider credentials, local model hosting, rate controls, or metrics across multiple applications.
TanStack Query For Final Responses
Use TanStack Query when the UI needs a final response and normal query cache behavior.
import { useQuery } from '@tanstack/react-query';
import { querySipp } from '../server/sipp';
export function Answer({ prompt }: { readonly prompt: string }): JSX.Element {
const result = useQuery({
queryKey: ['sipp-query', prompt],
queryFn: () => querySipp({ data: { prompt } }),
enabled: prompt.trim() !== '',
});
if (result.isPending) return <p>Loading...</p>;
if (result.isError) return <p>{result.error.message}</p>;
return <pre>{result.data.text}</pre>;
}
Keep the query key tied to the prompt, target, and any user-visible generation options that change the result.
Streaming Tokens
For token streaming, create a server route or server function that returns a stream, then append chunks with component state.
import { useState } from 'react';
export function StreamingAnswer(): JSX.Element {
const [text, setText] = useState('');
async function run(prompt: string): Promise<void> {
setText('');
const response = await fetch('/api/sipp/stream', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ prompt }),
});
if (response.body == null) {
throw new Error('streaming response body is missing');
}
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { value, done } = await reader.read();
if (done) break;
setText((current) => current + decoder.decode(value, { stream: true }));
}
}
return (
<button type="button" onClick={() => void run('Explain streaming.')}>
{text || 'Run'}
</button>
);
}
Browser Package
Use browser @sipp/sipp from components that run in the browser. That includes
browser-local GGUF inference and gateway endpoints with short-lived tokens or
same-origin server routes.
import { useState } from 'react';
import { SippClient } from '@sipp/sipp';
export function LocalAnswer(): JSX.Element {
const [text, setText] = useState('');
async function run(prompt: string): Promise<void> {
const client = new SippClient();
try {
const endpoint = await client.add('browser-local', {
kind: 'local',
source: '/models/model.gguf',
});
const response = await client.query(prompt, {
endpoint,
maxTokens: 64,
}).response;
setText(response.text);
} finally {
await client.close();
}
}
return (
<button type="button" onClick={() => void run('Explain local inference.')}>
{text || 'Run'}
</button>
);
}
Do not import @sipp/sipp-server from browser modules.
Browser Hybrid Endpoints
Register browser-local and same-origin gateway endpoints on one browser
SippClient, then choose the endpoint reference for each request. The
same-origin route can execute against a provider while still speaking the
gateway profile to the browser client.
import { useState } from 'react';
import { SippClient, type EndpointRef } from '@sipp/sipp';
type InferenceMode = 'local' | 'providerRoute';
export function HybridAnswer(): JSX.Element {
const [mode, setMode] = useState<InferenceMode>('local');
const [text, setText] = useState('');
async function run(prompt: string): Promise<void> {
const client = new SippClient();
try {
const localEndpoint = await client.add('browser-local', {
kind: 'local',
source: '/models/model.gguf',
});
const providerRouteEndpoint = await client.add('app-route', {
kind: 'gateway',
target: 'gpt-5-mini',
baseUrl: window.location.origin,
routes: { query: '/api/sipp/query' },
authentication: { kind: 'none' },
});
const endpoint: EndpointRef =
mode === 'local' ? localEndpoint : providerRouteEndpoint;
const response = await client.query(prompt, {
endpoint,
maxTokens: 64,
}).response;
setText(response.text);
} finally {
await client.close();
}
}
return (
<>
<select
value={mode}
onChange={(event) => setMode(event.currentTarget.value as InferenceMode)}
>
<option value="local">Browser local</option>
<option value="providerRoute">Provider route</option>
</select>
<button type="button" onClick={() => void run('Explain hybrid inference.')}>
{text || 'Run'}
</button>
</>
);
}
Browser gateway descriptors need an absolute http or https baseUrl.
Same-origin TanStack routes should use window.location.origin and route
overrides such as routes: { query: '/api/sipp/query' }. The target
value becomes the provider model in the server route above.
References
Gateway
Sipp gateway workflows put one HTTP boundary in front of local GGUF
targets and provider-backed targets. Applications still use the same client
model: register an endpoint with SippClient.add, keep the returned
endpoint reference, and choose that reference for query, chat, or embed.
Use a gateway when you want a separate process to own model paths, provider credentials, target access policy, concurrency limits, metrics, and operational routes.
Notices
Warning
The gateway server is in active development. Changes will be made frequently, and things will break. If you use it for production, be cautious and watch for release updates. You can join our Discord server and follow up on development.
What To Use
| Need | Start here |
|---|---|
| Run the first-party server from a checkout | Server |
| Build and run the Docker image | Docker |
| Understand the TOML file | Configuration |
| Test with curl, Postman, or raw HTTP | Testing |
| Operate health, metrics, admin, and ingress | Operations |
| Build your own gateway application | Toolkit |
| Understand package boundaries | Architecture |
| Debug common failures | Troubleshooting |
The current release workflow publishes browser npm, Node npm, Python wheels,
and Rust crates. It does not yet publish a standalone gateway-server
binary, public container image, or cargo install target. Build the
first-party server from the source checkout or with the provided Dockerfile.
Gateway Shapes
- First-party server:
apps/gateway-serverprovides TOML configuration, bearer-token policy, local and provider targets, management routes, metrics, and an Admin Dashboard. - Docker image:
apps/gateway-server/Dockerfilebuilds the same staged gateway distribution and runssipp-gateway serve --config /etc/sipp/gateway.toml. - Gateway toolkit:
lib/gatewayprovides codecs, HTTP error helpers, authentication traits, observability traits, and the first-party JSON/SSE profile for custom applications. - Gateway clients: Browser, Node, Python, and Rust packages all register
gateway endpoints through the same
.addpath used for local and provider endpoints.
Deployment Shapes
- On-board GPU inference: configure a local GGUF target, build or run the
gateway with
vulkan,cuda, ormetal, and mount or point at the model path the process can read. - Provider-only router: configure only provider targets such as
openai,openai_compatible, oranthropic. No local model path or/modelsmount is required, and a CPU gateway image is sufficient because inference runs at the provider. - Hybrid: configure both a local GPU target and provider targets. Clients
still send the public gateway target name in the request
modelfield.
Default Routes
The first-party server examples use:
- Public:
/v1/query,/v1/chat,/v1/embed. - Management:
/,/healthz,/readyz,/metrics,/admin.
Those paths are application configuration, not core library behavior. Custom gateway applications can choose their own routes.
Gateway Quickstart
Use the on-board local path when the gateway should load a GGUF model, or the provider-only path when it should route requests upstream. Read Server and Docker before production deployment.
On-Board Local From Source
cp apps/gateway-server/.env.example apps/gateway-server/.env
cp apps/gateway-server/config/local.toml.example apps/gateway-server/config/local.toml
Edit apps/gateway-server/config/local.toml:
- Set the local target
modelto a GGUF file visible from the workspace root. - Keep local source binds on
127.0.0.1. - Keep
admin_password_env = "SIPP_GATEWAY_ADMIN_PASSWORD"unless you also change the.envsecret name.
Load secrets and start:
set -a
. apps/gateway-server/.env
set +a
sipp run gateway-server check --config apps/gateway-server/config/local.toml --backend vulkan
sipp run gateway-server serve --config apps/gateway-server/config/local.toml --backend vulkan
Use cuda for NVIDIA hosts or metal for macOS hosts when those are the
intended on-board inference backends.
Provider-Only From Source
cp apps/gateway-server/.env.example apps/gateway-server/.env
cp apps/gateway-server/config/provider-only.toml.example apps/gateway-server/config/provider-only.toml
Set provider secrets in apps/gateway-server/.env, then run:
set -a
. apps/gateway-server/.env
set +a
sipp run gateway-server check --config apps/gateway-server/config/provider-only.toml --backend cpu
sipp run gateway-server serve --config apps/gateway-server/config/provider-only.toml --backend cpu
Use the request target openai-chat with the checked-in provider-only example.
Docker
Docker uses one secrets-only .env, one gateway TOML, and one explicit Compose
file:
cp apps/gateway-server/.env.example apps/gateway-server/.env
cp apps/gateway-server/development.yml.example apps/gateway-server/development.yml
cp apps/gateway-server/config/development.toml.example apps/gateway-server/config/development.toml
docker compose --env-file apps/gateway-server/.env -f apps/gateway-server/development.yml build
docker compose --env-file apps/gateway-server/.env -f apps/gateway-server/development.yml up
Use development-provider-only.yml.example and
config/provider-only.toml.example for provider-only Docker.
First HTTP Request
In a second terminal:
set -a
. apps/gateway-server/.env
set +a
export GATEWAY_URL="http://127.0.0.1:8080"
export GATEWAY_MANAGEMENT_URL="http://127.0.0.1:9090"
curl --fail --silent "$GATEWAY_MANAGEMENT_URL/readyz"
curl -sS "$GATEWAY_URL/v1/query" \
-H "Authorization: Bearer $SIPP_GATEWAY_TOKEN" \
-H "Content-Type: application/json" \
-d '{"model":"local","prompt":"Explain gateway inference.","max_tokens":64}'
Use "model":"openai-chat" for the provider-only example.
Open http://127.0.0.1:9090/admin and log in with the value of
SIPP_GATEWAY_ADMIN_PASSWORD.
Gateway Server
The Sipp Gateway Server is the first-party HTTP application for teams that
want one inference boundary for local GGUF targets and provider-backed targets.
It lives in apps/gateway-server.
This page covers source checkout and generated executable operation. Use Docker for container workflows and Configuration for the TOML schema.
The current release workflow does not publish a standalone binary, public
container image, or cargo install target. Build it from the source checkout.
Source Workflow
Use sipp for source checkout workflows. sipp is the setup-installed launcher
for cargo xtask; when the launcher is unavailable, use cargo xtask with
the same arguments.
cp apps/gateway-server/config/local.toml.example apps/gateway-server/config/local.toml
cp apps/gateway-server/.env.example apps/gateway-server/.env
set -a
. apps/gateway-server/.env
set +a
sipp run gateway-server check --config apps/gateway-server/config/local.toml --backend vulkan
sipp run gateway-server serve --config apps/gateway-server/config/local.toml --backend vulkan
Before running real on-board inference tests, update the ignored local TOML with the token env names, admin password env name, and model path. Update only secret values in the secrets env file.
sipp run gateway-server check builds the staged gateway distribution for the
selected backend, then runs sipp-gateway check. The binary check
command parses and validates TOML only. It does not read bearer-token
environment variables, load model files, contact providers, or bind ports.
sipp run gateway-server serve builds the staged gateway distribution, then
runs the generated sipp-gateway executable from the workspace root. It
reads secret environment variables named by TOML, loads targets, binds both
listeners, and exits cleanly on Ctrl-C.
Use --backend cpu|vulkan|cuda|metal|all to select the backend compiled into
the staged gateway distribution.
Provider-Only Source Workflow
Provider-only gateways route to upstream APIs and do not load a local GGUF model. Use a CPU gateway build because inference happens at the provider:
cp apps/gateway-server/config/provider-only.toml.example apps/gateway-server/config/provider-only.toml
cp apps/gateway-server/.env.example apps/gateway-server/.env
set -a
. apps/gateway-server/.env
set +a
sipp run gateway-server check --config apps/gateway-server/config/provider-only.toml --backend cpu
sipp run gateway-server serve --config apps/gateway-server/config/provider-only.toml --backend cpu
Use Configuration for Anthropic and OpenAI-compatible target snippets.
Generated Executable
sipp build gateway-server --backend <backend> stages a runnable distribution
in .build/artifacts/gateway-server. The directory contains the
sipp-gateway executable, base runtime libraries, and selected GGML
backend plugins. The build also compiles the React Admin Dashboard from
apps/gateway-server/admin-ui and copies its Vite output to
.build/artifacts/gateway-server/admin-ui. Keep the executable, dashboard
asset directory, and runtime libraries together.
Direct execution must put the artifact directory on the dynamic loader path.
The executable reads dashboard assets from admin-ui beside the binary unless
SIPP_GATEWAY_ADMIN_ASSETS_DIR points at another Vite dist directory.
Linux:
set -a
. apps/gateway-server/.env
set +a
export LD_LIBRARY_PATH="$(pwd)/.build/artifacts/gateway-server${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}"
.build/artifacts/gateway-server/sipp-gateway check --config apps/gateway-server/config/local.toml
.build/artifacts/gateway-server/sipp-gateway serve --config apps/gateway-server/config/local.toml
macOS:
set -a
. apps/gateway-server/.env
set +a
export DYLD_LIBRARY_PATH="$(pwd)/.build/artifacts/gateway-server${DYLD_LIBRARY_PATH:+:$DYLD_LIBRARY_PATH}"
.build/artifacts/gateway-server/sipp-gateway check --config apps/gateway-server/config/local.toml
.build/artifacts/gateway-server/sipp-gateway serve --config apps/gateway-server/config/local.toml
Windows PowerShell:
Get-Content apps\gateway-server\.env | ForEach-Object {
if ($_ -and -not $_.StartsWith("#")) {
$name, $value = $_.Split("=", 2)
Set-Item -Path "Env:$name" -Value $value
}
}
$dist = Join-Path (Get-Location) ".build\artifacts\gateway-server"
$env:PATH = "$dist;$env:PATH"
.\.build\artifacts\gateway-server\sipp-gateway.exe check --config apps\gateway-server\config\local.toml
.\.build\artifacts\gateway-server\sipp-gateway.exe serve --config apps\gateway-server\config\local.toml
Relative model paths in TOML are resolved from the process working
directory. The sipp run gateway-server ... workflow runs from the workspace
root. When running the executable from another directory, use absolute model
paths or start the process from the workspace root.
Backends
The gateway server supports the same native backend names as other native targets:
cpu: provider-only router build or local-inference diagnostic backend.cuda: NVIDIA CUDA backend.metal: Apple Metal backend on macOS.vulkan: Vulkan backend.all: host-supported backend set for build commands.
For on-board local target TOML, backend = "auto" selects the best compiled
and available backend in this order: CUDA, Metal, Vulkan, then CPU. Production
model-serving configs should use auto or an explicit GPU backend. Explicit
cpu disables GPU offload and is intended only for diagnostics. Explicit GPU
backends fail if that backend was not compiled or is unavailable.
Admin Dashboard
The Admin Dashboard password is read from the env var named by TOML:
admin_password_env = "SIPP_GATEWAY_ADMIN_PASSWORD"
Keep the real value in a secrets env file or production secret manager.
Related Docs
Gateway Docker
Gateway Docker workflows use explicit Compose files plus the gateway TOML and a
secrets-only .env file.
The separation is strict:
.envcontains secret values only.- TOML contains gateway application configuration.
- Compose YAML contains Docker build, image, port, mount, healthcheck, and container orchestration settings.
The container runs:
sipp-gateway serve --config /etc/sipp/gateway.toml
Files
apps/gateway-server/Dockerfilebuilds the staged gateway distribution.apps/gateway-server/.env.exampleis the secrets-only env template.apps/gateway-server/development.yml.examplebuilds and runs a local model-serving image.apps/gateway-server/development-provider-only.yml.examplebuilds and runs a provider-router image with no model mount.apps/gateway-server/production.yml.exampleruns a prebuilt production model-serving image.apps/gateway-server/production-provider-only.yml.exampleruns a prebuilt provider-router image with no model mount.apps/gateway-server/config/*.toml.exampleare gateway application config templates.
Local Model-Serving Docker
From the repository root:
cp apps/gateway-server/.env.example apps/gateway-server/.env
cp apps/gateway-server/development.yml.example apps/gateway-server/development.yml
cp apps/gateway-server/config/development.toml.example apps/gateway-server/config/development.toml
Edit apps/gateway-server/.env and set only secrets:
SIPP_GATEWAY_ADMIN_PASSWORD=replace-me
SIPP_GATEWAY_TOKEN=replace-me
OPENAI_API_KEY=replace-me
ANTHROPIC_API_KEY=replace-me
Edit apps/gateway-server/config/development.toml:
- Set the local target
modelto the path the container sees, usually/models/<file>.gguf. - Keep
public_bind = "0.0.0.0:8080"andmanagement_bind = "0.0.0.0:9090"so the gateway listens inside the container. - Keep
admin_password_env = "SIPP_GATEWAY_ADMIN_PASSWORD"unless the.envsecret name also changes.
Edit apps/gateway-server/development.yml for Docker concerns such as image
tag, build backend, build images, model mount, port publishing, and
healthcheck.
Build and run with one backend profile. CPU works on Windows, macOS, and Linux. GPU containers require host-specific device support.
Warning
Windows Docker Desktop does not support the first-party Vulkan gateway path. NVIDIA Windows hosts should use the
cudaprofile. Do not use oldvulkan-windowsconfigs;ggml_vulkan: No devices foundmeans the container cannot enumerate a usable Vulkan physical device.
# CPU, portable across Windows, macOS, and Linux
docker compose --profile cpu --env-file apps/gateway-server/.env -f apps/gateway-server/development.yml config
docker compose --profile cpu --env-file apps/gateway-server/.env -f apps/gateway-server/development.yml build gateway-cpu
docker compose --profile cpu --env-file apps/gateway-server/.env -f apps/gateway-server/development.yml up gateway-cpu
# CUDA, Linux or Windows Docker Desktop with NVIDIA GPU support
docker compose --profile cuda --env-file apps/gateway-server/.env -f apps/gateway-server/development.yml config
docker compose --profile cuda --env-file apps/gateway-server/.env -f apps/gateway-server/development.yml build gateway-cuda
docker compose --profile cuda --env-file apps/gateway-server/.env -f apps/gateway-server/development.yml up gateway-cuda
# Vulkan on native Linux, uses /dev/dri
docker compose --profile vulkan-linux --env-file apps/gateway-server/.env -f apps/gateway-server/development.yml config
docker compose --profile vulkan-linux --env-file apps/gateway-server/.env -f apps/gateway-server/development.yml build gateway-vulkan-linux
docker compose --profile vulkan-linux --env-file apps/gateway-server/.env -f apps/gateway-server/development.yml up gateway-vulkan-linux
If Compose reports orphan containers after switching service names, remove the old containers once:
docker compose --env-file apps/gateway-server/.env -f apps/gateway-server/development.yml down --remove-orphans
Provider-Only Docker
Provider-only Docker runs use the provider-only Compose template and no model mount:
cp apps/gateway-server/.env.example apps/gateway-server/.env
cp apps/gateway-server/development-provider-only.yml.example apps/gateway-server/development-provider-only.yml
cp apps/gateway-server/config/provider-only.toml.example apps/gateway-server/config/provider-only.toml
Set secrets in apps/gateway-server/.env, then run:
docker compose --env-file apps/gateway-server/.env -f apps/gateway-server/development-provider-only.yml config
docker compose --env-file apps/gateway-server/.env -f apps/gateway-server/development-provider-only.yml build
docker compose --env-file apps/gateway-server/.env -f apps/gateway-server/development-provider-only.yml up
The provider-only template builds a CPU gateway image because inference happens upstream.
Production Docker
Keep production TOML, Compose, and .env copies outside the repository:
mkdir -p /opt/sipp/gateway
cp apps/gateway-server/.env.example /opt/sipp/gateway/.env
cp apps/gateway-server/production.yml.example /opt/sipp/gateway/production.yml
cp apps/gateway-server/config/production.toml.example /opt/sipp/gateway/production.toml
Edit /opt/sipp/gateway/.env for secret values only. Edit
/opt/sipp/gateway/production.toml for gateway runtime configuration.
Edit /opt/sipp/gateway/production.yml for image names, host model
mounts, ports, restart policy, and healthcheck.
Deploy with one backend profile:
# CPU
docker compose --profile cpu --env-file /opt/sipp/gateway/.env -f /opt/sipp/gateway/production.yml config
docker compose --profile cpu --env-file /opt/sipp/gateway/.env -f /opt/sipp/gateway/production.yml up -d gateway-cpu
# CUDA, requires NVIDIA Container Toolkit on the host
docker compose --profile cuda --env-file /opt/sipp/gateway/.env -f /opt/sipp/gateway/production.yml config
docker compose --profile cuda --env-file /opt/sipp/gateway/.env -f /opt/sipp/gateway/production.yml up -d gateway-cuda
# Vulkan on Linux hosts, requires /dev/dri rendering devices
docker compose --profile vulkan-linux --env-file /opt/sipp/gateway/.env -f /opt/sipp/gateway/production.yml config
docker compose --profile vulkan-linux --env-file /opt/sipp/gateway/.env -f /opt/sipp/gateway/production.yml up -d gateway-vulkan-linux
For provider-only production, copy production-provider-only.yml.example and
config/provider-only.toml.example instead.
Bind And Mount Behavior
The TOML file always uses the same schema, but bind and path interpretation changes by runtime mode.
| Runtime | TOML bind values | Host exposure | Local target model path |
|---|---|---|---|
| Source/exe | Host addresses, usually 127.0.0.1:* for development | The process binds directly on the host | Path seen from the process working directory |
| Local Compose | Container addresses, usually 0.0.0.0:8080 and 0.0.0.0:9090 | Compose ports map host ports to 127.0.0.1 in local templates | /models/<file>.gguf |
| Production Compose | Container addresses, usually 0.0.0.0:8080 and 0.0.0.0:9090 | Compose exposes public and keeps management host-local by default | /models/<file>.gguf |
| Provider-only Compose | Container addresses, usually 0.0.0.0:8080 and 0.0.0.0:9090 | Provider-only templates follow the same port rules | No local model path |
Keep management private in production. Put public ingress, TLS, and external auth controls in front of the public listener when needed.
Raw Docker Build
Raw Docker commands are supported as an escape hatch. Supply every build arg explicitly:
docker build \
--build-arg SIPP_GATEWAY_BACKEND=vulkan \
--build-arg SIPP_GATEWAY_BUILDER_IMAGE=rust:bookworm \
--build-arg SIPP_GATEWAY_RUNTIME_IMAGE=ubuntu:22.04 \
--build-arg SIPP_GATEWAY_INSTALL_RUSTUP=0 \
-f apps/gateway-server/Dockerfile \
-t sipp-gateway:vulkan .
Backend Hardware & Docker Constraints
Published gateway images use backend-specific tags: latest-cpu,
latest-cuda, and latest-vulkan.
Supported first-party Docker profiles:
| Host runtime | GPU vendor | Supported profile | Backend | Notes |
|---|---|---|---|---|
| Linux Docker | NVIDIA | cuda | CUDA | Recommended NVIDIA GPU path. Requires NVIDIA drivers and container runtime support. |
| Linux Docker | AMD or Intel | vulkan-linux | Vulkan | Requires host /dev/dri rendering devices and a usable Vulkan driver stack. |
| Linux Docker | No supported GPU | cpu | CPU | Portable diagnostic and fallback path. |
| Windows Docker Desktop | NVIDIA | cuda | CUDA | Requires Docker Desktop WSL2 GPU support and NVIDIA container GPU passthrough. |
| Windows Docker Desktop | AMD or Intel | cpu | CPU | First-party Docker does not support Windows Vulkan GPU inference. |
| macOS Docker | Any | cpu | CPU | Metal is available only through native macOS execution, not Linux Docker. |
CPU Backend (latest-cpu / cpu profile)
- Standard portable execution. Works on any host without special driver dependencies.
- This is the Docker path for macOS local development.
CUDA Backend (latest-cuda / cuda profile)
- Requires the NVIDIA Container Toolkit to be installed and configured on the host.
- Requires NVIDIA host GPU drivers.
- Exposed using Docker Compose GPU device reservation capabilities.
- Supported on Linux and Windows Docker Desktop WSL2 hosts with NVIDIA GPU support.
CUDA Architecture Selection
Set SIPP_CUDA_ARCHITECTURES to control the compiled GPU architecture
list. The value is passed verbatim to CMake, so use semicolon-separated
entries. In Docker builds, pass it as the SIPP_CUDA_ARCHITECTURES build
arg; the Compose CUDA service forwards it to the builder stage.
Defaults are layered:
cargo xtask buildCUDA targets (node, python, cli, gateway-server) default to the portable cloud GPU list below so packaged artifacts stay deterministic across build hosts. Docker gateway builds run xtask, so they inherit the same default when the build arg is empty.- Raw
cargo buildofsipp-sysoutside xtask does not setCMAKE_CUDA_ARCHITECTURES, which lets vendored llama.cpp choose CUDA-version-aware defaults for the local toolkit.
Portable cloud GPU release images use:
75-virtual;80-virtual;86-real;89-real;90-virtual;120a-real;121a-real
| Entry | Target GPUs |
|---|---|
75-virtual | T4 and other Turing cloud GPUs |
80-virtual | A100 and other Ampere data-center GPUs |
86-real | A10, A40, RTX A6000-class Ampere |
89-real | L4, L40S, Ada |
90-virtual | H100, H200 Hopper |
120a-real | Blackwell architecture-specific target |
121a-real | Newer Blackwell architecture-specific target |
For faster builds targeting a known GPU, narrow the list. For example, 80
for A100 only, 90 for H100/H200 only, or 89 for L4/L40S only.
CUDA 13 removes offline compilation support for GPU architectures before
compute capability 7.5, so 61 (Pascal) and 70 (Volta) are excluded from
CUDA 13 builds. Supporting those GPUs requires a separate legacy build using a
CUDA 12.x toolkit image with an explicit SIPP_CUDA_ARCHITECTURES list.
The a-suffix Blackwell entries are architecture-specific and not
forward-compatible; keep them aligned with the targets vendored llama.cpp
uses. Plain TensorRT-free CUDA images are the default because the gateway
links against CUDA runtime libraries only; use TensorRT images only if a
TensorRT dependency is introduced.
Vulkan Backend (latest-vulkan image)
- Supported first-party Docker profile is Linux-only:
vulkan-linux. - Linux runs expose host rendering devices with
/dev/dri:/dev/dri. - Windows Docker Desktop Vulkan is unsupported for gateway inference. NVIDIA Windows hosts should use
cudainstead. - The runtime container packages
libvulkan1andmesa-vulkan-driversfor the supported Linux Vulkan profile.
Apple Metal Backend (macOS hypervisor constraints)
Warning
Metal cannot run inside a standard Linux Docker container. Docker on macOS runs within a virtualized Linux hypervisor VM. Apple does not support direct forwarding of the Metal GPU API from macOS into Linux VMs.
Due to this hard architectural boundary:
- Docker Limitation: Running the gateway container on macOS will result in a CPU-only fallback or Vulkan device discovery failure (no Metal GPU acceleration).
- Native Execution: To utilize Apple Silicon GPU acceleration (Metal), macOS users must compile and run the gateway server natively:
cargo xtask build gateway-server --backend metal ./.build/artifacts/gateway-server/sipp-gateway serve --config apps/gateway-server/config/development.toml
Health Check
The Compose templates probe the management readiness route:
curl --fail --silent http://127.0.0.1:9090/readyz
If you change the readiness route in TOML, update the Compose healthcheck too.
Gateway Configuration
apps/gateway-server is configured by one TOML file. The same schema is used
for source/exe runs and Docker runs; only path and bind interpretation changes.
Use Gateway Server for source/exe commands and Docker
for container commands.
Example
public_bind = "0.0.0.0:8080"
management_bind = "0.0.0.0:9090"
max_request_bytes = 1048576
max_concurrent_requests = 4
allowed_origins = []
admin_password_env = "SIPP_GATEWAY_ADMIN_PASSWORD"
[security.client_ip]
source = "peer"
trusted_proxy_cidrs = []
[security.rate_limit]
enabled = false
requests_per_minute = 60
burst = 60
[routes]
query = "/v1/query"
chat = "/v1/chat"
embed = "/v1/embed"
index = "/"
health = "/healthz"
readiness = "/readyz"
metrics = "/metrics"
admin = "/admin"
[[tokens]]
env = "SIPP_GATEWAY_TOKEN"
caller = "production-client"
targets = ["local"]
[[targets]]
name = "local"
type = "local"
model = "/models/model.gguf"
backend = "auto"
stats = "basic"
Gateway Deployment Shapes
The same TOML schema supports three deployment shapes. Choose the shape by the configured targets.
On-Board GPU Inference
Use a local GGUF target when the gateway server owns model loading and GPU inference:
[[tokens]]
env = "SIPP_GATEWAY_TOKEN"
caller = "gpu-client"
targets = ["local-gpu"]
[[targets]]
name = "local-gpu"
type = "local"
model = "/models/model.gguf"
backend = "auto"
stats = "basic"
Use backend = "auto" or an explicit GPU backend such as cuda, metal, or
vulkan. The process must be able to read the GGUF path. Docker runs usually
mount the host model directory at /models.
Provider-Only Router
Use provider targets only when the gateway should hold provider credentials and route client prompts to upstream APIs without loading a local model:
[[tokens]]
env = "SIPP_GATEWAY_TOKEN"
caller = "provider-client"
targets = ["openai-chat"]
[[targets]]
name = "openai-chat"
type = "openai"
model = "gpt-5-mini"
api_key_env = "OPENAI_API_KEY"
timeout_seconds = 60
Provider-only configs have no type = "local" target, no model filesystem
path, and no backend field. CPU gateway builds are appropriate here because
the gateway is not performing on-board inference.
Hybrid
Use both target families when clients should be able to choose between a server-hosted local model and provider endpoints:
[[tokens]]
env = "SIPP_GATEWAY_TOKEN"
caller = "hybrid-client"
targets = ["local-gpu", "openai-chat"]
[[targets]]
name = "local-gpu"
type = "local"
model = "/models/model.gguf"
backend = "auto"
stats = "basic"
[[targets]]
name = "openai-chat"
type = "openai"
model = "gpt-5-mini"
api_key_env = "OPENAI_API_KEY"
timeout_seconds = 60
Requests select the public target name through the request model field, for
example local-gpu or openai-chat.
Top-Level Fields
| Field | Meaning |
|---|---|
public_bind | Address for public inference routes. Source/exe binds this on the host; Docker binds inside the container. |
management_bind | Address for health, readiness, metrics, index, and admin routes. Must differ from public_bind. |
max_request_bytes | Maximum HTTP request body size. Must be greater than zero. |
max_concurrent_requests | Optional application-wide request admission limit. Omit for unbounded. |
allowed_origins | CORS allowlist for browser requests to the public listener. Empty disables the CORS layer. |
admin_password_env | Environment variable containing the Admin Dashboard password. Required and non-blank. |
security | Required in-memory client identification and rate limiting settings. |
check validates these fields without reading secret env vars, loading
models, contacting providers, or binding ports.
Secrets
TOML names secret environment variables. Secret values belong in a private
.env file or production secret manager, not in TOML.
SIPP_GATEWAY_ADMIN_PASSWORD=replace-me
SIPP_GATEWAY_TOKEN=replace-me
OPENAI_API_KEY=replace-me
ANTHROPIC_API_KEY=replace-me
serve rejects missing or blank secret env values at startup. Bearer token
values must also contain no whitespace.
Routes
query, chat, and embed are required public routes. The other routes are
management routes:
index: optional management index JSON route.health: optional liveness route returningok.readiness: optional readiness route returningready.metrics: optional Prometheus text route.admin: optional Admin Dashboard route. Session JSON endpoints live under<admin>/api/session.
Routes must be absolute paths and must not contain query strings or fragments. Public routes cannot duplicate each other. Management routes cannot duplicate each other.
Tokens
Each [[tokens]] block maps one bearer-token environment variable to a caller
label and a target allowlist:
[[tokens]]
env = "SIPP_GATEWAY_TOKEN"
caller = "browser-client"
targets = ["local", "openai-chat"]
envnames the environment variable containing the bearer token value.calleris a stable label used in request metadata and diagnostics.targetslists allowed[[targets]].namevalues. An empty list grants all configured targets.
Token values must be non-empty and contain no whitespace. They are read only
when serve starts.
In-Memory Security Controls
Gateway security controls are process-local in the current version. Admin Dashboard sessions, CSRF tokens, rolling dashboard history, per-client rate-limit buckets, manual blocklist entries, and runtime control overrides disappear when the server restarts. The gateway does not write TOML, create a state file, or use an external cache or database for these controls.
The checked-in examples use the TCP peer address for client IP extraction:
[security.client_ip]
source = "peer"
trusted_proxy_cidrs = []
source can be peer, x_forwarded_for, or x_real_ip. Forwarded headers
are ignored unless trusted_proxy_cidrs contains the proxy CIDR that is
allowed to supply them. Keep source = "peer" unless the gateway sits behind
a trusted reverse proxy that preserves the real client address.
Per-client rate limiting is configured explicitly:
[security.rate_limit]
enabled = false
requests_per_minute = 60
burst = 60
When enabled, the limiter uses an in-memory token bucket keyed by the resolved
client IP. requests_per_minute controls refill rate. burst controls bucket
capacity.
Targets
Each [[targets]] block publishes one model or provider endpoint under a
stable target name.
Local GGUF
[[targets]]
name = "local"
type = "local"
model = ".build/models/qwen2.5-0.5b-instruct-q4_0.gguf"
backend = "auto"
stats = "basic"
modelis the GGUF path seen by the process. Relative paths resolve from the process working directory.backendcan beauto,cpu,cuda,metal, orvulkan.statscan beoff,basic, orprofile.runtimecan contain advanced native runtime settings from the shared runtime options schema.
For on-board inference, prefer backend = "auto" or an explicit GPU backend.
backend = "auto" selects the best compiled and available backend in this
order: CUDA, Metal, Vulkan, then CPU. Explicit cpu disables GPU offload and
is intended only for diagnostics. Explicit GPU backends fail if that backend
was not compiled or is unavailable.
stats = "off" disables runtime metrics and backend profiling.
stats = "basic" enables runtime metrics. stats = "profile" enables runtime
metrics and backend profiling.
OpenAI
[[targets]]
name = "openai-chat"
type = "openai"
model = "provider-model"
api_key_env = "OPENAI_API_KEY"
base_url = "https://api.openai.com/v1"
timeout_seconds = 60
base_url and timeout_seconds are optional. The API key is read from
api_key_env when serve starts.
OpenAI-Compatible
[[targets]]
name = "compatible-chat"
type = "openai_compatible"
model = "served-model"
base_url = "https://provider.example/v1"
token_env = "PROVIDER_TOKEN"
correlation_header = "x-request-id"
timeout_seconds = 60
base_url and token_env are required. correlation_header and
timeout_seconds are optional.
Anthropic
[[targets]]
name = "anthropic-chat"
type = "anthropic"
model = "provider-model"
api_key_env = "ANTHROPIC_API_KEY"
version = "2023-06-01"
timeout_seconds = 60
base_url, version, and timeout_seconds are optional. The API key is read
from api_key_env when serve starts.
Bind Behavior
Source/exe mode binds public_bind and management_bind directly on the
host. Docker mode binds those addresses inside the container; Compose ports
decide host exposure.
For Docker:
- The gateway process should listen on container interfaces such as
0.0.0.0:8080and0.0.0.0:9090. - Local testing keeps both host ports on
127.0.0.1through Compose port bindings. - Production exposes public traffic through the configured host port and keeps
management on
127.0.0.1by default. - Local model paths should match the container mount point in the Compose volume configuration.
- Provider-only Docker configs do not need a model mount because no local GGUF target is loaded.
Admin Dashboard
The dashboard is served only on the management listener. It uses
the value of admin_password_env for login, stores short-lived HTTP-only
sessions, and does not render the password, bearer tokens, or provider secrets.
The dashboard serves a React single-page application from the gateway
distribution’s admin-ui asset directory and exposes session-protected JSON
endpoints under <admin>/api/*. Login uses POST <admin>/api/session, logout
uses DELETE <admin>/api/session, and mutating admin API calls require the
session CSRF token in the x-sipp-admin-csrf header. Runtime edits made
from the dashboard affect only the running process and reset on restart.
Gateway Testing
Use this page when testing the first-party gateway with curl, Postman, or any
other raw HTTP client. The examples assume the default routes from
apps/gateway-server/config/*.toml.
Environment
Bash:
export GATEWAY_URL="http://127.0.0.1:8080"
export GATEWAY_MANAGEMENT_URL="http://127.0.0.1:9090"
export SIPP_GATEWAY_TOKEN="replace-me"
export SIPP_GATEWAY_TARGET="local"
PowerShell:
$env:GATEWAY_URL = "http://127.0.0.1:8080"
$env:GATEWAY_MANAGEMENT_URL = "http://127.0.0.1:9090"
$env:SIPP_GATEWAY_TOKEN = "replace-me"
$env:SIPP_GATEWAY_TARGET = "local"
Management Probes
Health and readiness do not require bearer authentication:
curl --fail --silent "$GATEWAY_MANAGEMENT_URL/healthz"
curl --fail --silent "$GATEWAY_MANAGEMENT_URL/readyz"
curl --fail --silent "$GATEWAY_MANAGEMENT_URL/metrics"
The Admin Dashboard is available at:
http://127.0.0.1:9090/admin
Log in with the value of the env var named by admin_password_env in TOML.
Query
curl -sS "$GATEWAY_URL/v1/query" \
-H "Authorization: Bearer $SIPP_GATEWAY_TOKEN" \
-H "Content-Type: application/json" \
-H "x-request-id: curl-query-1" \
-d '{
"model": "'"$SIPP_GATEWAY_TARGET"'",
"prompt": "Explain gateway inference in one sentence.",
"max_tokens": 64,
"temperature": 0.2
}'
Finite text responses use JSON:
{
"id": "response",
"model": "local",
"text": "A gateway centralizes inference behind an HTTP boundary.",
"finish_reason": "stop"
}
When usage is available, the response also includes:
{
"usage": {
"input_tokens": 8,
"output_tokens": 12,
"total_tokens": 20
}
}
Chat
curl -sS "$GATEWAY_URL/v1/chat" \
-H "Authorization: Bearer $SIPP_GATEWAY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$SIPP_GATEWAY_TARGET"'",
"messages": [
{ "role": "system", "content": "Answer briefly." },
{ "role": "user", "content": "What does the gateway own?" }
],
"max_tokens": 64
}'
Chat uses the same finite text response shape as query. Valid message roles are
system, user, and assistant.
Embeddings
curl -sS "$GATEWAY_URL/v1/embed" \
-H "Authorization: Bearer $SIPP_GATEWAY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$SIPP_GATEWAY_TARGET"'",
"input": "gateway inference"
}'
Embedding responses use JSON:
{
"id": "response",
"model": "local",
"embedding": [0.0123, -0.0456]
}
Embedding requires a target that supports embeddings. Text-only local models or
provider targets can return an execution error for /v1/embed.
Streaming
Query and chat support server-sent events when the request contains
"stream": true:
curl -N -sS "$GATEWAY_URL/v1/query" \
-H "Authorization: Bearer $SIPP_GATEWAY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$SIPP_GATEWAY_TARGET"'",
"prompt": "Write one short sentence about gateways.",
"max_tokens": 64,
"stream": true
}'
The stream content type is text/event-stream. Events are newline-delimited
SSE frames:
event: token
data: {"text":"Gateways","sequence":0}
event: usage
data: {"input_tokens":8,"output_tokens":9,"total_tokens":17}
event: done
data: {"finish_reason":"stop"}
If an error happens after streaming has started, the stream emits:
event: error
data: {"error":{"code":"execution","message":"..."}}
Postman
Create a Postman environment with these variables:
| Variable | Example |
|---|---|
gateway_url | http://127.0.0.1:8080 |
management_url | http://127.0.0.1:9090 |
gateway_token | replace-me |
gateway_target | local |
For public routes:
- Method:
POST. - Authorization: Bearer Token with
{{gateway_token}}. - Header:
Content-Type: application/json. - Body: raw JSON.
- Query URL:
{{gateway_url}}/v1/query. - Chat URL:
{{gateway_url}}/v1/chat. - Embed URL:
{{gateway_url}}/v1/embed.
For management probes:
- Method:
GET. - URLs:
{{management_url}}/healthz,{{management_url}}/readyz, and{{management_url}}/metrics. - No bearer token is required.
Postman can display finite JSON responses directly. For streaming requests,
use a client that preserves SSE frames, such as curl -N, when debugging token
timing and terminal events.
Common HTTP Failures
| Status | Common cause |
|---|---|
400 | Invalid JSON, invalid route body, or unsupported request field value. |
401 | Missing bearer token or malformed Authorization header. |
403 | Bearer token is valid but not allowed to use the requested target. |
404 | Requested model target is not configured. |
413 | Request body exceeds max_request_bytes. |
429 | max_concurrent_requests admission limit is full. |
500 | Target load or execution failure. Check gateway logs and target config. |
Non-streaming errors use JSON:
{
"error": {
"code": "authorization",
"message": "token is not allowed to access target"
}
}
Gateway Operations
The first-party gateway has one public listener and one management listener. Keep those operational surfaces separate in deployment.
Public Listener
The public listener serves inference routes:
/v1/query/v1/chat/v1/embed
Every public request must include a bearer token accepted by the configured
[[tokens]] policy. The request model field is the public target name. The
gateway resolves that target to a local model or provider endpoint.
Put TLS, external authentication, rate limiting, and network ingress in front of the public listener when exposing it beyond a trusted network.
Management Listener
The management listener can serve:
/: optional index JSON route./healthz: liveness route returningok./readyz: readiness route returningready./metrics: Prometheus text metrics route./admin: password-protected Admin Dashboard.
Keep the management listener private. In Docker production, the Compose file
binds the management host port to 127.0.0.1 by default.
Admin Dashboard
The Admin Dashboard uses the value of the env var named by
admin_password_env in TOML for login. It stores short-lived HTTP-only
sessions and does not render the password, bearer tokens, or provider secrets.
Use the dashboard to inspect configured routes, targets, selected local backends, and current request metrics. Do not expose it directly to the public internet.
Metrics
The metrics route renders low-cardinality Prometheus text. Current gateway metrics include request and error counters by operation, for example:
sipp_gateway_requests_total{operation="query"} 3
sipp_gateway_errors_total{operation="chat"} 1
Target-level local runtime metrics depend on the target stats setting:
off: disable runtime metrics and backend profiling.basic: enable runtime metrics.profile: enable runtime metrics and backend profiling.
Logging
The gateway uses tracing JSON logs. Set RUST_LOG in the process
environment to control verbosity:
RUST_LOG=info
RUST_LOG=debug,sipp_gateway_server=trace
Do not log bearer token values, provider credentials, or production TOML contents.
CORS
allowed_origins controls browser access to the public listener. An empty
array disables the CORS layer. Add only trusted browser origins:
allowed_origins = ["https://app.example.com"]
Browser clients should use short-lived gateway tokens supplied at runtime, not long-lived tokens embedded in bundles.
Secrets
The gateway uses two types of secrets:
admin_password_env: TOML field naming the dashboard password env var.- Token/provider env vars: names are configured in TOML; values are read from
the process environment when
servestarts.
Keep secrets env files private and outside source control. Use deployment secret stores where available.
Gateway Toolkit
sipp-gateway is a route-free Rust HTTP toolkit for applications that want
to expose Sipp inference through their own server framework.
The toolkit provides codecs, authentication and observability traits, HTTP error helpers, and the first-party JSON/SSE profile. Applications bind sockets, register routes, load configuration, and define deployment policy.
Use Gateway Server when you want the first-party server application with TOML, bearer tokens, target policy, metrics, probes, and listener management.
Distribution
The toolkit crate target is sipp-gateway. crates.io publishing covers
the sipp-rs and sipp-sys crates; the toolkit is intentionally
source-distributed. Use Source Builds when
consuming the toolkit from this checkout.
Use It For
- Building application-owned HTTP gateway routes.
- Translating request bodies into typed Sipp requests.
- Encoding JSON and SSE responses.
- Sharing the first-party protocol profile with Sipp clients.
Minimal Handler Shape
#![allow(unused)]
fn main() {
use sipp_gateway::{GatewayCodec, ProtocolCodec};
let codec = GatewayCodec;
let mut decoded = codec.decode_query(&body)?;
decoded.request.endpoint = Some(resolve(&decoded.target)?);
let response = client.query(decoded.request).await?;
let bytes = codec.encode_text(&decoded.target, &response)?;
}
Custom gateway applications own sockets, route layout, authentication,
configuration files, target policy, CORS, logging, and deployment defaults.
Node route handlers can use the matching gateway profile helpers exported by
@sipp/sipp-server when implementing the same first-party profile in framework
routes.
Boundaries
lib/gateway supplies helpers, not an application:
- It does not register routes.
- It does not bind listeners.
- It does not own bearer-token policy.
- It does not own TOML, CORS, metrics, or deployment behavior.
Default /v1/query, /v1/chat, and /v1/embed paths belong only to
applications that choose them.
Related Docs
Gateway Architecture
Gateway behavior is split into independent layers. There is no compatibility layer for deleted gateway route-autowiring or remote endpoint APIs.
Core Execution
sipp::gateway_core (the gateway_core module of the sipp crate,
behind the gateway feature) exposes only typed query, chat, and embed
execution:
GatewayRequestContextand cancellation.TargetResolver,Authorizer,AdmissionController, andGatewayExecutor.GatewayPipelineordering and admission-permit lifetime.- Protocol-neutral finite results and streaming events.
It does not depend on HTTP, Axum routes, JSON, SSE, bearer tokens, status codes, aliases, TOML, or fixed limits.
The sipp client API owns local, provider, and gateway endpoint
registration through SippClient.add(...). Gateway endpoints call an HTTP
gateway as a client transport and are never selected implicitly.
Developer Toolkit
lib/gateway contains route-free HTTP helpers for applications that choose to
expose a gateway:
ProtocolCodecfor request, response, stream, and error wire formats.Authenticatorfor arbitrary authentication.ErrorTranslatorfor application HTTP error mapping.GatewayCodecfor the first-party Sipp JSON/SSE profile.GatewayHttpErrorand SSE/error response encoders.
It does not register routes, expose a router, or own handler paths.
Applications decode requests, select targets, call client.query(),
client.chat(), or client.embed() directly, and encode responses
explicitly.
Public Endpoints
Rust, Node, Python, and browser packages expose gateway endpoint descriptors
through the same .add path used for local and provider endpoints:
- A protocol target.
- A gateway base URL.
- Query, chat, and embed routes.
- Authentication strategy.
- Static headers.
- Timeout policy.
- Protocol-specific request options.
The endpoint id is supplied only to .add. Local model, provider, and gateway
descriptors are different descriptor kinds, but query, chat, and embed
request shapes are identical once an endpoint ref is selected.
First-Party Applications
apps/gateway-server is one opinionated first-party application. Its bearer
tokens, target access, concurrency limit, CORS, routes, management listener,
metrics, and TOML format are application-owned.
examples/gateway demonstrates the canonical developer pattern:
- Create a
SippClient. - Add local, provider, or gateway endpoints with
.add. - Define Axum routes in the example application.
- Decode each route body, select an endpoint, call
client.*, and encode the response.
Default /v1/query, /v1/chat, and /v1/embed paths belong only to
applications that choose them. The library supplies codecs and endpoint
transports, not route ownership.
Gateway Troubleshooting
Use this page when the first-party gateway starts, serves, or responds differently than expected.
check Succeeds But serve Fails
check parses and validates TOML only. It does not read token environment
variables, load model files, contact providers, or bind ports.
If serve fails after check succeeds, verify:
- Bearer token env vars named by
[[tokens]].envare present and non-empty. - The env var named by
admin_password_envis present and non-empty. - Provider secret env vars such as
OPENAI_API_KEYare present for provider targets. - Local GGUF paths exist from the process point of view.
public_bindandmanagement_bindare available and not already in use.- Requested GPU backends were compiled and are available on the host.
Missing DLL Or Shared Library
Direct executable runs must put .build/artifacts/gateway-server on the
dynamic loader path. The staged executable depends on runtime libraries and
GGML backend plugins in that same directory.
- Windows: prepend the artifact directory to
PATH. - Linux: prepend the artifact directory to
LD_LIBRARY_PATH. - macOS: prepend the artifact directory to
DYLD_LIBRARY_PATH.
The sipp run gateway-server ... workflow handles this automatically.
Relative Model Path Is Wrong
Relative local target model paths resolve from the process working directory.
sipp run gateway-server ... runs from the workspace root. Direct executable
commands run wherever the shell is currently located.
Use absolute model paths when starting the executable from another directory. For Docker, use the container path, not the host path.
Docker Port Is Published But Host Cannot Connect
In Docker mode, public_bind and management_bind are addresses inside the
container. Use container listener values such as:
public_bind = "0.0.0.0:8080"
management_bind = "0.0.0.0:9090"
Then use Compose ports to control host exposure. The local Compose templates
map both host ports to 127.0.0.1 for workstation-only access.
401 Unauthorized
The public route did not receive a valid bearer token. Check:
- Header is
Authorization: Bearer <token>. - Token value matches the environment variable named by a
[[tokens]]block. - Token contains no whitespace.
- The gateway process was restarted after changing the token environment.
403 Forbidden
The bearer token is valid, but its targets allowlist does not include the
request model target. Add the target name to the relevant [[tokens]] block
or use a token that grants that target.
404 Target Not Found
The request model value does not match any configured [[targets]].name.
The model field in public HTTP requests is a public gateway target name, not
necessarily the provider model or GGUF file name.
CORS Failure In Browser
Browser requests require the public listener to allow the page origin. Add the
exact origin to allowed_origins:
allowed_origins = ["http://localhost:5173"]
An empty allowed_origins array disables the CORS layer.
GPU Backend Fails
Explicit local target backends fail when the backend was not compiled or is not
available at runtime. Use backend = "auto" to let the gateway pick the best
compiled and available backend, or select a GPU backend that was included in
the build. Explicit cpu disables GPU offload and is useful only for
diagnosing local-inference setup issues.
Docker GPU builds also require host runtime support:
- CUDA requires NVIDIA host drivers and container runtime support.
- Vulkan requires GPU device access, Vulkan loader, and driver support.
- Metal is macOS-only and not available from Linux Docker.
If Docker logs show ggml_vulkan: No devices found, the container has loaded
the Vulkan backend but cannot enumerate a usable Vulkan physical device. On
Windows Docker Desktop with NVIDIA GPUs, use the cuda profile instead.
Admin Dashboard Login Fails
The dashboard password is read from the env var named by admin_password_env
in the selected TOML file. Confirm the secrets env file or secret manager has
that value, and confirm the gateway is using the intended TOML through
--config.
The dashboard is served on the management listener only.
Guides
Guides explain cross-package behavior and workflow choices that belong outside individual README files.
Local Inference
Local inference runs a GGUF model inside the current browser, Node.js, Python, Rust, or CLI process. The application owns model selection, runtime lifecycle, resource cleanup, and the request options that should be exposed to users.
Register a local endpoint with SippClient.add, keep the returned endpoint
reference, and pass that reference to query, chat, or embed.
Endpoint Flow
- Choose a GGUF model that supports the requested capability.
- Register the model with a local descriptor.
- Set load-time runtime options on the endpoint descriptor.
- Pass request-time generation options to
query,chat, orembed. - Stream tokens or await the final response.
- Close the client when the page, worker, service, or script no longer needs the runtime.
Local endpoints do not route implicitly. A client can register multiple
endpoints, but every request that should use a specific destination should pass
the endpoint reference returned by add.
Model Sources
Browser local endpoints can load:
- A model URL served by the application.
- A user-selected
File. - Multiple shard URLs or files.
- An installed model id returned by browser model-management APIs.
- A model plus projector pair for vision-capable models.
Node.js, Python, Rust, and CLI local endpoints use filesystem paths. Source
examples and smoke workflows can use cached sample models under .build/models
when running from a checkout.
Runtime And Request Options
Keep option layers separate:
- Browser client options such as
executionMode,wasmThreading, runtime asset URLs, andbrowserCachebelong onnew SippClient(...). - Local endpoint load options choose the model source, browser backend
preference, progress callbacks, and
NativeRuntimeConfig. - Runtime config groups such as
context,sampling,scheduler,cache,placement,multimodal,residency, andobservabilitydescribe stable local endpoint behavior. - Request options such as
maxTokens,temperature,topP,stop, cancellation, andemitTokensbelong onquery,chat, orembed. - Local-only request options such as context keys, grammars, media inputs, and embedding normalization should not be sent to gateway or provider endpoints.
See Runtime Options for the canonical option map and field groups.
Threads And Browser Execution
Browser execution has two separate choices:
executionMode: 'worker'orautokeeps inference work off the UI thread when workers are available.wasmThreading: 'pthread'enables the pthread WASM runtime and requiresSharedArrayBufferplus cross-origin isolation headers.
Use wasmThreading: 'single-thread' when the app cannot serve COOP/COEP
headers. Use executionMode: 'main-thread' mainly for debugging or constrained
hosts.
Native Node.js, Python, and Rust local endpoints can tune CPU thread counts
with context.n_threads and context.n_threads_batch. Leave them unset for
runtime defaults unless the application has measured a better value.
Text, Embeddings, And Vision
- Query and chat require text generation support.
- Embed requires a model/runtime that reports embedding support.
- Vision chat requires a text/vision model plus projector data where the model family requires it.
- Streaming text requires
emitTokensand consuming the returned token iterable before or alongside the final response. - GBNF grammars and media inputs are local-only request features.
Related Docs
Backend Matrix
Sipp local inference is built on llama.cpp and ggml. Sipp owns the client APIs, endpoint model, scheduling, package bindings, browser lifecycle, and gateway integration; llama.cpp and ggml provide the GGUF runtime and backend kernels.
Backend support therefore has two layers:
- Sipp support: which backend names each package can select and how the backend is built or chosen.
- ggml support: which tensor operations each ggml backend implements.
For the ggml operation-level matrix, use the upstream llama.cpp GGML operations table. That table is generated from llama.cpp backend probes and is the source of truth for per-operation support.
Sipp Backend Names
| Backend | Device class | Where Sipp exposes it | Notes |
|---|---|---|---|
cpu | Host CPU | Browser, Node.js, Python, Rust/source, CLI, gateway server | Portable default. Native builds use ggml CPU; browser builds use WASM CPU with the browser runtime. |
webgpu | Browser GPU through WebGPU | Browser package | Browser-only. Selected with browser local endpoint options.backend; requires a WebGPU-capable browser and adapter. |
cuda | NVIDIA GPU | Native source builds, Node.js, Python, CLI, gateway server | Requires a local CUDA Toolkit and compatible NVIDIA driver. xtask reports CUDA readiness but does not install CUDA. |
metal | Apple GPU through Metal | Native source builds, Node.js, Python, CLI, gateway server on macOS | macOS-only native backend. Best for Apple Silicon and tested AMD Macs; use CPU on Intel integrated GPUs. |
vulkan | GPU through Vulkan | Native source builds, Node.js, Python, CLI, gateway server | Requires a Vulkan-capable system and driver. xtask can bootstrap the Vulkan SDK for builds. macOS Vulkan is source-build only and runs through a Metal translation layer. |
Upstream llama.cpp/ggml supports more backend families than Sipp currently exposes as package/runtime selectors, including BLAS, CANN, OpenCL, SYCL, ZenDNN, and zDNN. Those appear in the upstream operation matrix but are not first-party Sipp backend names at this time.
Package And Runtime Selection
| Surface | Supported backend selectors | How to select |
|---|---|---|
| Browser local | auto, cpu, webgpu | client.add(..., { kind: 'local', options: { backend: 'webgpu' } }) |
| Node.js local | cpu, vulkan, cuda, metal | `SIPP_NODE_BACKEND=cpu |
| Python local | cpu, vulkan, cuda, metal | `SIPP_PYTHON_BACKEND=cpu |
| CLI | auto, cpu, cuda, metal, vulkan | sipp ... --backend <backend> |
| Gateway server | auto, cpu, cuda, metal, vulkan | Build or run with sipp ... --backend <backend>; target TOML can set backend = "auto" or a concrete backend. |
| Rust source/client workflows | Compiled native backend set | Build through sipp or cargo xtask; runtime availability follows the linked native artifacts. |
auto is a runtime selection policy. all is a build/test selector used by
sipp and cargo xtask; it builds or checks the host-supported backend set for
that target and is not a runtime backend name.
Mixing Backends
Keep build artifact selection separate from engine backend selection.
- A build artifact decides which ggml GPU backends are compiled and loadable in the current process. A CUDA-only artifact does not make Vulkan available, and a Metal-only artifact does not make CUDA or Vulkan available.
cpuis the exception in the engine policy. When an engine is explicitly planned forcpu, Sipp disables GPU layers, device placement, GPU K/V offload, op offload, flash attention, and GPU residency leasing for that load.- Explicit GPU selections such as
cuda,metal,vulkan, andwebgpumust be both compiled into the active artifact and available on the host. - Node.js and Python choose the native binding at process load with
SIPP_NODE_BACKENDorSIPP_PYTHON_BACKEND. Their local model descriptors do not carry a separate per-engine backend field, so use a different process or artifact when you need a different GPU backend. - Gateway, CLI, browser, and lower-level Rust lifecycle paths expose backend selectors at the target/load/run layer. They can select only from the backend set available to that artifact and host.
Practical examples:
| Active artifact/process | CPU engine | CUDA engine | Metal engine | Vulkan engine |
|---|---|---|---|---|
| CUDA-only native artifact | Yes, where the surface exposes CPU selection | Yes, if the CUDA device is available | No | No |
| Metal-only native artifact | Yes, where the surface exposes CPU selection | No | Yes, on macOS | No |
| Vulkan-only native artifact | Yes, where the surface exposes CPU selection | No | No | Yes, if the Vulkan device is available |
| Multi-backend source build | Yes | Yes, if compiled and available | Yes, if compiled and available | Yes, if compiled and available |
CLI examples:
# Build a CUDA-capable CLI artifact.
sipp build cli --backend cuda
# Use CUDA when the CUDA device is available.
sipp ./models/model.gguf "Explain this model." --chat --backend cuda
# Force CPU for a run; this disables GPU offload for that engine.
sipp ./models/model.gguf "Explain this model." --chat --backend cpu
# This requires a Vulkan-capable artifact; a CUDA-only artifact is not enough.
sipp ./models/model.gguf "Explain this model." --chat --backend vulkan
Gateway target examples:
# Same gateway process, different local targets.
# Each GPU backend must be compiled into the active gateway artifact.
[[targets]]
name = "local-cuda"
type = "local"
model = "./models/model.gguf"
backend = "cuda"
[[targets]]
name = "local-cpu"
type = "local"
model = "./models/model.gguf"
backend = "cpu"
Browser examples:
// Browser local supports CPU and WebGPU backend selection per local endpoint.
await client.add('local-webgpu', {
kind: 'local',
model: './models/model.gguf',
options: { backend: 'webgpu' },
});
await client.add('local-cpu', {
kind: 'local',
model: './models/model.gguf',
options: { backend: 'cpu' },
});
Node.js and Python examples:
# PowerShell: choose the native binding before starting the process.
$env:SIPP_NODE_BACKEND = "cuda"
node .\examples\node\chat.mjs .\models\model.gguf "Explain this model."
$env:SIPP_NODE_BACKEND = "cpu"
node .\examples\node\chat.mjs .\models\model.gguf "Explain this model."
# Bash: choose the native binding before starting the process.
SIPP_PYTHON_BACKEND=cuda \
python examples/python/chat.py ./models/model.gguf "Explain this model."
SIPP_PYTHON_BACKEND=cpu \
python examples/python/chat.py ./models/model.gguf "Explain this model."
Build Matrix
| Build command | Backend argument | Result |
|---|---|---|
sipp build wasm | none | Browser WASM package with CPU and WebGPU runtime support. |
sipp build node --backend cpu | cpu, cuda, metal, vulkan, all | Node native binding artifacts for the selected backend set. |
sipp build python --backend cpu | cpu, cuda, metal, vulkan, all | Python native binding artifacts for the selected backend set. |
sipp build cli --backend cpu | cpu, cuda, metal, vulkan, all | Local sipp CLI distribution for the selected backend set. |
sipp build gateway-server --backend cpu | cpu, cuda, metal, vulkan, all | Gateway server distribution for the selected backend set. |
sipp build all | none | Core, WASM, Python CPU, Node CPU, and CLI CPU targets. |
sipp build all is intentionally conservative. Use an explicit backend build
when you need CUDA, Metal, or Vulkan artifacts.
Operation Support
ggml backends do not all implement the same operation set. Common transformer inference paths are covered by the backends Sipp exposes, but support for a specific model family depends on the ggml operations used by that model and the selected backend.
Use these rules when diagnosing backend issues:
- If a model works on
cpubut fails on a GPU backend, check the upstream ggml operations matrix for the missing operation. - If a GPU backend lacks an operation, llama.cpp/ggml may fall back for some paths, keep tensors on CPU for that operation, or fail depending on the graph and backend policy.
- If a package cannot see a backend at runtime, check that the artifact was built or installed for that backend and that the device driver/runtime is visible to the process.
- Browser
webgpudepends on both compiled WebGPU support and browser adapter availability. Usebackend: 'cpu'to force the browser CPU path.
For local verification from a source checkout:
sipp doctor --target node --backend vulkan
sipp run llama backend-ops --backend vulkan --mode support
sipp run llama backend-ops --backend cuda --mode perf --op MUL_MAT
The llama backend-ops command builds llama.cpp’s backend operation tool for
the selected backend and is useful when investigating operation coverage or
performance outside the Sipp client path.
Practical Selection
Use cpu first when validating a model or reproducing correctness issues. Move
to a GPU backend after the model, prompt format, and runtime config are known to
work.
Use webgpu for browser-local acceleration when the application can require a
modern WebGPU browser. Keep a CPU fallback for browsers, drivers, and devices
that do not expose a compatible adapter.
Use cuda for NVIDIA-heavy native deployments and metal for Apple Silicon or
tested AMD macOS deployments. On Intel Macs with integrated GPUs, use cpu
unless the exact model, context size, and device have been tested and Metal is
stable and faster than CPU. Use vulkan when you want a cross-vendor native GPU
path and have tested the target driver stack. On macOS, prefer Metal over Vulkan
unless you are specifically testing LunarG’s Vulkan-over-Metal drivers.
Gateway And Hybrid Inference
Gateway inference lets an application call a separate Sipp gateway over HTTP. Hybrid inference registers local and gateway endpoints in the same client so each request can choose where it runs.
When To Use A Gateway
- Keep provider credentials out of browser or edge clients.
- Centralize target access policy and concurrency limits.
- Serve local models from a controlled machine.
- Expose a stable HTTP boundary to multiple language clients.
Gateway Deployment Shapes
The first-party gateway can be deployed in three shapes:
- On-board GPU inference: the gateway loads a local GGUF model and serves it through a GPU backend.
- Provider-only router: the gateway has no local model and forwards requests to provider targets such as OpenAI, Anthropic, or OpenAI-compatible APIs.
- Hybrid: the gateway exposes both local GPU targets and provider targets.
Endpoint Model
The client does not route implicitly. Every application registers descriptors and selects an endpoint reference:
- Local descriptor: a GGUF model loaded by the current runtime.
- Gateway descriptor: a base URL, target name, routes, and authentication.
- Provider descriptor: direct provider adapter where the package supports it.
Gateway descriptors send the target as the first-party profile model field.
The gateway process resolves that public target name to a local or provider
endpoint.
Authentication
Server and script environments use bearer values from environment variables. Browser applications use short-lived tokens supplied at runtime through a provider callback.
Related Docs
Browser Caching
Browser-local inference caches model data in browser storage so repeated loads avoid full network downloads when the runtime supports that path. Sipp browser examples and demos use this path for GGUF model loading.
Responsibilities
The browser package owns runtime integration and cache mechanics. Applications still own:
- The model URL or file selection UI.
- Progress display and cancellation behavior.
- Storage-clearing controls when users need to reclaim space.
- Fallback behavior when browser storage is unavailable.
Practical Guidance
- Prefer model URLs that support range requests for large assets.
- Keep default demo models small enough for first-run onboarding.
- Treat browser storage as user-controlled and best-effort.
- Close
SippClientinstances when a page, worker, or component no longer needs local runtime resources.
Use the browser examples for minimal flows and the playground for runtime diagnostics.
Providers
Sipp can call external providers directly from trusted server-side
processes or indirectly through a Sipp gateway. Both paths use the same
endpoint model: register a descriptor with SippClient.add, keep the
endpoint reference, and pass it to query, chat, or embed.
Provider credentials must stay in trusted code. Do not ship long-lived provider keys in browser bundles.
Direct Provider Endpoints
Use a direct provider endpoint when the current server process owns the credential lifecycle and application policy. This is the recommended framework route pattern for Next.js and TanStack server code.
import { SippClient } from '@sipp/sipp-server';
function requiredEnv(name: string): string {
const value = process.env[name];
if (value == null || value === '') {
throw new Error(`${name} is required`);
}
return value;
}
const client = new SippClient();
const endpoint = await client.add('provider', {
kind: 'provider',
provider: 'openai',
model: process.env.OPENAI_MODEL ?? 'gpt-5-mini',
apiKey: requiredEnv('OPENAI_API_KEY'),
});
const run = client.chat({
endpoint,
messages: [{ role: 'user', content: 'Explain provider inference.' }],
options: { maxTokens: 128, temperature: 0.2 },
});
console.log((await run.response).text);
Use OPENAI_API_KEY="<mock-openai-key>" only as a placeholder in docs and
examples. Real keys belong in environment variables or a secret manager.
Provider Options
Typed request fields should use Sipp’s request options. Provider-only
fields belong in providerOptions:
const run = client.chat({
endpoint,
messages,
options: { maxTokens: 128 },
providerOptions: {
reasoning_effort: 'low',
},
});
providerOptions is for direct provider endpoints. Gateway-specific extensions
belong in endpointOptions or descriptor-level protocolOptions, because the
gateway implementation owns how those fields are interpreted.
Provider-Backed Gateway Targets
Use the first-party gateway when multiple applications should share target policy, provider credentials, local model hosting, admission control, metrics, or a stable HTTP boundary.
OpenAI target:
[[targets]]
name = "openai-chat"
type = "openai"
model = "gpt-5-mini"
api_key_env = "OPENAI_API_KEY"
OpenAI-compatible target:
[[targets]]
name = "compatible-chat"
type = "openai_compatible"
model = "provider-model"
base_url = "https://provider.example/v1"
token_env = "COMPATIBLE_API_TOKEN"
correlation_header = "x-request-id"
Anthropic target:
[[targets]]
name = "anthropic-chat"
type = "anthropic"
model = "claude-3-5-sonnet-latest"
api_key_env = "ANTHROPIC_API_KEY"
Gateway clients receive only the public target name, gateway URL, and gateway authentication value. Provider credentials stay in the gateway process.
Browser Applications
Browser applications should usually call an application route or gateway, not a provider directly. If a BYOK browser flow is required, use short-lived provider keys supplied at runtime through the browser provider descriptor and keep the user-facing risks explicit.
Related Docs
Vision
Vision workflows send image data alongside chat messages. They are local-only unless an application gateway deliberately implements an equivalent media profile.
Local Vision
Local vision examples typically require:
- A compatible vision-capable GGUF model.
- A projector GGUF when required by the model family.
- An image path, browser canvas export, or image byte payload.
The Rust, Node.js, and Python example directories include vision_chat
examples. Browser demos show canvas and file-oriented workflows.
Browser Vision
Browser applications keep captured image payloads small and grounded in the task. The proactive drawing demo uses cropped JPEG captures so the model sees the relevant ink instead of a full page screenshot.
Related Docs
Reference
Reference pages collect command, configuration, testing, and application details that need a stable home outside package READMEs.
Inference Operations
Sipp separates the operation from the endpoint. Choose query, chat, or
embed based on the input shape and expected output, then pass the endpoint
reference that decides where the request runs.
Shared Contract
- Register a local, gateway, or provider descriptor with
SippClient.add. - Keep the returned endpoint reference.
- Pass that reference to
query,chat, orembed.
query and chat both produce text. They share maxTokens, temperature,
topP, stop, cancellation, and token streaming. embed produces vectors and
does not use generation options or token streaming.
| Operation | Input | Output | Best fit |
|---|---|---|---|
query | One already-rendered prompt string. | Generated text. | Raw completions, custom templates, encoder-decoder text generation, few-shot prompts, and agent loops that render prompts themselves. |
chat | Ordered { role, content } messages. | Generated assistant text. | Conversation-shaped model calls where the endpoint owns the chat-template or provider-message mapping. |
embed | One text input. | One embedding vector. | Retrieval, semantic search, ranking, clustering, and memory indexes. |
Local Inference
Local endpoints run a GGUF model in the current browser, Node.js, Python, Rust, or CLI process.
| Operation | What Sipp sends to the runtime | Template behavior | Local-only options |
|---|---|---|---|
query | The prompt string exactly as supplied. Decoder-only models run the normal decode path; encoder-decoder models run an encoder pass and then the decoder loop. | No chat template is applied. Use this when the application owns a custom or generic prompt format. | Context keys, grammars, JSON schema, sampling overrides, media inputs. |
chat | Messages are rendered to one prompt with llama.cpp chat-template support and add_assistant = true. | Requires the GGUF to declare tokenizer.chat_template. Sipp checks model metadata, not the llama.cpp fallback chain, before allowing local chat. | Same text options as query, including context keys and media inputs. |
embed | The input text is encoded by the local embedding runtime. | No chat template and no generation. | Context key and embedding normalization. |
Local chat is a prompt renderer plus generation call, not a conversation
store. Pass prior turns in messages when they should be visible to the model.
Use a context key only for local KV-cache reuse.
Encoder-decoder text models, such as T5 or BART GGUF files, use query for
text generation. Encoder-only models do not generate text and should use
embed when they expose pooled embeddings.
Local Query With A Custom Template
Use query when you want to own the full prompt shape, including a hand-written
or application-provided chat template.
const endpoint = await client.add('local', {
kind: 'local',
modelPath: '/models/model.gguf',
});
const prompt = [
'<|system|>',
'Answer with one concise paragraph.',
'<|user|>',
'Explain local query.',
'<|assistant|>',
].join('\n');
const run = client.query({
endpoint,
prompt,
options: { maxTokens: 128, temperature: 0.2 },
local: { contextKey: 'docs-example' },
emitTokens: true,
});
Local Chat With The Model Template
Use chat when the GGUF model declares the chat template it expects. Sipp
passes the role messages to llama.cpp template rendering and then generates
from the rendered prompt.
const run = client.chat({
endpoint,
messages: [
{ role: 'system', content: 'Answer with one concise paragraph.' },
{ role: 'user', content: 'Explain local chat.' },
],
options: { maxTokens: 128, temperature: 0.2 },
local: { contextKey: 'docs-example' },
emitTokens: true,
});
If the model has no tokenizer.chat_template, local chat fails. Use query
with an explicit prompt template for base models, legacy models, or any generic
template the application wants to control.
Local Query With Encoder-Decoder Models
Use query for encoder-decoder GGUF models. The source prompt is encoded
first; Sipp then drives the decoder from the model’s decoder-start token.
const endpoint = await client.add('t5-local', {
kind: 'local',
modelPath: '/models/t5-small-f16.gguf',
});
const run = client.query({
endpoint,
prompt: 'translate English to German: Hello, world.',
options: { maxTokens: 64 },
});
Most encoder-decoder text models do not declare a GGUF chat template. In that
case chat is rejected even though query works.
Local Embed
Use embed with a model/runtime that supports embeddings. Local embedding
normalization is a local-only option.
const run = client.embed({
endpoint,
input: 'Vectorize this sentence for retrieval.',
local: { normalize: true },
});
const embedding = (await run.response).values;
Remote Gateway
A gateway endpoint sends the operation over HTTP. The first-party profile uses separate routes and payload shapes:
| Operation | Default route | Required body fields |
|---|---|---|
query | /v1/query | model, prompt |
chat | /v1/chat | model, messages |
embed | /v1/embed | model, input |
model is the public gateway target name. The gateway resolves that target to
a local GGUF endpoint, OpenAI endpoint, OpenAI-compatible endpoint, or
Anthropic endpoint.
Gateway calls accept shared text options for query and chat, such as
max_tokens, temperature, top_p, stop, and stream. Local-only fields
such as contextKey, grammar, jsonSchema, sampling, media, and
normalize are rejected by gateway endpoints. Direct-provider
providerOptions are also rejected by gateway endpoints; a custom gateway must
translate provider-specific extensions deliberately.
Gateway Target Mapping
| Gateway target type | query behavior | chat behavior | embed behavior |
|---|---|---|---|
| Local GGUF | Runs local raw-prompt generation. Decoder-only models decode directly; encoder-decoder models run encoder prefill plus decoder generation. No chat template is added. | Runs local chat rendering with the GGUF-declared chat template. Fails if the model has no template, including many encoder-decoder models. | Runs local embedding if the loaded model/runtime supports embeddings. Encoder-decoder text models do not produce embeddings through this runtime. |
| OpenAI | Sends an OpenAI completions request with prompt. | Sends an OpenAI chat-completions request with messages. | Sends an OpenAI embeddings request with input and encoding_format: "float". |
| OpenAI-compatible | Sends /completions with prompt. | Sends /chat/completions with messages. | Sends /embeddings with input and encoding_format: "float". |
| Anthropic | Wraps the prompt as one user message and sends an Anthropic /messages request. | Sends Anthropic /messages; system role messages are joined into the top-level system field, and user/assistant messages remain in messages. | Unsupported by the native Anthropic adapter. |
Provider support still depends on the upstream model and provider. For example,
an OpenAI-compatible target may expose chat but not completions, so gateway
chat can work while gateway query fails for that target.
Gateway Client Chat
const endpoint = await client.add('gateway-openai', {
kind: 'gateway',
target: 'openai-chat',
baseUrl: process.env.SIPP_GATEWAY_URL!,
authentication: {
kind: 'bearer',
value: process.env.SIPP_GATEWAY_TOKEN!,
},
});
const run = client.chat({
endpoint,
messages: [
{ role: 'system', content: 'Answer for application developers.' },
{ role: 'user', content: 'When should I use gateway chat?' },
],
options: { maxTokens: 128, temperature: 0.2 },
});
First-Party Gateway HTTP Examples
Raw-prompt query:
curl -X POST "$SIPP_GATEWAY_URL/v1/query" \
-H "Authorization: Bearer $SIPP_GATEWAY_TOKEN" \
-H "content-type: application/json" \
-d '{
"model": "compatible-completion",
"prompt": "Explain gateway query in one sentence.",
"max_tokens": 64
}'
Chat:
curl -X POST "$SIPP_GATEWAY_URL/v1/chat" \
-H "Authorization: Bearer $SIPP_GATEWAY_TOKEN" \
-H "content-type: application/json" \
-d '{
"model": "anthropic-chat",
"messages": [
{ "role": "system", "content": "Answer briefly." },
{ "role": "user", "content": "Explain gateway chat." }
],
"max_tokens": 128
}'
Embedding:
curl -X POST "$SIPP_GATEWAY_URL/v1/embed" \
-H "Authorization: Bearer $SIPP_GATEWAY_TOKEN" \
-H "content-type: application/json" \
-d '{
"model": "openai-embed",
"input": "Text to index for retrieval."
}'
Choosing Quickly
- Use local
querywhen the application must control every token in the prompt, including custom or generic chat templates, or when the target is an encoder-decoder text model. - Use local
chatwhen the GGUF model declares its own chat template and the application already has role messages. - Use local
embedwhen vectors should be produced in the current process and local normalization matters. - Use gateway
querywhen the target supports raw-prompt generation, including local decoder-only or encoder-decoder GGUF targets and OpenAI-compatible completions targets. - Use gateway
chatfor provider chat models and for local GGUF chat models with declared templates. - Use gateway
embedfor local, OpenAI, or OpenAI-compatible embedding targets; do not use it with native Anthropic targets.
Related Docs
Runtime Options
Sipp keeps runtime configuration close to the endpoint that owns local
inference. Request options stay on query, chat, or embed calls. Gateway
and provider extensions use separate option buckets so applications can see
which boundary receives each field.
Option Layers
| Layer | Browser package | Node.js package | Purpose |
|---|---|---|---|
| Client options | new SippClient(options) | Environment and process setup | Browser assets, workers, browser cache, and backend selection. |
| Local endpoint load options | client.add(..., { kind: 'local', options }) | client.add(..., { kind: 'local', config }) | Model source, backend preference, progress, and native runtime config. |
| Text request options | client.query(prompt, options) | client.query({ options }) | Output length, sampling shortcuts, streaming, cancellation, and stop strings. |
| Local request options | contextKey, grammar, media, normalize | local: { contextKey, grammar, media, normalize } | Local-only prompt state, grammars, images, and embedding normalization. |
| Gateway extensions | endpointOptions | endpointOptions | Extra fields consumed by gateway endpoint implementations. |
| Provider extensions | providerOptions | providerOptions | Provider-only fields merged into direct provider requests. |
Python and Rust expose the same concepts with language-native descriptors and runtime config classes or structs.
Browser Client Options
Browser SippClientOptions affect the WebAssembly runtime, worker transport,
and browser storage. They do not select a model by themselves.
| Option | Use |
|---|---|
executionMode | auto uses a worker when available. worker forces worker transport. main-thread is useful for debugging or constrained hosts. |
wasmThreading | single-thread loads the single-thread WASM runtime. pthread loads the pthread runtime. |
moduleUrl, wasmUrl | Override single-thread runtime asset URLs when a bundler or deployment moves package assets. Provide both together. |
pthreadModuleUrl, pthreadWasmUrl | Override pthread runtime asset URLs. Provide both together. |
browserCache | Tune OPFS split thresholds and direct-load behavior for browser GGUF storage. |
trustedOrigins | Allow runtime asset URLs from additional origins. Defaults allow same-origin package assets. |
workerUrl | Override the worker entry URL when the bundler cannot resolve the packaged worker. |
wasmThreading: 'pthread' requires SharedArrayBuffer, cross-origin
isolation, and COOP/COEP headers. Use single-thread when the application
cannot serve those headers.
const client = new SippClient({
executionMode: 'worker',
wasmThreading: 'single-thread',
});
Local Endpoint Options
Browser local endpoints use source plus optional load options:
const endpoint = await client.add('browser-local', {
kind: 'local',
source: '/models/model.gguf',
options: {
backend: 'webgpu',
runtime: {
context: { n_ctx: 2048 },
},
},
});
Node.js local endpoints use modelPath and config:
const endpoint = await client.add('node-local', {
kind: 'local',
modelPath: '/models/model.gguf',
config: {
context: { n_ctx: 2048, n_threads: 8, n_threads_batch: 8 },
},
});
Browser backend accepts auto, cpu, or webgpu. Native package backend
selection is package-specific: Node.js uses SIPP_NODE_BACKEND, Python
uses SIPP_PYTHON_BACKEND, and the CLI uses --backend.
Native Runtime Config
NativeRuntimeConfig groups local runtime settings by responsibility.
| Group | Common fields | Use |
|---|---|---|
placement | devices, gpu_layers, split_mode, main_gpu, tensor_split, use_mmap, use_mlock, fit_params | Model placement, memory mapping, and GPU residency choices. |
context | n_ctx, n_batch, n_ubatch, n_parallel, n_threads, n_threads_batch, flash_attention, offload_kqv | Context window, batch sizes, CPU thread counts, attention, and KV behavior. |
sampling | samplers, seed, top_k, top_p, min_p, temperature, repeat_penalty, mirostat, logit_bias | Default local sampling behavior for text generation. |
scheduler | continuous_batching, policy, prefill_chunk_size, max_running_requests, max_queued_requests | Request scheduling, batching, and queue limits. |
cache | mode, retained_prefix_tokens, snapshot_interval_tokens, max_snapshot_entries, max_snapshot_bytes | Prefix KV reuse and snapshot behavior. |
multimodal | projector_path, use_gpu, image_min_tokens, image_max_tokens | Vision projector and image-token settings. |
residency | max_gpu_models_per_device, allow_cpu_models_while_gpu_loaded, require_gpu_lease | GPU model residency policy for native runtimes. |
observability | runtime_metrics, backend_profiling | Runtime timing, throughput, and backend diagnostics. |
Use runtime config for stable endpoint behavior. Use request options for values that should vary per prompt, user action, or UI control.
Request Options
Text-producing calls share common generation controls:
| Option | Use |
|---|---|
maxTokens | Maximum generated tokens for the response. |
temperature | Request-local temperature shortcut. |
topP | Request-local nucleus sampling shortcut. |
stop | Stop strings for text generation. |
signal | Cancellation through AbortSignal where supported. |
emitTokens | Enables token streaming through the returned run handle. |
Local text calls can also use a prompt context key, GBNF grammar, and media inputs for vision-capable models. Embedding calls can set normalization through local embedding options.
Gateway-specific fields belong in endpointOptions. Direct provider-specific
fields belong in providerOptions:
const run = client.chat({
endpoint,
messages,
options: { maxTokens: 128, temperature: 0.2 },
providerOptions: {
reasoning_effort: 'low',
},
});
Provider options cannot override typed fields such as model, messages,
prompt, temperature, or topP/top_p; set those through the typed request
options where Sipp exposes them.
Related Docs
Device Support
Sipp runs across a range of devices, operating systems, browsers, and GPU accelerators. This page documents which configurations are supported, at what level, and any known limitations.
Compute Backends
Backend names are shared across build configuration and runtime selection. The same name selects the backend in each surface.
| Backend | Status | Feature flag | Default | Platforms | Notes |
|---|---|---|---|---|---|
| CPU | Supported | native | Yes | All | Portable fallback, no accelerator required |
| CUDA | Supported | cuda | No | Linux, Windows | NVIDIA GPUs, compute capability 7.5+ |
| Metal | Supported | metal | No | macOS | Apple Silicon and AMD GPUs; use CPU on Intel integrated GPUs |
| Vulkan | Supported | vulkan | No | Linux, Windows | Vulkan 1.2+ GPU required |
| WebGPU | Supported | GGML_WEBGPU (CMake) | No | WASM browsers | Browser-only, requires shader-f16 |
Runtime selection:
- CLI:
--backend auto|cpu|cuda|metal|vulkan - Node.js:
SIPP_NODE_BACKEND=cpu|vulkan|cuda|metal - Python:
SIPP_PYTHON_BACKEND=cpu|vulkan|cuda|metal - Browser:
backend: 'auto' | 'cpu' | 'webgpu'in model load options
Leave the variable unset for automatic backend selection.
Backend Availability by Package
| Backend | Node.js | Python | Rust | Browser (WASM) | Gateway |
|---|---|---|---|---|---|
| CPU | Yes | Yes | Yes | Yes | Yes |
| CUDA | Yes | Yes | Yes | — | Yes |
| Metal | Yes | Yes | Yes | — | — |
| Vulkan | Yes | Yes | Yes | — | Yes |
| WebGPU | — | — | — | Yes | — |
Additional llama.cpp Backends (Not Yet Exposed)
The vendored llama.cpp supports additional backends that Sipp does not currently expose as feature flags. Community contributions are welcome.
- SYCL (Intel oneAPI)
- HIP / ROCm (AMD)
- OpenCL
- OpenVINO
- CANN (Huawei Ascend)
- MUSA (Moore Threads)
- Hexagon (Qualcomm DSP)
- ZenDNN (AMD)
- RPC (remote backend)
These backends require custom CMake flags on top of the vendored llama.cpp build and are not available through Sipp’s standard build or package commands.
Desktop Browser Support Matrix
The table below shows the first browser version where each feature is available for desktop operating systems. A dash (—) means the feature is not supported.
| Browser | Support | WASM st | WASM pthread¹ | WebGPU | WebGPU + f16² | OPFS³ | Workers |
|---|---|---|---|---|---|---|---|
| Chrome (Win, Mac, Linux) | ✅ Tested | 57 | 92⁴ | 113 | 113 | 86 | 4 |
| Edge (Win, Mac, Linux) | ❌ Untested | 79⁵ | 92⁴ | 113 | 113 | 86 | 79⁵ |
| Firefox (Windows) | ❌ Untested | 52 | 79⁴ | 141 | 141 | 111 | 3.5 |
| Firefox (macOS) | ❌ Untested | 52 | 79⁴ | 145⁶ | 145⁶ | 111 | 3.5 |
| Firefox (Linux) | ❌ Untested | 52 | 79⁴ | ⚠ Nightly | ⚠ Nightly | 111 | 3.5 |
| Safari (macOS) | ❌ Untested | 11 | 15.2⁴ | 26 | 26 | 16.4 | 4 |
| Opera (Win, Mac, Linux) | ❌ Untested | 44 | 78⁴ | 99 | 99 | 72 | 11.5 |
| ChromeOS | ❌ Untested | 57 | 92⁴ | 113 | 113 | 86 | 4 |
| Other Chromium-based⁷ | ❌ Untested | 57+ | 92⁴ | 113 | 113 | 86+ | 4+ |
Footnotes:
- ¹ WASM pthread requires the server to send
Cross-Origin-Opener-Policy: same-originandCross-Origin-Embedder-Policy: require-corp(orcredentialless) HTTP headers. See WASM Threading below. - ² The
shader-f16WebGPU feature is required by Sipp’s browser WebGPU backend. Availability depends on GPU and driver support in addition to the browser version. - ³ Origin Private File System. Used for model data caching. Requires a secure context (HTTPS). Firefox support is behind the
dom.fs.enabledpreference until version 111. - ⁴ Version listed is when
SharedArrayBufferbecame available with cross-origin isolation headers. Earlier versions may have had the feature without the header requirement. - ⁵ Edge switched to a Chromium engine at version 79. The Chromium-based Edge supports WASM single-thread from 79, Workers from 79. The legacy EdgeHTML engine supported Workers from version 12 and WASM from version 16.
- ⁶ Firefox 145 enables WebGPU on macOS version 26 (ARM64). Intel Mac support is in progress in Nightly.
- ⁷ Includes Brave, Vivaldi, Arc, and other Chromium-derived browsers. Versions match their underlying Chromium release.
Mobile Browser Support Matrix
| Browser | Support | WASM st | WASM pthread¹ | WebGPU | WebGPU + f16² | OPFS³ | Workers |
|---|---|---|---|---|---|---|---|
| Chrome (Android) | 🟡 Pending | 57 | 92⁴ | 121⁵ | 121⁵ | 86 | 56 |
| Safari (iOS / iPadOS) | ❌ Untested | 11 | 15.2⁴ | 26 | 26 | 16.4 | 5 |
| Safari (visionOS) | ❌ Untested | 11 | 15.2⁴ | 26 | 26 | 16.4 | 5 |
| Samsung Internet (Android) | ❌ Untested | 8 | 16⁴ | 24 | 24 | 21 | 4 |
| Opera (Android) | ❌ Untested | 44 | 78⁴ | 80 | 80 | 72 | 11.5 |
| Firefox (Android) | ❌ Untested | 52 | 79⁴ | ⚠ Beta/Nightly | ⚠ Beta/Nightly | 150 | 52 |
| Android WebView | ❌ Untested | 57 | 92⁴ | ⚠ Flag⁶ | ⚠ Flag⁶ | 86 | 56 |
Footnotes:
- ¹ Requires COOP/COEP HTTP headers as described in WASM Threading.
- ² The
shader-f16feature may not be available on all mobile GPU/driver combinations even when the browser version supports it. - ³ Origin Private File System. Chrome for Android and Samsung Internet support OPFS. iOS Safari supports OPFS from 16.4.
- ⁴ Version listed is when
SharedArrayBufferbecame available with cross-origin isolation headers. - ⁵ Chrome 121 on Android 12+ with Qualcomm or ARM GPUs. Support on other GPU vendors (Imagination, Samsung Xclipse) is still rolling out.
- ⁶ Android WebView requires the
--enable-unsafe-webgpuflag. Not recommended for production use.
WASM Threading
Sipp ships two WASM runtime artifacts:
| Artifact | Thread count | Token streaming | Requirements |
|---|---|---|---|
sipp-wasm.js (single-thread) | 1 | postMessage | None |
sipp-wasm-pthread.js (pthread) | up to 4⁷ | SharedArrayBuffer ring | COOP + COEP headers, secure context |
⁷ Defaults to
min(4, navigator.hardwareConcurrency). Override withruntime.context.n_threadsin model load options.
The client auto-detects pthread availability at runtime:
function supportsWasmPthreads(): boolean {
return (
typeof SharedArrayBuffer !== 'undefined' &&
globalThis.crossOriginIsolated === true &&
typeof Worker !== 'undefined'
);
}
Set wasmThreading: 'single-thread' in client options when the hosting environment cannot serve COOP/COEP headers (for example, GitHub Pages or shared hosting without header control).
Platform & OS Support
| OS | x64 | arm64 | Other architectures | Available bindings |
|---|---|---|---|---|
| Linux (glibc) | Yes | Yes | arm, loong64, riscv64, ppc64, s390x | Node.js, Python, Rust |
| Linux (musl) | Yes | Yes | arm, loong64, riscv64 | Node.js |
| Windows (MSVC) | Yes | Yes | ia32 | Node.js, Python, Rust |
| Windows (GNU) | Yes | — | — | Node.js |
| macOS | Yes | Yes | universal2 | Node.js, Python, Rust |
| Android | — | Yes | arm (eabi) | Node.js |
| FreeBSD | Yes | Yes | — | Node.js |
| OpenHarmony | Yes | Yes | arm | Node.js |
Docker Containers
| Profile | Backend | Host OS | Notes |
|---|---|---|---|
| CPU | CPU | Linux, macOS, Windows | Works everywhere, no GPU passthrough |
| CUDA | CUDA | Linux, Windows (WSL2) | Requires NVIDIA Container Toolkit |
| Vulkan | Vulkan | Linux only | Windows Docker Desktop does not support Vulkan passthrough |
| Metal | — | — | Metal unavailable inside Linux containers |
GPU & Accelerator Support
NVIDIA CUDA
Sipp targets NVIDIA GPUs with compute capability 7.5 and above. CUDA 13 removes support for architectures below 7.5.
| Architecture | Compute Capability | Target GPUs |
|---|---|---|
| Turing | 7.5 | T4, Quadro RTX, GeForce RTX 20-series |
| Ampere | 8.0, 8.6 | A100, A10, A40, RTX A6000, GeForce RTX 30-series |
| Ada Lovelace | 8.9 | L4, L40S, GeForce RTX 40-series |
| Hopper | 9.0 | H100, H200 |
| Blackwell (Data Center) | 10.0 | B100, B200, GB200 |
| Blackwell (Consumer/Edge) | 12.0, 12.1 | GeForce RTX 50-series, RTX PRO Blackwell |
Vulkan
Any GPU with Vulkan 1.2 or later driver support works on Linux and Windows. Tested on:
- NVIDIA: Turing, Ampere, Ada Lovelace, Hopper (proprietary driver)
- AMD: RDNA 2 and later (AMDGPU PRO or RADV)
- Intel: Gen12/Xe and later (ANV)
Windows Docker Desktop does not support the Vulkan backend.
macOS source builds can compile Vulkan through the LunarG SDK, but LunarG’s macOS drivers translate Vulkan to Metal. Sipp does not publish macOS Vulkan packages because the native Metal backend is simpler for normal macOS use and macOS Vulkan adds loader/ICD runtime requirements.
Metal
- Apple Silicon: M1, M2, M3, M4 series
- AMD: GPUs supported by macOS (Radeon Pro series)
Metal is macOS-only and unavailable inside Docker containers. Intel integrated GPUs expose Metal, but Sipp does not treat them as a recommended Metal target; use the CPU backend on those Macs unless you have tested the exact model, context size, and device and confirmed that Metal is stable and faster than CPU.
Apple Silicon can run x64 processes through Rosetta 2. A darwin-x64 Node or
Python native package is only used by an x64 Node/Python process; native arm64
Node/Python installations use the darwin-arm64 packages and are the preferred
path on Apple Silicon.
WebGPU (Browser)
Any GPU that the host browser exposes as a WebGPU adapter may work, but Sipp requires the shader-f16 feature for WebGPU acceleration. Common configurations:
| GPU Family | Chrome (D3D12) | Chrome (Vulkan) | Firefox (wgpu) | Safari (Metal) |
|---|---|---|---|---|
| NVIDIA | Yes | Yes (Linux) | Yes | — |
| AMD | Yes | Yes (Linux) | Yes | Yes |
| Intel integrated | Yes | Yes (Linux) | Yes | Yes |
| Apple Silicon | — | — | Yes | Yes |
| Qualcomm (Android) | Yes | — | — | — |
| ARM Mali | Yes (Android) | — | — | — |
Language Binding Support
| Package | Install command | Status | Run time | Primary use |
|---|---|---|---|---|
Browser (@sipp/sipp) | npm install @sipp/sipp | Published (npm) | WASM / WebGPU | Browser-local GGUF inference, gateway clients |
Node.js (@sipp/sipp-server) | npm install @sipp/sipp-server | Published (npm) | N-API native | Server processes, route handlers, backend services |
Python (sipppy) | pip install sipppy | Published (PyPI) | PyO3 native | Python services, scripts, gateway clients |
Rust (sipp-rs) | cargo add sipp-rs | Published (crates.io) | Native-backed Rust crate | Rust applications and services |
| Gateway server | Source-built | Source only | Axum binary | HTTP gateway for local and provider targets |
| Gateway Docker | Docker from source | Source only | Container | Production container workflows |
| Gateway toolkit | Source artifact | Source only | Rust crate | Custom gateway applications |
Limitations & Work in Progress
- Gateway server does not have a published binary or public container image yet. It must be built from source.
- Windows Docker Vulkan is not supported. Use the CUDA or CPU profiles on Windows with WSL2.
- macOS Docker is CPU-only. Metal cannot run inside a Linux Docker container.
- Android and iOS are not first-class package targets. The browser WASM package works on mobile web browsers, but no native Android or iOS packages are published.
- Chrome (desktop) is the primary tested browser target. Other desktop browsers (Edge, Firefox, Safari, Opera, Chromium derivatives) are untested.
- Mobile browser support has not been validated yet. Chrome (Android) is the next target for testing.
- Firefox WebGPU on Linux and Android is in active development (Nightly / Beta). Firefox WebGPU on macOS Intel is also in progress.
- Gateways are compatible with OpenAI and OpenAI-compatible providers plus Anthropic. Additional provider support is added over time.
CLI
apps/cli builds the sipp command-line application for local GGUF text
generation. It is useful for runtime smoke testing, manual model checks, and
quick local prompts.
Build
cargo xtask build cli --backend cpu
cargo xtask build cli --backend all
Run
cargo run -p sipp-cli -- <model.gguf> "Explain Sipp."
Useful flags include:
--max-tokens--ctx-size--backend auto|cpu|cuda|metal|vulkan--temperature--stats off|basic|profile--chat
Use cargo run -p sipp-cli -- --help for the full generated help.
Configuration
Sipp configuration is intentionally split by responsibility. Core crates do not own HTTP routes, authentication schemes, TOML files, or deployment policy.
Runtime Configuration
Local runtime configuration belongs to the endpoint descriptor or package-level runtime options. Common areas include context size, scheduler behavior, cache mode, observability, sampling, and backend selection. See Runtime Options for the shared option map.
Gateway Configuration
apps/gateway-server owns TOML configuration for the first-party gateway
application:
[routes]selects public and management paths.admin_password_envnames the secret env var containing the Admin Dashboard password.[[tokens]]maps bearer-token environment variables to caller labels and allowed targets.[[targets]]defines local, OpenAI, OpenAI-compatible, or Anthropic targets. Local targets can selectbackend = "auto",cpu,cuda,metal, orvulkan. See Gateway Configuration for the full schema.
Custom wire formats, authentication schemes, and route layouts belong in
separate applications composed from lib/gateway.
Environment Variables
SIPP_GATEWAY_TOKEN: development bearer token for examples and gateway server commands.SIPP_GATEWAY_ADMIN_PASSWORD: Admin Dashboard password used by gateway examples.SIPP_GATEWAY_URL: gateway base URL for client examples.SIPP_NODE_BACKEND: Node runtime backend selection.SIPP_PYTHON_BACKEND: Python runtime backend selection.OPENAI_API_KEY: provider credential used by OpenAI examples and provider-backed gateway targets.
Examples And Demos
Examples are small, runnable integrations. Demos are broader browser experiences for inspecting runtime behavior and user-facing workflows.
Examples
examples/rust: Rust query, chat, embed, vision, gateway, and provider examples.examples/node: Node.js query, chat, embed, vision, and gateway examples.examples/python: Python query, chat, embed, vision, and gateway examples.examples/web: Vite browser pages for local and gateway workflows.examples/gateway: minimal Axum gateway route composition.
Start with:
cargo xtask run examples gateway rust --case query
cargo xtask run examples serve browser
Demos
demos/chat: focused browser chat interface for local GGUF models.demos/avatar: React and three.js character demo.demos/proactive-ui: drawing-to-vision demo with runtime tracing.demos/simulation: multi-agent simulation demo using director helpers.tools/playground: browser runtime diagnostics and automation tool.
Start with:
cargo xtask run demos serve chat
cargo xtask run tools serve playground
Use cargo xtask test smoke group examples --backend cpu for model-backed
example smoke coverage when validating broader runtime behavior.
Maintainers
This section is for developers working from the Sipp source checkout. It covers repository structure, build orchestration, tests, coverage, and contribution workflow.
Application developers who only need the published packages should start with Using the Core Library.
Start Here
- Source Builds covers checkout setup,
sipp, source examples, demos, and package build targets. - Architecture explains crate and package boundaries.
- Gateway Architecture explains gateway layering.
- Testing lists the cataloged test suites.
- Coverage covers coverage commands and outputs.
- Contributing documents contribution expectations.
Source Builds
Use the source checkout when developing Sipp itself, validating package artifacts, running examples, or deploying the gateway server before a public server artifact exists.
Bootstrap
From the repository root:
source ./setup.sh
sipp doctor
sipp test list
On Windows, run .\setup.ps1 from PowerShell or setup.cmd from CMD. After
setup, sipp is a repo-local alias for cargo xtask; use cargo xtask ...
with the same arguments if the launcher is not active.
Build Targets
Use the xtask orchestrator instead of direct build commands when compiling Sipp targets. It manages native dependencies, backend toolchains, and package staging.
sipp build core
sipp build node --backend cpu
sipp build python --backend cpu
sipp build gateway-server --backend cpu
sipp build wasm
sipp build all
Use --backend vulkan, --backend cuda, --backend metal, or
--backend all where a native package target supports those backends.
CUDA builds compile a portable cloud GPU architecture list by default. Set
SIPP_CUDA_ARCHITECTURES (semicolon-separated CMake entries, for example
80 for A100 only) before building to narrow the list for faster local
builds. See docs/gateway/docker.md for the full list
and rationale.
Examples And Demos
Run browser examples and demos through sipp. These commands start Vite dev
servers and do not accept native backend flags:
sipp run examples serve browser
sipp run demos serve avatar
sipp run demos serve simulation
Gateway Hello World Examples
Gateway example workflows start a local gateway, run a client example, and stop
the gateway when the client exits. They start examples/gateway and then run a
client from examples/rust, examples/node, or examples/python.
Use --case query|chat|embed to choose the client case. Use
--backend cpu|vulkan|cuda|metal when the gateway process should use a
specific native backend.
sipp run examples gateway rust --case query
sipp run examples gateway node --case chat
sipp run examples gateway python --case embed --backend vulkan
Playground
The browser playground lives under tools/playground. Use it to inspect local
inference, vision model setup, GGUF loading, runtime observability, and
repeatable browser runtime smoke checks.
sipp run tools serve playground
Gateway Server
The release workflow does not yet publish a standalone gateway-server binary or
container image. Use sipp for source checkout checks and raw Docker commands
for container deployment. The canonical source guide is
Gateway Server; Docker deployment is covered in
Gateway Docker.
cp apps/gateway-server/config/local.toml.example apps/gateway-server/config/local.toml
cp apps/gateway-server/.env.example apps/gateway-server/.env
set -a
. apps/gateway-server/.env
set +a
sipp run gateway-server check --config apps/gateway-server/config/local.toml --backend cpu
sipp run gateway-server serve --config apps/gateway-server/config/local.toml --backend cpu
The copied local config expects a local GGUF model under .build/models and a
dashboard password env var named by the selected TOML file. Keep secrets env
files private because they contain the Admin Dashboard password and provider
credentials.
Validation
Use the narrowest relevant target from Testing. Common entry points are:
sipp test list
sipp test unit group full
sipp test smoke group examples --backend cpu
sipp test verify --target public-docs
Architecture
Sipp separates inference primitives from protocol and deployment policy. The public package surfaces compose lower-level crates without moving HTTP routes, serialized wire formats, or deployment defaults into core inference layers.
Published Crates
crates/sipp: the publicsippRust library published assipp-rs. Former foundational crates continue as module folders:core: low-level shared types.shard: GGUF cache planning and split-file utilities.backend,engine,lifecycle,runtime: local inference, scheduling, lifecycle, and memory management.client: typed endpoint registration and query, chat, embed dispatch, re-exported at the crate root.providers(featureproviders): explicitly selected external provider adapters.gateway_core(featuregateway): protocol-neutral gateway execution traits and pipeline ordering.
crates/sys: thesipp-syscrate — unsafe FFI bindings, native llama.cpp shims, and the vendoredllama.cpp/source tree.
Public Libraries
lib/web: browser package source.lib/node: Node.js server package source.lib/python: Python package source.lib/gateway: route-free HTTP gateway toolkit, consumed from source checkouts.
Applications And Examples
apps/gateway-server: opinionated first-party gateway application.apps/cli: command-line local inference application.examples: small copyable integrations.demos: browser experiences built on the public package surfaces.xtask: build, test, run, packaging, and maintenance orchestration.
For gateway-specific layering, read Gateway Architecture.
Gateway Architecture
Gateway architecture documentation now lives in Gateway Architecture.
Use:
- Gateway for the full gateway section.
- Gateway Architecture for package boundaries.
- Gateway Toolkit for custom gateway applications.
Testing
Sipp tests are cataloged by cargo xtask test list. Use that command first
when choosing a target or checking what CI runs.
Commands
cargo xtask test has four top-level actions:
list: list unit and smoke suites and optionally discover/search cheap cases.unit: run deterministic code-flow and API-layer tests by suite or group.smoke: run holistic integration smoke tests by suite or group.verify: analyze existing coverage artifacts and validate test structure.
Common Commands
cargo xtask test list
cargo xtask test list --group unit --layer interface --cases --search router --format json
cargo xtask test unit group full
cargo xtask test unit group whitebox
cargo xtask test unit group interface
cargo xtask test unit suite xtask
cargo xtask test unit suite rust-crates --package sipp-rs
cargo xtask test unit suite browser --wasm-threading single-thread
cargo xtask test unit suite demos --wasm-threading single-thread
cargo xtask test unit suite node-package --backend cpu
cargo xtask test unit suite python-package --backend cpu
cargo xtask test smoke suite example-node --backend cpu
cargo xtask test smoke suite example-gateway --backend cpu --case query
cargo xtask test smoke suite playground-browser
cargo xtask test smoke group examples --backend cpu
cargo xtask test smoke group local-model --backend cpu
cargo xtask test smoke group full --backend cpu
cargo xtask test verify --target whitebox
cargo xtask test verify --changed
test unit owns deterministic tests. It is split into explicit namespaces:
test unit suite <name>runs exactly one deterministic unit suite.test unit group <name>runs a named bundle of deterministic unit suites.
Unit suite names expose suite-specific options, such as
test unit suite rust-crates --package <crate> and
test unit suite node-package --backend cpu.
Unit Suites
| Command | What runs | Code location |
|---|---|---|
cargo xtask test unit suite xtask | xtask CLI and orchestration tests | xtask/src/tests |
cargo xtask test unit suite rust-crates | Workspace crate unit tests | crates, lib/gateway, apps |
cargo xtask test unit suite rust-bindings | Rust binding crate unit tests | bindings/node, bindings/python, bindings/wasm |
cargo xtask test unit suite browser | Browser TypeScript tests | lib/web/tests |
cargo xtask test unit suite demos | Browser demo TypeScript tests | demos |
cargo xtask test unit suite api | Crate-level public API integration tests | crates/sipp/tests |
cargo xtask test unit suite cli | CLI black-box integration tests | apps/cli/tests |
cargo xtask test unit suite node-package | Deterministic Node package API tests | lib/node, bindings/node |
cargo xtask test unit suite python-package | Deterministic Python package API tests | lib/python, bindings/python |
Unit Groups
| Command | Suites |
|---|---|
cargo xtask test unit group whitebox | xtask, rust-crates, rust-bindings, browser, and demos |
cargo xtask test unit group interface | api, cli, node-package, and python-package |
cargo xtask test unit group full | Every deterministic unit suite |
Browser and demo unit suites accept --wasm-threading single-thread|pthread|all.
CI uses single-thread to keep source validation fast. Release package builds
continue to use cargo xtask build wasm, whose default is all.
test smoke owns holistic integration checks. It is split into explicit
namespaces:
test smoke suite <name>runs exactly one smoke suite.test smoke group <name>runs a named bundle of smoke suites.
Model-backed smoke suites default to the setup sample model cache under
.build/models when --model is omitted. Rust, Node, Python, gateway, and
browser example smoke accept repeated --case query|chat|embed. Embedding
cases require a model/runtime that reports embedding support.
Smoke Suites
| Command | What runs | Code location |
|---|---|---|
cargo xtask test smoke suite cli | Staged local CLI generation smoke | apps/cli |
cargo xtask test smoke suite example-rust | Rust query/chat/embed examples | examples/rust |
cargo xtask test smoke suite example-node | Node query.mjs/chat.mjs/embed.mjs examples | examples/node |
cargo xtask test smoke suite example-python | Python query.py/chat.py/embed.py examples | examples/python |
cargo xtask test smoke suite example-gateway | Embedded local gateway proxy plus Rust/Node/Python local-and-gateway clients | examples/gateway, examples/rust, examples/node, examples/python |
cargo xtask test smoke suite example-browser | Browser query.html/chat.html/embed.html examples through Playwright | examples/web |
cargo xtask test smoke suite playground-browser | Browser playground runtime smoke through Playwright | tools/playground |
cargo xtask test smoke suite llama-backend-ops | llama.cpp backend operation correctness smoke | crates/sys/llama.cpp |
Smoke Groups
| Command | Suites |
|---|---|
cargo xtask test smoke group examples | example-rust, example-node, example-python, example-gateway, and example-browser |
cargo xtask test smoke group local-model | cli, example-rust, example-node, and example-python |
cargo xtask test smoke group full | Every smoke suite, including playground, gateway, and llama checks |
Use cargo xtask run examples serve browser to manually serve browser examples.
Use cargo xtask run examples serve gateway-local --model <model.gguf> to
serve the minimal local gateway proxy. Provider-backed and production serving
use apps/gateway-server; validate its configuration with
sipp run gateway-server check --config <path> and use raw Docker commands from
Gateway Docker for container testing. Use
Gateway Testing for curl and Postman checks. Playground
validation remains under test smoke suite playground-browser.
test unit and test smoke print a final suite and test/check summary, then
write .build/test/run-report.json and .build/test/run-report.md.
Coverage-capable unit suites also write fresh coverage artifacts under
.build/coverage/.
test verify does not execute test suites. It validates test structure,
catalog ownership, test/runtime code separation, optional changed-file coverage,
and existing coverage artifacts.
Package Locations
lib/webpublishes@noumena-labs/sippand public@sipp/sipp.lib/nodepublishes@noumena-labs/sipp-serverand public@sipp/sipp-server.lib/pythonpublishes Pythonsipp.crates/sipppublishes the Rust packagesipp-rswith library cratesipp.
Coverage
Sipp coverage is driven through the same test catalog used by
cargo xtask test list. General test command guidance lives in
testing.md.
Commands
cargo xtask test list
cargo xtask test list --group unit --layer whitebox --cases --format json
cargo xtask test unit group whitebox
cargo xtask test verify --target whitebox
cargo xtask test verify --target node
cargo xtask test verify --changed
test unit is the command that executes deterministic coverage-capable suites
and creates fresh coverage data. Rust writes coverage through cargo-llvm-cov,
Node writes coverage through c8, and Python writes coverage through
pytest-cov.
test verify defaults to all coverage-capable unit suites. It does not execute
test suites, build bindings, download models, or run smoke tests. Use
--target to narrow which existing coverage artifacts are analyzed. Explicitly
selecting a unit target that is not coverage-capable fails with a clear error.
--changed validates that changed first-party source files owned by the
selected unit suites have matching changed tests owned by the same catalog
suites. test verify also checks catalog ownership and test/runtime code
separation so tests do not live inside runtime source files.
test list --format json is the stable catalog surface used by CI and
contributors. Each suite entry includes id, group, layer, description,
requirements, sourceRoots, backendPolicy, coverage, and
caseDiscovery. Use --cases when a tool needs discoverable files and case
names that map to the suite runner.
Tools
Coverage reporting uses the tools required by the selected report areas:
cargo-llvm-covfor Rust/native execution and report rendering.c8for Node wrapper coverage duringtest unit suite node-package.pytest-covfor Python wrapper coverage duringtest unit suite python-package.
test verify only reads existing coverage artifacts and renders summaries from
them.
Outputs
Reports are written under .build/coverage/:
rust/lcov.infoandrust/html/node/lcov.infopython/lcov.info,python/cobertura.xml, andpython/html/baseline.jsoncoverage-summary.md
Test command reports are written under .build/test/:
run-report.jsonandrun-report.mdverify-report.jsonandverify-report.md
The baseline includes first-party crates/ and bindings/ code. It
intentionally excludes generated outputs, caches, tests, examples,
third_party/, and the vendored crates/sys/llama.cpp/ tree.
Policy
The current implementation records the baseline and does not fail on percentage thresholds. It does fail when an enabled coverage area produces an empty first-party report. Thresholds should be added after the baseline is stable and the largest uncovered first-party areas are addressed.
Contributing
Sipp is a polyglot monorepo. Keep contributions focused, documented, and validated with the narrowest useful commands.
Before submitting issues or PRs, be ready to explain why the change matters and how it works. AI-assisted coding is fine, including agent-generated drafts, but the author is responsible for reviewing, understanding, and maintaining the final change.
Identify The Why
For issues and feature requests, explain the problem, who it affects, and how it could affect the system. This helps maintainers evaluate the priority and choose the right implementation path.
Explain The How
For PRs, describe what changed and how the implementation works. If you cannot explain the behavior, risks, and validation, revisit the change before asking for review.
Communication
Use your own words in issues and PRs. Keep the main message concise, then add supporting detail only when it helps reviewers understand the change.
For each issue or PR:
- Explain why it matters.
- Describe what changed.
- Keep the scope atomic.
- Avoid unrelated cleanup.
Before Editing
- Read the root README and the relevant package or app README.
- Use
cargo xtask test listto inspect available validation targets. - Use
cargo xtaskcommands for builds and long-running workflows. - Avoid changing vendored files under
third_party/orcrates/sys/llama.cpp/unless the task is explicitly about the vendor source.
Documentation Changes
- Keep README files short and task-oriented.
- Put detailed guides and references in this mdBook.
- Prefer examples that can be copied and run from a clean checkout.
- Update docs when public APIs, package behavior, commands, or configuration change.
Validation
For documentation-only changes:
sipp docs build
cargo xtask test list
cargo xtask test verify --target public-docs
For code changes, use the narrowest relevant test target from Testing. Run broader suites only when the change crosses package or runtime boundaries.
Known Issues
This page tracks current issues that users may hit when running Sipp.
Browser Pulse Animations Can Reduce WebGPU Decode Throughput
Status: open.
Continuous pulse animations in the page can slow down browser-local inference, especially WebGPU decode throughput. This has been observed as lower tokens-per-second while a demo or app is rendering pulsing UI or scene effects during generation.
Affected surface:
- Browser-local inference through the
@sipp/sippbrowser package. - Demos or applications that keep pulse animations active while the model is decoding.
Workarounds:
- Disable or pause pulse animations while a request is decoding.
- Prefer static state indicators or lower-frequency updates during generation.
- Test browser inference performance with visual animations disabled before comparing backend or model throughput.
Hybrid Graphics Laptops May Pick The Integrated GPU
Status: open.
On Windows laptops with both integrated and discrete graphics, the browser may choose the integrated GPU for WebGPU. Browser-local inference still runs, but decode throughput can be much lower than expected.
Workaround:
- Open Windows Settings.
- Go to System > Display > Graphics.
- Add the browser executable you use for Sipp, such as Chrome, Edge, or another Chromium-based browser.
- Set that browser to High performance.
- Restart the browser and reload the Sipp page.
This setting is stronger than relying on browser flags because it tells Windows which GPU the browser process should prefer.