Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Runtime Options

Sipp keeps runtime configuration close to the endpoint that owns local inference. Request options stay on query, chat, or embed calls. Gateway and provider extensions use separate option buckets so applications can see which boundary receives each field.

Option Layers

LayerBrowser packageNode.js packagePurpose
Client optionsnew SippClient(options)Environment and process setupBrowser assets, workers, browser cache, and backend selection.
Local endpoint load optionsclient.add(..., { kind: 'local', options })client.add(..., { kind: 'local', config })Model source, backend preference, progress, and native runtime config.
Text request optionsclient.query(prompt, options)client.query({ options })Output length, sampling shortcuts, streaming, cancellation, and stop strings.
Local request optionscontextKey, grammar, media, normalizelocal: { contextKey, grammar, media, normalize }Local-only prompt state, grammars, images, and embedding normalization.
Gateway extensionsendpointOptionsendpointOptionsExtra fields consumed by gateway endpoint implementations.
Provider extensionsproviderOptionsproviderOptionsProvider-only fields merged into direct provider requests.

Python and Rust expose the same concepts with language-native descriptors and runtime config classes or structs.

Browser Client Options

Browser SippClientOptions affect the WebAssembly runtime, worker transport, and browser storage. They do not select a model by themselves.

OptionUse
executionModeauto uses a worker when available. worker forces worker transport. main-thread is useful for debugging or constrained hosts.
wasmThreadingsingle-thread loads the single-thread WASM runtime. pthread loads the pthread runtime.
moduleUrl, wasmUrlOverride single-thread runtime asset URLs when a bundler or deployment moves package assets. Provide both together.
pthreadModuleUrl, pthreadWasmUrlOverride pthread runtime asset URLs. Provide both together.
browserCacheTune OPFS split thresholds and direct-load behavior for browser GGUF storage.
trustedOriginsAllow runtime asset URLs from additional origins. Defaults allow same-origin package assets.
workerUrlOverride the worker entry URL when the bundler cannot resolve the packaged worker.

wasmThreading: 'pthread' requires SharedArrayBuffer, cross-origin isolation, and COOP/COEP headers. Use single-thread when the application cannot serve those headers.

const client = new SippClient({
  executionMode: 'worker',
  wasmThreading: 'single-thread',
});

Local Endpoint Options

Browser local endpoints use source plus optional load options:

const endpoint = await client.add('browser-local', {
  kind: 'local',
  source: '/models/model.gguf',
  options: {
    backend: 'webgpu',
    runtime: {
      context: { n_ctx: 2048 },
    },
  },
});

Node.js local endpoints use modelPath and config:

const endpoint = await client.add('node-local', {
  kind: 'local',
  modelPath: '/models/model.gguf',
  config: {
    context: { n_ctx: 2048, n_threads: 8, n_threads_batch: 8 },
  },
});

Browser backend accepts auto, cpu, or webgpu. Native package backend selection is package-specific: Node.js uses SIPP_NODE_BACKEND, Python uses SIPP_PYTHON_BACKEND, and the CLI uses --backend.

Native Runtime Config

NativeRuntimeConfig groups local runtime settings by responsibility.

GroupCommon fieldsUse
placementdevices, gpu_layers, split_mode, main_gpu, tensor_split, use_mmap, use_mlock, fit_paramsModel placement, memory mapping, and GPU residency choices.
contextn_ctx, n_batch, n_ubatch, n_parallel, n_threads, n_threads_batch, flash_attention, offload_kqvContext window, batch sizes, CPU thread counts, attention, and KV behavior.
samplingsamplers, seed, top_k, top_p, min_p, temperature, repeat_penalty, mirostat, logit_biasDefault local sampling behavior for text generation.
schedulercontinuous_batching, policy, prefill_chunk_size, max_running_requests, max_queued_requestsRequest scheduling, batching, and queue limits.
cachemode, retained_prefix_tokens, snapshot_interval_tokens, max_snapshot_entries, max_snapshot_bytesPrefix KV reuse and snapshot behavior.
multimodalprojector_path, use_gpu, image_min_tokens, image_max_tokensVision projector and image-token settings.
residencymax_gpu_models_per_device, allow_cpu_models_while_gpu_loaded, require_gpu_leaseGPU model residency policy for native runtimes.
observabilityruntime_metrics, backend_profilingRuntime timing, throughput, and backend diagnostics.

Use runtime config for stable endpoint behavior. Use request options for values that should vary per prompt, user action, or UI control.

Request Options

Text-producing calls share common generation controls:

OptionUse
maxTokensMaximum generated tokens for the response.
temperatureRequest-local temperature shortcut.
topPRequest-local nucleus sampling shortcut.
stopStop strings for text generation.
signalCancellation through AbortSignal where supported.
emitTokensEnables token streaming through the returned run handle.

Local text calls can also use a prompt context key, GBNF grammar, and media inputs for vision-capable models. Embedding calls can set normalization through local embedding options.

Gateway-specific fields belong in endpointOptions. Direct provider-specific fields belong in providerOptions:

const run = client.chat({
  endpoint,
  messages,
  options: { maxTokens: 128, temperature: 0.2 },
  providerOptions: {
    reasoning_effort: 'low',
  },
});

Provider options cannot override typed fields such as model, messages, prompt, temperature, or topP/top_p; set those through the typed request options where Sipp exposes them.