Runtime Options

Sipp keeps runtime configuration close to the endpoint that owns local inference. Request options stay on query, chat, or embed calls. Gateway and provider extensions use separate option buckets so applications can see which boundary receives each field.

Option Layers

Layer	Browser package	Node.js package	Purpose
Client options	`new SippClient(options)`	Environment and process setup	Browser assets, workers, browser cache, and backend selection.
Local endpoint load options	`client.add(..., { kind: 'local', options })`	`client.add(..., { kind: 'local', config })`	Model source, backend preference, progress, and native runtime config.
Text request options	`client.query(prompt, options)`	`client.query({ options })`	Output length, sampling shortcuts, streaming, cancellation, and stop strings.
Local request options	`contextKey`, `grammar`, media, `normalize`	`local: { contextKey, grammar, media, normalize }`	Local-only prompt state, grammars, images, and embedding normalization.
Gateway extensions	`endpointOptions`	`endpointOptions`	Extra fields consumed by gateway endpoint implementations.
Provider extensions	`providerOptions`	`providerOptions`	Provider-only fields merged into direct provider requests.

Python and Rust expose the same concepts with language-native descriptors and runtime config classes or structs.

Browser Client Options

Browser SippClientOptions affect the WebAssembly runtime, worker transport, and browser storage. They do not select a model by themselves.

Option	Use
`executionMode`	`auto` uses a worker when available. `worker` forces worker transport. `main-thread` is useful for debugging or constrained hosts.
`wasmThreading`	`single-thread` loads the single-thread WASM runtime. `pthread` loads the pthread runtime.
`moduleUrl`, `wasmUrl`	Override single-thread runtime asset URLs when a bundler or deployment moves package assets. Provide both together.
`pthreadModuleUrl`, `pthreadWasmUrl`	Override pthread runtime asset URLs. Provide both together.
`browserCache`	Tune OPFS split thresholds and direct-load behavior for browser GGUF storage.
`trustedOrigins`	Allow runtime asset URLs from additional origins. Defaults allow same-origin package assets.
`workerUrl`	Override the worker entry URL when the bundler cannot resolve the packaged worker.

wasmThreading: 'pthread' requires SharedArrayBuffer, cross-origin isolation, and COOP/COEP headers. Use single-thread when the application cannot serve those headers.

const client = new SippClient({
  executionMode: 'worker',
  wasmThreading: 'single-thread',
});

Local Endpoint Options

Browser local endpoints use source plus optional load options:

const endpoint = await client.add('browser-local', {
  kind: 'local',
  source: '/models/model.gguf',
  options: {
    backend: 'webgpu',
    runtime: {
      context: { n_ctx: 2048 },
    },
  },
});

Node.js local endpoints use modelPath and config:

const endpoint = await client.add('node-local', {
  kind: 'local',
  modelPath: '/models/model.gguf',
  config: {
    context: { n_ctx: 2048, n_threads: 8, n_threads_batch: 8 },
  },
});

Browser backend accepts auto, cpu, or webgpu. Native package backend selection is package-specific: Node.js uses SIPP_NODE_BACKEND, Python uses SIPP_PYTHON_BACKEND, and the CLI uses --backend.

Native Runtime Config

NativeRuntimeConfig groups local runtime settings by responsibility.

Group	Common fields	Use
`placement`	`devices`, `gpu_layers`, `split_mode`, `main_gpu`, `tensor_split`, `use_mmap`, `use_mlock`, `fit_params`	Model placement, memory mapping, and GPU residency choices.
`context`	`n_ctx`, `n_batch`, `n_ubatch`, `n_parallel`, `n_threads`, `n_threads_batch`, `flash_attention`, `offload_kqv`	Context window, batch sizes, CPU thread counts, attention, and KV behavior.
`sampling`	`samplers`, `seed`, `top_k`, `top_p`, `min_p`, `temperature`, `repeat_penalty`, `mirostat`, `logit_bias`	Default local sampling behavior for text generation.
`scheduler`	`continuous_batching`, `policy`, `prefill_chunk_size`, `max_running_requests`, `max_queued_requests`	Request scheduling, batching, and queue limits.
`cache`	`mode`, `retained_prefix_tokens`, `snapshot_interval_tokens`, `max_snapshot_entries`, `max_snapshot_bytes`	Prefix KV reuse and snapshot behavior.
`multimodal`	`projector_path`, `use_gpu`, `image_min_tokens`, `image_max_tokens`	Vision projector and image-token settings.
`residency`	`max_gpu_models_per_device`, `allow_cpu_models_while_gpu_loaded`, `require_gpu_lease`	GPU model residency policy for native runtimes.
`observability`	`runtime_metrics`, `backend_profiling`	Runtime timing, throughput, and backend diagnostics.

Use runtime config for stable endpoint behavior. Use request options for values that should vary per prompt, user action, or UI control.

Request Options

Text-producing calls share common generation controls:

Option	Use
`maxTokens`	Maximum generated tokens for the response.
`temperature`	Request-local temperature shortcut.
`topP`	Request-local nucleus sampling shortcut.
`stop`	Stop strings for text generation.
`signal`	Cancellation through `AbortSignal` where supported.
`emitTokens`	Enables token streaming through the returned run handle.

Local text calls can also use a prompt context key, GBNF grammar, and media inputs for vision-capable models. Embedding calls can set normalization through local embedding options.

Gateway-specific fields belong in endpointOptions. Direct provider-specific fields belong in providerOptions:

const run = client.chat({
  endpoint,
  messages,
  options: { maxTokens: 128, temperature: 0.2 },
  providerOptions: {
    reasoning_effort: 'low',
  },
});

Provider options cannot override typed fields such as model, messages, prompt, temperature, or topP/top_p; set those through the typed request options where Sipp exposes them.

Keyboard shortcuts