Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Python Package

The Python wheel is named sipppy. Python code imports the sipp module, which exposes native descriptor classes, run handles, token streaming, and the same endpoint model as the Rust client.

Published wheels require Python 3.10 or newer.

See the Library API Overview for the shared add, query, chat, and embed contracts.

Install

Note

Python wheels currently ship from the project’s GitHub Releases, not PyPI. A full PyPI release with a complete build matrix (CPU and GPU backends across operating systems, architectures, and Python versions, in the style of PyTorch’s distribution matrix) is in progress. The package name sipppy import are stable; only the distribution channel will change.

Download the sipppy wheel that matches your platform, Python version, and backend from the GitHub Releases page, then install it with pip. The default wheel includes the CPU backend:

pip install sipppy

The default wheel includes the CPU backend. Install PyPI-published GPU backends as extras:

pip install "sipppy[vulkan]"
pip install "sipppy[metal]"

The backend wheels are separate PyPI distributions. For example, sipppy[vulkan] installs the main sipppy wheel plus the matching sipppy-backend-vulkan wheel for the same release version. Python code still imports sipp. CUDA backend wheels are attached to GitHub releases for the first public release and will move to PyPI after the CUDA wheel size limit is raised.

Use It For

  • Python applications that need local GGUF inference.
  • Gateway-backed inference from Python services or scripts.
  • Direct provider descriptors where server-side credentials are appropriate.
  • Runtime metrics and backend selection in Python services.

Local GGUF Query

import sys

from sipp import (
    CacheRuntimeConfig,
    SippClient,
    SippTextOptions,
    ContextRuntimeConfig,
    LocalModelDescriptor,
    LocalTextOptions,
    NativeRuntimeConfig,
    ObservabilityRuntimeConfig,
    SchedulerRuntimeConfig,
)


client = SippClient()
endpoint = client.add(
    "default",
    LocalModelDescriptor(
        sys.argv[1],
        NativeRuntimeConfig(
            context=ContextRuntimeConfig(n_ctx=2048),
            scheduler=SchedulerRuntimeConfig(
                continuous_batching=True,
                prefill_chunk_size=0,
            ),
            cache=CacheRuntimeConfig(mode="live_slot_prefix"),
            observability=ObservabilityRuntimeConfig(runtime_metrics=True),
        ),
    ),
)
query_prompt = "\n".join(
    [
        "<|system|>",
        "Answer concisely.",
        "<|user|>",
        "Explain Sipp in one sentence.",
        "<|assistant|>",
    ]
)
run = client.query(
    # query: raw prompt; replace markers with the target model's template.
    query_prompt,
    endpoint=endpoint,
    options=SippTextOptions(max_tokens=64),
    local=LocalTextOptions(context_key="python-local"),
)
print(run.result()["text"])

Set SIPP_PYTHON_BACKEND=cpu|vulkan|cuda|metal to choose an installed native backend. See Runtime Options for local runtime config groups and request option boundaries.

On Intel Macs with integrated GPUs, prefer SIPP_PYTHON_BACKEND=cpu. The Metal backend is intended for Apple Silicon and tested AMD Mac GPUs. Apple Silicon can run x64 Python through Rosetta 2, but x64 wheels are used only by an x64 Python process; native arm64 Python should use arm64 wheels.

Gateway Chat

import os

from sipp import ChatMessage, SippClient, SippTextOptions, GatewayDescriptor


client = SippClient()
endpoint = client.add(
    "gateway",
    GatewayDescriptor(
        os.environ["SIPP_GATEWAY_TARGET"],
        os.environ["SIPP_GATEWAY_URL"],
        authentication_kind="bearer",
        authentication_value=os.environ["SIPP_GATEWAY_TOKEN"],
    ),
)
messages = [
    ChatMessage("system", "Answer concisely."),
    ChatMessage("user", "Explain gateway inference."),
]
run = client.chat(
    messages,
    endpoint=endpoint,
    options=SippTextOptions(max_tokens=64),
)
print(run.result()["text"])

Gateway clients need only the gateway URL, bearer token, and public target. Provider credentials and local model paths stay in the gateway process.