Python Package
The Python wheel is named sipppy. Python code imports the sipp module, which
exposes native descriptor classes, run handles, token streaming, and the same
endpoint model as the Rust client.
Published wheels require Python 3.10 or newer.
See the Library API Overview for the shared add, query,
chat, and embed contracts.
Install
Note
Python wheels currently ship from the project’s GitHub Releases, not PyPI. A full PyPI release with a complete build matrix (CPU and GPU backends across operating systems, architectures, and Python versions, in the style of PyTorch’s distribution matrix) is in progress. The package name
sipppyimport are stable; only the distribution channel will change.
Download the sipppy wheel that matches your platform, Python version, and
backend from the GitHub Releases
page, then install it with pip. The default wheel includes the CPU backend:
pip install sipppy
The default wheel includes the CPU backend. Install PyPI-published GPU backends as extras:
pip install "sipppy[vulkan]"
pip install "sipppy[metal]"
The backend wheels are separate PyPI distributions. For example,
sipppy[vulkan] installs the main sipppy wheel plus the matching
sipppy-backend-vulkan wheel for the same release version. Python code still
imports sipp. CUDA backend wheels are attached to GitHub releases for the
first public release and will move to PyPI after the CUDA wheel size limit is
raised.
Use It For
- Python applications that need local GGUF inference.
- Gateway-backed inference from Python services or scripts.
- Direct provider descriptors where server-side credentials are appropriate.
- Runtime metrics and backend selection in Python services.
Local GGUF Query
import sys
from sipp import (
CacheRuntimeConfig,
SippClient,
SippTextOptions,
ContextRuntimeConfig,
LocalModelDescriptor,
LocalTextOptions,
NativeRuntimeConfig,
ObservabilityRuntimeConfig,
SchedulerRuntimeConfig,
)
client = SippClient()
endpoint = client.add(
"default",
LocalModelDescriptor(
sys.argv[1],
NativeRuntimeConfig(
context=ContextRuntimeConfig(n_ctx=2048),
scheduler=SchedulerRuntimeConfig(
continuous_batching=True,
prefill_chunk_size=0,
),
cache=CacheRuntimeConfig(mode="live_slot_prefix"),
observability=ObservabilityRuntimeConfig(runtime_metrics=True),
),
),
)
query_prompt = "\n".join(
[
"<|system|>",
"Answer concisely.",
"<|user|>",
"Explain Sipp in one sentence.",
"<|assistant|>",
]
)
run = client.query(
# query: raw prompt; replace markers with the target model's template.
query_prompt,
endpoint=endpoint,
options=SippTextOptions(max_tokens=64),
local=LocalTextOptions(context_key="python-local"),
)
print(run.result()["text"])
Set SIPP_PYTHON_BACKEND=cpu|vulkan|cuda|metal to choose an installed native
backend. See Runtime Options for local
runtime config groups and request option boundaries.
On Intel Macs with integrated GPUs, prefer SIPP_PYTHON_BACKEND=cpu.
The Metal backend is intended for Apple Silicon and tested AMD Mac GPUs.
Apple Silicon can run x64 Python through Rosetta 2, but x64 wheels are used
only by an x64 Python process; native arm64 Python should use arm64 wheels.
Gateway Chat
import os
from sipp import ChatMessage, SippClient, SippTextOptions, GatewayDescriptor
client = SippClient()
endpoint = client.add(
"gateway",
GatewayDescriptor(
os.environ["SIPP_GATEWAY_TARGET"],
os.environ["SIPP_GATEWAY_URL"],
authentication_kind="bearer",
authentication_value=os.environ["SIPP_GATEWAY_TOKEN"],
),
)
messages = [
ChatMessage("system", "Answer concisely."),
ChatMessage("user", "Explain gateway inference."),
]
run = client.chat(
messages,
endpoint=endpoint,
options=SippTextOptions(max_tokens=64),
)
print(run.result()["text"])
Gateway clients need only the gateway URL, bearer token, and public target. Provider credentials and local model paths stay in the gateway process.