or run

tessl search
Log in

Version

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/kserve@0.16.x

docs

index.md
tile.json

tessl/pypi-kserve

tessl install tessl/pypi-kserve@0.16.1

KServe is a comprehensive Python SDK that provides standardized interfaces for building and deploying machine learning model serving infrastructure on Kubernetes.

index.mddocs/

KServe Python SDK

KServe is a comprehensive Python SDK for building and deploying ML model serving infrastructure on Kubernetes. It provides a Control Plane Client for managing InferenceService resources and a Serving Runtime SDK with FastAPI-based servers supporting Open Inference Protocol, V1, V2, and OpenAI protocols.

Quick Start

Installation:

pip install kserve              # Base package
pip install kserve[storage]     # With S3, GCS, Azure support
pip install kserve[llm]         # With OpenAI protocol support

Core Imports:

from kserve import Model, ModelServer                    # Model serving
from kserve import InferenceRESTClient, InferenceGRPCClient  # Clients
from kserve import KServeClient                          # Kubernetes control
from kserve import InferRequest, InferResponse           # Protocol types

Basic Model Server:

from kserve import Model, ModelServer

class MyModel(Model):
    def load(self):
        self.model = load_my_model()
        self.ready = True
    
    def predict(self, payload, headers=None):
        return {"predictions": self.model.predict(payload["instances"])}

if __name__ == "__main__":
    model = MyModel("my-model")
    model.load()
    ModelServer().start([model])

Complete Quick Start Guide

Architecture Overview

Serving Runtime SDK

  • Model Class - Custom inference logic with lifecycle hooks
  • ModelServer - FastAPI server with health endpoints
  • Protocols - REST v1/v2, gRPC v2, OpenAI-compatible
  • Model Repository - Dynamic model loading/unloading
  • Storage - GCS, S3, Azure Blob, PVC, HTTP/HTTPS
  • Observability - Prometheus metrics, structured logging

Control Plane Client

  • KServeClient - Kubernetes API operations
  • Resource Management - InferenceServices, TrainedModels, InferenceGraphs
  • Credentials - Storage configuration for GCS, S3, Azure
  • Status Tracking - Resource readiness monitoring

Core Components

ComponentDescriptionReference
ModelBase class for custom modelscustom-models.md
ModelServerFastAPI-based servermodel-server.md
InferenceClientsREST/gRPC clientsinference-clients.md
KServeClientKubernetes control planekserve-client.md
Protocol TypesInferRequest/InferResponseprotocol-types.md
ModelRepositoryDynamic model managementmodel-repository.md
ConfigurationClient/server configconfiguration.md
ErrorsException handlingerrors.md
Logging/MetricsObservabilitylogging-metrics.md
Constants/UtilsHelper functionsconstants-utils.md
Kubernetes ModelsResource definitionskubernetes-models.md
OpenAI ProtocolLLM servingopenai-protocol.md

Quick Reference

Model Lifecycle Methods

class Model:
    def load(self) -> None: ...                          # Load model artifacts
    def preprocess(self, body, headers=None): ...        # Transform input
    def predict(self, payload, headers=None): ...        # Run inference
    def postprocess(self, response, headers=None): ...   # Transform output
    def explain(self, payload, headers=None): ...        # Generate explanations

Inference Clients

# REST Client
client = InferenceRESTClient(url="http://localhost:8080")
response = await client.infer(base_url=url, model_name="my-model", data={...})

# gRPC Client
client = InferenceGRPCClient(url="localhost:8081")
response = await client.infer(model_name="my-model", inputs=[...])

Kubernetes Operations

client = KServeClient()
client.create(inferenceservice)                          # Create resource
client.get(name, namespace)                              # Get resource
client.patch(name, inferenceservice, namespace)          # Update resource
client.delete(name, namespace)                           # Delete resource
client.wait_isvc_ready(name, namespace, timeout)         # Wait for ready

Protocol Data Types

# Input tensor
input = InferInput(name="input-0", shape=[1, 4], datatype="FP32", data=[[...]])
input.set_data_from_numpy(array)                         # From NumPy
array = input.as_numpy()                                 # To NumPy

# Request/Response
request = InferRequest(model_name="model", infer_inputs=[input])
response = InferResponse(model_name="model", infer_outputs=[output])

Server Configuration

# Start server with options
ModelServer(
    http_port=8080,
    grpc_port=8081,
    workers=4,
    enable_grpc=True,
    enable_docs_url=True
).start([model])

Protocol Support

ProtocolEndpointsUse Case
REST v1/v1/models/:predict, /v1/models/:explainLegacy compatibility
REST v2/v2/models/:infer, /v2/health/*Standard inference
gRPC v2ModelInfer, ServerMetadataHigh performance
OpenAI/v1/chat/completions, /v1/embeddingsLLM serving

Storage Backends

Supported with pip install kserve[storage]:

  • Google Cloud Storage (gs://)
  • S3 Compatible (s3://)
  • Azure Blob Storage (https://)
  • Local filesystem (file://)
  • Persistent Volume Claims (pvc://)
  • HTTP/HTTPS URLs

Framework Support

FrameworkPredictor SpecExample
Scikit-learnV1beta1SKLearnSpecsklearn={"storageUri": "gs://..."}
XGBoostV1beta1XGBoostSpecxgboost={"storageUri": "s3://..."}
TensorFlowV1beta1TFServingSpectensorflow={"storageUri": "..."}
PyTorchV1beta1TorchServeSpecpytorch={"storageUri": "..."}
ONNXV1beta1ONNXRuntimeSpeconnx={"storageUri": "..."}
TritonV1beta1TritonSpectriton={"storageUri": "..."}
Hugging FaceV1beta1HuggingFaceRuntimeSpechuggingface={"storageUri": "..."}
LightGBMV1beta1LightGBMSpeclightgbm={"storageUri": "..."}
PMMLV1beta1PMMLSpecpmml={"storageUri": "..."}

Data Types

KServe TypeNumPy TypeDescription
BOOLnp.bool_Boolean
UINT8/16/32/64np.uint8/16/32/64Unsigned integers
INT8/16/32/64np.int8/16/32/64Signed integers
FP16/32/64np.float16/32/64Floating point
BYTESnp.object_Variable-length bytes

Error Handling

from kserve.errors import (
    InferenceError,      # Inference execution failure (500)
    InvalidInput,        # Invalid input data (400)
    ModelNotFound,       # Model doesn't exist (404)
    ModelNotReady,       # Model not initialized (503)
    UnsupportedProtocol, # Unknown protocol
    ServerNotReady       # Server not ready (503)
)

Guides

Examples

Reference Documentation

API Reference

Configuration & Operations

Kubernetes Resources

Advanced Topics

Health & Monitoring

Health Endpoints:

  • GET /v2/health/live - Server liveness
  • GET /v2/health/ready - Server readiness
  • GET /v2/models/{name}/ready - Model readiness

Metrics Endpoint:

  • GET /metrics - Prometheus metrics

Key Metrics:

  • request_preprocess_seconds - Preprocessing latency
  • request_predict_seconds - Prediction latency
  • request_postprocess_seconds - Postprocessing latency
  • request_explain_seconds - Explanation latency

Command-Line Options

python model.py \
  --http_port 8080 \
  --grpc_port 8081 \
  --workers 4 \
  --max_threads 8 \
  --enable_grpc true \
  --enable_docs_url true \
  --log_config_file /path/to/config.yaml

Additional Resources

  • KServe Documentation
  • GitHub Repository
  • Examples Repository
  • Community Slack

Version Information

  • Package: kserve
  • Version: 0.16.0
  • Python: 3.8+
  • Kubernetes: 1.20+