tessl install tessl/pypi-kserve@0.16.1KServe is a comprehensive Python SDK that provides standardized interfaces for building and deploying machine learning model serving infrastructure on Kubernetes.
KServe is a comprehensive Python SDK for building and deploying ML model serving infrastructure on Kubernetes. It provides a Control Plane Client for managing InferenceService resources and a Serving Runtime SDK with FastAPI-based servers supporting Open Inference Protocol, V1, V2, and OpenAI protocols.
Installation:
pip install kserve # Base package
pip install kserve[storage] # With S3, GCS, Azure support
pip install kserve[llm] # With OpenAI protocol supportCore Imports:
from kserve import Model, ModelServer # Model serving
from kserve import InferenceRESTClient, InferenceGRPCClient # Clients
from kserve import KServeClient # Kubernetes control
from kserve import InferRequest, InferResponse # Protocol typesBasic Model Server:
from kserve import Model, ModelServer
class MyModel(Model):
def load(self):
self.model = load_my_model()
self.ready = True
def predict(self, payload, headers=None):
return {"predictions": self.model.predict(payload["instances"])}
if __name__ == "__main__":
model = MyModel("my-model")
model.load()
ModelServer().start([model])| Component | Description | Reference |
|---|---|---|
| Model | Base class for custom models | custom-models.md |
| ModelServer | FastAPI-based server | model-server.md |
| InferenceClients | REST/gRPC clients | inference-clients.md |
| KServeClient | Kubernetes control plane | kserve-client.md |
| Protocol Types | InferRequest/InferResponse | protocol-types.md |
| ModelRepository | Dynamic model management | model-repository.md |
| Configuration | Client/server config | configuration.md |
| Errors | Exception handling | errors.md |
| Logging/Metrics | Observability | logging-metrics.md |
| Constants/Utils | Helper functions | constants-utils.md |
| Kubernetes Models | Resource definitions | kubernetes-models.md |
| OpenAI Protocol | LLM serving | openai-protocol.md |
class Model:
def load(self) -> None: ... # Load model artifacts
def preprocess(self, body, headers=None): ... # Transform input
def predict(self, payload, headers=None): ... # Run inference
def postprocess(self, response, headers=None): ... # Transform output
def explain(self, payload, headers=None): ... # Generate explanations# REST Client
client = InferenceRESTClient(url="http://localhost:8080")
response = await client.infer(base_url=url, model_name="my-model", data={...})
# gRPC Client
client = InferenceGRPCClient(url="localhost:8081")
response = await client.infer(model_name="my-model", inputs=[...])client = KServeClient()
client.create(inferenceservice) # Create resource
client.get(name, namespace) # Get resource
client.patch(name, inferenceservice, namespace) # Update resource
client.delete(name, namespace) # Delete resource
client.wait_isvc_ready(name, namespace, timeout) # Wait for ready# Input tensor
input = InferInput(name="input-0", shape=[1, 4], datatype="FP32", data=[[...]])
input.set_data_from_numpy(array) # From NumPy
array = input.as_numpy() # To NumPy
# Request/Response
request = InferRequest(model_name="model", infer_inputs=[input])
response = InferResponse(model_name="model", infer_outputs=[output])# Start server with options
ModelServer(
http_port=8080,
grpc_port=8081,
workers=4,
enable_grpc=True,
enable_docs_url=True
).start([model])| Protocol | Endpoints | Use Case |
|---|---|---|
| REST v1 | /v1/models/:predict, /v1/models/:explain | Legacy compatibility |
| REST v2 | /v2/models/:infer, /v2/health/* | Standard inference |
| gRPC v2 | ModelInfer, ServerMetadata | High performance |
| OpenAI | /v1/chat/completions, /v1/embeddings | LLM serving |
Supported with pip install kserve[storage]:
gs://)s3://)https://)file://)pvc://)| Framework | Predictor Spec | Example |
|---|---|---|
| Scikit-learn | V1beta1SKLearnSpec | sklearn={"storageUri": "gs://..."} |
| XGBoost | V1beta1XGBoostSpec | xgboost={"storageUri": "s3://..."} |
| TensorFlow | V1beta1TFServingSpec | tensorflow={"storageUri": "..."} |
| PyTorch | V1beta1TorchServeSpec | pytorch={"storageUri": "..."} |
| ONNX | V1beta1ONNXRuntimeSpec | onnx={"storageUri": "..."} |
| Triton | V1beta1TritonSpec | triton={"storageUri": "..."} |
| Hugging Face | V1beta1HuggingFaceRuntimeSpec | huggingface={"storageUri": "..."} |
| LightGBM | V1beta1LightGBMSpec | lightgbm={"storageUri": "..."} |
| PMML | V1beta1PMMLSpec | pmml={"storageUri": "..."} |
| KServe Type | NumPy Type | Description |
|---|---|---|
BOOL | np.bool_ | Boolean |
UINT8/16/32/64 | np.uint8/16/32/64 | Unsigned integers |
INT8/16/32/64 | np.int8/16/32/64 | Signed integers |
FP16/32/64 | np.float16/32/64 | Floating point |
BYTES | np.object_ | Variable-length bytes |
from kserve.errors import (
InferenceError, # Inference execution failure (500)
InvalidInput, # Invalid input data (400)
ModelNotFound, # Model doesn't exist (404)
ModelNotReady, # Model not initialized (503)
UnsupportedProtocol, # Unknown protocol
ServerNotReady # Server not ready (503)
)Health Endpoints:
GET /v2/health/live - Server livenessGET /v2/health/ready - Server readinessGET /v2/models/{name}/ready - Model readinessMetrics Endpoint:
GET /metrics - Prometheus metricsKey Metrics:
request_preprocess_seconds - Preprocessing latencyrequest_predict_seconds - Prediction latencyrequest_postprocess_seconds - Postprocessing latencyrequest_explain_seconds - Explanation latencypython model.py \
--http_port 8080 \
--grpc_port 8081 \
--workers 4 \
--max_threads 8 \
--enable_grpc true \
--enable_docs_url true \
--log_config_file /path/to/config.yaml