or run

tessl search

tessl/pypi-kserve

tessl install tessl/pypi-kserve@0.16.1

KServe is a comprehensive Python SDK that provides standardized interfaces for building and deploying machine learning model serving infrastructure on Kubernetes.

Troubleshooting Guide

Common issues and solutions for KServe model servers.

Model Not Ready

Symptom

Health checks fail
Requests return 503 Service Unavailable
Model shows as not ready

Diagnosis

from kserve import Model, logger

class DebuggableModel(Model):
    def load(self):
        """Load with detailed logging"""
        try:
            logger.info(f"Starting model load for {self.name}")
            logger.info(f"Model path: {self.model_path}")
            
            # Check file exists
            import os
            if not os.path.exists(self.model_path):
                logger.error(f"Model file not found: {self.model_path}")
                raise FileNotFoundError(self.model_path)
            
            # Check file size
            size_mb = os.path.getsize(self.model_path) / (1024 * 1024)
            logger.info(f"Model file size: {size_mb:.2f}MB")
            
            # Load model
            logger.info("Loading model...")
            self.model = joblib.load(self.model_path)
            logger.info("Model loaded successfully")
            
            # Test prediction
            logger.info("Testing model...")
            test_result = self.model.predict([[1, 2, 3, 4]])
            logger.info(f"Test prediction successful: {test_result}")
            
            self.ready = True
            logger.info(f"Model {self.name} is ready")
            
        except Exception as e:
            logger.error(f"Failed to load model {self.name}: {e}", exc_info=True)
            self.ready = False
            raise

Solutions

Check logs:

# Kubernetes
kubectl logs <pod-name> -n <namespace>

# Docker
docker logs <container-id>

Verify model file:

# Check file exists
ls -lh /mnt/models/model.pkl

# Check permissions
ls -la /mnt/models/

# Check disk space
df -h /mnt/models/

Test model loading:

import joblib

try:
    model = joblib.load("/mnt/models/model.pkl")
    print("Model loaded successfully")
    print(f"Model type: {type(model)}")
    
    # Test prediction
    result = model.predict([[1, 2, 3, 4]])
    print(f"Test prediction: {result}")
except Exception as e:
    print(f"Failed to load: {e}")

Memory Issues

Symptom

Pod OOMKilled
Server crashes during inference
Slow performance

Diagnosis

from kserve import Model, logger
import psutil
import gc

class MemoryMonitoredModel(Model):
    def load(self):
        """Load with memory monitoring"""
        # Check memory before load
        mem_before = psutil.virtual_memory()
        logger.info(f"Memory before load: {mem_before.percent}% used")
        
        self.model = joblib.load("/mnt/models/model.pkl")
        
        # Check memory after load
        mem_after = psutil.virtual_memory()
        logger.info(f"Memory after load: {mem_after.percent}% used")
        logger.info(f"Model memory: {mem_after.used - mem_before.used} bytes")
        
        self.ready = True
    
    def predict(self, payload, headers=None):
        """Predict with memory monitoring"""
        mem_before = psutil.virtual_memory()
        
        predictions = self.model.predict(payload["instances"])
        
        mem_after = psutil.virtual_memory()
        logger.debug(f"Prediction memory delta: {mem_after.used - mem_before.used} bytes")
        
        return {"predictions": predictions.tolist()}

Solutions

Increase memory limits:

resources:
  limits:
    memory: "8Gi"  # Increase from 4Gi
  requests:
    memory: "4Gi"

Enable garbage collection:

import gc
from kserve import Model

class GCModel(Model):
    def predict(self, payload, headers=None):
        try:
            result = self.model.predict(payload["instances"])
            return {"predictions": result.tolist()}
        finally:
            gc.collect()

Use model quantization:

import torch

# Quantize model to reduce memory
model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},
    dtype=torch.qint8
)

Implement lazy loading:

class LazyModel(Model):
    def load(self):
        # Don't load model yet
        self.ready = True
    
    def predict(self, payload, headers=None):
        # Load on first use
        if self.model is None:
            self.model = joblib.load("/mnt/models/model.pkl")
        
        return {"predictions": self.model.predict(payload["instances"])}

Connection Timeouts

Symptom

Requests timeout
"Connection timeout" errors
Slow response times

Diagnosis

from kserve import InferenceRESTClient, RESTConfig
import time

async def diagnose_timeout():
    """Diagnose connection timeout issues"""
    client = InferenceRESTClient()
    
    # Test with increasing timeouts
    for timeout in [5, 10, 30, 60]:
        try:
            start = time.time()
            response = await client.infer(
                base_url="http://localhost:8080",
                model_name="my-model",
                data={"instances": [[1, 2, 3, 4]]},
                timeout=timeout
            )
            elapsed = time.time() - start
            print(f"Success with {timeout}s timeout (took {elapsed:.2f}s)")
            break
        except Exception as e:
            print(f"Failed with {timeout}s timeout: {e}")
    
    await client.close()

Solutions

Increase client timeout:

config = RESTConfig(
    protocol="v2",
    timeout=120,  # Increase to 2 minutes
    retries=1
)
client = InferenceRESTClient(config=config)

Optimize model inference:

# Use smaller batch sizes
# Enable GPU acceleration
# Optimize preprocessing

Check network connectivity:

# Test connection
curl -v http://localhost:8080/v2/health/live

# Check DNS resolution
nslookup model-service.default.svc.cluster.local

# Test latency
ping model-service.default.svc.cluster.local

gRPC Errors

Symptom

gRPC UNAVAILABLE errors
Connection refused
Channel closed errors

Solutions

from kserve import InferenceGRPCClient
import grpc
import asyncio

async def resilient_grpc_client():
    """Create resilient gRPC client"""
    # Configure channel options
    channel_args = [
        ('grpc.keepalive_time_ms', 30000),
        ('grpc.keepalive_timeout_ms', 10000),
        ('grpc.keepalive_permit_without_calls', True),
        ('grpc.http2.max_pings_without_data', 0),
        ('grpc.max_connection_idle_ms', 60000),
        ('grpc.max_connection_age_ms', 300000)
    ]
    
    client = InferenceGRPCClient(
        url="localhost:8081",
        channel_args=channel_args,
        timeout=60,
        retries=3
    )
    
    return client

Storage Access Issues

Symptom

"Permission denied" errors
"File not found" errors
Slow model loading

Solutions

Verify credentials:

from kserve import KServeClient

# Check S3 credentials
client = KServeClient()
client.set_credentials(
    storage_type="S3",
    namespace="default",
    service_account="kserve-sa",
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY")
)

Test storage access:

# Test S3 access
aws s3 ls s3://my-bucket/models/

# Test GCS access
gsutil ls gs://my-bucket/models/

# Test Azure access
az storage blob list --account-name myaccount --container-name models

Check service account permissions:

# Kubernetes
kubectl get serviceaccount kserve-sa -n default -o yaml

# Check role bindings
kubectl get rolebindings -n default | grep kserve

Protocol Mismatch

Symptom

"Unsupported protocol" errors
Malformed request errors
400 Bad Request

Solutions

from kserve import InferenceRESTClient, RESTConfig

# Ensure protocol matches server
# For v1 protocol
v1_config = RESTConfig(protocol="v1")
v1_client = InferenceRESTClient(config=v1_config)

# For v2 protocol
v2_config = RESTConfig(protocol="v2")
v2_client = InferenceRESTClient(config=v2_config)

# Check server protocol
response = await client.get_server_metadata(base_url="http://localhost:8080")
print(f"Server version: {response}")

Performance Issues

Symptom

Slow inference times
High latency
Low throughput

Diagnosis

import cProfile
import pstats
from kserve import Model

class ProfiledModel(Model):
    def predict(self, payload, headers=None):
        """Profile prediction performance"""
        profiler = cProfile.Profile()
        profiler.enable()
        
        result = self.model.predict(payload["instances"])
        
        profiler.disable()
        stats = pstats.Stats(profiler)
        stats.sort_stats('cumulative')
        stats.print_stats(20)  # Top 20 functions
        
        return {"predictions": result.tolist()}

Solutions

Enable caching:

from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_predict(input_tuple):
    return model.predict([list(input_tuple)])

Use batching:

# Increase batch size
ModelServer(max_batch_size=64).start([model])

Enable GPU:

import torch

model = model.to('cuda')

Optimize workers:

python model.py --workers 4 --max_threads 8

Kubernetes Issues

Pod Stuck in Pending

Check events:

kubectl describe pod <pod-name> -n <namespace>
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Common causes:

Insufficient resources
Image pull errors
Volume mount issues
Node selector mismatch

InferenceService Not Ready

Check status:

kubectl get inferenceservice sklearn-iris -n default -o yaml
kubectl describe inferenceservice sklearn-iris -n default

Check underlying resources:

kubectl get pods -n default -l serving.kserve.io/inferenceservice=sklearn-iris
kubectl get services -n default -l serving.kserve.io/inferenceservice=sklearn-iris
kubectl get virtualservices -n default

Debugging Tools

Enable Debug Logging

import os
os.environ["KSERVE_LOGLEVEL"] = "DEBUG"

from kserve import logger
logger.setLevel("DEBUG")

Enable FastAPI Docs

python model.py --enable_docs_url true
# Access at http://localhost:8080/docs

Check Metrics

# Get Prometheus metrics
curl http://localhost:8080/metrics

# Check specific metric
curl http://localhost:8080/metrics | grep request_predict_seconds

Common Error Messages

"Model not found"

Cause: Model not registered or name mismatch
Solution: Check model name in request matches registered name

"Invalid input"

Cause: Input format doesn't match expected schema
Solution: Validate input shape, datatype, and structure

"Inference failed"

Cause: Model execution error
Solution: Check model compatibility, input ranges, dependencies

"Circuit breaker is open"

Cause: Too many consecutive failures
Solution: Wait for timeout, fix underlying issue, restart service

Next Steps

Design Patterns - Implement robust patterns
Error Recovery - Handle failures gracefully
Production Deployment - Deploy correctly

Version

tessl/pypi-kserve

troubleshooting.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/reference/

Troubleshooting Guide

Model Not Ready

Symptom

Diagnosis

Solutions

Memory Issues

Symptom

Diagnosis

Solutions

Connection Timeouts

Symptom

Diagnosis

Solutions

gRPC Errors

Symptom

Solutions

Storage Access Issues

Symptom

Solutions

Protocol Mismatch

Symptom

Solutions

Performance Issues

Symptom

Diagnosis

Solutions

Kubernetes Issues

Pod Stuck in Pending

InferenceService Not Ready

Debugging Tools

Enable Debug Logging

Enable FastAPI Docs

Check Metrics

Common Error Messages

"Model not found"

"Invalid input"

"Inference failed"

"Circuit breaker is open"

Next Steps

troubleshooting.mddocs/reference/