Comprehensive Python SDK for Amazon SageMaker that provides a unified interface for machine learning workflows including model training, deployment, and MLOps operations
tessl install tessl/pypi-sagemaker@3.3.0The SageMaker Python SDK is a comprehensive Python library for training and deploying machine learning models on Amazon SageMaker. It provides a unified interface for the complete machine learning workflow from data preparation and distributed training to model deployment, monitoring, and pipeline orchestration.
pip install sagemaker# Training
from sagemaker.train import ModelTrainer, Session, get_execution_role
from sagemaker.train.tuner import HyperparameterTuner
# Serving/Inference
from sagemaker.serve import ModelBuilder, InferenceSpec, ModelServer
# MLOps/Pipelines
from sagemaker.mlops.workflow import Pipeline, TrainingStep, ProcessingStep
# Core functionality
from sagemaker.core import Processor, Transformer
from sagemaker.core.workflow import ParameterString, ConditionEqualsfrom sagemaker.train import ModelTrainer
from sagemaker.train.configs import InputData, Compute
# Create trainer
trainer = ModelTrainer(
training_image="my-training-image",
role="arn:aws:iam::123456789012:role/SageMakerRole",
compute=Compute(instance_type="ml.m5.xlarge", instance_count=1)
)
# Prepare training data
train_data = InputData(
channel_name="training",
data_source="s3://my-bucket/train"
)
# Start training
trainer.train(input_data_config=[train_data])from sagemaker.serve import ModelBuilder
# Build model
builder = ModelBuilder(
model="my-model",
model_path="s3://my-bucket/model.tar.gz",
role_arn="arn:aws:iam::123456789012:role/SageMakerRole",
instance_type="ml.m5.xlarge"
)
# Deploy to endpoint
model = builder.build()
endpoint = builder.deploy(endpoint_name="my-endpoint")
# Make predictions
result = endpoint.invoke(data=input_data)from sagemaker.mlops.workflow import Pipeline, TrainingStep, ProcessingStep
from sagemaker.train import ModelTrainer
from sagemaker.core import Processor
# Define steps
training_step = TrainingStep(name="Train", estimator=trainer)
processing_step = ProcessingStep(name="Process", processor=processor)
# Create pipeline
pipeline = Pipeline(
name="my-pipeline",
steps=[processing_step, training_step]
)
# Execute pipeline
execution = pipeline.start()The SageMaker Python SDK V3 uses a modular architecture with four sub-packages:
All packages use namespace packaging under sagemaker.* for unified imports.
Comprehensive model training capabilities including distributed training, hyperparameter tuning, and fine-tuning for foundation models.
from sagemaker.train import ModelTrainer
from sagemaker.train.tuner import HyperparameterTuner
from sagemaker.train.sft_trainer import SFTTrainer
from sagemaker.train.dpo_trainer import DPOTrainer
from sagemaker.train.rlaif_trainer import RLAIFTrainer
from sagemaker.train.rlvr_trainer import RLVRTrainerUnified interface for model deployment with support for multiple frameworks, model servers, and deployment modes (SageMaker endpoint, local container, in-process).
from sagemaker.serve import ModelBuilder, InferenceSpec, ModelServer, Mode
from sagemaker.serve.builder.schema_builder import SchemaBuilder
from sagemaker.serve.utils.payload_translator import CustomPayloadTranslatorPipeline orchestration with 13+ step types for building complex ML workflows with conditional execution, parallelism, and retry policies.
from sagemaker.mlops.workflow import Pipeline, PipelineGraph
from sagemaker.mlops.workflow import TrainingStep, ProcessingStep, TransformStep, TuningStep
from sagemaker.mlops.workflow import ConditionStep, LambdaStep, ModelStepData processing and batch transformation capabilities for preprocessing, feature engineering, and batch inference.
from sagemaker.core import Processor, ScriptProcessor, FrameworkProcessor, TransformerBuilding blocks for creating parameterized, conditional workflows with pipeline variables, parameters, functions, and conditions.
from sagemaker.core.workflow import (
ParameterString, ParameterInteger, ParameterFloat, ParameterBoolean,
ConditionEquals, ConditionGreaterThan, ConditionLessThan,
Join, JsonGet, Properties, ExecutionVariables
)Comprehensive model evaluation with benchmark evaluations, custom scorers, and LLM-as-judge evaluations.
from sagemaker.train import (
BenchMarkEvaluator, CustomScorerEvaluator, LLMAsJudgeEvaluator,
EvaluationPipelineExecution, get_benchmarks, get_builtin_metrics
)Monitor model quality, data quality, bias, and explainability in production with customizable monitoring schedules and alerts.
from sagemaker.core.model_monitor import (
ModelMonitor, DefaultModelMonitor, ModelQualityMonitor,
ModelBiasMonitor, ModelExplainabilityMonitor,
DataCaptureConfig, MonitoringSchedule
)Track, organize, and compare machine learning experiments with integration for MLflow.
from sagemaker.core.experiments import Experiment, RunAccess pre-trained models, example notebooks, and solution templates from SageMaker JumpStart.
from sagemaker.core.jumpstart import (
JumpStartModelsAccessor, JumpStartConfig,
SageMakerSettings
)Execute Python functions remotely on SageMaker infrastructure with automatic dependency management.
from sagemaker.core.remote_function import remote, RemoteExecutorDebug and profile training jobs with TensorBoard integration and rule-based monitoring.
from sagemaker.core.debugger import (
DebuggerHookConfig, TensorBoardOutputConfig, Rule, ProfilerRule,
ProfilerConfig, FrameworkProfile
)Detect bias and explain model predictions using SageMaker Clarify.
from sagemaker.core.clarify import (
SageMakerClarifyProcessor, DataConfig, BiasConfig, ModelConfig,
SHAPConfig, PDPConfig
)Serializers and deserializers for various data formats including JSON, CSV, NumPy, Pandas, PyTorch tensors, and more.
from sagemaker.core.serializers import (
JSONSerializer, CSVSerializer, NumpySerializer,
TorchTensorSerializer, IdentitySerializer
)
from sagemaker.core.deserializers import (
JSONDeserializer, CSVDeserializer, NumpyDeserializer,
PandasDeserializer, BytesDeserializer
)Track lineage of ML artifacts, models, datasets, and their relationships.
from sagemaker.core.lineage import (
Action, Artifact, Association, Context,
LineageQuery, LineageFilter
)Auto-generated resource classes providing direct access to 110+ SageMaker APIs for advanced use cases.
from sagemaker.core.resources import (
TrainingJob, ProcessingJob, TransformJob,
Model, Endpoint, EndpointConfig,
ModelPackage, ModelPackageGroup, ModelCard,
Pipeline, PipelineExecution, Experiment, Trial
)Helper utilities for uploading and downloading files to/from Amazon S3.
from sagemaker.core.s3 import S3Uploader, S3Downloader
from sagemaker.core.s3 import parse_s3_url, is_s3_url, s3_path_joinQuick Example:
from sagemaker.core.s3 import S3Uploader, S3Downloader, s3_path_join
# Upload files
s3_uri = S3Uploader.upload(
local_path="./model.tar.gz",
desired_s3_uri="s3://my-bucket/models/"
)
# Download files
files = S3Downloader.download(
s3_uri="s3://my-bucket/models/model.tar.gz",
local_path="./downloaded/"
)
# Build S3 paths
path = s3_path_join("s3://", "bucket", "prefix", "file.txt")Manage datasets and evaluators in the SageMaker AI Registry Hub for model customization workflows.
from sagemaker.ai_registry.dataset import DataSet
from sagemaker.ai_registry.evaluator import Evaluator, EvaluatorMethod
from sagemaker.ai_registry.dataset_utils import DataSetMethod
from sagemaker.ai_registry.air_constants import HubContentStatusQuick Example:
from sagemaker.ai_registry.dataset import DataSet
from sagemaker.ai_registry.evaluator import Evaluator
# Create dataset
dataset = DataSet.create(
name="training-data",
source="./data/train.jsonl",
wait=True
)
# Create evaluator
evaluator = Evaluator.create(
name="reward-function",
type="RewardFunction",
source="arn:aws:lambda:us-west-2:123456789012:function:reward",
wait=True
)
# List datasets
datasets = DataSet.list()Configuration classes for SageMaker Clarify explainability to interpret model predictions using SHAP values.
from sagemaker.core.explainer import (
ClarifyExplainerConfig, ClarifyShapConfig,
ClarifyInferenceConfig, ClarifyShapBaselineConfig, ClarifyTextConfig
)Quick Example:
from sagemaker.core.explainer import (
ClarifyExplainerConfig, ClarifyShapConfig, ClarifyShapBaselineConfig
)
# Configure explainability
baseline_config = ClarifyShapBaselineConfig(
mime_type="text/csv",
shap_baseline="0,0,0,0"
)
shap_config = ClarifyShapConfig(
shap_baseline_config=baseline_config,
number_of_samples=100
)
explainer_config = ClarifyExplainerConfig(shap_config=shap_config)
# Deploy with explainability
endpoint = builder.deploy(
endpoint_name="explainable-endpoint",
explainer_config=explainer_config
)All SageMaker operations require appropriate IAM permissions. The execution role must have:
AmazonSageMakerFullAccess managed policy, orMinimum permissions include:
sagemaker:CreateTrainingJob, sagemaker:DescribeTrainingJob for trainingsagemaker:CreateEndpoint, sagemaker:InvokeEndpoint for inferences3:GetObject, s3:PutObject for S3 data accesslogs:CreateLogGroup, logs:CreateLogStream for CloudWatch logsSecurity Best Practices:
Not all SageMaker features and instance types are available in all regions. Check AWS documentation for regional availability of:
Regional Considerations:
Cost Optimization Strategies:
Use Spot Instances: Enable use_spot_instances=True for training to reduce costs by up to 90%. Always set max_wait_time_in_seconds appropriately.
Right-size Instances: Start with smaller instances and scale up as needed. Profile workload before choosing expensive GPU instances.
Enable Logging: Always enable CloudWatch logs for debugging with appropriate retention periods. Use structured logging in training code.
Tag Resources: Use consistent tagging for cost tracking and resource management. Include project, environment, and owner tags.
Clean Up Resources: Delete endpoints when not in use to avoid ongoing charges. Implement automated cleanup for test resources.
Use Checkpoints: Enable checkpointing for long-running training jobs. Critical for spot instance training to resume after interruptions.
Monitor Metrics: Track training metrics to detect issues early. Set up CloudWatch alarms for anomalies.
Version Models: Use Model Registry to track model versions and lineage. Implement approval workflows for production deployments.
Implement Retry Logic: Use retry strategies for transient failures. Configure appropriate backoff and maximum attempts.
VPC Configuration: Use VPC for production workloads requiring network isolation. Configure security groups and NACLs appropriately.
Data Validation: Validate input data before training. Implement data quality checks in processing steps.
Experiment Tracking: Use Experiments to organize and compare runs. Log all hyperparameters and metrics systematically.
Always implement proper error handling when using SageMaker SDK:
from botocore.exceptions import ClientError, WaiterError
from sagemaker.exceptions import CapacityError
try:
trainer.train(input_data_config=[train_data])
except ClientError as e:
error_code = e.response['Error']['Code']
if error_code == 'ResourceLimitExceeded':
print("Instance limit exceeded, try different instance type or region")
elif error_code == 'ValidationException':
print(f"Invalid configuration: {e}")
elif error_code == 'ThrottlingException':
print("API rate limit exceeded, implement exponential backoff")
else:
raise
except CapacityError as e:
print(f"Insufficient capacity: {e}. Try different instance type or region")
except WaiterError as e:
print(f"Training job did not reach expected state: {e}")
except Exception as e:
print(f"Training failed: {e}")
# Implement appropriate recovery logicCommon Error Scenarios:
ResourceLimitExceeded: Exceeded service quota for instance type or concurrent jobs. Request quota increase or use different resources.
CapacityError: Insufficient capacity in availability zone. Retry in different region or use different instance type.
ValidationException: Invalid parameter values or configurations. Review API documentation for valid ranges and formats.
AccessDeniedException: Insufficient IAM permissions. Review and update execution role permissions.
ThrottlingException: API rate limits exceeded. Implement exponential backoff and retry logic.
AlgorithmError: Error in training script. Check CloudWatch logs for stack traces and debugging information.
Use Pipe input mode for large datasets to stream data instead of downloading. Reduces training start time significantly.
Enable managed spot training for cost savings on interruptible workloads. Set max_wait_time_in_seconds > max_runtime_in_seconds.
Use distributed training for large models and datasets. Choose appropriate strategy (data parallel, model parallel, or hybrid).
Configure appropriate instance types based on workload:
Use SageMaker Processing for data preparation to parallelize across multiple instances with automatic data distribution.
Enable caching in pipelines to avoid re-running unchanged steps. Specify cache_config with appropriate expiration.
Optimize Docker images: Use multi-stage builds, minimize layers, cache dependencies appropriately.
Batch predictions efficiently: Use appropriate max_payload and max_concurrent_transforms for transform jobs.
Use keep-alive for endpoints: Configure warm pools to reduce cold start latency for repeated invocations.
Training Jobs:
Endpoints:
Pipelines:
Monitoring: