Experiment tracking and management for organizing ML workflows, comparing runs, and managing model metadata.
Experiment management for organizing related trials and runs.
class Experiment:
"""
Experiment management for ML workflows.
Parameters:
experiment_name: str - Experiment name (required)
- 1-120 characters
- Alphanumeric, hyphens, underscores
description: Optional[str] - Experiment description
- Maximum 3072 characters
tags: Optional[List[Tag]] - Resource tags
sagemaker_session: Optional[Session] - SageMaker session
Methods:
create(experiment_name, description=None, tags=None, sagemaker_session=None) -> Experiment
Create new experiment.
Parameters:
experiment_name: str - Experiment name (required)
description: Optional[str] - Description
tags: Optional[List[Tag]] - Tags
sagemaker_session: Optional[Session] - Session
Returns:
Experiment: Created experiment
Raises:
ValueError: If experiment_name invalid
ClientError: If experiment already exists
load(experiment_name, sagemaker_session=None) -> Experiment
Load existing experiment.
Parameters:
experiment_name: str - Experiment name (required)
sagemaker_session: Optional[Session] - Session
Returns:
Experiment: Loaded experiment
Raises:
ClientError: If experiment doesn't exist
list(sort_by="CreationTime", sort_order="Descending", max_results=100,
sagemaker_session=None) -> List[Experiment]
List experiments.
Parameters:
sort_by: str - Sort field (default: "CreationTime")
- "CreationTime", "Name"
sort_order: str - Sort order (default: "Descending")
- "Ascending", "Descending"
max_results: int - Maximum results (default: 100, max: 100)
sagemaker_session: Optional[Session] - Session
Returns:
List[Experiment]: Experiments list
delete() -> None
Delete the experiment.
Raises:
ClientError: If experiment has associated runs (delete runs first)
Attributes:
experiment_name: str - Experiment name
experiment_arn: str - Experiment ARN
description: Optional[str] - Experiment description
creation_time: datetime - Creation timestamp
created_by: Dict - Creator information
last_modified_time: datetime - Last modification timestamp
last_modified_by: Dict - Last modifier information
Notes:
- Experiments organize related runs/trials
- Delete all runs before deleting experiment
- Tags useful for cost tracking and organization
- Cannot rename experiment after creation
"""Usage:
from sagemaker.core.experiments import Experiment
# Create experiment for project
try:
experiment = Experiment.create(
experiment_name="customer-churn-prediction",
description="Experiments for customer churn prediction model",
tags=[
{"Key": "Project", "Value": "CustomerChurn"},
{"Key": "Team", "Value": "DataScience"}
]
)
print(f"Experiment created: {experiment.experiment_arn}")
except ClientError as e:
if e.response['Error']['Code'] == 'ResourceInUse':
print("Experiment already exists, loading...")
experiment = Experiment.load("customer-churn-prediction")
# List all experiments
experiments = Experiment.list(
sort_by="CreationTime",
sort_order="Descending",
max_results=20
)
print(f"\nRecent experiments:")
for exp in experiments[:5]:
print(f" {exp.experiment_name} - {exp.description}")
# Delete experiment (after deleting all runs)
# experiment.delete()Run management for individual training runs within experiments.
class Run:
"""
Run management for tracking training executions.
Parameters:
experiment_name: str - Parent experiment name (required)
run_name: Optional[str] - Run name (auto-generated if not provided)
- Format: auto-generated includes timestamp
sagemaker_session: Optional[Session] - SageMaker session
Methods:
log_parameter(name, value) -> None
Log single parameter.
Parameters:
name: str - Parameter name (required)
value: Union[str, int, float, bool] - Parameter value (required)
Raises:
ValueError: If value not JSON-serializable
log_parameters(parameters) -> None
Log multiple parameters.
Parameters:
parameters: Dict[str, Any] - Parameters dictionary (required)
log_metric(name, value, step=None, timestamp=None) -> None
Log metric value.
Parameters:
name: str - Metric name (required)
value: float - Metric value (required)
step: Optional[int] - Training step/epoch
timestamp: Optional[datetime] - Timestamp
log_metrics(metrics, step=None) -> None
Log multiple metrics.
Parameters:
metrics: Dict[str, float] - Metrics dictionary (required)
step: Optional[int] - Training step/epoch
log_artifact(name, value, media_type="text/plain") -> None
Log artifact.
Parameters:
name: str - Artifact name (required)
value: str - Artifact value (required)
media_type: str - Media type (default: "text/plain")
log_file(file_path, name=None, media_type=None, is_output=True) -> None
Log file as artifact.
Parameters:
file_path: str - Local file path (required)
name: Optional[str] - Artifact name (default: filename)
media_type: Optional[str] - Media type (auto-detected)
is_output: bool - Is output artifact (default: True)
Raises:
FileNotFoundError: If file doesn't exist
log_model(model_data_uri, model_type=None, framework=None, framework_version=None) -> None
Log model artifact.
Parameters:
model_data_uri: str - S3 URI for model (required)
model_type: Optional[str] - Model type
framework: Optional[str] - Framework name
framework_version: Optional[str] - Framework version
wait() -> None
Wait for run to complete (if associated with job).
list(experiment_name, sort_by="CreationTime", sort_order="Descending",
max_results=100) -> List[Run]
List runs in experiment.
Parameters:
experiment_name: str - Experiment name (required)
sort_by: str - Sort field
sort_order: str - Sort order
max_results: int - Maximum results (1-100)
Returns:
List[Run]: Runs list
Context Manager:
Use with 'with' statement for automatic resource management and cleanup.
Attributes:
run_name: str - Run name
experiment_name: str - Parent experiment name
run_arn: str - Run ARN
status: str - Run status
Notes:
- Use context manager for automatic cleanup
- Log parameters before training
- Log metrics during/after training
- Log model and artifacts after training
- Parameters immutable after logging
- Metrics can be logged multiple times (time series)
"""Usage:
from sagemaker.core.experiments import Run
import json
# Create and use run with context manager
with Run(
experiment_name="customer-churn-prediction",
run_name="xgboost-trial-1",
sagemaker_session=session
) as run:
# Log hyperparameters at start
run.log_parameter("algorithm", "xgboost")
run.log_parameter("learning_rate", 0.1)
run.log_parameter("max_depth", 5)
run.log_parameter("num_rounds", 100)
# Or log all at once
run.log_parameters({
"min_child_weight": 3,
"subsample": 0.8,
"colsample_bytree": 0.8
})
# Train model (pseudo-code)
model, history = train_xgboost_model()
# Log metrics during training
for epoch, metrics in enumerate(history):
run.log_metrics({
"train_loss": metrics["train_loss"],
"train_accuracy": metrics["train_acc"],
"val_loss": metrics["val_loss"],
"val_accuracy": metrics["val_acc"]
}, step=epoch)
# Log final metrics
run.log_metric("final_accuracy", 0.94)
run.log_metric("final_f1", 0.92)
run.log_metric("auc_roc", 0.96)
# Log model
model_uri = "s3://my-bucket/models/xgboost-model.tar.gz"
run.log_model(
model_data_uri=model_uri,
model_type="xgboost",
framework="xgboost",
framework_version="1.7.3"
)
# Log artifacts
run.log_file(
file_path="confusion_matrix.png",
name="confusion_matrix",
media_type="image/png"
)
run.log_file(
file_path="feature_importance.json",
name="feature_importance",
media_type="application/json"
)
# Log custom artifact
config = {
"preprocessing": "standard_scaler",
"feature_selection": "top_20",
"class_weights": {0: 1.0, 1: 2.5}
}
run.log_artifact(
name="training_config",
value=json.dumps(config),
media_type="application/json"
)
# Run automatically closed and finalized
print(f"Run completed: {run.run_name}")from sagemaker.train import ModelTrainer
from sagemaker.core.experiments import Run, Experiment
# Create experiment if needed
try:
experiment = Experiment.create(
experiment_name="hyperparameter-search",
description="Finding optimal hyperparameters for ResNet"
)
except ClientError:
experiment = Experiment.load("hyperparameter-search")
# Run training with experiment tracking
hyperparams_to_test = [
{"learning_rate": 0.01, "batch_size": 32},
{"learning_rate": 0.001, "batch_size": 64},
{"learning_rate": 0.0001, "batch_size": 128}
]
best_accuracy = 0
best_run = None
for i, hyperparams in enumerate(hyperparams_to_test):
with Run(
experiment_name=experiment.experiment_name,
run_name=f"trial-{i+1}"
) as run:
# Log hyperparameters
run.log_parameters(hyperparams)
run.log_parameter("optimizer", "adam")
run.log_parameter("epochs", 10)
# Create and train model
trainer = ModelTrainer(
training_image="pytorch-image",
role=role,
compute=Compute(
instance_type="ml.p3.2xlarge",
instance_count=1
),
hyperparameters=hyperparams
)
trainer.train(input_data_config=[train_data, val_data])
# Get metrics from training job
job = trainer._latest_training_job
final_metrics = job.final_metric_data_list
# Log metrics
for metric in final_metrics:
metric_name = metric["MetricName"]
metric_value = metric["Value"]
run.log_metric(metric_name, metric_value)
if metric_name == "validation:accuracy" and metric_value > best_accuracy:
best_accuracy = metric_value
best_run = run.run_name
# Log model artifact
run.log_model(
model_data_uri=job.model_artifacts["S3ModelArtifacts"],
model_type="pytorch",
framework="pytorch",
framework_version="2.0"
)
print(f"\nBest run: {best_run} with accuracy: {best_accuracy}")from sagemaker.mlops.workflow import Pipeline, TrainingStep, PipelineExperimentConfig
from sagemaker.core.workflow import ExecutionVariables
# Configure pipeline with experiment tracking
experiment_config = PipelineExperimentConfig(
experiment_name="pipeline-experiment",
trial_name=ExecutionVariables.PIPELINE_EXECUTION_ID # Unique per execution
)
# Create pipeline
pipeline = Pipeline(
name="training-pipeline",
steps=[preprocess_step, train_step, evaluate_step],
pipeline_experiment_config=experiment_config
)
# Each execution creates a new trial/run
execution1 = pipeline.start() # Creates trial with execution ID 1
execution2 = pipeline.start() # Creates trial with execution ID 2
# List runs created by pipeline
runs = Run.list(experiment_name="pipeline-experiment")
print(f"Total pipeline runs: {len(runs)}")from sagemaker.core.experiments import Run
import pandas as pd
# Get all runs from experiment
runs = Run.list(
experiment_name="hyperparameter-search",
sort_by="CreationTime",
sort_order="Descending"
)
# Extract metrics and parameters
results = []
for run in runs:
# Load run details
run_obj = Run(
experiment_name=run.experiment_name,
run_name=run.run_name
)
# Get logged data (access via SageMaker API)
run_details = {
"run_name": run.run_name,
"creation_time": run.creation_time,
# Parameters and metrics retrieved via describe API
}
results.append(run_details)
# Create comparison DataFrame
df = pd.DataFrame(results)
# Find best run by metric
best_run = df.loc[df['validation_accuracy'].idxmax()]
print(f"\nBest run: {best_run['run_name']}")
print(f"Parameters: learning_rate={best_run['learning_rate']}, batch_size={best_run['batch_size']}")
print(f"Validation accuracy: {best_run['validation_accuracy']}")
# Visualize results
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.scatter(df['learning_rate'], df['validation_accuracy'], s=df['batch_size'])
plt.xlabel('Learning Rate')
plt.ylabel('Validation Accuracy')
plt.title('Hyperparameter Search Results')
plt.xscale('log')
plt.show()# Parent run for ensemble
with Run(
experiment_name="ensemble-models",
run_name="ensemble-voting-v1"
) as parent_run:
parent_run.log_parameter("ensemble_type", "voting")
parent_run.log_parameter("voting_strategy", "soft")
parent_run.log_parameter("num_models", 3)
models = ["xgboost", "random_forest", "neural_net"]
model_scores = []
# Child runs for individual models
for i, model_type in enumerate(models):
with Run(
experiment_name="ensemble-models",
run_name=f"ensemble-v1-model-{i}-{model_type}"
) as child_run:
# Link to parent
child_run.log_parameter("parent_run", parent_run.run_name)
child_run.log_parameter("model_type", model_type)
child_run.log_parameter("ensemble_index", i)
# Train individual model
model, accuracy = train_model(model_type)
model_scores.append(accuracy)
# Log child metrics
child_run.log_metric("accuracy", accuracy)
child_run.log_metric("training_time", training_time)
# Log model
child_run.log_model(
model_data_uri=f"s3://bucket/models/{model_type}.tar.gz",
model_type=model_type
)
# Log ensemble metrics
ensemble_predictions = create_ensemble(models)
ensemble_accuracy = evaluate_ensemble(ensemble_predictions)
parent_run.log_metric("ensemble_accuracy", ensemble_accuracy)
parent_run.log_metric("improvement_over_best",
ensemble_accuracy - max(model_scores))
parent_run.log_parameters({
"model_1_accuracy": model_scores[0],
"model_2_accuracy": model_scores[1],
"model_3_accuracy": model_scores[2]
})
print(f"Ensemble run completed: {parent_run.run_name}")from sagemaker.train.tuner import HyperparameterTuner
from sagemaker.core.experiments import Experiment, Run
# Create experiment for HPO
experiment = Experiment.create(
experiment_name="hpo-experiment",
description="Hyperparameter optimization for CNN"
)
# Each tuning trial automatically tracked as run
tuner = HyperparameterTuner(
model_trainer=trainer,
objective_metric_name="validation:accuracy",
hyperparameter_ranges=hyperparameter_ranges,
max_jobs=20,
max_parallel_jobs=3
)
# Start tuning
tuner.tune()
# Trials automatically logged as runs
runs = Run.list(experiment_name="hpo-experiment")
print(f"Total HPO trials: {len(runs)}")
# Each run contains:
# - Hyperparameter values
# - Training metrics
# - Model artifacts
# - Training job detailsimport matplotlib.pyplot as plt
import json
import numpy as np
with Run(experiment_name="model-analysis", run_name="visualization-run") as run:
# Log configuration as JSON
config = {
"architecture": "resnet50",
"preprocessing": {
"normalization": "imagenet",
"augmentation": ["flip", "rotate", "crop"]
},
"training": {
"optimizer": "adam",
"loss": "cross_entropy",
"metrics": ["accuracy", "f1"]
}
}
run.log_artifact(
name="config",
value=json.dumps(config, indent=2),
media_type="application/json"
)
# Generate and log training curve
plt.figure(figsize=(10, 6))
plt.plot(train_losses, label='Training Loss')
plt.plot(val_losses, label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.title('Training Progress')
plt.savefig("loss_curve.png")
run.log_file("loss_curve.png", name="training_curve", media_type="image/png")
# Log confusion matrix
plt.figure(figsize=(8, 8))
plot_confusion_matrix(confusion_matrix)
plt.savefig("confusion_matrix.png")
run.log_file("confusion_matrix.png", media_type="image/png")
# Log dataset statistics
stats = {
"total_samples": 50000,
"class_distribution": {
"class_0": 25000,
"class_1": 15000,
"class_2": 10000
},
"split": {
"train": 0.7,
"val": 0.15,
"test": 0.15
},
"features": {
"image_size": [224, 224, 3],
"normalization": "imagenet"
}
}
run.log_artifact(
name="dataset_stats",
value=json.dumps(stats, indent=2),
media_type="application/json"
)
# Log feature importance
feature_importance = calculate_feature_importance(model)
run.log_artifact(
name="feature_importance",
value=json.dumps(feature_importance),
media_type="application/json"
)import random
import numpy as np
import torch
import sys
import os
# Set all random seeds
def set_seed(seed):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
seed = 42
set_seed(seed)
with Run(experiment_name="reproducible-experiment", run_name="trial-1") as run:
# Log all environment details
run.log_parameter("random_seed", seed)
run.log_parameter("python_version", sys.version)
run.log_parameter("torch_version", torch.__version__)
run.log_parameter("numpy_version", np.__version__)
run.log_parameter("cuda_version", torch.version.cuda if torch.cuda.is_available() else "none")
# Log hardware info
run.log_parameter("device", "cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
run.log_parameter("gpu_name", torch.cuda.get_device_name(0))
run.log_parameter("gpu_memory_gb", torch.cuda.get_device_properties(0).total_memory / 1e9)
# Log data hash for verification
import hashlib
data_hash = hashlib.sha256(training_data.tobytes()).hexdigest()
run.log_parameter("data_hash", data_hash)
# Training code with deterministic behavior
model = train_deterministic_model()
# Log model hash
model_hash = compute_model_hash(model)
run.log_parameter("model_hash", model_hash)
# Results fully reproducible with same seedMLflow is automatically integrated when using mlflow_resource_arn parameters in evaluators and trainers.
# Create MLflow tracking server in SageMaker
import boto3
sm_client = boto3.client('sagemaker')
# Create tracking server
response = sm_client.create_mlflow_tracking_server(
TrackingServerName='my-mlflow-server',
ArtifactStoreUri='s3://my-bucket/mlflow',
RoleArn='arn:aws:iam::123456789012:role/MLflowRole',
AutomaticModelRegistration=True
)
mlflow_arn = response['TrackingServerArn']
# Use with evaluations
evaluator = BenchMarkEvaluator(
benchmark="MMLU",
model="my-model",
mlflow_resource_arn=mlflow_arn,
mlflow_experiment_name="model-evaluation",
mlflow_run_name="mmlu-baseline"
)
# Results automatically logged to MLflow
execution = evaluator.evaluate()
execution.wait()
# Access via MLflow UI or Python client
import mlflow
mlflow.set_tracking_uri(mlflow_tracking_server_url)
experiment = mlflow.get_experiment_by_name("model-evaluation")
runs = mlflow.search_runs(experiment_ids=[experiment.experiment_id])
print(runs[['run_name', 'metrics.accuracy', 'params.model']])class _Trial:
"""
Internal trial management (deprecated, use Run instead).
Legacy class for backward compatibility with SDK V2.
New code should use Run class which provides same functionality
with improved API design.
Notes:
- Deprecated in SDK V3
- Use Run for new code
- Existing _Trial code continues to work
"""class _TrialComponent:
"""
Internal trial component management (deprecated, use Run instead).
Legacy class for backward compatibility with SDK V2.
New code should use Run class.
Notes:
- Deprecated in SDK V3
- Run class provides equivalent functionality
"""class _RunContext:
"""
Internal run context management.
Manages run lifecycle and resource cleanup.
Automatically used by Run context manager.
Notes:
- Internal implementation detail
- Handles run lifecycle: create, log, finalize
- Ensures proper cleanup on context exit
- Not intended for direct use
"""Experiment Already Exists:
Run Not Finalized:
with Run(...) as run: for automatic finalizationParameter Type Error:
File Not Found:
Metric Not Numeric:
Run Name Collision: