or run

tessl search

Experiments & Tracking

Experiment tracking and management for organizing ML workflows, comparing runs, and managing model metadata.

Capabilities

Experiment

Experiment management for organizing related trials and runs.

class Experiment:
    """
    Experiment management for ML workflows.

    Parameters:
        experiment_name: str - Experiment name (required)
            - 1-120 characters
            - Alphanumeric, hyphens, underscores
        description: Optional[str] - Experiment description
            - Maximum 3072 characters
        tags: Optional[List[Tag]] - Resource tags
        sagemaker_session: Optional[Session] - SageMaker session

    Methods:
        create(experiment_name, description=None, tags=None, sagemaker_session=None) -> Experiment
            Create new experiment.
            
            Parameters:
                experiment_name: str - Experiment name (required)
                description: Optional[str] - Description
                tags: Optional[List[Tag]] - Tags
                sagemaker_session: Optional[Session] - Session
            
            Returns:
                Experiment: Created experiment
            
            Raises:
                ValueError: If experiment_name invalid
                ClientError: If experiment already exists
        
        load(experiment_name, sagemaker_session=None) -> Experiment
            Load existing experiment.
            
            Parameters:
                experiment_name: str - Experiment name (required)
                sagemaker_session: Optional[Session] - Session
            
            Returns:
                Experiment: Loaded experiment
            
            Raises:
                ClientError: If experiment doesn't exist
        
        list(sort_by="CreationTime", sort_order="Descending", max_results=100, 
             sagemaker_session=None) -> List[Experiment]
            List experiments.
            
            Parameters:
                sort_by: str - Sort field (default: "CreationTime")
                    - "CreationTime", "Name"
                sort_order: str - Sort order (default: "Descending")
                    - "Ascending", "Descending"
                max_results: int - Maximum results (default: 100, max: 100)
                sagemaker_session: Optional[Session] - Session
            
            Returns:
                List[Experiment]: Experiments list
        
        delete() -> None
            Delete the experiment.
            
            Raises:
                ClientError: If experiment has associated runs (delete runs first)

    Attributes:
        experiment_name: str - Experiment name
        experiment_arn: str - Experiment ARN
        description: Optional[str] - Experiment description
        creation_time: datetime - Creation timestamp
        created_by: Dict - Creator information
        last_modified_time: datetime - Last modification timestamp
        last_modified_by: Dict - Last modifier information
    
    Notes:
        - Experiments organize related runs/trials
        - Delete all runs before deleting experiment
        - Tags useful for cost tracking and organization
        - Cannot rename experiment after creation
    """

Usage:

from sagemaker.core.experiments import Experiment

# Create experiment for project
try:
    experiment = Experiment.create(
        experiment_name="customer-churn-prediction",
        description="Experiments for customer churn prediction model",
        tags=[
            {"Key": "Project", "Value": "CustomerChurn"},
            {"Key": "Team", "Value": "DataScience"}
        ]
    )
    print(f"Experiment created: {experiment.experiment_arn}")
    
except ClientError as e:
    if e.response['Error']['Code'] == 'ResourceInUse':
        print("Experiment already exists, loading...")
        experiment = Experiment.load("customer-churn-prediction")

# List all experiments
experiments = Experiment.list(
    sort_by="CreationTime",
    sort_order="Descending",
    max_results=20
)

print(f"\nRecent experiments:")
for exp in experiments[:5]:
    print(f"  {exp.experiment_name} - {exp.description}")

# Delete experiment (after deleting all runs)
# experiment.delete()

Run

Run management for individual training runs within experiments.

class Run:
    """
    Run management for tracking training executions.

    Parameters:
        experiment_name: str - Parent experiment name (required)
        run_name: Optional[str] - Run name (auto-generated if not provided)
            - Format: auto-generated includes timestamp
        sagemaker_session: Optional[Session] - SageMaker session

    Methods:
        log_parameter(name, value) -> None
            Log single parameter.
            
            Parameters:
                name: str - Parameter name (required)
                value: Union[str, int, float, bool] - Parameter value (required)
            
            Raises:
                ValueError: If value not JSON-serializable
        
        log_parameters(parameters) -> None
            Log multiple parameters.
            
            Parameters:
                parameters: Dict[str, Any] - Parameters dictionary (required)
        
        log_metric(name, value, step=None, timestamp=None) -> None
            Log metric value.
            
            Parameters:
                name: str - Metric name (required)
                value: float - Metric value (required)
                step: Optional[int] - Training step/epoch
                timestamp: Optional[datetime] - Timestamp
        
        log_metrics(metrics, step=None) -> None
            Log multiple metrics.
            
            Parameters:
                metrics: Dict[str, float] - Metrics dictionary (required)
                step: Optional[int] - Training step/epoch
        
        log_artifact(name, value, media_type="text/plain") -> None
            Log artifact.
            
            Parameters:
                name: str - Artifact name (required)
                value: str - Artifact value (required)
                media_type: str - Media type (default: "text/plain")
        
        log_file(file_path, name=None, media_type=None, is_output=True) -> None
            Log file as artifact.
            
            Parameters:
                file_path: str - Local file path (required)
                name: Optional[str] - Artifact name (default: filename)
                media_type: Optional[str] - Media type (auto-detected)
                is_output: bool - Is output artifact (default: True)
            
            Raises:
                FileNotFoundError: If file doesn't exist
        
        log_model(model_data_uri, model_type=None, framework=None, framework_version=None) -> None
            Log model artifact.
            
            Parameters:
                model_data_uri: str - S3 URI for model (required)
                model_type: Optional[str] - Model type
                framework: Optional[str] - Framework name
                framework_version: Optional[str] - Framework version
        
        wait() -> None
            Wait for run to complete (if associated with job).
        
        list(experiment_name, sort_by="CreationTime", sort_order="Descending", 
             max_results=100) -> List[Run]
            List runs in experiment.
            
            Parameters:
                experiment_name: str - Experiment name (required)
                sort_by: str - Sort field
                sort_order: str - Sort order
                max_results: int - Maximum results (1-100)
            
            Returns:
                List[Run]: Runs list

    Context Manager:
        Use with 'with' statement for automatic resource management and cleanup.

    Attributes:
        run_name: str - Run name
        experiment_name: str - Parent experiment name
        run_arn: str - Run ARN
        status: str - Run status

    Notes:
        - Use context manager for automatic cleanup
        - Log parameters before training
        - Log metrics during/after training
        - Log model and artifacts after training
        - Parameters immutable after logging
        - Metrics can be logged multiple times (time series)
    """

Usage:

from sagemaker.core.experiments import Run
import json

# Create and use run with context manager
with Run(
    experiment_name="customer-churn-prediction",
    run_name="xgboost-trial-1",
    sagemaker_session=session
) as run:
    # Log hyperparameters at start
    run.log_parameter("algorithm", "xgboost")
    run.log_parameter("learning_rate", 0.1)
    run.log_parameter("max_depth", 5)
    run.log_parameter("num_rounds", 100)
    
    # Or log all at once
    run.log_parameters({
        "min_child_weight": 3,
        "subsample": 0.8,
        "colsample_bytree": 0.8
    })
    
    # Train model (pseudo-code)
    model, history = train_xgboost_model()
    
    # Log metrics during training
    for epoch, metrics in enumerate(history):
        run.log_metrics({
            "train_loss": metrics["train_loss"],
            "train_accuracy": metrics["train_acc"],
            "val_loss": metrics["val_loss"],
            "val_accuracy": metrics["val_acc"]
        }, step=epoch)
    
    # Log final metrics
    run.log_metric("final_accuracy", 0.94)
    run.log_metric("final_f1", 0.92)
    run.log_metric("auc_roc", 0.96)
    
    # Log model
    model_uri = "s3://my-bucket/models/xgboost-model.tar.gz"
    run.log_model(
        model_data_uri=model_uri,
        model_type="xgboost",
        framework="xgboost",
        framework_version="1.7.3"
    )
    
    # Log artifacts
    run.log_file(
        file_path="confusion_matrix.png",
        name="confusion_matrix",
        media_type="image/png"
    )
    
    run.log_file(
        file_path="feature_importance.json",
        name="feature_importance",
        media_type="application/json"
    )
    
    # Log custom artifact
    config = {
        "preprocessing": "standard_scaler",
        "feature_selection": "top_20",
        "class_weights": {0: 1.0, 1: 2.5}
    }
    run.log_artifact(
        name="training_config",
        value=json.dumps(config),
        media_type="application/json"
    )

# Run automatically closed and finalized
print(f"Run completed: {run.run_name}")

Integration with Training

from sagemaker.train import ModelTrainer
from sagemaker.core.experiments import Run, Experiment

# Create experiment if needed
try:
    experiment = Experiment.create(
        experiment_name="hyperparameter-search",
        description="Finding optimal hyperparameters for ResNet"
    )
except ClientError:
    experiment = Experiment.load("hyperparameter-search")

# Run training with experiment tracking
hyperparams_to_test = [
    {"learning_rate": 0.01, "batch_size": 32},
    {"learning_rate": 0.001, "batch_size": 64},
    {"learning_rate": 0.0001, "batch_size": 128}
]

best_accuracy = 0
best_run = None

for i, hyperparams in enumerate(hyperparams_to_test):
    with Run(
        experiment_name=experiment.experiment_name,
        run_name=f"trial-{i+1}"
    ) as run:
        # Log hyperparameters
        run.log_parameters(hyperparams)
        run.log_parameter("optimizer", "adam")
        run.log_parameter("epochs", 10)
        
        # Create and train model
        trainer = ModelTrainer(
            training_image="pytorch-image",
            role=role,
            compute=Compute(
                instance_type="ml.p3.2xlarge",
                instance_count=1
            ),
            hyperparameters=hyperparams
        )
        
        trainer.train(input_data_config=[train_data, val_data])
        
        # Get metrics from training job
        job = trainer._latest_training_job
        final_metrics = job.final_metric_data_list
        
        # Log metrics
        for metric in final_metrics:
            metric_name = metric["MetricName"]
            metric_value = metric["Value"]
            run.log_metric(metric_name, metric_value)
            
            if metric_name == "validation:accuracy" and metric_value > best_accuracy:
                best_accuracy = metric_value
                best_run = run.run_name
        
        # Log model artifact
        run.log_model(
            model_data_uri=job.model_artifacts["S3ModelArtifacts"],
            model_type="pytorch",
            framework="pytorch",
            framework_version="2.0"
        )

print(f"\nBest run: {best_run} with accuracy: {best_accuracy}")

Integration with Pipelines

from sagemaker.mlops.workflow import Pipeline, TrainingStep, PipelineExperimentConfig
from sagemaker.core.workflow import ExecutionVariables

# Configure pipeline with experiment tracking
experiment_config = PipelineExperimentConfig(
    experiment_name="pipeline-experiment",
    trial_name=ExecutionVariables.PIPELINE_EXECUTION_ID  # Unique per execution
)

# Create pipeline
pipeline = Pipeline(
    name="training-pipeline",
    steps=[preprocess_step, train_step, evaluate_step],
    pipeline_experiment_config=experiment_config
)

# Each execution creates a new trial/run
execution1 = pipeline.start()  # Creates trial with execution ID 1
execution2 = pipeline.start()  # Creates trial with execution ID 2

# List runs created by pipeline
runs = Run.list(experiment_name="pipeline-experiment")
print(f"Total pipeline runs: {len(runs)}")

Advanced Usage

Compare Multiple Runs

from sagemaker.core.experiments import Run
import pandas as pd

# Get all runs from experiment
runs = Run.list(
    experiment_name="hyperparameter-search",
    sort_by="CreationTime",
    sort_order="Descending"
)

# Extract metrics and parameters
results = []
for run in runs:
    # Load run details
    run_obj = Run(
        experiment_name=run.experiment_name,
        run_name=run.run_name
    )
    
    # Get logged data (access via SageMaker API)
    run_details = {
        "run_name": run.run_name,
        "creation_time": run.creation_time,
        # Parameters and metrics retrieved via describe API
    }
    results.append(run_details)

# Create comparison DataFrame
df = pd.DataFrame(results)

# Find best run by metric
best_run = df.loc[df['validation_accuracy'].idxmax()]
print(f"\nBest run: {best_run['run_name']}")
print(f"Parameters: learning_rate={best_run['learning_rate']}, batch_size={best_run['batch_size']}")
print(f"Validation accuracy: {best_run['validation_accuracy']}")

# Visualize results
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.scatter(df['learning_rate'], df['validation_accuracy'], s=df['batch_size'])
plt.xlabel('Learning Rate')
plt.ylabel('Validation Accuracy')
plt.title('Hyperparameter Search Results')
plt.xscale('log')
plt.show()

Nested Runs for Ensemble Models

# Parent run for ensemble
with Run(
    experiment_name="ensemble-models",
    run_name="ensemble-voting-v1"
) as parent_run:
    parent_run.log_parameter("ensemble_type", "voting")
    parent_run.log_parameter("voting_strategy", "soft")
    parent_run.log_parameter("num_models", 3)
    
    models = ["xgboost", "random_forest", "neural_net"]
    model_scores = []
    
    # Child runs for individual models
    for i, model_type in enumerate(models):
        with Run(
            experiment_name="ensemble-models",
            run_name=f"ensemble-v1-model-{i}-{model_type}"
        ) as child_run:
            # Link to parent
            child_run.log_parameter("parent_run", parent_run.run_name)
            child_run.log_parameter("model_type", model_type)
            child_run.log_parameter("ensemble_index", i)
            
            # Train individual model
            model, accuracy = train_model(model_type)
            model_scores.append(accuracy)
            
            # Log child metrics
            child_run.log_metric("accuracy", accuracy)
            child_run.log_metric("training_time", training_time)
            
            # Log model
            child_run.log_model(
                model_data_uri=f"s3://bucket/models/{model_type}.tar.gz",
                model_type=model_type
            )
    
    # Log ensemble metrics
    ensemble_predictions = create_ensemble(models)
    ensemble_accuracy = evaluate_ensemble(ensemble_predictions)
    
    parent_run.log_metric("ensemble_accuracy", ensemble_accuracy)
    parent_run.log_metric("improvement_over_best", 
                         ensemble_accuracy - max(model_scores))
    parent_run.log_parameters({
        "model_1_accuracy": model_scores[0],
        "model_2_accuracy": model_scores[1],
        "model_3_accuracy": model_scores[2]
    })

print(f"Ensemble run completed: {parent_run.run_name}")

Hyperparameter Tuning Integration

from sagemaker.train.tuner import HyperparameterTuner
from sagemaker.core.experiments import Experiment, Run

# Create experiment for HPO
experiment = Experiment.create(
    experiment_name="hpo-experiment",
    description="Hyperparameter optimization for CNN"
)

# Each tuning trial automatically tracked as run
tuner = HyperparameterTuner(
    model_trainer=trainer,
    objective_metric_name="validation:accuracy",
    hyperparameter_ranges=hyperparameter_ranges,
    max_jobs=20,
    max_parallel_jobs=3
)

# Start tuning
tuner.tune()

# Trials automatically logged as runs
runs = Run.list(experiment_name="hpo-experiment")
print(f"Total HPO trials: {len(runs)}")

# Each run contains:
# - Hyperparameter values
# - Training metrics
# - Model artifacts
# - Training job details

Logging Complex Artifacts

import matplotlib.pyplot as plt
import json
import numpy as np

with Run(experiment_name="model-analysis", run_name="visualization-run") as run:
    # Log configuration as JSON
    config = {
        "architecture": "resnet50",
        "preprocessing": {
            "normalization": "imagenet",
            "augmentation": ["flip", "rotate", "crop"]
        },
        "training": {
            "optimizer": "adam",
            "loss": "cross_entropy",
            "metrics": ["accuracy", "f1"]
        }
    }
    run.log_artifact(
        name="config",
        value=json.dumps(config, indent=2),
        media_type="application/json"
    )
    
    # Generate and log training curve
    plt.figure(figsize=(10, 6))
    plt.plot(train_losses, label='Training Loss')
    plt.plot(val_losses, label='Validation Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()
    plt.title('Training Progress')
    plt.savefig("loss_curve.png")
    run.log_file("loss_curve.png", name="training_curve", media_type="image/png")
    
    # Log confusion matrix
    plt.figure(figsize=(8, 8))
    plot_confusion_matrix(confusion_matrix)
    plt.savefig("confusion_matrix.png")
    run.log_file("confusion_matrix.png", media_type="image/png")
    
    # Log dataset statistics
    stats = {
        "total_samples": 50000,
        "class_distribution": {
            "class_0": 25000,
            "class_1": 15000,
            "class_2": 10000
        },
        "split": {
            "train": 0.7,
            "val": 0.15,
            "test": 0.15
        },
        "features": {
            "image_size": [224, 224, 3],
            "normalization": "imagenet"
        }
    }
    run.log_artifact(
        name="dataset_stats",
        value=json.dumps(stats, indent=2),
        media_type="application/json"
    )
    
    # Log feature importance
    feature_importance = calculate_feature_importance(model)
    run.log_artifact(
        name="feature_importance",
        value=json.dumps(feature_importance),
        media_type="application/json"
    )

Reproducibility Best Practices

import random
import numpy as np
import torch
import sys
import os

# Set all random seeds
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)

seed = 42
set_seed(seed)

with Run(experiment_name="reproducible-experiment", run_name="trial-1") as run:
    # Log all environment details
    run.log_parameter("random_seed", seed)
    run.log_parameter("python_version", sys.version)
    run.log_parameter("torch_version", torch.__version__)
    run.log_parameter("numpy_version", np.__version__)
    run.log_parameter("cuda_version", torch.version.cuda if torch.cuda.is_available() else "none")
    
    # Log hardware info
    run.log_parameter("device", "cuda" if torch.cuda.is_available() else "cpu")
    if torch.cuda.is_available():
        run.log_parameter("gpu_name", torch.cuda.get_device_name(0))
        run.log_parameter("gpu_memory_gb", torch.cuda.get_device_properties(0).total_memory / 1e9)
    
    # Log data hash for verification
    import hashlib
    data_hash = hashlib.sha256(training_data.tobytes()).hexdigest()
    run.log_parameter("data_hash", data_hash)
    
    # Training code with deterministic behavior
    model = train_deterministic_model()
    
    # Log model hash
    model_hash = compute_model_hash(model)
    run.log_parameter("model_hash", model_hash)
    
    # Results fully reproducible with same seed

MLflow Integration

MLflow is automatically integrated when using mlflow_resource_arn parameters in evaluators and trainers.

# Create MLflow tracking server in SageMaker
import boto3

sm_client = boto3.client('sagemaker')

# Create tracking server
response = sm_client.create_mlflow_tracking_server(
    TrackingServerName='my-mlflow-server',
    ArtifactStoreUri='s3://my-bucket/mlflow',
    RoleArn='arn:aws:iam::123456789012:role/MLflowRole',
    AutomaticModelRegistration=True
)

mlflow_arn = response['TrackingServerArn']

# Use with evaluations
evaluator = BenchMarkEvaluator(
    benchmark="MMLU",
    model="my-model",
    mlflow_resource_arn=mlflow_arn,
    mlflow_experiment_name="model-evaluation",
    mlflow_run_name="mmlu-baseline"
)

# Results automatically logged to MLflow
execution = evaluator.evaluate()
execution.wait()

# Access via MLflow UI or Python client
import mlflow

mlflow.set_tracking_uri(mlflow_tracking_server_url)
experiment = mlflow.get_experiment_by_name("model-evaluation")
runs = mlflow.search_runs(experiment_ids=[experiment.experiment_id])
print(runs[['run_name', 'metrics.accuracy', 'params.model']])

Internal Classes

_Trial (Deprecated)

class _Trial:
    """
    Internal trial management (deprecated, use Run instead).

    Legacy class for backward compatibility with SDK V2.
    New code should use Run class which provides same functionality
    with improved API design.

    Notes:
        - Deprecated in SDK V3
        - Use Run for new code
        - Existing _Trial code continues to work
    """

_TrialComponent (Deprecated)

class _TrialComponent:
    """
    Internal trial component management (deprecated, use Run instead).

    Legacy class for backward compatibility with SDK V2.
    New code should use Run class.

    Notes:
        - Deprecated in SDK V3
        - Run class provides equivalent functionality
    """

_RunContext

class _RunContext:
    """
    Internal run context management.

    Manages run lifecycle and resource cleanup.
    Automatically used by Run context manager.

    Notes:
        - Internal implementation detail
        - Handles run lifecycle: create, log, finalize
        - Ensures proper cleanup on context exit
        - Not intended for direct use
    """

Validation and Constraints

Experiment Constraints

Experiment name: 1-120 characters, alphanumeric, hyphens, underscores
Description: Maximum 3072 characters
Maximum experiments per account: 5000
Tags: Maximum 50 per experiment

Run Constraints

Run name: Auto-generated or custom (1-120 characters)
Maximum parameters: 300 per run
Maximum metrics: 500,000 data points per run
Maximum artifacts: 30 per run
Parameter value length: Maximum 256 characters
Metric value: Must be numeric (float)

Logging Constraints

Parameter immutability: Cannot change after logging
Metric time series: Can log same metric multiple times with different steps
File size limit: 5 MB per artifact file
Concurrent logging: Thread-safe within single run context

Common Error Scenarios

Experiment Already Exists:
- Cause: Creating experiment with existing name
- Solution: Use Experiment.load() or catch ResourceInUse error
Run Not Finalized:
- Cause: Accessing run outside context manager
- Solution: Use with Run(...) as run: for automatic finalization
Parameter Type Error:
- Cause: Logging non-serializable value
- Solution: Convert to string, int, float, or bool
File Not Found:
- Cause: log_file() with invalid path
- Solution: Verify file exists before logging
Metric Not Numeric:
- Cause: Logging string as metric
- Solution: Ensure metric values are float/int, use log_parameter for strings
Run Name Collision:
- Cause: Multiple runs with same name in experiment
- Solution: Use auto-generated names or ensure uniqueness

Version

Files

experiments.mddocs/

Experiments & Tracking

Capabilities

Experiment

Run

Integration with Training

Integration with Pipelines

Advanced Usage

Compare Multiple Runs

Nested Runs for Ensemble Models

Hyperparameter Tuning Integration

Logging Complex Artifacts

Reproducibility Best Practices

MLflow Integration

Internal Classes

_Trial (Deprecated)

_TrialComponent (Deprecated)

_RunContext

Validation and Constraints

Experiment Constraints

Run Constraints

Logging Constraints

Common Error Scenarios

Version

Files

experiments.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

Experiments & Tracking

Capabilities

Experiment

Run

Integration with Training

Integration with Pipelines

Advanced Usage

Compare Multiple Runs

Nested Runs for Ensemble Models

Hyperparameter Tuning Integration

Logging Complex Artifacts

Reproducibility Best Practices

MLflow Integration

Internal Classes

_Trial (Deprecated)

_TrialComponent (Deprecated)

_RunContext

Validation and Constraints

Experiment Constraints

Run Constraints

Logging Constraints

Common Error Scenarios

experiments.mddocs/