or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

auth.mdbom.mdconfig.mdfilestore.mdindex.mdinference.mdmetadata.mdpolicy.md

bom.mddocs/

0

# Training BOM Management

1

2

Complete lifecycle tracking for machine learning training processes with structured metadata capture. The Training BOM (Bill of Materials) provides comprehensive documentation of ML training workflows including dataset information, feature engineering details, model specifications, and integration with MLflow for experiment tracking and reproducibility.

3

4

## Capabilities

5

6

### TrainingBOM Core Model

7

8

The main data model for capturing comprehensive training workflow metadata with nested information structures for datasets, features, and models, designed for integration with MLflow tracking systems.

9

10

```python { .api }

11

class TrainingBOM(BaseModel):

12

"""

13

Represent a Bill of Materials for model training.

14

15

Attributes:

16

- id: str - Training identifier

17

- start_time: str - Training start timestamp

18

- end_time: str - Training end timestamp

19

- dataset_info: DatasetInfo - Dataset information

20

- feature_info: FeatureInfo - Feature engineering information

21

- model_info: ModelInfo - Model information

22

- mlflow_params: Dict - MLflow parameters

23

- mlflow_metrics: Dict - MLflow metrics

24

"""

25

id: str

26

start_time: str

27

end_time: str

28

dataset_info: DatasetInfo

29

feature_info: FeatureInfo

30

model_info: ModelInfo

31

mlflow_params: Dict

32

mlflow_metrics: Dict

33

```

34

35

### Dataset Information Tracking

36

37

Captures essential metadata about training datasets including origin sources and size metrics for comprehensive data lineage and reproducibility tracking.

38

39

```python { .api }

40

class TrainingBOM.DatasetInfo(BaseModel):

41

"""

42

Represents training dataset information for the Bill of Materials.

43

44

Attributes:

45

- origin: str - Dataset origin

46

- size: int - Dataset size (default: 0)

47

"""

48

origin: str

49

size: int = 0

50

```

51

52

### Feature Engineering Metadata

53

54

Comprehensive tracking of feature engineering and selection processes including original feature sets and selected subsets for model training transparency and reproducibility.

55

56

```python { .api }

57

class TrainingBOM.FeatureInfo(BaseModel):

58

"""

59

Represents feature engineering/selection information for the Bill of Materials.

60

61

Attributes:

62

- original_features: List[str] - Original features (default: empty list)

63

- selected_features: List[str] - Selected features (default: empty list)

64

"""

65

original_features: List[str] = []

66

selected_features: List[str] = []

67

```

68

69

### Model Architecture Documentation

70

71

Structured capture of model specifications including type classification and architectural details for comprehensive model documentation and version tracking.

72

73

```python { .api }

74

class TrainingBOM.ModelInfo(BaseModel):

75

"""

76

Represents training model information for the Bill of Materials.

77

78

Attributes:

79

- type: str - Model type

80

- architecture: str - Model architecture

81

"""

82

type: str

83

architecture: str

84

```

85

86

## Usage Examples

87

88

### Basic Training BOM Creation

89

90

```python

91

from aissemble_core_bom.training_bom import TrainingBOM

92

from datetime import datetime

93

94

# Create dataset information

95

dataset_info = TrainingBOM.DatasetInfo(

96

origin="s3://ml-data-bucket/training-data-v2.0.parquet",

97

size=1500000

98

)

99

100

# Define feature engineering details

101

feature_info = TrainingBOM.FeatureInfo(

102

original_features=["age", "income", "credit_score", "employment_years", "debt_ratio"],

103

selected_features=["age", "income", "credit_score", "employment_years"]

104

)

105

106

# Specify model information

107

model_info = TrainingBOM.ModelInfo(

108

type="RandomForestClassifier",

109

architecture="ensemble_tree_based"

110

)

111

112

# Create complete training BOM

113

training_bom = TrainingBOM(

114

id="training-run-2024-09-05-001",

115

start_time=datetime.now().isoformat(),

116

end_time=datetime.now().isoformat(),

117

dataset_info=dataset_info,

118

feature_info=feature_info,

119

model_info=model_info,

120

mlflow_params={

121

"n_estimators": 100,

122

"max_depth": 10,

123

"min_samples_split": 2,

124

"random_state": 42

125

},

126

mlflow_metrics={

127

"accuracy": 0.87,

128

"precision": 0.84,

129

"recall": 0.89,

130

"f1_score": 0.86,

131

"auc_roc": 0.91

132

}

133

)

134

135

print(f"Created Training BOM: {training_bom.id}")

136

print(f"Dataset origin: {training_bom.dataset_info.origin}")

137

print(f"Selected features: {training_bom.feature_info.selected_features}")

138

print(f"Model type: {training_bom.model_info.type}")

139

print(f"Training accuracy: {training_bom.mlflow_metrics['accuracy']}")

140

```

141

142

### MLflow Integration Pattern

143

144

```python

145

import mlflow

146

from aissemble_core_bom.training_bom import TrainingBOM

147

148

def create_training_bom_from_mlflow(run_id: str) -> TrainingBOM:

149

"""Create Training BOM from MLflow run data"""

150

151

# Get MLflow run information

152

run = mlflow.get_run(run_id)

153

154

# Extract parameters and metrics

155

params = run.data.params

156

metrics = run.data.metrics

157

158

# Create BOM with MLflow data

159

training_bom = TrainingBOM(

160

id=f"mlflow-{run_id}",

161

start_time=str(run.info.start_time),

162

end_time=str(run.info.end_time),

163

dataset_info=TrainingBOM.DatasetInfo(

164

origin=params.get("dataset_path", "unknown"),

165

size=int(params.get("dataset_size", 0))

166

),

167

feature_info=TrainingBOM.FeatureInfo(

168

original_features=params.get("original_features", "").split(",") if params.get("original_features") else [],

169

selected_features=params.get("selected_features", "").split(",") if params.get("selected_features") else []

170

),

171

model_info=TrainingBOM.ModelInfo(

172

type=params.get("model_type", "unknown"),

173

architecture=params.get("model_architecture", "unknown")

174

),

175

mlflow_params=params,

176

mlflow_metrics=metrics

177

)

178

179

return training_bom

180

181

# Usage example

182

with mlflow.start_run() as run:

183

# Training code here...

184

mlflow.log_param("dataset_path", "s3://bucket/data.parquet")

185

mlflow.log_param("model_type", "XGBoostClassifier")

186

mlflow.log_metric("accuracy", 0.92)

187

188

# Create BOM after training

189

bom = create_training_bom_from_mlflow(run.info.run_id)

190

```

191

192

### Comprehensive Workflow Tracking

193

194

```python

195

from aissemble_core_bom.training_bom import TrainingBOM

196

from datetime import datetime

197

import json

198

199

class TrainingWorkflowTracker:

200

"""Utility class for comprehensive training workflow tracking"""

201

202

def __init__(self):

203

self.start_time = None

204

self.dataset_info = None

205

self.feature_info = None

206

self.model_info = None

207

self.mlflow_params = {}

208

self.mlflow_metrics = {}

209

210

def start_training(self, training_id: str):

211

"""Initialize training session"""

212

self.training_id = training_id

213

self.start_time = datetime.now().isoformat()

214

print(f"Started training session: {training_id}")

215

216

def register_dataset(self, origin: str, size: int):

217

"""Register dataset information"""

218

self.dataset_info = TrainingBOM.DatasetInfo(origin=origin, size=size)

219

print(f"Registered dataset from: {origin}")

220

221

def register_features(self, original_features: list, selected_features: list):

222

"""Register feature engineering information"""

223

self.feature_info = TrainingBOM.FeatureInfo(

224

original_features=original_features,

225

selected_features=selected_features

226

)

227

print(f"Registered {len(selected_features)} selected features from {len(original_features)} original")

228

229

def register_model(self, model_type: str, architecture: str):

230

"""Register model information"""

231

self.model_info = TrainingBOM.ModelInfo(type=model_type, architecture=architecture)

232

print(f"Registered model: {model_type}")

233

234

def log_parameter(self, key: str, value):

235

"""Log training parameter"""

236

self.mlflow_params[key] = value

237

238

def log_metric(self, key: str, value: float):

239

"""Log training metric"""

240

self.mlflow_metrics[key] = value

241

242

def finalize_training(self) -> TrainingBOM:

243

"""Create final Training BOM"""

244

end_time = datetime.now().isoformat()

245

246

bom = TrainingBOM(

247

id=self.training_id,

248

start_time=self.start_time,

249

end_time=end_time,

250

dataset_info=self.dataset_info,

251

feature_info=self.feature_info,

252

model_info=self.model_info,

253

mlflow_params=self.mlflow_params,

254

mlflow_metrics=self.mlflow_metrics

255

)

256

257

print(f"Training completed: {self.training_id}")

258

return bom

259

260

# Example usage

261

tracker = TrainingWorkflowTracker()

262

263

# Initialize training

264

tracker.start_training("fraud-detection-v3.2")

265

266

# Register components

267

tracker.register_dataset("s3://fraud-data/transactions-2024.parquet", 2500000)

268

tracker.register_features(

269

original_features=["amount", "merchant", "timestamp", "location", "card_type", "user_age"],

270

selected_features=["amount", "merchant", "timestamp", "location"]

271

)

272

tracker.register_model("GradientBoostingClassifier", "tree_ensemble")

273

274

# Log training parameters

275

tracker.log_parameter("n_estimators", 200)

276

tracker.log_parameter("learning_rate", 0.1)

277

tracker.log_parameter("max_depth", 8)

278

279

# Log training metrics

280

tracker.log_metric("accuracy", 0.94)

281

tracker.log_metric("precision", 0.91)

282

tracker.log_metric("recall", 0.96)

283

284

# Finalize and get BOM

285

final_bom = tracker.finalize_training()

286

287

# Serialize BOM for storage or transmission

288

bom_json = final_bom.model_dump_json(indent=2)

289

print("Training BOM JSON:", bom_json)

290

```

291

292

## Best Practices

293

294

### Comprehensive Data Capture

295

- Always include dataset origin paths for reproducibility

296

- Track both original and selected features for transparency

297

- Record model architecture details for version management

298

- Capture complete MLflow parameters and metrics

299

300

### Integration Patterns

301

- Use consistent training IDs across systems

302

- Integrate with MLflow experiment tracking

303

- Store BOMs in centralized metadata systems

304

- Link BOMs to model deployment records

305

306

### Validation and Quality

307

- Validate dataset accessibility before training

308

- Ensure feature consistency across runs

309

- Track data quality metrics in additionalValues

310

- Monitor training metrics for anomalies