0
# Training BOM Management
1
2
Complete lifecycle tracking for machine learning training processes with structured metadata capture. The Training BOM (Bill of Materials) provides comprehensive documentation of ML training workflows including dataset information, feature engineering details, model specifications, and integration with MLflow for experiment tracking and reproducibility.
3
4
## Capabilities
5
6
### TrainingBOM Core Model
7
8
The main data model for capturing comprehensive training workflow metadata with nested information structures for datasets, features, and models, designed for integration with MLflow tracking systems.
9
10
```python { .api }
11
class TrainingBOM(BaseModel):
12
"""
13
Represent a Bill of Materials for model training.
14
15
Attributes:
16
- id: str - Training identifier
17
- start_time: str - Training start timestamp
18
- end_time: str - Training end timestamp
19
- dataset_info: DatasetInfo - Dataset information
20
- feature_info: FeatureInfo - Feature engineering information
21
- model_info: ModelInfo - Model information
22
- mlflow_params: Dict - MLflow parameters
23
- mlflow_metrics: Dict - MLflow metrics
24
"""
25
id: str
26
start_time: str
27
end_time: str
28
dataset_info: DatasetInfo
29
feature_info: FeatureInfo
30
model_info: ModelInfo
31
mlflow_params: Dict
32
mlflow_metrics: Dict
33
```
34
35
### Dataset Information Tracking
36
37
Captures essential metadata about training datasets including origin sources and size metrics for comprehensive data lineage and reproducibility tracking.
38
39
```python { .api }
40
class TrainingBOM.DatasetInfo(BaseModel):
41
"""
42
Represents training dataset information for the Bill of Materials.
43
44
Attributes:
45
- origin: str - Dataset origin
46
- size: int - Dataset size (default: 0)
47
"""
48
origin: str
49
size: int = 0
50
```
51
52
### Feature Engineering Metadata
53
54
Comprehensive tracking of feature engineering and selection processes including original feature sets and selected subsets for model training transparency and reproducibility.
55
56
```python { .api }
57
class TrainingBOM.FeatureInfo(BaseModel):
58
"""
59
Represents feature engineering/selection information for the Bill of Materials.
60
61
Attributes:
62
- original_features: List[str] - Original features (default: empty list)
63
- selected_features: List[str] - Selected features (default: empty list)
64
"""
65
original_features: List[str] = []
66
selected_features: List[str] = []
67
```
68
69
### Model Architecture Documentation
70
71
Structured capture of model specifications including type classification and architectural details for comprehensive model documentation and version tracking.
72
73
```python { .api }
74
class TrainingBOM.ModelInfo(BaseModel):
75
"""
76
Represents training model information for the Bill of Materials.
77
78
Attributes:
79
- type: str - Model type
80
- architecture: str - Model architecture
81
"""
82
type: str
83
architecture: str
84
```
85
86
## Usage Examples
87
88
### Basic Training BOM Creation
89
90
```python
91
from aissemble_core_bom.training_bom import TrainingBOM
92
from datetime import datetime
93
94
# Create dataset information
95
dataset_info = TrainingBOM.DatasetInfo(
96
origin="s3://ml-data-bucket/training-data-v2.0.parquet",
97
size=1500000
98
)
99
100
# Define feature engineering details
101
feature_info = TrainingBOM.FeatureInfo(
102
original_features=["age", "income", "credit_score", "employment_years", "debt_ratio"],
103
selected_features=["age", "income", "credit_score", "employment_years"]
104
)
105
106
# Specify model information
107
model_info = TrainingBOM.ModelInfo(
108
type="RandomForestClassifier",
109
architecture="ensemble_tree_based"
110
)
111
112
# Create complete training BOM
113
training_bom = TrainingBOM(
114
id="training-run-2024-09-05-001",
115
start_time=datetime.now().isoformat(),
116
end_time=datetime.now().isoformat(),
117
dataset_info=dataset_info,
118
feature_info=feature_info,
119
model_info=model_info,
120
mlflow_params={
121
"n_estimators": 100,
122
"max_depth": 10,
123
"min_samples_split": 2,
124
"random_state": 42
125
},
126
mlflow_metrics={
127
"accuracy": 0.87,
128
"precision": 0.84,
129
"recall": 0.89,
130
"f1_score": 0.86,
131
"auc_roc": 0.91
132
}
133
)
134
135
print(f"Created Training BOM: {training_bom.id}")
136
print(f"Dataset origin: {training_bom.dataset_info.origin}")
137
print(f"Selected features: {training_bom.feature_info.selected_features}")
138
print(f"Model type: {training_bom.model_info.type}")
139
print(f"Training accuracy: {training_bom.mlflow_metrics['accuracy']}")
140
```
141
142
### MLflow Integration Pattern
143
144
```python
145
import mlflow
146
from aissemble_core_bom.training_bom import TrainingBOM
147
148
def create_training_bom_from_mlflow(run_id: str) -> TrainingBOM:
149
"""Create Training BOM from MLflow run data"""
150
151
# Get MLflow run information
152
run = mlflow.get_run(run_id)
153
154
# Extract parameters and metrics
155
params = run.data.params
156
metrics = run.data.metrics
157
158
# Create BOM with MLflow data
159
training_bom = TrainingBOM(
160
id=f"mlflow-{run_id}",
161
start_time=str(run.info.start_time),
162
end_time=str(run.info.end_time),
163
dataset_info=TrainingBOM.DatasetInfo(
164
origin=params.get("dataset_path", "unknown"),
165
size=int(params.get("dataset_size", 0))
166
),
167
feature_info=TrainingBOM.FeatureInfo(
168
original_features=params.get("original_features", "").split(",") if params.get("original_features") else [],
169
selected_features=params.get("selected_features", "").split(",") if params.get("selected_features") else []
170
),
171
model_info=TrainingBOM.ModelInfo(
172
type=params.get("model_type", "unknown"),
173
architecture=params.get("model_architecture", "unknown")
174
),
175
mlflow_params=params,
176
mlflow_metrics=metrics
177
)
178
179
return training_bom
180
181
# Usage example
182
with mlflow.start_run() as run:
183
# Training code here...
184
mlflow.log_param("dataset_path", "s3://bucket/data.parquet")
185
mlflow.log_param("model_type", "XGBoostClassifier")
186
mlflow.log_metric("accuracy", 0.92)
187
188
# Create BOM after training
189
bom = create_training_bom_from_mlflow(run.info.run_id)
190
```
191
192
### Comprehensive Workflow Tracking
193
194
```python
195
from aissemble_core_bom.training_bom import TrainingBOM
196
from datetime import datetime
197
import json
198
199
class TrainingWorkflowTracker:
200
"""Utility class for comprehensive training workflow tracking"""
201
202
def __init__(self):
203
self.start_time = None
204
self.dataset_info = None
205
self.feature_info = None
206
self.model_info = None
207
self.mlflow_params = {}
208
self.mlflow_metrics = {}
209
210
def start_training(self, training_id: str):
211
"""Initialize training session"""
212
self.training_id = training_id
213
self.start_time = datetime.now().isoformat()
214
print(f"Started training session: {training_id}")
215
216
def register_dataset(self, origin: str, size: int):
217
"""Register dataset information"""
218
self.dataset_info = TrainingBOM.DatasetInfo(origin=origin, size=size)
219
print(f"Registered dataset from: {origin}")
220
221
def register_features(self, original_features: list, selected_features: list):
222
"""Register feature engineering information"""
223
self.feature_info = TrainingBOM.FeatureInfo(
224
original_features=original_features,
225
selected_features=selected_features
226
)
227
print(f"Registered {len(selected_features)} selected features from {len(original_features)} original")
228
229
def register_model(self, model_type: str, architecture: str):
230
"""Register model information"""
231
self.model_info = TrainingBOM.ModelInfo(type=model_type, architecture=architecture)
232
print(f"Registered model: {model_type}")
233
234
def log_parameter(self, key: str, value):
235
"""Log training parameter"""
236
self.mlflow_params[key] = value
237
238
def log_metric(self, key: str, value: float):
239
"""Log training metric"""
240
self.mlflow_metrics[key] = value
241
242
def finalize_training(self) -> TrainingBOM:
243
"""Create final Training BOM"""
244
end_time = datetime.now().isoformat()
245
246
bom = TrainingBOM(
247
id=self.training_id,
248
start_time=self.start_time,
249
end_time=end_time,
250
dataset_info=self.dataset_info,
251
feature_info=self.feature_info,
252
model_info=self.model_info,
253
mlflow_params=self.mlflow_params,
254
mlflow_metrics=self.mlflow_metrics
255
)
256
257
print(f"Training completed: {self.training_id}")
258
return bom
259
260
# Example usage
261
tracker = TrainingWorkflowTracker()
262
263
# Initialize training
264
tracker.start_training("fraud-detection-v3.2")
265
266
# Register components
267
tracker.register_dataset("s3://fraud-data/transactions-2024.parquet", 2500000)
268
tracker.register_features(
269
original_features=["amount", "merchant", "timestamp", "location", "card_type", "user_age"],
270
selected_features=["amount", "merchant", "timestamp", "location"]
271
)
272
tracker.register_model("GradientBoostingClassifier", "tree_ensemble")
273
274
# Log training parameters
275
tracker.log_parameter("n_estimators", 200)
276
tracker.log_parameter("learning_rate", 0.1)
277
tracker.log_parameter("max_depth", 8)
278
279
# Log training metrics
280
tracker.log_metric("accuracy", 0.94)
281
tracker.log_metric("precision", 0.91)
282
tracker.log_metric("recall", 0.96)
283
284
# Finalize and get BOM
285
final_bom = tracker.finalize_training()
286
287
# Serialize BOM for storage or transmission
288
bom_json = final_bom.model_dump_json(indent=2)
289
print("Training BOM JSON:", bom_json)
290
```
291
292
## Best Practices
293
294
### Comprehensive Data Capture
295
- Always include dataset origin paths for reproducibility
296
- Track both original and selected features for transparency
297
- Record model architecture details for version management
298
- Capture complete MLflow parameters and metrics
299
300
### Integration Patterns
301
- Use consistent training IDs across systems
302
- Integrate with MLflow experiment tracking
303
- Store BOMs in centralized metadata systems
304
- Link BOMs to model deployment records
305
306
### Validation and Quality
307
- Validate dataset accessibility before training
308
- Ensure feature consistency across runs
309
- Track data quality metrics in additionalValues
310
- Monitor training metrics for anomalies