0
# Training Framework
1
2
ModelScope's training framework provides comprehensive tools for training and fine-tuning models across different domains. The framework supports epoch-based training with hooks, metrics, evaluation, and checkpoint management.
3
4
## Capabilities
5
6
### Epoch-Based Trainer
7
8
Main trainer class for epoch-based training workflows.
9
10
```python { .api }
11
class EpochBasedTrainer:
12
"""
13
Main epoch-based trainer for ModelScope models.
14
"""
15
16
def __init__(
17
self,
18
model: Optional[Union[TorchModel, nn.Module, str]] = None,
19
cfg_file: Optional[str] = None,
20
cfg_modify_fn: Optional[Callable] = None,
21
arg_parse_fn: Optional[Callable] = None,
22
data_collator: Optional[Union[Callable, Dict[str, Callable]]] = None,
23
train_dataset: Optional[Union[MsDataset, Dataset]] = None,
24
eval_dataset: Optional[Union[MsDataset, Dataset]] = None,
25
preprocessor: Optional[Union[Preprocessor, Dict[str, Preprocessor]]] = None,
26
optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler._LRScheduler] = (None, None),
27
model_revision: Optional[str] = DEFAULT_MODEL_REVISION,
28
seed: int = 42,
29
callbacks: Optional[List[Hook]] = None,
30
samplers: Optional[Union[Sampler, Dict[str, Sampler]]] = None,
31
efficient_tuners: Union[Dict[str, TunerConfig], TunerConfig] = None,
32
**kwargs
33
):
34
"""
35
Initialize trainer with model and training configuration.
36
37
Parameters:
38
- model: Model instance to train (TorchModel, nn.Module, or model identifier string)
39
- cfg_file: Path to configuration file
40
- cfg_modify_fn: Function to modify configuration dynamically
41
- arg_parse_fn: Custom argument parsing function
42
- data_collator: Data collation function(s) for batching
43
- train_dataset: Training dataset (MsDataset or Dataset)
44
- eval_dataset: Evaluation dataset (MsDataset or Dataset)
45
- preprocessor: Data preprocessor(s) for input processing
46
- optimizers: Tuple of (optimizer, lr_scheduler) instances
47
- model_revision: Model revision/version (default: DEFAULT_MODEL_REVISION)
48
- seed: Random seed for reproducibility (default: 42)
49
- callbacks: List of training hooks/callbacks
50
- samplers: Data sampler(s) for training and evaluation
51
- efficient_tuners: Parameter-efficient tuning configurations
52
- **kwargs: Additional trainer-specific parameters
53
"""
54
55
def train(self):
56
"""
57
Start the training process.
58
"""
59
60
def evaluate(self, eval_dataset = None):
61
"""
62
Evaluate model on evaluation dataset.
63
64
Parameters:
65
- eval_dataset: Dataset for evaluation (optional)
66
67
Returns:
68
Evaluation metrics dictionary
69
"""
70
71
def save_checkpoint(self, checkpoint_dir: str):
72
"""
73
Save training checkpoint.
74
75
Parameters:
76
- checkpoint_dir: Directory to save checkpoint
77
"""
78
79
def load_checkpoint(self, checkpoint_path: str):
80
"""
81
Load training checkpoint.
82
83
Parameters:
84
- checkpoint_path: Path to checkpoint file
85
"""
86
87
def resume_training(self, checkpoint_path: str):
88
"""
89
Resume training from checkpoint.
90
91
Parameters:
92
- checkpoint_path: Path to checkpoint file
93
"""
94
```
95
96
### Training Arguments
97
98
Configuration class for training parameters and hyperparameters.
99
100
```python { .api }
101
@dataclass(init=False)
102
class TrainingArgs(DatasetArgs, TrainArgs, ModelArgs):
103
"""
104
Configuration container for training parameters.
105
Inherits from DatasetArgs, TrainArgs, and ModelArgs dataclasses.
106
"""
107
108
use_model_config: bool = field(
109
default=False,
110
metadata={
111
'help': 'Use the configuration of the model'
112
}
113
)
114
115
def __init__(self, **kwargs):
116
"""
117
Initialize training arguments with flexible keyword arguments.
118
119
Parameters:
120
- **kwargs: Training configuration parameters including:
121
- output_dir: Directory for saving model and checkpoints
122
- max_epochs: Maximum number of training epochs
123
- learning_rate: Learning rate for optimizer
124
- train_batch_size: Batch size for training
125
- eval_batch_size: Batch size for evaluation
126
- eval_strategy: Evaluation strategy ('no', 'steps', 'epoch')
127
- save_strategy: Checkpoint saving strategy ('no', 'steps', 'epoch')
128
- logging_steps: Steps between logging outputs
129
- save_steps: Steps between saving checkpoints
130
- eval_steps: Steps between evaluations
131
- use_model_config: Whether to use model configuration
132
133
Note: This class uses dataclass fields and supports all parameters
134
from DatasetArgs, TrainArgs, and ModelArgs parent classes.
135
"""
136
self.manual_args = list(kwargs.keys())
137
for f in fields(self):
138
if f.name in kwargs:
139
setattr(self, f.name, kwargs[f.name])
140
self._unknown_args = {}
141
```
142
143
### Hook System
144
145
Training hooks for customizing the training process at different stages.
146
147
```python { .api }
148
class Hook:
149
"""
150
Base class for training hooks.
151
"""
152
153
def before_run(self, trainer):
154
"""
155
Called before training starts.
156
157
Parameters:
158
- trainer: Trainer instance
159
"""
160
161
def after_run(self, trainer):
162
"""
163
Called after training completes.
164
165
Parameters:
166
- trainer: Trainer instance
167
"""
168
169
def before_epoch(self, trainer):
170
"""
171
Called before each epoch.
172
173
Parameters:
174
- trainer: Trainer instance
175
"""
176
177
def after_epoch(self, trainer):
178
"""
179
Called after each epoch.
180
181
Parameters:
182
- trainer: Trainer instance
183
"""
184
185
def before_iter(self, trainer):
186
"""
187
Called before each iteration.
188
189
Parameters:
190
- trainer: Trainer instance
191
"""
192
193
def after_iter(self, trainer):
194
"""
195
Called after each iteration.
196
197
Parameters:
198
- trainer: Trainer instance
199
"""
200
201
class Priority:
202
"""
203
Priority levels for hook execution order.
204
"""
205
HIGHEST = 0
206
HIGH = 10
207
NORMAL = 50
208
LOW = 70
209
LOWEST = 100
210
```
211
212
### Dataset Builder
213
214
Utility functions for creating datasets from various sources.
215
216
```python { .api }
217
def build_dataset_from_file(
218
data_files: str,
219
split: str = None,
220
cache_dir: str = None,
221
**kwargs
222
):
223
"""
224
Build dataset from file paths.
225
226
Parameters:
227
- data_files: Path to data file(s)
228
- split: Dataset split name
229
- cache_dir: Directory for caching processed data
230
- **kwargs: Additional dataset parameters
231
232
Returns:
233
Dataset instance
234
"""
235
236
def build_trainer(cfg: dict, default_args: dict = None):
237
"""
238
Build trainer from configuration.
239
240
Parameters:
241
- cfg: Trainer configuration dictionary
242
- default_args: Default arguments to merge
243
244
Returns:
245
Trainer instance
246
"""
247
```
248
249
### Specialized Trainers
250
251
Domain-specific trainer implementations for specialized tasks.
252
253
```python { .api }
254
class NlpEpochBasedTrainer(EpochBasedTrainer):
255
"""
256
NLP-specific trainer with text processing optimizations.
257
"""
258
pass
259
260
class VecoTrainer(EpochBasedTrainer):
261
"""
262
Specialized trainer for Veco models.
263
"""
264
pass
265
```
266
267
## Usage Examples
268
269
### Basic Training Setup
270
271
```python
272
from modelscope import Model, EpochBasedTrainer, TrainingArgs
273
from modelscope import build_dataset_from_file
274
275
# Load pre-trained model
276
model = Model.from_pretrained('damo/nlp_structbert_base_chinese')
277
278
# Build dataset
279
train_dataset = build_dataset_from_file('train.json')
280
eval_dataset = build_dataset_from_file('eval.json')
281
282
# Configure training arguments
283
training_args = TrainingArgs(
284
output_dir='./output',
285
max_epochs=10,
286
learning_rate=2e-5,
287
train_batch_size=16,
288
eval_batch_size=32,
289
eval_strategy='epoch',
290
save_strategy='epoch',
291
logging_steps=100
292
)
293
294
# Create trainer
295
trainer = EpochBasedTrainer(
296
model=model,
297
args=training_args,
298
train_dataset=train_dataset,
299
eval_dataset=eval_dataset
300
)
301
302
# Start training
303
trainer.train()
304
```
305
306
### Custom Training with Hooks
307
308
```python
309
from modelscope import EpochBasedTrainer, Hook, Priority
310
311
class CustomLoggingHook(Hook):
312
def __init__(self, log_interval=100):
313
self.log_interval = log_interval
314
self.step = 0
315
316
def after_iter(self, trainer):
317
self.step += 1
318
if self.step % self.log_interval == 0:
319
print(f"Step {self.step}: Loss = {trainer.loss}")
320
321
def after_epoch(self, trainer):
322
print(f"Epoch {trainer.epoch} completed")
323
324
class ModelCheckpointHook(Hook):
325
def __init__(self, save_interval=5):
326
self.save_interval = save_interval
327
328
def after_epoch(self, trainer):
329
if trainer.epoch % self.save_interval == 0:
330
trainer.save_checkpoint(f'./checkpoints/epoch_{trainer.epoch}')
331
332
# Create trainer with custom hooks
333
trainer = EpochBasedTrainer(
334
model=model,
335
args=training_args,
336
train_dataset=train_dataset
337
)
338
339
# Register hooks
340
trainer.register_hook(CustomLoggingHook(log_interval=50), Priority.HIGH)
341
trainer.register_hook(ModelCheckpointHook(save_interval=2), Priority.NORMAL)
342
343
# Start training
344
trainer.train()
345
```
346
347
### Fine-tuning with Evaluation
348
349
```python
350
from modelscope import Model, EpochBasedTrainer, TrainingArgs
351
352
# Load model for fine-tuning
353
model = Model.from_pretrained('damo/nlp_bert_base_chinese')
354
355
# Prepare datasets
356
train_data = build_dataset_from_file('fine_tune_train.json')
357
eval_data = build_dataset_from_file('fine_tune_eval.json')
358
359
# Configure fine-tuning arguments
360
fine_tune_args = TrainingArgs(
361
output_dir='./fine_tuned_model',
362
max_epochs=5,
363
learning_rate=1e-5, # Lower learning rate for fine-tuning
364
train_batch_size=8,
365
eval_batch_size=16,
366
eval_strategy='steps',
367
eval_steps=200,
368
save_strategy='steps',
369
save_steps=500,
370
load_best_model_at_end=True,
371
metric_for_best_model='eval_accuracy',
372
greater_is_better=True
373
)
374
375
# Create trainer
376
trainer = EpochBasedTrainer(
377
model=model,
378
args=fine_tune_args,
379
train_dataset=train_data,
380
eval_dataset=eval_data
381
)
382
383
# Train and evaluate
384
trainer.train()
385
final_metrics = trainer.evaluate()
386
print(f"Final evaluation metrics: {final_metrics}")
387
```
388
389
### Resume Training from Checkpoint
390
391
```python
392
from modelscope import EpochBasedTrainer, TrainingArgs
393
394
# Configure training arguments
395
training_args = TrainingArgs(
396
output_dir='./continued_training',
397
max_epochs=20,
398
resume_from_checkpoint='./checkpoints/epoch_10'
399
)
400
401
# Create trainer
402
trainer = EpochBasedTrainer(
403
model=model,
404
args=training_args,
405
train_dataset=train_dataset
406
)
407
408
# Resume training from checkpoint
409
trainer.resume_training('./checkpoints/epoch_10/checkpoint.pth')
410
```
411
412
### Custom Trainer Implementation
413
414
```python
415
from modelscope import EpochBasedTrainer
416
417
class CustomTrainer(EpochBasedTrainer):
418
def __init__(self, *args, **kwargs):
419
super().__init__(*args, **kwargs)
420
# Custom initialization
421
422
def compute_loss(self, model, inputs):
423
"""
424
Custom loss computation.
425
426
Parameters:
427
- model: Model instance
428
- inputs: Batch inputs
429
430
Returns:
431
Loss tensor
432
"""
433
outputs = model(inputs)
434
# Custom loss calculation
435
loss = custom_loss_function(outputs, inputs['labels'])
436
return loss
437
438
def evaluate(self, eval_dataset=None):
439
"""
440
Custom evaluation logic.
441
"""
442
# Custom evaluation implementation
443
metrics = super().evaluate(eval_dataset)
444
445
# Add custom metrics
446
custom_metric = self.compute_custom_metric()
447
metrics['custom_metric'] = custom_metric
448
449
return metrics
450
451
# Use custom trainer
452
trainer = CustomTrainer(
453
model=model,
454
args=training_args,
455
train_dataset=train_dataset,
456
eval_dataset=eval_dataset
457
)
458
```
459
460
### Multi-GPU Training
461
462
```python
463
from modelscope import EpochBasedTrainer, TrainingArgs
464
import torch
465
466
# Check for multiple GPUs
467
if torch.cuda.device_count() > 1:
468
print(f"Using {torch.cuda.device_count()} GPUs")
469
470
# Configure for multi-GPU training
471
training_args = TrainingArgs(
472
output_dir='./multi_gpu_output',
473
max_epochs=10,
474
train_batch_size=32, # Total batch size across all GPUs
475
eval_batch_size=64,
476
dataloader_num_workers=4,
477
fp16=True, # Mixed precision training
478
gradient_accumulation_steps=2
479
)
480
481
# Create trainer (will automatically use multiple GPUs)
482
trainer = EpochBasedTrainer(
483
model=model,
484
args=training_args,
485
train_dataset=train_dataset,
486
eval_dataset=eval_dataset
487
)
488
489
trainer.train()
490
```
491
492
### Learning Rate Scheduling
493
494
```python
495
from modelscope import EpochBasedTrainer, TrainingArgs, Hook
496
497
class LearningRateSchedulerHook(Hook):
498
def __init__(self, scheduler):
499
self.scheduler = scheduler
500
501
def after_epoch(self, trainer):
502
self.scheduler.step()
503
current_lr = self.scheduler.get_last_lr()[0]
504
print(f"Learning rate updated to: {current_lr}")
505
506
# Setup training with learning rate scheduling
507
import torch.optim as optim
508
from torch.optim.lr_scheduler import StepLR
509
510
trainer = EpochBasedTrainer(
511
model=model,
512
args=training_args,
513
train_dataset=train_dataset
514
)
515
516
# Create optimizer and scheduler
517
optimizer = optim.Adam(model.parameters(), lr=1e-4)
518
scheduler = StepLR(optimizer, step_size=3, gamma=0.5)
519
520
# Register scheduler hook
521
trainer.register_hook(LearningRateSchedulerHook(scheduler))
522
523
trainer.train()
524
```