Tessl Tile for pypi/modelscope@1.29.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

datasets.md export.md hub.md index.md metrics.md models.md pipelines.md preprocessors.md training.md utilities.md

training.mddocs/

0
# Training Framework
1

2
ModelScope's training framework provides comprehensive tools for training and fine-tuning models across different domains. The framework supports epoch-based training with hooks, metrics, evaluation, and checkpoint management.
3

4
## Capabilities
5

6
### Epoch-Based Trainer
7

8
Main trainer class for epoch-based training workflows.
9

10
```python { .api }
11
class EpochBasedTrainer:
12
    """
13
    Main epoch-based trainer for ModelScope models.
14
    """
15
    
16
    def __init__(
17
        self,
18
        model: Optional[Union[TorchModel, nn.Module, str]] = None,
19
        cfg_file: Optional[str] = None,
20
        cfg_modify_fn: Optional[Callable] = None,
21
        arg_parse_fn: Optional[Callable] = None,
22
        data_collator: Optional[Union[Callable, Dict[str, Callable]]] = None,
23
        train_dataset: Optional[Union[MsDataset, Dataset]] = None,
24
        eval_dataset: Optional[Union[MsDataset, Dataset]] = None,
25
        preprocessor: Optional[Union[Preprocessor, Dict[str, Preprocessor]]] = None,
26
        optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler._LRScheduler] = (None, None),
27
        model_revision: Optional[str] = DEFAULT_MODEL_REVISION,
28
        seed: int = 42,
29
        callbacks: Optional[List[Hook]] = None,
30
        samplers: Optional[Union[Sampler, Dict[str, Sampler]]] = None,
31
        efficient_tuners: Union[Dict[str, TunerConfig], TunerConfig] = None,
32
        **kwargs
33
    ):
34
        """
35
        Initialize trainer with model and training configuration.
36
        
37
        Parameters:
38
        - model: Model instance to train (TorchModel, nn.Module, or model identifier string)
39
        - cfg_file: Path to configuration file
40
        - cfg_modify_fn: Function to modify configuration dynamically
41
        - arg_parse_fn: Custom argument parsing function
42
        - data_collator: Data collation function(s) for batching
43
        - train_dataset: Training dataset (MsDataset or Dataset)
44
        - eval_dataset: Evaluation dataset (MsDataset or Dataset)
45
        - preprocessor: Data preprocessor(s) for input processing
46
        - optimizers: Tuple of (optimizer, lr_scheduler) instances
47
        - model_revision: Model revision/version (default: DEFAULT_MODEL_REVISION)
48
        - seed: Random seed for reproducibility (default: 42)
49
        - callbacks: List of training hooks/callbacks
50
        - samplers: Data sampler(s) for training and evaluation
51
        - efficient_tuners: Parameter-efficient tuning configurations
52
        - **kwargs: Additional trainer-specific parameters
53
        """
54
    
55
    def train(self):
56
        """
57
        Start the training process.
58
        """
59
    
60
    def evaluate(self, eval_dataset = None):
61
        """
62
        Evaluate model on evaluation dataset.
63
        
64
        Parameters:
65
        - eval_dataset: Dataset for evaluation (optional)
66
        
67
        Returns:
68
        Evaluation metrics dictionary
69
        """
70
    
71
    def save_checkpoint(self, checkpoint_dir: str):
72
        """
73
        Save training checkpoint.
74
        
75
        Parameters:
76
        - checkpoint_dir: Directory to save checkpoint
77
        """
78
    
79
    def load_checkpoint(self, checkpoint_path: str):
80
        """
81
        Load training checkpoint.
82
        
83
        Parameters:
84
        - checkpoint_path: Path to checkpoint file
85
        """
86
    
87
    def resume_training(self, checkpoint_path: str):
88
        """
89
        Resume training from checkpoint.
90
        
91
        Parameters:
92
        - checkpoint_path: Path to checkpoint file
93
        """
94
```
95

96
### Training Arguments
97

98
Configuration class for training parameters and hyperparameters.
99

100
```python { .api }
101
@dataclass(init=False)
102
class TrainingArgs(DatasetArgs, TrainArgs, ModelArgs):
103
    """
104
    Configuration container for training parameters.
105
    Inherits from DatasetArgs, TrainArgs, and ModelArgs dataclasses.
106
    """
107
    
108
    use_model_config: bool = field(
109
        default=False,
110
        metadata={
111
            'help': 'Use the configuration of the model'
112
        }
113
    )
114
    
115
    def __init__(self, **kwargs):
116
        """
117
        Initialize training arguments with flexible keyword arguments.
118
        
119
        Parameters:
120
        - **kwargs: Training configuration parameters including:
121
          - output_dir: Directory for saving model and checkpoints
122
          - max_epochs: Maximum number of training epochs
123
          - learning_rate: Learning rate for optimizer
124
          - train_batch_size: Batch size for training
125
          - eval_batch_size: Batch size for evaluation
126
          - eval_strategy: Evaluation strategy ('no', 'steps', 'epoch')
127
          - save_strategy: Checkpoint saving strategy ('no', 'steps', 'epoch')
128
          - logging_steps: Steps between logging outputs
129
          - save_steps: Steps between saving checkpoints
130
          - eval_steps: Steps between evaluations
131
          - use_model_config: Whether to use model configuration
132
          
133
        Note: This class uses dataclass fields and supports all parameters
134
        from DatasetArgs, TrainArgs, and ModelArgs parent classes.
135
        """
136
        self.manual_args = list(kwargs.keys())
137
        for f in fields(self):
138
            if f.name in kwargs:
139
                setattr(self, f.name, kwargs[f.name])
140
        self._unknown_args = {}
141
```
142

143
### Hook System
144

145
Training hooks for customizing the training process at different stages.
146

147
```python { .api }
148
class Hook:
149
    """
150
    Base class for training hooks.
151
    """
152
    
153
    def before_run(self, trainer):
154
        """
155
        Called before training starts.
156
        
157
        Parameters:
158
        - trainer: Trainer instance
159
        """
160
    
161
    def after_run(self, trainer):
162
        """
163
        Called after training completes.
164
        
165
        Parameters:
166
        - trainer: Trainer instance
167
        """
168
    
169
    def before_epoch(self, trainer):
170
        """
171
        Called before each epoch.
172
        
173
        Parameters:
174
        - trainer: Trainer instance
175
        """
176
    
177
    def after_epoch(self, trainer):
178
        """
179
        Called after each epoch.
180
        
181
        Parameters:
182
        - trainer: Trainer instance
183
        """
184
    
185
    def before_iter(self, trainer):
186
        """
187
        Called before each iteration.
188
        
189
        Parameters:
190
        - trainer: Trainer instance
191
        """
192
    
193
    def after_iter(self, trainer):
194
        """
195
        Called after each iteration.
196
        
197
        Parameters:
198
        - trainer: Trainer instance
199
        """
200

201
class Priority:
202
    """
203
    Priority levels for hook execution order.
204
    """
205
    HIGHEST = 0
206
    HIGH = 10
207
    NORMAL = 50
208
    LOW = 70
209
    LOWEST = 100
210
```
211

212
### Dataset Builder
213

214
Utility functions for creating datasets from various sources.
215

216
```python { .api }
217
def build_dataset_from_file(
218
    data_files: str,
219
    split: str = None,
220
    cache_dir: str = None,
221
    **kwargs
222
):
223
    """
224
    Build dataset from file paths.
225
    
226
    Parameters:
227
    - data_files: Path to data file(s)
228
    - split: Dataset split name
229
    - cache_dir: Directory for caching processed data
230
    - **kwargs: Additional dataset parameters
231
    
232
    Returns:
233
    Dataset instance
234
    """
235

236
def build_trainer(cfg: dict, default_args: dict = None):
237
    """
238
    Build trainer from configuration.
239
    
240
    Parameters:
241
    - cfg: Trainer configuration dictionary
242
    - default_args: Default arguments to merge
243
    
244
    Returns:
245
    Trainer instance
246
    """
247
```
248

249
### Specialized Trainers
250

251
Domain-specific trainer implementations for specialized tasks.
252

253
```python { .api }
254
class NlpEpochBasedTrainer(EpochBasedTrainer):
255
    """
256
    NLP-specific trainer with text processing optimizations.
257
    """
258
    pass
259

260
class VecoTrainer(EpochBasedTrainer):
261
    """
262
    Specialized trainer for Veco models.
263
    """
264
    pass
265
```
266

267
## Usage Examples
268

269
### Basic Training Setup
270

271
```python
272
from modelscope import Model, EpochBasedTrainer, TrainingArgs
273
from modelscope import build_dataset_from_file
274

275
# Load pre-trained model
276
model = Model.from_pretrained('damo/nlp_structbert_base_chinese')
277

278
# Build dataset
279
train_dataset = build_dataset_from_file('train.json')
280
eval_dataset = build_dataset_from_file('eval.json')
281

282
# Configure training arguments
283
training_args = TrainingArgs(
284
    output_dir='./output',
285
    max_epochs=10,
286
    learning_rate=2e-5,
287
    train_batch_size=16,
288
    eval_batch_size=32,
289
    eval_strategy='epoch',
290
    save_strategy='epoch',
291
    logging_steps=100
292
)
293

294
# Create trainer
295
trainer = EpochBasedTrainer(
296
    model=model,
297
    args=training_args,
298
    train_dataset=train_dataset,
299
    eval_dataset=eval_dataset
300
)
301

302
# Start training
303
trainer.train()
304
```
305

306
### Custom Training with Hooks
307

308
```python
309
from modelscope import EpochBasedTrainer, Hook, Priority
310

311
class CustomLoggingHook(Hook):
312
    def __init__(self, log_interval=100):
313
        self.log_interval = log_interval
314
        self.step = 0
315
    
316
    def after_iter(self, trainer):
317
        self.step += 1
318
        if self.step % self.log_interval == 0:
319
            print(f"Step {self.step}: Loss = {trainer.loss}")
320
    
321
    def after_epoch(self, trainer):
322
        print(f"Epoch {trainer.epoch} completed")
323

324
class ModelCheckpointHook(Hook):
325
    def __init__(self, save_interval=5):
326
        self.save_interval = save_interval
327
    
328
    def after_epoch(self, trainer):
329
        if trainer.epoch % self.save_interval == 0:
330
            trainer.save_checkpoint(f'./checkpoints/epoch_{trainer.epoch}')
331

332
# Create trainer with custom hooks
333
trainer = EpochBasedTrainer(
334
    model=model,
335
    args=training_args,
336
    train_dataset=train_dataset
337
)
338

339
# Register hooks
340
trainer.register_hook(CustomLoggingHook(log_interval=50), Priority.HIGH)
341
trainer.register_hook(ModelCheckpointHook(save_interval=2), Priority.NORMAL)
342

343
# Start training
344
trainer.train()
345
```
346

347
### Fine-tuning with Evaluation
348

349
```python
350
from modelscope import Model, EpochBasedTrainer, TrainingArgs
351

352
# Load model for fine-tuning
353
model = Model.from_pretrained('damo/nlp_bert_base_chinese')
354

355
# Prepare datasets
356
train_data = build_dataset_from_file('fine_tune_train.json')
357
eval_data = build_dataset_from_file('fine_tune_eval.json')
358

359
# Configure fine-tuning arguments
360
fine_tune_args = TrainingArgs(
361
    output_dir='./fine_tuned_model',
362
    max_epochs=5,
363
    learning_rate=1e-5,  # Lower learning rate for fine-tuning
364
    train_batch_size=8,
365
    eval_batch_size=16,
366
    eval_strategy='steps',
367
    eval_steps=200,
368
    save_strategy='steps',
369
    save_steps=500,
370
    load_best_model_at_end=True,
371
    metric_for_best_model='eval_accuracy',
372
    greater_is_better=True
373
)
374

375
# Create trainer
376
trainer = EpochBasedTrainer(
377
    model=model,
378
    args=fine_tune_args,
379
    train_dataset=train_data,
380
    eval_dataset=eval_data
381
)
382

383
# Train and evaluate
384
trainer.train()
385
final_metrics = trainer.evaluate()
386
print(f"Final evaluation metrics: {final_metrics}")
387
```
388

389
### Resume Training from Checkpoint
390

391
```python
392
from modelscope import EpochBasedTrainer, TrainingArgs
393

394
# Configure training arguments
395
training_args = TrainingArgs(
396
    output_dir='./continued_training',
397
    max_epochs=20,
398
    resume_from_checkpoint='./checkpoints/epoch_10'
399
)
400

401
# Create trainer
402
trainer = EpochBasedTrainer(
403
    model=model,
404
    args=training_args,
405
    train_dataset=train_dataset
406
)
407

408
# Resume training from checkpoint
409
trainer.resume_training('./checkpoints/epoch_10/checkpoint.pth')
410
```
411

412
### Custom Trainer Implementation
413

414
```python
415
from modelscope import EpochBasedTrainer
416

417
class CustomTrainer(EpochBasedTrainer):
418
    def __init__(self, *args, **kwargs):
419
        super().__init__(*args, **kwargs)
420
        # Custom initialization
421
    
422
    def compute_loss(self, model, inputs):
423
        """
424
        Custom loss computation.
425
        
426
        Parameters:
427
        - model: Model instance
428
        - inputs: Batch inputs
429
        
430
        Returns:
431
        Loss tensor
432
        """
433
        outputs = model(inputs)
434
        # Custom loss calculation
435
        loss = custom_loss_function(outputs, inputs['labels'])
436
        return loss
437
    
438
    def evaluate(self, eval_dataset=None):
439
        """
440
        Custom evaluation logic.
441
        """
442
        # Custom evaluation implementation
443
        metrics = super().evaluate(eval_dataset)
444
        
445
        # Add custom metrics
446
        custom_metric = self.compute_custom_metric()
447
        metrics['custom_metric'] = custom_metric
448
        
449
        return metrics
450

451
# Use custom trainer
452
trainer = CustomTrainer(
453
    model=model,
454
    args=training_args,
455
    train_dataset=train_dataset,
456
    eval_dataset=eval_dataset
457
)
458
```
459

460
### Multi-GPU Training
461

462
```python
463
from modelscope import EpochBasedTrainer, TrainingArgs
464
import torch
465

466
# Check for multiple GPUs
467
if torch.cuda.device_count() > 1:
468
    print(f"Using {torch.cuda.device_count()} GPUs")
469
    
470
    # Configure for multi-GPU training
471
    training_args = TrainingArgs(
472
        output_dir='./multi_gpu_output',
473
        max_epochs=10,
474
        train_batch_size=32,  # Total batch size across all GPUs
475
        eval_batch_size=64,
476
        dataloader_num_workers=4,
477
        fp16=True,  # Mixed precision training
478
        gradient_accumulation_steps=2
479
    )
480
    
481
    # Create trainer (will automatically use multiple GPUs)
482
    trainer = EpochBasedTrainer(
483
        model=model,
484
        args=training_args,
485
        train_dataset=train_dataset,
486
        eval_dataset=eval_dataset
487
    )
488
    
489
    trainer.train()
490
```
491

492
### Learning Rate Scheduling
493

494
```python
495
from modelscope import EpochBasedTrainer, TrainingArgs, Hook
496

497
class LearningRateSchedulerHook(Hook):
498
    def __init__(self, scheduler):
499
        self.scheduler = scheduler
500
    
501
    def after_epoch(self, trainer):
502
        self.scheduler.step()
503
        current_lr = self.scheduler.get_last_lr()[0]
504
        print(f"Learning rate updated to: {current_lr}")
505

506
# Setup training with learning rate scheduling
507
import torch.optim as optim
508
from torch.optim.lr_scheduler import StepLR
509

510
trainer = EpochBasedTrainer(
511
    model=model,
512
    args=training_args,
513
    train_dataset=train_dataset
514
)
515

516
# Create optimizer and scheduler
517
optimizer = optim.Adam(model.parameters(), lr=1e-4)
518
scheduler = StepLR(optimizer, step_size=3, gamma=0.5)
519

520
# Register scheduler hook
521
trainer.register_hook(LearningRateSchedulerHook(scheduler))
522

523
trainer.train()
524
```

Version

Tile

Files

training.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

training.mddocs/