or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

datasets.mdexport.mdhub.mdindex.mdmetrics.mdmodels.mdpipelines.mdpreprocessors.mdtraining.mdutilities.md

training.mddocs/

0

# Training Framework

1

2

ModelScope's training framework provides comprehensive tools for training and fine-tuning models across different domains. The framework supports epoch-based training with hooks, metrics, evaluation, and checkpoint management.

3

4

## Capabilities

5

6

### Epoch-Based Trainer

7

8

Main trainer class for epoch-based training workflows.

9

10

```python { .api }

11

class EpochBasedTrainer:

12

"""

13

Main epoch-based trainer for ModelScope models.

14

"""

15

16

def __init__(

17

self,

18

model: Optional[Union[TorchModel, nn.Module, str]] = None,

19

cfg_file: Optional[str] = None,

20

cfg_modify_fn: Optional[Callable] = None,

21

arg_parse_fn: Optional[Callable] = None,

22

data_collator: Optional[Union[Callable, Dict[str, Callable]]] = None,

23

train_dataset: Optional[Union[MsDataset, Dataset]] = None,

24

eval_dataset: Optional[Union[MsDataset, Dataset]] = None,

25

preprocessor: Optional[Union[Preprocessor, Dict[str, Preprocessor]]] = None,

26

optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler._LRScheduler] = (None, None),

27

model_revision: Optional[str] = DEFAULT_MODEL_REVISION,

28

seed: int = 42,

29

callbacks: Optional[List[Hook]] = None,

30

samplers: Optional[Union[Sampler, Dict[str, Sampler]]] = None,

31

efficient_tuners: Union[Dict[str, TunerConfig], TunerConfig] = None,

32

**kwargs

33

):

34

"""

35

Initialize trainer with model and training configuration.

36

37

Parameters:

38

- model: Model instance to train (TorchModel, nn.Module, or model identifier string)

39

- cfg_file: Path to configuration file

40

- cfg_modify_fn: Function to modify configuration dynamically

41

- arg_parse_fn: Custom argument parsing function

42

- data_collator: Data collation function(s) for batching

43

- train_dataset: Training dataset (MsDataset or Dataset)

44

- eval_dataset: Evaluation dataset (MsDataset or Dataset)

45

- preprocessor: Data preprocessor(s) for input processing

46

- optimizers: Tuple of (optimizer, lr_scheduler) instances

47

- model_revision: Model revision/version (default: DEFAULT_MODEL_REVISION)

48

- seed: Random seed for reproducibility (default: 42)

49

- callbacks: List of training hooks/callbacks

50

- samplers: Data sampler(s) for training and evaluation

51

- efficient_tuners: Parameter-efficient tuning configurations

52

- **kwargs: Additional trainer-specific parameters

53

"""

54

55

def train(self):

56

"""

57

Start the training process.

58

"""

59

60

def evaluate(self, eval_dataset = None):

61

"""

62

Evaluate model on evaluation dataset.

63

64

Parameters:

65

- eval_dataset: Dataset for evaluation (optional)

66

67

Returns:

68

Evaluation metrics dictionary

69

"""

70

71

def save_checkpoint(self, checkpoint_dir: str):

72

"""

73

Save training checkpoint.

74

75

Parameters:

76

- checkpoint_dir: Directory to save checkpoint

77

"""

78

79

def load_checkpoint(self, checkpoint_path: str):

80

"""

81

Load training checkpoint.

82

83

Parameters:

84

- checkpoint_path: Path to checkpoint file

85

"""

86

87

def resume_training(self, checkpoint_path: str):

88

"""

89

Resume training from checkpoint.

90

91

Parameters:

92

- checkpoint_path: Path to checkpoint file

93

"""

94

```

95

96

### Training Arguments

97

98

Configuration class for training parameters and hyperparameters.

99

100

```python { .api }

101

@dataclass(init=False)

102

class TrainingArgs(DatasetArgs, TrainArgs, ModelArgs):

103

"""

104

Configuration container for training parameters.

105

Inherits from DatasetArgs, TrainArgs, and ModelArgs dataclasses.

106

"""

107

108

use_model_config: bool = field(

109

default=False,

110

metadata={

111

'help': 'Use the configuration of the model'

112

}

113

)

114

115

def __init__(self, **kwargs):

116

"""

117

Initialize training arguments with flexible keyword arguments.

118

119

Parameters:

120

- **kwargs: Training configuration parameters including:

121

- output_dir: Directory for saving model and checkpoints

122

- max_epochs: Maximum number of training epochs

123

- learning_rate: Learning rate for optimizer

124

- train_batch_size: Batch size for training

125

- eval_batch_size: Batch size for evaluation

126

- eval_strategy: Evaluation strategy ('no', 'steps', 'epoch')

127

- save_strategy: Checkpoint saving strategy ('no', 'steps', 'epoch')

128

- logging_steps: Steps between logging outputs

129

- save_steps: Steps between saving checkpoints

130

- eval_steps: Steps between evaluations

131

- use_model_config: Whether to use model configuration

132

133

Note: This class uses dataclass fields and supports all parameters

134

from DatasetArgs, TrainArgs, and ModelArgs parent classes.

135

"""

136

self.manual_args = list(kwargs.keys())

137

for f in fields(self):

138

if f.name in kwargs:

139

setattr(self, f.name, kwargs[f.name])

140

self._unknown_args = {}

141

```

142

143

### Hook System

144

145

Training hooks for customizing the training process at different stages.

146

147

```python { .api }

148

class Hook:

149

"""

150

Base class for training hooks.

151

"""

152

153

def before_run(self, trainer):

154

"""

155

Called before training starts.

156

157

Parameters:

158

- trainer: Trainer instance

159

"""

160

161

def after_run(self, trainer):

162

"""

163

Called after training completes.

164

165

Parameters:

166

- trainer: Trainer instance

167

"""

168

169

def before_epoch(self, trainer):

170

"""

171

Called before each epoch.

172

173

Parameters:

174

- trainer: Trainer instance

175

"""

176

177

def after_epoch(self, trainer):

178

"""

179

Called after each epoch.

180

181

Parameters:

182

- trainer: Trainer instance

183

"""

184

185

def before_iter(self, trainer):

186

"""

187

Called before each iteration.

188

189

Parameters:

190

- trainer: Trainer instance

191

"""

192

193

def after_iter(self, trainer):

194

"""

195

Called after each iteration.

196

197

Parameters:

198

- trainer: Trainer instance

199

"""

200

201

class Priority:

202

"""

203

Priority levels for hook execution order.

204

"""

205

HIGHEST = 0

206

HIGH = 10

207

NORMAL = 50

208

LOW = 70

209

LOWEST = 100

210

```

211

212

### Dataset Builder

213

214

Utility functions for creating datasets from various sources.

215

216

```python { .api }

217

def build_dataset_from_file(

218

data_files: str,

219

split: str = None,

220

cache_dir: str = None,

221

**kwargs

222

):

223

"""

224

Build dataset from file paths.

225

226

Parameters:

227

- data_files: Path to data file(s)

228

- split: Dataset split name

229

- cache_dir: Directory for caching processed data

230

- **kwargs: Additional dataset parameters

231

232

Returns:

233

Dataset instance

234

"""

235

236

def build_trainer(cfg: dict, default_args: dict = None):

237

"""

238

Build trainer from configuration.

239

240

Parameters:

241

- cfg: Trainer configuration dictionary

242

- default_args: Default arguments to merge

243

244

Returns:

245

Trainer instance

246

"""

247

```

248

249

### Specialized Trainers

250

251

Domain-specific trainer implementations for specialized tasks.

252

253

```python { .api }

254

class NlpEpochBasedTrainer(EpochBasedTrainer):

255

"""

256

NLP-specific trainer with text processing optimizations.

257

"""

258

pass

259

260

class VecoTrainer(EpochBasedTrainer):

261

"""

262

Specialized trainer for Veco models.

263

"""

264

pass

265

```

266

267

## Usage Examples

268

269

### Basic Training Setup

270

271

```python

272

from modelscope import Model, EpochBasedTrainer, TrainingArgs

273

from modelscope import build_dataset_from_file

274

275

# Load pre-trained model

276

model = Model.from_pretrained('damo/nlp_structbert_base_chinese')

277

278

# Build dataset

279

train_dataset = build_dataset_from_file('train.json')

280

eval_dataset = build_dataset_from_file('eval.json')

281

282

# Configure training arguments

283

training_args = TrainingArgs(

284

output_dir='./output',

285

max_epochs=10,

286

learning_rate=2e-5,

287

train_batch_size=16,

288

eval_batch_size=32,

289

eval_strategy='epoch',

290

save_strategy='epoch',

291

logging_steps=100

292

)

293

294

# Create trainer

295

trainer = EpochBasedTrainer(

296

model=model,

297

args=training_args,

298

train_dataset=train_dataset,

299

eval_dataset=eval_dataset

300

)

301

302

# Start training

303

trainer.train()

304

```

305

306

### Custom Training with Hooks

307

308

```python

309

from modelscope import EpochBasedTrainer, Hook, Priority

310

311

class CustomLoggingHook(Hook):

312

def __init__(self, log_interval=100):

313

self.log_interval = log_interval

314

self.step = 0

315

316

def after_iter(self, trainer):

317

self.step += 1

318

if self.step % self.log_interval == 0:

319

print(f"Step {self.step}: Loss = {trainer.loss}")

320

321

def after_epoch(self, trainer):

322

print(f"Epoch {trainer.epoch} completed")

323

324

class ModelCheckpointHook(Hook):

325

def __init__(self, save_interval=5):

326

self.save_interval = save_interval

327

328

def after_epoch(self, trainer):

329

if trainer.epoch % self.save_interval == 0:

330

trainer.save_checkpoint(f'./checkpoints/epoch_{trainer.epoch}')

331

332

# Create trainer with custom hooks

333

trainer = EpochBasedTrainer(

334

model=model,

335

args=training_args,

336

train_dataset=train_dataset

337

)

338

339

# Register hooks

340

trainer.register_hook(CustomLoggingHook(log_interval=50), Priority.HIGH)

341

trainer.register_hook(ModelCheckpointHook(save_interval=2), Priority.NORMAL)

342

343

# Start training

344

trainer.train()

345

```

346

347

### Fine-tuning with Evaluation

348

349

```python

350

from modelscope import Model, EpochBasedTrainer, TrainingArgs

351

352

# Load model for fine-tuning

353

model = Model.from_pretrained('damo/nlp_bert_base_chinese')

354

355

# Prepare datasets

356

train_data = build_dataset_from_file('fine_tune_train.json')

357

eval_data = build_dataset_from_file('fine_tune_eval.json')

358

359

# Configure fine-tuning arguments

360

fine_tune_args = TrainingArgs(

361

output_dir='./fine_tuned_model',

362

max_epochs=5,

363

learning_rate=1e-5, # Lower learning rate for fine-tuning

364

train_batch_size=8,

365

eval_batch_size=16,

366

eval_strategy='steps',

367

eval_steps=200,

368

save_strategy='steps',

369

save_steps=500,

370

load_best_model_at_end=True,

371

metric_for_best_model='eval_accuracy',

372

greater_is_better=True

373

)

374

375

# Create trainer

376

trainer = EpochBasedTrainer(

377

model=model,

378

args=fine_tune_args,

379

train_dataset=train_data,

380

eval_dataset=eval_data

381

)

382

383

# Train and evaluate

384

trainer.train()

385

final_metrics = trainer.evaluate()

386

print(f"Final evaluation metrics: {final_metrics}")

387

```

388

389

### Resume Training from Checkpoint

390

391

```python

392

from modelscope import EpochBasedTrainer, TrainingArgs

393

394

# Configure training arguments

395

training_args = TrainingArgs(

396

output_dir='./continued_training',

397

max_epochs=20,

398

resume_from_checkpoint='./checkpoints/epoch_10'

399

)

400

401

# Create trainer

402

trainer = EpochBasedTrainer(

403

model=model,

404

args=training_args,

405

train_dataset=train_dataset

406

)

407

408

# Resume training from checkpoint

409

trainer.resume_training('./checkpoints/epoch_10/checkpoint.pth')

410

```

411

412

### Custom Trainer Implementation

413

414

```python

415

from modelscope import EpochBasedTrainer

416

417

class CustomTrainer(EpochBasedTrainer):

418

def __init__(self, *args, **kwargs):

419

super().__init__(*args, **kwargs)

420

# Custom initialization

421

422

def compute_loss(self, model, inputs):

423

"""

424

Custom loss computation.

425

426

Parameters:

427

- model: Model instance

428

- inputs: Batch inputs

429

430

Returns:

431

Loss tensor

432

"""

433

outputs = model(inputs)

434

# Custom loss calculation

435

loss = custom_loss_function(outputs, inputs['labels'])

436

return loss

437

438

def evaluate(self, eval_dataset=None):

439

"""

440

Custom evaluation logic.

441

"""

442

# Custom evaluation implementation

443

metrics = super().evaluate(eval_dataset)

444

445

# Add custom metrics

446

custom_metric = self.compute_custom_metric()

447

metrics['custom_metric'] = custom_metric

448

449

return metrics

450

451

# Use custom trainer

452

trainer = CustomTrainer(

453

model=model,

454

args=training_args,

455

train_dataset=train_dataset,

456

eval_dataset=eval_dataset

457

)

458

```

459

460

### Multi-GPU Training

461

462

```python

463

from modelscope import EpochBasedTrainer, TrainingArgs

464

import torch

465

466

# Check for multiple GPUs

467

if torch.cuda.device_count() > 1:

468

print(f"Using {torch.cuda.device_count()} GPUs")

469

470

# Configure for multi-GPU training

471

training_args = TrainingArgs(

472

output_dir='./multi_gpu_output',

473

max_epochs=10,

474

train_batch_size=32, # Total batch size across all GPUs

475

eval_batch_size=64,

476

dataloader_num_workers=4,

477

fp16=True, # Mixed precision training

478

gradient_accumulation_steps=2

479

)

480

481

# Create trainer (will automatically use multiple GPUs)

482

trainer = EpochBasedTrainer(

483

model=model,

484

args=training_args,

485

train_dataset=train_dataset,

486

eval_dataset=eval_dataset

487

)

488

489

trainer.train()

490

```

491

492

### Learning Rate Scheduling

493

494

```python

495

from modelscope import EpochBasedTrainer, TrainingArgs, Hook

496

497

class LearningRateSchedulerHook(Hook):

498

def __init__(self, scheduler):

499

self.scheduler = scheduler

500

501

def after_epoch(self, trainer):

502

self.scheduler.step()

503

current_lr = self.scheduler.get_last_lr()[0]

504

print(f"Learning rate updated to: {current_lr}")

505

506

# Setup training with learning rate scheduling

507

import torch.optim as optim

508

from torch.optim.lr_scheduler import StepLR

509

510

trainer = EpochBasedTrainer(

511

model=model,

512

args=training_args,

513

train_dataset=train_dataset

514

)

515

516

# Create optimizer and scheduler

517

optimizer = optim.Adam(model.parameters(), lr=1e-4)

518

scheduler = StepLR(optimizer, step_size=3, gamma=0.5)

519

520

# Register scheduler hook

521

trainer.register_hook(LearningRateSchedulerHook(scheduler))

522

523

trainer.train()

524

```