or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-evaluation.mdevaluation-suites.mdhub-integration.mdindex.mdmodule-discovery.mdtask-evaluators.mdutilities.md

task-evaluators.mddocs/

0

# Task Evaluators

1

2

Task-specific evaluators provide high-level evaluation pipelines that integrate models, datasets, and metrics for common machine learning tasks. They simplify the evaluation process by handling data loading, preprocessing, inference, and metric computation in a unified workflow.

3

4

## Capabilities

5

6

### Evaluator Factory Function

7

8

The `evaluator` function is the primary way to create task-specific evaluators:

9

10

```python { .api }

11

def evaluator(task: str) -> Evaluator:

12

"""Factory function to create task-specific evaluators.

13

14

Args:

15

task: Task type string specifying which evaluator to create.

16

Must be one of the supported task types.

17

18

Returns:

19

Task-specific evaluator instance with default metric configured.

20

21

Raises:

22

ImportError: If transformers is not installed (required for evaluators)

23

KeyError: If task type is not supported

24

"""

25

```

26

27

Supported tasks:

28

- `"text-classification"` (alias: `"sentiment-analysis"`)

29

- `"image-classification"`

30

- `"question-answering"`

31

- `"token-classification"`

32

- `"text-generation"`

33

- `"text2text-generation"`

34

- `"summarization"`

35

- `"translation"`

36

- `"automatic-speech-recognition"`

37

- `"audio-classification"`

38

39

**Usage Example:**

40

```python

41

import evaluate

42

43

# Create task-specific evaluators

44

text_evaluator = evaluate.evaluator("text-classification")

45

qa_evaluator = evaluate.evaluator("question-answering")

46

img_evaluator = evaluate.evaluator("image-classification")

47

```

48

49

### Base Evaluator Class

50

51

All task evaluators inherit from the base `Evaluator` class:

52

53

```python { .api }

54

class Evaluator:

55

"""Abstract base class for task-specific evaluators."""

56

57

def compute(

58

self,

59

model_or_pipeline,

60

data,

61

subset: Optional[str] = None,

62

split: Optional[str] = None,

63

metric: Optional[Union[str, EvaluationModule]] = None,

64

tokenizer: Optional[str] = None,

65

feature_extractor: Optional[str] = None,

66

strategy: str = "simple",

67

confidence_level: float = 0.95,

68

n_resamples: int = 9999,

69

device: Optional[int] = None,

70

random_state: Optional[int] = None,

71

input_column: str = "text",

72

label_column: str = "label",

73

label_mapping: Optional[Dict[str, Number]] = None

74

) -> Dict[str, float]

75

76

def load_data(

77

self,

78

data: Union[str, Dataset],

79

subset: Optional[str] = None,

80

split: Optional[str] = None

81

) -> Dataset

82

83

def prepare_data(

84

self,

85

data: Dataset,

86

input_column: str,

87

label_column: str,

88

*args,

89

**kwargs

90

) -> Dataset

91

92

def prepare_pipeline(

93

self,

94

model_or_pipeline,

95

tokenizer: Optional[str] = None,

96

feature_extractor: Optional[str] = None,

97

device: Optional[int] = None

98

)

99

100

def prepare_metric(self, metric: Union[str, EvaluationModule]) -> EvaluationModule

101

```

102

103

**Usage Example:**

104

```python

105

import evaluate

106

107

# Create evaluator

108

evaluator = evaluate.evaluator("text-classification")

109

110

# Evaluate a model on a dataset

111

results = evaluator.compute(

112

model_or_pipeline="cardiffnlp/twitter-roberta-base-emotion",

113

data="emotion",

114

subset="split",

115

split="test[:100]",

116

metric="accuracy",

117

input_column="text",

118

label_column="label"

119

)

120

121

print(results) # {'accuracy': 0.85}

122

```

123

124

### Text Classification Evaluator

125

126

Evaluates text classification models using accuracy as the default metric:

127

128

```python { .api }

129

class TextClassificationEvaluator(Evaluator):

130

"""Evaluator for text classification tasks."""

131

# Default metric: "accuracy"

132

```

133

134

**Usage Example:**

135

```python

136

import evaluate

137

138

evaluator = evaluate.evaluator("text-classification")

139

140

# Evaluate with Transformers pipeline

141

from transformers import pipeline

142

classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")

143

144

results = evaluator.compute(

145

model_or_pipeline=classifier,

146

data="glue",

147

subset="sst2",

148

split="validation[:100]",

149

metric="accuracy"

150

)

151

```

152

153

### Question Answering Evaluator

154

155

Evaluates question answering models using SQuAD metric as default:

156

157

```python { .api }

158

class QuestionAnsweringEvaluator(Evaluator):

159

"""Evaluator for question answering tasks."""

160

# Default metric: "squad"

161

```

162

163

**Usage Example:**

164

```python

165

import evaluate

166

167

evaluator = evaluate.evaluator("question-answering")

168

169

results = evaluator.compute(

170

model_or_pipeline="distilbert-base-cased-distilled-squad",

171

data="squad",

172

split="validation[:100]",

173

metric="squad"

174

)

175

176

print(results) # {'exact_match': 78.5, 'f1': 86.2}

177

```

178

179

### Token Classification Evaluator

180

181

Evaluates named entity recognition and other token classification tasks:

182

183

```python { .api }

184

class TokenClassificationEvaluator(Evaluator):

185

"""Evaluator for token classification tasks."""

186

# Default metric: "seqeval"

187

```

188

189

**Usage Example:**

190

```python

191

import evaluate

192

193

evaluator = evaluate.evaluator("token-classification")

194

195

results = evaluator.compute(

196

model_or_pipeline="dbmdz/bert-large-cased-finetuned-conll03-english",

197

data="conll2003",

198

split="test[:100]",

199

metric="seqeval"

200

)

201

```

202

203

### Image Classification Evaluator

204

205

Evaluates image classification models:

206

207

```python { .api }

208

class ImageClassificationEvaluator(Evaluator):

209

"""Evaluator for image classification tasks."""

210

# Default metric: "accuracy"

211

```

212

213

**Usage Example:**

214

```python

215

import evaluate

216

217

evaluator = evaluate.evaluator("image-classification")

218

219

results = evaluator.compute(

220

model_or_pipeline="google/vit-base-patch16-224",

221

data="imagenet-1k",

222

split="validation[:100]",

223

metric="accuracy",

224

input_column="image",

225

label_column="label"

226

)

227

```

228

229

### Text Generation Evaluators

230

231

Multiple evaluators for different text generation tasks:

232

233

```python { .api }

234

class TextGenerationEvaluator(Evaluator):

235

"""Evaluator for general text generation tasks."""

236

# Default metric: "word_count"

237

238

class Text2TextGenerationEvaluator(Evaluator):

239

"""Evaluator for text-to-text generation tasks."""

240

# Default metric: "bleu"

241

242

class SummarizationEvaluator(Evaluator):

243

"""Evaluator for summarization tasks."""

244

# Default metric: "rouge"

245

246

class TranslationEvaluator(Evaluator):

247

"""Evaluator for translation tasks."""

248

# Default metric: "bleu"

249

```

250

251

**Usage Examples:**

252

```python

253

import evaluate

254

255

# Summarization

256

sum_evaluator = evaluate.evaluator("summarization")

257

results = sum_evaluator.compute(

258

model_or_pipeline="facebook/bart-large-cnn",

259

data="cnn_dailymail",

260

subset="3.0.0",

261

split="test[:100]"

262

)

263

264

# Translation

265

trans_evaluator = evaluate.evaluator("translation")

266

results = trans_evaluator.compute(

267

model_or_pipeline="Helsinki-NLP/opus-mt-en-de",

268

data="wmt14",

269

subset="de-en",

270

split="test[:100]"

271

)

272

```

273

274

### Audio Evaluators

275

276

Evaluators for audio processing tasks:

277

278

```python { .api }

279

class AudioClassificationEvaluator(Evaluator):

280

"""Evaluator for audio classification tasks."""

281

# Default metric: "accuracy"

282

283

class AutomaticSpeechRecognitionEvaluator(Evaluator):

284

"""Evaluator for automatic speech recognition tasks."""

285

# Default metric: "wer"

286

```

287

288

**Usage Examples:**

289

```python

290

import evaluate

291

292

# Audio classification

293

audio_evaluator = evaluate.evaluator("audio-classification")

294

results = audio_evaluator.compute(

295

model_or_pipeline="facebook/wav2vec2-base-960h",

296

data="superb",

297

subset="ks",

298

split="test[:100]"

299

)

300

301

# Speech recognition

302

asr_evaluator = evaluate.evaluator("automatic-speech-recognition")

303

results = asr_evaluator.compute(

304

model_or_pipeline="facebook/wav2vec2-base-960h",

305

data="librispeech_asr",

306

split="test.clean[:100]",

307

metric="wer"

308

)

309

```

310

311

## Error Handling

312

313

Task evaluators may raise these exceptions:

314

315

- `KeyError`: Unknown task name provided to `evaluator()`

316

- `ImportError`: Missing transformers library (required for evaluators)

317

- `ValueError`: Invalid data format or model incompatibility

318

- `RuntimeError`: Evaluation pipeline errors

319

320

**Example:**

321

```python

322

try:

323

evaluator = evaluate.evaluator("unknown-task")

324

except KeyError as e:

325

print(f"Unsupported task: {e}")

326

327

try:

328

evaluator = evaluate.evaluator("text-classification")

329

# This will fail if transformers is not installed

330

except ImportError as e:

331

print("Install transformers: pip install evaluate[transformers]")

332

```