0
# Task Evaluators
1
2
Task-specific evaluators provide high-level evaluation pipelines that integrate models, datasets, and metrics for common machine learning tasks. They simplify the evaluation process by handling data loading, preprocessing, inference, and metric computation in a unified workflow.
3
4
## Capabilities
5
6
### Evaluator Factory Function
7
8
The `evaluator` function is the primary way to create task-specific evaluators:
9
10
```python { .api }
11
def evaluator(task: str) -> Evaluator:
12
"""Factory function to create task-specific evaluators.
13
14
Args:
15
task: Task type string specifying which evaluator to create.
16
Must be one of the supported task types.
17
18
Returns:
19
Task-specific evaluator instance with default metric configured.
20
21
Raises:
22
ImportError: If transformers is not installed (required for evaluators)
23
KeyError: If task type is not supported
24
"""
25
```
26
27
Supported tasks:
28
- `"text-classification"` (alias: `"sentiment-analysis"`)
29
- `"image-classification"`
30
- `"question-answering"`
31
- `"token-classification"`
32
- `"text-generation"`
33
- `"text2text-generation"`
34
- `"summarization"`
35
- `"translation"`
36
- `"automatic-speech-recognition"`
37
- `"audio-classification"`
38
39
**Usage Example:**
40
```python
41
import evaluate
42
43
# Create task-specific evaluators
44
text_evaluator = evaluate.evaluator("text-classification")
45
qa_evaluator = evaluate.evaluator("question-answering")
46
img_evaluator = evaluate.evaluator("image-classification")
47
```
48
49
### Base Evaluator Class
50
51
All task evaluators inherit from the base `Evaluator` class:
52
53
```python { .api }
54
class Evaluator:
55
"""Abstract base class for task-specific evaluators."""
56
57
def compute(
58
self,
59
model_or_pipeline,
60
data,
61
subset: Optional[str] = None,
62
split: Optional[str] = None,
63
metric: Optional[Union[str, EvaluationModule]] = None,
64
tokenizer: Optional[str] = None,
65
feature_extractor: Optional[str] = None,
66
strategy: str = "simple",
67
confidence_level: float = 0.95,
68
n_resamples: int = 9999,
69
device: Optional[int] = None,
70
random_state: Optional[int] = None,
71
input_column: str = "text",
72
label_column: str = "label",
73
label_mapping: Optional[Dict[str, Number]] = None
74
) -> Dict[str, float]
75
76
def load_data(
77
self,
78
data: Union[str, Dataset],
79
subset: Optional[str] = None,
80
split: Optional[str] = None
81
) -> Dataset
82
83
def prepare_data(
84
self,
85
data: Dataset,
86
input_column: str,
87
label_column: str,
88
*args,
89
**kwargs
90
) -> Dataset
91
92
def prepare_pipeline(
93
self,
94
model_or_pipeline,
95
tokenizer: Optional[str] = None,
96
feature_extractor: Optional[str] = None,
97
device: Optional[int] = None
98
)
99
100
def prepare_metric(self, metric: Union[str, EvaluationModule]) -> EvaluationModule
101
```
102
103
**Usage Example:**
104
```python
105
import evaluate
106
107
# Create evaluator
108
evaluator = evaluate.evaluator("text-classification")
109
110
# Evaluate a model on a dataset
111
results = evaluator.compute(
112
model_or_pipeline="cardiffnlp/twitter-roberta-base-emotion",
113
data="emotion",
114
subset="split",
115
split="test[:100]",
116
metric="accuracy",
117
input_column="text",
118
label_column="label"
119
)
120
121
print(results) # {'accuracy': 0.85}
122
```
123
124
### Text Classification Evaluator
125
126
Evaluates text classification models using accuracy as the default metric:
127
128
```python { .api }
129
class TextClassificationEvaluator(Evaluator):
130
"""Evaluator for text classification tasks."""
131
# Default metric: "accuracy"
132
```
133
134
**Usage Example:**
135
```python
136
import evaluate
137
138
evaluator = evaluate.evaluator("text-classification")
139
140
# Evaluate with Transformers pipeline
141
from transformers import pipeline
142
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
143
144
results = evaluator.compute(
145
model_or_pipeline=classifier,
146
data="glue",
147
subset="sst2",
148
split="validation[:100]",
149
metric="accuracy"
150
)
151
```
152
153
### Question Answering Evaluator
154
155
Evaluates question answering models using SQuAD metric as default:
156
157
```python { .api }
158
class QuestionAnsweringEvaluator(Evaluator):
159
"""Evaluator for question answering tasks."""
160
# Default metric: "squad"
161
```
162
163
**Usage Example:**
164
```python
165
import evaluate
166
167
evaluator = evaluate.evaluator("question-answering")
168
169
results = evaluator.compute(
170
model_or_pipeline="distilbert-base-cased-distilled-squad",
171
data="squad",
172
split="validation[:100]",
173
metric="squad"
174
)
175
176
print(results) # {'exact_match': 78.5, 'f1': 86.2}
177
```
178
179
### Token Classification Evaluator
180
181
Evaluates named entity recognition and other token classification tasks:
182
183
```python { .api }
184
class TokenClassificationEvaluator(Evaluator):
185
"""Evaluator for token classification tasks."""
186
# Default metric: "seqeval"
187
```
188
189
**Usage Example:**
190
```python
191
import evaluate
192
193
evaluator = evaluate.evaluator("token-classification")
194
195
results = evaluator.compute(
196
model_or_pipeline="dbmdz/bert-large-cased-finetuned-conll03-english",
197
data="conll2003",
198
split="test[:100]",
199
metric="seqeval"
200
)
201
```
202
203
### Image Classification Evaluator
204
205
Evaluates image classification models:
206
207
```python { .api }
208
class ImageClassificationEvaluator(Evaluator):
209
"""Evaluator for image classification tasks."""
210
# Default metric: "accuracy"
211
```
212
213
**Usage Example:**
214
```python
215
import evaluate
216
217
evaluator = evaluate.evaluator("image-classification")
218
219
results = evaluator.compute(
220
model_or_pipeline="google/vit-base-patch16-224",
221
data="imagenet-1k",
222
split="validation[:100]",
223
metric="accuracy",
224
input_column="image",
225
label_column="label"
226
)
227
```
228
229
### Text Generation Evaluators
230
231
Multiple evaluators for different text generation tasks:
232
233
```python { .api }
234
class TextGenerationEvaluator(Evaluator):
235
"""Evaluator for general text generation tasks."""
236
# Default metric: "word_count"
237
238
class Text2TextGenerationEvaluator(Evaluator):
239
"""Evaluator for text-to-text generation tasks."""
240
# Default metric: "bleu"
241
242
class SummarizationEvaluator(Evaluator):
243
"""Evaluator for summarization tasks."""
244
# Default metric: "rouge"
245
246
class TranslationEvaluator(Evaluator):
247
"""Evaluator for translation tasks."""
248
# Default metric: "bleu"
249
```
250
251
**Usage Examples:**
252
```python
253
import evaluate
254
255
# Summarization
256
sum_evaluator = evaluate.evaluator("summarization")
257
results = sum_evaluator.compute(
258
model_or_pipeline="facebook/bart-large-cnn",
259
data="cnn_dailymail",
260
subset="3.0.0",
261
split="test[:100]"
262
)
263
264
# Translation
265
trans_evaluator = evaluate.evaluator("translation")
266
results = trans_evaluator.compute(
267
model_or_pipeline="Helsinki-NLP/opus-mt-en-de",
268
data="wmt14",
269
subset="de-en",
270
split="test[:100]"
271
)
272
```
273
274
### Audio Evaluators
275
276
Evaluators for audio processing tasks:
277
278
```python { .api }
279
class AudioClassificationEvaluator(Evaluator):
280
"""Evaluator for audio classification tasks."""
281
# Default metric: "accuracy"
282
283
class AutomaticSpeechRecognitionEvaluator(Evaluator):
284
"""Evaluator for automatic speech recognition tasks."""
285
# Default metric: "wer"
286
```
287
288
**Usage Examples:**
289
```python
290
import evaluate
291
292
# Audio classification
293
audio_evaluator = evaluate.evaluator("audio-classification")
294
results = audio_evaluator.compute(
295
model_or_pipeline="facebook/wav2vec2-base-960h",
296
data="superb",
297
subset="ks",
298
split="test[:100]"
299
)
300
301
# Speech recognition
302
asr_evaluator = evaluate.evaluator("automatic-speech-recognition")
303
results = asr_evaluator.compute(
304
model_or_pipeline="facebook/wav2vec2-base-960h",
305
data="librispeech_asr",
306
split="test.clean[:100]",
307
metric="wer"
308
)
309
```
310
311
## Error Handling
312
313
Task evaluators may raise these exceptions:
314
315
- `KeyError`: Unknown task name provided to `evaluator()`
316
- `ImportError`: Missing transformers library (required for evaluators)
317
- `ValueError`: Invalid data format or model incompatibility
318
- `RuntimeError`: Evaluation pipeline errors
319
320
**Example:**
321
```python
322
try:
323
evaluator = evaluate.evaluator("unknown-task")
324
except KeyError as e:
325
print(f"Unsupported task: {e}")
326
327
try:
328
evaluator = evaluate.evaluator("text-classification")
329
# This will fail if transformers is not installed
330
except ImportError as e:
331
print("Install transformers: pip install evaluate[transformers]")
332
```