0
# Core Evaluation
1
2
The core evaluation functionality provides the fundamental building blocks for model evaluation workflows. This includes loading evaluation modules, using metrics/comparisons/measurements, and combining multiple evaluations into unified workflows.
3
4
## Capabilities
5
6
### Loading Evaluation Modules
7
8
The primary way to access evaluation functionality is through the `load` function, which retrieves evaluation modules from the Hugging Face Hub or local paths.
9
10
```python { .api }
11
def load(
12
path: str,
13
config_name: Optional[str] = None,
14
module_type: Optional[str] = None,
15
process_id: int = 0,
16
num_process: int = 1,
17
cache_dir: Optional[str] = None,
18
experiment_id: Optional[str] = None,
19
keep_in_memory: bool = False,
20
download_config: Optional[DownloadConfig] = None,
21
download_mode: Optional[DownloadMode] = None,
22
revision: Optional[Union[str, Version]] = None,
23
**init_kwargs
24
) -> EvaluationModule:
25
"""Load an EvaluationModule (metric, comparison, or measurement).
26
27
Args:
28
path: Path to evaluation module or module identifier from Hub
29
config_name: Configuration name for the module (e.g., GLUE subset)
30
module_type: Type of module ('metric', 'comparison', 'measurement')
31
process_id: Process ID for distributed evaluation (0-based)
32
num_process: Total number of processes in distributed setup
33
cache_dir: Directory for caching downloaded modules
34
experiment_id: Unique identifier for experiment tracking
35
keep_in_memory: Store all data in memory (not for distributed)
36
download_config: Configuration for downloading from Hub
37
download_mode: How to handle existing cached data
38
revision: Specific revision/version to load
39
**init_kwargs: Additional initialization arguments for the module
40
"""
41
```
42
43
**Usage Example:**
44
```python
45
import evaluate
46
47
# Load popular metrics
48
accuracy = evaluate.load("accuracy")
49
bleu = evaluate.load("bleu")
50
rouge = evaluate.load("rouge")
51
52
# Load with specific configuration
53
squad_metric = evaluate.load("squad", config_name="v2")
54
55
# Load local evaluation module
56
custom_metric = evaluate.load("./path/to/custom_metric.py")
57
```
58
59
### Base Evaluation Module
60
61
All evaluation functionality inherits from the `EvaluationModule` base class, providing a consistent API across metrics, comparisons, and measurements.
62
63
```python { .api }
64
class EvaluationModule:
65
"""Base class for all evaluation modules."""
66
67
def compute(
68
self,
69
*,
70
predictions=None,
71
references=None,
72
**kwargs
73
) -> Optional[Dict[str, Any]]:
74
"""Compute evaluation results from accumulated predictions and references."""
75
76
def add_batch(
77
self,
78
*,
79
predictions=None,
80
references=None,
81
**kwargs
82
):
83
"""Add a batch of predictions and references."""
84
85
def add(
86
self,
87
*,
88
prediction=None,
89
reference=None,
90
**kwargs
91
):
92
"""Add a single prediction and reference pair."""
93
94
def download_and_prepare(
95
self,
96
download_config: Optional[DownloadConfig] = None,
97
dl_manager: Optional[DownloadManager] = None
98
):
99
"""Download and prepare the evaluation module."""
100
101
# Properties
102
@property
103
def name(self) -> str:
104
"""Name of the evaluation module."""
105
106
@property
107
def description(self) -> str:
108
"""Description of what the module evaluates."""
109
110
@property
111
def citation(self) -> str:
112
"""Citation information for the evaluation method."""
113
114
@property
115
def features(self) -> Features:
116
"""Expected input features schema."""
117
118
@property
119
def inputs_description(self) -> str:
120
"""Description of expected inputs."""
121
122
@property
123
def homepage(self) -> Optional[str]:
124
"""Homepage URL for the evaluation method."""
125
126
@property
127
def license(self) -> str:
128
"""License information."""
129
130
@property
131
def codebase_urls(self) -> List[str]:
132
"""URLs to relevant codebases."""
133
134
@property
135
def reference_urls(self) -> List[str]:
136
"""URLs to reference papers or documentation."""
137
```
138
139
**Usage Example:**
140
```python
141
import evaluate
142
143
# Load and use a metric
144
accuracy = evaluate.load("accuracy")
145
146
# Add individual predictions
147
accuracy.add(prediction=1, reference=1)
148
accuracy.add(prediction=0, reference=1)
149
150
# Add batch predictions
151
accuracy.add_batch(
152
predictions=[1, 0, 1, 1],
153
references=[1, 1, 0, 1]
154
)
155
156
# Compute final results
157
result = accuracy.compute()
158
print(result) # {'accuracy': 0.625}
159
160
# Access module information
161
print(accuracy.description)
162
print(accuracy.citation)
163
```
164
165
### Specialized Evaluation Classes
166
167
The library provides specialized classes for different types of evaluation:
168
169
```python { .api }
170
class Metric(EvaluationModule):
171
"""Specialized evaluation module for metrics."""
172
173
class Comparison(EvaluationModule):
174
"""Specialized evaluation module for comparisons between models."""
175
176
class Measurement(EvaluationModule):
177
"""Specialized evaluation module for measurements."""
178
```
179
180
These classes inherit all functionality from `EvaluationModule` but may have specialized behavior for their specific evaluation type.
181
182
### Combining Multiple Evaluations
183
184
The `combine` function allows you to run multiple evaluation modules together as a single unit:
185
186
```python { .api }
187
def combine(
188
evaluations: Union[List[Union[str, EvaluationModule]], Dict[str, Union[str, EvaluationModule]]],
189
force_prefix: bool = False
190
) -> CombinedEvaluations:
191
"""Combine multiple evaluation modules into a single object.
192
193
Args:
194
evaluations: List or dict of evaluation modules. Can be module names (str)
195
or loaded EvaluationModule objects. If dict, keys are used as
196
prefixes for results.
197
force_prefix: If True, all results are prefixed with module names
198
"""
199
```
200
201
```python { .api }
202
class CombinedEvaluations:
203
"""Container for multiple evaluation modules."""
204
205
def add(
206
self,
207
*,
208
prediction=None,
209
reference=None,
210
**kwargs
211
):
212
"""Add prediction/reference to all contained modules."""
213
214
def add_batch(
215
self,
216
*,
217
predictions=None,
218
references=None,
219
**kwargs
220
):
221
"""Add batch predictions/references to all contained modules."""
222
223
def compute(
224
self,
225
*,
226
predictions=None,
227
references=None,
228
**kwargs
229
) -> Dict[str, Any]:
230
"""Compute results from all contained modules."""
231
```
232
233
**Usage Example:**
234
```python
235
import evaluate
236
237
# Combine multiple metrics
238
combined = evaluate.combine(["accuracy", "f1", "precision", "recall"])
239
240
# Use like a single metric
241
combined.add_batch(
242
predictions=[1, 0, 1, 0],
243
references=[1, 1, 0, 0]
244
)
245
246
results = combined.compute()
247
print(results)
248
# {
249
# 'accuracy': 0.5,
250
# 'f1': 0.5,
251
# 'precision': 0.5,
252
# 'recall': 0.5
253
# }
254
255
# Combine with custom names to avoid conflicts
256
combined_with_prefix = evaluate.combine([
257
("acc", evaluate.load("accuracy")),
258
("f1_macro", evaluate.load("f1", average="macro"))
259
], force_prefix=True)
260
```
261
262
## Error Handling
263
264
Evaluation modules may raise the following exceptions:
265
266
- `ValueError`: Invalid input data or configuration
267
- `TypeError`: Incorrect data types for predictions or references
268
- `ImportError`: Missing required dependencies for specific metrics
269
- `ConnectionError`: Network issues when downloading from Hub
270
271
**Example:**
272
```python
273
try:
274
metric = evaluate.load("nonexistent_metric")
275
except FileNotFoundError:
276
print("Metric not found")
277
278
try:
279
accuracy = evaluate.load("accuracy")
280
accuracy.compute(predictions=[1, 2], references=[1]) # Mismatched lengths
281
except ValueError as e:
282
print(f"Input validation error: {e}")
283
```