HuggingFace community-driven open-source library of evaluation metrics for machine learning models and datasets.
npx @tessl/cli install tessl/pypi-evaluate@0.4.00
# Evaluate
1
2
A comprehensive evaluation library for machine learning models and datasets, providing implementations of dozens of popular metrics spanning NLP to Computer Vision tasks. The library features dataset-specific metrics, easy integration with any ML framework (NumPy/Pandas/PyTorch/TensorFlow/JAX), type checking for input validation, metric cards with descriptions and usage examples, and community-driven extensibility through the Hugging Face Hub.
3
4
## Package Information
5
6
- **Package Name**: evaluate
7
- **Package Type**: pypi
8
- **Language**: Python
9
- **Installation**: `pip install evaluate`
10
11
## Core Imports
12
13
```python
14
import evaluate
15
```
16
17
For specific components:
18
19
```python
20
from evaluate import load, combine, push_to_hub, save
21
from evaluate import Metric, Comparison, Measurement, EvaluationModule
22
from evaluate import evaluator
23
```
24
25
## Basic Usage
26
27
```python
28
import evaluate
29
30
# Load a metric from the Hub
31
accuracy = evaluate.load("accuracy")
32
33
# Add predictions and references
34
accuracy.add_batch(predictions=[0, 2, 1, 3], references=[0, 1, 2, 3])
35
accuracy.add(prediction=1, reference=1)
36
37
# Compute final score
38
score = accuracy.compute()
39
print(score) # {'accuracy': 0.5}
40
41
# Combine multiple metrics
42
combined_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])
43
results = combined_metrics.compute(predictions=[0, 1, 1], references=[0, 1, 0])
44
print(results) # {'accuracy': 0.6667, 'f1': 0.6667, 'precision': 0.6667, 'recall': 0.6667}
45
46
# Use task-specific evaluators
47
task_evaluator = evaluate.evaluator("text-classification")
48
eval_results = task_evaluator.compute(
49
model_or_pipeline="cardiffnlp/twitter-roberta-base-emotion",
50
data="emotion",
51
subset="split",
52
split="test[:40]"
53
)
54
```
55
56
## Architecture
57
58
The evaluate library is built around several key components:
59
60
- **EvaluationModule**: Base class for all evaluation functionality (metrics, comparisons, measurements)
61
- **Evaluators**: Task-specific evaluation pipelines that integrate models, datasets, and metrics
62
- **Hub Integration**: Seamless loading from and pushing to Hugging Face Hub
63
- **Combined Evaluations**: Unified interface for running multiple evaluation modules together
64
65
The library provides both low-level evaluation primitives for custom workflows and high-level evaluators for common ML tasks, enabling standardized model evaluation and comparison across the machine learning ecosystem.
66
67
## Capabilities
68
69
### Core Evaluation
70
71
Core functionality for loading and using evaluation modules including metrics, comparisons, and measurements. Provides the fundamental building blocks for model evaluation workflows.
72
73
```python { .api }
74
def load(path: str, config_name: Optional[str] = None, **kwargs) -> EvaluationModule:
75
"""Load evaluation modules from Hub or local paths."""
76
77
def combine(evaluations: List[str], force_prefix: bool = False) -> CombinedEvaluations:
78
"""Combine multiple evaluation modules into a single object."""
79
80
class EvaluationModule:
81
"""Base class for all evaluation modules."""
82
def compute(self, *, predictions=None, references=None, **kwargs) -> Optional[dict]: ...
83
def add_batch(self, *, predictions=None, references=None, **kwargs): ...
84
def add(self, *, prediction=None, reference=None, **kwargs): ...
85
```
86
87
[Core Evaluation](./core-evaluation.md)
88
89
### Task-Specific Evaluators
90
91
High-level evaluators for common machine learning tasks that integrate models, datasets, and metrics into streamlined evaluation pipelines.
92
93
```python { .api }
94
def evaluator(task: str) -> Evaluator:
95
"""Factory function to create task-specific evaluators."""
96
97
class Evaluator:
98
"""Base class for task-specific evaluators."""
99
def compute(self, model_or_pipeline, data, **kwargs) -> dict: ...
100
101
# Specialized evaluator classes
102
class TextClassificationEvaluator(Evaluator): ...
103
class ImageClassificationEvaluator(Evaluator): ...
104
class QuestionAnsweringEvaluator(Evaluator): ...
105
class TokenClassificationEvaluator(Evaluator): ...
106
class TextGenerationEvaluator(Evaluator): ...
107
class Text2TextGenerationEvaluator(Evaluator): ...
108
class SummarizationEvaluator(Evaluator): ...
109
class TranslationEvaluator(Evaluator): ...
110
class AutomaticSpeechRecognitionEvaluator(Evaluator): ...
111
class AudioClassificationEvaluator(Evaluator): ...
112
```
113
114
[Task Evaluators](./task-evaluators.md)
115
116
### Hub Integration
117
118
Functions for sharing evaluation results with the Hugging Face Hub and saving results locally with comprehensive metadata.
119
120
```python { .api }
121
def push_to_hub(
122
model_id: str,
123
task_type: str,
124
dataset_type: str,
125
metric_type: str,
126
metric_value: float,
127
**kwargs
128
): ...
129
130
def save(path_or_file, **data): ...
131
```
132
133
[Hub Integration](./hub-integration.md)
134
135
### Module Discovery
136
137
Tools for discovering, listing, and inspecting available evaluation modules from the Hugging Face Hub and local sources.
138
139
```python { .api }
140
def list_evaluation_modules(
141
module_type: Optional[str] = None,
142
include_community: bool = True,
143
with_details: bool = False
144
): ...
145
146
def inspect_evaluation_module(
147
path: str,
148
local_path: str,
149
**kwargs
150
): ...
151
```
152
153
[Module Discovery](./module-discovery.md)
154
155
### Evaluation Suites
156
157
Comprehensive evaluation workflows that run multiple tasks and datasets together for thorough model evaluation.
158
159
```python { .api }
160
class EvaluationSuite:
161
"""Multi-task, multi-dataset evaluation suite."""
162
@staticmethod
163
def load(path: str, **kwargs) -> EvaluationSuite: ...
164
def run(self, model_or_pipeline) -> dict: ...
165
```
166
167
[Evaluation Suites](./evaluation-suites.md)
168
169
### Utilities
170
171
Helper functions for logging control and Gradio integration for interactive evaluation experiences.
172
173
```python { .api }
174
# Logging utilities
175
def enable_progress_bar(): ...
176
def disable_progress_bar(): ...
177
def is_progress_bar_enabled() -> bool: ...
178
179
# Gradio integration
180
def launch_gradio_widget(evaluation_module): ...
181
```
182
183
[Utilities](./utilities.md)
184
185
## Types
186
187
```python { .api }
188
from typing import Dict, List, Optional, Union, Any
189
from datasets import Dataset
190
191
# Core evaluation types
192
class EvaluationModuleInfo:
193
"""Information about evaluation modules."""
194
description: str
195
citation: str
196
features: Any
197
inputs_description: str
198
homepage: Optional[str]
199
license: str
200
codebase_urls: List[str]
201
reference_urls: List[str]
202
203
class MetricInfo(EvaluationModuleInfo):
204
"""Information specific to metrics."""
205
206
class ComparisonInfo(EvaluationModuleInfo):
207
"""Information specific to comparisons."""
208
209
class MeasurementInfo(EvaluationModuleInfo):
210
"""Information specific to measurements."""
211
212
# Combined evaluation result type
213
CombinedResults = Dict[str, Union[float, Dict[str, float], List]]
214
215
# Configuration and download types
216
from datasets import DownloadConfig, DownloadMode
217
from datasets.utils.version import Version
218
219
# Download configuration for Hub modules
220
class DownloadConfig:
221
"""Configuration for downloading modules from Hub."""
222
cache_dir: Optional[str]
223
force_download: bool
224
resume_download: bool
225
use_auth_token: Optional[str]
226
227
# Download mode enumeration
228
class DownloadMode:
229
"""Download behavior for cached modules."""
230
REUSE_DATASET_IF_EXISTS: str
231
REUSE_CACHE_IF_EXISTS: str
232
FORCE_REDOWNLOAD: str
233
234
# Version handling for modules
235
class Version:
236
"""Version specification for modules."""
237
def __init__(self, version_str: str): ...
238
def __str__(self) -> str: ...
239
```