Tessl Tile for pypi/evaluate@0.4.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

tessl/pypi-evaluate

HuggingFace community-driven open-source library of evaluation metrics for machine learning models and datasets.

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:pypi/evaluate@0.4.x

To install, run

npx @tessl/cli install tessl/pypi-evaluate@0.4.0

0
# Evaluate
1

2
A comprehensive evaluation library for machine learning models and datasets, providing implementations of dozens of popular metrics spanning NLP to Computer Vision tasks. The library features dataset-specific metrics, easy integration with any ML framework (NumPy/Pandas/PyTorch/TensorFlow/JAX), type checking for input validation, metric cards with descriptions and usage examples, and community-driven extensibility through the Hugging Face Hub.
3

4
## Package Information
5

6
- **Package Name**: evaluate
7
- **Package Type**: pypi
8
- **Language**: Python
9
- **Installation**: `pip install evaluate`
10

11
## Core Imports
12

13
```python
14
import evaluate
15
```
16

17
For specific components:
18

19
```python
20
from evaluate import load, combine, push_to_hub, save
21
from evaluate import Metric, Comparison, Measurement, EvaluationModule
22
from evaluate import evaluator
23
```
24

25
## Basic Usage
26

27
```python
28
import evaluate
29

30
# Load a metric from the Hub
31
accuracy = evaluate.load("accuracy")
32

33
# Add predictions and references
34
accuracy.add_batch(predictions=[0, 2, 1, 3], references=[0, 1, 2, 3])
35
accuracy.add(prediction=1, reference=1)
36

37
# Compute final score
38
score = accuracy.compute()
39
print(score)  # {'accuracy': 0.5}
40

41
# Combine multiple metrics
42
combined_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])
43
results = combined_metrics.compute(predictions=[0, 1, 1], references=[0, 1, 0])
44
print(results)  # {'accuracy': 0.6667, 'f1': 0.6667, 'precision': 0.6667, 'recall': 0.6667}
45

46
# Use task-specific evaluators
47
task_evaluator = evaluate.evaluator("text-classification")  
48
eval_results = task_evaluator.compute(
49
    model_or_pipeline="cardiffnlp/twitter-roberta-base-emotion",
50
    data="emotion",
51
    subset="split",
52
    split="test[:40]"
53
)
54
```
55

56
## Architecture
57

58
The evaluate library is built around several key components:
59

60
- **EvaluationModule**: Base class for all evaluation functionality (metrics, comparisons, measurements)
61
- **Evaluators**: Task-specific evaluation pipelines that integrate models, datasets, and metrics
62
- **Hub Integration**: Seamless loading from and pushing to Hugging Face Hub
63
- **Combined Evaluations**: Unified interface for running multiple evaluation modules together
64

65
The library provides both low-level evaluation primitives for custom workflows and high-level evaluators for common ML tasks, enabling standardized model evaluation and comparison across the machine learning ecosystem.
66

67
## Capabilities
68

69
### Core Evaluation
70

71
Core functionality for loading and using evaluation modules including metrics, comparisons, and measurements. Provides the fundamental building blocks for model evaluation workflows.
72

73
```python { .api }
74
def load(path: str, config_name: Optional[str] = None, **kwargs) -> EvaluationModule:
75
    """Load evaluation modules from Hub or local paths."""
76

77
def combine(evaluations: List[str], force_prefix: bool = False) -> CombinedEvaluations:
78
    """Combine multiple evaluation modules into a single object."""
79

80
class EvaluationModule:
81
    """Base class for all evaluation modules."""
82
    def compute(self, *, predictions=None, references=None, **kwargs) -> Optional[dict]: ...
83
    def add_batch(self, *, predictions=None, references=None, **kwargs): ...
84
    def add(self, *, prediction=None, reference=None, **kwargs): ...
85
```
86

87
[Core Evaluation](./core-evaluation.md)
88

89
### Task-Specific Evaluators
90

91
High-level evaluators for common machine learning tasks that integrate models, datasets, and metrics into streamlined evaluation pipelines.
92

93
```python { .api }
94
def evaluator(task: str) -> Evaluator:
95
    """Factory function to create task-specific evaluators."""
96

97
class Evaluator:
98
    """Base class for task-specific evaluators."""
99
    def compute(self, model_or_pipeline, data, **kwargs) -> dict: ...
100

101
# Specialized evaluator classes
102
class TextClassificationEvaluator(Evaluator): ...
103
class ImageClassificationEvaluator(Evaluator): ...
104
class QuestionAnsweringEvaluator(Evaluator): ...
105
class TokenClassificationEvaluator(Evaluator): ...
106
class TextGenerationEvaluator(Evaluator): ...
107
class Text2TextGenerationEvaluator(Evaluator): ...
108
class SummarizationEvaluator(Evaluator): ...
109
class TranslationEvaluator(Evaluator): ...
110
class AutomaticSpeechRecognitionEvaluator(Evaluator): ...
111
class AudioClassificationEvaluator(Evaluator): ...
112
```
113

114
[Task Evaluators](./task-evaluators.md)
115

116
### Hub Integration
117

118
Functions for sharing evaluation results with the Hugging Face Hub and saving results locally with comprehensive metadata.
119

120
```python { .api }
121
def push_to_hub(
122
    model_id: str,
123
    task_type: str, 
124
    dataset_type: str,
125
    metric_type: str,
126
    metric_value: float,
127
    **kwargs
128
): ...
129

130
def save(path_or_file, **data): ...
131
```
132

133
[Hub Integration](./hub-integration.md)
134

135
### Module Discovery
136

137
Tools for discovering, listing, and inspecting available evaluation modules from the Hugging Face Hub and local sources.
138

139
```python { .api }
140
def list_evaluation_modules(
141
    module_type: Optional[str] = None,
142
    include_community: bool = True,
143
    with_details: bool = False
144
): ...
145

146
def inspect_evaluation_module(
147
    path: str,
148
    local_path: str,
149
    **kwargs
150
): ...
151
```
152

153
[Module Discovery](./module-discovery.md)
154

155
### Evaluation Suites
156

157
Comprehensive evaluation workflows that run multiple tasks and datasets together for thorough model evaluation.
158

159
```python { .api }
160
class EvaluationSuite:
161
    """Multi-task, multi-dataset evaluation suite."""
162
    @staticmethod
163
    def load(path: str, **kwargs) -> EvaluationSuite: ...
164
    def run(self, model_or_pipeline) -> dict: ...
165
```
166

167
[Evaluation Suites](./evaluation-suites.md)
168

169
### Utilities
170

171
Helper functions for logging control and Gradio integration for interactive evaluation experiences.
172

173
```python { .api }
174
# Logging utilities
175
def enable_progress_bar(): ...
176
def disable_progress_bar(): ...
177
def is_progress_bar_enabled() -> bool: ...
178

179
# Gradio integration
180
def launch_gradio_widget(evaluation_module): ...
181
```
182

183
[Utilities](./utilities.md)
184

185
## Types
186

187
```python { .api }
188
from typing import Dict, List, Optional, Union, Any
189
from datasets import Dataset
190

191
# Core evaluation types
192
class EvaluationModuleInfo:
193
    """Information about evaluation modules."""
194
    description: str
195
    citation: str
196
    features: Any
197
    inputs_description: str
198
    homepage: Optional[str]
199
    license: str
200
    codebase_urls: List[str]
201
    reference_urls: List[str]
202

203
class MetricInfo(EvaluationModuleInfo):
204
    """Information specific to metrics."""
205

206
class ComparisonInfo(EvaluationModuleInfo):
207
    """Information specific to comparisons."""
208

209
class MeasurementInfo(EvaluationModuleInfo):
210
    """Information specific to measurements."""
211

212
# Combined evaluation result type
213
CombinedResults = Dict[str, Union[float, Dict[str, float], List]]
214

215
# Configuration and download types
216
from datasets import DownloadConfig, DownloadMode
217
from datasets.utils.version import Version
218

219
# Download configuration for Hub modules
220
class DownloadConfig:
221
    """Configuration for downloading modules from Hub."""
222
    cache_dir: Optional[str]
223
    force_download: bool
224
    resume_download: bool
225
    use_auth_token: Optional[str]
226

227
# Download mode enumeration  
228
class DownloadMode:
229
    """Download behavior for cached modules."""
230
    REUSE_DATASET_IF_EXISTS: str
231
    REUSE_CACHE_IF_EXISTS: str  
232
    FORCE_REDOWNLOAD: str
233

234
# Version handling for modules
235
class Version:
236
    """Version specification for modules."""
237
    def __init__(self, version_str: str): ...
238
    def __str__(self) -> str: ...
239
```