Tessl Tile for pypi/evaluate@0.4.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

core-evaluation.md evaluation-suites.md hub-integration.md index.md module-discovery.md task-evaluators.md utilities.md

core-evaluation.mddocs/

0
# Core Evaluation
1

2
The core evaluation functionality provides the fundamental building blocks for model evaluation workflows. This includes loading evaluation modules, using metrics/comparisons/measurements, and combining multiple evaluations into unified workflows.
3

4
## Capabilities
5

6
### Loading Evaluation Modules
7

8
The primary way to access evaluation functionality is through the `load` function, which retrieves evaluation modules from the Hugging Face Hub or local paths.
9

10
```python { .api }
11
def load(
12
    path: str,
13
    config_name: Optional[str] = None,
14
    module_type: Optional[str] = None,
15
    process_id: int = 0,
16
    num_process: int = 1,
17
    cache_dir: Optional[str] = None,
18
    experiment_id: Optional[str] = None,
19
    keep_in_memory: bool = False,
20
    download_config: Optional[DownloadConfig] = None,
21
    download_mode: Optional[DownloadMode] = None,
22
    revision: Optional[Union[str, Version]] = None,
23
    **init_kwargs
24
) -> EvaluationModule:
25
    """Load an EvaluationModule (metric, comparison, or measurement).
26
    
27
    Args:
28
        path: Path to evaluation module or module identifier from Hub
29
        config_name: Configuration name for the module (e.g., GLUE subset)
30
        module_type: Type of module ('metric', 'comparison', 'measurement')
31
        process_id: Process ID for distributed evaluation (0-based)
32
        num_process: Total number of processes in distributed setup
33
        cache_dir: Directory for caching downloaded modules
34
        experiment_id: Unique identifier for experiment tracking
35
        keep_in_memory: Store all data in memory (not for distributed)
36
        download_config: Configuration for downloading from Hub
37
        download_mode: How to handle existing cached data
38
        revision: Specific revision/version to load
39
        **init_kwargs: Additional initialization arguments for the module
40
    """
41
```
42

43
**Usage Example:**
44
```python
45
import evaluate
46

47
# Load popular metrics
48
accuracy = evaluate.load("accuracy")
49
bleu = evaluate.load("bleu")
50
rouge = evaluate.load("rouge")
51

52
# Load with specific configuration
53
squad_metric = evaluate.load("squad", config_name="v2")
54

55
# Load local evaluation module
56
custom_metric = evaluate.load("./path/to/custom_metric.py")
57
```
58

59
### Base Evaluation Module
60

61
All evaluation functionality inherits from the `EvaluationModule` base class, providing a consistent API across metrics, comparisons, and measurements.
62

63
```python { .api }
64
class EvaluationModule:
65
    """Base class for all evaluation modules."""
66
    
67
    def compute(
68
        self,
69
        *,
70
        predictions=None,
71
        references=None,
72
        **kwargs
73
    ) -> Optional[Dict[str, Any]]:
74
        """Compute evaluation results from accumulated predictions and references."""
75
    
76
    def add_batch(
77
        self,
78
        *,
79
        predictions=None,
80
        references=None,
81
        **kwargs
82
    ):
83
        """Add a batch of predictions and references."""
84
    
85
    def add(
86
        self,
87
        *,
88
        prediction=None,
89
        reference=None,
90
        **kwargs
91
    ):
92
        """Add a single prediction and reference pair."""
93
    
94
    def download_and_prepare(
95
        self,
96
        download_config: Optional[DownloadConfig] = None,
97
        dl_manager: Optional[DownloadManager] = None
98
    ):
99
        """Download and prepare the evaluation module."""
100
    
101
    # Properties
102
    @property
103
    def name(self) -> str:
104
        """Name of the evaluation module."""
105
    
106
    @property
107
    def description(self) -> str:
108
        """Description of what the module evaluates."""
109
    
110
    @property
111
    def citation(self) -> str:
112
        """Citation information for the evaluation method."""
113
    
114
    @property
115
    def features(self) -> Features:
116
        """Expected input features schema."""
117
    
118
    @property
119
    def inputs_description(self) -> str:
120
        """Description of expected inputs."""
121
    
122
    @property
123
    def homepage(self) -> Optional[str]:
124
        """Homepage URL for the evaluation method."""
125
    
126
    @property
127
    def license(self) -> str:
128
        """License information."""
129
    
130
    @property
131
    def codebase_urls(self) -> List[str]:
132
        """URLs to relevant codebases."""
133
    
134
    @property
135
    def reference_urls(self) -> List[str]:
136
        """URLs to reference papers or documentation."""
137
```
138

139
**Usage Example:**
140
```python
141
import evaluate
142

143
# Load and use a metric
144
accuracy = evaluate.load("accuracy")
145

146
# Add individual predictions
147
accuracy.add(prediction=1, reference=1)
148
accuracy.add(prediction=0, reference=1)
149

150
# Add batch predictions
151
accuracy.add_batch(
152
    predictions=[1, 0, 1, 1], 
153
    references=[1, 1, 0, 1]
154
)
155

156
# Compute final results
157
result = accuracy.compute()
158
print(result)  # {'accuracy': 0.625}
159

160
# Access module information
161
print(accuracy.description)
162
print(accuracy.citation)
163
```
164

165
### Specialized Evaluation Classes
166

167
The library provides specialized classes for different types of evaluation:
168

169
```python { .api }
170
class Metric(EvaluationModule):
171
    """Specialized evaluation module for metrics."""
172

173
class Comparison(EvaluationModule):
174
    """Specialized evaluation module for comparisons between models."""
175

176
class Measurement(EvaluationModule):
177
    """Specialized evaluation module for measurements."""
178
```
179

180
These classes inherit all functionality from `EvaluationModule` but may have specialized behavior for their specific evaluation type.
181

182
### Combining Multiple Evaluations
183

184
The `combine` function allows you to run multiple evaluation modules together as a single unit:
185

186
```python { .api }
187
def combine(
188
    evaluations: Union[List[Union[str, EvaluationModule]], Dict[str, Union[str, EvaluationModule]]], 
189
    force_prefix: bool = False
190
) -> CombinedEvaluations:
191
    """Combine multiple evaluation modules into a single object.
192
    
193
    Args:
194
        evaluations: List or dict of evaluation modules. Can be module names (str) 
195
                    or loaded EvaluationModule objects. If dict, keys are used as 
196
                    prefixes for results.
197
        force_prefix: If True, all results are prefixed with module names
198
    """
199
```
200

201
```python { .api }
202
class CombinedEvaluations:
203
    """Container for multiple evaluation modules."""
204
    
205
    def add(
206
        self,
207
        *,
208
        prediction=None,
209
        reference=None,
210
        **kwargs
211
    ):
212
        """Add prediction/reference to all contained modules."""
213
    
214
    def add_batch(
215
        self,
216
        *,
217
        predictions=None,
218
        references=None,
219
        **kwargs
220
    ):
221
        """Add batch predictions/references to all contained modules."""
222
    
223
    def compute(
224
        self,
225
        *,
226
        predictions=None,
227
        references=None,
228
        **kwargs
229
    ) -> Dict[str, Any]:
230
        """Compute results from all contained modules."""
231
```
232

233
**Usage Example:**
234
```python
235
import evaluate
236

237
# Combine multiple metrics
238
combined = evaluate.combine(["accuracy", "f1", "precision", "recall"])
239

240
# Use like a single metric
241
combined.add_batch(
242
    predictions=[1, 0, 1, 0],
243
    references=[1, 1, 0, 0]
244
)
245

246
results = combined.compute()
247
print(results)
248
# {
249
#     'accuracy': 0.5,
250
#     'f1': 0.5,
251
#     'precision': 0.5,
252
#     'recall': 0.5
253
# }
254

255
# Combine with custom names to avoid conflicts
256
combined_with_prefix = evaluate.combine([
257
    ("acc", evaluate.load("accuracy")),
258
    ("f1_macro", evaluate.load("f1", average="macro"))
259
], force_prefix=True)
260
```
261

262
## Error Handling
263

264
Evaluation modules may raise the following exceptions:
265

266
- `ValueError`: Invalid input data or configuration
267
- `TypeError`: Incorrect data types for predictions or references
268
- `ImportError`: Missing required dependencies for specific metrics
269
- `ConnectionError`: Network issues when downloading from Hub
270

271
**Example:**
272
```python
273
try:
274
    metric = evaluate.load("nonexistent_metric")
275
except FileNotFoundError:
276
    print("Metric not found")
277

278
try:
279
    accuracy = evaluate.load("accuracy")
280
    accuracy.compute(predictions=[1, 2], references=[1])  # Mismatched lengths
281
except ValueError as e:
282
    print(f"Input validation error: {e}")
283
```

Version

Tile

Files

core-evaluation.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

core-evaluation.mddocs/