or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-evaluate

HuggingFace community-driven open-source library of evaluation metrics for machine learning models and datasets.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/evaluate@0.4.x

To install, run

npx @tessl/cli install tessl/pypi-evaluate@0.4.0

0

# Evaluate

1

2

A comprehensive evaluation library for machine learning models and datasets, providing implementations of dozens of popular metrics spanning NLP to Computer Vision tasks. The library features dataset-specific metrics, easy integration with any ML framework (NumPy/Pandas/PyTorch/TensorFlow/JAX), type checking for input validation, metric cards with descriptions and usage examples, and community-driven extensibility through the Hugging Face Hub.

3

4

## Package Information

5

6

- **Package Name**: evaluate

7

- **Package Type**: pypi

8

- **Language**: Python

9

- **Installation**: `pip install evaluate`

10

11

## Core Imports

12

13

```python

14

import evaluate

15

```

16

17

For specific components:

18

19

```python

20

from evaluate import load, combine, push_to_hub, save

21

from evaluate import Metric, Comparison, Measurement, EvaluationModule

22

from evaluate import evaluator

23

```

24

25

## Basic Usage

26

27

```python

28

import evaluate

29

30

# Load a metric from the Hub

31

accuracy = evaluate.load("accuracy")

32

33

# Add predictions and references

34

accuracy.add_batch(predictions=[0, 2, 1, 3], references=[0, 1, 2, 3])

35

accuracy.add(prediction=1, reference=1)

36

37

# Compute final score

38

score = accuracy.compute()

39

print(score) # {'accuracy': 0.5}

40

41

# Combine multiple metrics

42

combined_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])

43

results = combined_metrics.compute(predictions=[0, 1, 1], references=[0, 1, 0])

44

print(results) # {'accuracy': 0.6667, 'f1': 0.6667, 'precision': 0.6667, 'recall': 0.6667}

45

46

# Use task-specific evaluators

47

task_evaluator = evaluate.evaluator("text-classification")

48

eval_results = task_evaluator.compute(

49

model_or_pipeline="cardiffnlp/twitter-roberta-base-emotion",

50

data="emotion",

51

subset="split",

52

split="test[:40]"

53

)

54

```

55

56

## Architecture

57

58

The evaluate library is built around several key components:

59

60

- **EvaluationModule**: Base class for all evaluation functionality (metrics, comparisons, measurements)

61

- **Evaluators**: Task-specific evaluation pipelines that integrate models, datasets, and metrics

62

- **Hub Integration**: Seamless loading from and pushing to Hugging Face Hub

63

- **Combined Evaluations**: Unified interface for running multiple evaluation modules together

64

65

The library provides both low-level evaluation primitives for custom workflows and high-level evaluators for common ML tasks, enabling standardized model evaluation and comparison across the machine learning ecosystem.

66

67

## Capabilities

68

69

### Core Evaluation

70

71

Core functionality for loading and using evaluation modules including metrics, comparisons, and measurements. Provides the fundamental building blocks for model evaluation workflows.

72

73

```python { .api }

74

def load(path: str, config_name: Optional[str] = None, **kwargs) -> EvaluationModule:

75

"""Load evaluation modules from Hub or local paths."""

76

77

def combine(evaluations: List[str], force_prefix: bool = False) -> CombinedEvaluations:

78

"""Combine multiple evaluation modules into a single object."""

79

80

class EvaluationModule:

81

"""Base class for all evaluation modules."""

82

def compute(self, *, predictions=None, references=None, **kwargs) -> Optional[dict]: ...

83

def add_batch(self, *, predictions=None, references=None, **kwargs): ...

84

def add(self, *, prediction=None, reference=None, **kwargs): ...

85

```

86

87

[Core Evaluation](./core-evaluation.md)

88

89

### Task-Specific Evaluators

90

91

High-level evaluators for common machine learning tasks that integrate models, datasets, and metrics into streamlined evaluation pipelines.

92

93

```python { .api }

94

def evaluator(task: str) -> Evaluator:

95

"""Factory function to create task-specific evaluators."""

96

97

class Evaluator:

98

"""Base class for task-specific evaluators."""

99

def compute(self, model_or_pipeline, data, **kwargs) -> dict: ...

100

101

# Specialized evaluator classes

102

class TextClassificationEvaluator(Evaluator): ...

103

class ImageClassificationEvaluator(Evaluator): ...

104

class QuestionAnsweringEvaluator(Evaluator): ...

105

class TokenClassificationEvaluator(Evaluator): ...

106

class TextGenerationEvaluator(Evaluator): ...

107

class Text2TextGenerationEvaluator(Evaluator): ...

108

class SummarizationEvaluator(Evaluator): ...

109

class TranslationEvaluator(Evaluator): ...

110

class AutomaticSpeechRecognitionEvaluator(Evaluator): ...

111

class AudioClassificationEvaluator(Evaluator): ...

112

```

113

114

[Task Evaluators](./task-evaluators.md)

115

116

### Hub Integration

117

118

Functions for sharing evaluation results with the Hugging Face Hub and saving results locally with comprehensive metadata.

119

120

```python { .api }

121

def push_to_hub(

122

model_id: str,

123

task_type: str,

124

dataset_type: str,

125

metric_type: str,

126

metric_value: float,

127

**kwargs

128

): ...

129

130

def save(path_or_file, **data): ...

131

```

132

133

[Hub Integration](./hub-integration.md)

134

135

### Module Discovery

136

137

Tools for discovering, listing, and inspecting available evaluation modules from the Hugging Face Hub and local sources.

138

139

```python { .api }

140

def list_evaluation_modules(

141

module_type: Optional[str] = None,

142

include_community: bool = True,

143

with_details: bool = False

144

): ...

145

146

def inspect_evaluation_module(

147

path: str,

148

local_path: str,

149

**kwargs

150

): ...

151

```

152

153

[Module Discovery](./module-discovery.md)

154

155

### Evaluation Suites

156

157

Comprehensive evaluation workflows that run multiple tasks and datasets together for thorough model evaluation.

158

159

```python { .api }

160

class EvaluationSuite:

161

"""Multi-task, multi-dataset evaluation suite."""

162

@staticmethod

163

def load(path: str, **kwargs) -> EvaluationSuite: ...

164

def run(self, model_or_pipeline) -> dict: ...

165

```

166

167

[Evaluation Suites](./evaluation-suites.md)

168

169

### Utilities

170

171

Helper functions for logging control and Gradio integration for interactive evaluation experiences.

172

173

```python { .api }

174

# Logging utilities

175

def enable_progress_bar(): ...

176

def disable_progress_bar(): ...

177

def is_progress_bar_enabled() -> bool: ...

178

179

# Gradio integration

180

def launch_gradio_widget(evaluation_module): ...

181

```

182

183

[Utilities](./utilities.md)

184

185

## Types

186

187

```python { .api }

188

from typing import Dict, List, Optional, Union, Any

189

from datasets import Dataset

190

191

# Core evaluation types

192

class EvaluationModuleInfo:

193

"""Information about evaluation modules."""

194

description: str

195

citation: str

196

features: Any

197

inputs_description: str

198

homepage: Optional[str]

199

license: str

200

codebase_urls: List[str]

201

reference_urls: List[str]

202

203

class MetricInfo(EvaluationModuleInfo):

204

"""Information specific to metrics."""

205

206

class ComparisonInfo(EvaluationModuleInfo):

207

"""Information specific to comparisons."""

208

209

class MeasurementInfo(EvaluationModuleInfo):

210

"""Information specific to measurements."""

211

212

# Combined evaluation result type

213

CombinedResults = Dict[str, Union[float, Dict[str, float], List]]

214

215

# Configuration and download types

216

from datasets import DownloadConfig, DownloadMode

217

from datasets.utils.version import Version

218

219

# Download configuration for Hub modules

220

class DownloadConfig:

221

"""Configuration for downloading modules from Hub."""

222

cache_dir: Optional[str]

223

force_download: bool

224

resume_download: bool

225

use_auth_token: Optional[str]

226

227

# Download mode enumeration

228

class DownloadMode:

229

"""Download behavior for cached modules."""

230

REUSE_DATASET_IF_EXISTS: str

231

REUSE_CACHE_IF_EXISTS: str

232

FORCE_REDOWNLOAD: str

233

234

# Version handling for modules

235

class Version:

236

"""Version specification for modules."""

237

def __init__(self, version_str: str): ...

238

def __str__(self) -> str: ...

239

```