or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-evaluation.mdevaluation-suites.mdhub-integration.mdindex.mdmodule-discovery.mdtask-evaluators.mdutilities.md

core-evaluation.mddocs/

0

# Core Evaluation

1

2

The core evaluation functionality provides the fundamental building blocks for model evaluation workflows. This includes loading evaluation modules, using metrics/comparisons/measurements, and combining multiple evaluations into unified workflows.

3

4

## Capabilities

5

6

### Loading Evaluation Modules

7

8

The primary way to access evaluation functionality is through the `load` function, which retrieves evaluation modules from the Hugging Face Hub or local paths.

9

10

```python { .api }

11

def load(

12

path: str,

13

config_name: Optional[str] = None,

14

module_type: Optional[str] = None,

15

process_id: int = 0,

16

num_process: int = 1,

17

cache_dir: Optional[str] = None,

18

experiment_id: Optional[str] = None,

19

keep_in_memory: bool = False,

20

download_config: Optional[DownloadConfig] = None,

21

download_mode: Optional[DownloadMode] = None,

22

revision: Optional[Union[str, Version]] = None,

23

**init_kwargs

24

) -> EvaluationModule:

25

"""Load an EvaluationModule (metric, comparison, or measurement).

26

27

Args:

28

path: Path to evaluation module or module identifier from Hub

29

config_name: Configuration name for the module (e.g., GLUE subset)

30

module_type: Type of module ('metric', 'comparison', 'measurement')

31

process_id: Process ID for distributed evaluation (0-based)

32

num_process: Total number of processes in distributed setup

33

cache_dir: Directory for caching downloaded modules

34

experiment_id: Unique identifier for experiment tracking

35

keep_in_memory: Store all data in memory (not for distributed)

36

download_config: Configuration for downloading from Hub

37

download_mode: How to handle existing cached data

38

revision: Specific revision/version to load

39

**init_kwargs: Additional initialization arguments for the module

40

"""

41

```

42

43

**Usage Example:**

44

```python

45

import evaluate

46

47

# Load popular metrics

48

accuracy = evaluate.load("accuracy")

49

bleu = evaluate.load("bleu")

50

rouge = evaluate.load("rouge")

51

52

# Load with specific configuration

53

squad_metric = evaluate.load("squad", config_name="v2")

54

55

# Load local evaluation module

56

custom_metric = evaluate.load("./path/to/custom_metric.py")

57

```

58

59

### Base Evaluation Module

60

61

All evaluation functionality inherits from the `EvaluationModule` base class, providing a consistent API across metrics, comparisons, and measurements.

62

63

```python { .api }

64

class EvaluationModule:

65

"""Base class for all evaluation modules."""

66

67

def compute(

68

self,

69

*,

70

predictions=None,

71

references=None,

72

**kwargs

73

) -> Optional[Dict[str, Any]]:

74

"""Compute evaluation results from accumulated predictions and references."""

75

76

def add_batch(

77

self,

78

*,

79

predictions=None,

80

references=None,

81

**kwargs

82

):

83

"""Add a batch of predictions and references."""

84

85

def add(

86

self,

87

*,

88

prediction=None,

89

reference=None,

90

**kwargs

91

):

92

"""Add a single prediction and reference pair."""

93

94

def download_and_prepare(

95

self,

96

download_config: Optional[DownloadConfig] = None,

97

dl_manager: Optional[DownloadManager] = None

98

):

99

"""Download and prepare the evaluation module."""

100

101

# Properties

102

@property

103

def name(self) -> str:

104

"""Name of the evaluation module."""

105

106

@property

107

def description(self) -> str:

108

"""Description of what the module evaluates."""

109

110

@property

111

def citation(self) -> str:

112

"""Citation information for the evaluation method."""

113

114

@property

115

def features(self) -> Features:

116

"""Expected input features schema."""

117

118

@property

119

def inputs_description(self) -> str:

120

"""Description of expected inputs."""

121

122

@property

123

def homepage(self) -> Optional[str]:

124

"""Homepage URL for the evaluation method."""

125

126

@property

127

def license(self) -> str:

128

"""License information."""

129

130

@property

131

def codebase_urls(self) -> List[str]:

132

"""URLs to relevant codebases."""

133

134

@property

135

def reference_urls(self) -> List[str]:

136

"""URLs to reference papers or documentation."""

137

```

138

139

**Usage Example:**

140

```python

141

import evaluate

142

143

# Load and use a metric

144

accuracy = evaluate.load("accuracy")

145

146

# Add individual predictions

147

accuracy.add(prediction=1, reference=1)

148

accuracy.add(prediction=0, reference=1)

149

150

# Add batch predictions

151

accuracy.add_batch(

152

predictions=[1, 0, 1, 1],

153

references=[1, 1, 0, 1]

154

)

155

156

# Compute final results

157

result = accuracy.compute()

158

print(result) # {'accuracy': 0.625}

159

160

# Access module information

161

print(accuracy.description)

162

print(accuracy.citation)

163

```

164

165

### Specialized Evaluation Classes

166

167

The library provides specialized classes for different types of evaluation:

168

169

```python { .api }

170

class Metric(EvaluationModule):

171

"""Specialized evaluation module for metrics."""

172

173

class Comparison(EvaluationModule):

174

"""Specialized evaluation module for comparisons between models."""

175

176

class Measurement(EvaluationModule):

177

"""Specialized evaluation module for measurements."""

178

```

179

180

These classes inherit all functionality from `EvaluationModule` but may have specialized behavior for their specific evaluation type.

181

182

### Combining Multiple Evaluations

183

184

The `combine` function allows you to run multiple evaluation modules together as a single unit:

185

186

```python { .api }

187

def combine(

188

evaluations: Union[List[Union[str, EvaluationModule]], Dict[str, Union[str, EvaluationModule]]],

189

force_prefix: bool = False

190

) -> CombinedEvaluations:

191

"""Combine multiple evaluation modules into a single object.

192

193

Args:

194

evaluations: List or dict of evaluation modules. Can be module names (str)

195

or loaded EvaluationModule objects. If dict, keys are used as

196

prefixes for results.

197

force_prefix: If True, all results are prefixed with module names

198

"""

199

```

200

201

```python { .api }

202

class CombinedEvaluations:

203

"""Container for multiple evaluation modules."""

204

205

def add(

206

self,

207

*,

208

prediction=None,

209

reference=None,

210

**kwargs

211

):

212

"""Add prediction/reference to all contained modules."""

213

214

def add_batch(

215

self,

216

*,

217

predictions=None,

218

references=None,

219

**kwargs

220

):

221

"""Add batch predictions/references to all contained modules."""

222

223

def compute(

224

self,

225

*,

226

predictions=None,

227

references=None,

228

**kwargs

229

) -> Dict[str, Any]:

230

"""Compute results from all contained modules."""

231

```

232

233

**Usage Example:**

234

```python

235

import evaluate

236

237

# Combine multiple metrics

238

combined = evaluate.combine(["accuracy", "f1", "precision", "recall"])

239

240

# Use like a single metric

241

combined.add_batch(

242

predictions=[1, 0, 1, 0],

243

references=[1, 1, 0, 0]

244

)

245

246

results = combined.compute()

247

print(results)

248

# {

249

# 'accuracy': 0.5,

250

# 'f1': 0.5,

251

# 'precision': 0.5,

252

# 'recall': 0.5

253

# }

254

255

# Combine with custom names to avoid conflicts

256

combined_with_prefix = evaluate.combine([

257

("acc", evaluate.load("accuracy")),

258

("f1_macro", evaluate.load("f1", average="macro"))

259

], force_prefix=True)

260

```

261

262

## Error Handling

263

264

Evaluation modules may raise the following exceptions:

265

266

- `ValueError`: Invalid input data or configuration

267

- `TypeError`: Incorrect data types for predictions or references

268

- `ImportError`: Missing required dependencies for specific metrics

269

- `ConnectionError`: Network issues when downloading from Hub

270

271

**Example:**

272

```python

273

try:

274

metric = evaluate.load("nonexistent_metric")

275

except FileNotFoundError:

276

print("Metric not found")

277

278

try:

279

accuracy = evaluate.load("accuracy")

280

accuracy.compute(predictions=[1, 2], references=[1]) # Mismatched lengths

281

except ValueError as e:

282

print(f"Input validation error: {e}")

283

```