or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-evaluation.mdevaluation-suites.mdhub-integration.mdindex.mdmodule-discovery.mdtask-evaluators.mdutilities.md

hub-integration.mddocs/

0

# Hub Integration

1

2

Functions for sharing evaluation results with the Hugging Face Hub and saving results locally with comprehensive metadata. These tools enable reproducible evaluation workflows and result sharing within the ML community.

3

4

## Capabilities

5

6

### Push Results to Hub

7

8

Push evaluation results directly to model metadata on the Hugging Face Hub:

9

10

```python { .api }

11

def push_to_hub(

12

model_id: str,

13

task_type: str,

14

dataset_type: str,

15

dataset_name: str,

16

metric_type: str,

17

metric_name: str,

18

metric_value: float,

19

task_name: Optional[str] = None,

20

dataset_config: Optional[str] = None,

21

dataset_split: Optional[str] = None,

22

dataset_revision: Optional[str] = None,

23

dataset_args: Optional[Dict[str, int]] = None,

24

metric_config: Optional[str] = None,

25

metric_args: Optional[Dict[str, int]] = None,

26

overwrite: bool = False

27

):

28

"""Push evaluation results to a model's metadata on Hugging Face Hub.

29

30

Args:

31

model_id: Model identifier on the Hub (e.g., "username/model-name")

32

task_type: Task type (must be from Hub's allowed tasks)

33

dataset_type: Dataset identifier from Hub

34

dataset_name: Human-readable dataset name

35

metric_type: Metric identifier from Hub

36

metric_name: Human-readable metric name

37

metric_value: Computed metric score

38

task_name: Human-readable task name (optional)

39

dataset_config: Dataset configuration/subset name (optional)

40

dataset_split: Dataset split used ("train", "test", "validation")

41

dataset_revision: Specific dataset revision/commit (optional)

42

dataset_args: Additional dataset parameters (optional)

43

metric_config: Metric configuration name (optional)

44

metric_args: Additional metric parameters (optional)

45

overwrite: Whether to overwrite existing results (default: False)

46

"""

47

```

48

49

**Usage Example:**

50

```python

51

import evaluate

52

53

# Evaluate a model

54

accuracy = evaluate.load("accuracy")

55

accuracy.add_batch(predictions=[1, 0, 1], references=[1, 1, 0])

56

result = accuracy.compute()

57

58

# Push results to the model's Hub page

59

evaluate.push_to_hub(

60

model_id="my-username/my-model",

61

task_type="text-classification",

62

dataset_type="glue",

63

dataset_name="sst2",

64

metric_type="accuracy",

65

metric_name="accuracy",

66

metric_value=result["accuracy"],

67

dataset_config="default",

68

dataset_split="validation"

69

)

70

```

71

72

**Advanced Example with Multiple Metrics:**

73

```python

74

import evaluate

75

76

# Evaluate with multiple metrics

77

combined = evaluate.combine(["accuracy", "f1", "precision", "recall"])

78

results = combined.compute(predictions=[1, 0, 1, 0], references=[1, 1, 0, 0])

79

80

# Push each metric separately

81

for metric_name, metric_value in results.items():

82

evaluate.push_to_hub(

83

model_id="my-username/my-classification-model",

84

task_type="text-classification",

85

dataset_type="custom",

86

dataset_name="my-dataset",

87

metric_type=metric_name,

88

metric_name=metric_name,

89

metric_value=metric_value,

90

dataset_split="test",

91

overwrite=True # Update existing results

92

)

93

```

94

95

### Save Results Locally

96

97

Save evaluation results to local JSON files with comprehensive metadata:

98

99

```python { .api }

100

def save(path_or_file: Union[str, Path, TextIOWrapper], **data)

101

```

102

103

The function automatically includes system metadata such as:

104

- Timestamp of evaluation

105

- Python version and platform information

106

- Package version information

107

- System specifications

108

109

**Usage Example:**

110

```python

111

import evaluate

112

113

# Run evaluation

114

bleu = evaluate.load("bleu")

115

rouge = evaluate.load("rouge")

116

117

bleu_result = bleu.compute(

118

predictions=["hello there", "general kenobi"],

119

references=[["hello there"], ["general kenobi"]]

120

)

121

122

rouge_result = rouge.compute(

123

predictions=["hello there", "general kenobi"],

124

references=["hello there", "general kenobi"]

125

)

126

127

# Save results with metadata

128

evaluate.save(

129

"evaluation_results.json",

130

model_name="my-model-v1.0",

131

dataset="custom-test-set",

132

bleu_score=bleu_result,

133

rouge_scores=rouge_result,

134

notes="Initial baseline evaluation"

135

)

136

```

137

138

**Example Output Structure:**

139

```json

140

{

141

"model_name": "my-model-v1.0",

142

"dataset": "custom-test-set",

143

"bleu_score": {"bleu": 1.0},

144

"rouge_scores": {

145

"rouge1": 1.0,

146

"rouge2": 1.0,

147

"rougeL": 1.0,

148

"rougeLsum": 1.0

149

},

150

"notes": "Initial baseline evaluation",

151

"_timestamp": "2023-12-07T15:30:45.123456",

152

"_python_version": "3.9.7",

153

"_evaluate_version": "0.4.5",

154

"_platform": "Linux-5.4.0-x86_64"

155

}

156

```

157

158

**Save to File Object:**

159

```python

160

import evaluate

161

import json

162

163

# Evaluate model

164

accuracy = evaluate.load("accuracy")

165

result = accuracy.compute(predictions=[1, 0, 1], references=[1, 1, 0])

166

167

# Save to open file object

168

with open("results.json", "w") as f:

169

evaluate.save(

170

f,

171

experiment_id="exp_001",

172

model="bert-base-uncased",

173

accuracy=result["accuracy"],

174

hyperparameters={"lr": 0.001, "batch_size": 32}

175

)

176

```

177

178

**Batch Results Saving:**

179

```python

180

import evaluate

181

182

# Run multiple evaluations

183

evaluator = evaluate.evaluator("text-classification")

184

185

models = [

186

"distilbert-base-uncased",

187

"bert-base-uncased",

188

"roberta-base"

189

]

190

191

all_results = {}

192

193

for model_name in models:

194

results = evaluator.compute(

195

model_or_pipeline=model_name,

196

data="imdb",

197

split="test[:100]"

198

)

199

all_results[model_name] = results

200

201

# Save comprehensive comparison

202

evaluate.save(

203

"model_comparison.json",

204

experiment_name="IMDB Classification Comparison",

205

dataset="imdb",

206

results=all_results,

207

evaluation_config={

208

"split": "test[:100]",

209

"metric": "accuracy",

210

"task": "text-classification"

211

}

212

)

213

```

214

215

## Integration with Evaluation Workflows

216

217

**Complete Evaluation and Sharing Workflow:**

218

```python

219

import evaluate

220

from transformers import pipeline

221

222

# Setup evaluation

223

model_name = "cardiffnlp/twitter-roberta-base-emotion"

224

evaluator = evaluate.evaluator("text-classification")

225

226

# Run evaluation

227

results = evaluator.compute(

228

model_or_pipeline=model_name,

229

data="emotion",

230

split="test[:200]",

231

metric="accuracy"

232

)

233

234

# Save detailed results locally

235

evaluate.save(

236

f"evaluation_{model_name.replace('/', '_')}.json",

237

model=model_name,

238

dataset="emotion",

239

split="test[:200]",

240

results=results,

241

evaluation_date="2023-12-07"

242

)

243

244

# Share key results on Hub

245

evaluate.push_to_hub(

246

model_id=model_name,

247

task_type="text-classification",

248

dataset_type="emotion",

249

dataset_name="emotion",

250

metric_type="accuracy",

251

metric_name="accuracy",

252

metric_value=results["accuracy"],

253

dataset_split="test"

254

)

255

256

print(f"Evaluation complete. Accuracy: {results['accuracy']:.3f}")

257

```

258

259

## Error Handling

260

261

Hub integration functions may raise:

262

263

- `ConnectionError`: Network connectivity issues

264

- `HTTPError`: Hub API authentication or permission errors

265

- `ValueError`: Invalid model_id format or missing required parameters

266

- `FileNotFoundError`: Invalid local file paths for saving

267

- `PermissionError`: Insufficient file system permissions

268

269

**Example:**

270

```python

271

import evaluate

272

273

try:

274

evaluate.push_to_hub(

275

model_id="invalid/model/name/format",

276

task_type="text-classification",

277

# ... other parameters

278

)

279

except ValueError as e:

280

print(f"Invalid model ID: {e}")

281

282

try:

283

evaluate.save("/invalid/path/results.json", data="test")

284

except PermissionError as e:

285

print(f"Cannot write to path: {e}")

286

```