0
# Hub Integration
1
2
Functions for sharing evaluation results with the Hugging Face Hub and saving results locally with comprehensive metadata. These tools enable reproducible evaluation workflows and result sharing within the ML community.
3
4
## Capabilities
5
6
### Push Results to Hub
7
8
Push evaluation results directly to model metadata on the Hugging Face Hub:
9
10
```python { .api }
11
def push_to_hub(
12
model_id: str,
13
task_type: str,
14
dataset_type: str,
15
dataset_name: str,
16
metric_type: str,
17
metric_name: str,
18
metric_value: float,
19
task_name: Optional[str] = None,
20
dataset_config: Optional[str] = None,
21
dataset_split: Optional[str] = None,
22
dataset_revision: Optional[str] = None,
23
dataset_args: Optional[Dict[str, int]] = None,
24
metric_config: Optional[str] = None,
25
metric_args: Optional[Dict[str, int]] = None,
26
overwrite: bool = False
27
):
28
"""Push evaluation results to a model's metadata on Hugging Face Hub.
29
30
Args:
31
model_id: Model identifier on the Hub (e.g., "username/model-name")
32
task_type: Task type (must be from Hub's allowed tasks)
33
dataset_type: Dataset identifier from Hub
34
dataset_name: Human-readable dataset name
35
metric_type: Metric identifier from Hub
36
metric_name: Human-readable metric name
37
metric_value: Computed metric score
38
task_name: Human-readable task name (optional)
39
dataset_config: Dataset configuration/subset name (optional)
40
dataset_split: Dataset split used ("train", "test", "validation")
41
dataset_revision: Specific dataset revision/commit (optional)
42
dataset_args: Additional dataset parameters (optional)
43
metric_config: Metric configuration name (optional)
44
metric_args: Additional metric parameters (optional)
45
overwrite: Whether to overwrite existing results (default: False)
46
"""
47
```
48
49
**Usage Example:**
50
```python
51
import evaluate
52
53
# Evaluate a model
54
accuracy = evaluate.load("accuracy")
55
accuracy.add_batch(predictions=[1, 0, 1], references=[1, 1, 0])
56
result = accuracy.compute()
57
58
# Push results to the model's Hub page
59
evaluate.push_to_hub(
60
model_id="my-username/my-model",
61
task_type="text-classification",
62
dataset_type="glue",
63
dataset_name="sst2",
64
metric_type="accuracy",
65
metric_name="accuracy",
66
metric_value=result["accuracy"],
67
dataset_config="default",
68
dataset_split="validation"
69
)
70
```
71
72
**Advanced Example with Multiple Metrics:**
73
```python
74
import evaluate
75
76
# Evaluate with multiple metrics
77
combined = evaluate.combine(["accuracy", "f1", "precision", "recall"])
78
results = combined.compute(predictions=[1, 0, 1, 0], references=[1, 1, 0, 0])
79
80
# Push each metric separately
81
for metric_name, metric_value in results.items():
82
evaluate.push_to_hub(
83
model_id="my-username/my-classification-model",
84
task_type="text-classification",
85
dataset_type="custom",
86
dataset_name="my-dataset",
87
metric_type=metric_name,
88
metric_name=metric_name,
89
metric_value=metric_value,
90
dataset_split="test",
91
overwrite=True # Update existing results
92
)
93
```
94
95
### Save Results Locally
96
97
Save evaluation results to local JSON files with comprehensive metadata:
98
99
```python { .api }
100
def save(path_or_file: Union[str, Path, TextIOWrapper], **data)
101
```
102
103
The function automatically includes system metadata such as:
104
- Timestamp of evaluation
105
- Python version and platform information
106
- Package version information
107
- System specifications
108
109
**Usage Example:**
110
```python
111
import evaluate
112
113
# Run evaluation
114
bleu = evaluate.load("bleu")
115
rouge = evaluate.load("rouge")
116
117
bleu_result = bleu.compute(
118
predictions=["hello there", "general kenobi"],
119
references=[["hello there"], ["general kenobi"]]
120
)
121
122
rouge_result = rouge.compute(
123
predictions=["hello there", "general kenobi"],
124
references=["hello there", "general kenobi"]
125
)
126
127
# Save results with metadata
128
evaluate.save(
129
"evaluation_results.json",
130
model_name="my-model-v1.0",
131
dataset="custom-test-set",
132
bleu_score=bleu_result,
133
rouge_scores=rouge_result,
134
notes="Initial baseline evaluation"
135
)
136
```
137
138
**Example Output Structure:**
139
```json
140
{
141
"model_name": "my-model-v1.0",
142
"dataset": "custom-test-set",
143
"bleu_score": {"bleu": 1.0},
144
"rouge_scores": {
145
"rouge1": 1.0,
146
"rouge2": 1.0,
147
"rougeL": 1.0,
148
"rougeLsum": 1.0
149
},
150
"notes": "Initial baseline evaluation",
151
"_timestamp": "2023-12-07T15:30:45.123456",
152
"_python_version": "3.9.7",
153
"_evaluate_version": "0.4.5",
154
"_platform": "Linux-5.4.0-x86_64"
155
}
156
```
157
158
**Save to File Object:**
159
```python
160
import evaluate
161
import json
162
163
# Evaluate model
164
accuracy = evaluate.load("accuracy")
165
result = accuracy.compute(predictions=[1, 0, 1], references=[1, 1, 0])
166
167
# Save to open file object
168
with open("results.json", "w") as f:
169
evaluate.save(
170
f,
171
experiment_id="exp_001",
172
model="bert-base-uncased",
173
accuracy=result["accuracy"],
174
hyperparameters={"lr": 0.001, "batch_size": 32}
175
)
176
```
177
178
**Batch Results Saving:**
179
```python
180
import evaluate
181
182
# Run multiple evaluations
183
evaluator = evaluate.evaluator("text-classification")
184
185
models = [
186
"distilbert-base-uncased",
187
"bert-base-uncased",
188
"roberta-base"
189
]
190
191
all_results = {}
192
193
for model_name in models:
194
results = evaluator.compute(
195
model_or_pipeline=model_name,
196
data="imdb",
197
split="test[:100]"
198
)
199
all_results[model_name] = results
200
201
# Save comprehensive comparison
202
evaluate.save(
203
"model_comparison.json",
204
experiment_name="IMDB Classification Comparison",
205
dataset="imdb",
206
results=all_results,
207
evaluation_config={
208
"split": "test[:100]",
209
"metric": "accuracy",
210
"task": "text-classification"
211
}
212
)
213
```
214
215
## Integration with Evaluation Workflows
216
217
**Complete Evaluation and Sharing Workflow:**
218
```python
219
import evaluate
220
from transformers import pipeline
221
222
# Setup evaluation
223
model_name = "cardiffnlp/twitter-roberta-base-emotion"
224
evaluator = evaluate.evaluator("text-classification")
225
226
# Run evaluation
227
results = evaluator.compute(
228
model_or_pipeline=model_name,
229
data="emotion",
230
split="test[:200]",
231
metric="accuracy"
232
)
233
234
# Save detailed results locally
235
evaluate.save(
236
f"evaluation_{model_name.replace('/', '_')}.json",
237
model=model_name,
238
dataset="emotion",
239
split="test[:200]",
240
results=results,
241
evaluation_date="2023-12-07"
242
)
243
244
# Share key results on Hub
245
evaluate.push_to_hub(
246
model_id=model_name,
247
task_type="text-classification",
248
dataset_type="emotion",
249
dataset_name="emotion",
250
metric_type="accuracy",
251
metric_name="accuracy",
252
metric_value=results["accuracy"],
253
dataset_split="test"
254
)
255
256
print(f"Evaluation complete. Accuracy: {results['accuracy']:.3f}")
257
```
258
259
## Error Handling
260
261
Hub integration functions may raise:
262
263
- `ConnectionError`: Network connectivity issues
264
- `HTTPError`: Hub API authentication or permission errors
265
- `ValueError`: Invalid model_id format or missing required parameters
266
- `FileNotFoundError`: Invalid local file paths for saving
267
- `PermissionError`: Insufficient file system permissions
268
269
**Example:**
270
```python
271
import evaluate
272
273
try:
274
evaluate.push_to_hub(
275
model_id="invalid/model/name/format",
276
task_type="text-classification",
277
# ... other parameters
278
)
279
except ValueError as e:
280
print(f"Invalid model ID: {e}")
281
282
try:
283
evaluate.save("/invalid/path/results.json", data="test")
284
except PermissionError as e:
285
print(f"Cannot write to path: {e}")
286
```