Tessl Tile for pypi/evaluate@0.4.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

core-evaluation.md evaluation-suites.md hub-integration.md index.md module-discovery.md task-evaluators.md utilities.md

hub-integration.mddocs/

0
# Hub Integration
1

2
Functions for sharing evaluation results with the Hugging Face Hub and saving results locally with comprehensive metadata. These tools enable reproducible evaluation workflows and result sharing within the ML community.
3

4
## Capabilities
5

6
### Push Results to Hub
7

8
Push evaluation results directly to model metadata on the Hugging Face Hub:
9

10
```python { .api }
11
def push_to_hub(
12
    model_id: str,
13
    task_type: str,
14
    dataset_type: str,
15
    dataset_name: str,
16
    metric_type: str,
17
    metric_name: str,
18
    metric_value: float,
19
    task_name: Optional[str] = None,
20
    dataset_config: Optional[str] = None,
21
    dataset_split: Optional[str] = None,
22
    dataset_revision: Optional[str] = None,
23
    dataset_args: Optional[Dict[str, int]] = None,
24
    metric_config: Optional[str] = None,
25
    metric_args: Optional[Dict[str, int]] = None,
26
    overwrite: bool = False
27
):
28
    """Push evaluation results to a model's metadata on Hugging Face Hub.
29
    
30
    Args:
31
        model_id: Model identifier on the Hub (e.g., "username/model-name")
32
        task_type: Task type (must be from Hub's allowed tasks)
33
        dataset_type: Dataset identifier from Hub
34
        dataset_name: Human-readable dataset name
35
        metric_type: Metric identifier from Hub
36
        metric_name: Human-readable metric name
37
        metric_value: Computed metric score
38
        task_name: Human-readable task name (optional)
39
        dataset_config: Dataset configuration/subset name (optional)
40
        dataset_split: Dataset split used ("train", "test", "validation") 
41
        dataset_revision: Specific dataset revision/commit (optional)
42
        dataset_args: Additional dataset parameters (optional)
43
        metric_config: Metric configuration name (optional)
44
        metric_args: Additional metric parameters (optional)
45
        overwrite: Whether to overwrite existing results (default: False)
46
    """
47
```
48

49
**Usage Example:**
50
```python
51
import evaluate
52

53
# Evaluate a model
54
accuracy = evaluate.load("accuracy")
55
accuracy.add_batch(predictions=[1, 0, 1], references=[1, 1, 0])
56
result = accuracy.compute()
57

58
# Push results to the model's Hub page
59
evaluate.push_to_hub(
60
    model_id="my-username/my-model",
61
    task_type="text-classification",
62
    dataset_type="glue",
63
    dataset_name="sst2", 
64
    metric_type="accuracy",
65
    metric_name="accuracy",
66
    metric_value=result["accuracy"],
67
    dataset_config="default",
68
    dataset_split="validation"
69
)
70
```
71

72
**Advanced Example with Multiple Metrics:**
73
```python
74
import evaluate
75

76
# Evaluate with multiple metrics
77
combined = evaluate.combine(["accuracy", "f1", "precision", "recall"])
78
results = combined.compute(predictions=[1, 0, 1, 0], references=[1, 1, 0, 0])
79

80
# Push each metric separately
81
for metric_name, metric_value in results.items():
82
    evaluate.push_to_hub(
83
        model_id="my-username/my-classification-model",
84
        task_type="text-classification", 
85
        dataset_type="custom",
86
        dataset_name="my-dataset",
87
        metric_type=metric_name,
88
        metric_name=metric_name,
89
        metric_value=metric_value,
90
        dataset_split="test",
91
        overwrite=True  # Update existing results
92
    )
93
```
94

95
### Save Results Locally
96

97
Save evaluation results to local JSON files with comprehensive metadata:
98

99
```python { .api }
100
def save(path_or_file: Union[str, Path, TextIOWrapper], **data)
101
```
102

103
The function automatically includes system metadata such as:
104
- Timestamp of evaluation
105
- Python version and platform information  
106
- Package version information
107
- System specifications
108

109
**Usage Example:**
110
```python
111
import evaluate
112

113
# Run evaluation
114
bleu = evaluate.load("bleu")
115
rouge = evaluate.load("rouge")
116

117
bleu_result = bleu.compute(
118
    predictions=["hello there", "general kenobi"], 
119
    references=[["hello there"], ["general kenobi"]]
120
)
121

122
rouge_result = rouge.compute(
123
    predictions=["hello there", "general kenobi"],
124
    references=["hello there", "general kenobi"]
125
)
126

127
# Save results with metadata
128
evaluate.save(
129
    "evaluation_results.json",
130
    model_name="my-model-v1.0",
131
    dataset="custom-test-set", 
132
    bleu_score=bleu_result,
133
    rouge_scores=rouge_result,
134
    notes="Initial baseline evaluation"
135
)
136
```
137

138
**Example Output Structure:**
139
```json
140
{
141
    "model_name": "my-model-v1.0",
142
    "dataset": "custom-test-set",
143
    "bleu_score": {"bleu": 1.0},
144
    "rouge_scores": {
145
        "rouge1": 1.0,
146
        "rouge2": 1.0, 
147
        "rougeL": 1.0,
148
        "rougeLsum": 1.0
149
    },
150
    "notes": "Initial baseline evaluation",
151
    "_timestamp": "2023-12-07T15:30:45.123456",
152
    "_python_version": "3.9.7",
153
    "_evaluate_version": "0.4.5",
154
    "_platform": "Linux-5.4.0-x86_64"
155
}
156
```
157

158
**Save to File Object:**
159
```python
160
import evaluate
161
import json
162

163
# Evaluate model
164
accuracy = evaluate.load("accuracy") 
165
result = accuracy.compute(predictions=[1, 0, 1], references=[1, 1, 0])
166

167
# Save to open file object
168
with open("results.json", "w") as f:
169
    evaluate.save(
170
        f,
171
        experiment_id="exp_001",
172
        model="bert-base-uncased",
173
        accuracy=result["accuracy"],
174
        hyperparameters={"lr": 0.001, "batch_size": 32}
175
    )
176
```
177

178
**Batch Results Saving:**
179
```python
180
import evaluate
181

182
# Run multiple evaluations
183
evaluator = evaluate.evaluator("text-classification")
184

185
models = [
186
    "distilbert-base-uncased", 
187
    "bert-base-uncased",
188
    "roberta-base"
189
]
190

191
all_results = {}
192

193
for model_name in models:
194
    results = evaluator.compute(
195
        model_or_pipeline=model_name,
196
        data="imdb", 
197
        split="test[:100]"
198
    )
199
    all_results[model_name] = results
200

201
# Save comprehensive comparison
202
evaluate.save(
203
    "model_comparison.json",
204
    experiment_name="IMDB Classification Comparison",
205
    dataset="imdb",
206
    results=all_results,
207
    evaluation_config={
208
        "split": "test[:100]",
209
        "metric": "accuracy", 
210
        "task": "text-classification"
211
    }
212
)
213
```
214

215
## Integration with Evaluation Workflows
216

217
**Complete Evaluation and Sharing Workflow:**
218
```python
219
import evaluate
220
from transformers import pipeline
221

222
# Setup evaluation
223
model_name = "cardiffnlp/twitter-roberta-base-emotion"
224
evaluator = evaluate.evaluator("text-classification")
225

226
# Run evaluation
227
results = evaluator.compute(
228
    model_or_pipeline=model_name,
229
    data="emotion",
230
    split="test[:200]", 
231
    metric="accuracy"
232
)
233

234
# Save detailed results locally
235
evaluate.save(
236
    f"evaluation_{model_name.replace('/', '_')}.json",
237
    model=model_name,
238
    dataset="emotion",
239
    split="test[:200]",
240
    results=results,
241
    evaluation_date="2023-12-07"
242
)
243

244
# Share key results on Hub
245
evaluate.push_to_hub(
246
    model_id=model_name,
247
    task_type="text-classification",
248
    dataset_type="emotion", 
249
    dataset_name="emotion",
250
    metric_type="accuracy",
251
    metric_name="accuracy",
252
    metric_value=results["accuracy"],
253
    dataset_split="test"
254
)
255

256
print(f"Evaluation complete. Accuracy: {results['accuracy']:.3f}")
257
```
258

259
## Error Handling
260

261
Hub integration functions may raise:
262

263
- `ConnectionError`: Network connectivity issues
264
- `HTTPError`: Hub API authentication or permission errors  
265
- `ValueError`: Invalid model_id format or missing required parameters
266
- `FileNotFoundError`: Invalid local file paths for saving
267
- `PermissionError`: Insufficient file system permissions
268

269
**Example:**
270
```python
271
import evaluate
272

273
try:
274
    evaluate.push_to_hub(
275
        model_id="invalid/model/name/format", 
276
        task_type="text-classification",
277
        # ... other parameters
278
    )
279
except ValueError as e:
280
    print(f"Invalid model ID: {e}")
281

282
try:
283
    evaluate.save("/invalid/path/results.json", data="test")
284
except PermissionError as e:
285
    print(f"Cannot write to path: {e}")
286
```

Version

Tile

Files

hub-integration.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

hub-integration.mddocs/