0
# Evaluation Metrics
1
2
Comprehensive evaluation metrics for assessing the quality of conformal prediction intervals and sets, including coverage, width, and calibration metrics. These metrics help evaluate the performance and reliability of uncertainty quantification methods.
3
4
## Capabilities
5
6
### Regression Metrics
7
8
Metrics for evaluating prediction intervals in regression tasks, focusing on coverage guarantees, interval width efficiency, and distributional properties.
9
10
```python { .api }
11
def regression_coverage_score(y_true, y_intervals):
12
"""
13
Compute coverage score for regression prediction intervals.
14
15
Parameters:
16
- y_true: ArrayLike, true target values
17
- y_intervals: ArrayLike, prediction intervals (shape: n_samples x 2 x n_alpha)
18
19
Returns:
20
NDArray: coverage scores for each confidence level
21
"""
22
23
def regression_mean_width_score(y_intervals):
24
"""
25
Compute mean width of prediction intervals.
26
27
Parameters:
28
- y_intervals: ArrayLike, prediction intervals (shape: n_samples x 2 x n_alpha)
29
30
Returns:
31
NDArray: mean interval widths for each confidence level
32
"""
33
34
def regression_ssc(y_true, y_intervals):
35
"""
36
Size-stratified coverage score for regression.
37
38
Parameters:
39
- y_true: ArrayLike, true target values
40
- y_intervals: ArrayLike, prediction intervals
41
42
Returns:
43
NDArray: size-stratified coverage scores
44
"""
45
46
def regression_ssc_score(y_true, y_intervals, num_bins=10):
47
"""
48
Size-stratified coverage score with binning.
49
50
Parameters:
51
- y_true: ArrayLike, true target values
52
- y_intervals: ArrayLike, prediction intervals
53
- num_bins: int, number of bins for stratification (default: 10)
54
55
Returns:
56
NDArray: binned size-stratified coverage scores
57
"""
58
59
def hsic(x, y, kernel="gaussian"):
60
"""
61
Hilbert-Schmidt Independence Criterion for testing independence.
62
63
Parameters:
64
- x: ArrayLike, first variable
65
- y: ArrayLike, second variable
66
- kernel: str, kernel type ("gaussian", "linear") (default: "gaussian")
67
68
Returns:
69
float: HSIC statistic
70
"""
71
72
def coverage_width_based(y_true, y_intervals, eta=1.0):
73
"""
74
Coverage-width-based metric balancing coverage and efficiency.
75
76
Parameters:
77
- y_true: ArrayLike, true target values
78
- y_intervals: ArrayLike, prediction intervals
79
- eta: float, weight parameter for width penalty (default: 1.0)
80
81
Returns:
82
NDArray: coverage-width-based scores
83
"""
84
85
def regression_mwi_score(y_intervals, num_bins=10):
86
"""
87
Mean width interval score with binning.
88
89
Parameters:
90
- y_intervals: ArrayLike, prediction intervals
91
- num_bins: int, number of bins (default: 10)
92
93
Returns:
94
NDArray: mean width scores per bin
95
"""
96
```
97
98
### Classification Metrics
99
100
Metrics for evaluating prediction sets in classification tasks, measuring set coverage, size efficiency, and distributional properties.
101
102
```python { .api }
103
def classification_coverage_score(y_true, y_pred_set):
104
"""
105
Compute coverage score for classification prediction sets.
106
107
Parameters:
108
- y_true: ArrayLike, true class labels
109
- y_pred_set: ArrayLike, prediction sets (binary matrix: n_samples x n_classes)
110
111
Returns:
112
NDArray: coverage scores
113
"""
114
115
def classification_mean_width_score(y_pred_set):
116
"""
117
Compute mean size of prediction sets.
118
119
Parameters:
120
- y_pred_set: ArrayLike, prediction sets (binary matrix)
121
122
Returns:
123
float: mean prediction set size
124
"""
125
126
def classification_ssc(y_true, y_pred_set):
127
"""
128
Size-stratified coverage for classification.
129
130
Parameters:
131
- y_true: ArrayLike, true class labels
132
- y_pred_set: ArrayLike, prediction sets
133
134
Returns:
135
NDArray: size-stratified coverage scores
136
"""
137
138
def classification_ssc_score(y_true, y_pred_set, num_bins=10):
139
"""
140
Size-stratified coverage score with binning for classification.
141
142
Parameters:
143
- y_true: ArrayLike, true class labels
144
- y_pred_set: ArrayLike, prediction sets
145
- num_bins: int, number of bins for stratification (default: 10)
146
147
Returns:
148
NDArray: binned size-stratified coverage scores
149
"""
150
```
151
152
### Calibration Metrics
153
154
Metrics for evaluating probability calibration quality, testing whether predicted probabilities accurately reflect true confidence levels.
155
156
```python { .api }
157
def expected_calibration_error(y_true, y_scores, num_bins=50, split_strategy=None):
158
"""
159
Expected Calibration Error (ECE) for probability predictions.
160
161
Parameters:
162
- y_true: ArrayLike, true binary labels (0/1)
163
- y_scores: ArrayLike, predicted probabilities
164
- num_bins: int, number of bins for reliability diagram (default: 50)
165
- split_strategy: Optional[str], binning strategy ("uniform", "quantile")
166
167
Returns:
168
float: expected calibration error
169
"""
170
171
def top_label_ece(y_true, y_scores, num_bins=50, split_strategy=None):
172
"""
173
Top-label Expected Calibration Error for multi-class problems.
174
175
Parameters:
176
- y_true: ArrayLike, true class labels
177
- y_scores: ArrayLike, predicted class probabilities (n_samples x n_classes)
178
- num_bins: int, number of bins (default: 50)
179
- split_strategy: Optional[str], binning strategy
180
181
Returns:
182
float: top-label expected calibration error
183
"""
184
185
def kolmogorov_smirnov_statistic(y_true, y_score):
186
"""
187
Kolmogorov-Smirnov test statistic for calibration assessment.
188
189
Parameters:
190
- y_true: ArrayLike, true binary labels
191
- y_score: ArrayLike, predicted probabilities
192
193
Returns:
194
float: KS test statistic
195
"""
196
197
def kolmogorov_smirnov_p_value(y_true, y_score):
198
"""
199
P-value for Kolmogorov-Smirnov calibration test.
200
201
Parameters:
202
- y_true: ArrayLike, true binary labels
203
- y_score: ArrayLike, predicted probabilities
204
205
Returns:
206
float: KS test p-value
207
"""
208
209
def kuiper_statistic(y_true, y_score):
210
"""
211
Kuiper test statistic for calibration (circular KS test).
212
213
Parameters:
214
- y_true: ArrayLike, true binary labels
215
- y_score: ArrayLike, predicted probabilities
216
217
Returns:
218
float: Kuiper test statistic
219
"""
220
221
def kuiper_p_value(y_true, y_score):
222
"""
223
P-value for Kuiper calibration test.
224
225
Parameters:
226
- y_true: ArrayLike, true binary labels
227
- y_score: ArrayLike, predicted probabilities
228
229
Returns:
230
float: Kuiper test p-value
231
"""
232
233
def spiegelhalter_statistic(y_true, y_score):
234
"""
235
Spiegelhalter test statistic for calibration assessment.
236
237
Parameters:
238
- y_true: ArrayLike, true binary labels
239
- y_score: ArrayLike, predicted probabilities
240
241
Returns:
242
float: Spiegelhalter test statistic
243
"""
244
245
def spiegelhalter_p_value(y_true, y_score):
246
"""
247
P-value for Spiegelhalter calibration test.
248
249
Parameters:
250
- y_true: ArrayLike, true binary labels
251
- y_score: ArrayLike, predicted probabilities
252
253
Returns:
254
float: Spiegelhalter test p-value
255
"""
256
```
257
258
## Usage Examples
259
260
### Regression Metrics Evaluation
261
262
```python
263
from mapie.metrics.regression import (
264
regression_coverage_score,
265
regression_mean_width_score,
266
regression_ssc_score
267
)
268
import numpy as np
269
270
# Assume we have predictions from MAPIE regressor
271
# y_pred: point predictions
272
# y_intervals: prediction intervals (shape: n_samples x 2 x n_alpha)
273
# y_test: true values
274
275
# Coverage evaluation
276
coverage_scores = regression_coverage_score(y_test, y_intervals)
277
print(f"Coverage scores: {coverage_scores}")
278
279
# Width evaluation
280
mean_widths = regression_mean_width_score(y_intervals)
281
print(f"Mean interval widths: {mean_widths}")
282
283
# Size-stratified coverage
284
ssc_scores = regression_ssc_score(y_test, y_intervals, num_bins=10)
285
print(f"Size-stratified coverage: {ssc_scores}")
286
287
# Coverage-width trade-off
288
from mapie.metrics.regression import coverage_width_based
289
cwb_scores = coverage_width_based(y_test, y_intervals, eta=0.5)
290
print(f"Coverage-width-based scores: {cwb_scores}")
291
```
292
293
### Classification Metrics Evaluation
294
295
```python
296
from mapie.metrics.classification import (
297
classification_coverage_score,
298
classification_mean_width_score,
299
classification_ssc_score
300
)
301
302
# Assume we have prediction sets from MAPIE classifier
303
# y_pred_sets: binary matrix (n_samples x n_classes)
304
# y_test: true class labels
305
306
# Coverage evaluation
307
coverage = classification_coverage_score(y_test, y_pred_sets)
308
print(f"Empirical coverage: {coverage:.3f}")
309
310
# Set size evaluation
311
mean_set_size = classification_mean_width_score(y_pred_sets)
312
print(f"Mean prediction set size: {mean_set_size:.2f}")
313
314
# Size-stratified coverage
315
ssc_scores = classification_ssc_score(y_test, y_pred_sets, num_bins=5)
316
print(f"Size-stratified coverage by bin: {ssc_scores}")
317
```
318
319
### Calibration Assessment
320
321
```python
322
from mapie.metrics.calibration import (
323
expected_calibration_error,
324
top_label_ece,
325
kolmogorov_smirnov_statistic,
326
spiegelhalter_p_value
327
)
328
329
# Binary classification calibration
330
y_proba_binary = classifier.predict_proba(X_test)[:, 1]
331
y_binary = (y_test == positive_class).astype(int)
332
333
# Expected Calibration Error
334
ece = expected_calibration_error(y_binary, y_proba_binary, num_bins=10)
335
print(f"Expected Calibration Error: {ece:.4f}")
336
337
# Kolmogorov-Smirnov test
338
ks_stat = kolmogorov_smirnov_statistic(y_binary, y_proba_binary)
339
print(f"KS statistic: {ks_stat:.4f}")
340
341
# Multi-class calibration
342
y_proba_multi = classifier.predict_proba(X_test)
343
top_ece = top_label_ece(y_test, y_proba_multi)
344
print(f"Top-label ECE: {top_ece:.4f}")
345
346
# Statistical significance test
347
spieg_pval = spiegelhalter_p_value(y_binary, y_proba_binary)
348
print(f"Spiegelhalter p-value: {spieg_pval:.4f}")
349
```
350
351
## Advanced Analysis
352
353
### Comprehensive Regression Evaluation
354
355
```python
356
def evaluate_regression_intervals(y_true, y_pred, y_intervals, confidence_levels):
357
"""Comprehensive evaluation of regression prediction intervals."""
358
359
results = {}
360
361
for i, alpha in enumerate(confidence_levels):
362
level_intervals = y_intervals[:, :, i] if y_intervals.ndim == 3 else y_intervals
363
364
# Coverage
365
coverage = regression_coverage_score(y_true, level_intervals)
366
367
# Width
368
width = regression_mean_width_score(level_intervals)
369
370
# Efficiency (width relative to empirical quantiles)
371
residuals = np.abs(y_true - y_pred)
372
empirical_quantile = np.quantile(residuals, alpha)
373
efficiency = width / (2 * empirical_quantile) if empirical_quantile > 0 else np.inf
374
375
results[f"confidence_{alpha}"] = {
376
"coverage": coverage,
377
"mean_width": width,
378
"efficiency": efficiency
379
}
380
381
return results
382
383
# Usage
384
results = evaluate_regression_intervals(
385
y_test, y_pred, y_intervals,
386
confidence_levels=[0.8, 0.9, 0.95]
387
)
388
```
389
390
### Classification Set Analysis
391
392
```python
393
def analyze_prediction_sets(y_true, y_pred_sets, class_names=None):
394
"""Analyze prediction set characteristics."""
395
396
n_samples, n_classes = y_pred_sets.shape
397
398
# Set sizes
399
set_sizes = np.sum(y_pred_sets, axis=1)
400
401
# Coverage
402
coverage = classification_coverage_score(y_true, y_pred_sets)
403
404
# Size distribution
405
size_counts = np.bincount(set_sizes.astype(int), minlength=n_classes+1)
406
size_dist = size_counts / n_samples
407
408
# Per-class inclusion rates
409
inclusion_rates = np.mean(y_pred_sets, axis=0)
410
411
results = {
412
"overall_coverage": coverage,
413
"mean_set_size": np.mean(set_sizes),
414
"set_size_distribution": size_dist,
415
"inclusion_rates": dict(zip(class_names or range(n_classes), inclusion_rates))
416
}
417
418
return results
419
420
# Usage
421
analysis = analyze_prediction_sets(y_test, y_pred_sets, class_names=['A', 'B', 'C'])
422
```
423
424
### Calibration Reliability Diagram
425
426
```python
427
import matplotlib.pyplot as plt
428
429
def plot_reliability_diagram(y_true, y_proba, n_bins=10):
430
"""Plot reliability diagram for calibration assessment."""
431
432
from sklearn.calibration import calibration_curve
433
434
# Compute calibration curve
435
fraction_of_positives, mean_predicted_value = calibration_curve(
436
y_true, y_proba, n_bins=n_bins
437
)
438
439
# Plot
440
plt.figure(figsize=(8, 6))
441
plt.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
442
plt.plot(mean_predicted_value, fraction_of_positives, 's-',
443
label=f'Model (ECE = {expected_calibration_error(y_true, y_proba):.3f})')
444
445
plt.xlabel('Mean Predicted Probability')
446
plt.ylabel('Fraction of Positives')
447
plt.title('Reliability Diagram')
448
plt.legend()
449
plt.grid(True, alpha=0.3)
450
plt.show()
451
452
# Usage
453
plot_reliability_diagram(y_binary, y_proba_binary, n_bins=10)
454
```
455
456
### Independence Testing with HSIC
457
458
```python
459
from mapie.metrics.regression import hsic
460
461
def test_interval_independence(residuals, interval_widths, kernel="gaussian"):
462
"""Test independence between residuals and interval widths using HSIC."""
463
464
# Compute HSIC statistic
465
hsic_stat = hsic(residuals, interval_widths, kernel=kernel)
466
467
# Bootstrap p-value approximation
468
n_bootstrap = 1000
469
bootstrap_stats = []
470
471
for _ in range(n_bootstrap):
472
# Shuffle one variable to break dependence
473
shuffled_widths = np.random.permutation(interval_widths)
474
bootstrap_stat = hsic(residuals, shuffled_widths, kernel=kernel)
475
bootstrap_stats.append(bootstrap_stat)
476
477
# P-value
478
p_value = np.mean(np.array(bootstrap_stats) >= hsic_stat)
479
480
return {
481
"hsic_statistic": hsic_stat,
482
"p_value": p_value,
483
"is_independent": p_value > 0.05
484
}
485
486
# Usage
487
residuals = np.abs(y_test - y_pred)
488
widths = y_intervals[:, 1] - y_intervals[:, 0]
489
independence_test = test_interval_independence(residuals, widths)
490
```
491
492
## Metric Interpretation
493
494
### Coverage Metrics
495
- **Target**: Should match nominal confidence level (e.g., 0.9 for 90% intervals)
496
- **Under-coverage**: Intervals too narrow, insufficient uncertainty quantification
497
- **Over-coverage**: Intervals too wide, conservative but inefficient
498
499
### Width/Size Metrics
500
- **Regression**: Narrower intervals are better (conditional coverage)
501
- **Classification**: Smaller sets are better (more decisive predictions)
502
- **Trade-off**: Balance between coverage and efficiency
503
504
### Calibration Metrics
505
- **ECE < 0.05**: Well-calibrated probabilities
506
- **ECE > 0.1**: Poorly calibrated, needs calibration
507
- **Statistical tests**: p-value < 0.05 indicates miscalibration
508
509
### Size-Stratified Coverage
510
- **Conditional validity**: Coverage should be consistent across different interval sizes
511
- **Adaptive methods**: Should maintain coverage even when intervals vary significantly