or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

calibration.mdclassification.mdindex.mdmetrics.mdregression.mdrisk-control.mdutils.md

metrics.mddocs/

0

# Evaluation Metrics

1

2

Comprehensive evaluation metrics for assessing the quality of conformal prediction intervals and sets, including coverage, width, and calibration metrics. These metrics help evaluate the performance and reliability of uncertainty quantification methods.

3

4

## Capabilities

5

6

### Regression Metrics

7

8

Metrics for evaluating prediction intervals in regression tasks, focusing on coverage guarantees, interval width efficiency, and distributional properties.

9

10

```python { .api }

11

def regression_coverage_score(y_true, y_intervals):

12

"""

13

Compute coverage score for regression prediction intervals.

14

15

Parameters:

16

- y_true: ArrayLike, true target values

17

- y_intervals: ArrayLike, prediction intervals (shape: n_samples x 2 x n_alpha)

18

19

Returns:

20

NDArray: coverage scores for each confidence level

21

"""

22

23

def regression_mean_width_score(y_intervals):

24

"""

25

Compute mean width of prediction intervals.

26

27

Parameters:

28

- y_intervals: ArrayLike, prediction intervals (shape: n_samples x 2 x n_alpha)

29

30

Returns:

31

NDArray: mean interval widths for each confidence level

32

"""

33

34

def regression_ssc(y_true, y_intervals):

35

"""

36

Size-stratified coverage score for regression.

37

38

Parameters:

39

- y_true: ArrayLike, true target values

40

- y_intervals: ArrayLike, prediction intervals

41

42

Returns:

43

NDArray: size-stratified coverage scores

44

"""

45

46

def regression_ssc_score(y_true, y_intervals, num_bins=10):

47

"""

48

Size-stratified coverage score with binning.

49

50

Parameters:

51

- y_true: ArrayLike, true target values

52

- y_intervals: ArrayLike, prediction intervals

53

- num_bins: int, number of bins for stratification (default: 10)

54

55

Returns:

56

NDArray: binned size-stratified coverage scores

57

"""

58

59

def hsic(x, y, kernel="gaussian"):

60

"""

61

Hilbert-Schmidt Independence Criterion for testing independence.

62

63

Parameters:

64

- x: ArrayLike, first variable

65

- y: ArrayLike, second variable

66

- kernel: str, kernel type ("gaussian", "linear") (default: "gaussian")

67

68

Returns:

69

float: HSIC statistic

70

"""

71

72

def coverage_width_based(y_true, y_intervals, eta=1.0):

73

"""

74

Coverage-width-based metric balancing coverage and efficiency.

75

76

Parameters:

77

- y_true: ArrayLike, true target values

78

- y_intervals: ArrayLike, prediction intervals

79

- eta: float, weight parameter for width penalty (default: 1.0)

80

81

Returns:

82

NDArray: coverage-width-based scores

83

"""

84

85

def regression_mwi_score(y_intervals, num_bins=10):

86

"""

87

Mean width interval score with binning.

88

89

Parameters:

90

- y_intervals: ArrayLike, prediction intervals

91

- num_bins: int, number of bins (default: 10)

92

93

Returns:

94

NDArray: mean width scores per bin

95

"""

96

```

97

98

### Classification Metrics

99

100

Metrics for evaluating prediction sets in classification tasks, measuring set coverage, size efficiency, and distributional properties.

101

102

```python { .api }

103

def classification_coverage_score(y_true, y_pred_set):

104

"""

105

Compute coverage score for classification prediction sets.

106

107

Parameters:

108

- y_true: ArrayLike, true class labels

109

- y_pred_set: ArrayLike, prediction sets (binary matrix: n_samples x n_classes)

110

111

Returns:

112

NDArray: coverage scores

113

"""

114

115

def classification_mean_width_score(y_pred_set):

116

"""

117

Compute mean size of prediction sets.

118

119

Parameters:

120

- y_pred_set: ArrayLike, prediction sets (binary matrix)

121

122

Returns:

123

float: mean prediction set size

124

"""

125

126

def classification_ssc(y_true, y_pred_set):

127

"""

128

Size-stratified coverage for classification.

129

130

Parameters:

131

- y_true: ArrayLike, true class labels

132

- y_pred_set: ArrayLike, prediction sets

133

134

Returns:

135

NDArray: size-stratified coverage scores

136

"""

137

138

def classification_ssc_score(y_true, y_pred_set, num_bins=10):

139

"""

140

Size-stratified coverage score with binning for classification.

141

142

Parameters:

143

- y_true: ArrayLike, true class labels

144

- y_pred_set: ArrayLike, prediction sets

145

- num_bins: int, number of bins for stratification (default: 10)

146

147

Returns:

148

NDArray: binned size-stratified coverage scores

149

"""

150

```

151

152

### Calibration Metrics

153

154

Metrics for evaluating probability calibration quality, testing whether predicted probabilities accurately reflect true confidence levels.

155

156

```python { .api }

157

def expected_calibration_error(y_true, y_scores, num_bins=50, split_strategy=None):

158

"""

159

Expected Calibration Error (ECE) for probability predictions.

160

161

Parameters:

162

- y_true: ArrayLike, true binary labels (0/1)

163

- y_scores: ArrayLike, predicted probabilities

164

- num_bins: int, number of bins for reliability diagram (default: 50)

165

- split_strategy: Optional[str], binning strategy ("uniform", "quantile")

166

167

Returns:

168

float: expected calibration error

169

"""

170

171

def top_label_ece(y_true, y_scores, num_bins=50, split_strategy=None):

172

"""

173

Top-label Expected Calibration Error for multi-class problems.

174

175

Parameters:

176

- y_true: ArrayLike, true class labels

177

- y_scores: ArrayLike, predicted class probabilities (n_samples x n_classes)

178

- num_bins: int, number of bins (default: 50)

179

- split_strategy: Optional[str], binning strategy

180

181

Returns:

182

float: top-label expected calibration error

183

"""

184

185

def kolmogorov_smirnov_statistic(y_true, y_score):

186

"""

187

Kolmogorov-Smirnov test statistic for calibration assessment.

188

189

Parameters:

190

- y_true: ArrayLike, true binary labels

191

- y_score: ArrayLike, predicted probabilities

192

193

Returns:

194

float: KS test statistic

195

"""

196

197

def kolmogorov_smirnov_p_value(y_true, y_score):

198

"""

199

P-value for Kolmogorov-Smirnov calibration test.

200

201

Parameters:

202

- y_true: ArrayLike, true binary labels

203

- y_score: ArrayLike, predicted probabilities

204

205

Returns:

206

float: KS test p-value

207

"""

208

209

def kuiper_statistic(y_true, y_score):

210

"""

211

Kuiper test statistic for calibration (circular KS test).

212

213

Parameters:

214

- y_true: ArrayLike, true binary labels

215

- y_score: ArrayLike, predicted probabilities

216

217

Returns:

218

float: Kuiper test statistic

219

"""

220

221

def kuiper_p_value(y_true, y_score):

222

"""

223

P-value for Kuiper calibration test.

224

225

Parameters:

226

- y_true: ArrayLike, true binary labels

227

- y_score: ArrayLike, predicted probabilities

228

229

Returns:

230

float: Kuiper test p-value

231

"""

232

233

def spiegelhalter_statistic(y_true, y_score):

234

"""

235

Spiegelhalter test statistic for calibration assessment.

236

237

Parameters:

238

- y_true: ArrayLike, true binary labels

239

- y_score: ArrayLike, predicted probabilities

240

241

Returns:

242

float: Spiegelhalter test statistic

243

"""

244

245

def spiegelhalter_p_value(y_true, y_score):

246

"""

247

P-value for Spiegelhalter calibration test.

248

249

Parameters:

250

- y_true: ArrayLike, true binary labels

251

- y_score: ArrayLike, predicted probabilities

252

253

Returns:

254

float: Spiegelhalter test p-value

255

"""

256

```

257

258

## Usage Examples

259

260

### Regression Metrics Evaluation

261

262

```python

263

from mapie.metrics.regression import (

264

regression_coverage_score,

265

regression_mean_width_score,

266

regression_ssc_score

267

)

268

import numpy as np

269

270

# Assume we have predictions from MAPIE regressor

271

# y_pred: point predictions

272

# y_intervals: prediction intervals (shape: n_samples x 2 x n_alpha)

273

# y_test: true values

274

275

# Coverage evaluation

276

coverage_scores = regression_coverage_score(y_test, y_intervals)

277

print(f"Coverage scores: {coverage_scores}")

278

279

# Width evaluation

280

mean_widths = regression_mean_width_score(y_intervals)

281

print(f"Mean interval widths: {mean_widths}")

282

283

# Size-stratified coverage

284

ssc_scores = regression_ssc_score(y_test, y_intervals, num_bins=10)

285

print(f"Size-stratified coverage: {ssc_scores}")

286

287

# Coverage-width trade-off

288

from mapie.metrics.regression import coverage_width_based

289

cwb_scores = coverage_width_based(y_test, y_intervals, eta=0.5)

290

print(f"Coverage-width-based scores: {cwb_scores}")

291

```

292

293

### Classification Metrics Evaluation

294

295

```python

296

from mapie.metrics.classification import (

297

classification_coverage_score,

298

classification_mean_width_score,

299

classification_ssc_score

300

)

301

302

# Assume we have prediction sets from MAPIE classifier

303

# y_pred_sets: binary matrix (n_samples x n_classes)

304

# y_test: true class labels

305

306

# Coverage evaluation

307

coverage = classification_coverage_score(y_test, y_pred_sets)

308

print(f"Empirical coverage: {coverage:.3f}")

309

310

# Set size evaluation

311

mean_set_size = classification_mean_width_score(y_pred_sets)

312

print(f"Mean prediction set size: {mean_set_size:.2f}")

313

314

# Size-stratified coverage

315

ssc_scores = classification_ssc_score(y_test, y_pred_sets, num_bins=5)

316

print(f"Size-stratified coverage by bin: {ssc_scores}")

317

```

318

319

### Calibration Assessment

320

321

```python

322

from mapie.metrics.calibration import (

323

expected_calibration_error,

324

top_label_ece,

325

kolmogorov_smirnov_statistic,

326

spiegelhalter_p_value

327

)

328

329

# Binary classification calibration

330

y_proba_binary = classifier.predict_proba(X_test)[:, 1]

331

y_binary = (y_test == positive_class).astype(int)

332

333

# Expected Calibration Error

334

ece = expected_calibration_error(y_binary, y_proba_binary, num_bins=10)

335

print(f"Expected Calibration Error: {ece:.4f}")

336

337

# Kolmogorov-Smirnov test

338

ks_stat = kolmogorov_smirnov_statistic(y_binary, y_proba_binary)

339

print(f"KS statistic: {ks_stat:.4f}")

340

341

# Multi-class calibration

342

y_proba_multi = classifier.predict_proba(X_test)

343

top_ece = top_label_ece(y_test, y_proba_multi)

344

print(f"Top-label ECE: {top_ece:.4f}")

345

346

# Statistical significance test

347

spieg_pval = spiegelhalter_p_value(y_binary, y_proba_binary)

348

print(f"Spiegelhalter p-value: {spieg_pval:.4f}")

349

```

350

351

## Advanced Analysis

352

353

### Comprehensive Regression Evaluation

354

355

```python

356

def evaluate_regression_intervals(y_true, y_pred, y_intervals, confidence_levels):

357

"""Comprehensive evaluation of regression prediction intervals."""

358

359

results = {}

360

361

for i, alpha in enumerate(confidence_levels):

362

level_intervals = y_intervals[:, :, i] if y_intervals.ndim == 3 else y_intervals

363

364

# Coverage

365

coverage = regression_coverage_score(y_true, level_intervals)

366

367

# Width

368

width = regression_mean_width_score(level_intervals)

369

370

# Efficiency (width relative to empirical quantiles)

371

residuals = np.abs(y_true - y_pred)

372

empirical_quantile = np.quantile(residuals, alpha)

373

efficiency = width / (2 * empirical_quantile) if empirical_quantile > 0 else np.inf

374

375

results[f"confidence_{alpha}"] = {

376

"coverage": coverage,

377

"mean_width": width,

378

"efficiency": efficiency

379

}

380

381

return results

382

383

# Usage

384

results = evaluate_regression_intervals(

385

y_test, y_pred, y_intervals,

386

confidence_levels=[0.8, 0.9, 0.95]

387

)

388

```

389

390

### Classification Set Analysis

391

392

```python

393

def analyze_prediction_sets(y_true, y_pred_sets, class_names=None):

394

"""Analyze prediction set characteristics."""

395

396

n_samples, n_classes = y_pred_sets.shape

397

398

# Set sizes

399

set_sizes = np.sum(y_pred_sets, axis=1)

400

401

# Coverage

402

coverage = classification_coverage_score(y_true, y_pred_sets)

403

404

# Size distribution

405

size_counts = np.bincount(set_sizes.astype(int), minlength=n_classes+1)

406

size_dist = size_counts / n_samples

407

408

# Per-class inclusion rates

409

inclusion_rates = np.mean(y_pred_sets, axis=0)

410

411

results = {

412

"overall_coverage": coverage,

413

"mean_set_size": np.mean(set_sizes),

414

"set_size_distribution": size_dist,

415

"inclusion_rates": dict(zip(class_names or range(n_classes), inclusion_rates))

416

}

417

418

return results

419

420

# Usage

421

analysis = analyze_prediction_sets(y_test, y_pred_sets, class_names=['A', 'B', 'C'])

422

```

423

424

### Calibration Reliability Diagram

425

426

```python

427

import matplotlib.pyplot as plt

428

429

def plot_reliability_diagram(y_true, y_proba, n_bins=10):

430

"""Plot reliability diagram for calibration assessment."""

431

432

from sklearn.calibration import calibration_curve

433

434

# Compute calibration curve

435

fraction_of_positives, mean_predicted_value = calibration_curve(

436

y_true, y_proba, n_bins=n_bins

437

)

438

439

# Plot

440

plt.figure(figsize=(8, 6))

441

plt.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')

442

plt.plot(mean_predicted_value, fraction_of_positives, 's-',

443

label=f'Model (ECE = {expected_calibration_error(y_true, y_proba):.3f})')

444

445

plt.xlabel('Mean Predicted Probability')

446

plt.ylabel('Fraction of Positives')

447

plt.title('Reliability Diagram')

448

plt.legend()

449

plt.grid(True, alpha=0.3)

450

plt.show()

451

452

# Usage

453

plot_reliability_diagram(y_binary, y_proba_binary, n_bins=10)

454

```

455

456

### Independence Testing with HSIC

457

458

```python

459

from mapie.metrics.regression import hsic

460

461

def test_interval_independence(residuals, interval_widths, kernel="gaussian"):

462

"""Test independence between residuals and interval widths using HSIC."""

463

464

# Compute HSIC statistic

465

hsic_stat = hsic(residuals, interval_widths, kernel=kernel)

466

467

# Bootstrap p-value approximation

468

n_bootstrap = 1000

469

bootstrap_stats = []

470

471

for _ in range(n_bootstrap):

472

# Shuffle one variable to break dependence

473

shuffled_widths = np.random.permutation(interval_widths)

474

bootstrap_stat = hsic(residuals, shuffled_widths, kernel=kernel)

475

bootstrap_stats.append(bootstrap_stat)

476

477

# P-value

478

p_value = np.mean(np.array(bootstrap_stats) >= hsic_stat)

479

480

return {

481

"hsic_statistic": hsic_stat,

482

"p_value": p_value,

483

"is_independent": p_value > 0.05

484

}

485

486

# Usage

487

residuals = np.abs(y_test - y_pred)

488

widths = y_intervals[:, 1] - y_intervals[:, 0]

489

independence_test = test_interval_independence(residuals, widths)

490

```

491

492

## Metric Interpretation

493

494

### Coverage Metrics

495

- **Target**: Should match nominal confidence level (e.g., 0.9 for 90% intervals)

496

- **Under-coverage**: Intervals too narrow, insufficient uncertainty quantification

497

- **Over-coverage**: Intervals too wide, conservative but inefficient

498

499

### Width/Size Metrics

500

- **Regression**: Narrower intervals are better (conditional coverage)

501

- **Classification**: Smaller sets are better (more decisive predictions)

502

- **Trade-off**: Balance between coverage and efficiency

503

504

### Calibration Metrics

505

- **ECE < 0.05**: Well-calibrated probabilities

506

- **ECE > 0.1**: Poorly calibrated, needs calibration

507

- **Statistical tests**: p-value < 0.05 indicates miscalibration

508

509

### Size-Stratified Coverage

510

- **Conditional validity**: Coverage should be consistent across different interval sizes

511

- **Adaptive methods**: Should maintain coverage even when intervals vary significantly