or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

categorical-scores.mdcontinuous-scores.mdemerging-scores.mdindex.mdpandas-integration.mdplot-data.mdprobability-scores.mdprocessing-tools.mdsample-data.mdspatial-scores.mdstatistical-tests.md

processing-tools.mddocs/

0

# Processing Tools

1

2

Data preprocessing, discretization, bootstrapping, and CDF manipulation tools required for forecast verification workflows. These utilities prepare data for scoring functions and provide statistical resampling capabilities.

3

4

## Capabilities

5

6

### Data Discretization

7

8

Tools for converting continuous data to categorical or binary formats required for categorical scoring.

9

10

#### Comparative Discretization

11

12

Discretizes both forecast and observation data using common bin edges.

13

14

```python { .api }

15

def comparative_discretise(

16

fcst: XarrayLike,

17

obs: XarrayLike,

18

*,

19

bin_edges: Sequence[float],

20

reduce_dims: Optional[FlexibleDimensionTypes] = None,

21

preserve_dims: Optional[FlexibleDimensionTypes] = None,

22

) -> Tuple[XarrayLike, XarrayLike]:

23

"""

24

Discretize forecast and observation data using common bin edges.

25

26

Args:

27

fcst: Continuous forecast data

28

obs: Continuous observation data

29

bin_edges: Bin edge values for discretization

30

reduce_dims: Dimensions to reduce during binning

31

preserve_dims: Dimensions to preserve

32

33

Returns:

34

Tuple of (discretized_forecast, discretized_observation)

35

36

Notes:

37

- Creates consistent categorical bins for both datasets

38

- Bin edges define category boundaries

39

- Returns integer category labels starting from 0

40

- Values outside bin range assigned to nearest bin

41

"""

42

```

43

44

#### Binary Discretization

45

46

Converts continuous data to binary events using threshold comparison.

47

48

```python { .api }

49

def binary_discretise(

50

data: XarrayLike,

51

threshold: Union[float, int, XarrayLike],

52

*,

53

threshold_operator: Callable = operator.ge,

54

) -> XarrayLike:

55

"""

56

Convert continuous data to binary events using threshold.

57

58

Args:

59

data: Input continuous data

60

threshold: Threshold value(s) for binary conversion

61

threshold_operator: Comparison operator (ge, le, gt, lt)

62

63

Returns:

64

Binary data {0, 1} where 1 indicates event occurrence

65

66

Formula:

67

binary_result = threshold_operator(data, threshold)

68

69

Notes:

70

- Default: event = data >= threshold

71

- Threshold can be scalar or array with compatible dimensions

72

- NaN values preserved as NaN in output

73

"""

74

```

75

76

#### Proportion-based Binary Discretization

77

78

Creates binary events based on data percentiles rather than fixed thresholds.

79

80

```python { .api }

81

def binary_discretise_proportion(

82

data: XarrayLike,

83

proportion: float,

84

*,

85

dims: Optional[FlexibleDimensionTypes] = None,

86

threshold_operator: Callable = operator.ge,

87

) -> XarrayLike:

88

"""

89

Convert data to binary events using proportion-based threshold.

90

91

Args:

92

data: Input continuous data

93

proportion: Proportion of data to classify as events (0-1)

94

dims: Dimensions over which to calculate percentiles

95

threshold_operator: Comparison operator

96

97

Returns:

98

Binary data where specified proportion are events

99

100

Notes:

101

- proportion=0.1 creates events for top 10% of values

102

- Threshold determined from data percentiles

103

- Useful for relative event definitions

104

"""

105

```

106

107

#### Proportion Exceeding Threshold

108

109

Calculates the fraction of data exceeding a specified threshold.

110

111

```python { .api }

112

def proportion_exceeding(

113

data: XarrayLike,

114

threshold: Union[float, int, XarrayLike],

115

*,

116

dims: Optional[FlexibleDimensionTypes] = None,

117

threshold_operator: Callable = operator.ge,

118

) -> XarrayLike:

119

"""

120

Calculate proportion of data exceeding threshold.

121

122

Args:

123

data: Input data

124

threshold: Threshold value(s)

125

dims: Dimensions over which to calculate proportion

126

threshold_operator: Comparison operator

127

128

Returns:

129

Proportion values [0, 1]

130

131

Notes:

132

- Returns fraction of values satisfying threshold condition

133

- Useful for climate statistics and event frequencies

134

"""

135

```

136

137

### Data Matching and Broadcasting

138

139

Utilities for aligning and preparing data arrays for scoring.

140

141

#### Broadcast and Match NaN

142

143

Ensures consistent NaN patterns between forecast and observation arrays.

144

145

```python { .api }

146

def broadcast_and_match_nan(

147

*args: XarrayLike,

148

) -> Tuple[XarrayLike, ...]:

149

"""

150

Broadcast arrays and match NaN patterns.

151

152

Args:

153

*args: Variable number of xarray data arrays

154

155

Returns:

156

Tuple of arrays with matching dimensions and NaN patterns

157

158

Notes:

159

- Broadcasts arrays to common dimension set

160

- Propagates NaN values across all arrays

161

- Ensures consistent missing data handling

162

- Required preprocessing for many scoring functions

163

"""

164

```

165

166

### Statistical Resampling

167

168

Bootstrap methods for uncertainty quantification and confidence interval estimation.

169

170

#### Block Bootstrap

171

172

Block bootstrap resampling for time series data with temporal dependence.

173

174

```python { .api }

175

def block_bootstrap(

176

data: xr.DataArray,

177

*,

178

block_length: int,

179

n_samples: int = 1000,

180

rng: Optional[Union[int, np.random.Generator]] = None,

181

dim: str = "time",

182

) -> xr.DataArray:

183

"""

184

Generate block bootstrap samples from time series data.

185

186

Args:

187

data: Input time series data

188

block_length: Length of each bootstrap block

189

n_samples: Number of bootstrap samples to generate

190

rng: Random number generator or seed

191

dim: Name of time dimension for resampling

192

193

Returns:

194

Bootstrap samples with new 'bootstrap_sample' dimension

195

196

Notes:

197

- Preserves temporal autocorrelation within blocks

198

- Block length should reflect data correlation timescale

199

- Suitable for dependent time series data

200

- Output shape: original + (n_samples,)

201

"""

202

```

203

204

### Isotonic Regression

205

206

Isotonic fitting for reliability diagrams and calibration analysis.

207

208

```python { .api }

209

def isotonic_fit(

210

fcst: XarrayLike,

211

obs: XarrayLike,

212

*,

213

reduce_dims: Optional[FlexibleDimensionTypes] = None,

214

preserve_dims: Optional[FlexibleDimensionTypes] = None,

215

) -> XarrayLike:

216

"""

217

Fit isotonic regression for reliability diagram analysis.

218

219

Args:

220

fcst: Forecast probability values [0, 1]

221

obs: Binary observation values {0, 1}

222

reduce_dims: Dimensions to reduce

223

preserve_dims: Dimensions to preserve

224

225

Returns:

226

Isotonic regression fit values

227

228

Notes:

229

- Creates monotonic mapping from forecast to observed frequency

230

- Used for probability forecast calibration assessment

231

- Output represents conditional event frequency given forecast

232

"""

233

```

234

235

### CDF Processing Functions

236

237

Specialized functions for manipulating Cumulative Distribution Function data.

238

239

#### CDF Manipulation

240

241

```python { .api }

242

from scores.processing.cdf import (

243

round_values,

244

propagate_nan,

245

observed_cdf,

246

integrate_square_piecewise_linear,

247

add_thresholds,

248

fill_cdf,

249

decreasing_cdfs,

250

cdf_envelope

251

)

252

253

def round_values(

254

values: xr.DataArray,

255

precision: int = 6

256

) -> xr.DataArray:

257

"""Round values for numerical stability in CDF processing."""

258

259

def propagate_nan(

260

data: xr.DataArray,

261

threshold_dim: str

262

) -> xr.DataArray:

263

"""Propagate NaN values consistently through CDF dimensions."""

264

265

def observed_cdf(

266

obs: xr.DataArray,

267

thresholds: xr.DataArray,

268

threshold_dim: str

269

) -> xr.DataArray:

270

"""Create empirical CDF from observation data."""

271

272

def integrate_square_piecewise_linear(

273

values: xr.DataArray,

274

coords: xr.DataArray,

275

coord_dim: str

276

) -> xr.DataArray:

277

"""Integrate squared piecewise linear functions for CRPS calculation."""

278

279

def add_thresholds(

280

cdf_data: xr.DataArray,

281

additional_thresholds: xr.DataArray,

282

threshold_dim: str

283

) -> xr.DataArray:

284

"""Add additional threshold points to CDF data."""

285

286

def fill_cdf(

287

cdf_data: xr.DataArray,

288

method: str = "linear",

289

threshold_dim: str = "threshold"

290

) -> xr.DataArray:

291

"""Fill missing values in CDF using specified interpolation method."""

292

293

def decreasing_cdfs(

294

cdf_data: xr.DataArray,

295

threshold_dim: str

296

) -> xr.DataArray:

297

"""Handle and correct decreasing CDF values."""

298

299

def cdf_envelope(

300

cdf_data: xr.DataArray,

301

threshold_dim: str

302

) -> xr.DataArray:

303

"""Create CDF envelope for ensemble of CDFs."""

304

```

305

306

## Usage Patterns

307

308

### Basic Data Preprocessing

309

310

```python

311

from scores.processing import (

312

binary_discretise, comparative_discretise,

313

broadcast_and_match_nan

314

)

315

import xarray as xr

316

import numpy as np

317

318

# Create sample continuous data

319

forecast = xr.DataArray(np.random.exponential(3, 1000))

320

observation = xr.DataArray(np.random.exponential(3, 1000))

321

322

# Convert to binary events (>= 5.0)

323

fcst_binary = binary_discretise(forecast, 5.0)

324

obs_binary = binary_discretise(observation, 5.0)

325

326

print(f"Event frequency in forecast: {fcst_binary.mean().values:.3f}")

327

print(f"Event frequency in observations: {obs_binary.mean().values:.3f}")

328

329

# Create multicategorical data

330

bin_edges = [0, 1, 5, 10, 20, 100]

331

fcst_categorical, obs_categorical = comparative_discretise(

332

forecast, observation, bin_edges=bin_edges

333

)

334

335

print(f"Categorical forecast range: {fcst_categorical.min().values} to {fcst_categorical.max().values}")

336

```

337

338

### Proportion-based Event Definition

339

340

```python

341

# Define events based on data distribution

342

from scores.processing import binary_discretise_proportion, proportion_exceeding

343

344

# Top 20% of values as events

345

fcst_top20 = binary_discretise_proportion(forecast, 0.2)

346

obs_top20 = binary_discretise_proportion(observation, 0.2)

347

348

# Verify proportion

349

print(f"Forecast top 20% frequency: {fcst_top20.mean().values:.3f}")

350

print(f"Observation top 20% frequency: {obs_top20.mean().values:.3f}")

351

352

# Calculate actual proportions exceeding various thresholds

353

thresholds = [1, 2, 5, 10, 15]

354

for threshold in thresholds:

355

fcst_prop = proportion_exceeding(forecast, threshold)

356

obs_prop = proportion_exceeding(observation, threshold)

357

print(f"Threshold {threshold:2d}: fcst={fcst_prop.values:.3f}, obs={obs_prop.values:.3f}")

358

```

359

360

### Data Alignment and Broadcasting

361

362

```python

363

# Handle misaligned data with NaN patterns

364

forecast_with_nan = forecast.copy()

365

observation_with_nan = observation.copy()

366

367

# Introduce different NaN patterns

368

forecast_with_nan[::10] = np.nan # Every 10th value

369

observation_with_nan[5::15] = np.nan # Different pattern

370

371

# Align and match NaN patterns

372

aligned_fcst, aligned_obs = broadcast_and_match_nan(

373

forecast_with_nan, observation_with_nan

374

)

375

376

print(f"Original forecast NaN count: {forecast_with_nan.isnull().sum().values}")

377

print(f"Original observation NaN count: {observation_with_nan.isnull().sum().values}")

378

print(f"Aligned data NaN count: {aligned_fcst.isnull().sum().values}")

379

```

380

381

### Bootstrap Uncertainty Estimation

382

383

```python

384

from scores.processing import block_bootstrap

385

from scores.continuous import mse

386

387

# Create time series data

388

time_index = pd.date_range("2023-01-01", periods=365, freq="D")

389

forecast_ts = xr.DataArray(

390

np.random.normal(10, 2, 365) + 3*np.sin(np.arange(365)*2*np.pi/365),

391

dims=["time"],

392

coords={"time": time_index}

393

)

394

observation_ts = xr.DataArray(

395

forecast_ts.values + np.random.normal(0, 1, 365),

396

dims=["time"],

397

coords={"time": time_index}

398

)

399

400

# Generate bootstrap samples

401

bootstrap_samples = block_bootstrap(

402

forecast_ts,

403

block_length=14, # 2-week blocks

404

n_samples=1000,

405

dim="time"

406

)

407

408

# Calculate MSE for each bootstrap sample

409

mse_bootstrap = []

410

for i in range(1000):

411

boot_fcst = bootstrap_samples.isel(bootstrap_sample=i)

412

boot_obs = observation_ts.isel(time=boot_fcst.time) # Match time indices

413

mse_val = mse(boot_fcst, boot_obs)

414

mse_bootstrap.append(mse_val.values)

415

416

mse_bootstrap = np.array(mse_bootstrap)

417

print(f"Bootstrap MSE: {np.mean(mse_bootstrap):.3f} ± {np.std(mse_bootstrap):.3f}")

418

print(f"95% CI: [{np.percentile(mse_bootstrap, 2.5):.3f}, {np.percentile(mse_bootstrap, 97.5):.3f}]")

419

```

420

421

### Probability Calibration Analysis

422

423

```python

424

from scores.processing import isotonic_fit

425

426

# Create probability forecast and binary observations

427

prob_forecast = xr.DataArray(np.random.beta(2, 2, 500))

428

# Generate observations with some relationship to forecast

429

binary_obs = xr.DataArray(

430

np.random.binomial(1, 0.3 + 0.4*prob_forecast.values)

431

)

432

433

# Fit isotonic regression for reliability analysis

434

reliability_fit = isotonic_fit(prob_forecast, binary_obs)

435

436

print(f"Original forecast range: [{prob_forecast.min().values:.3f}, {prob_forecast.max().values:.3f}]")

437

print(f"Reliability fit range: [{reliability_fit.min().values:.3f}, {reliability_fit.max().values:.3f}]")

438

439

# Perfect reliability would have reliability_fit ≈ prob_forecast

440

reliability_bias = (reliability_fit - prob_forecast).mean()

441

print(f"Mean reliability bias: {reliability_bias.values:.3f}")

442

```

443

444

### Multi-dimensional Processing

445

446

```python

447

# Process multi-dimensional data

448

forecast_3d = xr.DataArray(

449

np.random.exponential(2, (50, 20, 30)), # time, lat, lon

450

dims=["time", "lat", "lon"]

451

)

452

observation_3d = xr.DataArray(

453

np.random.exponential(2, (50, 20, 30)),

454

dims=["time", "lat", "lon"]

455

)

456

457

# Binary discretization preserves all dimensions

458

fcst_events_3d = binary_discretise(forecast_3d, 5.0)

459

obs_events_3d = binary_discretise(observation_3d, 5.0)

460

461

# Calculate event frequencies over different dimensions

462

temporal_frequency = fcst_events_3d.mean(dim="time") # Spatial event frequency

463

spatial_frequency = fcst_events_3d.mean(dim=["lat", "lon"]) # Temporal event frequency

464

overall_frequency = fcst_events_3d.mean() # Overall frequency

465

466

print(f"Spatial frequency range: {temporal_frequency.min().values:.3f} to {temporal_frequency.max().values:.3f}")

467

print(f"Overall event frequency: {overall_frequency.values:.3f}")

468

469

# Proportion-based events calculated over specific dimensions

470

# Top 10% events at each grid point over time

471

spatial_top10 = binary_discretise_proportion(

472

forecast_3d, 0.1, dims="time"

473

)

474

print(f"Spatial top 10% events shape: {spatial_top10.shape}")

475

```

476

477

### CDF Data Processing

478

479

```python

480

from scores.processing.cdf import fill_cdf, add_thresholds, observed_cdf

481

482

# Create sample CDF data with missing values

483

thresholds = xr.DataArray(np.linspace(0, 20, 21))

484

cdf_values = xr.DataArray(

485

np.cumsum(np.random.exponential(0.1, (100, 21)), axis=1),

486

dims=["sample", "threshold"],

487

coords={"threshold": thresholds}

488

)

489

# Normalize to [0,1]

490

cdf_values = cdf_values / cdf_values.isel(threshold=-1, drop=True)

491

492

# Introduce missing values

493

cdf_values = cdf_values.where(np.random.random(cdf_values.shape) > 0.05)

494

495

# Fill missing values

496

filled_cdf = fill_cdf(cdf_values, method="linear", threshold_dim="threshold")

497

498

# Add additional threshold points

499

additional_thresh = xr.DataArray([2.5, 7.5, 12.5, 17.5])

500

extended_cdf = add_thresholds(filled_cdf, additional_thresh, "threshold")

501

502

print(f"Original CDF shape: {cdf_values.shape}")

503

print(f"Extended CDF shape: {extended_cdf.shape}")

504

print(f"Missing values before: {cdf_values.isnull().sum().values}")

505

print(f"Missing values after: {extended_cdf.isnull().sum().values}")

506

```