0
# Processing Tools
1
2
Data preprocessing, discretization, bootstrapping, and CDF manipulation tools required for forecast verification workflows. These utilities prepare data for scoring functions and provide statistical resampling capabilities.
3
4
## Capabilities
5
6
### Data Discretization
7
8
Tools for converting continuous data to categorical or binary formats required for categorical scoring.
9
10
#### Comparative Discretization
11
12
Discretizes both forecast and observation data using common bin edges.
13
14
```python { .api }
15
def comparative_discretise(
16
fcst: XarrayLike,
17
obs: XarrayLike,
18
*,
19
bin_edges: Sequence[float],
20
reduce_dims: Optional[FlexibleDimensionTypes] = None,
21
preserve_dims: Optional[FlexibleDimensionTypes] = None,
22
) -> Tuple[XarrayLike, XarrayLike]:
23
"""
24
Discretize forecast and observation data using common bin edges.
25
26
Args:
27
fcst: Continuous forecast data
28
obs: Continuous observation data
29
bin_edges: Bin edge values for discretization
30
reduce_dims: Dimensions to reduce during binning
31
preserve_dims: Dimensions to preserve
32
33
Returns:
34
Tuple of (discretized_forecast, discretized_observation)
35
36
Notes:
37
- Creates consistent categorical bins for both datasets
38
- Bin edges define category boundaries
39
- Returns integer category labels starting from 0
40
- Values outside bin range assigned to nearest bin
41
"""
42
```
43
44
#### Binary Discretization
45
46
Converts continuous data to binary events using threshold comparison.
47
48
```python { .api }
49
def binary_discretise(
50
data: XarrayLike,
51
threshold: Union[float, int, XarrayLike],
52
*,
53
threshold_operator: Callable = operator.ge,
54
) -> XarrayLike:
55
"""
56
Convert continuous data to binary events using threshold.
57
58
Args:
59
data: Input continuous data
60
threshold: Threshold value(s) for binary conversion
61
threshold_operator: Comparison operator (ge, le, gt, lt)
62
63
Returns:
64
Binary data {0, 1} where 1 indicates event occurrence
65
66
Formula:
67
binary_result = threshold_operator(data, threshold)
68
69
Notes:
70
- Default: event = data >= threshold
71
- Threshold can be scalar or array with compatible dimensions
72
- NaN values preserved as NaN in output
73
"""
74
```
75
76
#### Proportion-based Binary Discretization
77
78
Creates binary events based on data percentiles rather than fixed thresholds.
79
80
```python { .api }
81
def binary_discretise_proportion(
82
data: XarrayLike,
83
proportion: float,
84
*,
85
dims: Optional[FlexibleDimensionTypes] = None,
86
threshold_operator: Callable = operator.ge,
87
) -> XarrayLike:
88
"""
89
Convert data to binary events using proportion-based threshold.
90
91
Args:
92
data: Input continuous data
93
proportion: Proportion of data to classify as events (0-1)
94
dims: Dimensions over which to calculate percentiles
95
threshold_operator: Comparison operator
96
97
Returns:
98
Binary data where specified proportion are events
99
100
Notes:
101
- proportion=0.1 creates events for top 10% of values
102
- Threshold determined from data percentiles
103
- Useful for relative event definitions
104
"""
105
```
106
107
#### Proportion Exceeding Threshold
108
109
Calculates the fraction of data exceeding a specified threshold.
110
111
```python { .api }
112
def proportion_exceeding(
113
data: XarrayLike,
114
threshold: Union[float, int, XarrayLike],
115
*,
116
dims: Optional[FlexibleDimensionTypes] = None,
117
threshold_operator: Callable = operator.ge,
118
) -> XarrayLike:
119
"""
120
Calculate proportion of data exceeding threshold.
121
122
Args:
123
data: Input data
124
threshold: Threshold value(s)
125
dims: Dimensions over which to calculate proportion
126
threshold_operator: Comparison operator
127
128
Returns:
129
Proportion values [0, 1]
130
131
Notes:
132
- Returns fraction of values satisfying threshold condition
133
- Useful for climate statistics and event frequencies
134
"""
135
```
136
137
### Data Matching and Broadcasting
138
139
Utilities for aligning and preparing data arrays for scoring.
140
141
#### Broadcast and Match NaN
142
143
Ensures consistent NaN patterns between forecast and observation arrays.
144
145
```python { .api }
146
def broadcast_and_match_nan(
147
*args: XarrayLike,
148
) -> Tuple[XarrayLike, ...]:
149
"""
150
Broadcast arrays and match NaN patterns.
151
152
Args:
153
*args: Variable number of xarray data arrays
154
155
Returns:
156
Tuple of arrays with matching dimensions and NaN patterns
157
158
Notes:
159
- Broadcasts arrays to common dimension set
160
- Propagates NaN values across all arrays
161
- Ensures consistent missing data handling
162
- Required preprocessing for many scoring functions
163
"""
164
```
165
166
### Statistical Resampling
167
168
Bootstrap methods for uncertainty quantification and confidence interval estimation.
169
170
#### Block Bootstrap
171
172
Block bootstrap resampling for time series data with temporal dependence.
173
174
```python { .api }
175
def block_bootstrap(
176
data: xr.DataArray,
177
*,
178
block_length: int,
179
n_samples: int = 1000,
180
rng: Optional[Union[int, np.random.Generator]] = None,
181
dim: str = "time",
182
) -> xr.DataArray:
183
"""
184
Generate block bootstrap samples from time series data.
185
186
Args:
187
data: Input time series data
188
block_length: Length of each bootstrap block
189
n_samples: Number of bootstrap samples to generate
190
rng: Random number generator or seed
191
dim: Name of time dimension for resampling
192
193
Returns:
194
Bootstrap samples with new 'bootstrap_sample' dimension
195
196
Notes:
197
- Preserves temporal autocorrelation within blocks
198
- Block length should reflect data correlation timescale
199
- Suitable for dependent time series data
200
- Output shape: original + (n_samples,)
201
"""
202
```
203
204
### Isotonic Regression
205
206
Isotonic fitting for reliability diagrams and calibration analysis.
207
208
```python { .api }
209
def isotonic_fit(
210
fcst: XarrayLike,
211
obs: XarrayLike,
212
*,
213
reduce_dims: Optional[FlexibleDimensionTypes] = None,
214
preserve_dims: Optional[FlexibleDimensionTypes] = None,
215
) -> XarrayLike:
216
"""
217
Fit isotonic regression for reliability diagram analysis.
218
219
Args:
220
fcst: Forecast probability values [0, 1]
221
obs: Binary observation values {0, 1}
222
reduce_dims: Dimensions to reduce
223
preserve_dims: Dimensions to preserve
224
225
Returns:
226
Isotonic regression fit values
227
228
Notes:
229
- Creates monotonic mapping from forecast to observed frequency
230
- Used for probability forecast calibration assessment
231
- Output represents conditional event frequency given forecast
232
"""
233
```
234
235
### CDF Processing Functions
236
237
Specialized functions for manipulating Cumulative Distribution Function data.
238
239
#### CDF Manipulation
240
241
```python { .api }
242
from scores.processing.cdf import (
243
round_values,
244
propagate_nan,
245
observed_cdf,
246
integrate_square_piecewise_linear,
247
add_thresholds,
248
fill_cdf,
249
decreasing_cdfs,
250
cdf_envelope
251
)
252
253
def round_values(
254
values: xr.DataArray,
255
precision: int = 6
256
) -> xr.DataArray:
257
"""Round values for numerical stability in CDF processing."""
258
259
def propagate_nan(
260
data: xr.DataArray,
261
threshold_dim: str
262
) -> xr.DataArray:
263
"""Propagate NaN values consistently through CDF dimensions."""
264
265
def observed_cdf(
266
obs: xr.DataArray,
267
thresholds: xr.DataArray,
268
threshold_dim: str
269
) -> xr.DataArray:
270
"""Create empirical CDF from observation data."""
271
272
def integrate_square_piecewise_linear(
273
values: xr.DataArray,
274
coords: xr.DataArray,
275
coord_dim: str
276
) -> xr.DataArray:
277
"""Integrate squared piecewise linear functions for CRPS calculation."""
278
279
def add_thresholds(
280
cdf_data: xr.DataArray,
281
additional_thresholds: xr.DataArray,
282
threshold_dim: str
283
) -> xr.DataArray:
284
"""Add additional threshold points to CDF data."""
285
286
def fill_cdf(
287
cdf_data: xr.DataArray,
288
method: str = "linear",
289
threshold_dim: str = "threshold"
290
) -> xr.DataArray:
291
"""Fill missing values in CDF using specified interpolation method."""
292
293
def decreasing_cdfs(
294
cdf_data: xr.DataArray,
295
threshold_dim: str
296
) -> xr.DataArray:
297
"""Handle and correct decreasing CDF values."""
298
299
def cdf_envelope(
300
cdf_data: xr.DataArray,
301
threshold_dim: str
302
) -> xr.DataArray:
303
"""Create CDF envelope for ensemble of CDFs."""
304
```
305
306
## Usage Patterns
307
308
### Basic Data Preprocessing
309
310
```python
311
from scores.processing import (
312
binary_discretise, comparative_discretise,
313
broadcast_and_match_nan
314
)
315
import xarray as xr
316
import numpy as np
317
318
# Create sample continuous data
319
forecast = xr.DataArray(np.random.exponential(3, 1000))
320
observation = xr.DataArray(np.random.exponential(3, 1000))
321
322
# Convert to binary events (>= 5.0)
323
fcst_binary = binary_discretise(forecast, 5.0)
324
obs_binary = binary_discretise(observation, 5.0)
325
326
print(f"Event frequency in forecast: {fcst_binary.mean().values:.3f}")
327
print(f"Event frequency in observations: {obs_binary.mean().values:.3f}")
328
329
# Create multicategorical data
330
bin_edges = [0, 1, 5, 10, 20, 100]
331
fcst_categorical, obs_categorical = comparative_discretise(
332
forecast, observation, bin_edges=bin_edges
333
)
334
335
print(f"Categorical forecast range: {fcst_categorical.min().values} to {fcst_categorical.max().values}")
336
```
337
338
### Proportion-based Event Definition
339
340
```python
341
# Define events based on data distribution
342
from scores.processing import binary_discretise_proportion, proportion_exceeding
343
344
# Top 20% of values as events
345
fcst_top20 = binary_discretise_proportion(forecast, 0.2)
346
obs_top20 = binary_discretise_proportion(observation, 0.2)
347
348
# Verify proportion
349
print(f"Forecast top 20% frequency: {fcst_top20.mean().values:.3f}")
350
print(f"Observation top 20% frequency: {obs_top20.mean().values:.3f}")
351
352
# Calculate actual proportions exceeding various thresholds
353
thresholds = [1, 2, 5, 10, 15]
354
for threshold in thresholds:
355
fcst_prop = proportion_exceeding(forecast, threshold)
356
obs_prop = proportion_exceeding(observation, threshold)
357
print(f"Threshold {threshold:2d}: fcst={fcst_prop.values:.3f}, obs={obs_prop.values:.3f}")
358
```
359
360
### Data Alignment and Broadcasting
361
362
```python
363
# Handle misaligned data with NaN patterns
364
forecast_with_nan = forecast.copy()
365
observation_with_nan = observation.copy()
366
367
# Introduce different NaN patterns
368
forecast_with_nan[::10] = np.nan # Every 10th value
369
observation_with_nan[5::15] = np.nan # Different pattern
370
371
# Align and match NaN patterns
372
aligned_fcst, aligned_obs = broadcast_and_match_nan(
373
forecast_with_nan, observation_with_nan
374
)
375
376
print(f"Original forecast NaN count: {forecast_with_nan.isnull().sum().values}")
377
print(f"Original observation NaN count: {observation_with_nan.isnull().sum().values}")
378
print(f"Aligned data NaN count: {aligned_fcst.isnull().sum().values}")
379
```
380
381
### Bootstrap Uncertainty Estimation
382
383
```python
384
from scores.processing import block_bootstrap
385
from scores.continuous import mse
386
387
# Create time series data
388
time_index = pd.date_range("2023-01-01", periods=365, freq="D")
389
forecast_ts = xr.DataArray(
390
np.random.normal(10, 2, 365) + 3*np.sin(np.arange(365)*2*np.pi/365),
391
dims=["time"],
392
coords={"time": time_index}
393
)
394
observation_ts = xr.DataArray(
395
forecast_ts.values + np.random.normal(0, 1, 365),
396
dims=["time"],
397
coords={"time": time_index}
398
)
399
400
# Generate bootstrap samples
401
bootstrap_samples = block_bootstrap(
402
forecast_ts,
403
block_length=14, # 2-week blocks
404
n_samples=1000,
405
dim="time"
406
)
407
408
# Calculate MSE for each bootstrap sample
409
mse_bootstrap = []
410
for i in range(1000):
411
boot_fcst = bootstrap_samples.isel(bootstrap_sample=i)
412
boot_obs = observation_ts.isel(time=boot_fcst.time) # Match time indices
413
mse_val = mse(boot_fcst, boot_obs)
414
mse_bootstrap.append(mse_val.values)
415
416
mse_bootstrap = np.array(mse_bootstrap)
417
print(f"Bootstrap MSE: {np.mean(mse_bootstrap):.3f} ± {np.std(mse_bootstrap):.3f}")
418
print(f"95% CI: [{np.percentile(mse_bootstrap, 2.5):.3f}, {np.percentile(mse_bootstrap, 97.5):.3f}]")
419
```
420
421
### Probability Calibration Analysis
422
423
```python
424
from scores.processing import isotonic_fit
425
426
# Create probability forecast and binary observations
427
prob_forecast = xr.DataArray(np.random.beta(2, 2, 500))
428
# Generate observations with some relationship to forecast
429
binary_obs = xr.DataArray(
430
np.random.binomial(1, 0.3 + 0.4*prob_forecast.values)
431
)
432
433
# Fit isotonic regression for reliability analysis
434
reliability_fit = isotonic_fit(prob_forecast, binary_obs)
435
436
print(f"Original forecast range: [{prob_forecast.min().values:.3f}, {prob_forecast.max().values:.3f}]")
437
print(f"Reliability fit range: [{reliability_fit.min().values:.3f}, {reliability_fit.max().values:.3f}]")
438
439
# Perfect reliability would have reliability_fit ≈ prob_forecast
440
reliability_bias = (reliability_fit - prob_forecast).mean()
441
print(f"Mean reliability bias: {reliability_bias.values:.3f}")
442
```
443
444
### Multi-dimensional Processing
445
446
```python
447
# Process multi-dimensional data
448
forecast_3d = xr.DataArray(
449
np.random.exponential(2, (50, 20, 30)), # time, lat, lon
450
dims=["time", "lat", "lon"]
451
)
452
observation_3d = xr.DataArray(
453
np.random.exponential(2, (50, 20, 30)),
454
dims=["time", "lat", "lon"]
455
)
456
457
# Binary discretization preserves all dimensions
458
fcst_events_3d = binary_discretise(forecast_3d, 5.0)
459
obs_events_3d = binary_discretise(observation_3d, 5.0)
460
461
# Calculate event frequencies over different dimensions
462
temporal_frequency = fcst_events_3d.mean(dim="time") # Spatial event frequency
463
spatial_frequency = fcst_events_3d.mean(dim=["lat", "lon"]) # Temporal event frequency
464
overall_frequency = fcst_events_3d.mean() # Overall frequency
465
466
print(f"Spatial frequency range: {temporal_frequency.min().values:.3f} to {temporal_frequency.max().values:.3f}")
467
print(f"Overall event frequency: {overall_frequency.values:.3f}")
468
469
# Proportion-based events calculated over specific dimensions
470
# Top 10% events at each grid point over time
471
spatial_top10 = binary_discretise_proportion(
472
forecast_3d, 0.1, dims="time"
473
)
474
print(f"Spatial top 10% events shape: {spatial_top10.shape}")
475
```
476
477
### CDF Data Processing
478
479
```python
480
from scores.processing.cdf import fill_cdf, add_thresholds, observed_cdf
481
482
# Create sample CDF data with missing values
483
thresholds = xr.DataArray(np.linspace(0, 20, 21))
484
cdf_values = xr.DataArray(
485
np.cumsum(np.random.exponential(0.1, (100, 21)), axis=1),
486
dims=["sample", "threshold"],
487
coords={"threshold": thresholds}
488
)
489
# Normalize to [0,1]
490
cdf_values = cdf_values / cdf_values.isel(threshold=-1, drop=True)
491
492
# Introduce missing values
493
cdf_values = cdf_values.where(np.random.random(cdf_values.shape) > 0.05)
494
495
# Fill missing values
496
filled_cdf = fill_cdf(cdf_values, method="linear", threshold_dim="threshold")
497
498
# Add additional threshold points
499
additional_thresh = xr.DataArray([2.5, 7.5, 12.5, 17.5])
500
extended_cdf = add_thresholds(filled_cdf, additional_thresh, "threshold")
501
502
print(f"Original CDF shape: {cdf_values.shape}")
503
print(f"Extended CDF shape: {extended_cdf.shape}")
504
print(f"Missing values before: {cdf_values.isnull().sum().values}")
505
print(f"Missing values after: {extended_cdf.isnull().sum().values}")
506
```