Tessl Tile for pypi/scores@2.2.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

categorical-scores.md continuous-scores.md emerging-scores.md index.md pandas-integration.md plot-data.md probability-scores.md processing-tools.md sample-data.md spatial-scores.md statistical-tests.md

processing-tools.mddocs/

0
# Processing Tools
1

2
Data preprocessing, discretization, bootstrapping, and CDF manipulation tools required for forecast verification workflows. These utilities prepare data for scoring functions and provide statistical resampling capabilities.
3

4
## Capabilities
5

6
### Data Discretization
7

8
Tools for converting continuous data to categorical or binary formats required for categorical scoring.
9

10
#### Comparative Discretization
11

12
Discretizes both forecast and observation data using common bin edges.
13

14
```python { .api }
15
def comparative_discretise(
16
    fcst: XarrayLike,
17
    obs: XarrayLike,
18
    *,
19
    bin_edges: Sequence[float],
20
    reduce_dims: Optional[FlexibleDimensionTypes] = None,
21
    preserve_dims: Optional[FlexibleDimensionTypes] = None,
22
) -> Tuple[XarrayLike, XarrayLike]:
23
    """
24
    Discretize forecast and observation data using common bin edges.
25

26
    Args:
27
        fcst: Continuous forecast data
28
        obs: Continuous observation data
29
        bin_edges: Bin edge values for discretization
30
        reduce_dims: Dimensions to reduce during binning
31
        preserve_dims: Dimensions to preserve
32

33
    Returns:
34
        Tuple of (discretized_forecast, discretized_observation)
35

36
    Notes:
37
        - Creates consistent categorical bins for both datasets
38
        - Bin edges define category boundaries
39
        - Returns integer category labels starting from 0
40
        - Values outside bin range assigned to nearest bin
41
    """
42
```
43

44
#### Binary Discretization
45

46
Converts continuous data to binary events using threshold comparison.
47

48
```python { .api }
49
def binary_discretise(
50
    data: XarrayLike,
51
    threshold: Union[float, int, XarrayLike],
52
    *,
53
    threshold_operator: Callable = operator.ge,
54
) -> XarrayLike:
55
    """
56
    Convert continuous data to binary events using threshold.
57

58
    Args:
59
        data: Input continuous data
60
        threshold: Threshold value(s) for binary conversion
61
        threshold_operator: Comparison operator (ge, le, gt, lt)
62

63
    Returns:
64
        Binary data {0, 1} where 1 indicates event occurrence
65

66
    Formula:
67
        binary_result = threshold_operator(data, threshold)
68

69
    Notes:
70
        - Default: event = data >= threshold
71
        - Threshold can be scalar or array with compatible dimensions
72
        - NaN values preserved as NaN in output
73
    """
74
```
75

76
#### Proportion-based Binary Discretization
77

78
Creates binary events based on data percentiles rather than fixed thresholds.
79

80
```python { .api }
81
def binary_discretise_proportion(
82
    data: XarrayLike,
83
    proportion: float,
84
    *,
85
    dims: Optional[FlexibleDimensionTypes] = None,
86
    threshold_operator: Callable = operator.ge,
87
) -> XarrayLike:
88
    """
89
    Convert data to binary events using proportion-based threshold.
90

91
    Args:
92
        data: Input continuous data
93
        proportion: Proportion of data to classify as events (0-1)
94
        dims: Dimensions over which to calculate percentiles
95
        threshold_operator: Comparison operator
96

97
    Returns:
98
        Binary data where specified proportion are events
99

100
    Notes:
101
        - proportion=0.1 creates events for top 10% of values
102
        - Threshold determined from data percentiles
103
        - Useful for relative event definitions
104
    """
105
```
106

107
#### Proportion Exceeding Threshold
108

109
Calculates the fraction of data exceeding a specified threshold.
110

111
```python { .api }
112
def proportion_exceeding(
113
    data: XarrayLike,
114
    threshold: Union[float, int, XarrayLike],
115
    *,
116
    dims: Optional[FlexibleDimensionTypes] = None,
117
    threshold_operator: Callable = operator.ge,
118
) -> XarrayLike:
119
    """
120
    Calculate proportion of data exceeding threshold.
121

122
    Args:
123
        data: Input data
124
        threshold: Threshold value(s)
125
        dims: Dimensions over which to calculate proportion
126
        threshold_operator: Comparison operator
127

128
    Returns:
129
        Proportion values [0, 1]
130

131
    Notes:
132
        - Returns fraction of values satisfying threshold condition
133
        - Useful for climate statistics and event frequencies
134
    """
135
```
136

137
### Data Matching and Broadcasting
138

139
Utilities for aligning and preparing data arrays for scoring.
140

141
#### Broadcast and Match NaN
142

143
Ensures consistent NaN patterns between forecast and observation arrays.
144

145
```python { .api }
146
def broadcast_and_match_nan(
147
    *args: XarrayLike,
148
) -> Tuple[XarrayLike, ...]:
149
    """
150
    Broadcast arrays and match NaN patterns.
151

152
    Args:
153
        *args: Variable number of xarray data arrays
154

155
    Returns:
156
        Tuple of arrays with matching dimensions and NaN patterns
157

158
    Notes:
159
        - Broadcasts arrays to common dimension set
160
        - Propagates NaN values across all arrays
161
        - Ensures consistent missing data handling
162
        - Required preprocessing for many scoring functions
163
    """
164
```
165

166
### Statistical Resampling
167

168
Bootstrap methods for uncertainty quantification and confidence interval estimation.
169

170
#### Block Bootstrap
171

172
Block bootstrap resampling for time series data with temporal dependence.
173

174
```python { .api }
175
def block_bootstrap(
176
    data: xr.DataArray,
177
    *,
178
    block_length: int,
179
    n_samples: int = 1000,
180
    rng: Optional[Union[int, np.random.Generator]] = None,
181
    dim: str = "time",
182
) -> xr.DataArray:
183
    """
184
    Generate block bootstrap samples from time series data.
185

186
    Args:
187
        data: Input time series data
188
        block_length: Length of each bootstrap block
189
        n_samples: Number of bootstrap samples to generate
190
        rng: Random number generator or seed
191
        dim: Name of time dimension for resampling
192

193
    Returns:
194
        Bootstrap samples with new 'bootstrap_sample' dimension
195

196
    Notes:
197
        - Preserves temporal autocorrelation within blocks
198
        - Block length should reflect data correlation timescale
199
        - Suitable for dependent time series data
200
        - Output shape: original + (n_samples,)
201
    """
202
```
203

204
### Isotonic Regression
205

206
Isotonic fitting for reliability diagrams and calibration analysis.
207

208
```python { .api }
209
def isotonic_fit(
210
    fcst: XarrayLike,
211
    obs: XarrayLike,
212
    *,
213
    reduce_dims: Optional[FlexibleDimensionTypes] = None,
214
    preserve_dims: Optional[FlexibleDimensionTypes] = None,
215
) -> XarrayLike:
216
    """
217
    Fit isotonic regression for reliability diagram analysis.
218

219
    Args:
220
        fcst: Forecast probability values [0, 1]
221
        obs: Binary observation values {0, 1}
222
        reduce_dims: Dimensions to reduce
223
        preserve_dims: Dimensions to preserve
224

225
    Returns:
226
        Isotonic regression fit values
227

228
    Notes:
229
        - Creates monotonic mapping from forecast to observed frequency
230
        - Used for probability forecast calibration assessment
231
        - Output represents conditional event frequency given forecast
232
    """
233
```
234

235
### CDF Processing Functions
236

237
Specialized functions for manipulating Cumulative Distribution Function data.
238

239
#### CDF Manipulation
240

241
```python { .api }
242
from scores.processing.cdf import (
243
    round_values,
244
    propagate_nan,
245
    observed_cdf,
246
    integrate_square_piecewise_linear,
247
    add_thresholds,
248
    fill_cdf,
249
    decreasing_cdfs,
250
    cdf_envelope
251
)
252

253
def round_values(
254
    values: xr.DataArray,
255
    precision: int = 6
256
) -> xr.DataArray:
257
    """Round values for numerical stability in CDF processing."""
258

259
def propagate_nan(
260
    data: xr.DataArray,
261
    threshold_dim: str
262
) -> xr.DataArray:
263
    """Propagate NaN values consistently through CDF dimensions."""
264

265
def observed_cdf(
266
    obs: xr.DataArray,
267
    thresholds: xr.DataArray,
268
    threshold_dim: str
269
) -> xr.DataArray:
270
    """Create empirical CDF from observation data."""
271

272
def integrate_square_piecewise_linear(
273
    values: xr.DataArray,
274
    coords: xr.DataArray,
275
    coord_dim: str
276
) -> xr.DataArray:
277
    """Integrate squared piecewise linear functions for CRPS calculation."""
278

279
def add_thresholds(
280
    cdf_data: xr.DataArray,
281
    additional_thresholds: xr.DataArray,
282
    threshold_dim: str
283
) -> xr.DataArray:
284
    """Add additional threshold points to CDF data."""
285

286
def fill_cdf(
287
    cdf_data: xr.DataArray,
288
    method: str = "linear",
289
    threshold_dim: str = "threshold"
290
) -> xr.DataArray:
291
    """Fill missing values in CDF using specified interpolation method."""
292

293
def decreasing_cdfs(
294
    cdf_data: xr.DataArray,
295
    threshold_dim: str
296
) -> xr.DataArray:
297
    """Handle and correct decreasing CDF values."""
298

299
def cdf_envelope(
300
    cdf_data: xr.DataArray,
301
    threshold_dim: str
302
) -> xr.DataArray:
303
    """Create CDF envelope for ensemble of CDFs."""
304
```
305

306
## Usage Patterns
307

308
### Basic Data Preprocessing
309

310
```python
311
from scores.processing import (
312
    binary_discretise, comparative_discretise,
313
    broadcast_and_match_nan
314
)
315
import xarray as xr
316
import numpy as np
317

318
# Create sample continuous data
319
forecast = xr.DataArray(np.random.exponential(3, 1000))
320
observation = xr.DataArray(np.random.exponential(3, 1000))
321

322
# Convert to binary events (>= 5.0)
323
fcst_binary = binary_discretise(forecast, 5.0)
324
obs_binary = binary_discretise(observation, 5.0)
325

326
print(f"Event frequency in forecast: {fcst_binary.mean().values:.3f}")
327
print(f"Event frequency in observations: {obs_binary.mean().values:.3f}")
328

329
# Create multicategorical data
330
bin_edges = [0, 1, 5, 10, 20, 100]
331
fcst_categorical, obs_categorical = comparative_discretise(
332
    forecast, observation, bin_edges=bin_edges
333
)
334

335
print(f"Categorical forecast range: {fcst_categorical.min().values} to {fcst_categorical.max().values}")
336
```
337

338
### Proportion-based Event Definition
339

340
```python
341
# Define events based on data distribution
342
from scores.processing import binary_discretise_proportion, proportion_exceeding
343

344
# Top 20% of values as events
345
fcst_top20 = binary_discretise_proportion(forecast, 0.2)
346
obs_top20 = binary_discretise_proportion(observation, 0.2)
347

348
# Verify proportion
349
print(f"Forecast top 20% frequency: {fcst_top20.mean().values:.3f}")
350
print(f"Observation top 20% frequency: {obs_top20.mean().values:.3f}")
351

352
# Calculate actual proportions exceeding various thresholds
353
thresholds = [1, 2, 5, 10, 15]
354
for threshold in thresholds:
355
    fcst_prop = proportion_exceeding(forecast, threshold)
356
    obs_prop = proportion_exceeding(observation, threshold)
357
    print(f"Threshold {threshold:2d}: fcst={fcst_prop.values:.3f}, obs={obs_prop.values:.3f}")
358
```
359

360
### Data Alignment and Broadcasting
361

362
```python
363
# Handle misaligned data with NaN patterns
364
forecast_with_nan = forecast.copy()
365
observation_with_nan = observation.copy()
366

367
# Introduce different NaN patterns
368
forecast_with_nan[::10] = np.nan  # Every 10th value
369
observation_with_nan[5::15] = np.nan  # Different pattern
370

371
# Align and match NaN patterns
372
aligned_fcst, aligned_obs = broadcast_and_match_nan(
373
    forecast_with_nan, observation_with_nan
374
)
375

376
print(f"Original forecast NaN count: {forecast_with_nan.isnull().sum().values}")
377
print(f"Original observation NaN count: {observation_with_nan.isnull().sum().values}")
378
print(f"Aligned data NaN count: {aligned_fcst.isnull().sum().values}")
379
```
380

381
### Bootstrap Uncertainty Estimation
382

383
```python
384
from scores.processing import block_bootstrap
385
from scores.continuous import mse
386

387
# Create time series data
388
time_index = pd.date_range("2023-01-01", periods=365, freq="D")
389
forecast_ts = xr.DataArray(
390
    np.random.normal(10, 2, 365) + 3*np.sin(np.arange(365)*2*np.pi/365),
391
    dims=["time"],
392
    coords={"time": time_index}
393
)
394
observation_ts = xr.DataArray(
395
    forecast_ts.values + np.random.normal(0, 1, 365),
396
    dims=["time"],
397
    coords={"time": time_index}
398
)
399

400
# Generate bootstrap samples
401
bootstrap_samples = block_bootstrap(
402
    forecast_ts,
403
    block_length=14,  # 2-week blocks
404
    n_samples=1000,
405
    dim="time"
406
)
407

408
# Calculate MSE for each bootstrap sample
409
mse_bootstrap = []
410
for i in range(1000):
411
    boot_fcst = bootstrap_samples.isel(bootstrap_sample=i)
412
    boot_obs = observation_ts.isel(time=boot_fcst.time)  # Match time indices
413
    mse_val = mse(boot_fcst, boot_obs)
414
    mse_bootstrap.append(mse_val.values)
415

416
mse_bootstrap = np.array(mse_bootstrap)
417
print(f"Bootstrap MSE: {np.mean(mse_bootstrap):.3f} ± {np.std(mse_bootstrap):.3f}")
418
print(f"95% CI: [{np.percentile(mse_bootstrap, 2.5):.3f}, {np.percentile(mse_bootstrap, 97.5):.3f}]")
419
```
420

421
### Probability Calibration Analysis
422

423
```python
424
from scores.processing import isotonic_fit
425

426
# Create probability forecast and binary observations
427
prob_forecast = xr.DataArray(np.random.beta(2, 2, 500))
428
# Generate observations with some relationship to forecast
429
binary_obs = xr.DataArray(
430
    np.random.binomial(1, 0.3 + 0.4*prob_forecast.values)
431
)
432

433
# Fit isotonic regression for reliability analysis
434
reliability_fit = isotonic_fit(prob_forecast, binary_obs)
435

436
print(f"Original forecast range: [{prob_forecast.min().values:.3f}, {prob_forecast.max().values:.3f}]")
437
print(f"Reliability fit range: [{reliability_fit.min().values:.3f}, {reliability_fit.max().values:.3f}]")
438

439
# Perfect reliability would have reliability_fit ≈ prob_forecast
440
reliability_bias = (reliability_fit - prob_forecast).mean()
441
print(f"Mean reliability bias: {reliability_bias.values:.3f}")
442
```
443

444
### Multi-dimensional Processing
445

446
```python
447
# Process multi-dimensional data
448
forecast_3d = xr.DataArray(
449
    np.random.exponential(2, (50, 20, 30)),  # time, lat, lon
450
    dims=["time", "lat", "lon"]
451
)
452
observation_3d = xr.DataArray(
453
    np.random.exponential(2, (50, 20, 30)),
454
    dims=["time", "lat", "lon"]
455
)
456

457
# Binary discretization preserves all dimensions
458
fcst_events_3d = binary_discretise(forecast_3d, 5.0)
459
obs_events_3d = binary_discretise(observation_3d, 5.0)
460

461
# Calculate event frequencies over different dimensions
462
temporal_frequency = fcst_events_3d.mean(dim="time")  # Spatial event frequency
463
spatial_frequency = fcst_events_3d.mean(dim=["lat", "lon"])  # Temporal event frequency
464
overall_frequency = fcst_events_3d.mean()  # Overall frequency
465

466
print(f"Spatial frequency range: {temporal_frequency.min().values:.3f} to {temporal_frequency.max().values:.3f}")
467
print(f"Overall event frequency: {overall_frequency.values:.3f}")
468

469
# Proportion-based events calculated over specific dimensions
470
# Top 10% events at each grid point over time
471
spatial_top10 = binary_discretise_proportion(
472
    forecast_3d, 0.1, dims="time"
473
)
474
print(f"Spatial top 10% events shape: {spatial_top10.shape}")
475
```
476

477
### CDF Data Processing
478

479
```python
480
from scores.processing.cdf import fill_cdf, add_thresholds, observed_cdf
481

482
# Create sample CDF data with missing values
483
thresholds = xr.DataArray(np.linspace(0, 20, 21))
484
cdf_values = xr.DataArray(
485
    np.cumsum(np.random.exponential(0.1, (100, 21)), axis=1),
486
    dims=["sample", "threshold"],
487
    coords={"threshold": thresholds}
488
)
489
# Normalize to [0,1]
490
cdf_values = cdf_values / cdf_values.isel(threshold=-1, drop=True)
491

492
# Introduce missing values
493
cdf_values = cdf_values.where(np.random.random(cdf_values.shape) > 0.05)
494

495
# Fill missing values
496
filled_cdf = fill_cdf(cdf_values, method="linear", threshold_dim="threshold")
497

498
# Add additional threshold points
499
additional_thresh = xr.DataArray([2.5, 7.5, 12.5, 17.5])
500
extended_cdf = add_thresholds(filled_cdf, additional_thresh, "threshold")
501

502
print(f"Original CDF shape: {cdf_values.shape}")
503
print(f"Extended CDF shape: {extended_cdf.shape}")
504
print(f"Missing values before: {cdf_values.isnull().sum().values}")
505
print(f"Missing values after: {extended_cdf.isnull().sum().values}")
506
```

Version

Tile

Files

processing-tools.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

processing-tools.mddocs/