0
# Sample Data
1
2
Built-in sample data generation for tutorials, testing, and experimentation with various data formats and structures. Provides realistic synthetic data that demonstrates the features and capabilities of the scores package.
3
4
## Capabilities
5
6
### Simple Data Generation
7
8
Basic one-dimensional data arrays for quick testing and tutorials.
9
10
#### Simple Forecast Data
11
12
```python { .api }
13
def simple_forecast() -> xr.DataArray:
14
"""
15
Generate simple series of prediction values for tutorials.
16
17
Returns:
18
DataArray with simple forecast values
19
20
Characteristics:
21
- Single dimension array
22
- Realistic forecast values
23
- No missing data
24
- Suitable for basic scoring function demos
25
"""
26
```
27
28
#### Simple Observation Data
29
30
```python { .api }
31
def simple_observations() -> xr.DataArray:
32
"""
33
Generate simple series of observation values for tutorials.
34
35
Returns:
36
DataArray with simple observation values
37
38
Characteristics:
39
- Matches simple_forecast() structure
40
- Corresponding observation values
41
- Suitable for basic verification examples
42
"""
43
```
44
45
### Pandas Data Generation
46
47
Simple data generation for pandas-based workflows.
48
49
#### Pandas Forecast Series
50
51
```python { .api }
52
def simple_forecast_pandas() -> pd.Series:
53
"""
54
Generate simple pandas series of prediction values.
55
56
Returns:
57
Pandas Series with forecast values
58
59
Notes:
60
- Pandas Series format instead of xarray
61
- Compatible with pandas-specific scoring functions
62
- Useful for traditional pandas workflows
63
"""
64
```
65
66
#### Pandas Observation Series
67
68
```python { .api }
69
def simple_observations_pandas() -> pd.Series:
70
"""
71
Generate simple pandas series of observation values.
72
73
Returns:
74
Pandas Series with observation values
75
76
Notes:
77
- Matches simple_forecast_pandas() structure
78
- Corresponding observation data
79
- Suitable for pandas-based verification
80
"""
81
```
82
83
### Multi-dimensional Continuous Data
84
85
Realistic multi-dimensional arrays for comprehensive testing of scoring functions.
86
87
#### Continuous Observations
88
89
```python { .api }
90
def continuous_observations(*, large_size: bool = False) -> xr.DataArray:
91
"""
92
Create continuous observation array with synthetic data.
93
94
Args:
95
large_size: Generate larger dataset for performance testing
96
97
Returns:
98
Multi-dimensional DataArray with synthetic observation data
99
100
Dimensions:
101
- time: Temporal dimension with regular intervals
102
- station: Spatial stations (if multi-dimensional)
103
- Additional dimensions based on configuration
104
105
Characteristics:
106
- Realistic temporal and spatial patterns
107
- Seasonal cycles and trends
108
- Missing data patterns
109
- Labeled coordinates with metadata
110
"""
111
```
112
113
#### Continuous Forecast Data
114
115
```python { .api }
116
def continuous_forecast(*, large_size: bool = False, lead_days: bool = False) -> xr.DataArray:
117
"""
118
Create continuous forecast array with synthetic data.
119
120
Args:
121
large_size: Generate larger dataset for performance testing
122
lead_days: Include lead time dimension for forecast horizons
123
124
Returns:
125
Multi-dimensional DataArray with synthetic forecast data
126
127
Dimensions:
128
- time: Valid time dimension
129
- station: Spatial stations (if multi-dimensional)
130
- lead_time: Forecast lead times (if lead_days=True)
131
132
Characteristics:
133
- Corresponds to continuous_observations() structure
134
- Realistic forecast errors and biases
135
- Lead time dependencies (if enabled)
136
- Ensemble-like variations
137
"""
138
```
139
140
### CDF/Probability Data
141
142
Specialized data for probabilistic verification and CDF-based scoring.
143
144
#### CDF Forecast Data
145
146
```python { .api }
147
def cdf_forecast(*, lead_days: bool = False) -> xr.DataArray:
148
"""
149
Create forecast array with CDF at each point.
150
151
Args:
152
lead_days: Include lead time dimension
153
154
Returns:
155
DataArray with CDF forecast values [0, 1]
156
157
Dimensions:
158
- time: Valid time dimension
159
- threshold: CDF threshold values
160
- lead_time: Forecast lead times (if lead_days=True)
161
162
Characteristics:
163
- Monotonically increasing CDFs
164
- Realistic probability distributions
165
- Multiple threshold levels
166
- Suitable for CRPS calculations
167
"""
168
```
169
170
#### CDF Observation Data
171
172
```python { .api }
173
def cdf_observations() -> xr.DataArray:
174
"""
175
Create observation array compatible with CDF forecasts.
176
177
Returns:
178
DataArray with observation values
179
180
Characteristics:
181
- Compatible with cdf_forecast() output
182
- Continuous values for CDF evaluation
183
- Matching temporal structure
184
- Realistic value ranges
185
"""
186
```
187
188
## Usage Patterns
189
190
### Basic Tutorial Examples
191
192
```python
193
from scores.sample_data import simple_forecast, simple_observations
194
from scores.continuous import mse, rmse, mae
195
196
# Generate simple tutorial data
197
forecast = simple_forecast()
198
observations = simple_observations()
199
200
print(f"Forecast shape: {forecast.shape}")
201
print(f"Forecast range: [{forecast.min().values:.2f}, {forecast.max().values:.2f}]")
202
203
# Calculate basic scores
204
mse_score = mse(forecast, observations)
205
rmse_score = rmse(forecast, observations)
206
mae_score = mae(forecast, observations)
207
208
print(f"\nBasic Scores:")
209
print(f"MSE: {mse_score.values:.3f}")
210
print(f"RMSE: {rmse_score.values:.3f}")
211
print(f"MAE: {mae_score.values:.3f}")
212
```
213
214
### Pandas Workflow Example
215
216
```python
217
from scores.sample_data import simple_forecast_pandas, simple_observations_pandas
218
from scores.pandas import mse as pandas_mse
219
220
# Generate pandas data
221
forecast_pd = simple_forecast_pandas()
222
observations_pd = simple_observations_pandas()
223
224
print(f"Pandas forecast type: {type(forecast_pd)}")
225
print(f"Pandas data length: {len(forecast_pd)}")
226
227
# Use pandas-specific scoring functions
228
mse_pd = pandas_mse(forecast_pd, observations_pd)
229
print(f"Pandas MSE: {mse_pd:.3f}")
230
```
231
232
### Multi-dimensional Data Analysis
233
234
```python
235
from scores.sample_data import continuous_forecast, continuous_observations
236
from scores.continuous import mse, kge, pearsonr
237
238
# Generate multi-dimensional data
239
forecast = continuous_forecast()
240
observations = continuous_observations()
241
242
print(f"Multi-dimensional data:")
243
print(f"Forecast dimensions: {forecast.dims}")
244
print(f"Forecast shape: {forecast.shape}")
245
print(f"Coordinates: {list(forecast.coords.keys())}")
246
247
# Analyze different aspects
248
temporal_mse = mse(forecast, observations, reduce_dims="time")
249
spatial_mse = mse(forecast, observations, preserve_dims="time")
250
overall_mse = mse(forecast, observations)
251
252
print(f"\nMulti-dimensional Analysis:")
253
print(f"Temporal MSE shape: {temporal_mse.shape}")
254
print(f"Spatial MSE shape: {spatial_mse.shape}")
255
print(f"Overall MSE: {overall_mse.values:.3f}")
256
257
# Advanced metrics
258
kge_score = kge(forecast, observations)
259
correlation = pearsonr(forecast, observations)
260
261
print(f"KGE: {kge_score.values:.3f}")
262
print(f"Correlation: {correlation.values:.3f}")
263
```
264
265
### Large Dataset Testing
266
267
```python
268
# Generate large datasets for performance testing
269
large_forecast = continuous_forecast(large_size=True)
270
large_observations = continuous_observations(large_size=True)
271
272
print(f"Large dataset dimensions: {large_forecast.shape}")
273
print(f"Memory usage estimate: ~{large_forecast.nbytes / 1e6:.1f} MB")
274
275
# Performance timing example
276
import time
277
278
start_time = time.time()
279
large_mse = mse(large_forecast, large_observations)
280
end_time = time.time()
281
282
print(f"Large dataset MSE: {large_mse.values:.3f}")
283
print(f"Computation time: {end_time - start_time:.3f} seconds")
284
```
285
286
### Lead Time Analysis
287
288
```python
289
from scores.sample_data import continuous_forecast
290
291
# Generate forecast with lead times
292
lead_forecast = continuous_forecast(lead_days=True)
293
294
print(f"Lead time forecast dimensions: {lead_forecast.dims}")
295
print(f"Lead time coordinate: {lead_forecast.lead_time.values}")
296
297
# Analyze performance by lead time
298
observations = continuous_observations()
299
300
# Calculate MSE for each lead time
301
mse_by_lead = mse(lead_forecast, observations,
302
reduce_dims=["time", "station"])
303
304
print(f"MSE by lead time:")
305
for lead in mse_by_lead.lead_time.values:
306
lead_mse = mse_by_lead.sel(lead_time=lead)
307
print(f" Lead {lead:2d}: MSE = {lead_mse.values:.3f}")
308
```
309
310
### CDF and Probabilistic Data
311
312
```python
313
from scores.sample_data import cdf_forecast, cdf_observations
314
from scores.probability import crps_cdf
315
316
# Generate CDF forecast data
317
cdf_fcst = cdf_forecast()
318
cdf_obs = cdf_observations()
319
320
print(f"CDF forecast dimensions: {cdf_fcst.dims}")
321
print(f"CDF thresholds: {len(cdf_fcst.threshold)} points")
322
print(f"Threshold range: [{cdf_fcst.threshold.min().values:.1f}, {cdf_fcst.threshold.max().values:.1f}]")
323
324
# Verify CDF properties
325
print(f"CDF starts at: {cdf_fcst.isel(threshold=0).values.mean():.3f}")
326
print(f"CDF ends at: {cdf_fcst.isel(threshold=-1).values.mean():.3f}")
327
328
# Calculate CRPS for CDF forecasts
329
crps_score = crps_cdf(cdf_fcst, cdf_obs, threshold_dim="threshold")
330
print(f"CRPS for CDF forecast: {crps_score.values:.3f}")
331
```
332
333
### Data Exploration and Visualization
334
335
```python
336
import matplotlib.pyplot as plt
337
import numpy as np
338
339
# Generate various sample data
340
simple_fcst = simple_forecast()
341
simple_obs = simple_observations()
342
cont_fcst = continuous_forecast()
343
cont_obs = continuous_observations()
344
345
# Create comparison plots
346
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(12, 10))
347
348
# Simple data scatter plot
349
ax1.scatter(simple_fcst, simple_obs, alpha=0.6)
350
ax1.plot([simple_fcst.min(), simple_fcst.max()],
351
[simple_fcst.min(), simple_fcst.max()], 'r--', alpha=0.7)
352
ax1.set_xlabel('Simple Forecast')
353
ax1.set_ylabel('Simple Observation')
354
ax1.set_title('Simple Data Scatter Plot')
355
ax1.grid(True)
356
357
# Time series plot (if time dimension exists)
358
if 'time' in cont_fcst.dims:
359
time_slice = cont_fcst.isel(station=0) if 'station' in cont_fcst.dims else cont_fcst
360
obs_slice = cont_obs.isel(station=0) if 'station' in cont_obs.dims else cont_obs
361
362
ax2.plot(time_slice, label='Forecast', alpha=0.7)
363
ax2.plot(obs_slice, label='Observation', alpha=0.7)
364
ax2.set_xlabel('Time Index')
365
ax2.set_ylabel('Value')
366
ax2.set_title('Time Series Comparison')
367
ax2.legend()
368
ax2.grid(True)
369
370
# Distribution comparison
371
ax3.hist(simple_fcst, bins=20, alpha=0.6, label='Forecast', density=True)
372
ax3.hist(simple_obs, bins=20, alpha=0.6, label='Observation', density=True)
373
ax3.set_xlabel('Value')
374
ax3.set_ylabel('Density')
375
ax3.set_title('Distribution Comparison')
376
ax3.legend()
377
ax3.grid(True)
378
379
# Error analysis
380
errors = simple_fcst - simple_obs
381
ax4.hist(errors, bins=20, alpha=0.7, color='green')
382
ax4.axvline(0, color='red', linestyle='--', alpha=0.7)
383
ax4.set_xlabel('Forecast Error')
384
ax4.set_ylabel('Frequency')
385
ax4.set_title('Error Distribution')
386
ax4.grid(True)
387
388
plt.tight_layout()
389
plt.show()
390
391
# Summary statistics
392
print(f"\nData Summary Statistics:")
393
print(f"Simple forecast: mean={simple_fcst.mean().values:.3f}, std={simple_fcst.std().values:.3f}")
394
print(f"Simple observation: mean={simple_obs.mean().values:.3f}, std={simple_obs.std().values:.3f}")
395
print(f"Error statistics: mean={errors.mean().values:.3f}, std={errors.std().values:.3f}")
396
```
397
398
### Custom Data Integration
399
400
```python
401
# Use sample data as templates for custom data generation
402
template_fcst = continuous_forecast()
403
template_obs = continuous_observations()
404
405
# Create custom data with similar structure but different values
406
custom_forecast = template_fcst.copy()
407
custom_forecast.values = np.random.normal(
408
template_fcst.mean(),
409
template_fcst.std() * 1.2, # 20% more variable
410
template_fcst.shape
411
)
412
413
custom_observations = template_obs.copy()
414
custom_observations.values = np.random.normal(
415
template_obs.mean(),
416
template_obs.std(),
417
template_obs.shape
418
)
419
420
# Verify custom data maintains structure
421
print(f"Custom data verification:")
422
print(f"Dimensions match: {custom_forecast.dims == template_fcst.dims}")
423
print(f"Coordinates match: {list(custom_forecast.coords.keys()) == list(template_fcst.coords.keys())}")
424
425
# Score custom data
426
custom_mse = mse(custom_forecast, custom_observations)
427
print(f"Custom data MSE: {custom_mse.values:.3f}")
428
```
429
430
### Batch Data Generation
431
432
```python
433
# Generate multiple datasets for ensemble analysis
434
n_datasets = 10
435
forecast_ensemble = []
436
observation_ensemble = []
437
438
for i in range(n_datasets):
439
# Add some random seed variation
440
np.random.seed(42 + i)
441
fcst = continuous_forecast()
442
obs = continuous_observations()
443
444
forecast_ensemble.append(fcst)
445
observation_ensemble.append(obs)
446
447
# Analyze ensemble statistics
448
ensemble_mses = [mse(f, o).values for f, o in zip(forecast_ensemble, observation_ensemble)]
449
450
print(f"Ensemble MSE Statistics:")
451
print(f"Mean MSE: {np.mean(ensemble_mses):.3f}")
452
print(f"MSE Range: [{np.min(ensemble_mses):.3f}, {np.max(ensemble_mses):.3f}]")
453
print(f"MSE Std Dev: {np.std(ensemble_mses):.3f}")
454
```