or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

categorical-scores.mdcontinuous-scores.mdemerging-scores.mdindex.mdpandas-integration.mdplot-data.mdprobability-scores.mdprocessing-tools.mdsample-data.mdspatial-scores.mdstatistical-tests.md

sample-data.mddocs/

0

# Sample Data

1

2

Built-in sample data generation for tutorials, testing, and experimentation with various data formats and structures. Provides realistic synthetic data that demonstrates the features and capabilities of the scores package.

3

4

## Capabilities

5

6

### Simple Data Generation

7

8

Basic one-dimensional data arrays for quick testing and tutorials.

9

10

#### Simple Forecast Data

11

12

```python { .api }

13

def simple_forecast() -> xr.DataArray:

14

"""

15

Generate simple series of prediction values for tutorials.

16

17

Returns:

18

DataArray with simple forecast values

19

20

Characteristics:

21

- Single dimension array

22

- Realistic forecast values

23

- No missing data

24

- Suitable for basic scoring function demos

25

"""

26

```

27

28

#### Simple Observation Data

29

30

```python { .api }

31

def simple_observations() -> xr.DataArray:

32

"""

33

Generate simple series of observation values for tutorials.

34

35

Returns:

36

DataArray with simple observation values

37

38

Characteristics:

39

- Matches simple_forecast() structure

40

- Corresponding observation values

41

- Suitable for basic verification examples

42

"""

43

```

44

45

### Pandas Data Generation

46

47

Simple data generation for pandas-based workflows.

48

49

#### Pandas Forecast Series

50

51

```python { .api }

52

def simple_forecast_pandas() -> pd.Series:

53

"""

54

Generate simple pandas series of prediction values.

55

56

Returns:

57

Pandas Series with forecast values

58

59

Notes:

60

- Pandas Series format instead of xarray

61

- Compatible with pandas-specific scoring functions

62

- Useful for traditional pandas workflows

63

"""

64

```

65

66

#### Pandas Observation Series

67

68

```python { .api }

69

def simple_observations_pandas() -> pd.Series:

70

"""

71

Generate simple pandas series of observation values.

72

73

Returns:

74

Pandas Series with observation values

75

76

Notes:

77

- Matches simple_forecast_pandas() structure

78

- Corresponding observation data

79

- Suitable for pandas-based verification

80

"""

81

```

82

83

### Multi-dimensional Continuous Data

84

85

Realistic multi-dimensional arrays for comprehensive testing of scoring functions.

86

87

#### Continuous Observations

88

89

```python { .api }

90

def continuous_observations(*, large_size: bool = False) -> xr.DataArray:

91

"""

92

Create continuous observation array with synthetic data.

93

94

Args:

95

large_size: Generate larger dataset for performance testing

96

97

Returns:

98

Multi-dimensional DataArray with synthetic observation data

99

100

Dimensions:

101

- time: Temporal dimension with regular intervals

102

- station: Spatial stations (if multi-dimensional)

103

- Additional dimensions based on configuration

104

105

Characteristics:

106

- Realistic temporal and spatial patterns

107

- Seasonal cycles and trends

108

- Missing data patterns

109

- Labeled coordinates with metadata

110

"""

111

```

112

113

#### Continuous Forecast Data

114

115

```python { .api }

116

def continuous_forecast(*, large_size: bool = False, lead_days: bool = False) -> xr.DataArray:

117

"""

118

Create continuous forecast array with synthetic data.

119

120

Args:

121

large_size: Generate larger dataset for performance testing

122

lead_days: Include lead time dimension for forecast horizons

123

124

Returns:

125

Multi-dimensional DataArray with synthetic forecast data

126

127

Dimensions:

128

- time: Valid time dimension

129

- station: Spatial stations (if multi-dimensional)

130

- lead_time: Forecast lead times (if lead_days=True)

131

132

Characteristics:

133

- Corresponds to continuous_observations() structure

134

- Realistic forecast errors and biases

135

- Lead time dependencies (if enabled)

136

- Ensemble-like variations

137

"""

138

```

139

140

### CDF/Probability Data

141

142

Specialized data for probabilistic verification and CDF-based scoring.

143

144

#### CDF Forecast Data

145

146

```python { .api }

147

def cdf_forecast(*, lead_days: bool = False) -> xr.DataArray:

148

"""

149

Create forecast array with CDF at each point.

150

151

Args:

152

lead_days: Include lead time dimension

153

154

Returns:

155

DataArray with CDF forecast values [0, 1]

156

157

Dimensions:

158

- time: Valid time dimension

159

- threshold: CDF threshold values

160

- lead_time: Forecast lead times (if lead_days=True)

161

162

Characteristics:

163

- Monotonically increasing CDFs

164

- Realistic probability distributions

165

- Multiple threshold levels

166

- Suitable for CRPS calculations

167

"""

168

```

169

170

#### CDF Observation Data

171

172

```python { .api }

173

def cdf_observations() -> xr.DataArray:

174

"""

175

Create observation array compatible with CDF forecasts.

176

177

Returns:

178

DataArray with observation values

179

180

Characteristics:

181

- Compatible with cdf_forecast() output

182

- Continuous values for CDF evaluation

183

- Matching temporal structure

184

- Realistic value ranges

185

"""

186

```

187

188

## Usage Patterns

189

190

### Basic Tutorial Examples

191

192

```python

193

from scores.sample_data import simple_forecast, simple_observations

194

from scores.continuous import mse, rmse, mae

195

196

# Generate simple tutorial data

197

forecast = simple_forecast()

198

observations = simple_observations()

199

200

print(f"Forecast shape: {forecast.shape}")

201

print(f"Forecast range: [{forecast.min().values:.2f}, {forecast.max().values:.2f}]")

202

203

# Calculate basic scores

204

mse_score = mse(forecast, observations)

205

rmse_score = rmse(forecast, observations)

206

mae_score = mae(forecast, observations)

207

208

print(f"\nBasic Scores:")

209

print(f"MSE: {mse_score.values:.3f}")

210

print(f"RMSE: {rmse_score.values:.3f}")

211

print(f"MAE: {mae_score.values:.3f}")

212

```

213

214

### Pandas Workflow Example

215

216

```python

217

from scores.sample_data import simple_forecast_pandas, simple_observations_pandas

218

from scores.pandas import mse as pandas_mse

219

220

# Generate pandas data

221

forecast_pd = simple_forecast_pandas()

222

observations_pd = simple_observations_pandas()

223

224

print(f"Pandas forecast type: {type(forecast_pd)}")

225

print(f"Pandas data length: {len(forecast_pd)}")

226

227

# Use pandas-specific scoring functions

228

mse_pd = pandas_mse(forecast_pd, observations_pd)

229

print(f"Pandas MSE: {mse_pd:.3f}")

230

```

231

232

### Multi-dimensional Data Analysis

233

234

```python

235

from scores.sample_data import continuous_forecast, continuous_observations

236

from scores.continuous import mse, kge, pearsonr

237

238

# Generate multi-dimensional data

239

forecast = continuous_forecast()

240

observations = continuous_observations()

241

242

print(f"Multi-dimensional data:")

243

print(f"Forecast dimensions: {forecast.dims}")

244

print(f"Forecast shape: {forecast.shape}")

245

print(f"Coordinates: {list(forecast.coords.keys())}")

246

247

# Analyze different aspects

248

temporal_mse = mse(forecast, observations, reduce_dims="time")

249

spatial_mse = mse(forecast, observations, preserve_dims="time")

250

overall_mse = mse(forecast, observations)

251

252

print(f"\nMulti-dimensional Analysis:")

253

print(f"Temporal MSE shape: {temporal_mse.shape}")

254

print(f"Spatial MSE shape: {spatial_mse.shape}")

255

print(f"Overall MSE: {overall_mse.values:.3f}")

256

257

# Advanced metrics

258

kge_score = kge(forecast, observations)

259

correlation = pearsonr(forecast, observations)

260

261

print(f"KGE: {kge_score.values:.3f}")

262

print(f"Correlation: {correlation.values:.3f}")

263

```

264

265

### Large Dataset Testing

266

267

```python

268

# Generate large datasets for performance testing

269

large_forecast = continuous_forecast(large_size=True)

270

large_observations = continuous_observations(large_size=True)

271

272

print(f"Large dataset dimensions: {large_forecast.shape}")

273

print(f"Memory usage estimate: ~{large_forecast.nbytes / 1e6:.1f} MB")

274

275

# Performance timing example

276

import time

277

278

start_time = time.time()

279

large_mse = mse(large_forecast, large_observations)

280

end_time = time.time()

281

282

print(f"Large dataset MSE: {large_mse.values:.3f}")

283

print(f"Computation time: {end_time - start_time:.3f} seconds")

284

```

285

286

### Lead Time Analysis

287

288

```python

289

from scores.sample_data import continuous_forecast

290

291

# Generate forecast with lead times

292

lead_forecast = continuous_forecast(lead_days=True)

293

294

print(f"Lead time forecast dimensions: {lead_forecast.dims}")

295

print(f"Lead time coordinate: {lead_forecast.lead_time.values}")

296

297

# Analyze performance by lead time

298

observations = continuous_observations()

299

300

# Calculate MSE for each lead time

301

mse_by_lead = mse(lead_forecast, observations,

302

reduce_dims=["time", "station"])

303

304

print(f"MSE by lead time:")

305

for lead in mse_by_lead.lead_time.values:

306

lead_mse = mse_by_lead.sel(lead_time=lead)

307

print(f" Lead {lead:2d}: MSE = {lead_mse.values:.3f}")

308

```

309

310

### CDF and Probabilistic Data

311

312

```python

313

from scores.sample_data import cdf_forecast, cdf_observations

314

from scores.probability import crps_cdf

315

316

# Generate CDF forecast data

317

cdf_fcst = cdf_forecast()

318

cdf_obs = cdf_observations()

319

320

print(f"CDF forecast dimensions: {cdf_fcst.dims}")

321

print(f"CDF thresholds: {len(cdf_fcst.threshold)} points")

322

print(f"Threshold range: [{cdf_fcst.threshold.min().values:.1f}, {cdf_fcst.threshold.max().values:.1f}]")

323

324

# Verify CDF properties

325

print(f"CDF starts at: {cdf_fcst.isel(threshold=0).values.mean():.3f}")

326

print(f"CDF ends at: {cdf_fcst.isel(threshold=-1).values.mean():.3f}")

327

328

# Calculate CRPS for CDF forecasts

329

crps_score = crps_cdf(cdf_fcst, cdf_obs, threshold_dim="threshold")

330

print(f"CRPS for CDF forecast: {crps_score.values:.3f}")

331

```

332

333

### Data Exploration and Visualization

334

335

```python

336

import matplotlib.pyplot as plt

337

import numpy as np

338

339

# Generate various sample data

340

simple_fcst = simple_forecast()

341

simple_obs = simple_observations()

342

cont_fcst = continuous_forecast()

343

cont_obs = continuous_observations()

344

345

# Create comparison plots

346

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(12, 10))

347

348

# Simple data scatter plot

349

ax1.scatter(simple_fcst, simple_obs, alpha=0.6)

350

ax1.plot([simple_fcst.min(), simple_fcst.max()],

351

[simple_fcst.min(), simple_fcst.max()], 'r--', alpha=0.7)

352

ax1.set_xlabel('Simple Forecast')

353

ax1.set_ylabel('Simple Observation')

354

ax1.set_title('Simple Data Scatter Plot')

355

ax1.grid(True)

356

357

# Time series plot (if time dimension exists)

358

if 'time' in cont_fcst.dims:

359

time_slice = cont_fcst.isel(station=0) if 'station' in cont_fcst.dims else cont_fcst

360

obs_slice = cont_obs.isel(station=0) if 'station' in cont_obs.dims else cont_obs

361

362

ax2.plot(time_slice, label='Forecast', alpha=0.7)

363

ax2.plot(obs_slice, label='Observation', alpha=0.7)

364

ax2.set_xlabel('Time Index')

365

ax2.set_ylabel('Value')

366

ax2.set_title('Time Series Comparison')

367

ax2.legend()

368

ax2.grid(True)

369

370

# Distribution comparison

371

ax3.hist(simple_fcst, bins=20, alpha=0.6, label='Forecast', density=True)

372

ax3.hist(simple_obs, bins=20, alpha=0.6, label='Observation', density=True)

373

ax3.set_xlabel('Value')

374

ax3.set_ylabel('Density')

375

ax3.set_title('Distribution Comparison')

376

ax3.legend()

377

ax3.grid(True)

378

379

# Error analysis

380

errors = simple_fcst - simple_obs

381

ax4.hist(errors, bins=20, alpha=0.7, color='green')

382

ax4.axvline(0, color='red', linestyle='--', alpha=0.7)

383

ax4.set_xlabel('Forecast Error')

384

ax4.set_ylabel('Frequency')

385

ax4.set_title('Error Distribution')

386

ax4.grid(True)

387

388

plt.tight_layout()

389

plt.show()

390

391

# Summary statistics

392

print(f"\nData Summary Statistics:")

393

print(f"Simple forecast: mean={simple_fcst.mean().values:.3f}, std={simple_fcst.std().values:.3f}")

394

print(f"Simple observation: mean={simple_obs.mean().values:.3f}, std={simple_obs.std().values:.3f}")

395

print(f"Error statistics: mean={errors.mean().values:.3f}, std={errors.std().values:.3f}")

396

```

397

398

### Custom Data Integration

399

400

```python

401

# Use sample data as templates for custom data generation

402

template_fcst = continuous_forecast()

403

template_obs = continuous_observations()

404

405

# Create custom data with similar structure but different values

406

custom_forecast = template_fcst.copy()

407

custom_forecast.values = np.random.normal(

408

template_fcst.mean(),

409

template_fcst.std() * 1.2, # 20% more variable

410

template_fcst.shape

411

)

412

413

custom_observations = template_obs.copy()

414

custom_observations.values = np.random.normal(

415

template_obs.mean(),

416

template_obs.std(),

417

template_obs.shape

418

)

419

420

# Verify custom data maintains structure

421

print(f"Custom data verification:")

422

print(f"Dimensions match: {custom_forecast.dims == template_fcst.dims}")

423

print(f"Coordinates match: {list(custom_forecast.coords.keys()) == list(template_fcst.coords.keys())}")

424

425

# Score custom data

426

custom_mse = mse(custom_forecast, custom_observations)

427

print(f"Custom data MSE: {custom_mse.values:.3f}")

428

```

429

430

### Batch Data Generation

431

432

```python

433

# Generate multiple datasets for ensemble analysis

434

n_datasets = 10

435

forecast_ensemble = []

436

observation_ensemble = []

437

438

for i in range(n_datasets):

439

# Add some random seed variation

440

np.random.seed(42 + i)

441

fcst = continuous_forecast()

442

obs = continuous_observations()

443

444

forecast_ensemble.append(fcst)

445

observation_ensemble.append(obs)

446

447

# Analyze ensemble statistics

448

ensemble_mses = [mse(f, o).values for f, o in zip(forecast_ensemble, observation_ensemble)]

449

450

print(f"Ensemble MSE Statistics:")

451

print(f"Mean MSE: {np.mean(ensemble_mses):.3f}")

452

print(f"MSE Range: [{np.min(ensemble_mses):.3f}, {np.max(ensemble_mses):.3f}]")

453

print(f"MSE Std Dev: {np.std(ensemble_mses):.3f}")

454

```