or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

configuration.mdcore-display.mddata-utilities.mdframework-extensions.mdindex.mdtype-system.md

data-utilities.mddocs/

0

# Data Processing and Utilities

1

2

Utilities for data handling, downsampling, sample data generation, and type processing to ensure optimal table performance and provide testing data for development and demonstration purposes.

3

4

## Capabilities

5

6

### Data Downsampling

7

8

Functions for automatically reducing DataFrame size when it exceeds specified limits, ensuring responsive table performance while preserving data structure and representation.

9

10

```python { .api }

11

def downsample(df, max_rows=0, max_columns=0, max_bytes=0):

12

"""

13

Return a subset of the DataFrame that fits the specified limits.

14

15

Parameters:

16

- df: Pandas/Polars DataFrame or Series to downsample

17

- max_rows (int): Maximum number of rows (0 = unlimited)

18

- max_columns (int): Maximum number of columns (0 = unlimited)

19

- max_bytes (int | str): Maximum memory usage ("64KB", "1MB", or integer bytes)

20

21

Returns:

22

tuple[DataFrame, str]: (downsampled_df, warning_message)

23

- warning_message is empty string if no downsampling occurred

24

"""

25

26

def nbytes(df):

27

"""

28

Calculate memory usage of DataFrame.

29

30

Parameters:

31

- df: Pandas/Polars DataFrame or Series

32

33

Returns:

34

int: Memory usage in bytes

35

"""

36

37

def as_nbytes(mem):

38

"""

39

Convert memory specification to bytes.

40

41

Parameters:

42

- mem (int | float | str): Memory specification ("64KB", "1MB", etc. or numeric)

43

44

Returns:

45

int: Memory size in bytes

46

47

Raises:

48

ValueError: If specification format is invalid or too large (>= 1GB)

49

"""

50

```

51

52

### Sample Data Generation

53

54

Comprehensive collection of functions for generating test data with various data types, structures, and complexities for development, testing, and demonstration purposes.

55

56

```python { .api }

57

def get_countries(html=False, climate_zone=False):

58

"""

59

Return DataFrame with world countries data from World Bank.

60

61

Parameters:

62

- html (bool): If True, include HTML formatted country/capital links and flag images

63

- climate_zone (bool): If True, add climate zone and hemisphere columns

64

65

Returns:

66

pd.DataFrame: Countries data with columns: region, country, capital, longitude, latitude

67

"""

68

69

def get_population():

70

"""

71

Return Series with world population data from World Bank.

72

73

Returns:

74

pd.Series: Population data indexed by country name

75

"""

76

77

def get_indicators():

78

"""

79

Return DataFrame with subset of World Bank indicators.

80

81

Returns:

82

pd.DataFrame: World Bank indicators data

83

"""

84

85

def get_df_complex_index():

86

"""

87

Return DataFrame with complex multi-level index for testing.

88

89

Returns:

90

pd.DataFrame: DataFrame with MultiIndex (region, country) and MultiIndex columns

91

"""

92

93

def get_dict_of_test_dfs(N=100, M=100):

94

"""

95

Return dictionary of test DataFrames with various data types and structures.

96

97

Parameters:

98

- N (int): Number of rows for generated data

99

- M (int): Number of columns for wide DataFrame

100

101

Returns:

102

dict[str, pd.DataFrame]: Test DataFrames including empty, boolean, int, float,

103

string, datetime, categorical, object, multiindex, and complex index types

104

"""

105

106

def get_dict_of_polars_test_dfs(N=100, M=100):

107

"""

108

Return dictionary of Polars test DataFrames.

109

110

Parameters:

111

- N (int): Number of rows for generated data

112

- M (int): Number of columns for wide DataFrame

113

114

Returns:

115

dict[str, pl.DataFrame]: Polars versions of test DataFrames

116

"""

117

118

def generate_random_df(rows, columns, column_types=None):

119

"""

120

Generate random DataFrame with specified dimensions and data types.

121

122

Parameters:

123

- rows (int): Number of rows to generate

124

- columns (int): Number of columns to generate

125

- column_types (list, optional): List of data types to use (default: COLUMN_TYPES)

126

127

Returns:

128

pd.DataFrame: Random DataFrame with mixed data types

129

"""

130

131

def generate_random_series(rows, type):

132

"""

133

Generate random Series of specified type and length.

134

135

Parameters:

136

- rows (int): Number of rows to generate

137

- type (str): Data type ("bool", "int", "float", "str", "categories",

138

"boolean", "Int64", "date", "datetime", "timedelta")

139

140

Returns:

141

pd.Series: Random Series of specified type

142

"""

143

144

def get_dict_of_polars_test_dfs(N=100, M=100):

145

"""

146

Return dictionary of Polars test DataFrames.

147

148

Parameters:

149

- N (int): Number of rows for generated data

150

- M (int): Number of columns for wide DataFrame

151

152

Returns:

153

dict[str, pl.DataFrame]: Polars versions of test DataFrames with same structure as pandas versions

154

"""

155

156

def get_dict_of_test_series():

157

"""

158

Return dictionary of test Series with various data types.

159

160

Returns:

161

dict[str, pd.Series]: Test Series including boolean, int, float, string,

162

categorical, datetime, and complex types

163

"""

164

165

def get_dict_of_polars_test_series():

166

"""

167

Return dictionary of Polars test Series.

168

169

Returns:

170

dict[str, pl.Series]: Polars versions of test Series

171

"""

172

173

def generate_date_series():

174

"""

175

Generate Series with various date formats and edge cases.

176

177

Returns:

178

pd.Series: Date series with timezone, leap years, and boundary dates

179

"""

180

181

def get_pandas_styler():

182

"""

183

Return styled Pandas DataFrame with background colors and tooltips.

184

185

Returns:

186

pd.Styler: Styled DataFrame with trigonometric data and formatting

187

"""

188

```

189

190

### Package Utilities

191

192

Helper functions for accessing ITables package resources and internal file management.

193

194

```python { .api }

195

def find_package_file(*path):

196

"""

197

Return full path to file within ITables package.

198

199

Parameters:

200

- *path (str): Path components relative to package root

201

202

Returns:

203

Path: Full path to package file

204

"""

205

206

def read_package_file(*path):

207

"""

208

Read and return content of file within ITables package.

209

210

Parameters:

211

- *path (str): Path components relative to package root

212

213

Returns:

214

str: File content as string

215

"""

216

```

217

218

## Usage Examples

219

220

### Automatic Downsampling

221

222

```python

223

import pandas as pd

224

from itables.downsample import downsample

225

226

# Create large DataFrame

227

df = pd.DataFrame({

228

'data': range(10000),

229

'values': np.random.randn(10000)

230

})

231

232

# Downsample to fit limits

233

small_df, warning = downsample(df, max_rows=1000, max_bytes="1MB")

234

235

if warning:

236

print(f"Downsampling applied: {warning}")

237

print(f"Original shape: {df.shape}, New shape: {small_df.shape}")

238

```

239

240

### Sample Data Usage

241

242

```python

243

from itables.sample_dfs import get_countries, get_dict_of_test_dfs

244

from itables import show

245

246

# Display world countries data

247

countries = get_countries(html=True, climate_zone=True)

248

show(countries, caption="World Countries with Climate Data")

249

250

# Get various test DataFrames

251

test_dfs = get_dict_of_test_dfs(N=50, M=10)

252

253

# Display different data types

254

show(test_dfs['float'], caption="Float Data Types")

255

show(test_dfs['time'], caption="Time Data Types")

256

show(test_dfs['multiindex'], caption="MultiIndex Example")

257

```

258

259

### Random Data Generation

260

261

```python

262

from itables.sample_dfs import generate_random_df, COLUMN_TYPES

263

from itables import show

264

265

# Generate random DataFrame

266

random_df = generate_random_df(

267

rows=100,

268

columns=8,

269

column_types=['int', 'float', 'str', 'bool', 'date', 'categories']

270

)

271

272

show(random_df, caption="Random Generated Data")

273

274

# Generate with all supported types

275

full_random = generate_random_df(rows=50, columns=len(COLUMN_TYPES))

276

show(full_random, caption="All Data Types")

277

```

278

279

### Styled DataFrames

280

281

```python

282

from itables.sample_dfs import get_pandas_styler

283

from itables import show

284

285

# Get pre-styled DataFrame

286

styled_df = get_pandas_styler()

287

show(styled_df,

288

caption="Styled Trigonometric Data",

289

allow_html=True) # Required for styled DataFrames

290

```

291

292

### Memory Analysis

293

294

```python

295

from itables.downsample import nbytes, as_nbytes

296

import pandas as pd

297

298

# Analyze DataFrame memory usage

299

df = pd.DataFrame({

300

'A': range(1000),

301

'B': ['text'] * 1000,

302

'C': pd.date_range('2020-01-01', periods=1000)

303

})

304

305

memory_usage = nbytes(df)

306

print(f"DataFrame uses {memory_usage:,} bytes")

307

308

# Convert memory specifications

309

print(f"64KB = {as_nbytes('64KB'):,} bytes")

310

print(f"1MB = {as_nbytes('1MB'):,} bytes")

311

print(f"Direct int: {as_nbytes(1024)} bytes")

312

```

313

314

### Custom Test Data

315

316

```python

317

from itables.sample_dfs import get_dict_of_test_dfs, get_dict_of_test_series

318

from itables import show

319

320

# Get all test DataFrames

321

test_data = get_dict_of_test_dfs(N=20, M=5)

322

323

# Show specific interesting cases

324

show(test_data['empty'], caption="Empty DataFrame")

325

show(test_data['duplicated_columns'], caption="Duplicated Column Names")

326

show(test_data['big_integers'], caption="Large Integer Handling")

327

328

# Test Series data

329

test_series = get_dict_of_test_series()

330

for name, series in list(test_series.items())[:3]:

331

show(series.to_frame(), caption=f"Series: {name}")

332

```

333

334

### Package Resource Access

335

336

```python

337

from itables.utils import find_package_file, read_package_file

338

339

# Find package files

340

dt_bundle_path = find_package_file("html", "dt_bundle.js")

341

print(f"DataTables bundle located at: {dt_bundle_path}")

342

343

# Read package content (for advanced use cases)

344

init_html = read_package_file("html", "init_datatables.html")

345

print(f"Init HTML template length: {len(init_html)} characters")

346

```

347

348

## Data Type Support

349

350

### Supported Column Types

351

352

The `COLUMN_TYPES` constant defines all supported data types for random generation:

353

354

```python

355

COLUMN_TYPES = [

356

"bool", # Boolean values

357

"int", # Integer values

358

"float", # Floating point (with NaN, inf handling)

359

"str", # String values

360

"categories", # Categorical data

361

"boolean", # Nullable boolean (pandas extension)

362

"Int64", # Nullable integer (pandas extension)

363

"date", # Date values

364

"datetime", # Datetime values

365

"timedelta" # Time duration values

366

]

367

```

368

369

### Special Value Handling

370

371

- **NaN/Null values**: Automatically handled for appropriate data types

372

- **Infinite values**: Properly encoded for JSON serialization

373

- **Large integers**: Preserved without precision loss

374

- **Complex objects**: Converted to string representation with warnings

375

- **Polars types**: Full compatibility including unsigned integers and struct types

376

377

### Memory Optimization

378

379

The downsampling system uses intelligent algorithms to:

380

- Preserve data structure (first/last rows for temporal continuity)

381

- Maintain aspect ratios when possible

382

- Provide clear warnings about data reduction

383

- Support both row and column limits simultaneously