0
# Data Processing and Utilities
1
2
Utilities for data handling, downsampling, sample data generation, and type processing to ensure optimal table performance and provide testing data for development and demonstration purposes.
3
4
## Capabilities
5
6
### Data Downsampling
7
8
Functions for automatically reducing DataFrame size when it exceeds specified limits, ensuring responsive table performance while preserving data structure and representation.
9
10
```python { .api }
11
def downsample(df, max_rows=0, max_columns=0, max_bytes=0):
12
"""
13
Return a subset of the DataFrame that fits the specified limits.
14
15
Parameters:
16
- df: Pandas/Polars DataFrame or Series to downsample
17
- max_rows (int): Maximum number of rows (0 = unlimited)
18
- max_columns (int): Maximum number of columns (0 = unlimited)
19
- max_bytes (int | str): Maximum memory usage ("64KB", "1MB", or integer bytes)
20
21
Returns:
22
tuple[DataFrame, str]: (downsampled_df, warning_message)
23
- warning_message is empty string if no downsampling occurred
24
"""
25
26
def nbytes(df):
27
"""
28
Calculate memory usage of DataFrame.
29
30
Parameters:
31
- df: Pandas/Polars DataFrame or Series
32
33
Returns:
34
int: Memory usage in bytes
35
"""
36
37
def as_nbytes(mem):
38
"""
39
Convert memory specification to bytes.
40
41
Parameters:
42
- mem (int | float | str): Memory specification ("64KB", "1MB", etc. or numeric)
43
44
Returns:
45
int: Memory size in bytes
46
47
Raises:
48
ValueError: If specification format is invalid or too large (>= 1GB)
49
"""
50
```
51
52
### Sample Data Generation
53
54
Comprehensive collection of functions for generating test data with various data types, structures, and complexities for development, testing, and demonstration purposes.
55
56
```python { .api }
57
def get_countries(html=False, climate_zone=False):
58
"""
59
Return DataFrame with world countries data from World Bank.
60
61
Parameters:
62
- html (bool): If True, include HTML formatted country/capital links and flag images
63
- climate_zone (bool): If True, add climate zone and hemisphere columns
64
65
Returns:
66
pd.DataFrame: Countries data with columns: region, country, capital, longitude, latitude
67
"""
68
69
def get_population():
70
"""
71
Return Series with world population data from World Bank.
72
73
Returns:
74
pd.Series: Population data indexed by country name
75
"""
76
77
def get_indicators():
78
"""
79
Return DataFrame with subset of World Bank indicators.
80
81
Returns:
82
pd.DataFrame: World Bank indicators data
83
"""
84
85
def get_df_complex_index():
86
"""
87
Return DataFrame with complex multi-level index for testing.
88
89
Returns:
90
pd.DataFrame: DataFrame with MultiIndex (region, country) and MultiIndex columns
91
"""
92
93
def get_dict_of_test_dfs(N=100, M=100):
94
"""
95
Return dictionary of test DataFrames with various data types and structures.
96
97
Parameters:
98
- N (int): Number of rows for generated data
99
- M (int): Number of columns for wide DataFrame
100
101
Returns:
102
dict[str, pd.DataFrame]: Test DataFrames including empty, boolean, int, float,
103
string, datetime, categorical, object, multiindex, and complex index types
104
"""
105
106
def get_dict_of_polars_test_dfs(N=100, M=100):
107
"""
108
Return dictionary of Polars test DataFrames.
109
110
Parameters:
111
- N (int): Number of rows for generated data
112
- M (int): Number of columns for wide DataFrame
113
114
Returns:
115
dict[str, pl.DataFrame]: Polars versions of test DataFrames
116
"""
117
118
def generate_random_df(rows, columns, column_types=None):
119
"""
120
Generate random DataFrame with specified dimensions and data types.
121
122
Parameters:
123
- rows (int): Number of rows to generate
124
- columns (int): Number of columns to generate
125
- column_types (list, optional): List of data types to use (default: COLUMN_TYPES)
126
127
Returns:
128
pd.DataFrame: Random DataFrame with mixed data types
129
"""
130
131
def generate_random_series(rows, type):
132
"""
133
Generate random Series of specified type and length.
134
135
Parameters:
136
- rows (int): Number of rows to generate
137
- type (str): Data type ("bool", "int", "float", "str", "categories",
138
"boolean", "Int64", "date", "datetime", "timedelta")
139
140
Returns:
141
pd.Series: Random Series of specified type
142
"""
143
144
def get_dict_of_polars_test_dfs(N=100, M=100):
145
"""
146
Return dictionary of Polars test DataFrames.
147
148
Parameters:
149
- N (int): Number of rows for generated data
150
- M (int): Number of columns for wide DataFrame
151
152
Returns:
153
dict[str, pl.DataFrame]: Polars versions of test DataFrames with same structure as pandas versions
154
"""
155
156
def get_dict_of_test_series():
157
"""
158
Return dictionary of test Series with various data types.
159
160
Returns:
161
dict[str, pd.Series]: Test Series including boolean, int, float, string,
162
categorical, datetime, and complex types
163
"""
164
165
def get_dict_of_polars_test_series():
166
"""
167
Return dictionary of Polars test Series.
168
169
Returns:
170
dict[str, pl.Series]: Polars versions of test Series
171
"""
172
173
def generate_date_series():
174
"""
175
Generate Series with various date formats and edge cases.
176
177
Returns:
178
pd.Series: Date series with timezone, leap years, and boundary dates
179
"""
180
181
def get_pandas_styler():
182
"""
183
Return styled Pandas DataFrame with background colors and tooltips.
184
185
Returns:
186
pd.Styler: Styled DataFrame with trigonometric data and formatting
187
"""
188
```
189
190
### Package Utilities
191
192
Helper functions for accessing ITables package resources and internal file management.
193
194
```python { .api }
195
def find_package_file(*path):
196
"""
197
Return full path to file within ITables package.
198
199
Parameters:
200
- *path (str): Path components relative to package root
201
202
Returns:
203
Path: Full path to package file
204
"""
205
206
def read_package_file(*path):
207
"""
208
Read and return content of file within ITables package.
209
210
Parameters:
211
- *path (str): Path components relative to package root
212
213
Returns:
214
str: File content as string
215
"""
216
```
217
218
## Usage Examples
219
220
### Automatic Downsampling
221
222
```python
223
import pandas as pd
224
from itables.downsample import downsample
225
226
# Create large DataFrame
227
df = pd.DataFrame({
228
'data': range(10000),
229
'values': np.random.randn(10000)
230
})
231
232
# Downsample to fit limits
233
small_df, warning = downsample(df, max_rows=1000, max_bytes="1MB")
234
235
if warning:
236
print(f"Downsampling applied: {warning}")
237
print(f"Original shape: {df.shape}, New shape: {small_df.shape}")
238
```
239
240
### Sample Data Usage
241
242
```python
243
from itables.sample_dfs import get_countries, get_dict_of_test_dfs
244
from itables import show
245
246
# Display world countries data
247
countries = get_countries(html=True, climate_zone=True)
248
show(countries, caption="World Countries with Climate Data")
249
250
# Get various test DataFrames
251
test_dfs = get_dict_of_test_dfs(N=50, M=10)
252
253
# Display different data types
254
show(test_dfs['float'], caption="Float Data Types")
255
show(test_dfs['time'], caption="Time Data Types")
256
show(test_dfs['multiindex'], caption="MultiIndex Example")
257
```
258
259
### Random Data Generation
260
261
```python
262
from itables.sample_dfs import generate_random_df, COLUMN_TYPES
263
from itables import show
264
265
# Generate random DataFrame
266
random_df = generate_random_df(
267
rows=100,
268
columns=8,
269
column_types=['int', 'float', 'str', 'bool', 'date', 'categories']
270
)
271
272
show(random_df, caption="Random Generated Data")
273
274
# Generate with all supported types
275
full_random = generate_random_df(rows=50, columns=len(COLUMN_TYPES))
276
show(full_random, caption="All Data Types")
277
```
278
279
### Styled DataFrames
280
281
```python
282
from itables.sample_dfs import get_pandas_styler
283
from itables import show
284
285
# Get pre-styled DataFrame
286
styled_df = get_pandas_styler()
287
show(styled_df,
288
caption="Styled Trigonometric Data",
289
allow_html=True) # Required for styled DataFrames
290
```
291
292
### Memory Analysis
293
294
```python
295
from itables.downsample import nbytes, as_nbytes
296
import pandas as pd
297
298
# Analyze DataFrame memory usage
299
df = pd.DataFrame({
300
'A': range(1000),
301
'B': ['text'] * 1000,
302
'C': pd.date_range('2020-01-01', periods=1000)
303
})
304
305
memory_usage = nbytes(df)
306
print(f"DataFrame uses {memory_usage:,} bytes")
307
308
# Convert memory specifications
309
print(f"64KB = {as_nbytes('64KB'):,} bytes")
310
print(f"1MB = {as_nbytes('1MB'):,} bytes")
311
print(f"Direct int: {as_nbytes(1024)} bytes")
312
```
313
314
### Custom Test Data
315
316
```python
317
from itables.sample_dfs import get_dict_of_test_dfs, get_dict_of_test_series
318
from itables import show
319
320
# Get all test DataFrames
321
test_data = get_dict_of_test_dfs(N=20, M=5)
322
323
# Show specific interesting cases
324
show(test_data['empty'], caption="Empty DataFrame")
325
show(test_data['duplicated_columns'], caption="Duplicated Column Names")
326
show(test_data['big_integers'], caption="Large Integer Handling")
327
328
# Test Series data
329
test_series = get_dict_of_test_series()
330
for name, series in list(test_series.items())[:3]:
331
show(series.to_frame(), caption=f"Series: {name}")
332
```
333
334
### Package Resource Access
335
336
```python
337
from itables.utils import find_package_file, read_package_file
338
339
# Find package files
340
dt_bundle_path = find_package_file("html", "dt_bundle.js")
341
print(f"DataTables bundle located at: {dt_bundle_path}")
342
343
# Read package content (for advanced use cases)
344
init_html = read_package_file("html", "init_datatables.html")
345
print(f"Init HTML template length: {len(init_html)} characters")
346
```
347
348
## Data Type Support
349
350
### Supported Column Types
351
352
The `COLUMN_TYPES` constant defines all supported data types for random generation:
353
354
```python
355
COLUMN_TYPES = [
356
"bool", # Boolean values
357
"int", # Integer values
358
"float", # Floating point (with NaN, inf handling)
359
"str", # String values
360
"categories", # Categorical data
361
"boolean", # Nullable boolean (pandas extension)
362
"Int64", # Nullable integer (pandas extension)
363
"date", # Date values
364
"datetime", # Datetime values
365
"timedelta" # Time duration values
366
]
367
```
368
369
### Special Value Handling
370
371
- **NaN/Null values**: Automatically handled for appropriate data types
372
- **Infinite values**: Properly encoded for JSON serialization
373
- **Large integers**: Preserved without precision loss
374
- **Complex objects**: Converted to string representation with warnings
375
- **Polars types**: Full compatibility including unsigned integers and struct types
376
377
### Memory Optimization
378
379
The downsampling system uses intelligent algorithms to:
380
- Preserve data structure (first/last rows for temporal continuity)
381
- Maintain aspect ratios when possible
382
- Provide clear warnings about data reduction
383
- Support both row and column limits simultaneously