Tessl Tile for pypi/fastparquet@2024.11.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

dataset-management.md index.md reading.md schema-types.md writing.md

schema-types.mddocs/

0
# Schema and Types
1

2
Tools for working with parquet schemas, type conversions, and metadata management to ensure proper data representation and compatibility between pandas and parquet formats.
3

4
## Capabilities
5

6
### Schema Management
7

8
#### SchemaHelper Class
9

10
Utility class for working with parquet schema structures and navigating complex nested schemas.
11

12
```python { .api }
13
class SchemaHelper:
14
    """
15
    Helper class for parquet schema navigation and analysis.
16
    
17
    Provides methods to understand and work with parquet schema structures,
18
    including nested types and logical type conversions.
19
    """
20
    
21
    def __init__(self, schema_elements):
22
        """
23
        Initialize SchemaHelper with parquet schema elements.
24
        
25
        Parameters:
26
        - schema_elements: list, parquet schema element list
27
        """
28
    
29
    def schema_element(self, name):
30
        """
31
        Get schema element by name or path.
32
        
33
        Parameters:
34
        - name: str or list, column name or path in schema
35
        
36
        Returns:
37
        SchemaElement: The schema element for the specified path
38
        """
39
    
40
    def is_required(self, name):
41
        """
42
        Check if field is required (not optional).
43
        
44
        Parameters:
45
        - name: str, column name to check
46
        
47
        Returns:
48
        bool: True if field is required, False if optional
49
        """
50
    
51
    def max_repetition_level(self, parts):
52
        """
53
        Calculate maximum repetition level for column path.
54
        
55
        Parameters:
56
        - parts: list, column path components
57
        
58
        Returns:
59
        int: Maximum repetition level
60
        """
61
    
62
    def max_definition_level(self, parts):
63
        """
64
        Calculate maximum definition level for column path.
65
        
66
        Parameters:
67
        - parts: list, column path components
68
        
69
        Returns:
70
        int: Maximum definition level
71
        """
72
    
73
    @property
74
    def text(self):
75
        """Human-readable schema representation."""
76
```
77

78
#### Schema Structure Functions
79

80
Functions for building and analyzing parquet schema structures.
81

82
```python { .api }
83
def schema_tree(schema, i=0):
84
    """
85
    Build tree structure from flat schema list.
86
    
87
    Parameters:
88
    - schema: list, flat list of schema elements
89
    - i: int, starting index in schema list
90
    
91
    Returns:
92
    int: Final index after processing tree
93
    """
94

95
def schema_to_text(root, indent=[]):
96
    """
97
    Convert schema to human-readable text representation.
98
    
99
    Parameters:
100
    - root: SchemaElement, root schema element
101
    - indent: list, indentation tracking for nested elements
102
    
103
    Returns:
104
    str: Human-readable schema representation
105
    """
106

107
def flatten(schema_helper, path=None):
108
    """
109
    Flatten nested schema into list of column paths.
110
    
111
    Parameters:
112
    - schema_helper: SchemaHelper, schema navigation helper
113
    - path: list, current path for recursion
114
    
115
    Returns:
116
    list: List of flattened column paths
117
    """
118

119
def _is_list_like(helper, name):
120
    """
121
    Check if column represents a list-like structure.
122
    
123
    Parameters:
124
    - helper: SchemaHelper, schema navigation helper
125
    - name: str, column name to check
126
    
127
    Returns:
128
    bool: True if column is list-like, False otherwise
129
    """
130

131
def _is_map_like(helper, name):
132
    """
133
    Check if column represents a map-like structure.
134
    
135
    Parameters:
136
    - helper: SchemaHelper, schema navigation helper
137
    - name: str, column name to check
138
    
139
    Returns:
140
    bool: True if column is map-like, False otherwise
141
    """
142
```
143

144
### Type Conversion System
145

146
#### Pandas to Parquet Conversion
147

148
Convert pandas data types to appropriate parquet representations.
149

150
```python { .api }
151
def find_type(data, fixed_text=None, object_encoding=None, 
152
              times='int64', is_index=None):
153
    """
154
    Determine appropriate parquet type codes for pandas Series.
155
    
156
    Parameters:
157
    - data: pandas.Series, input data to analyze
158
    - fixed_text: int, fixed-length string size for string/bytes columns
159
    - object_encoding: str, encoding method for object dtype columns
160
    - times: str, timestamp encoding format ('int64' or 'int96')
161
    - is_index: bool, whether data represents DataFrame index
162
    
163
    Returns:
164
    tuple: (SchemaElement, parquet_type_code)
165
    """
166

167
def convert(data, se):
168
    """
169
    Convert pandas data according to schema element specification.
170
    
171
    Parameters:
172
    - data: pandas.Series, input data to convert
173
    - se: SchemaElement, target parquet schema element
174
    
175
    Returns:
176
    numpy.ndarray: Converted data ready for parquet encoding
177
    """
178
```
179

180
#### Object Encoding Functions
181

182
Handle encoding of Python object types to parquet-compatible formats.
183

184
```python { .api }
185
def infer_object_encoding(data):
186
    """
187
    Automatically infer appropriate encoding for object dtype column.
188
    
189
    Parameters:
190
    - data: pandas.Series, object dtype data to analyze
191
    
192
    Returns:
193
    str: Inferred encoding type ('utf8', 'bytes', 'json', 'bool', etc.)
194
    """
195
```
196

197
### Parquet to Pandas Conversion
198

199
#### Type Mapping and Conversion
200

201
Convert parquet data back to appropriate pandas types.
202

203
```python { .api }
204
def convert(data, schema_element, metadata=None):
205
    """
206
    Convert raw parquet data to appropriate pandas types.
207
    
208
    Parameters:
209
    - data: numpy.ndarray, raw parquet data
210
    - schema_element: SchemaElement, parquet schema information
211
    - metadata: dict, additional pandas metadata for conversion
212
    
213
    Returns:
214
    numpy.ndarray: Converted data suitable for pandas
215
    """
216

217
# Type mapping constants
218
simple = {
219
    # Mapping from parquet primitive types to numpy dtypes
220
    'INT32': 'int32',
221
    'INT64': 'int64', 
222
    'FLOAT': 'float32',
223
    'DOUBLE': 'float64',
224
    'BOOLEAN': 'bool',
225
    'BYTE_ARRAY': 'object',
226
    'FIXED_LEN_BYTE_ARRAY': 'object'
227
}
228

229
complex = {
230
    # Mapping from parquet logical types to numpy dtypes
231
    'UTF8': 'object',
232
    'JSON': 'object',
233
    'BSON': 'object',
234
    'DECIMAL': 'float64',
235
    'TIMESTAMP_MILLIS': 'datetime64[ms]',
236
    'TIMESTAMP_MICROS': 'datetime64[us]',
237
    'TIME_MICROS': 'timedelta64[us]'
238
}
239
```
240

241
### Binary Data Handling
242

243
#### BSON Support
244

245
Functions for working with Binary JSON (BSON) encoded data.
246

247
```python { .api }
248
def tobson(obj):
249
    """
250
    Convert Python object to BSON binary format.
251
    
252
    Parameters:
253
    - obj: Any, Python object to encode
254
    
255
    Returns:
256
    bytes: BSON-encoded binary data
257
    """
258

259
def unbson(data):
260
    """
261
    Convert BSON binary data back to Python object.
262
    
263
    Parameters:
264
    - data: bytes, BSON-encoded binary data
265
    
266
    Returns:
267
    Any: Decoded Python object
268
    """
269
```
270

271
### Column Metadata Generation
272

273
#### Pandas Metadata Creation
274

275
Generate pandas-compatible metadata for columns during the writing process.
276

277
```python { .api }
278
def get_column_metadata(column, name, object_dtype=None):
279
    """
280
    Generate pandas column metadata for parquet storage.
281
    
282
    Parameters:
283
    - column: pandas.Series, source column data
284
    - name: str, column name
285
    - object_dtype: str, specific object encoding type
286
    
287
    Returns:
288
    dict: Pandas metadata dictionary with type and encoding info
289
    """
290

291
def get_numpy_type(dtype):
292
    """
293
    Get numpy type string representation for pandas dtype.
294
    
295
    Parameters:
296
    - dtype: pandas.dtype, input pandas data type
297
    
298
    Returns:
299
    str: String representation of equivalent numpy type
300
    """
301
```
302

303
### Type Inference and Validation
304

305
#### Automatic Type Detection
306

307
Utilities for inferring and validating data types during conversion.
308

309
```python { .api }
310
def infer_dtype(column):
311
    """
312
    Infer pandas dtype of column data.
313
    
314
    Parameters:
315
    - column: pandas.Series, data to analyze
316
    
317
    Returns:
318
    str: Inferred pandas dtype string
319
    """
320

321
def groupby_types(iterable):
322
    """
323
    Group objects by their Python type.
324
    
325
    Parameters:
326
    - iterable: Iterable, collection of objects to group
327
    
328
    Returns:
329
    dict: Mapping from type to list of objects of that type
330
    """
331
```
332

333
## Usage Examples
334

335
### Working with Schemas
336

337
```python
338
from fastparquet import ParquetFile
339

340
# Read schema information
341
pf = ParquetFile('data.parquet')
342
schema = pf.schema
343

344
# Print schema structure
345
print(schema)  # Human-readable schema representation
346

347
# Access individual schema elements
348
for column_name in pf.columns:
349
    element = schema.schema_element([column_name])
350
    print(f"{column_name}: {element.type}")
351
```
352

353
### Type Conversion Examples
354

355
```python
356
import pandas as pd
357
from fastparquet.writer import find_type, convert
358

359
# Analyze pandas data for parquet conversion
360
data = pd.Series([1, 2, 3, 4, 5], name='numbers')
361
schema_element, type_code = find_type(data)
362

363
print(f"Parquet type: {schema_element.type}")
364
print(f"Converted type: {schema_element.converted_type}")
365

366
# Convert data for writing
367
converted_data = convert(data, schema_element)
368
print(f"Converted shape: {converted_data.shape}")
369
```
370

371
### Object Encoding Examples
372

373
```python
374
from fastparquet.writer import find_type
375

376
# String data (automatic UTF-8 encoding)
377
text_data = pd.Series(['hello', 'world', 'test'])
378
se, _ = find_type(text_data, object_encoding='utf8')
379

380
# JSON data
381
json_data = pd.Series([{'a': 1}, {'b': 2}, {'c': 3}])
382
se, _ = find_type(json_data, object_encoding='json')
383

384
# Binary data
385
binary_data = pd.Series([b'data1', b'data2', b'data3'])
386
se, _ = find_type(binary_data, object_encoding='bytes')
387

388
# Automatic inference
389
mixed_data = pd.Series(['text1', 'text2', 'text3'])
390
se, _ = find_type(mixed_data, object_encoding='infer')
391
```
392

393
### Custom Type Handling
394

395
```python
396
from fastparquet.writer import find_type
397

398
# Fixed-length strings
399
fixed_text_data = pd.Series(['ABC', 'DEF', 'GHI'])
400
se, _ = find_type(fixed_text_data, fixed_text=3)
401

402
# Decimal data
403
decimal_data = pd.Series([1.234, 5.678, 9.012])
404
se, _ = find_type(decimal_data, object_encoding='decimal')
405

406
# Timestamp with different encodings
407
timestamp_data = pd.Series(pd.date_range('2023-01-01', periods=3))
408

409
# 64-bit integer timestamps (default)
410
se_int64, _ = find_type(timestamp_data, times='int64')
411

412
# 96-bit timestamps (legacy compatibility)
413
se_int96, _ = find_type(timestamp_data, times='int96')
414
```
415

416
### Metadata Generation
417

418
```python
419
from fastparquet.util import get_column_metadata
420

421
# Generate metadata for different column types
422
df = pd.DataFrame({
423
    'int_col': [1, 2, 3],
424
    'float_col': [1.1, 2.2, 3.3],
425
    'str_col': ['a', 'b', 'c'],
426
    'cat_col': pd.Categorical(['X', 'Y', 'Z']),
427
    'date_col': pd.date_range('2023-01-01', periods=3)
428
})
429

430
for col_name, col_data in df.items():
431
    metadata = get_column_metadata(col_data, col_name)
432
    print(f"{col_name}: {metadata['pandas_type']} -> {metadata['numpy_type']}")
433
```
434

435
## Type Definitions
436

437
```python { .api }
438
# Object encoding options
439
ObjectEncoding = Literal[
440
    'infer',    # Automatically detect encoding
441
    'utf8',     # UTF-8 text encoding
442
    'bytes',    # Raw binary data
443
    'json',     # JSON serialization
444
    'bson',     # Binary JSON encoding
445
    'bool',     # Boolean values
446
    'int',      # Integer values (64-bit)
447
    'int32',    # Integer values (32-bit)
448
    'float',    # Floating point values
449
    'decimal'   # Decimal number handling
450
]
451

452
# Timestamp encoding formats
453
TimeEncoding = Literal[
454
    'int64',    # 64-bit integer (nanosecond precision)
455
    'int96'     # 96-bit format (legacy compatibility)
456
]
457

458
# Parquet primitive types
459
ParquetPrimitiveType = Literal[
460
    'BOOLEAN',
461
    'INT32', 
462
    'INT64',
463
    'FLOAT',
464
    'DOUBLE',
465
    'BYTE_ARRAY',
466
    'FIXED_LEN_BYTE_ARRAY',
467
    'INT96'
468
]
469

470
# Parquet logical types
471
ParquetLogicalType = Literal[
472
    'UTF8',
473
    'JSON', 
474
    'BSON',
475
    'DECIMAL',
476
    'TIMESTAMP_MILLIS',
477
    'TIMESTAMP_MICROS', 
478
    'TIME_MICROS',
479
    'INT_8',
480
    'INT_16',
481
    'INT_32',
482
    'INT_64',
483
    'UINT_8',
484
    'UINT_16', 
485
    'UINT_32',
486
    'UINT_64'
487
]
488

489
# Schema element structure
490
SchemaElement = Any  # parquet_thrift.SchemaElement
491

492
# Column metadata structure
493
ColumnMetadata = Dict[str, Union[str, Dict[str, Any]]]
494
```

Version

Tile

Files

schema-types.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

schema-types.mddocs/