Tessl Tile for pypi/fastparquet@2024.11.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

dataset-management.md index.md reading.md schema-types.md writing.md

reading.mddocs/

0
# Reading Parquet Files
1

2
Comprehensive functionality for reading parquet files into pandas DataFrames with high performance and flexible data access patterns.
3

4
## Capabilities
5

6
### ParquetFile Class
7

8
The main class for reading parquet files, providing access to metadata, schema information, and efficient data reading methods.
9

10
```python { .api }
11
class ParquetFile:
12
    def __init__(self, fn, verify=False, open_with=None, root=False, 
13
                 sep=None, fs=None, pandas_nulls=True, dtypes=None):
14
        """
15
        Initialize ParquetFile for reading parquet data.
16

17
        Parameters:
18
        - fn: str, path/URL or list of paths to parquet file(s)
19
        - verify: bool, test file start/end byte markers
20
        - open_with: function, custom file opener with signature func(path, mode)
21
        - root: str, dataset root directory for partitioned data
22
        - fs: fsspec filesystem, alternative to open_with
23
        - pandas_nulls: bool, use pandas nullable types for int/bool with nulls
24
        - dtypes: dict, override column dtypes
25
        """
26
```
27

28
### Data Reading Methods
29

30
#### Complete Data Reading
31

32
Read entire parquet file or filtered subset into a pandas DataFrame.
33

34
```python { .api }
35
def to_pandas(self, columns=None, categories=None, filters=[], 
36
              index=None, row_filter=False, dtypes=None):
37
    """
38
    Read parquet data into pandas DataFrame.
39

40
    Parameters:
41
    - columns: list, column names to load (None for all)
42
    - categories: list or dict, columns to treat as categorical
43
    - filters: list, row filtering conditions
44
    - index: str or list, column(s) to use as DataFrame index
45
    - row_filter: bool or array, enable row-wise filtering
46
    - dtypes: dict, override column data types
47

48
    Returns:
49
    pandas.DataFrame: The loaded data
50
    """
51
```
52

53
#### Partial Data Reading
54

55
Get a limited number of rows from the beginning of the dataset.
56

57
```python { .api }
58
def head(self, nrows, **kwargs):
59
    """
60
    Get the first nrows of data.
61

62
    Parameters:
63
    - nrows: int, number of rows to return
64
    - **kwargs: additional arguments passed to to_pandas()
65

66
    Returns:
67
    pandas.DataFrame: First nrows of data
68
    """
69
```
70

71
#### Row Group Iteration
72

73
Iterate through the dataset one row group at a time for memory-efficient processing.
74

75
```python { .api }
76
def iter_row_groups(self, filters=None, **kwargs):
77
    """
78
    Iterate dataset by row groups.
79

80
    Parameters:
81
    - filters: list, optional filters to skip row groups
82
    - **kwargs: additional arguments passed to to_pandas()
83

84
    Yields:
85
    pandas.DataFrame: One DataFrame per row group
86
    """
87
```
88

89
### Data Access and Slicing
90

91
#### Row Group Selection
92

93
Access specific row groups using indexing and slicing operations.
94

95
```python { .api }
96
def __getitem__(self, item):
97
    """
98
    Select row groups using integer/slicing.
99

100
    Parameters:
101
    - item: int, slice, or list, row group selector
102

103
    Returns:
104
    ParquetFile: New ParquetFile with selected row groups
105
    """
106

107
def __len__(self):
108
    """
109
    Return number of row groups.
110

111
    Returns:
112
    int: Number of row groups in the file
113
    """
114
```
115

116
#### Row Count
117

118
Get total number of rows with optional filtering.
119

120
```python { .api }
121
def count(self, filters=None, row_filter=False):
122
    """
123
    Total number of rows in the dataset.
124

125
    Parameters:
126
    - filters: list, optional row filtering conditions
127
    - row_filter: bool, enable row-wise filtering
128

129
    Returns:
130
    int: Total number of rows
131
    """
132
```
133

134
### Metadata and Schema Access
135

136
#### Properties
137

138
Access file metadata, schema, and structural information.
139

140
```python { .api }
141
@property
142
def columns(self):
143
    """Column names available in the dataset."""
144

145
@property
146
def dtypes(self):
147
    """Expected output types for each column."""
148

149
@property
150
def schema(self):
151
    """SchemaHelper object representing column structure."""
152

153
@property
154
def statistics(self):
155
    """Per-column statistics (min, max, count, null_count)."""
156

157
@property
158
def key_value_metadata(self):
159
    """Additional metadata key-value pairs."""
160

161
@property
162
def pandas_metadata(self):
163
    """Pandas-specific metadata if available."""
164

165
@property
166
def info(self):
167
    """Dataset summary information."""
168

169
@property
170
def file_scheme(self):
171
    """File organization scheme ('simple', 'hive', 'mixed', 'empty')."""
172
```
173

174
### Low-Level Reading Functions
175

176
#### Row Group Reading
177

178
Direct row group reading functions for advanced use cases and performance optimization.
179

180
```python { .api }
181
def read_row_group(file, rg, columns, categories, schema=None, 
182
                   cats=None, index=None, assign=None, 
183
                   scheme='hive', pandas_nulls=True, dtypes=None):
184
    """
185
    Read single row group from parquet file.
186

187
    Parameters:
188
    - file: file-like object or ParquetFile
189
    - rg: RowGroup, row group metadata object
190
    - columns: list, column names to read
191
    - categories: list or dict, categorical column specifications
192
    - schema: SchemaHelper, parquet schema object
193
    - cats: dict, partition categories
194
    - index: str or list, index column specifications
195
    - assign: dict, values to assign for partitioned columns
196
    - scheme: str, partitioning scheme
197
    - pandas_nulls: bool, use pandas nullable types
198
    - dtypes: dict, column data type overrides
199

200
    Returns:
201
    pandas.DataFrame: Row group data
202
    """
203

204
def read_row_group_arrays(file, rg, columns, categories, schema=None,
205
                         cats=None, assign=None, scheme='hive'):
206
    """
207
    Read row group into numpy arrays.
208

209
    Parameters:
210
    - file: file-like object or ParquetFile
211
    - rg: RowGroup, row group metadata object  
212
    - columns: list, column names to read
213
    - categories: list or dict, categorical specifications
214
    - schema: SchemaHelper, parquet schema
215
    - cats: dict, partition categories
216
    - assign: dict, partition value assignments
217
    - scheme: str, partitioning scheme
218

219
    Returns:
220
    dict: Column name to numpy array mapping
221
    """
222
```
223

224
#### Column Reading
225

226
Functions for reading individual columns and their data pages.
227

228
```python { .api }
229
def read_col(column, schema_helper, infile, use_cat=True, 
230
             assign=None, row_filter=None):
231
    """
232
    Read single column from parquet file.
233

234
    Parameters:
235
    - column: ColumnChunk, column metadata object
236
    - schema_helper: SchemaHelper, schema navigation helper
237
    - infile: file-like object, open parquet file
238
    - use_cat: bool, use categorical optimization
239
    - assign: any, value to assign for partition columns
240
    - row_filter: array, boolean row selection mask
241

242
    Returns:
243
    numpy.ndarray: Column data
244
    """
245

246
def read_data_page(infile, page, compressed_size, uncompressed_size,
247
                   column, schema, use_cat=True, selfmade=True, 
248
                   assign=None, decoders=None, row_filter=None):
249
    """
250
    Read and decode single data page.
251

252
    Parameters:
253
    - infile: file-like object, open parquet file
254
    - page: PageHeader, page metadata object
255
    - compressed_size: int, compressed page size in bytes
256
    - uncompressed_size: int, uncompressed page size in bytes
257
    - column: ColumnChunk, column metadata
258
    - schema: SchemaHelper, schema navigation
259
    - use_cat: bool, use categorical optimization
260
    - selfmade: bool, file created by fastparquet
261
    - assign: any, partition column assignment value
262
    - decoders: dict, custom decoder functions
263
    - row_filter: array, row selection mask
264

265
    Returns:
266
    tuple: (values, definition_levels, repetition_levels)
267
    """
268

269
def read_data_page_v2(infile, page, compressed_size, uncompressed_size,
270
                      column, schema, use_cat=True, selfmade=True,
271
                      assign=None, decoders=None, row_filter=None):
272
    """
273
    Read and decode data page in v2 format.
274

275
    Parameters:
276
    - infile: file-like object, open parquet file
277
    - page: PageHeader, page metadata object
278
    - compressed_size: int, compressed page size
279
    - uncompressed_size: int, uncompressed page size  
280
    - column: ColumnChunk, column metadata
281
    - schema: SchemaHelper, schema navigation
282
    - use_cat: bool, categorical optimization
283
    - selfmade: bool, fastparquet-created file
284
    - assign: any, partition value assignment
285
    - decoders: dict, custom decoders
286
    - row_filter: array, row selection mask
287

288
    Returns:
289
    tuple: (values, definition_levels, repetition_levels)
290
    """
291

292
def read_dictionary_page(infile, schema_helper):
293
    """
294
    Read dictionary page for categorical columns.
295

296
    Parameters:
297
    - infile: file-like object, open parquet file
298
    - schema_helper: SchemaHelper, schema navigation helper
299

300
    Returns:
301
    numpy.ndarray: Dictionary values
302
    """
303
```
304

305
### Filtering and Statistics
306

307
#### Filter Functions
308

309
Utility functions for working with parquet file filters and statistics.
310

311
```python { .api }
312
def filter_row_groups(pf, filters, as_idx=False):
313
    """
314
    Select row groups using filters.
315

316
    Parameters:
317
    - pf: ParquetFile, the parquet file object
318
    - filters: list, filtering conditions
319
    - as_idx: bool, return indices instead of row groups
320

321
    Returns:
322
    list: Filtered row groups or their indices
323
    """
324

325
def statistics(obj):
326
    """
327
    Return per-column statistics for a ParquetFile.
328

329
    Parameters:
330
    - obj: ParquetFile, ColumnChunk, or RowGroup
331

332
    Returns:
333
    dict: Statistics mapping (min, max, distinct_count, null_count) to columns
334
    """
335

336
def sorted_partitioned_columns(pf, filters=None):
337
    """
338
    Find columns that are sorted partition-by-partition.
339

340
    Parameters:
341
    - pf: ParquetFile, the parquet file object  
342
    - filters: list, optional filtering conditions
343

344
    Returns:
345
    dict: Column names to min/max value ranges
346
    """
347
```
348

349
## Usage Examples
350

351
### Basic File Reading
352

353
```python
354
from fastparquet import ParquetFile
355

356
# Open parquet file
357
pf = ParquetFile('data.parquet')
358

359
# Read all data
360
df = pf.to_pandas()
361

362
# Read specific columns
363
df_subset = pf.to_pandas(columns=['col1', 'col2'])
364

365
# Check file info
366
print(pf.info)
367
print(f"Columns: {pf.columns}")
368
print(f"Row count: {pf.count()}")
369
```
370

371
### Filtering Data
372

373
```python
374
# Single condition filter
375
df_filtered = pf.to_pandas(filters=[('age', '>', 25)])
376

377
# Multiple conditions (AND)
378
df_filtered = pf.to_pandas(filters=[('age', '>', 25), ('score', '>=', 80)])
379

380
# Multiple condition groups (OR)
381
df_filtered = pf.to_pandas(filters=[
382
    [('category', '==', 'A'), ('value', '>', 100)],  # Group 1
383
    [('category', '==', 'B'), ('value', '>', 200)]   # Group 2
384
])
385
```
386

387
### Memory-Efficient Processing
388

389
```python
390
# Process large files in chunks
391
total_rows = 0
392
for chunk in pf.iter_row_groups():
393
    # Process each row group
394
    processed = chunk.groupby('category').sum()
395
    total_rows += len(chunk)
396
    
397
print(f"Processed {total_rows} rows")
398

399
# Get sample of large file
400
sample = pf.head(1000)
401
```
402

403
### Working with Partitioned Datasets
404

405
```python
406
# Read partitioned dataset
407
pf = ParquetFile('/path/to/partitioned/dataset/')
408

409
# Access partition information
410
print(f"Partitions: {list(pf.cats.keys())}")
411
print(f"File scheme: {pf.file_scheme}")
412

413
# Filter by partition values
414
df = pf.to_pandas(filters=[('year', '==', 2023), ('month', 'in', [1, 2, 3])])
415
```
416

417
## Type Definitions
418

419
```python { .api }
420
# Filter specification
421
FilterCondition = Tuple[str, str, Any]  # (column, operator, value)
422
FilterGroup = List[FilterCondition]     # AND conditions
423
Filter = List[Union[FilterCondition, FilterGroup]]  # OR groups
424

425
# Supported filter operators
426
FilterOp = Literal['==', '=', '!=', '<', '<=', '>', '>=', 'in', 'not in']
427

428
# File opening function signature
429
OpenFunction = Callable[[str, str], Any]  # (path, mode) -> file-like object
430

431
# Filesystem interface
432
FileSystem = Any  # fsspec.AbstractFileSystem compatible
433
```

Version

Tile

Files

reading.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

reading.mddocs/