0
# Reading Parquet Files
1
2
Comprehensive functionality for reading parquet files into pandas DataFrames with high performance and flexible data access patterns.
3
4
## Capabilities
5
6
### ParquetFile Class
7
8
The main class for reading parquet files, providing access to metadata, schema information, and efficient data reading methods.
9
10
```python { .api }
11
class ParquetFile:
12
def __init__(self, fn, verify=False, open_with=None, root=False,
13
sep=None, fs=None, pandas_nulls=True, dtypes=None):
14
"""
15
Initialize ParquetFile for reading parquet data.
16
17
Parameters:
18
- fn: str, path/URL or list of paths to parquet file(s)
19
- verify: bool, test file start/end byte markers
20
- open_with: function, custom file opener with signature func(path, mode)
21
- root: str, dataset root directory for partitioned data
22
- fs: fsspec filesystem, alternative to open_with
23
- pandas_nulls: bool, use pandas nullable types for int/bool with nulls
24
- dtypes: dict, override column dtypes
25
"""
26
```
27
28
### Data Reading Methods
29
30
#### Complete Data Reading
31
32
Read entire parquet file or filtered subset into a pandas DataFrame.
33
34
```python { .api }
35
def to_pandas(self, columns=None, categories=None, filters=[],
36
index=None, row_filter=False, dtypes=None):
37
"""
38
Read parquet data into pandas DataFrame.
39
40
Parameters:
41
- columns: list, column names to load (None for all)
42
- categories: list or dict, columns to treat as categorical
43
- filters: list, row filtering conditions
44
- index: str or list, column(s) to use as DataFrame index
45
- row_filter: bool or array, enable row-wise filtering
46
- dtypes: dict, override column data types
47
48
Returns:
49
pandas.DataFrame: The loaded data
50
"""
51
```
52
53
#### Partial Data Reading
54
55
Get a limited number of rows from the beginning of the dataset.
56
57
```python { .api }
58
def head(self, nrows, **kwargs):
59
"""
60
Get the first nrows of data.
61
62
Parameters:
63
- nrows: int, number of rows to return
64
- **kwargs: additional arguments passed to to_pandas()
65
66
Returns:
67
pandas.DataFrame: First nrows of data
68
"""
69
```
70
71
#### Row Group Iteration
72
73
Iterate through the dataset one row group at a time for memory-efficient processing.
74
75
```python { .api }
76
def iter_row_groups(self, filters=None, **kwargs):
77
"""
78
Iterate dataset by row groups.
79
80
Parameters:
81
- filters: list, optional filters to skip row groups
82
- **kwargs: additional arguments passed to to_pandas()
83
84
Yields:
85
pandas.DataFrame: One DataFrame per row group
86
"""
87
```
88
89
### Data Access and Slicing
90
91
#### Row Group Selection
92
93
Access specific row groups using indexing and slicing operations.
94
95
```python { .api }
96
def __getitem__(self, item):
97
"""
98
Select row groups using integer/slicing.
99
100
Parameters:
101
- item: int, slice, or list, row group selector
102
103
Returns:
104
ParquetFile: New ParquetFile with selected row groups
105
"""
106
107
def __len__(self):
108
"""
109
Return number of row groups.
110
111
Returns:
112
int: Number of row groups in the file
113
"""
114
```
115
116
#### Row Count
117
118
Get total number of rows with optional filtering.
119
120
```python { .api }
121
def count(self, filters=None, row_filter=False):
122
"""
123
Total number of rows in the dataset.
124
125
Parameters:
126
- filters: list, optional row filtering conditions
127
- row_filter: bool, enable row-wise filtering
128
129
Returns:
130
int: Total number of rows
131
"""
132
```
133
134
### Metadata and Schema Access
135
136
#### Properties
137
138
Access file metadata, schema, and structural information.
139
140
```python { .api }
141
@property
142
def columns(self):
143
"""Column names available in the dataset."""
144
145
@property
146
def dtypes(self):
147
"""Expected output types for each column."""
148
149
@property
150
def schema(self):
151
"""SchemaHelper object representing column structure."""
152
153
@property
154
def statistics(self):
155
"""Per-column statistics (min, max, count, null_count)."""
156
157
@property
158
def key_value_metadata(self):
159
"""Additional metadata key-value pairs."""
160
161
@property
162
def pandas_metadata(self):
163
"""Pandas-specific metadata if available."""
164
165
@property
166
def info(self):
167
"""Dataset summary information."""
168
169
@property
170
def file_scheme(self):
171
"""File organization scheme ('simple', 'hive', 'mixed', 'empty')."""
172
```
173
174
### Low-Level Reading Functions
175
176
#### Row Group Reading
177
178
Direct row group reading functions for advanced use cases and performance optimization.
179
180
```python { .api }
181
def read_row_group(file, rg, columns, categories, schema=None,
182
cats=None, index=None, assign=None,
183
scheme='hive', pandas_nulls=True, dtypes=None):
184
"""
185
Read single row group from parquet file.
186
187
Parameters:
188
- file: file-like object or ParquetFile
189
- rg: RowGroup, row group metadata object
190
- columns: list, column names to read
191
- categories: list or dict, categorical column specifications
192
- schema: SchemaHelper, parquet schema object
193
- cats: dict, partition categories
194
- index: str or list, index column specifications
195
- assign: dict, values to assign for partitioned columns
196
- scheme: str, partitioning scheme
197
- pandas_nulls: bool, use pandas nullable types
198
- dtypes: dict, column data type overrides
199
200
Returns:
201
pandas.DataFrame: Row group data
202
"""
203
204
def read_row_group_arrays(file, rg, columns, categories, schema=None,
205
cats=None, assign=None, scheme='hive'):
206
"""
207
Read row group into numpy arrays.
208
209
Parameters:
210
- file: file-like object or ParquetFile
211
- rg: RowGroup, row group metadata object
212
- columns: list, column names to read
213
- categories: list or dict, categorical specifications
214
- schema: SchemaHelper, parquet schema
215
- cats: dict, partition categories
216
- assign: dict, partition value assignments
217
- scheme: str, partitioning scheme
218
219
Returns:
220
dict: Column name to numpy array mapping
221
"""
222
```
223
224
#### Column Reading
225
226
Functions for reading individual columns and their data pages.
227
228
```python { .api }
229
def read_col(column, schema_helper, infile, use_cat=True,
230
assign=None, row_filter=None):
231
"""
232
Read single column from parquet file.
233
234
Parameters:
235
- column: ColumnChunk, column metadata object
236
- schema_helper: SchemaHelper, schema navigation helper
237
- infile: file-like object, open parquet file
238
- use_cat: bool, use categorical optimization
239
- assign: any, value to assign for partition columns
240
- row_filter: array, boolean row selection mask
241
242
Returns:
243
numpy.ndarray: Column data
244
"""
245
246
def read_data_page(infile, page, compressed_size, uncompressed_size,
247
column, schema, use_cat=True, selfmade=True,
248
assign=None, decoders=None, row_filter=None):
249
"""
250
Read and decode single data page.
251
252
Parameters:
253
- infile: file-like object, open parquet file
254
- page: PageHeader, page metadata object
255
- compressed_size: int, compressed page size in bytes
256
- uncompressed_size: int, uncompressed page size in bytes
257
- column: ColumnChunk, column metadata
258
- schema: SchemaHelper, schema navigation
259
- use_cat: bool, use categorical optimization
260
- selfmade: bool, file created by fastparquet
261
- assign: any, partition column assignment value
262
- decoders: dict, custom decoder functions
263
- row_filter: array, row selection mask
264
265
Returns:
266
tuple: (values, definition_levels, repetition_levels)
267
"""
268
269
def read_data_page_v2(infile, page, compressed_size, uncompressed_size,
270
column, schema, use_cat=True, selfmade=True,
271
assign=None, decoders=None, row_filter=None):
272
"""
273
Read and decode data page in v2 format.
274
275
Parameters:
276
- infile: file-like object, open parquet file
277
- page: PageHeader, page metadata object
278
- compressed_size: int, compressed page size
279
- uncompressed_size: int, uncompressed page size
280
- column: ColumnChunk, column metadata
281
- schema: SchemaHelper, schema navigation
282
- use_cat: bool, categorical optimization
283
- selfmade: bool, fastparquet-created file
284
- assign: any, partition value assignment
285
- decoders: dict, custom decoders
286
- row_filter: array, row selection mask
287
288
Returns:
289
tuple: (values, definition_levels, repetition_levels)
290
"""
291
292
def read_dictionary_page(infile, schema_helper):
293
"""
294
Read dictionary page for categorical columns.
295
296
Parameters:
297
- infile: file-like object, open parquet file
298
- schema_helper: SchemaHelper, schema navigation helper
299
300
Returns:
301
numpy.ndarray: Dictionary values
302
"""
303
```
304
305
### Filtering and Statistics
306
307
#### Filter Functions
308
309
Utility functions for working with parquet file filters and statistics.
310
311
```python { .api }
312
def filter_row_groups(pf, filters, as_idx=False):
313
"""
314
Select row groups using filters.
315
316
Parameters:
317
- pf: ParquetFile, the parquet file object
318
- filters: list, filtering conditions
319
- as_idx: bool, return indices instead of row groups
320
321
Returns:
322
list: Filtered row groups or their indices
323
"""
324
325
def statistics(obj):
326
"""
327
Return per-column statistics for a ParquetFile.
328
329
Parameters:
330
- obj: ParquetFile, ColumnChunk, or RowGroup
331
332
Returns:
333
dict: Statistics mapping (min, max, distinct_count, null_count) to columns
334
"""
335
336
def sorted_partitioned_columns(pf, filters=None):
337
"""
338
Find columns that are sorted partition-by-partition.
339
340
Parameters:
341
- pf: ParquetFile, the parquet file object
342
- filters: list, optional filtering conditions
343
344
Returns:
345
dict: Column names to min/max value ranges
346
"""
347
```
348
349
## Usage Examples
350
351
### Basic File Reading
352
353
```python
354
from fastparquet import ParquetFile
355
356
# Open parquet file
357
pf = ParquetFile('data.parquet')
358
359
# Read all data
360
df = pf.to_pandas()
361
362
# Read specific columns
363
df_subset = pf.to_pandas(columns=['col1', 'col2'])
364
365
# Check file info
366
print(pf.info)
367
print(f"Columns: {pf.columns}")
368
print(f"Row count: {pf.count()}")
369
```
370
371
### Filtering Data
372
373
```python
374
# Single condition filter
375
df_filtered = pf.to_pandas(filters=[('age', '>', 25)])
376
377
# Multiple conditions (AND)
378
df_filtered = pf.to_pandas(filters=[('age', '>', 25), ('score', '>=', 80)])
379
380
# Multiple condition groups (OR)
381
df_filtered = pf.to_pandas(filters=[
382
[('category', '==', 'A'), ('value', '>', 100)], # Group 1
383
[('category', '==', 'B'), ('value', '>', 200)] # Group 2
384
])
385
```
386
387
### Memory-Efficient Processing
388
389
```python
390
# Process large files in chunks
391
total_rows = 0
392
for chunk in pf.iter_row_groups():
393
# Process each row group
394
processed = chunk.groupby('category').sum()
395
total_rows += len(chunk)
396
397
print(f"Processed {total_rows} rows")
398
399
# Get sample of large file
400
sample = pf.head(1000)
401
```
402
403
### Working with Partitioned Datasets
404
405
```python
406
# Read partitioned dataset
407
pf = ParquetFile('/path/to/partitioned/dataset/')
408
409
# Access partition information
410
print(f"Partitions: {list(pf.cats.keys())}")
411
print(f"File scheme: {pf.file_scheme}")
412
413
# Filter by partition values
414
df = pf.to_pandas(filters=[('year', '==', 2023), ('month', 'in', [1, 2, 3])])
415
```
416
417
## Type Definitions
418
419
```python { .api }
420
# Filter specification
421
FilterCondition = Tuple[str, str, Any] # (column, operator, value)
422
FilterGroup = List[FilterCondition] # AND conditions
423
Filter = List[Union[FilterCondition, FilterGroup]] # OR groups
424
425
# Supported filter operators
426
FilterOp = Literal['==', '=', '!=', '<', '<=', '>', '>=', 'in', 'not in']
427
428
# File opening function signature
429
OpenFunction = Callable[[str, str], Any] # (path, mode) -> file-like object
430
431
# Filesystem interface
432
FileSystem = Any # fsspec.AbstractFileSystem compatible
433
```