0
# fastparquet
1
2
A high-performance Python implementation of the Apache Parquet columnar storage format, designed for seamless integration with pandas and the Python data science ecosystem. It provides fast reading and writing of parquet files with excellent compression and query performance.
3
4
## Package Information
5
6
- **Package Name**: fastparquet
7
- **Language**: Python
8
- **Installation**: `pip install fastparquet` or `conda install -c conda-forge fastparquet`
9
- **Python Requirements**: >=3.9
10
11
## Core Imports
12
13
```python
14
import fastparquet
15
```
16
17
For reading parquet files:
18
19
```python
20
from fastparquet import ParquetFile
21
```
22
23
For writing parquet files:
24
25
```python
26
from fastparquet import write, update_file_custom_metadata
27
```
28
29
## Basic Usage
30
31
```python
32
import pandas as pd
33
from fastparquet import ParquetFile, write
34
35
# Writing parquet files
36
df = pd.DataFrame({
37
'id': range(1000),
38
'value': range(1000, 2000),
39
'category': ['A', 'B', 'C'] * 333 + ['A']
40
})
41
42
# Write to parquet file
43
write('data.parquet', df)
44
45
# Write with compression
46
write('data_compressed.parquet', df, compression='GZIP')
47
48
# Reading parquet files
49
pf = ParquetFile('data.parquet')
50
51
# Read entire file to pandas DataFrame
52
df_read = pf.to_pandas()
53
54
# Read specific columns
55
df_subset = pf.to_pandas(columns=['id', 'value'])
56
57
# Read with filters
58
df_filtered = pf.to_pandas(
59
filters=[('category', '==', 'A'), ('value', '>', 1500)]
60
)
61
```
62
63
## Architecture
64
65
fastparquet is built around several core components:
66
67
- **ParquetFile**: Main class for reading parquet files, handling metadata, schema, and data access
68
- **Writer Functions**: High-level functions for writing pandas DataFrames to parquet format with various options
69
- **Schema System**: Tools for handling parquet schema definitions and type conversions
70
- **Compression Support**: Built-in support for multiple compression algorithms (GZIP, Snappy, Brotli, LZ4, Zstandard)
71
- **Partitioning**: Support for hive-style and drill-style partitioned datasets
72
73
The library emphasizes performance and compatibility with the broader Python data ecosystem, particularly pandas, while providing comprehensive support for parquet format features.
74
75
## Capabilities
76
77
### Reading Parquet Files
78
79
Core functionality for reading parquet files into pandas DataFrames, with support for selective column reading, filtering, and efficient memory usage through row group iteration.
80
81
```python { .api }
82
class ParquetFile:
83
def __init__(self, fn, verify=False, open_with=None, root=False,
84
sep=None, fs=None, pandas_nulls=True, dtypes=None): ...
85
def to_pandas(self, columns=None, categories=None, filters=[],
86
index=None, row_filter=False, dtypes=None): ...
87
def head(self, nrows, **kwargs): ...
88
def iter_row_groups(self, filters=None, **kwargs): ...
89
def count(self, filters=None, row_filter=False): ...
90
def __getitem__(self, item): ... # Row group selection
91
def __len__(self): ... # Number of row groups
92
def read_row_group_file(self, rg, columns, categories, index=None,
93
assign=None, partition_meta=None, row_filter=False): ...
94
def write_row_groups(self, data, row_group_offsets=None, **kwargs): ...
95
def remove_row_groups(self, rgs, **kwargs): ...
96
def check_categories(self, cats): ...
97
def pre_allocate(self, size, columns, categories, index, dtypes=None): ...
98
```
99
100
[Reading Parquet Files](./reading.md)
101
102
### Writing Parquet Files
103
104
Comprehensive functionality for writing pandas DataFrames to parquet format with options for compression, partitioning, encoding, and metadata management.
105
106
```python { .api }
107
def write(filename, data, row_group_offsets=None, compression=None,
108
file_scheme='hive', has_nulls=None, write_index=None,
109
partition_on=[], append=False, object_encoding=None,
110
fixed_text=None, times='int64', custom_metadata=None,
111
stats="auto", open_with=None, mkdirs=None): ...
112
113
def update_file_custom_metadata(fn, custom_metadata, open_with=None): ...
114
```
115
116
[Writing Parquet Files](./writing.md)
117
118
### Schema and Types
119
120
Tools for working with parquet schemas, type conversions, and metadata management to ensure proper data representation and compatibility.
121
122
```python { .api }
123
def find_type(data, fixed_text=None, object_encoding=None,
124
times='int64', is_index=None): ...
125
def convert(data, se): ...
126
127
class SchemaHelper:
128
def __init__(self, schema_elements): ...
129
def schema_element(self, name): ...
130
def is_required(self, name): ...
131
def max_repetition_level(self, parts): ...
132
def max_definition_level(self, parts): ...
133
```
134
135
[Schema and Types](./schema-types.md)
136
137
### Dataset Management
138
139
Advanced features for working with partitioned datasets, including reading from and writing to multi-file parquet collections with directory-based partitioning.
140
141
```python { .api }
142
def merge(file_list, verify_schema=True, open_with=None, root=False): ...
143
144
def metadata_from_many(file_list, verify_schema=False, open_with=None,
145
root=False, fs=None): ...
146
```
147
148
[Dataset Management](./dataset-management.md)
149
150
## Common Types
151
152
```python { .api }
153
class ParquetException(Exception):
154
"""Generic Exception related to unexpected data format when reading parquet file."""
155
156
# File scheme options
157
FileScheme = Literal['simple', 'hive', 'drill', 'flat', 'empty', 'mixed']
158
159
# Compression algorithm options
160
CompressionType = Union[str, Dict[str, Union[str, Dict[str, Any]]]]
161
# Supported: 'UNCOMPRESSED', 'SNAPPY', 'GZIP', 'LZO', 'BROTLI', 'LZ4', 'ZSTD'
162
163
# Filter specification
164
FilterCondition = Tuple[str, str, Any] # (column, operator, value)
165
FilterGroup = List[FilterCondition] # AND conditions
166
Filter = List[Union[FilterCondition, FilterGroup]] # OR groups
167
168
# Supported filter operators
169
FilterOperator = Literal['==', '=', '!=', '<', '<=', '>', '>=', 'in', 'not in']
170
171
# Object encoding options
172
ObjectEncoding = Literal['infer', 'utf8', 'json', 'bytes', 'bson']
173
174
# Timestamp encoding options
175
TimeEncoding = Literal['int64', 'int96']
176
177
# SchemaElement type
178
class SchemaElement:
179
name: str
180
type: Optional[str]
181
type_length: Optional[int]
182
repetition_type: Optional[str] # 'REQUIRED', 'OPTIONAL', 'REPEATED'
183
num_children: int
184
converted_type: Optional[str]
185
scale: Optional[int]
186
precision: Optional[int]
187
field_id: Optional[int]
188
logical_type: Optional[dict]
189
190
# Thrift metadata types
191
class FileMetaData:
192
version: int
193
schema: List[SchemaElement]
194
num_rows: int
195
row_groups: List['RowGroup']
196
key_value_metadata: Optional[List['KeyValue']]
197
created_by: Optional[str]
198
column_orders: Optional[List['ColumnOrder']]
199
200
class RowGroup:
201
columns: List['ColumnChunk']
202
total_byte_size: int
203
num_rows: int
204
sorting_columns: Optional[List['SortingColumn']]
205
file_offset: Optional[int]
206
total_compressed_size: Optional[int]
207
ordinal: Optional[int]
208
209
class ColumnChunk:
210
file_path: Optional[str]
211
file_offset: int
212
meta_data: Optional['ColumnMetaData']
213
offset_index_offset: Optional[int]
214
offset_index_length: Optional[int]
215
column_index_offset: Optional[int]
216
column_index_length: Optional[int]
217
218
# Function type signatures
219
OpenFunction = Callable[[str, str], Any] # (path, mode) -> file-like
220
MkdirsFunction = Callable[[str], None] # (path) -> None
221
RemoveFunction = Callable[[str], None] # (path) -> None
222
FileSystem = Any # fsspec.AbstractFileSystem compatible interface
223
```