0
# Writing Parquet Files
1
2
Comprehensive functionality for writing pandas DataFrames to parquet format with extensive options for compression, partitioning, encoding, and performance optimization.
3
4
## Capabilities
5
6
### Main Write Function
7
8
The primary function for writing pandas DataFrames to parquet files with full control over format options.
9
10
```python { .api }
11
def write(filename, data, row_group_offsets=None, compression=None,
12
file_scheme='simple', open_with=None, mkdirs=None,
13
has_nulls=True, write_index=None, partition_on=[],
14
fixed_text=None, append=False, object_encoding='infer',
15
times='int64', custom_metadata=None, stats="auto"):
16
"""
17
Write pandas DataFrame to parquet file.
18
19
Parameters:
20
- filename: str, output parquet file or directory path
21
- data: pandas.DataFrame, data to write
22
- row_group_offsets: int or list, row group size control
23
- compression: str or dict, compression algorithm(s) to use
24
- file_scheme: str, file organization ('simple', 'hive', 'drill')
25
- open_with: function, custom file opener
26
- mkdirs: function, directory creation function
27
- has_nulls: bool or list, null value handling specification
28
- write_index: bool, whether to write DataFrame index as column
29
- partition_on: list, columns to partition data by
30
- fixed_text: dict, fixed-length string specifications
31
- append: bool, append to existing dataset
32
- object_encoding: str or dict, object column encoding method
33
- times: str, timestamp encoding format ('int64' or 'int96')
34
- custom_metadata: dict, additional metadata to store
35
- stats: bool or list, statistics calculation control
36
"""
37
```
38
39
### Specialized Write Functions
40
41
#### Simple File Writing
42
43
Write all data to a single parquet file.
44
45
```python { .api }
46
def write_simple(fn, data, fmd, row_group_offsets=None, compression=None,
47
open_with=None, has_nulls=None, append=False, stats=True):
48
"""
49
Write to single parquet file.
50
51
Parameters:
52
- fn: str, output file path
53
- data: pandas.DataFrame or iterable of DataFrames
54
- fmd: FileMetaData, parquet metadata object
55
- row_group_offsets: int or list, row group size specification
56
- compression: str or dict, compression settings
57
- open_with: function, file opening function
58
- has_nulls: bool or list, null handling specification
59
- append: bool, append to existing file
60
- stats: bool or list, statistics calculation control
61
"""
62
```
63
64
#### Multi-File Writing
65
66
Write data across multiple files with partitioning support.
67
68
```python { .api }
69
def write_multi(dn, data, fmd, row_group_offsets=None, compression=None,
70
file_scheme='hive', write_fmd=True, open_with=None,
71
mkdirs=None, partition_on=[], append=False, stats=True):
72
"""
73
Write to multiple parquet files with partitioning.
74
75
Parameters:
76
- dn: str, output directory path
77
- data: pandas.DataFrame or iterable of DataFrames
78
- fmd: FileMetaData, parquet metadata object
79
- row_group_offsets: int or list, row group size specification
80
- compression: str or dict, compression settings
81
- file_scheme: str, partitioning scheme ('hive', 'drill', 'flat')
82
- write_fmd: bool, write common metadata files
83
- open_with: function, file opening function
84
- mkdirs: function, directory creation function
85
- partition_on: list, partitioning column names
86
- append: bool, append to existing dataset
87
- stats: bool or list, statistics calculation control
88
"""
89
```
90
91
### Data Type and Schema Functions
92
93
#### Type Detection
94
95
Determine appropriate parquet types for pandas data.
96
97
```python { .api }
98
def find_type(data, fixed_text=None, object_encoding=None,
99
times='int64', is_index=None):
100
"""
101
Determine appropriate parquet type codes for pandas Series.
102
103
Parameters:
104
- data: pandas.Series, input data to analyze
105
- fixed_text: int, fixed-length string size
106
- object_encoding: str, encoding method for object columns
107
- times: str, timestamp format ('int64' or 'int96')
108
- is_index: bool, whether data represents an index column
109
110
Returns:
111
tuple: (schema_element, type_code)
112
"""
113
```
114
115
#### Data Conversion
116
117
Convert pandas data to parquet-compatible format.
118
119
```python { .api }
120
def convert(data, se):
121
"""
122
Convert pandas data according to schema element specification.
123
124
Parameters:
125
- data: pandas.Series, input data to convert
126
- se: SchemaElement, parquet schema element describing target format
127
128
Returns:
129
numpy.ndarray: Converted data ready for parquet encoding
130
"""
131
```
132
133
#### Metadata Creation
134
135
Generate parquet file metadata from pandas DataFrame.
136
137
```python { .api }
138
def make_metadata(data, has_nulls=True, ignore_columns=None,
139
fixed_text=None, object_encoding=None, times='int64',
140
index_cols=None, partition_cols=None, cols_dtype="object"):
141
"""
142
Create parquet file metadata from pandas DataFrame.
143
144
Parameters:
145
- data: pandas.DataFrame, source data
146
- has_nulls: bool or list, null value specifications
147
- ignore_columns: list, columns to exclude from metadata
148
- fixed_text: dict, fixed-length text specifications
149
- object_encoding: str or dict, object encoding methods
150
- times: str, timestamp encoding format
151
- index_cols: list, index column specifications
152
- partition_cols: list, partition column names
153
- cols_dtype: str, default column dtype
154
155
Returns:
156
FileMetaData: Parquet metadata object
157
"""
158
```
159
160
### Column-Level Writing
161
162
#### Individual Column Writing
163
164
Write single column data with full control over encoding and compression.
165
166
```python { .api }
167
def write_column(f, data0, selement, compression=None,
168
datapage_version=None, stats=True):
169
"""
170
Write single column to parquet file.
171
172
Parameters:
173
- f: file, open binary file for writing
174
- data0: pandas.Series, column data to write
175
- selement: SchemaElement, column schema specification
176
- compression: str or dict, compression settings
177
- datapage_version: int, parquet data page version (1 or 2)
178
- stats: bool, calculate and write column statistics
179
180
Returns:
181
ColumnChunk: Parquet column chunk metadata
182
"""
183
```
184
185
### Metadata Management
186
187
#### Common Metadata Writing
188
189
Write shared metadata files for multi-file datasets.
190
191
```python { .api }
192
def write_common_metadata(fn, fmd, open_with=None, no_row_groups=True):
193
"""
194
Write parquet schema to shared metadata file.
195
196
Parameters:
197
- fn: str, metadata file path
198
- fmd: FileMetaData, metadata to write
199
- open_with: function, file opening function
200
- no_row_groups: bool, exclude row group info for common metadata
201
"""
202
```
203
204
#### Custom Metadata Updates
205
206
Update file metadata without rewriting data.
207
208
```python { .api }
209
def update_file_custom_metadata(path, custom_metadata, is_metadata_file=None):
210
"""
211
Update custom metadata in parquet file without rewriting data.
212
213
Parameters:
214
- path: str, path to parquet file
215
- custom_metadata: dict, metadata key-value pairs to update
216
- is_metadata_file: bool, whether target is pure metadata file
217
"""
218
```
219
220
### Low-Level Writing Functions
221
222
#### Row Group and Partition Writing
223
224
Low-level functions for creating individual row groups and partition files.
225
226
```python { .api }
227
def make_row_group(df, schema, compression=None, stats=True,
228
has_nulls=True, fmd=None):
229
"""
230
Create row group metadata from DataFrame.
231
232
Parameters:
233
- df: pandas.DataFrame, data for the row group
234
- schema: list, parquet schema elements
235
- compression: str or dict, compression settings
236
- stats: bool or list, statistics calculation control
237
- has_nulls: bool or list, null value specifications
238
- fmd: FileMetaData, file metadata object
239
240
Returns:
241
RowGroup: Row group metadata object
242
"""
243
244
def make_part_file(filename, rg, schema, fmd, compression=None,
245
open_with=None, sep=None):
246
"""
247
Write single partition file.
248
249
Parameters:
250
- filename: str, output file path
251
- rg: RowGroup, row group to write
252
- schema: list, parquet schema elements
253
- fmd: FileMetaData, file metadata
254
- compression: str or dict, compression settings
255
- open_with: function, file opening function
256
- sep: str, path separator for platform compatibility
257
258
Returns:
259
int: Bytes written to file
260
"""
261
```
262
263
#### Data Encoding Functions
264
265
Functions for encoding column data in different formats.
266
267
```python { .api }
268
def encode_plain(data, se):
269
"""
270
Encode data using plain encoding.
271
272
Parameters:
273
- data: numpy.ndarray, data to encode
274
- se: SchemaElement, schema element specification
275
276
Returns:
277
bytes: Encoded data
278
"""
279
280
def encode_dict(data, se):
281
"""
282
Encode data using dictionary encoding.
283
284
Parameters:
285
- data: numpy.ndarray, data to encode
286
- se: SchemaElement, schema element specification
287
288
Returns:
289
tuple: (encoded_data, dictionary_data)
290
"""
291
```
292
293
### Dataset Operations
294
295
#### Appending and Row Group Management
296
297
Add new data to existing parquet datasets.
298
299
```python { .api }
300
# ParquetFile methods for dataset modification
301
def write_row_groups(self, data, row_group_offsets=None, sort_key=None,
302
sort_pnames=False, compression=None, write_fmd=True,
303
open_with=None, mkdirs=None, stats="auto"):
304
"""
305
Write data as new row groups to existing dataset.
306
307
Parameters:
308
- data: pandas.DataFrame or iterable, data to add
309
- row_group_offsets: int or list, row group size control
310
- sort_key: function, sorting key for row group ordering
311
- sort_pnames: bool, align partition file names with positions
312
- compression: str or dict, compression settings
313
- write_fmd: bool, update common metadata
314
- open_with: function, file opening function
315
- mkdirs: function, directory creation function
316
- stats: bool or list, statistics calculation control
317
"""
318
319
def remove_row_groups(self, rgs, sort_pnames=False, write_fmd=True,
320
open_with=None, remove_with=None):
321
"""
322
Remove row groups from existing dataset.
323
324
Parameters:
325
- rgs: list, row group indices to remove
326
- sort_pnames: bool, align partition file names
327
- write_fmd: bool, update common metadata
328
- open_with: function, file opening function
329
- remove_with: function, file removal function
330
"""
331
```
332
333
#### Dataset Merging and Overwriting
334
335
Advanced dataset management operations.
336
337
```python { .api }
338
def merge(file_list, verify_schema=True, open_with=None, root=False):
339
"""
340
Create logical dataset from multiple parquet files.
341
342
Parameters:
343
- file_list: list, paths to parquet files or ParquetFile instances
344
- verify_schema: bool, verify schema consistency across files
345
- open_with: function, file opening function
346
- root: str, dataset root directory
347
348
Returns:
349
ParquetFile: Merged dataset representation
350
"""
351
352
def overwrite(dirpath, data, row_group_offsets=None, sort_pnames=True,
353
compression=None, open_with=None, mkdirs=None,
354
remove_with=None, stats=True):
355
"""
356
Overwrite partitions in existing parquet dataset.
357
358
Parameters:
359
- dirpath: str, dataset directory path
360
- data: pandas.DataFrame, new data to write
361
- row_group_offsets: int or list, row group size specification
362
- sort_pnames: bool, align partition file names
363
- compression: str or dict, compression settings
364
- open_with: function, file opening function
365
- mkdirs: function, directory creation function
366
- remove_with: function, file removal function
367
- stats: bool or list, statistics calculation control
368
"""
369
```
370
371
## Usage Examples
372
373
### Basic Writing
374
375
```python
376
import pandas as pd
377
from fastparquet import write
378
379
# Create sample data
380
df = pd.DataFrame({
381
'id': range(1000),
382
'value': range(1000, 2000),
383
'category': ['A', 'B', 'C'] * 333 + ['A'],
384
'timestamp': pd.date_range('2023-01-01', periods=1000, freq='H')
385
})
386
387
# Write to parquet file
388
write('output.parquet', df)
389
390
# Write with compression
391
write('output_compressed.parquet', df, compression='GZIP')
392
393
# Write specific columns only
394
write('output_subset.parquet', df[['id', 'value']])
395
```
396
397
### Compression Options
398
399
```python
400
# String compression (applied to all columns)
401
write('data.parquet', df, compression='SNAPPY')
402
403
# Per-column compression
404
write('data.parquet', df, compression={
405
'id': 'GZIP',
406
'value': 'SNAPPY',
407
'category': 'LZ4',
408
'timestamp': None, # No compression
409
'_default': 'GZIP' # Default for unlisted columns
410
})
411
412
# Advanced compression with arguments
413
write('data.parquet', df, compression={
414
'value': {
415
'type': 'LZ4',
416
'args': {'mode': 'high_compression', 'compression': 9}
417
},
418
'category': {
419
'type': 'SNAPPY',
420
'args': None
421
}
422
})
423
```
424
425
### Partitioned Datasets
426
427
```python
428
# Partition by single column
429
write('partitioned_data', df,
430
file_scheme='hive',
431
partition_on=['category'])
432
433
# Partition by multiple columns
434
write('partitioned_data', df,
435
file_scheme='hive',
436
partition_on=['category', 'year'])
437
438
# Drill-style partitioning (directory names as values)
439
write('partitioned_data', df,
440
file_scheme='drill',
441
partition_on=['category'])
442
```
443
444
### Advanced Options
445
446
```python
447
# Control row group sizes
448
write('data.parquet', df, row_group_offsets=50000) # ~50k rows per group
449
write('data.parquet', df, row_group_offsets=[0, 100, 500, 1000]) # Explicit offsets
450
451
# Handle object columns
452
write('data.parquet', df, object_encoding={
453
'text_col': 'utf8',
454
'json_col': 'json',
455
'binary_col': 'bytes'
456
})
457
458
# Write with custom metadata
459
write('data.parquet', df, custom_metadata={
460
'created_by': 'my_application',
461
'version': '1.0.0',
462
'description': 'Sample dataset'
463
})
464
465
# Control statistics calculation
466
write('data.parquet', df, stats=['id', 'value']) # Only for specific columns
467
write('data.parquet', df, stats=False) # Disable statistics
468
write('data.parquet', df, stats="auto") # Auto-detect (default)
469
```
470
471
### Appending Data
472
473
```python
474
from fastparquet import ParquetFile
475
476
# Append to existing file
477
new_data = pd.DataFrame({'id': [1001, 1002], 'value': [2001, 2002]})
478
write('existing.parquet', new_data, append=True)
479
480
# Append using ParquetFile methods
481
pf = ParquetFile('existing.parquet')
482
pf.write_row_groups(new_data)
483
```
484
485
## Type Definitions
486
487
```python { .api }
488
# File scheme options
489
FileScheme = Literal['simple', 'hive', 'drill']
490
491
# Compression specification
492
CompressionType = Union[
493
str, # Algorithm name
494
Dict[str, Union[str, None, Dict[str, Any]]] # Per-column with options
495
]
496
497
# Object encoding options
498
ObjectEncoding = Union[
499
Literal['infer', 'utf8', 'bytes', 'json', 'bson', 'bool', 'int', 'int32', 'float', 'decimal'],
500
Dict[str, str] # Per-column encoding
501
]
502
503
# Row group size specification
504
RowGroupSpec = Union[int, List[int]]
505
506
# Statistics specification
507
StatsSpec = Union[bool, Literal["auto"], List[str]]
508
509
# Null handling specification
510
NullsSpec = Union[bool, Literal['infer'], List[str]]
511
512
# Custom metadata
513
CustomMetadata = Dict[str, Union[str, bytes]]
514
```