CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-fastparquet

Python support for Parquet file format with high performance reading and writing capabilities

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Overview
Eval results
Files

writing.mddocs/

Writing Parquet Files

Comprehensive functionality for writing pandas DataFrames to parquet format with extensive options for compression, partitioning, encoding, and performance optimization.

Capabilities

Main Write Function

The primary function for writing pandas DataFrames to parquet files with full control over format options.

def write(filename, data, row_group_offsets=None, compression=None, 
          file_scheme='simple', open_with=None, mkdirs=None, 
          has_nulls=True, write_index=None, partition_on=[], 
          fixed_text=None, append=False, object_encoding='infer', 
          times='int64', custom_metadata=None, stats="auto"):
    """
    Write pandas DataFrame to parquet file.

    Parameters:
    - filename: str, output parquet file or directory path
    - data: pandas.DataFrame, data to write
    - row_group_offsets: int or list, row group size control
    - compression: str or dict, compression algorithm(s) to use
    - file_scheme: str, file organization ('simple', 'hive', 'drill')
    - open_with: function, custom file opener
    - mkdirs: function, directory creation function
    - has_nulls: bool or list, null value handling specification
    - write_index: bool, whether to write DataFrame index as column
    - partition_on: list, columns to partition data by
    - fixed_text: dict, fixed-length string specifications
    - append: bool, append to existing dataset
    - object_encoding: str or dict, object column encoding method
    - times: str, timestamp encoding format ('int64' or 'int96')
    - custom_metadata: dict, additional metadata to store
    - stats: bool or list, statistics calculation control
    """

Specialized Write Functions

Simple File Writing

Write all data to a single parquet file.

def write_simple(fn, data, fmd, row_group_offsets=None, compression=None,
                 open_with=None, has_nulls=None, append=False, stats=True):
    """
    Write to single parquet file.

    Parameters:
    - fn: str, output file path
    - data: pandas.DataFrame or iterable of DataFrames
    - fmd: FileMetaData, parquet metadata object
    - row_group_offsets: int or list, row group size specification
    - compression: str or dict, compression settings
    - open_with: function, file opening function
    - has_nulls: bool or list, null handling specification
    - append: bool, append to existing file
    - stats: bool or list, statistics calculation control
    """

Multi-File Writing

Write data across multiple files with partitioning support.

def write_multi(dn, data, fmd, row_group_offsets=None, compression=None,
                file_scheme='hive', write_fmd=True, open_with=None,
                mkdirs=None, partition_on=[], append=False, stats=True):
    """
    Write to multiple parquet files with partitioning.

    Parameters:
    - dn: str, output directory path
    - data: pandas.DataFrame or iterable of DataFrames
    - fmd: FileMetaData, parquet metadata object
    - row_group_offsets: int or list, row group size specification
    - compression: str or dict, compression settings
    - file_scheme: str, partitioning scheme ('hive', 'drill', 'flat')
    - write_fmd: bool, write common metadata files
    - open_with: function, file opening function
    - mkdirs: function, directory creation function
    - partition_on: list, partitioning column names
    - append: bool, append to existing dataset
    - stats: bool or list, statistics calculation control
    """

Data Type and Schema Functions

Type Detection

Determine appropriate parquet types for pandas data.

def find_type(data, fixed_text=None, object_encoding=None, 
              times='int64', is_index=None):
    """
    Determine appropriate parquet type codes for pandas Series.

    Parameters:
    - data: pandas.Series, input data to analyze
    - fixed_text: int, fixed-length string size
    - object_encoding: str, encoding method for object columns
    - times: str, timestamp format ('int64' or 'int96')
    - is_index: bool, whether data represents an index column

    Returns:
    tuple: (schema_element, type_code)
    """

Data Conversion

Convert pandas data to parquet-compatible format.

def convert(data, se):
    """
    Convert pandas data according to schema element specification.

    Parameters:
    - data: pandas.Series, input data to convert
    - se: SchemaElement, parquet schema element describing target format

    Returns:
    numpy.ndarray: Converted data ready for parquet encoding
    """

Metadata Creation

Generate parquet file metadata from pandas DataFrame.

def make_metadata(data, has_nulls=True, ignore_columns=None, 
                  fixed_text=None, object_encoding=None, times='int64',
                  index_cols=None, partition_cols=None, cols_dtype="object"):
    """
    Create parquet file metadata from pandas DataFrame.

    Parameters:
    - data: pandas.DataFrame, source data
    - has_nulls: bool or list, null value specifications
    - ignore_columns: list, columns to exclude from metadata
    - fixed_text: dict, fixed-length text specifications
    - object_encoding: str or dict, object encoding methods
    - times: str, timestamp encoding format
    - index_cols: list, index column specifications
    - partition_cols: list, partition column names
    - cols_dtype: str, default column dtype

    Returns:
    FileMetaData: Parquet metadata object
    """

Column-Level Writing

Individual Column Writing

Write single column data with full control over encoding and compression.

def write_column(f, data0, selement, compression=None, 
                 datapage_version=None, stats=True):
    """
    Write single column to parquet file.

    Parameters:
    - f: file, open binary file for writing
    - data0: pandas.Series, column data to write
    - selement: SchemaElement, column schema specification
    - compression: str or dict, compression settings
    - datapage_version: int, parquet data page version (1 or 2)
    - stats: bool, calculate and write column statistics

    Returns:
    ColumnChunk: Parquet column chunk metadata
    """

Metadata Management

Common Metadata Writing

Write shared metadata files for multi-file datasets.

def write_common_metadata(fn, fmd, open_with=None, no_row_groups=True):
    """
    Write parquet schema to shared metadata file.

    Parameters:
    - fn: str, metadata file path
    - fmd: FileMetaData, metadata to write
    - open_with: function, file opening function
    - no_row_groups: bool, exclude row group info for common metadata
    """

Custom Metadata Updates

Update file metadata without rewriting data.

def update_file_custom_metadata(path, custom_metadata, is_metadata_file=None):
    """
    Update custom metadata in parquet file without rewriting data.

    Parameters:
    - path: str, path to parquet file
    - custom_metadata: dict, metadata key-value pairs to update
    - is_metadata_file: bool, whether target is pure metadata file
    """

Low-Level Writing Functions

Row Group and Partition Writing

Low-level functions for creating individual row groups and partition files.

def make_row_group(df, schema, compression=None, stats=True, 
                   has_nulls=True, fmd=None):
    """
    Create row group metadata from DataFrame.

    Parameters:
    - df: pandas.DataFrame, data for the row group
    - schema: list, parquet schema elements
    - compression: str or dict, compression settings
    - stats: bool or list, statistics calculation control
    - has_nulls: bool or list, null value specifications
    - fmd: FileMetaData, file metadata object

    Returns:
    RowGroup: Row group metadata object
    """

def make_part_file(filename, rg, schema, fmd, compression=None,
                   open_with=None, sep=None):
    """
    Write single partition file.

    Parameters:
    - filename: str, output file path
    - rg: RowGroup, row group to write
    - schema: list, parquet schema elements
    - fmd: FileMetaData, file metadata
    - compression: str or dict, compression settings
    - open_with: function, file opening function
    - sep: str, path separator for platform compatibility

    Returns:
    int: Bytes written to file
    """

Data Encoding Functions

Functions for encoding column data in different formats.

def encode_plain(data, se):
    """
    Encode data using plain encoding.

    Parameters:
    - data: numpy.ndarray, data to encode
    - se: SchemaElement, schema element specification

    Returns:
    bytes: Encoded data
    """

def encode_dict(data, se):
    """
    Encode data using dictionary encoding.

    Parameters:
    - data: numpy.ndarray, data to encode
    - se: SchemaElement, schema element specification

    Returns:
    tuple: (encoded_data, dictionary_data)
    """

Dataset Operations

Appending and Row Group Management

Add new data to existing parquet datasets.

# ParquetFile methods for dataset modification
def write_row_groups(self, data, row_group_offsets=None, sort_key=None,
                     sort_pnames=False, compression=None, write_fmd=True,
                     open_with=None, mkdirs=None, stats="auto"):
    """
    Write data as new row groups to existing dataset.

    Parameters:
    - data: pandas.DataFrame or iterable, data to add
    - row_group_offsets: int or list, row group size control
    - sort_key: function, sorting key for row group ordering
    - sort_pnames: bool, align partition file names with positions
    - compression: str or dict, compression settings
    - write_fmd: bool, update common metadata
    - open_with: function, file opening function
    - mkdirs: function, directory creation function
    - stats: bool or list, statistics calculation control
    """

def remove_row_groups(self, rgs, sort_pnames=False, write_fmd=True,
                      open_with=None, remove_with=None):
    """
    Remove row groups from existing dataset.

    Parameters:
    - rgs: list, row group indices to remove
    - sort_pnames: bool, align partition file names
    - write_fmd: bool, update common metadata
    - open_with: function, file opening function
    - remove_with: function, file removal function
    """

Dataset Merging and Overwriting

Advanced dataset management operations.

def merge(file_list, verify_schema=True, open_with=None, root=False):
    """
    Create logical dataset from multiple parquet files.

    Parameters:
    - file_list: list, paths to parquet files or ParquetFile instances
    - verify_schema: bool, verify schema consistency across files
    - open_with: function, file opening function
    - root: str, dataset root directory

    Returns:
    ParquetFile: Merged dataset representation
    """

def overwrite(dirpath, data, row_group_offsets=None, sort_pnames=True,
              compression=None, open_with=None, mkdirs=None,
              remove_with=None, stats=True):
    """
    Overwrite partitions in existing parquet dataset.

    Parameters:
    - dirpath: str, dataset directory path
    - data: pandas.DataFrame, new data to write
    - row_group_offsets: int or list, row group size specification
    - sort_pnames: bool, align partition file names
    - compression: str or dict, compression settings
    - open_with: function, file opening function
    - mkdirs: function, directory creation function
    - remove_with: function, file removal function
    - stats: bool or list, statistics calculation control
    """

Usage Examples

Basic Writing

import pandas as pd
from fastparquet import write

# Create sample data
df = pd.DataFrame({
    'id': range(1000),
    'value': range(1000, 2000),
    'category': ['A', 'B', 'C'] * 333 + ['A'],
    'timestamp': pd.date_range('2023-01-01', periods=1000, freq='H')
})

# Write to parquet file
write('output.parquet', df)

# Write with compression
write('output_compressed.parquet', df, compression='GZIP')

# Write specific columns only
write('output_subset.parquet', df[['id', 'value']])

Compression Options

# String compression (applied to all columns)
write('data.parquet', df, compression='SNAPPY')

# Per-column compression
write('data.parquet', df, compression={
    'id': 'GZIP',
    'value': 'SNAPPY',
    'category': 'LZ4',
    'timestamp': None,  # No compression
    '_default': 'GZIP'  # Default for unlisted columns
})

# Advanced compression with arguments
write('data.parquet', df, compression={
    'value': {
        'type': 'LZ4',
        'args': {'mode': 'high_compression', 'compression': 9}
    },
    'category': {
        'type': 'SNAPPY',
        'args': None
    }
})

Partitioned Datasets

# Partition by single column
write('partitioned_data', df, 
      file_scheme='hive', 
      partition_on=['category'])

# Partition by multiple columns
write('partitioned_data', df,
      file_scheme='hive',
      partition_on=['category', 'year'])

# Drill-style partitioning (directory names as values)
write('partitioned_data', df,
      file_scheme='drill',
      partition_on=['category'])

Advanced Options

# Control row group sizes
write('data.parquet', df, row_group_offsets=50000)  # ~50k rows per group
write('data.parquet', df, row_group_offsets=[0, 100, 500, 1000])  # Explicit offsets

# Handle object columns
write('data.parquet', df, object_encoding={
    'text_col': 'utf8',
    'json_col': 'json',
    'binary_col': 'bytes'
})

# Write with custom metadata
write('data.parquet', df, custom_metadata={
    'created_by': 'my_application',
    'version': '1.0.0',
    'description': 'Sample dataset'
})

# Control statistics calculation
write('data.parquet', df, stats=['id', 'value'])  # Only for specific columns
write('data.parquet', df, stats=False)  # Disable statistics
write('data.parquet', df, stats="auto")  # Auto-detect (default)

Appending Data

from fastparquet import ParquetFile

# Append to existing file
new_data = pd.DataFrame({'id': [1001, 1002], 'value': [2001, 2002]})
write('existing.parquet', new_data, append=True)

# Append using ParquetFile methods
pf = ParquetFile('existing.parquet')
pf.write_row_groups(new_data)

Type Definitions

# File scheme options
FileScheme = Literal['simple', 'hive', 'drill']

# Compression specification
CompressionType = Union[
    str,  # Algorithm name
    Dict[str, Union[str, None, Dict[str, Any]]]  # Per-column with options
]

# Object encoding options
ObjectEncoding = Union[
    Literal['infer', 'utf8', 'bytes', 'json', 'bson', 'bool', 'int', 'int32', 'float', 'decimal'],
    Dict[str, str]  # Per-column encoding
]

# Row group size specification
RowGroupSpec = Union[int, List[int]]

# Statistics specification
StatsSpec = Union[bool, Literal["auto"], List[str]]

# Null handling specification
NullsSpec = Union[bool, Literal['infer'], List[str]]

# Custom metadata
CustomMetadata = Dict[str, Union[str, bytes]]

Install with Tessl CLI

npx tessl i tessl/pypi-fastparquet

docs

dataset-management.md

index.md

reading.md

schema-types.md

writing.md

tile.json