or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

dataset-management.mdindex.mdreading.mdschema-types.mdwriting.md

index.mddocs/

0

# fastparquet

1

2

A high-performance Python implementation of the Apache Parquet columnar storage format, designed for seamless integration with pandas and the Python data science ecosystem. It provides fast reading and writing of parquet files with excellent compression and query performance.

3

4

## Package Information

5

6

- **Package Name**: fastparquet

7

- **Language**: Python

8

- **Installation**: `pip install fastparquet` or `conda install -c conda-forge fastparquet`

9

- **Python Requirements**: >=3.9

10

11

## Core Imports

12

13

```python

14

import fastparquet

15

```

16

17

For reading parquet files:

18

19

```python

20

from fastparquet import ParquetFile

21

```

22

23

For writing parquet files:

24

25

```python

26

from fastparquet import write, update_file_custom_metadata

27

```

28

29

## Basic Usage

30

31

```python

32

import pandas as pd

33

from fastparquet import ParquetFile, write

34

35

# Writing parquet files

36

df = pd.DataFrame({

37

'id': range(1000),

38

'value': range(1000, 2000),

39

'category': ['A', 'B', 'C'] * 333 + ['A']

40

})

41

42

# Write to parquet file

43

write('data.parquet', df)

44

45

# Write with compression

46

write('data_compressed.parquet', df, compression='GZIP')

47

48

# Reading parquet files

49

pf = ParquetFile('data.parquet')

50

51

# Read entire file to pandas DataFrame

52

df_read = pf.to_pandas()

53

54

# Read specific columns

55

df_subset = pf.to_pandas(columns=['id', 'value'])

56

57

# Read with filters

58

df_filtered = pf.to_pandas(

59

filters=[('category', '==', 'A'), ('value', '>', 1500)]

60

)

61

```

62

63

## Architecture

64

65

fastparquet is built around several core components:

66

67

- **ParquetFile**: Main class for reading parquet files, handling metadata, schema, and data access

68

- **Writer Functions**: High-level functions for writing pandas DataFrames to parquet format with various options

69

- **Schema System**: Tools for handling parquet schema definitions and type conversions

70

- **Compression Support**: Built-in support for multiple compression algorithms (GZIP, Snappy, Brotli, LZ4, Zstandard)

71

- **Partitioning**: Support for hive-style and drill-style partitioned datasets

72

73

The library emphasizes performance and compatibility with the broader Python data ecosystem, particularly pandas, while providing comprehensive support for parquet format features.

74

75

## Capabilities

76

77

### Reading Parquet Files

78

79

Core functionality for reading parquet files into pandas DataFrames, with support for selective column reading, filtering, and efficient memory usage through row group iteration.

80

81

```python { .api }

82

class ParquetFile:

83

def __init__(self, fn, verify=False, open_with=None, root=False,

84

sep=None, fs=None, pandas_nulls=True, dtypes=None): ...

85

def to_pandas(self, columns=None, categories=None, filters=[],

86

index=None, row_filter=False, dtypes=None): ...

87

def head(self, nrows, **kwargs): ...

88

def iter_row_groups(self, filters=None, **kwargs): ...

89

def count(self, filters=None, row_filter=False): ...

90

def __getitem__(self, item): ... # Row group selection

91

def __len__(self): ... # Number of row groups

92

def read_row_group_file(self, rg, columns, categories, index=None,

93

assign=None, partition_meta=None, row_filter=False): ...

94

def write_row_groups(self, data, row_group_offsets=None, **kwargs): ...

95

def remove_row_groups(self, rgs, **kwargs): ...

96

def check_categories(self, cats): ...

97

def pre_allocate(self, size, columns, categories, index, dtypes=None): ...

98

```

99

100

[Reading Parquet Files](./reading.md)

101

102

### Writing Parquet Files

103

104

Comprehensive functionality for writing pandas DataFrames to parquet format with options for compression, partitioning, encoding, and metadata management.

105

106

```python { .api }

107

def write(filename, data, row_group_offsets=None, compression=None,

108

file_scheme='hive', has_nulls=None, write_index=None,

109

partition_on=[], append=False, object_encoding=None,

110

fixed_text=None, times='int64', custom_metadata=None,

111

stats="auto", open_with=None, mkdirs=None): ...

112

113

def update_file_custom_metadata(fn, custom_metadata, open_with=None): ...

114

```

115

116

[Writing Parquet Files](./writing.md)

117

118

### Schema and Types

119

120

Tools for working with parquet schemas, type conversions, and metadata management to ensure proper data representation and compatibility.

121

122

```python { .api }

123

def find_type(data, fixed_text=None, object_encoding=None,

124

times='int64', is_index=None): ...

125

def convert(data, se): ...

126

127

class SchemaHelper:

128

def __init__(self, schema_elements): ...

129

def schema_element(self, name): ...

130

def is_required(self, name): ...

131

def max_repetition_level(self, parts): ...

132

def max_definition_level(self, parts): ...

133

```

134

135

[Schema and Types](./schema-types.md)

136

137

### Dataset Management

138

139

Advanced features for working with partitioned datasets, including reading from and writing to multi-file parquet collections with directory-based partitioning.

140

141

```python { .api }

142

def merge(file_list, verify_schema=True, open_with=None, root=False): ...

143

144

def metadata_from_many(file_list, verify_schema=False, open_with=None,

145

root=False, fs=None): ...

146

```

147

148

[Dataset Management](./dataset-management.md)

149

150

## Common Types

151

152

```python { .api }

153

class ParquetException(Exception):

154

"""Generic Exception related to unexpected data format when reading parquet file."""

155

156

# File scheme options

157

FileScheme = Literal['simple', 'hive', 'drill', 'flat', 'empty', 'mixed']

158

159

# Compression algorithm options

160

CompressionType = Union[str, Dict[str, Union[str, Dict[str, Any]]]]

161

# Supported: 'UNCOMPRESSED', 'SNAPPY', 'GZIP', 'LZO', 'BROTLI', 'LZ4', 'ZSTD'

162

163

# Filter specification

164

FilterCondition = Tuple[str, str, Any] # (column, operator, value)

165

FilterGroup = List[FilterCondition] # AND conditions

166

Filter = List[Union[FilterCondition, FilterGroup]] # OR groups

167

168

# Supported filter operators

169

FilterOperator = Literal['==', '=', '!=', '<', '<=', '>', '>=', 'in', 'not in']

170

171

# Object encoding options

172

ObjectEncoding = Literal['infer', 'utf8', 'json', 'bytes', 'bson']

173

174

# Timestamp encoding options

175

TimeEncoding = Literal['int64', 'int96']

176

177

# SchemaElement type

178

class SchemaElement:

179

name: str

180

type: Optional[str]

181

type_length: Optional[int]

182

repetition_type: Optional[str] # 'REQUIRED', 'OPTIONAL', 'REPEATED'

183

num_children: int

184

converted_type: Optional[str]

185

scale: Optional[int]

186

precision: Optional[int]

187

field_id: Optional[int]

188

logical_type: Optional[dict]

189

190

# Thrift metadata types

191

class FileMetaData:

192

version: int

193

schema: List[SchemaElement]

194

num_rows: int

195

row_groups: List['RowGroup']

196

key_value_metadata: Optional[List['KeyValue']]

197

created_by: Optional[str]

198

column_orders: Optional[List['ColumnOrder']]

199

200

class RowGroup:

201

columns: List['ColumnChunk']

202

total_byte_size: int

203

num_rows: int

204

sorting_columns: Optional[List['SortingColumn']]

205

file_offset: Optional[int]

206

total_compressed_size: Optional[int]

207

ordinal: Optional[int]

208

209

class ColumnChunk:

210

file_path: Optional[str]

211

file_offset: int

212

meta_data: Optional['ColumnMetaData']

213

offset_index_offset: Optional[int]

214

offset_index_length: Optional[int]

215

column_index_offset: Optional[int]

216

column_index_length: Optional[int]

217

218

# Function type signatures

219

OpenFunction = Callable[[str, str], Any] # (path, mode) -> file-like

220

MkdirsFunction = Callable[[str], None] # (path) -> None

221

RemoveFunction = Callable[[str], None] # (path) -> None

222

FileSystem = Any # fsspec.AbstractFileSystem compatible interface

223

```