Tessl Tile for pypi/fastparquet@2024.11.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

dataset-management.md index.md reading.md schema-types.md writing.md

index.mddocs/

0
# fastparquet
1

2
A high-performance Python implementation of the Apache Parquet columnar storage format, designed for seamless integration with pandas and the Python data science ecosystem. It provides fast reading and writing of parquet files with excellent compression and query performance.
3

4
## Package Information
5

6
- **Package Name**: fastparquet
7
- **Language**: Python
8
- **Installation**: `pip install fastparquet` or `conda install -c conda-forge fastparquet`
9
- **Python Requirements**: >=3.9
10

11
## Core Imports
12

13
```python
14
import fastparquet
15
```
16

17
For reading parquet files:
18

19
```python
20
from fastparquet import ParquetFile
21
```
22

23
For writing parquet files:
24

25
```python
26
from fastparquet import write, update_file_custom_metadata
27
```
28

29
## Basic Usage
30

31
```python
32
import pandas as pd
33
from fastparquet import ParquetFile, write
34

35
# Writing parquet files
36
df = pd.DataFrame({
37
    'id': range(1000),
38
    'value': range(1000, 2000),
39
    'category': ['A', 'B', 'C'] * 333 + ['A']
40
})
41

42
# Write to parquet file
43
write('data.parquet', df)
44

45
# Write with compression
46
write('data_compressed.parquet', df, compression='GZIP')
47

48
# Reading parquet files
49
pf = ParquetFile('data.parquet')
50

51
# Read entire file to pandas DataFrame
52
df_read = pf.to_pandas()
53

54
# Read specific columns
55
df_subset = pf.to_pandas(columns=['id', 'value'])
56

57
# Read with filters
58
df_filtered = pf.to_pandas(
59
    filters=[('category', '==', 'A'), ('value', '>', 1500)]
60
)
61
```
62

63
## Architecture
64

65
fastparquet is built around several core components:
66

67
- **ParquetFile**: Main class for reading parquet files, handling metadata, schema, and data access
68
- **Writer Functions**: High-level functions for writing pandas DataFrames to parquet format with various options
69
- **Schema System**: Tools for handling parquet schema definitions and type conversions
70
- **Compression Support**: Built-in support for multiple compression algorithms (GZIP, Snappy, Brotli, LZ4, Zstandard)
71
- **Partitioning**: Support for hive-style and drill-style partitioned datasets
72

73
The library emphasizes performance and compatibility with the broader Python data ecosystem, particularly pandas, while providing comprehensive support for parquet format features.
74

75
## Capabilities
76

77
### Reading Parquet Files
78

79
Core functionality for reading parquet files into pandas DataFrames, with support for selective column reading, filtering, and efficient memory usage through row group iteration.
80

81
```python { .api }
82
class ParquetFile:
83
    def __init__(self, fn, verify=False, open_with=None, root=False, 
84
                 sep=None, fs=None, pandas_nulls=True, dtypes=None): ...
85
    def to_pandas(self, columns=None, categories=None, filters=[], 
86
                  index=None, row_filter=False, dtypes=None): ...
87
    def head(self, nrows, **kwargs): ...
88
    def iter_row_groups(self, filters=None, **kwargs): ...
89
    def count(self, filters=None, row_filter=False): ...
90
    def __getitem__(self, item): ...  # Row group selection
91
    def __len__(self): ...  # Number of row groups
92
    def read_row_group_file(self, rg, columns, categories, index=None, 
93
                            assign=None, partition_meta=None, row_filter=False): ...
94
    def write_row_groups(self, data, row_group_offsets=None, **kwargs): ...
95
    def remove_row_groups(self, rgs, **kwargs): ...
96
    def check_categories(self, cats): ...
97
    def pre_allocate(self, size, columns, categories, index, dtypes=None): ...
98
```
99

100
[Reading Parquet Files](./reading.md)
101

102
### Writing Parquet Files
103

104
Comprehensive functionality for writing pandas DataFrames to parquet format with options for compression, partitioning, encoding, and metadata management.
105

106
```python { .api }
107
def write(filename, data, row_group_offsets=None, compression=None, 
108
          file_scheme='hive', has_nulls=None, write_index=None,
109
          partition_on=[], append=False, object_encoding=None,
110
          fixed_text=None, times='int64', custom_metadata=None, 
111
          stats="auto", open_with=None, mkdirs=None): ...
112

113
def update_file_custom_metadata(fn, custom_metadata, open_with=None): ...
114
```
115

116
[Writing Parquet Files](./writing.md)
117

118
### Schema and Types
119

120
Tools for working with parquet schemas, type conversions, and metadata management to ensure proper data representation and compatibility.
121

122
```python { .api }
123
def find_type(data, fixed_text=None, object_encoding=None, 
124
              times='int64', is_index=None): ...
125
def convert(data, se): ...
126

127
class SchemaHelper:
128
    def __init__(self, schema_elements): ...
129
    def schema_element(self, name): ...
130
    def is_required(self, name): ...
131
    def max_repetition_level(self, parts): ...
132
    def max_definition_level(self, parts): ...
133
```
134

135
[Schema and Types](./schema-types.md)
136

137
### Dataset Management
138

139
Advanced features for working with partitioned datasets, including reading from and writing to multi-file parquet collections with directory-based partitioning.
140

141
```python { .api }
142
def merge(file_list, verify_schema=True, open_with=None, root=False): ...
143

144
def metadata_from_many(file_list, verify_schema=False, open_with=None,
145
                       root=False, fs=None): ...
146
```
147

148
[Dataset Management](./dataset-management.md)
149

150
## Common Types
151

152
```python { .api }
153
class ParquetException(Exception):
154
    """Generic Exception related to unexpected data format when reading parquet file."""
155

156
# File scheme options
157
FileScheme = Literal['simple', 'hive', 'drill', 'flat', 'empty', 'mixed']
158

159
# Compression algorithm options
160
CompressionType = Union[str, Dict[str, Union[str, Dict[str, Any]]]]
161
# Supported: 'UNCOMPRESSED', 'SNAPPY', 'GZIP', 'LZO', 'BROTLI', 'LZ4', 'ZSTD'
162

163
# Filter specification
164
FilterCondition = Tuple[str, str, Any]  # (column, operator, value)
165
FilterGroup = List[FilterCondition]     # AND conditions
166
Filter = List[Union[FilterCondition, FilterGroup]]  # OR groups
167

168
# Supported filter operators
169
FilterOperator = Literal['==', '=', '!=', '<', '<=', '>', '>=', 'in', 'not in']
170

171
# Object encoding options
172
ObjectEncoding = Literal['infer', 'utf8', 'json', 'bytes', 'bson']
173

174
# Timestamp encoding options
175
TimeEncoding = Literal['int64', 'int96']
176

177
# SchemaElement type
178
class SchemaElement:
179
    name: str
180
    type: Optional[str]
181
    type_length: Optional[int]
182
    repetition_type: Optional[str]  # 'REQUIRED', 'OPTIONAL', 'REPEATED'
183
    num_children: int
184
    converted_type: Optional[str]
185
    scale: Optional[int]
186
    precision: Optional[int]
187
    field_id: Optional[int]
188
    logical_type: Optional[dict]
189

190
# Thrift metadata types
191
class FileMetaData:
192
    version: int
193
    schema: List[SchemaElement]
194
    num_rows: int
195
    row_groups: List['RowGroup']
196
    key_value_metadata: Optional[List['KeyValue']]
197
    created_by: Optional[str]
198
    column_orders: Optional[List['ColumnOrder']]
199

200
class RowGroup:
201
    columns: List['ColumnChunk']
202
    total_byte_size: int
203
    num_rows: int
204
    sorting_columns: Optional[List['SortingColumn']]
205
    file_offset: Optional[int]
206
    total_compressed_size: Optional[int]
207
    ordinal: Optional[int]
208

209
class ColumnChunk:
210
    file_path: Optional[str]
211
    file_offset: int
212
    meta_data: Optional['ColumnMetaData']
213
    offset_index_offset: Optional[int]
214
    offset_index_length: Optional[int]
215
    column_index_offset: Optional[int]
216
    column_index_length: Optional[int]
217

218
# Function type signatures
219
OpenFunction = Callable[[str, str], Any]  # (path, mode) -> file-like
220
MkdirsFunction = Callable[[str], None]    # (path) -> None
221
RemoveFunction = Callable[[str], None]    # (path) -> None
222
FileSystem = Any  # fsspec.AbstractFileSystem compatible interface
223
```

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/