or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

dataset-management.mdindex.mdreading.mdschema-types.mdwriting.md

reading.mddocs/

0

# Reading Parquet Files

1

2

Comprehensive functionality for reading parquet files into pandas DataFrames with high performance and flexible data access patterns.

3

4

## Capabilities

5

6

### ParquetFile Class

7

8

The main class for reading parquet files, providing access to metadata, schema information, and efficient data reading methods.

9

10

```python { .api }

11

class ParquetFile:

12

def __init__(self, fn, verify=False, open_with=None, root=False,

13

sep=None, fs=None, pandas_nulls=True, dtypes=None):

14

"""

15

Initialize ParquetFile for reading parquet data.

16

17

Parameters:

18

- fn: str, path/URL or list of paths to parquet file(s)

19

- verify: bool, test file start/end byte markers

20

- open_with: function, custom file opener with signature func(path, mode)

21

- root: str, dataset root directory for partitioned data

22

- fs: fsspec filesystem, alternative to open_with

23

- pandas_nulls: bool, use pandas nullable types for int/bool with nulls

24

- dtypes: dict, override column dtypes

25

"""

26

```

27

28

### Data Reading Methods

29

30

#### Complete Data Reading

31

32

Read entire parquet file or filtered subset into a pandas DataFrame.

33

34

```python { .api }

35

def to_pandas(self, columns=None, categories=None, filters=[],

36

index=None, row_filter=False, dtypes=None):

37

"""

38

Read parquet data into pandas DataFrame.

39

40

Parameters:

41

- columns: list, column names to load (None for all)

42

- categories: list or dict, columns to treat as categorical

43

- filters: list, row filtering conditions

44

- index: str or list, column(s) to use as DataFrame index

45

- row_filter: bool or array, enable row-wise filtering

46

- dtypes: dict, override column data types

47

48

Returns:

49

pandas.DataFrame: The loaded data

50

"""

51

```

52

53

#### Partial Data Reading

54

55

Get a limited number of rows from the beginning of the dataset.

56

57

```python { .api }

58

def head(self, nrows, **kwargs):

59

"""

60

Get the first nrows of data.

61

62

Parameters:

63

- nrows: int, number of rows to return

64

- **kwargs: additional arguments passed to to_pandas()

65

66

Returns:

67

pandas.DataFrame: First nrows of data

68

"""

69

```

70

71

#### Row Group Iteration

72

73

Iterate through the dataset one row group at a time for memory-efficient processing.

74

75

```python { .api }

76

def iter_row_groups(self, filters=None, **kwargs):

77

"""

78

Iterate dataset by row groups.

79

80

Parameters:

81

- filters: list, optional filters to skip row groups

82

- **kwargs: additional arguments passed to to_pandas()

83

84

Yields:

85

pandas.DataFrame: One DataFrame per row group

86

"""

87

```

88

89

### Data Access and Slicing

90

91

#### Row Group Selection

92

93

Access specific row groups using indexing and slicing operations.

94

95

```python { .api }

96

def __getitem__(self, item):

97

"""

98

Select row groups using integer/slicing.

99

100

Parameters:

101

- item: int, slice, or list, row group selector

102

103

Returns:

104

ParquetFile: New ParquetFile with selected row groups

105

"""

106

107

def __len__(self):

108

"""

109

Return number of row groups.

110

111

Returns:

112

int: Number of row groups in the file

113

"""

114

```

115

116

#### Row Count

117

118

Get total number of rows with optional filtering.

119

120

```python { .api }

121

def count(self, filters=None, row_filter=False):

122

"""

123

Total number of rows in the dataset.

124

125

Parameters:

126

- filters: list, optional row filtering conditions

127

- row_filter: bool, enable row-wise filtering

128

129

Returns:

130

int: Total number of rows

131

"""

132

```

133

134

### Metadata and Schema Access

135

136

#### Properties

137

138

Access file metadata, schema, and structural information.

139

140

```python { .api }

141

@property

142

def columns(self):

143

"""Column names available in the dataset."""

144

145

@property

146

def dtypes(self):

147

"""Expected output types for each column."""

148

149

@property

150

def schema(self):

151

"""SchemaHelper object representing column structure."""

152

153

@property

154

def statistics(self):

155

"""Per-column statistics (min, max, count, null_count)."""

156

157

@property

158

def key_value_metadata(self):

159

"""Additional metadata key-value pairs."""

160

161

@property

162

def pandas_metadata(self):

163

"""Pandas-specific metadata if available."""

164

165

@property

166

def info(self):

167

"""Dataset summary information."""

168

169

@property

170

def file_scheme(self):

171

"""File organization scheme ('simple', 'hive', 'mixed', 'empty')."""

172

```

173

174

### Low-Level Reading Functions

175

176

#### Row Group Reading

177

178

Direct row group reading functions for advanced use cases and performance optimization.

179

180

```python { .api }

181

def read_row_group(file, rg, columns, categories, schema=None,

182

cats=None, index=None, assign=None,

183

scheme='hive', pandas_nulls=True, dtypes=None):

184

"""

185

Read single row group from parquet file.

186

187

Parameters:

188

- file: file-like object or ParquetFile

189

- rg: RowGroup, row group metadata object

190

- columns: list, column names to read

191

- categories: list or dict, categorical column specifications

192

- schema: SchemaHelper, parquet schema object

193

- cats: dict, partition categories

194

- index: str or list, index column specifications

195

- assign: dict, values to assign for partitioned columns

196

- scheme: str, partitioning scheme

197

- pandas_nulls: bool, use pandas nullable types

198

- dtypes: dict, column data type overrides

199

200

Returns:

201

pandas.DataFrame: Row group data

202

"""

203

204

def read_row_group_arrays(file, rg, columns, categories, schema=None,

205

cats=None, assign=None, scheme='hive'):

206

"""

207

Read row group into numpy arrays.

208

209

Parameters:

210

- file: file-like object or ParquetFile

211

- rg: RowGroup, row group metadata object

212

- columns: list, column names to read

213

- categories: list or dict, categorical specifications

214

- schema: SchemaHelper, parquet schema

215

- cats: dict, partition categories

216

- assign: dict, partition value assignments

217

- scheme: str, partitioning scheme

218

219

Returns:

220

dict: Column name to numpy array mapping

221

"""

222

```

223

224

#### Column Reading

225

226

Functions for reading individual columns and their data pages.

227

228

```python { .api }

229

def read_col(column, schema_helper, infile, use_cat=True,

230

assign=None, row_filter=None):

231

"""

232

Read single column from parquet file.

233

234

Parameters:

235

- column: ColumnChunk, column metadata object

236

- schema_helper: SchemaHelper, schema navigation helper

237

- infile: file-like object, open parquet file

238

- use_cat: bool, use categorical optimization

239

- assign: any, value to assign for partition columns

240

- row_filter: array, boolean row selection mask

241

242

Returns:

243

numpy.ndarray: Column data

244

"""

245

246

def read_data_page(infile, page, compressed_size, uncompressed_size,

247

column, schema, use_cat=True, selfmade=True,

248

assign=None, decoders=None, row_filter=None):

249

"""

250

Read and decode single data page.

251

252

Parameters:

253

- infile: file-like object, open parquet file

254

- page: PageHeader, page metadata object

255

- compressed_size: int, compressed page size in bytes

256

- uncompressed_size: int, uncompressed page size in bytes

257

- column: ColumnChunk, column metadata

258

- schema: SchemaHelper, schema navigation

259

- use_cat: bool, use categorical optimization

260

- selfmade: bool, file created by fastparquet

261

- assign: any, partition column assignment value

262

- decoders: dict, custom decoder functions

263

- row_filter: array, row selection mask

264

265

Returns:

266

tuple: (values, definition_levels, repetition_levels)

267

"""

268

269

def read_data_page_v2(infile, page, compressed_size, uncompressed_size,

270

column, schema, use_cat=True, selfmade=True,

271

assign=None, decoders=None, row_filter=None):

272

"""

273

Read and decode data page in v2 format.

274

275

Parameters:

276

- infile: file-like object, open parquet file

277

- page: PageHeader, page metadata object

278

- compressed_size: int, compressed page size

279

- uncompressed_size: int, uncompressed page size

280

- column: ColumnChunk, column metadata

281

- schema: SchemaHelper, schema navigation

282

- use_cat: bool, categorical optimization

283

- selfmade: bool, fastparquet-created file

284

- assign: any, partition value assignment

285

- decoders: dict, custom decoders

286

- row_filter: array, row selection mask

287

288

Returns:

289

tuple: (values, definition_levels, repetition_levels)

290

"""

291

292

def read_dictionary_page(infile, schema_helper):

293

"""

294

Read dictionary page for categorical columns.

295

296

Parameters:

297

- infile: file-like object, open parquet file

298

- schema_helper: SchemaHelper, schema navigation helper

299

300

Returns:

301

numpy.ndarray: Dictionary values

302

"""

303

```

304

305

### Filtering and Statistics

306

307

#### Filter Functions

308

309

Utility functions for working with parquet file filters and statistics.

310

311

```python { .api }

312

def filter_row_groups(pf, filters, as_idx=False):

313

"""

314

Select row groups using filters.

315

316

Parameters:

317

- pf: ParquetFile, the parquet file object

318

- filters: list, filtering conditions

319

- as_idx: bool, return indices instead of row groups

320

321

Returns:

322

list: Filtered row groups or their indices

323

"""

324

325

def statistics(obj):

326

"""

327

Return per-column statistics for a ParquetFile.

328

329

Parameters:

330

- obj: ParquetFile, ColumnChunk, or RowGroup

331

332

Returns:

333

dict: Statistics mapping (min, max, distinct_count, null_count) to columns

334

"""

335

336

def sorted_partitioned_columns(pf, filters=None):

337

"""

338

Find columns that are sorted partition-by-partition.

339

340

Parameters:

341

- pf: ParquetFile, the parquet file object

342

- filters: list, optional filtering conditions

343

344

Returns:

345

dict: Column names to min/max value ranges

346

"""

347

```

348

349

## Usage Examples

350

351

### Basic File Reading

352

353

```python

354

from fastparquet import ParquetFile

355

356

# Open parquet file

357

pf = ParquetFile('data.parquet')

358

359

# Read all data

360

df = pf.to_pandas()

361

362

# Read specific columns

363

df_subset = pf.to_pandas(columns=['col1', 'col2'])

364

365

# Check file info

366

print(pf.info)

367

print(f"Columns: {pf.columns}")

368

print(f"Row count: {pf.count()}")

369

```

370

371

### Filtering Data

372

373

```python

374

# Single condition filter

375

df_filtered = pf.to_pandas(filters=[('age', '>', 25)])

376

377

# Multiple conditions (AND)

378

df_filtered = pf.to_pandas(filters=[('age', '>', 25), ('score', '>=', 80)])

379

380

# Multiple condition groups (OR)

381

df_filtered = pf.to_pandas(filters=[

382

[('category', '==', 'A'), ('value', '>', 100)], # Group 1

383

[('category', '==', 'B'), ('value', '>', 200)] # Group 2

384

])

385

```

386

387

### Memory-Efficient Processing

388

389

```python

390

# Process large files in chunks

391

total_rows = 0

392

for chunk in pf.iter_row_groups():

393

# Process each row group

394

processed = chunk.groupby('category').sum()

395

total_rows += len(chunk)

396

397

print(f"Processed {total_rows} rows")

398

399

# Get sample of large file

400

sample = pf.head(1000)

401

```

402

403

### Working with Partitioned Datasets

404

405

```python

406

# Read partitioned dataset

407

pf = ParquetFile('/path/to/partitioned/dataset/')

408

409

# Access partition information

410

print(f"Partitions: {list(pf.cats.keys())}")

411

print(f"File scheme: {pf.file_scheme}")

412

413

# Filter by partition values

414

df = pf.to_pandas(filters=[('year', '==', 2023), ('month', 'in', [1, 2, 3])])

415

```

416

417

## Type Definitions

418

419

```python { .api }

420

# Filter specification

421

FilterCondition = Tuple[str, str, Any] # (column, operator, value)

422

FilterGroup = List[FilterCondition] # AND conditions

423

Filter = List[Union[FilterCondition, FilterGroup]] # OR groups

424

425

# Supported filter operators

426

FilterOp = Literal['==', '=', '!=', '<', '<=', '>', '>=', 'in', 'not in']

427

428

# File opening function signature

429

OpenFunction = Callable[[str, str], Any] # (path, mode) -> file-like object

430

431

# Filesystem interface

432

FileSystem = Any # fsspec.AbstractFileSystem compatible

433

```