or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

dataset-management.mdindex.mdreading.mdschema-types.mdwriting.md

writing.mddocs/

0

# Writing Parquet Files

1

2

Comprehensive functionality for writing pandas DataFrames to parquet format with extensive options for compression, partitioning, encoding, and performance optimization.

3

4

## Capabilities

5

6

### Main Write Function

7

8

The primary function for writing pandas DataFrames to parquet files with full control over format options.

9

10

```python { .api }

11

def write(filename, data, row_group_offsets=None, compression=None,

12

file_scheme='simple', open_with=None, mkdirs=None,

13

has_nulls=True, write_index=None, partition_on=[],

14

fixed_text=None, append=False, object_encoding='infer',

15

times='int64', custom_metadata=None, stats="auto"):

16

"""

17

Write pandas DataFrame to parquet file.

18

19

Parameters:

20

- filename: str, output parquet file or directory path

21

- data: pandas.DataFrame, data to write

22

- row_group_offsets: int or list, row group size control

23

- compression: str or dict, compression algorithm(s) to use

24

- file_scheme: str, file organization ('simple', 'hive', 'drill')

25

- open_with: function, custom file opener

26

- mkdirs: function, directory creation function

27

- has_nulls: bool or list, null value handling specification

28

- write_index: bool, whether to write DataFrame index as column

29

- partition_on: list, columns to partition data by

30

- fixed_text: dict, fixed-length string specifications

31

- append: bool, append to existing dataset

32

- object_encoding: str or dict, object column encoding method

33

- times: str, timestamp encoding format ('int64' or 'int96')

34

- custom_metadata: dict, additional metadata to store

35

- stats: bool or list, statistics calculation control

36

"""

37

```

38

39

### Specialized Write Functions

40

41

#### Simple File Writing

42

43

Write all data to a single parquet file.

44

45

```python { .api }

46

def write_simple(fn, data, fmd, row_group_offsets=None, compression=None,

47

open_with=None, has_nulls=None, append=False, stats=True):

48

"""

49

Write to single parquet file.

50

51

Parameters:

52

- fn: str, output file path

53

- data: pandas.DataFrame or iterable of DataFrames

54

- fmd: FileMetaData, parquet metadata object

55

- row_group_offsets: int or list, row group size specification

56

- compression: str or dict, compression settings

57

- open_with: function, file opening function

58

- has_nulls: bool or list, null handling specification

59

- append: bool, append to existing file

60

- stats: bool or list, statistics calculation control

61

"""

62

```

63

64

#### Multi-File Writing

65

66

Write data across multiple files with partitioning support.

67

68

```python { .api }

69

def write_multi(dn, data, fmd, row_group_offsets=None, compression=None,

70

file_scheme='hive', write_fmd=True, open_with=None,

71

mkdirs=None, partition_on=[], append=False, stats=True):

72

"""

73

Write to multiple parquet files with partitioning.

74

75

Parameters:

76

- dn: str, output directory path

77

- data: pandas.DataFrame or iterable of DataFrames

78

- fmd: FileMetaData, parquet metadata object

79

- row_group_offsets: int or list, row group size specification

80

- compression: str or dict, compression settings

81

- file_scheme: str, partitioning scheme ('hive', 'drill', 'flat')

82

- write_fmd: bool, write common metadata files

83

- open_with: function, file opening function

84

- mkdirs: function, directory creation function

85

- partition_on: list, partitioning column names

86

- append: bool, append to existing dataset

87

- stats: bool or list, statistics calculation control

88

"""

89

```

90

91

### Data Type and Schema Functions

92

93

#### Type Detection

94

95

Determine appropriate parquet types for pandas data.

96

97

```python { .api }

98

def find_type(data, fixed_text=None, object_encoding=None,

99

times='int64', is_index=None):

100

"""

101

Determine appropriate parquet type codes for pandas Series.

102

103

Parameters:

104

- data: pandas.Series, input data to analyze

105

- fixed_text: int, fixed-length string size

106

- object_encoding: str, encoding method for object columns

107

- times: str, timestamp format ('int64' or 'int96')

108

- is_index: bool, whether data represents an index column

109

110

Returns:

111

tuple: (schema_element, type_code)

112

"""

113

```

114

115

#### Data Conversion

116

117

Convert pandas data to parquet-compatible format.

118

119

```python { .api }

120

def convert(data, se):

121

"""

122

Convert pandas data according to schema element specification.

123

124

Parameters:

125

- data: pandas.Series, input data to convert

126

- se: SchemaElement, parquet schema element describing target format

127

128

Returns:

129

numpy.ndarray: Converted data ready for parquet encoding

130

"""

131

```

132

133

#### Metadata Creation

134

135

Generate parquet file metadata from pandas DataFrame.

136

137

```python { .api }

138

def make_metadata(data, has_nulls=True, ignore_columns=None,

139

fixed_text=None, object_encoding=None, times='int64',

140

index_cols=None, partition_cols=None, cols_dtype="object"):

141

"""

142

Create parquet file metadata from pandas DataFrame.

143

144

Parameters:

145

- data: pandas.DataFrame, source data

146

- has_nulls: bool or list, null value specifications

147

- ignore_columns: list, columns to exclude from metadata

148

- fixed_text: dict, fixed-length text specifications

149

- object_encoding: str or dict, object encoding methods

150

- times: str, timestamp encoding format

151

- index_cols: list, index column specifications

152

- partition_cols: list, partition column names

153

- cols_dtype: str, default column dtype

154

155

Returns:

156

FileMetaData: Parquet metadata object

157

"""

158

```

159

160

### Column-Level Writing

161

162

#### Individual Column Writing

163

164

Write single column data with full control over encoding and compression.

165

166

```python { .api }

167

def write_column(f, data0, selement, compression=None,

168

datapage_version=None, stats=True):

169

"""

170

Write single column to parquet file.

171

172

Parameters:

173

- f: file, open binary file for writing

174

- data0: pandas.Series, column data to write

175

- selement: SchemaElement, column schema specification

176

- compression: str or dict, compression settings

177

- datapage_version: int, parquet data page version (1 or 2)

178

- stats: bool, calculate and write column statistics

179

180

Returns:

181

ColumnChunk: Parquet column chunk metadata

182

"""

183

```

184

185

### Metadata Management

186

187

#### Common Metadata Writing

188

189

Write shared metadata files for multi-file datasets.

190

191

```python { .api }

192

def write_common_metadata(fn, fmd, open_with=None, no_row_groups=True):

193

"""

194

Write parquet schema to shared metadata file.

195

196

Parameters:

197

- fn: str, metadata file path

198

- fmd: FileMetaData, metadata to write

199

- open_with: function, file opening function

200

- no_row_groups: bool, exclude row group info for common metadata

201

"""

202

```

203

204

#### Custom Metadata Updates

205

206

Update file metadata without rewriting data.

207

208

```python { .api }

209

def update_file_custom_metadata(path, custom_metadata, is_metadata_file=None):

210

"""

211

Update custom metadata in parquet file without rewriting data.

212

213

Parameters:

214

- path: str, path to parquet file

215

- custom_metadata: dict, metadata key-value pairs to update

216

- is_metadata_file: bool, whether target is pure metadata file

217

"""

218

```

219

220

### Low-Level Writing Functions

221

222

#### Row Group and Partition Writing

223

224

Low-level functions for creating individual row groups and partition files.

225

226

```python { .api }

227

def make_row_group(df, schema, compression=None, stats=True,

228

has_nulls=True, fmd=None):

229

"""

230

Create row group metadata from DataFrame.

231

232

Parameters:

233

- df: pandas.DataFrame, data for the row group

234

- schema: list, parquet schema elements

235

- compression: str or dict, compression settings

236

- stats: bool or list, statistics calculation control

237

- has_nulls: bool or list, null value specifications

238

- fmd: FileMetaData, file metadata object

239

240

Returns:

241

RowGroup: Row group metadata object

242

"""

243

244

def make_part_file(filename, rg, schema, fmd, compression=None,

245

open_with=None, sep=None):

246

"""

247

Write single partition file.

248

249

Parameters:

250

- filename: str, output file path

251

- rg: RowGroup, row group to write

252

- schema: list, parquet schema elements

253

- fmd: FileMetaData, file metadata

254

- compression: str or dict, compression settings

255

- open_with: function, file opening function

256

- sep: str, path separator for platform compatibility

257

258

Returns:

259

int: Bytes written to file

260

"""

261

```

262

263

#### Data Encoding Functions

264

265

Functions for encoding column data in different formats.

266

267

```python { .api }

268

def encode_plain(data, se):

269

"""

270

Encode data using plain encoding.

271

272

Parameters:

273

- data: numpy.ndarray, data to encode

274

- se: SchemaElement, schema element specification

275

276

Returns:

277

bytes: Encoded data

278

"""

279

280

def encode_dict(data, se):

281

"""

282

Encode data using dictionary encoding.

283

284

Parameters:

285

- data: numpy.ndarray, data to encode

286

- se: SchemaElement, schema element specification

287

288

Returns:

289

tuple: (encoded_data, dictionary_data)

290

"""

291

```

292

293

### Dataset Operations

294

295

#### Appending and Row Group Management

296

297

Add new data to existing parquet datasets.

298

299

```python { .api }

300

# ParquetFile methods for dataset modification

301

def write_row_groups(self, data, row_group_offsets=None, sort_key=None,

302

sort_pnames=False, compression=None, write_fmd=True,

303

open_with=None, mkdirs=None, stats="auto"):

304

"""

305

Write data as new row groups to existing dataset.

306

307

Parameters:

308

- data: pandas.DataFrame or iterable, data to add

309

- row_group_offsets: int or list, row group size control

310

- sort_key: function, sorting key for row group ordering

311

- sort_pnames: bool, align partition file names with positions

312

- compression: str or dict, compression settings

313

- write_fmd: bool, update common metadata

314

- open_with: function, file opening function

315

- mkdirs: function, directory creation function

316

- stats: bool or list, statistics calculation control

317

"""

318

319

def remove_row_groups(self, rgs, sort_pnames=False, write_fmd=True,

320

open_with=None, remove_with=None):

321

"""

322

Remove row groups from existing dataset.

323

324

Parameters:

325

- rgs: list, row group indices to remove

326

- sort_pnames: bool, align partition file names

327

- write_fmd: bool, update common metadata

328

- open_with: function, file opening function

329

- remove_with: function, file removal function

330

"""

331

```

332

333

#### Dataset Merging and Overwriting

334

335

Advanced dataset management operations.

336

337

```python { .api }

338

def merge(file_list, verify_schema=True, open_with=None, root=False):

339

"""

340

Create logical dataset from multiple parquet files.

341

342

Parameters:

343

- file_list: list, paths to parquet files or ParquetFile instances

344

- verify_schema: bool, verify schema consistency across files

345

- open_with: function, file opening function

346

- root: str, dataset root directory

347

348

Returns:

349

ParquetFile: Merged dataset representation

350

"""

351

352

def overwrite(dirpath, data, row_group_offsets=None, sort_pnames=True,

353

compression=None, open_with=None, mkdirs=None,

354

remove_with=None, stats=True):

355

"""

356

Overwrite partitions in existing parquet dataset.

357

358

Parameters:

359

- dirpath: str, dataset directory path

360

- data: pandas.DataFrame, new data to write

361

- row_group_offsets: int or list, row group size specification

362

- sort_pnames: bool, align partition file names

363

- compression: str or dict, compression settings

364

- open_with: function, file opening function

365

- mkdirs: function, directory creation function

366

- remove_with: function, file removal function

367

- stats: bool or list, statistics calculation control

368

"""

369

```

370

371

## Usage Examples

372

373

### Basic Writing

374

375

```python

376

import pandas as pd

377

from fastparquet import write

378

379

# Create sample data

380

df = pd.DataFrame({

381

'id': range(1000),

382

'value': range(1000, 2000),

383

'category': ['A', 'B', 'C'] * 333 + ['A'],

384

'timestamp': pd.date_range('2023-01-01', periods=1000, freq='H')

385

})

386

387

# Write to parquet file

388

write('output.parquet', df)

389

390

# Write with compression

391

write('output_compressed.parquet', df, compression='GZIP')

392

393

# Write specific columns only

394

write('output_subset.parquet', df[['id', 'value']])

395

```

396

397

### Compression Options

398

399

```python

400

# String compression (applied to all columns)

401

write('data.parquet', df, compression='SNAPPY')

402

403

# Per-column compression

404

write('data.parquet', df, compression={

405

'id': 'GZIP',

406

'value': 'SNAPPY',

407

'category': 'LZ4',

408

'timestamp': None, # No compression

409

'_default': 'GZIP' # Default for unlisted columns

410

})

411

412

# Advanced compression with arguments

413

write('data.parquet', df, compression={

414

'value': {

415

'type': 'LZ4',

416

'args': {'mode': 'high_compression', 'compression': 9}

417

},

418

'category': {

419

'type': 'SNAPPY',

420

'args': None

421

}

422

})

423

```

424

425

### Partitioned Datasets

426

427

```python

428

# Partition by single column

429

write('partitioned_data', df,

430

file_scheme='hive',

431

partition_on=['category'])

432

433

# Partition by multiple columns

434

write('partitioned_data', df,

435

file_scheme='hive',

436

partition_on=['category', 'year'])

437

438

# Drill-style partitioning (directory names as values)

439

write('partitioned_data', df,

440

file_scheme='drill',

441

partition_on=['category'])

442

```

443

444

### Advanced Options

445

446

```python

447

# Control row group sizes

448

write('data.parquet', df, row_group_offsets=50000) # ~50k rows per group

449

write('data.parquet', df, row_group_offsets=[0, 100, 500, 1000]) # Explicit offsets

450

451

# Handle object columns

452

write('data.parquet', df, object_encoding={

453

'text_col': 'utf8',

454

'json_col': 'json',

455

'binary_col': 'bytes'

456

})

457

458

# Write with custom metadata

459

write('data.parquet', df, custom_metadata={

460

'created_by': 'my_application',

461

'version': '1.0.0',

462

'description': 'Sample dataset'

463

})

464

465

# Control statistics calculation

466

write('data.parquet', df, stats=['id', 'value']) # Only for specific columns

467

write('data.parquet', df, stats=False) # Disable statistics

468

write('data.parquet', df, stats="auto") # Auto-detect (default)

469

```

470

471

### Appending Data

472

473

```python

474

from fastparquet import ParquetFile

475

476

# Append to existing file

477

new_data = pd.DataFrame({'id': [1001, 1002], 'value': [2001, 2002]})

478

write('existing.parquet', new_data, append=True)

479

480

# Append using ParquetFile methods

481

pf = ParquetFile('existing.parquet')

482

pf.write_row_groups(new_data)

483

```

484

485

## Type Definitions

486

487

```python { .api }

488

# File scheme options

489

FileScheme = Literal['simple', 'hive', 'drill']

490

491

# Compression specification

492

CompressionType = Union[

493

str, # Algorithm name

494

Dict[str, Union[str, None, Dict[str, Any]]] # Per-column with options

495

]

496

497

# Object encoding options

498

ObjectEncoding = Union[

499

Literal['infer', 'utf8', 'bytes', 'json', 'bson', 'bool', 'int', 'int32', 'float', 'decimal'],

500

Dict[str, str] # Per-column encoding

501

]

502

503

# Row group size specification

504

RowGroupSpec = Union[int, List[int]]

505

506

# Statistics specification

507

StatsSpec = Union[bool, Literal["auto"], List[str]]

508

509

# Null handling specification

510

NullsSpec = Union[bool, Literal['infer'], List[str]]

511

512

# Custom metadata

513

CustomMetadata = Dict[str, Union[str, bytes]]

514

```