or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

aggregation.mdbasic-transformations.mddata-io.mddata-reshaping.mdindex.mdsorting-joins.mdtable-operations.mdvalidation-analysis.md

index.mddocs/

0

# PETL - Python ETL Library

1

2

A comprehensive Python library for extracting, transforming and loading tables of data. PETL enables developers to efficiently process, manipulate, and transform data tables through a fluent API that supports various data sources including CSV, JSON, databases, Excel files, and remote sources. The library offers extensive transformation capabilities including filtering, sorting, joining, aggregating, pivoting, and data type conversions, with built-in support for handling large datasets through memory-efficient iterators and streaming operations.

3

4

## Package Information

5

6

- **Package Name**: petl

7

- **Language**: Python

8

- **Installation**: `pip install petl`

9

10

## Core Imports

11

12

```python

13

import petl

14

```

15

16

Common usage pattern:

17

18

```python

19

import petl as etl

20

```

21

22

All functions can be imported directly:

23

24

```python

25

from petl import fromcsv, tocsv, cut, select, join

26

```

27

28

## Basic Usage

29

30

```python

31

import petl as etl

32

33

# Load data from CSV

34

table = etl.fromcsv('data.csv')

35

36

# Transform data with method chaining

37

result = table.cut('name', 'age', 'city').select('age', lambda age: age > 18).sort('name')

38

39

# Save to new file

40

result.tocsv('adults.csv')

41

42

# Alternative functional style

43

result = etl.tocsv(

44

etl.sort(

45

etl.select(

46

etl.cut(

47

etl.fromcsv('data.csv'),

48

'name', 'age', 'city'

49

),

50

'age', lambda age: age > 18

51

),

52

'name'

53

),

54

'adults.csv'

55

)

56

```

57

58

## Architecture

59

60

PETL is built around the **Table** abstraction - an iterable container that represents tabular data. The core design principles include:

61

62

- **Lazy Evaluation**: Transformations are applied only when data is consumed, enabling efficient processing of large datasets

63

- **Functional Programming**: Operations return new table objects, allowing for method chaining and immutable transformations

64

- **Memory Efficiency**: Built-in support for external sorting, buffering, and streaming for memory-constrained environments

65

- **Source Agnostic**: Unified interface for reading/writing data regardless of source format

66

- **Iterator-based**: All operations use Python iterators and generators for optimal memory usage

67

68

The library is organized into six main modules: **util** (table operations), **transform** (data transformations), **io** (input/output), **comparison** (sorting utilities), **errors** (exceptions), and **config** (settings).

69

70

## Capabilities

71

72

### Table Creation and Access

73

74

Core functionality for creating, accessing, and manipulating table structures. Includes table creation from various sources, header manipulation, data access, and conversion to different formats.

75

76

```python { .api }

77

def empty() -> Table: ...

78

def header(table): ...

79

def fieldnames(table): ...

80

def data(table, *sliceargs): ...

81

def records(table, *sliceargs, **kwargs): ...

82

def dicts(table, *sliceargs, **kwargs): ...

83

def namedtuples(table, *sliceargs, **kwargs): ...

84

def values(table, *field, **kwargs): ...

85

def columns(table, missing=None): ...

86

def facetcolumns(table, key, missing=None): ...

87

def expr(expression_text, trusted=True): ...

88

def rowgroupby(table, key, value=None): ...

89

```

90

91

[Table Operations](./table-operations.md)

92

93

### Data Input/Output

94

95

Comprehensive support for reading and writing data from various sources including CSV, JSON, XML, Excel, databases, and many specialized formats. Supports both file-based and streaming I/O operations.

96

97

```python { .api }

98

def fromcsv(source=None, **kwargs) -> Table: ...

99

def tocsv(table, source=None, **kwargs): ...

100

def fromjson(source, **kwargs) -> Table: ...

101

def tojson(table, source=None, **kwargs): ...

102

def fromdb(dbo, query, **kwargs) -> Table: ...

103

def todb(table, dbo, tablename, **kwargs): ...

104

```

105

106

[Data I/O](./data-io.md)

107

108

### Data Transformations

109

110

Core transformation operations for reshaping, filtering, and manipulating table data. Includes field operations, row selection, data conversion, and structural transformations.

111

112

```python { .api }

113

def cut(table, *args, **kwargs) -> Table: ...

114

def select(table, *args, **kwargs) -> Table: ...

115

def convert(table, *args, **kwargs) -> Table: ...

116

def rename(table, *args, **kwargs) -> Table: ...

117

def addfield(table, field, value=None, **kwargs) -> Table: ...

118

```

119

120

[Basic Transformations](./basic-transformations.md)

121

122

### Sorting and Joining

123

124

Advanced operations for sorting data and combining multiple tables through various join types. Includes both memory-based and disk-based sorting for large datasets.

125

126

```python { .api }

127

def sort(table, key=None, reverse=False, **kwargs) -> Table: ...

128

def join(table1, table2, key=None, **kwargs) -> Table: ...

129

def leftjoin(left, right, key=None, **kwargs) -> Table: ...

130

def outerjoin(left, right, key=None, **kwargs) -> Table: ...

131

def hashjoin(left, right, key=None, **kwargs) -> Table: ...

132

```

133

134

[Sorting and Joins](./sorting-joins.md)

135

136

### Aggregation and Grouping

137

138

Functions for grouping data and performing aggregation operations. Supports custom aggregation functions and reduction operations on grouped data.

139

140

```python { .api }

141

def aggregate(table, key, aggregation=None, **kwargs) -> Table: ...

142

def rowreduce(table, key, reducer, **kwargs) -> Table: ...

143

def merge(table, key, **kwargs) -> Table: ...

144

def fold(table, key, f, **kwargs) -> Table: ...

145

```

146

147

[Aggregation](./aggregation.md)

148

149

### Data Reshaping

150

151

Advanced reshaping operations including pivoting, melting, transposing, and data normalization. Essential for converting between wide and long data formats.

152

153

```python { .api }

154

def melt(table, key=None, **kwargs) -> Table: ...

155

def recast(table, key=None, **kwargs) -> Table: ...

156

def pivot(table, f1, f2, f3, aggfun, **kwargs) -> Table: ...

157

def transpose(table) -> Table: ...

158

def flatten(table) -> Table: ...

159

```

160

161

[Data Reshaping](./data-reshaping.md)

162

163

### Validation and Analysis

164

165

Tools for data validation, quality assessment, and statistical analysis. Includes data profiling, constraint validation, and summary statistics.

166

167

```python { .api }

168

def validate(table, constraints=None, **kwargs) -> Table: ...

169

def look(table, limit=0, **kwargs): ...

170

def see(table, limit=0, **kwargs): ...

171

def stats(table, field): ...

172

def valuecounts(table, *field, **kwargs) -> Table: ...

173

def limits(table, field): ...

174

def nrows(table): ...

175

def typecounts(table, field) -> Table: ...

176

```

177

178

[Validation and Analysis](./validation-analysis.md)

179

180

## Types

181

182

```python { .api }

183

class Table(IterContainer):

184

"""Main table abstraction supporting iteration and transformations."""

185

def __iter__(self): ...

186

def __getitem__(self, item): ... # Field access via string or row slicing

187

def cut(self, *args, **kwargs) -> 'Table': ...

188

def select(self, *args, **kwargs) -> 'Table': ...

189

def sort(self, key=None, **kwargs) -> 'Table': ...

190

# ... all transformation methods available as methods

191

192

class Record(tuple):

193

"""Named tuple subclass for table rows with field access by name or index."""

194

def __init__(self, row, flds, missing=None): ...

195

def __getitem__(self, f): ... # Access by field name or index

196

def __getattr__(self, f): ... # Access by field name as attribute

197

def get(self, key, default=None): ... # Safe field access

198

199

class RecordsView(IterContainer):

200

"""Container for table records with record-based access."""

201

def __iter__(self): ...

202

def __getitem__(self, key): ...

203

204

class ValuesView(IterContainer):

205

"""Container for values from a specific field."""

206

def __iter__(self): ...

207

def __getitem__(self, key): ...

208

209

class IterContainer:

210

"""Base container class for iterable data structures with utility methods."""

211

def __iter__(self): ...

212

def __getitem__(self, key): ...

213

def __len__(self): ...

214

def __contains__(self, item): ...

215

def list(self): ...

216

def tuple(self): ...

217

def set(self): ...

218

def counter(self): ...

219

220

# Source classes for data I/O

221

class FileSource:

222

"""File-based data source."""

223

def __init__(self, filename, mode='r', **kwargs): ...

224

225

class URLSource:

226

"""URL-based data source for web resources."""

227

def __init__(self, url, **kwargs): ...

228

229

class GzipSource:

230

"""Gzip-compressed file source."""

231

def __init__(self, filename, mode='r', **kwargs): ...

232

233

class ZipSource:

234

"""ZIP archive file source."""

235

def __init__(self, filename, member=None, **kwargs): ...

236

237

# Exception classes

238

class DuplicateKeyError(Exception):

239

"""Raised when duplicate keys are found where unique keys are expected."""

240

pass

241

242

class FieldSelectionError(Exception):

243

"""Raised when field selection is invalid or ambiguous."""

244

pass

245

246

class ArgumentError(Exception):

247

"""Raised when function arguments are invalid."""

248

pass

249

```