0
# PETL - Python ETL Library
1
2
A comprehensive Python library for extracting, transforming and loading tables of data. PETL enables developers to efficiently process, manipulate, and transform data tables through a fluent API that supports various data sources including CSV, JSON, databases, Excel files, and remote sources. The library offers extensive transformation capabilities including filtering, sorting, joining, aggregating, pivoting, and data type conversions, with built-in support for handling large datasets through memory-efficient iterators and streaming operations.
3
4
## Package Information
5
6
- **Package Name**: petl
7
- **Language**: Python
8
- **Installation**: `pip install petl`
9
10
## Core Imports
11
12
```python
13
import petl
14
```
15
16
Common usage pattern:
17
18
```python
19
import petl as etl
20
```
21
22
All functions can be imported directly:
23
24
```python
25
from petl import fromcsv, tocsv, cut, select, join
26
```
27
28
## Basic Usage
29
30
```python
31
import petl as etl
32
33
# Load data from CSV
34
table = etl.fromcsv('data.csv')
35
36
# Transform data with method chaining
37
result = table.cut('name', 'age', 'city').select('age', lambda age: age > 18).sort('name')
38
39
# Save to new file
40
result.tocsv('adults.csv')
41
42
# Alternative functional style
43
result = etl.tocsv(
44
etl.sort(
45
etl.select(
46
etl.cut(
47
etl.fromcsv('data.csv'),
48
'name', 'age', 'city'
49
),
50
'age', lambda age: age > 18
51
),
52
'name'
53
),
54
'adults.csv'
55
)
56
```
57
58
## Architecture
59
60
PETL is built around the **Table** abstraction - an iterable container that represents tabular data. The core design principles include:
61
62
- **Lazy Evaluation**: Transformations are applied only when data is consumed, enabling efficient processing of large datasets
63
- **Functional Programming**: Operations return new table objects, allowing for method chaining and immutable transformations
64
- **Memory Efficiency**: Built-in support for external sorting, buffering, and streaming for memory-constrained environments
65
- **Source Agnostic**: Unified interface for reading/writing data regardless of source format
66
- **Iterator-based**: All operations use Python iterators and generators for optimal memory usage
67
68
The library is organized into six main modules: **util** (table operations), **transform** (data transformations), **io** (input/output), **comparison** (sorting utilities), **errors** (exceptions), and **config** (settings).
69
70
## Capabilities
71
72
### Table Creation and Access
73
74
Core functionality for creating, accessing, and manipulating table structures. Includes table creation from various sources, header manipulation, data access, and conversion to different formats.
75
76
```python { .api }
77
def empty() -> Table: ...
78
def header(table): ...
79
def fieldnames(table): ...
80
def data(table, *sliceargs): ...
81
def records(table, *sliceargs, **kwargs): ...
82
def dicts(table, *sliceargs, **kwargs): ...
83
def namedtuples(table, *sliceargs, **kwargs): ...
84
def values(table, *field, **kwargs): ...
85
def columns(table, missing=None): ...
86
def facetcolumns(table, key, missing=None): ...
87
def expr(expression_text, trusted=True): ...
88
def rowgroupby(table, key, value=None): ...
89
```
90
91
[Table Operations](./table-operations.md)
92
93
### Data Input/Output
94
95
Comprehensive support for reading and writing data from various sources including CSV, JSON, XML, Excel, databases, and many specialized formats. Supports both file-based and streaming I/O operations.
96
97
```python { .api }
98
def fromcsv(source=None, **kwargs) -> Table: ...
99
def tocsv(table, source=None, **kwargs): ...
100
def fromjson(source, **kwargs) -> Table: ...
101
def tojson(table, source=None, **kwargs): ...
102
def fromdb(dbo, query, **kwargs) -> Table: ...
103
def todb(table, dbo, tablename, **kwargs): ...
104
```
105
106
[Data I/O](./data-io.md)
107
108
### Data Transformations
109
110
Core transformation operations for reshaping, filtering, and manipulating table data. Includes field operations, row selection, data conversion, and structural transformations.
111
112
```python { .api }
113
def cut(table, *args, **kwargs) -> Table: ...
114
def select(table, *args, **kwargs) -> Table: ...
115
def convert(table, *args, **kwargs) -> Table: ...
116
def rename(table, *args, **kwargs) -> Table: ...
117
def addfield(table, field, value=None, **kwargs) -> Table: ...
118
```
119
120
[Basic Transformations](./basic-transformations.md)
121
122
### Sorting and Joining
123
124
Advanced operations for sorting data and combining multiple tables through various join types. Includes both memory-based and disk-based sorting for large datasets.
125
126
```python { .api }
127
def sort(table, key=None, reverse=False, **kwargs) -> Table: ...
128
def join(table1, table2, key=None, **kwargs) -> Table: ...
129
def leftjoin(left, right, key=None, **kwargs) -> Table: ...
130
def outerjoin(left, right, key=None, **kwargs) -> Table: ...
131
def hashjoin(left, right, key=None, **kwargs) -> Table: ...
132
```
133
134
[Sorting and Joins](./sorting-joins.md)
135
136
### Aggregation and Grouping
137
138
Functions for grouping data and performing aggregation operations. Supports custom aggregation functions and reduction operations on grouped data.
139
140
```python { .api }
141
def aggregate(table, key, aggregation=None, **kwargs) -> Table: ...
142
def rowreduce(table, key, reducer, **kwargs) -> Table: ...
143
def merge(table, key, **kwargs) -> Table: ...
144
def fold(table, key, f, **kwargs) -> Table: ...
145
```
146
147
[Aggregation](./aggregation.md)
148
149
### Data Reshaping
150
151
Advanced reshaping operations including pivoting, melting, transposing, and data normalization. Essential for converting between wide and long data formats.
152
153
```python { .api }
154
def melt(table, key=None, **kwargs) -> Table: ...
155
def recast(table, key=None, **kwargs) -> Table: ...
156
def pivot(table, f1, f2, f3, aggfun, **kwargs) -> Table: ...
157
def transpose(table) -> Table: ...
158
def flatten(table) -> Table: ...
159
```
160
161
[Data Reshaping](./data-reshaping.md)
162
163
### Validation and Analysis
164
165
Tools for data validation, quality assessment, and statistical analysis. Includes data profiling, constraint validation, and summary statistics.
166
167
```python { .api }
168
def validate(table, constraints=None, **kwargs) -> Table: ...
169
def look(table, limit=0, **kwargs): ...
170
def see(table, limit=0, **kwargs): ...
171
def stats(table, field): ...
172
def valuecounts(table, *field, **kwargs) -> Table: ...
173
def limits(table, field): ...
174
def nrows(table): ...
175
def typecounts(table, field) -> Table: ...
176
```
177
178
[Validation and Analysis](./validation-analysis.md)
179
180
## Types
181
182
```python { .api }
183
class Table(IterContainer):
184
"""Main table abstraction supporting iteration and transformations."""
185
def __iter__(self): ...
186
def __getitem__(self, item): ... # Field access via string or row slicing
187
def cut(self, *args, **kwargs) -> 'Table': ...
188
def select(self, *args, **kwargs) -> 'Table': ...
189
def sort(self, key=None, **kwargs) -> 'Table': ...
190
# ... all transformation methods available as methods
191
192
class Record(tuple):
193
"""Named tuple subclass for table rows with field access by name or index."""
194
def __init__(self, row, flds, missing=None): ...
195
def __getitem__(self, f): ... # Access by field name or index
196
def __getattr__(self, f): ... # Access by field name as attribute
197
def get(self, key, default=None): ... # Safe field access
198
199
class RecordsView(IterContainer):
200
"""Container for table records with record-based access."""
201
def __iter__(self): ...
202
def __getitem__(self, key): ...
203
204
class ValuesView(IterContainer):
205
"""Container for values from a specific field."""
206
def __iter__(self): ...
207
def __getitem__(self, key): ...
208
209
class IterContainer:
210
"""Base container class for iterable data structures with utility methods."""
211
def __iter__(self): ...
212
def __getitem__(self, key): ...
213
def __len__(self): ...
214
def __contains__(self, item): ...
215
def list(self): ...
216
def tuple(self): ...
217
def set(self): ...
218
def counter(self): ...
219
220
# Source classes for data I/O
221
class FileSource:
222
"""File-based data source."""
223
def __init__(self, filename, mode='r', **kwargs): ...
224
225
class URLSource:
226
"""URL-based data source for web resources."""
227
def __init__(self, url, **kwargs): ...
228
229
class GzipSource:
230
"""Gzip-compressed file source."""
231
def __init__(self, filename, mode='r', **kwargs): ...
232
233
class ZipSource:
234
"""ZIP archive file source."""
235
def __init__(self, filename, member=None, **kwargs): ...
236
237
# Exception classes
238
class DuplicateKeyError(Exception):
239
"""Raised when duplicate keys are found where unique keys are expected."""
240
pass
241
242
class FieldSelectionError(Exception):
243
"""Raised when field selection is invalid or ambiguous."""
244
pass
245
246
class ArgumentError(Exception):
247
"""Raised when function arguments are invalid."""
248
pass
249
```