Python package for manipulating 2-dimensional tabular data structures with emphasis on speed and big data support
npx @tessl/cli install tessl/pypi-datatable@1.1.00
# datatable
1
2
A high-performance Python library for manipulating 2-dimensional tabular data structures with emphasis on speed and big data support up to 100GB on single-node machines. It features column-oriented data storage with native-C implementation, fast CSV reading, multi-threaded processing, and an expressive query syntax similar to R's data.table.
3
4
## Package Information
5
6
- **Package Name**: datatable
7
- **Language**: Python
8
- **Installation**: `pip install datatable`
9
10
## Core Imports
11
12
```python
13
import datatable as dt
14
from datatable import f, g, by, join
15
```
16
17
Common pattern for data manipulation:
18
19
```python
20
import datatable as dt
21
from datatable import f, g, by
22
```
23
24
## Basic Usage
25
26
```python
27
import datatable as dt
28
from datatable import f, g, by
29
30
# Read data from CSV
31
DT = dt.fread("data.csv")
32
33
# Create a Frame from data
34
DT = dt.Frame({
35
'A': [1, 2, 3, 4, 5],
36
'B': ['a', 'b', 'c', 'd', 'e'],
37
'C': [1.1, 2.2, 3.3, 4.4, 5.5]
38
})
39
40
# Basic operations
41
result = DT[:, f.A] # Select column A
42
result = DT[f.A > 2, :] # Filter rows where A > 2
43
result = DT[:, dt.sum(f.A)] # Aggregate sum of column A
44
45
# Groupby operations
46
result = DT[:, dt.sum(f.A), by(f.B)] # Sum A grouped by B
47
48
# Update operations
49
DT[:, dt.update(D=f.A * 2)] # Add new column D
50
51
# Join operations
52
DT2 = dt.Frame({'B': ['a', 'b'], 'X': [10, 20]})
53
result = DT[:, :, dt.join(DT2)] # Join on common columns
54
```
55
56
## Architecture
57
58
datatable follows a columnar storage architecture for performance:
59
60
- **Frame**: Main data structure representing a 2D table with column-oriented storage
61
- **Expression System**: f/g objects for column references and expression building
62
- **Type System**: Comprehensive stype/ltype system for precise data type control
63
- **Native-C Core**: Performance-critical operations implemented in C for speed
64
- **Memory Mapping**: Support for out-of-memory operations on large datasets
65
66
The library is designed specifically for machine learning applications requiring fast feature generation from large datasets, with copy-on-write semantics and rowindex views to minimize data copying.
67
68
## Capabilities
69
70
### Core Data Structure
71
72
The Frame class provides the main interface for tabular data manipulation with high-performance columnar storage and comprehensive data type support.
73
74
```python { .api }
75
class Frame:
76
def __init__(self, data=None, *, names=None, stypes=None,
77
stype=None, types=None, type=None): ...
78
79
@property
80
def shape(self) -> tuple: ...
81
@property
82
def names(self) -> tuple: ...
83
@property
84
def stypes(self) -> tuple: ...
85
86
def __getitem__(self, key): ...
87
def __setitem__(self, key, value): ...
88
```
89
90
[Core Data Structures](./core-data-structures.md)
91
92
### Expression System
93
94
Column references and expression building using f and g objects for flexible data queries and transformations.
95
96
```python { .api }
97
# Column reference objects
98
f: object # Primary column reference
99
g: object # Secondary column reference (for joins)
100
101
class FExpr:
102
"""Expression object for column operations"""
103
pass
104
105
class Namespace:
106
"""Namespace for organizing column references"""
107
pass
108
```
109
110
[Expression System](./expression-system.md)
111
112
### File I/O Operations
113
114
High-performance reading and writing of various file formats with automatic type detection and memory-efficient processing.
115
116
```python { .api }
117
def fread(anysource=None, *, file=None, text=None, cmd=None,
118
url=None, **kwargs) -> Frame: ...
119
120
def iread(anysource=None, *, file=None, text=None, cmd=None,
121
url=None, **kwargs): ... # Iterator version
122
```
123
124
[File I/O](./file-io.md)
125
126
### Data Manipulation Functions
127
128
Comprehensive set of functions for combining, transforming, and reshaping data frames.
129
130
```python { .api }
131
def cbind(*frames) -> Frame: ...
132
def rbind(*frames, force=False, bynames=True) -> Frame: ...
133
def unique(frame, *cols) -> Frame: ...
134
def sort(frame, *cols) -> Frame: ...
135
def update(**kwargs): ... # Update/add columns
136
def fillna(frame, value): ... # Fill missing values
137
def repeat(frame, n): ... # Repeat rows n times
138
def shift(frame, n): ... # Shift values by n positions
139
```
140
141
[Data Manipulation](./data-manipulation.md)
142
143
### Reduction and Aggregation
144
145
Statistical and mathematical reduction functions for data analysis and aggregation operations.
146
147
```python { .api }
148
def sum(expr): ...
149
def mean(expr): ...
150
def count(expr=None): ...
151
def min(expr): ...
152
def max(expr): ...
153
def median(expr): ...
154
def sd(expr): ... # Standard deviation
155
def nunique(expr): ...
156
```
157
158
[Reductions and Aggregations](./reductions-aggregations.md)
159
160
### Mathematical Functions
161
162
Comprehensive mathematical operations including trigonometric, logarithmic, and statistical functions.
163
164
```python { .api }
165
def abs(x): ...
166
def exp(x): ...
167
def log(x): ...
168
def log10(x): ...
169
def sqrt(x): ...
170
def isna(x): ...
171
def ifelse(condition, x, y): ... # Conditional selection
172
```
173
174
[Mathematical Functions](./mathematical-functions.md)
175
176
### Set Operations
177
178
Mathematical set operations for combining and comparing data frames.
179
180
```python { .api }
181
def union(*frames) -> Frame: ...
182
def intersect(*frames) -> Frame: ...
183
def setdiff(frame1, frame2) -> Frame: ...
184
def symdiff(frame1, frame2) -> Frame: ...
185
```
186
187
[Set Operations](./set-operations.md)
188
189
### Row-wise Operations
190
191
Element-wise operations across columns within rows for complex transformations.
192
193
```python { .api }
194
def rowall(*cols): ...
195
def rowany(*cols): ...
196
def rowcount(*cols): ...
197
def rowsum(*cols): ...
198
def rowmean(*cols): ...
199
```
200
201
[Row-wise Operations](./row-operations.md)
202
203
### String Operations
204
205
Text processing and manipulation functions for string columns.
206
207
```python { .api }
208
# String module functions
209
def len(x): ... # String length
210
def slice(x, start, stop=None): ... # String slicing
211
```
212
213
[String Operations](./string-operations.md)
214
215
### Time Operations
216
217
Date and time manipulation functions for temporal data analysis.
218
219
```python { .api }
220
def year(x): ...
221
def month(x): ...
222
def day(x): ...
223
def hour(x): ...
224
def minute(x): ...
225
def second(x): ...
226
```
227
228
[Time Operations](./time-operations.md)
229
230
### Type System and Conversion
231
232
Comprehensive type system with storage types (stype) and logical types (ltype) for precise data type control.
233
234
```python { .api }
235
class stype(Enum):
236
void = 0
237
bool8 = 1
238
int8 = 2
239
int16 = 3
240
int32 = 4
241
int64 = 5
242
float32 = 6
243
float64 = 7
244
str32 = 11
245
str64 = 12
246
obj64 = 21
247
248
def as_type(frame, new_type): ...
249
```
250
251
[Type System](./type-system.md)
252
253
### Data Binning and Encoding
254
255
Functions for data discretization and categorical encoding operations.
256
257
```python { .api }
258
def cut(x, bins, right=True, labels=None): ... # Bin values into discrete intervals
259
def qcut(x, q, labels=None): ... # Quantile-based discretization
260
def split_into_nhot(frame, delimiter=","): ... # One-hot encoding for delimited strings
261
```
262
263
## Global Objects
264
265
```python { .api }
266
# Module alias
267
dt = datatable # Common alias for the datatable module
268
269
# Configuration
270
options: Config # Global configuration system
271
272
# Display initialization
273
init_styles(): ... # Initialize display styles (auto-run in Jupyter)
274
```