Distributed Dataframes for Multimodal Data with high-performance query engine and support for complex nested data structures, AI/ML operations, and seamless cloud storage integration.
npx @tessl/cli install tessl/pypi-daft@0.6.00
# Daft
1
2
Daft is a distributed query engine for large-scale data processing that provides both Python DataFrame API and SQL interface, implemented in Rust for high performance. It specializes in multimodal data types including Images, URLs, Tensors and complex nested data structures, built on Apache Arrow for seamless interchange and record-setting I/O performance with cloud storage systems like S3.
3
4
## Package Information
5
6
- **Package Name**: daft
7
- **Language**: Python
8
- **Installation**: `pip install daft`
9
- **Optional Dependencies**: `pip install 'daft[aws,azure,gcp,ray,pandas,sql,iceberg,deltalake,unity]'`
10
11
## Core Imports
12
13
```python
14
import daft
15
```
16
17
For common operations:
18
19
```python
20
from daft import DataFrame, col, lit, when, coalesce
21
import daft.functions as F
22
```
23
24
For specific functionality:
25
26
```python
27
from daft import (
28
# Data conversion functions
29
from_pydict, from_pandas, from_arrow, from_ray_dataset, from_dask_dataframe,
30
31
# Data I/O functions
32
read_parquet, read_csv, read_json, read_deltalake, read_iceberg,
33
read_sql, read_lance, read_video_frames, read_warc, read_mcap,
34
read_huggingface, from_glob_path,
35
36
# Session and catalog management
37
current_session, set_catalog, attach_catalog, list_tables,
38
39
# SQL interface
40
sql, sql_expr,
41
42
# UDF creation
43
func, udf,
44
45
# Configuration
46
set_execution_config, set_planning_config,
47
48
# Types and utilities
49
DataType, Schema, Window, ResourceRequest,
50
ImageFormat, ImageMode, TimeUnit
51
)
52
```
53
54
## Basic Usage
55
56
```python
57
import daft
58
59
# Create DataFrame from Python data
60
df = daft.from_pydict({
61
"name": ["Alice", "Bob", "Charlie"],
62
"age": [25, 30, 35],
63
"city": ["New York", "London", "Tokyo"]
64
})
65
66
# Basic operations
67
result = (df
68
.filter(col("age") > 28)
69
.select("name", "city", (col("age") + 1).alias("next_age"))
70
.collect()
71
)
72
73
# SQL interface
74
df2 = daft.sql("SELECT name, city FROM df WHERE age > 28")
75
76
# Read from various formats
77
parquet_df = daft.read_parquet("s3://bucket/data/*.parquet")
78
csv_df = daft.read_csv("data.csv")
79
delta_df = daft.read_deltalake("s3://bucket/delta-table")
80
```
81
82
## Architecture
83
84
Daft follows a distributed, lazy evaluation architecture optimized for modern data workloads:
85
86
- **DataFrames**: Distributed data structures supporting both relational and multimodal operations
87
- **Expressions**: Column-level computations with type safety and optimization
88
- **IO Layer**: High-performance readers for 10+ data formats with cloud storage optimization
89
- **Query Engine**: Rust-based execution with intelligent caching and predicate pushdown
90
- **Catalog Integration**: Native support for data catalogs (Iceberg, Delta, Unity, Glue)
91
- **AI/ML Integration**: Built-in functions for embeddings, LLM operations, and model inference
92
93
## Capabilities
94
95
### DataFrame Operations
96
97
Core DataFrame functionality including creation, selection, filtering, grouping, aggregation, and joining operations. Supports both lazy and eager evaluation with distributed processing.
98
99
```python { .api }
100
class DataFrame:
101
def select(*columns: ColumnInputType, **projections: Expression) -> DataFrame: ...
102
def filter(predicate: Union[Expression, str]) -> DataFrame: ...
103
def groupby(*group_by: ManyColumnsInputType) -> GroupedDataFrame: ...
104
def collect(num_preview_rows: Optional[int] = 8) -> DataFrame: ...
105
```
106
107
[DataFrame Operations](./dataframe-operations.md)
108
109
### Data Input/Output
110
111
Reading and writing data from multiple formats including CSV, Parquet, JSON, Delta Lake, Apache Iceberg, Hudi, Lance, and databases. Optimized for cloud storage with support for AWS S3, Azure Blob, and Google Cloud Storage.
112
113
```python { .api }
114
def read_parquet(path: Union[str, List[str]], **kwargs) -> DataFrame: ...
115
def read_csv(path: Union[str, List[str]], **kwargs) -> DataFrame: ...
116
def read_deltalake(table_uri: str, **kwargs) -> DataFrame: ...
117
def read_iceberg(table: str, **kwargs) -> DataFrame: ...
118
```
119
120
[Data Input/Output](./data-io.md)
121
122
### Expressions and Functions
123
124
Column expressions for data transformation, computation, and manipulation. Includes mathematical operations, string processing, date/time handling, and conditional logic.
125
126
```python { .api }
127
def col(name: str) -> Expression: ...
128
def lit(value: Any) -> Expression: ...
129
def coalesce(*exprs: Expression) -> Expression: ...
130
def when(predicate: Expression) -> Expression: ...
131
```
132
133
[Expressions and Functions](./expressions.md)
134
135
### User-Defined Functions
136
137
Support for custom Python functions with three execution modes: row-wise (1-to-1), async row-wise, and generator (1-to-many). Functions can be decorated to work seamlessly with DataFrame operations.
138
139
```python { .api }
140
@daft.func
141
def custom_function(input: str) -> str: ...
142
143
@daft.func
144
async def async_function(input: str) -> str: ...
145
146
@daft.func
147
def generator_function(input: str) -> Iterator[str]: ...
148
```
149
150
[User-Defined Functions](./udf.md)
151
152
### SQL Interface
153
154
Execute SQL queries directly on DataFrames and registered tables. Supports standard SQL syntax with extensions for multimodal data operations.
155
156
```python { .api }
157
def sql(query: str) -> DataFrame: ...
158
def sql_expr(expression: str) -> Expression: ...
159
```
160
161
[SQL Interface](./sql.md)
162
163
### AI/ML Functions
164
165
Built-in functions for AI and machine learning workflows including text embeddings, LLM generation, and model inference operations.
166
167
```python { .api }
168
def embed_text(text: Expression, model: str) -> Expression: ...
169
def llm_generate(prompt: Expression, model: str) -> Expression: ...
170
```
171
172
[AI/ML Functions](./ai-ml.md)
173
174
### Data Catalog Integration
175
176
Integration with data catalogs for metadata management, table discovery, and governance. Supports Unity Catalog, Apache Iceberg, AWS Glue, and custom catalog implementations.
177
178
```python { .api }
179
class Catalog:
180
def list_tables(pattern: str = None) -> List[Identifier]: ...
181
def get_table(identifier: Union[Identifier, str]) -> Table: ...
182
def create_table(identifier: Union[Identifier, str], source: Union[Schema, DataFrame]) -> Table: ...
183
```
184
185
[Data Catalog Integration](./catalog.md)
186
187
### Session Management
188
189
Session-based configuration and resource management for distributed computing. Handles catalog connections, temporary tables, and execution settings.
190
191
```python { .api }
192
def set_execution_config(config: ExecutionConfig) -> None: ...
193
def set_planning_config(config: PlanningConfig) -> None: ...
194
def current_session() -> Session: ...
195
```
196
197
[Session Management](./session.md)
198
199
## Core Data Types
200
201
### Series
202
203
Column-level data container and operations.
204
205
```python { .api }
206
class Series:
207
@property
208
def name(self) -> str:
209
"""Get series name."""
210
211
def rename(self, name: str) -> Series:
212
"""Rename series."""
213
214
def datatype(self) -> DataType:
215
"""Get data type."""
216
217
def __len__(self) -> int:
218
"""Get length."""
219
220
def to_arrow(self) -> "pyarrow.Array":
221
"""Convert to Apache Arrow array."""
222
223
def to_pylist(self) -> List[Any]:
224
"""Convert to Python list."""
225
226
def cast(self, dtype: DataType) -> Series:
227
"""Cast to different data type."""
228
229
def filter(self, mask: Series) -> Series:
230
"""Filter by boolean mask."""
231
232
def take(self, idx: Series) -> Series:
233
"""Take values by indices."""
234
235
def slice(self, start: int, end: int) -> Series:
236
"""Slice series."""
237
```
238
239
### File
240
241
File metadata and operations.
242
243
```python { .api }
244
class File:
245
"""File handling and metadata operations."""
246
247
@property
248
def path(self) -> str:
249
"""Get file path."""
250
251
@property
252
def size(self) -> int:
253
"""Get file size in bytes."""
254
255
def read(self) -> bytes:
256
"""Read file contents."""
257
```
258
259
### Schema
260
261
Schema definitions for DataFrames.
262
263
```python { .api }
264
class Schema:
265
"""Schema definition for DataFrame structure."""
266
267
def column_names(self) -> List[str]:
268
"""Get column names."""
269
270
def to_pydict(self) -> Dict[str, DataType]:
271
"""Convert to Python dictionary."""
272
```
273
274
### Data Types
275
276
```python { .api }
277
class DataType:
278
@staticmethod
279
def int8() -> DataType: ...
280
@staticmethod
281
def int16() -> DataType: ...
282
@staticmethod
283
def int32() -> DataType: ...
284
@staticmethod
285
def int64() -> DataType: ...
286
@staticmethod
287
def uint8() -> DataType: ...
288
@staticmethod
289
def uint16() -> DataType: ...
290
@staticmethod
291
def uint32() -> DataType: ...
292
@staticmethod
293
def uint64() -> DataType: ...
294
@staticmethod
295
def float32() -> DataType: ...
296
@staticmethod
297
def float64() -> DataType: ...
298
@staticmethod
299
def bool() -> DataType: ...
300
@staticmethod
301
def string() -> DataType: ...
302
@staticmethod
303
def binary() -> DataType: ...
304
@staticmethod
305
def date() -> DataType: ...
306
@staticmethod
307
def timestamp(unit: TimeUnit) -> DataType: ...
308
@staticmethod
309
def list(inner: DataType) -> DataType: ...
310
@staticmethod
311
def struct(fields: Dict[str, DataType]) -> DataType: ...
312
@staticmethod
313
def image(mode: ImageMode = None) -> DataType: ...
314
@staticmethod
315
def tensor(dtype: DataType) -> DataType: ...
316
317
enum TimeUnit:
318
Nanoseconds
319
Microseconds
320
Milliseconds
321
Seconds
322
323
enum ImageMode:
324
L # 8-bit grayscale
325
LA # 8-bit grayscale with alpha
326
RGB # 8-bit RGB
327
RGBA # 8-bit RGB with alpha
328
329
enum ImageFormat:
330
PNG
331
JPEG
332
TIFF
333
GIF
334
BMP
335
```
336
337
### Resource Management
338
339
```python { .api }
340
class ResourceRequest:
341
"""Resource allocation specification for distributed tasks."""
342
343
def __init__(
344
self,
345
num_cpus: float = None,
346
num_gpus: float = None,
347
memory_bytes: int = None
348
): ...
349
350
def refresh_logger() -> None:
351
"""Refresh Daft's internal rust logging to current python log level."""
352
```
353
354
### Visualization
355
356
```python { .api }
357
def register_viz_hook(hook_fn: Callable) -> None:
358
"""Register custom visualization hook for DataFrame display."""
359
```