or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-daft

Distributed Dataframes for Multimodal Data with high-performance query engine and support for complex nested data structures, AI/ML operations, and seamless cloud storage integration.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/daft@0.6.x

To install, run

npx @tessl/cli install tessl/pypi-daft@0.6.0

0

# Daft

1

2

Daft is a distributed query engine for large-scale data processing that provides both Python DataFrame API and SQL interface, implemented in Rust for high performance. It specializes in multimodal data types including Images, URLs, Tensors and complex nested data structures, built on Apache Arrow for seamless interchange and record-setting I/O performance with cloud storage systems like S3.

3

4

## Package Information

5

6

- **Package Name**: daft

7

- **Language**: Python

8

- **Installation**: `pip install daft`

9

- **Optional Dependencies**: `pip install 'daft[aws,azure,gcp,ray,pandas,sql,iceberg,deltalake,unity]'`

10

11

## Core Imports

12

13

```python

14

import daft

15

```

16

17

For common operations:

18

19

```python

20

from daft import DataFrame, col, lit, when, coalesce

21

import daft.functions as F

22

```

23

24

For specific functionality:

25

26

```python

27

from daft import (

28

# Data conversion functions

29

from_pydict, from_pandas, from_arrow, from_ray_dataset, from_dask_dataframe,

30

31

# Data I/O functions

32

read_parquet, read_csv, read_json, read_deltalake, read_iceberg,

33

read_sql, read_lance, read_video_frames, read_warc, read_mcap,

34

read_huggingface, from_glob_path,

35

36

# Session and catalog management

37

current_session, set_catalog, attach_catalog, list_tables,

38

39

# SQL interface

40

sql, sql_expr,

41

42

# UDF creation

43

func, udf,

44

45

# Configuration

46

set_execution_config, set_planning_config,

47

48

# Types and utilities

49

DataType, Schema, Window, ResourceRequest,

50

ImageFormat, ImageMode, TimeUnit

51

)

52

```

53

54

## Basic Usage

55

56

```python

57

import daft

58

59

# Create DataFrame from Python data

60

df = daft.from_pydict({

61

"name": ["Alice", "Bob", "Charlie"],

62

"age": [25, 30, 35],

63

"city": ["New York", "London", "Tokyo"]

64

})

65

66

# Basic operations

67

result = (df

68

.filter(col("age") > 28)

69

.select("name", "city", (col("age") + 1).alias("next_age"))

70

.collect()

71

)

72

73

# SQL interface

74

df2 = daft.sql("SELECT name, city FROM df WHERE age > 28")

75

76

# Read from various formats

77

parquet_df = daft.read_parquet("s3://bucket/data/*.parquet")

78

csv_df = daft.read_csv("data.csv")

79

delta_df = daft.read_deltalake("s3://bucket/delta-table")

80

```

81

82

## Architecture

83

84

Daft follows a distributed, lazy evaluation architecture optimized for modern data workloads:

85

86

- **DataFrames**: Distributed data structures supporting both relational and multimodal operations

87

- **Expressions**: Column-level computations with type safety and optimization

88

- **IO Layer**: High-performance readers for 10+ data formats with cloud storage optimization

89

- **Query Engine**: Rust-based execution with intelligent caching and predicate pushdown

90

- **Catalog Integration**: Native support for data catalogs (Iceberg, Delta, Unity, Glue)

91

- **AI/ML Integration**: Built-in functions for embeddings, LLM operations, and model inference

92

93

## Capabilities

94

95

### DataFrame Operations

96

97

Core DataFrame functionality including creation, selection, filtering, grouping, aggregation, and joining operations. Supports both lazy and eager evaluation with distributed processing.

98

99

```python { .api }

100

class DataFrame:

101

def select(*columns: ColumnInputType, **projections: Expression) -> DataFrame: ...

102

def filter(predicate: Union[Expression, str]) -> DataFrame: ...

103

def groupby(*group_by: ManyColumnsInputType) -> GroupedDataFrame: ...

104

def collect(num_preview_rows: Optional[int] = 8) -> DataFrame: ...

105

```

106

107

[DataFrame Operations](./dataframe-operations.md)

108

109

### Data Input/Output

110

111

Reading and writing data from multiple formats including CSV, Parquet, JSON, Delta Lake, Apache Iceberg, Hudi, Lance, and databases. Optimized for cloud storage with support for AWS S3, Azure Blob, and Google Cloud Storage.

112

113

```python { .api }

114

def read_parquet(path: Union[str, List[str]], **kwargs) -> DataFrame: ...

115

def read_csv(path: Union[str, List[str]], **kwargs) -> DataFrame: ...

116

def read_deltalake(table_uri: str, **kwargs) -> DataFrame: ...

117

def read_iceberg(table: str, **kwargs) -> DataFrame: ...

118

```

119

120

[Data Input/Output](./data-io.md)

121

122

### Expressions and Functions

123

124

Column expressions for data transformation, computation, and manipulation. Includes mathematical operations, string processing, date/time handling, and conditional logic.

125

126

```python { .api }

127

def col(name: str) -> Expression: ...

128

def lit(value: Any) -> Expression: ...

129

def coalesce(*exprs: Expression) -> Expression: ...

130

def when(predicate: Expression) -> Expression: ...

131

```

132

133

[Expressions and Functions](./expressions.md)

134

135

### User-Defined Functions

136

137

Support for custom Python functions with three execution modes: row-wise (1-to-1), async row-wise, and generator (1-to-many). Functions can be decorated to work seamlessly with DataFrame operations.

138

139

```python { .api }

140

@daft.func

141

def custom_function(input: str) -> str: ...

142

143

@daft.func

144

async def async_function(input: str) -> str: ...

145

146

@daft.func

147

def generator_function(input: str) -> Iterator[str]: ...

148

```

149

150

[User-Defined Functions](./udf.md)

151

152

### SQL Interface

153

154

Execute SQL queries directly on DataFrames and registered tables. Supports standard SQL syntax with extensions for multimodal data operations.

155

156

```python { .api }

157

def sql(query: str) -> DataFrame: ...

158

def sql_expr(expression: str) -> Expression: ...

159

```

160

161

[SQL Interface](./sql.md)

162

163

### AI/ML Functions

164

165

Built-in functions for AI and machine learning workflows including text embeddings, LLM generation, and model inference operations.

166

167

```python { .api }

168

def embed_text(text: Expression, model: str) -> Expression: ...

169

def llm_generate(prompt: Expression, model: str) -> Expression: ...

170

```

171

172

[AI/ML Functions](./ai-ml.md)

173

174

### Data Catalog Integration

175

176

Integration with data catalogs for metadata management, table discovery, and governance. Supports Unity Catalog, Apache Iceberg, AWS Glue, and custom catalog implementations.

177

178

```python { .api }

179

class Catalog:

180

def list_tables(pattern: str = None) -> List[Identifier]: ...

181

def get_table(identifier: Union[Identifier, str]) -> Table: ...

182

def create_table(identifier: Union[Identifier, str], source: Union[Schema, DataFrame]) -> Table: ...

183

```

184

185

[Data Catalog Integration](./catalog.md)

186

187

### Session Management

188

189

Session-based configuration and resource management for distributed computing. Handles catalog connections, temporary tables, and execution settings.

190

191

```python { .api }

192

def set_execution_config(config: ExecutionConfig) -> None: ...

193

def set_planning_config(config: PlanningConfig) -> None: ...

194

def current_session() -> Session: ...

195

```

196

197

[Session Management](./session.md)

198

199

## Core Data Types

200

201

### Series

202

203

Column-level data container and operations.

204

205

```python { .api }

206

class Series:

207

@property

208

def name(self) -> str:

209

"""Get series name."""

210

211

def rename(self, name: str) -> Series:

212

"""Rename series."""

213

214

def datatype(self) -> DataType:

215

"""Get data type."""

216

217

def __len__(self) -> int:

218

"""Get length."""

219

220

def to_arrow(self) -> "pyarrow.Array":

221

"""Convert to Apache Arrow array."""

222

223

def to_pylist(self) -> List[Any]:

224

"""Convert to Python list."""

225

226

def cast(self, dtype: DataType) -> Series:

227

"""Cast to different data type."""

228

229

def filter(self, mask: Series) -> Series:

230

"""Filter by boolean mask."""

231

232

def take(self, idx: Series) -> Series:

233

"""Take values by indices."""

234

235

def slice(self, start: int, end: int) -> Series:

236

"""Slice series."""

237

```

238

239

### File

240

241

File metadata and operations.

242

243

```python { .api }

244

class File:

245

"""File handling and metadata operations."""

246

247

@property

248

def path(self) -> str:

249

"""Get file path."""

250

251

@property

252

def size(self) -> int:

253

"""Get file size in bytes."""

254

255

def read(self) -> bytes:

256

"""Read file contents."""

257

```

258

259

### Schema

260

261

Schema definitions for DataFrames.

262

263

```python { .api }

264

class Schema:

265

"""Schema definition for DataFrame structure."""

266

267

def column_names(self) -> List[str]:

268

"""Get column names."""

269

270

def to_pydict(self) -> Dict[str, DataType]:

271

"""Convert to Python dictionary."""

272

```

273

274

### Data Types

275

276

```python { .api }

277

class DataType:

278

@staticmethod

279

def int8() -> DataType: ...

280

@staticmethod

281

def int16() -> DataType: ...

282

@staticmethod

283

def int32() -> DataType: ...

284

@staticmethod

285

def int64() -> DataType: ...

286

@staticmethod

287

def uint8() -> DataType: ...

288

@staticmethod

289

def uint16() -> DataType: ...

290

@staticmethod

291

def uint32() -> DataType: ...

292

@staticmethod

293

def uint64() -> DataType: ...

294

@staticmethod

295

def float32() -> DataType: ...

296

@staticmethod

297

def float64() -> DataType: ...

298

@staticmethod

299

def bool() -> DataType: ...

300

@staticmethod

301

def string() -> DataType: ...

302

@staticmethod

303

def binary() -> DataType: ...

304

@staticmethod

305

def date() -> DataType: ...

306

@staticmethod

307

def timestamp(unit: TimeUnit) -> DataType: ...

308

@staticmethod

309

def list(inner: DataType) -> DataType: ...

310

@staticmethod

311

def struct(fields: Dict[str, DataType]) -> DataType: ...

312

@staticmethod

313

def image(mode: ImageMode = None) -> DataType: ...

314

@staticmethod

315

def tensor(dtype: DataType) -> DataType: ...

316

317

enum TimeUnit:

318

Nanoseconds

319

Microseconds

320

Milliseconds

321

Seconds

322

323

enum ImageMode:

324

L # 8-bit grayscale

325

LA # 8-bit grayscale with alpha

326

RGB # 8-bit RGB

327

RGBA # 8-bit RGB with alpha

328

329

enum ImageFormat:

330

PNG

331

JPEG

332

TIFF

333

GIF

334

BMP

335

```

336

337

### Resource Management

338

339

```python { .api }

340

class ResourceRequest:

341

"""Resource allocation specification for distributed tasks."""

342

343

def __init__(

344

self,

345

num_cpus: float = None,

346

num_gpus: float = None,

347

memory_bytes: int = None

348

): ...

349

350

def refresh_logger() -> None:

351

"""Refresh Daft's internal rust logging to current python log level."""

352

```

353

354

### Visualization

355

356

```python { .api }

357

def register_viz_hook(hook_fn: Callable) -> None:

358

"""Register custom visualization hook for DataFrame display."""

359

```