0
# SQL Interface
1
2
Execute SQL queries directly on DataFrames and registered tables with full support for standard SQL syntax, joins, aggregations, and Daft-specific extensions for multimodal data operations.
3
4
## Capabilities
5
6
### SQL Query Execution
7
8
Execute SQL queries that return DataFrames.
9
10
```python { .api }
11
def sql(query: str) -> DataFrame:
12
"""
13
Execute SQL query and return DataFrame.
14
15
Parameters:
16
- query: SQL query string with standard SQL syntax
17
18
Returns:
19
DataFrame: Result of SQL query
20
21
Examples:
22
>>> df = daft.sql("SELECT name, age FROM my_table WHERE age > 25")
23
>>> result = daft.sql("SELECT department, AVG(salary) FROM employees GROUP BY department")
24
"""
25
```
26
27
### SQL Expression Creation
28
29
Create SQL expressions for use in DataFrame operations.
30
31
```python { .api }
32
def sql_expr(expression: str) -> Expression:
33
"""
34
Create expression from SQL string.
35
36
Parameters:
37
- expression: SQL expression string
38
39
Returns:
40
Expression: Expression object for DataFrame operations
41
42
Examples:
43
>>> expr = daft.sql_expr("age * 2 + 1")
44
>>> df.select(sql_expr("UPPER(name)").alias("upper_name"))
45
"""
46
```
47
48
## Usage Examples
49
50
### Basic SQL Queries
51
```python
52
import daft
53
54
# Create sample data
55
df = daft.from_pydict({
56
"name": ["Alice", "Bob", "Charlie", "Diana"],
57
"age": [25, 30, 35, 28],
58
"department": ["Engineering", "Sales", "Engineering", "Marketing"],
59
"salary": [75000, 65000, 85000, 70000]
60
})
61
62
# Register DataFrame for SQL queries
63
daft.attach_table(df, "employees")
64
65
# Simple SELECT query
66
result1 = daft.sql("SELECT name, age FROM employees WHERE age > 27")
67
68
# Aggregation query
69
result2 = daft.sql("""
70
SELECT department,
71
COUNT(*) as employee_count,
72
AVG(salary) as avg_salary,
73
MAX(age) as max_age
74
FROM employees
75
GROUP BY department
76
ORDER BY avg_salary DESC
77
""")
78
```
79
80
### Advanced SQL Operations
81
```python
82
# Complex JOIN operations
83
orders = daft.from_pydict({
84
"order_id": [1, 2, 3, 4],
85
"employee_name": ["Alice", "Bob", "Alice", "Charlie"],
86
"amount": [1500, 2000, 1200, 1800],
87
"order_date": ["2024-01-15", "2024-01-16", "2024-01-17", "2024-01-18"]
88
})
89
90
daft.attach_table(orders, "orders")
91
92
# JOIN with aggregation
93
sales_summary = daft.sql("""
94
SELECT e.name,
95
e.department,
96
COUNT(o.order_id) as total_orders,
97
SUM(o.amount) as total_sales,
98
AVG(o.amount) as avg_order_value
99
FROM employees e
100
LEFT JOIN orders o ON e.name = o.employee_name
101
GROUP BY e.name, e.department
102
HAVING COUNT(o.order_id) > 0
103
ORDER BY total_sales DESC
104
""")
105
```
106
107
### Window Functions in SQL
108
```python
109
# Window functions for ranking and running totals
110
windowed_results = daft.sql("""
111
SELECT name,
112
department,
113
salary,
114
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank,
115
RANK() OVER (ORDER BY salary DESC) as overall_rank,
116
SUM(salary) OVER (PARTITION BY department) as dept_total_salary,
117
AVG(salary) OVER () as company_avg_salary
118
FROM employees
119
ORDER BY department, salary DESC
120
""")
121
```
122
123
### Common Table Expressions (CTEs)
124
```python
125
# Using WITH clauses for complex queries
126
cte_query = daft.sql("""
127
WITH department_stats AS (
128
SELECT department,
129
COUNT(*) as emp_count,
130
AVG(salary) as avg_salary,
131
MIN(age) as min_age,
132
MAX(age) as max_age
133
FROM employees
134
GROUP BY department
135
),
136
high_performing_depts AS (
137
SELECT department
138
FROM department_stats
139
WHERE avg_salary > 70000 AND emp_count >= 2
140
)
141
SELECT e.name, e.age, e.salary, e.department
142
FROM employees e
143
INNER JOIN high_performing_depts h ON e.department = h.department
144
ORDER BY e.salary DESC
145
""")
146
```
147
148
### Subqueries and Nested Operations
149
```python
150
# Subqueries for complex filtering
151
subquery_result = daft.sql("""
152
SELECT name, age, salary, department,
153
salary - (SELECT AVG(salary) FROM employees) as salary_diff
154
FROM employees
155
WHERE salary > (
156
SELECT AVG(salary)
157
FROM employees
158
WHERE department = 'Engineering'
159
)
160
ORDER BY salary_diff DESC
161
""")
162
```
163
164
### Mixing SQL with DataFrame Operations
165
```python
166
# Combine SQL queries with DataFrame methods
167
sql_df = daft.sql("SELECT * FROM employees WHERE age BETWEEN 25 AND 35")
168
169
# Continue with DataFrame operations
170
final_result = (sql_df
171
.filter(daft.col("salary") > 70000)
172
.select("name", "department", (daft.col("salary") * 1.1).alias("with_bonus"))
173
.collect()
174
)
175
```
176
177
### Using SQL Expressions in DataFrames
178
```python
179
# SQL expressions in DataFrame select
180
df_with_sql_expr = df.select(
181
daft.col("name"),
182
daft.sql_expr("age * 2").alias("double_age"),
183
daft.sql_expr("UPPER(department)").alias("dept_upper"),
184
daft.sql_expr("CASE WHEN salary > 70000 THEN 'High' ELSE 'Standard' END").alias("salary_tier")
185
)
186
187
# SQL expressions in filters
188
filtered_df = df.filter(
189
daft.sql_expr("age BETWEEN 25 AND 35 AND department IN ('Engineering', 'Sales')")
190
)
191
```
192
193
### Working with Multiple Tables
194
```python
195
# Register multiple tables
196
customers = daft.from_pydict({
197
"customer_id": [1, 2, 3],
198
"name": ["Corp A", "Corp B", "Corp C"],
199
"industry": ["Tech", "Finance", "Healthcare"]
200
})
201
202
daft.attach_table(customers, "customers")
203
204
# Multi-table JOIN
205
multi_table_query = daft.sql("""
206
SELECT c.name as customer_name,
207
c.industry,
208
e.name as employee_name,
209
e.department,
210
o.amount,
211
o.order_date
212
FROM customers c
213
JOIN orders o ON c.customer_id = o.order_id -- Simplified relationship
214
JOIN employees e ON o.employee_name = e.name
215
WHERE c.industry = 'Tech'
216
ORDER BY o.order_date DESC
217
""")
218
```
219
220
### Data Type Handling in SQL
221
```python
222
# Working with different data types
223
mixed_data = daft.from_pydict({
224
"id": [1, 2, 3],
225
"created_at": ["2024-01-01 10:00:00", "2024-01-02 15:30:00", "2024-01-03 09:15:00"],
226
"tags": [["python", "sql"], ["data", "analytics"], ["machine", "learning"]],
227
"metadata": ['{"priority": 1}', '{"priority": 2}', '{"priority": 3}']
228
})
229
230
daft.attach_table(mixed_data, "mixed_data")
231
232
# SQL with complex data types
233
complex_query = daft.sql("""
234
SELECT id,
235
EXTRACT(YEAR FROM created_at) as year,
236
EXTRACT(MONTH FROM created_at) as month,
237
ARRAY_LENGTH(tags) as tag_count,
238
JSON_EXTRACT(metadata, '$.priority') as priority
239
FROM mixed_data
240
WHERE ARRAY_LENGTH(tags) > 1
241
ORDER BY priority DESC
242
""")
243
```
244
245
### Temporary Tables and Views
246
```python
247
# Create temporary table from query result
248
temp_result = daft.sql("""
249
SELECT department, AVG(salary) as avg_salary
250
FROM employees
251
GROUP BY department
252
""")
253
254
daft.attach_table(temp_result, "dept_averages")
255
256
# Use temporary table in subsequent queries
257
comparison = daft.sql("""
258
SELECT e.name,
259
e.department,
260
e.salary,
261
d.avg_salary,
262
e.salary - d.avg_salary as salary_diff
263
FROM employees e
264
JOIN dept_averages d ON e.department = d.department
265
ORDER BY salary_diff DESC
266
""")
267
```
268
269
### Error Handling and Validation
270
```python
271
try:
272
# Execute potentially problematic query
273
result = daft.sql("""
274
SELECT name, salary / 0 as invalid_calc
275
FROM employees
276
""")
277
except Exception as e:
278
print(f"SQL error: {e}")
279
280
# Validate table exists before querying
281
if daft.has_table("employees"):
282
result = daft.sql("SELECT COUNT(*) FROM employees")
283
else:
284
print("Table 'employees' not found")
285
```
286
287
## SQL Feature Support
288
289
Daft's SQL interface supports:
290
291
### Standard SQL Features
292
- SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY
293
- JOINs (INNER, LEFT, RIGHT, FULL OUTER)
294
- Subqueries and CTEs (WITH clauses)
295
- Window functions (ROW_NUMBER, RANK, DENSE_RANK, etc.)
296
- Aggregate functions (COUNT, SUM, AVG, MIN, MAX, etc.)
297
- Mathematical and string functions
298
- Date/time functions
299
- CASE expressions and conditional logic
300
301
### Daft-Specific Extensions
302
- Integration with registered DataFrames and temporary tables
303
- Support for complex data types (arrays, structs, nested data)
304
- Multimodal data operations
305
- Cloud storage path references
306
- Custom UDF integration in SQL context
307
308
## Table Management for SQL
309
310
```python { .api }
311
def attach_table(df: DataFrame, name: str) -> None:
312
"""Register DataFrame as named table for SQL queries."""
313
314
def detach_table(name: str) -> None:
315
"""Remove registered table from SQL context."""
316
317
def has_table(name: str) -> bool:
318
"""Check if table is registered for SQL queries."""
319
320
def list_tables() -> List[str]:
321
"""List all registered tables."""
322
```
323
324
SQL queries can reference any registered table by name, enabling complex multi-table operations and reusable query patterns.