0
# SQL Functionality
1
2
Native SQL support allowing you to query DataFrames and LazyFrames using familiar SQL syntax with full integration into the polars ecosystem. Execute SQL queries directly on your data structures or within the context of broader data pipelines.
3
4
## Capabilities
5
6
### SQL Context
7
8
Manage and execute SQL queries across multiple named DataFrames and LazyFrames.
9
10
```python { .api }
11
class SQLContext:
12
def __init__(self, frames: dict[str, DataFrame | LazyFrame] | None = None):
13
"""
14
Create SQL context for managing multiple data sources.
15
16
Parameters:
17
- frames: Dictionary of {name: DataFrame/LazyFrame} to register
18
"""
19
20
def register(self, name: str, frame: DataFrame | LazyFrame) -> SQLContext:
21
"""
22
Register DataFrame or LazyFrame with a name for SQL queries.
23
24
Parameters:
25
- name: Table name to use in SQL queries
26
- frame: DataFrame or LazyFrame to register
27
28
Returns:
29
- SQLContext: Self for method chaining
30
"""
31
32
def register_globals(self, n: int = 1) -> SQLContext:
33
"""
34
Register DataFrames/LazyFrames from global namespace.
35
36
Parameters:
37
- n: Number of stack frames to look back
38
39
Returns:
40
- SQLContext: Self for method chaining
41
"""
42
43
def register_many(self, frames: dict[str, DataFrame | LazyFrame]) -> SQLContext:
44
"""
45
Register multiple frames at once.
46
47
Parameters:
48
- frames: Dictionary of {name: frame} pairs
49
50
Returns:
51
- SQLContext: Self for method chaining
52
"""
53
54
def unregister(self, name: str) -> SQLContext:
55
"""
56
Unregister a table from the context.
57
58
Parameters:
59
- name: Table name to unregister
60
61
Returns:
62
- SQLContext: Self for method chaining
63
"""
64
65
def execute(
66
self,
67
query: str,
68
*,
69
eager: bool = False
70
) -> DataFrame | LazyFrame:
71
"""
72
Execute SQL query and return result.
73
74
Parameters:
75
- query: SQL query string
76
- eager: Return DataFrame (True) or LazyFrame (False)
77
78
Returns:
79
- DataFrame | LazyFrame: Query result
80
"""
81
82
def tables(self) -> list[str]:
83
"""
84
Get list of registered table names.
85
86
Returns:
87
- list[str]: Registered table names
88
"""
89
```
90
91
### SQL Query Function
92
93
Execute SQL queries directly on DataFrames and LazyFrames.
94
95
```python { .api }
96
def sql(
97
query: str,
98
*,
99
eager: bool = True,
100
**frames: DataFrame | LazyFrame
101
) -> DataFrame | LazyFrame:
102
"""
103
Execute SQL query on provided frames.
104
105
Parameters:
106
- query: SQL query string
107
- eager: Return DataFrame (True) or LazyFrame (False)
108
- frames: Keyword arguments mapping table names to DataFrames/LazyFrames
109
110
Returns:
111
- DataFrame | LazyFrame: Query result
112
"""
113
```
114
115
### DataFrame/LazyFrame SQL Methods
116
117
Execute SQL queries directly on individual frames.
118
119
```python { .api }
120
# Available on DataFrame and LazyFrame classes
121
def sql(self, query: str, *, table_name: str = "self") -> DataFrame | LazyFrame:
122
"""
123
Execute SQL query with this frame as the primary table.
124
125
Parameters:
126
- query: SQL query string (can reference table_name)
127
- table_name: Name to use for this frame in the query
128
129
Returns:
130
- DataFrame | LazyFrame: Query result (same type as caller)
131
"""
132
```
133
134
### SQL Expression Function
135
136
Create expressions from SQL fragments for use in select, filter, etc.
137
138
```python { .api }
139
def sql_expr(sql: str) -> Expr:
140
"""
141
Create expression from SQL fragment.
142
143
Parameters:
144
- sql: SQL expression string
145
146
Returns:
147
- Expr: Expression from SQL fragment
148
"""
149
```
150
151
## Supported SQL Features
152
153
### DDL Operations
154
- **SELECT**: Column selection, aliases, expressions
155
- **FROM**: Table references, subqueries
156
- **WHERE**: Filtering conditions
157
- **GROUP BY**: Grouping operations
158
- **HAVING**: Group filtering
159
- **ORDER BY**: Sorting
160
- **LIMIT/OFFSET**: Result limiting
161
- **WITH**: Common Table Expressions (CTEs)
162
163
### DML Operations
164
- **UNION/UNION ALL**: Set operations
165
- **INTERSECT**: Set intersection
166
- **EXCEPT**: Set difference
167
168
### Joins
169
- **INNER JOIN**: Inner joins
170
- **LEFT/RIGHT JOIN**: Outer joins
171
- **FULL OUTER JOIN**: Full outer joins
172
- **CROSS JOIN**: Cartesian product
173
174
### Functions
175
- **Aggregate**: SUM, COUNT, AVG, MIN, MAX, STDDEV, VARIANCE
176
- **String**: UPPER, LOWER, LENGTH, SUBSTRING, CONCAT
177
- **Math**: ABS, ROUND, CEIL, FLOOR, SQRT, POW
178
- **Date/Time**: EXTRACT, DATE_TRUNC, NOW, CURRENT_DATE
179
- **Conditional**: CASE WHEN, COALESCE, NULLIF
180
- **Window**: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD
181
182
### Data Types
183
- **Numeric**: INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL
184
- **Text**: VARCHAR, TEXT, CHAR
185
- **Temporal**: DATE, TIME, TIMESTAMP
186
- **Boolean**: BOOLEAN
187
- **Complex**: ARRAY, STRUCT
188
189
## Usage Examples
190
191
### Basic SQL Queries
192
193
```python
194
import polars as pl
195
196
# Create sample data
197
df = pl.DataFrame({
198
"id": [1, 2, 3, 4, 5],
199
"name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
200
"age": [25, 30, 35, 28, 32],
201
"salary": [50000, 60000, 70000, 55000, 65000],
202
"department": ["IT", "HR", "IT", "Finance", "IT"]
203
})
204
205
# Simple SELECT query
206
result = df.sql("SELECT name, age, salary FROM self WHERE age > 30")
207
print(result)
208
209
# Aggregation query
210
agg_result = df.sql("""
211
SELECT
212
department,
213
COUNT(*) as employee_count,
214
AVG(salary) as avg_salary,
215
MAX(age) as max_age
216
FROM self
217
GROUP BY department
218
ORDER BY avg_salary DESC
219
""")
220
```
221
222
### SQL Context Usage
223
224
```python
225
# Create multiple DataFrames
226
employees = pl.DataFrame({
227
"emp_id": [1, 2, 3, 4],
228
"name": ["Alice", "Bob", "Charlie", "Diana"],
229
"dept_id": [1, 2, 1, 3]
230
})
231
232
departments = pl.DataFrame({
233
"dept_id": [1, 2, 3],
234
"dept_name": ["IT", "HR", "Finance"],
235
"budget": [100000, 50000, 75000]
236
})
237
238
# Create SQL context and register tables
239
ctx = pl.SQLContext()
240
ctx.register("employees", employees)
241
ctx.register("departments", departments)
242
243
# Execute join query
244
result = ctx.execute("""
245
SELECT
246
e.name,
247
d.dept_name,
248
d.budget
249
FROM employees e
250
JOIN departments d ON e.dept_id = d.dept_id
251
ORDER BY d.budget DESC
252
""")
253
```
254
255
### Using sql() Function
256
257
```python
258
# Execute SQL with keyword arguments
259
result = pl.sql(
260
"""
261
SELECT
262
e.name,
263
d.dept_name,
264
e.salary / d.budget * 100 as salary_percentage
265
FROM emp e
266
JOIN dept d ON e.dept_id = d.dept_id
267
""",
268
emp=employees,
269
dept=departments
270
)
271
```
272
273
### Complex Queries with CTEs
274
275
```python
276
sales_data = pl.DataFrame({
277
"product": ["A", "B", "A", "C", "B", "A"],
278
"quarter": ["Q1", "Q1", "Q2", "Q2", "Q3", "Q3"],
279
"revenue": [1000, 1500, 1200, 800, 1800, 1100]
280
})
281
282
result = sales_data.sql("""
283
WITH quarterly_totals AS (
284
SELECT
285
quarter,
286
SUM(revenue) as total_revenue,
287
COUNT(*) as product_count
288
FROM self
289
GROUP BY quarter
290
),
291
product_performance AS (
292
SELECT
293
product,
294
SUM(revenue) as total_product_revenue,
295
AVG(revenue) as avg_product_revenue
296
FROM self
297
GROUP BY product
298
)
299
SELECT
300
qt.quarter,
301
qt.total_revenue,
302
pp.product,
303
pp.avg_product_revenue
304
FROM quarterly_totals qt
305
CROSS JOIN product_performance pp
306
WHERE pp.avg_product_revenue > 1000
307
ORDER BY qt.quarter, pp.avg_product_revenue DESC
308
""")
309
```
310
311
### Window Functions
312
313
```python
314
time_series = pl.DataFrame({
315
"date": pl.date_range(pl.date(2023, 1, 1), pl.date(2023, 1, 10), "1d", eager=True),
316
"value": [100, 105, 102, 108, 110, 107, 112, 115, 109, 118]
317
})
318
319
result = time_series.sql("""
320
SELECT
321
date,
322
value,
323
LAG(value, 1) OVER (ORDER BY date) as prev_value,
324
value - LAG(value, 1) OVER (ORDER BY date) as change,
325
AVG(value) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as moving_avg_3,
326
ROW_NUMBER() OVER (ORDER BY value DESC) as value_rank
327
FROM self
328
ORDER BY date
329
""")
330
```
331
332
### Subqueries and Set Operations
333
334
```python
335
large_dataset = pl.DataFrame({
336
"category": ["A", "B", "A", "C", "B", "A", "C", "B"],
337
"value": [10, 20, 15, 25, 30, 12, 28, 22],
338
"region": ["North", "South", "North", "West", "South", "West", "North", "West"]
339
})
340
341
result = large_dataset.sql("""
342
SELECT category, region, value
343
FROM self
344
WHERE value > (
345
SELECT AVG(value)
346
FROM self s2
347
WHERE s2.category = self.category
348
)
349
350
UNION ALL
351
352
SELECT category, region, value
353
FROM self
354
WHERE region = 'North' AND value > 20
355
356
ORDER BY category, value DESC
357
""")
358
```
359
360
### SQL Expressions in Polars Operations
361
362
```python
363
# Use SQL expressions within regular Polars operations
364
df_with_sql_expr = df.select([
365
pl.col("name"),
366
pl.sql_expr("age * 12").alias("age_in_months"),
367
pl.sql_expr("CASE WHEN salary > 60000 THEN 'High' ELSE 'Low' END").alias("salary_category"),
368
pl.sql_expr("UPPER(department)").alias("dept_upper")
369
])
370
371
# Filter using SQL expressions
372
filtered_df = df.filter(
373
pl.sql_expr("age BETWEEN 25 AND 35 AND department IN ('IT', 'Finance')")
374
)
375
```
376
377
### LazyFrame SQL Integration
378
379
```python
380
# Create LazyFrame with SQL
381
lazy_df = pl.scan_csv("large_dataset.csv")
382
383
# Execute SQL on LazyFrame (returns LazyFrame)
384
lazy_result = lazy_df.sql("""
385
SELECT
386
category,
387
COUNT(*) as count,
388
AVG(value) as avg_value,
389
SUM(CASE WHEN value > 100 THEN 1 ELSE 0 END) as high_value_count
390
FROM self
391
WHERE date >= '2023-01-01'
392
GROUP BY category
393
HAVING avg_value > 50
394
ORDER BY avg_value DESC
395
""")
396
397
# Collect the lazy result
398
final_result = lazy_result.collect()
399
```
400
401
### Advanced SQL Context Operations
402
403
```python
404
# Register DataFrames from global namespace
405
df1 = pl.DataFrame({"a": [1, 2, 3]})
406
df2 = pl.DataFrame({"b": [4, 5, 6]})
407
408
ctx = pl.SQLContext()
409
ctx.register_globals() # Automatically registers df1 and df2
410
411
# Register many frames at once
412
additional_frames = {
413
"customers": customer_df,
414
"orders": order_df,
415
"products": product_df
416
}
417
ctx.register_many(additional_frames)
418
419
# Complex multi-table query
420
result = ctx.execute("""
421
SELECT
422
c.customer_name,
423
p.product_name,
424
o.quantity,
425
o.price * o.quantity as total_amount,
426
RANK() OVER (PARTITION BY c.customer_id ORDER BY o.price * o.quantity DESC) as purchase_rank
427
FROM customers c
428
JOIN orders o ON c.customer_id = o.customer_id
429
JOIN products p ON o.product_id = p.product_id
430
WHERE o.order_date >= '2023-01-01'
431
AND p.category = 'Electronics'
432
ORDER BY c.customer_name, purchase_rank
433
""")
434
435
# Get list of registered tables
436
tables_list = ctx.tables()
437
print(f"Registered tables: {tables_list}")
438
```
439
440
## SQL and Polars Integration Benefits
441
442
### Performance Optimization
443
- **Predicate Pushdown**: WHERE conditions pushed to scan level
444
- **Projection Pushdown**: Only required columns read from storage
445
- **Query Optimization**: Polars optimizer works with SQL queries
446
- **Lazy Evaluation**: SQL queries on LazyFrames remain lazy
447
448
### Type Safety
449
- **Schema Inference**: Automatic type inference for SQL results
450
- **Type Preservation**: Polars types maintained through SQL operations
451
- **Error Handling**: Clear error messages for type mismatches
452
453
### Interoperability
454
- **Mixed Syntax**: Combine SQL with Polars expressions
455
- **DataFrame Methods**: SQL results work with all DataFrame methods
456
- **Streaming**: SQL queries work with streaming operations
457
- **Memory Efficiency**: Zero-copy operations where possible
458
459
## Limitations and Considerations
460
461
### SQL Feature Limitations
462
- Some advanced SQL features may not be supported
463
- Complex recursive queries not available
464
- Database-specific functions not included
465
- DDL operations (CREATE, DROP) not supported
466
467
### Performance Considerations
468
- SQL parsing adds slight overhead compared to native Polars
469
- Complex queries may benefit from native Polars expressions
470
- String-based queries lack compile-time type checking
471
- Very large queries may hit parser limits
472
473
### Best Practices
474
- Use SQL for complex analytical queries
475
- Prefer native Polars for simple transformations
476
- Test SQL query performance against native equivalents
477
- Use parameterized approaches for dynamic queries