or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

configuration.mdcore-classes.mddata-types.mdexpressions.mdfunctions.mdindex.mdio-operations.mdsql-functionality.md

sql-functionality.mddocs/

0

# SQL Functionality

1

2

Native SQL support allowing you to query DataFrames and LazyFrames using familiar SQL syntax with full integration into the polars ecosystem. Execute SQL queries directly on your data structures or within the context of broader data pipelines.

3

4

## Capabilities

5

6

### SQL Context

7

8

Manage and execute SQL queries across multiple named DataFrames and LazyFrames.

9

10

```python { .api }

11

class SQLContext:

12

def __init__(self, frames: dict[str, DataFrame | LazyFrame] | None = None):

13

"""

14

Create SQL context for managing multiple data sources.

15

16

Parameters:

17

- frames: Dictionary of {name: DataFrame/LazyFrame} to register

18

"""

19

20

def register(self, name: str, frame: DataFrame | LazyFrame) -> SQLContext:

21

"""

22

Register DataFrame or LazyFrame with a name for SQL queries.

23

24

Parameters:

25

- name: Table name to use in SQL queries

26

- frame: DataFrame or LazyFrame to register

27

28

Returns:

29

- SQLContext: Self for method chaining

30

"""

31

32

def register_globals(self, n: int = 1) -> SQLContext:

33

"""

34

Register DataFrames/LazyFrames from global namespace.

35

36

Parameters:

37

- n: Number of stack frames to look back

38

39

Returns:

40

- SQLContext: Self for method chaining

41

"""

42

43

def register_many(self, frames: dict[str, DataFrame | LazyFrame]) -> SQLContext:

44

"""

45

Register multiple frames at once.

46

47

Parameters:

48

- frames: Dictionary of {name: frame} pairs

49

50

Returns:

51

- SQLContext: Self for method chaining

52

"""

53

54

def unregister(self, name: str) -> SQLContext:

55

"""

56

Unregister a table from the context.

57

58

Parameters:

59

- name: Table name to unregister

60

61

Returns:

62

- SQLContext: Self for method chaining

63

"""

64

65

def execute(

66

self,

67

query: str,

68

*,

69

eager: bool = False

70

) -> DataFrame | LazyFrame:

71

"""

72

Execute SQL query and return result.

73

74

Parameters:

75

- query: SQL query string

76

- eager: Return DataFrame (True) or LazyFrame (False)

77

78

Returns:

79

- DataFrame | LazyFrame: Query result

80

"""

81

82

def tables(self) -> list[str]:

83

"""

84

Get list of registered table names.

85

86

Returns:

87

- list[str]: Registered table names

88

"""

89

```

90

91

### SQL Query Function

92

93

Execute SQL queries directly on DataFrames and LazyFrames.

94

95

```python { .api }

96

def sql(

97

query: str,

98

*,

99

eager: bool = True,

100

**frames: DataFrame | LazyFrame

101

) -> DataFrame | LazyFrame:

102

"""

103

Execute SQL query on provided frames.

104

105

Parameters:

106

- query: SQL query string

107

- eager: Return DataFrame (True) or LazyFrame (False)

108

- frames: Keyword arguments mapping table names to DataFrames/LazyFrames

109

110

Returns:

111

- DataFrame | LazyFrame: Query result

112

"""

113

```

114

115

### DataFrame/LazyFrame SQL Methods

116

117

Execute SQL queries directly on individual frames.

118

119

```python { .api }

120

# Available on DataFrame and LazyFrame classes

121

def sql(self, query: str, *, table_name: str = "self") -> DataFrame | LazyFrame:

122

"""

123

Execute SQL query with this frame as the primary table.

124

125

Parameters:

126

- query: SQL query string (can reference table_name)

127

- table_name: Name to use for this frame in the query

128

129

Returns:

130

- DataFrame | LazyFrame: Query result (same type as caller)

131

"""

132

```

133

134

### SQL Expression Function

135

136

Create expressions from SQL fragments for use in select, filter, etc.

137

138

```python { .api }

139

def sql_expr(sql: str) -> Expr:

140

"""

141

Create expression from SQL fragment.

142

143

Parameters:

144

- sql: SQL expression string

145

146

Returns:

147

- Expr: Expression from SQL fragment

148

"""

149

```

150

151

## Supported SQL Features

152

153

### DDL Operations

154

- **SELECT**: Column selection, aliases, expressions

155

- **FROM**: Table references, subqueries

156

- **WHERE**: Filtering conditions

157

- **GROUP BY**: Grouping operations

158

- **HAVING**: Group filtering

159

- **ORDER BY**: Sorting

160

- **LIMIT/OFFSET**: Result limiting

161

- **WITH**: Common Table Expressions (CTEs)

162

163

### DML Operations

164

- **UNION/UNION ALL**: Set operations

165

- **INTERSECT**: Set intersection

166

- **EXCEPT**: Set difference

167

168

### Joins

169

- **INNER JOIN**: Inner joins

170

- **LEFT/RIGHT JOIN**: Outer joins

171

- **FULL OUTER JOIN**: Full outer joins

172

- **CROSS JOIN**: Cartesian product

173

174

### Functions

175

- **Aggregate**: SUM, COUNT, AVG, MIN, MAX, STDDEV, VARIANCE

176

- **String**: UPPER, LOWER, LENGTH, SUBSTRING, CONCAT

177

- **Math**: ABS, ROUND, CEIL, FLOOR, SQRT, POW

178

- **Date/Time**: EXTRACT, DATE_TRUNC, NOW, CURRENT_DATE

179

- **Conditional**: CASE WHEN, COALESCE, NULLIF

180

- **Window**: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD

181

182

### Data Types

183

- **Numeric**: INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL

184

- **Text**: VARCHAR, TEXT, CHAR

185

- **Temporal**: DATE, TIME, TIMESTAMP

186

- **Boolean**: BOOLEAN

187

- **Complex**: ARRAY, STRUCT

188

189

## Usage Examples

190

191

### Basic SQL Queries

192

193

```python

194

import polars as pl

195

196

# Create sample data

197

df = pl.DataFrame({

198

"id": [1, 2, 3, 4, 5],

199

"name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],

200

"age": [25, 30, 35, 28, 32],

201

"salary": [50000, 60000, 70000, 55000, 65000],

202

"department": ["IT", "HR", "IT", "Finance", "IT"]

203

})

204

205

# Simple SELECT query

206

result = df.sql("SELECT name, age, salary FROM self WHERE age > 30")

207

print(result)

208

209

# Aggregation query

210

agg_result = df.sql("""

211

SELECT

212

department,

213

COUNT(*) as employee_count,

214

AVG(salary) as avg_salary,

215

MAX(age) as max_age

216

FROM self

217

GROUP BY department

218

ORDER BY avg_salary DESC

219

""")

220

```

221

222

### SQL Context Usage

223

224

```python

225

# Create multiple DataFrames

226

employees = pl.DataFrame({

227

"emp_id": [1, 2, 3, 4],

228

"name": ["Alice", "Bob", "Charlie", "Diana"],

229

"dept_id": [1, 2, 1, 3]

230

})

231

232

departments = pl.DataFrame({

233

"dept_id": [1, 2, 3],

234

"dept_name": ["IT", "HR", "Finance"],

235

"budget": [100000, 50000, 75000]

236

})

237

238

# Create SQL context and register tables

239

ctx = pl.SQLContext()

240

ctx.register("employees", employees)

241

ctx.register("departments", departments)

242

243

# Execute join query

244

result = ctx.execute("""

245

SELECT

246

e.name,

247

d.dept_name,

248

d.budget

249

FROM employees e

250

JOIN departments d ON e.dept_id = d.dept_id

251

ORDER BY d.budget DESC

252

""")

253

```

254

255

### Using sql() Function

256

257

```python

258

# Execute SQL with keyword arguments

259

result = pl.sql(

260

"""

261

SELECT

262

e.name,

263

d.dept_name,

264

e.salary / d.budget * 100 as salary_percentage

265

FROM emp e

266

JOIN dept d ON e.dept_id = d.dept_id

267

""",

268

emp=employees,

269

dept=departments

270

)

271

```

272

273

### Complex Queries with CTEs

274

275

```python

276

sales_data = pl.DataFrame({

277

"product": ["A", "B", "A", "C", "B", "A"],

278

"quarter": ["Q1", "Q1", "Q2", "Q2", "Q3", "Q3"],

279

"revenue": [1000, 1500, 1200, 800, 1800, 1100]

280

})

281

282

result = sales_data.sql("""

283

WITH quarterly_totals AS (

284

SELECT

285

quarter,

286

SUM(revenue) as total_revenue,

287

COUNT(*) as product_count

288

FROM self

289

GROUP BY quarter

290

),

291

product_performance AS (

292

SELECT

293

product,

294

SUM(revenue) as total_product_revenue,

295

AVG(revenue) as avg_product_revenue

296

FROM self

297

GROUP BY product

298

)

299

SELECT

300

qt.quarter,

301

qt.total_revenue,

302

pp.product,

303

pp.avg_product_revenue

304

FROM quarterly_totals qt

305

CROSS JOIN product_performance pp

306

WHERE pp.avg_product_revenue > 1000

307

ORDER BY qt.quarter, pp.avg_product_revenue DESC

308

""")

309

```

310

311

### Window Functions

312

313

```python

314

time_series = pl.DataFrame({

315

"date": pl.date_range(pl.date(2023, 1, 1), pl.date(2023, 1, 10), "1d", eager=True),

316

"value": [100, 105, 102, 108, 110, 107, 112, 115, 109, 118]

317

})

318

319

result = time_series.sql("""

320

SELECT

321

date,

322

value,

323

LAG(value, 1) OVER (ORDER BY date) as prev_value,

324

value - LAG(value, 1) OVER (ORDER BY date) as change,

325

AVG(value) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as moving_avg_3,

326

ROW_NUMBER() OVER (ORDER BY value DESC) as value_rank

327

FROM self

328

ORDER BY date

329

""")

330

```

331

332

### Subqueries and Set Operations

333

334

```python

335

large_dataset = pl.DataFrame({

336

"category": ["A", "B", "A", "C", "B", "A", "C", "B"],

337

"value": [10, 20, 15, 25, 30, 12, 28, 22],

338

"region": ["North", "South", "North", "West", "South", "West", "North", "West"]

339

})

340

341

result = large_dataset.sql("""

342

SELECT category, region, value

343

FROM self

344

WHERE value > (

345

SELECT AVG(value)

346

FROM self s2

347

WHERE s2.category = self.category

348

)

349

350

UNION ALL

351

352

SELECT category, region, value

353

FROM self

354

WHERE region = 'North' AND value > 20

355

356

ORDER BY category, value DESC

357

""")

358

```

359

360

### SQL Expressions in Polars Operations

361

362

```python

363

# Use SQL expressions within regular Polars operations

364

df_with_sql_expr = df.select([

365

pl.col("name"),

366

pl.sql_expr("age * 12").alias("age_in_months"),

367

pl.sql_expr("CASE WHEN salary > 60000 THEN 'High' ELSE 'Low' END").alias("salary_category"),

368

pl.sql_expr("UPPER(department)").alias("dept_upper")

369

])

370

371

# Filter using SQL expressions

372

filtered_df = df.filter(

373

pl.sql_expr("age BETWEEN 25 AND 35 AND department IN ('IT', 'Finance')")

374

)

375

```

376

377

### LazyFrame SQL Integration

378

379

```python

380

# Create LazyFrame with SQL

381

lazy_df = pl.scan_csv("large_dataset.csv")

382

383

# Execute SQL on LazyFrame (returns LazyFrame)

384

lazy_result = lazy_df.sql("""

385

SELECT

386

category,

387

COUNT(*) as count,

388

AVG(value) as avg_value,

389

SUM(CASE WHEN value > 100 THEN 1 ELSE 0 END) as high_value_count

390

FROM self

391

WHERE date >= '2023-01-01'

392

GROUP BY category

393

HAVING avg_value > 50

394

ORDER BY avg_value DESC

395

""")

396

397

# Collect the lazy result

398

final_result = lazy_result.collect()

399

```

400

401

### Advanced SQL Context Operations

402

403

```python

404

# Register DataFrames from global namespace

405

df1 = pl.DataFrame({"a": [1, 2, 3]})

406

df2 = pl.DataFrame({"b": [4, 5, 6]})

407

408

ctx = pl.SQLContext()

409

ctx.register_globals() # Automatically registers df1 and df2

410

411

# Register many frames at once

412

additional_frames = {

413

"customers": customer_df,

414

"orders": order_df,

415

"products": product_df

416

}

417

ctx.register_many(additional_frames)

418

419

# Complex multi-table query

420

result = ctx.execute("""

421

SELECT

422

c.customer_name,

423

p.product_name,

424

o.quantity,

425

o.price * o.quantity as total_amount,

426

RANK() OVER (PARTITION BY c.customer_id ORDER BY o.price * o.quantity DESC) as purchase_rank

427

FROM customers c

428

JOIN orders o ON c.customer_id = o.customer_id

429

JOIN products p ON o.product_id = p.product_id

430

WHERE o.order_date >= '2023-01-01'

431

AND p.category = 'Electronics'

432

ORDER BY c.customer_name, purchase_rank

433

""")

434

435

# Get list of registered tables

436

tables_list = ctx.tables()

437

print(f"Registered tables: {tables_list}")

438

```

439

440

## SQL and Polars Integration Benefits

441

442

### Performance Optimization

443

- **Predicate Pushdown**: WHERE conditions pushed to scan level

444

- **Projection Pushdown**: Only required columns read from storage

445

- **Query Optimization**: Polars optimizer works with SQL queries

446

- **Lazy Evaluation**: SQL queries on LazyFrames remain lazy

447

448

### Type Safety

449

- **Schema Inference**: Automatic type inference for SQL results

450

- **Type Preservation**: Polars types maintained through SQL operations

451

- **Error Handling**: Clear error messages for type mismatches

452

453

### Interoperability

454

- **Mixed Syntax**: Combine SQL with Polars expressions

455

- **DataFrame Methods**: SQL results work with all DataFrame methods

456

- **Streaming**: SQL queries work with streaming operations

457

- **Memory Efficiency**: Zero-copy operations where possible

458

459

## Limitations and Considerations

460

461

### SQL Feature Limitations

462

- Some advanced SQL features may not be supported

463

- Complex recursive queries not available

464

- Database-specific functions not included

465

- DDL operations (CREATE, DROP) not supported

466

467

### Performance Considerations

468

- SQL parsing adds slight overhead compared to native Polars

469

- Complex queries may benefit from native Polars expressions

470

- String-based queries lack compile-time type checking

471

- Very large queries may hit parser limits

472

473

### Best Practices

474

- Use SQL for complex analytical queries

475

- Prefer native Polars for simple transformations

476

- Test SQL query performance against native equivalents

477

- Use parameterized approaches for dynamic queries