CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-polars-lts-cpu

Blazingly fast DataFrame library for legacy CPUs without AVX2 support

Pending
Overview
Eval results
Files

sql-functionality.mddocs/

SQL Functionality

Native SQL support allowing you to query DataFrames and LazyFrames using familiar SQL syntax with full integration into the polars ecosystem. Execute SQL queries directly on your data structures or within the context of broader data pipelines.

Capabilities

SQL Context

Manage and execute SQL queries across multiple named DataFrames and LazyFrames.

class SQLContext:
    def __init__(self, frames: dict[str, DataFrame | LazyFrame] | None = None):
        """
        Create SQL context for managing multiple data sources.
        
        Parameters:
        - frames: Dictionary of {name: DataFrame/LazyFrame} to register
        """

    def register(self, name: str, frame: DataFrame | LazyFrame) -> SQLContext:
        """
        Register DataFrame or LazyFrame with a name for SQL queries.
        
        Parameters:
        - name: Table name to use in SQL queries
        - frame: DataFrame or LazyFrame to register
        
        Returns:
        - SQLContext: Self for method chaining
        """

    def register_globals(self, n: int = 1) -> SQLContext:
        """
        Register DataFrames/LazyFrames from global namespace.
        
        Parameters:
        - n: Number of stack frames to look back
        
        Returns:
        - SQLContext: Self for method chaining
        """

    def register_many(self, frames: dict[str, DataFrame | LazyFrame]) -> SQLContext:
        """
        Register multiple frames at once.
        
        Parameters:
        - frames: Dictionary of {name: frame} pairs
        
        Returns:
        - SQLContext: Self for method chaining
        """

    def unregister(self, name: str) -> SQLContext:
        """
        Unregister a table from the context.
        
        Parameters:
        - name: Table name to unregister
        
        Returns:
        - SQLContext: Self for method chaining
        """

    def execute(
        self,
        query: str,
        *,
        eager: bool = False
    ) -> DataFrame | LazyFrame:
        """
        Execute SQL query and return result.
        
        Parameters:
        - query: SQL query string
        - eager: Return DataFrame (True) or LazyFrame (False)
        
        Returns:
        - DataFrame | LazyFrame: Query result
        """

    def tables(self) -> list[str]:
        """
        Get list of registered table names.
        
        Returns:
        - list[str]: Registered table names
        """

SQL Query Function

Execute SQL queries directly on DataFrames and LazyFrames.

def sql(
    query: str,
    *,
    eager: bool = True,
    **frames: DataFrame | LazyFrame
) -> DataFrame | LazyFrame:
    """
    Execute SQL query on provided frames.
    
    Parameters:
    - query: SQL query string
    - eager: Return DataFrame (True) or LazyFrame (False)
    - frames: Keyword arguments mapping table names to DataFrames/LazyFrames
    
    Returns:
    - DataFrame | LazyFrame: Query result
    """

DataFrame/LazyFrame SQL Methods

Execute SQL queries directly on individual frames.

# Available on DataFrame and LazyFrame classes
def sql(self, query: str, *, table_name: str = "self") -> DataFrame | LazyFrame:
    """
    Execute SQL query with this frame as the primary table.
    
    Parameters:
    - query: SQL query string (can reference table_name)
    - table_name: Name to use for this frame in the query
    
    Returns:
    - DataFrame | LazyFrame: Query result (same type as caller)
    """

SQL Expression Function

Create expressions from SQL fragments for use in select, filter, etc.

def sql_expr(sql: str) -> Expr:
    """
    Create expression from SQL fragment.
    
    Parameters:
    - sql: SQL expression string
    
    Returns:
    - Expr: Expression from SQL fragment
    """

Supported SQL Features

DDL Operations

  • SELECT: Column selection, aliases, expressions
  • FROM: Table references, subqueries
  • WHERE: Filtering conditions
  • GROUP BY: Grouping operations
  • HAVING: Group filtering
  • ORDER BY: Sorting
  • LIMIT/OFFSET: Result limiting
  • WITH: Common Table Expressions (CTEs)

DML Operations

  • UNION/UNION ALL: Set operations
  • INTERSECT: Set intersection
  • EXCEPT: Set difference

Joins

  • INNER JOIN: Inner joins
  • LEFT/RIGHT JOIN: Outer joins
  • FULL OUTER JOIN: Full outer joins
  • CROSS JOIN: Cartesian product

Functions

  • Aggregate: SUM, COUNT, AVG, MIN, MAX, STDDEV, VARIANCE
  • String: UPPER, LOWER, LENGTH, SUBSTRING, CONCAT
  • Math: ABS, ROUND, CEIL, FLOOR, SQRT, POW
  • Date/Time: EXTRACT, DATE_TRUNC, NOW, CURRENT_DATE
  • Conditional: CASE WHEN, COALESCE, NULLIF
  • Window: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD

Data Types

  • Numeric: INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL
  • Text: VARCHAR, TEXT, CHAR
  • Temporal: DATE, TIME, TIMESTAMP
  • Boolean: BOOLEAN
  • Complex: ARRAY, STRUCT

Usage Examples

Basic SQL Queries

import polars as pl

# Create sample data
df = pl.DataFrame({
    "id": [1, 2, 3, 4, 5],
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "age": [25, 30, 35, 28, 32],
    "salary": [50000, 60000, 70000, 55000, 65000],
    "department": ["IT", "HR", "IT", "Finance", "IT"]
})

# Simple SELECT query
result = df.sql("SELECT name, age, salary FROM self WHERE age > 30")
print(result)

# Aggregation query
agg_result = df.sql("""
    SELECT 
        department,
        COUNT(*) as employee_count,
        AVG(salary) as avg_salary,
        MAX(age) as max_age
    FROM self 
    GROUP BY department
    ORDER BY avg_salary DESC
""")

SQL Context Usage

# Create multiple DataFrames
employees = pl.DataFrame({
    "emp_id": [1, 2, 3, 4],
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "dept_id": [1, 2, 1, 3]
})

departments = pl.DataFrame({
    "dept_id": [1, 2, 3],
    "dept_name": ["IT", "HR", "Finance"],
    "budget": [100000, 50000, 75000]
})

# Create SQL context and register tables
ctx = pl.SQLContext()
ctx.register("employees", employees)
ctx.register("departments", departments)

# Execute join query
result = ctx.execute("""
    SELECT 
        e.name,
        d.dept_name,
        d.budget
    FROM employees e
    JOIN departments d ON e.dept_id = d.dept_id
    ORDER BY d.budget DESC
""")

Using sql() Function

# Execute SQL with keyword arguments
result = pl.sql(
    """
    SELECT 
        e.name,
        d.dept_name,
        e.salary / d.budget * 100 as salary_percentage
    FROM emp e
    JOIN dept d ON e.dept_id = d.dept_id
    """,
    emp=employees,
    dept=departments
)

Complex Queries with CTEs

sales_data = pl.DataFrame({
    "product": ["A", "B", "A", "C", "B", "A"],
    "quarter": ["Q1", "Q1", "Q2", "Q2", "Q3", "Q3"],
    "revenue": [1000, 1500, 1200, 800, 1800, 1100]
})

result = sales_data.sql("""
    WITH quarterly_totals AS (
        SELECT 
            quarter,
            SUM(revenue) as total_revenue,
            COUNT(*) as product_count
        FROM self
        GROUP BY quarter
    ),
    product_performance AS (
        SELECT 
            product,
            SUM(revenue) as total_product_revenue,
            AVG(revenue) as avg_product_revenue
        FROM self
        GROUP BY product
    )
    SELECT 
        qt.quarter,
        qt.total_revenue,
        pp.product,
        pp.avg_product_revenue
    FROM quarterly_totals qt
    CROSS JOIN product_performance pp
    WHERE pp.avg_product_revenue > 1000
    ORDER BY qt.quarter, pp.avg_product_revenue DESC
""")

Window Functions

time_series = pl.DataFrame({
    "date": pl.date_range(pl.date(2023, 1, 1), pl.date(2023, 1, 10), "1d", eager=True),
    "value": [100, 105, 102, 108, 110, 107, 112, 115, 109, 118]
})

result = time_series.sql("""
    SELECT 
        date,
        value,
        LAG(value, 1) OVER (ORDER BY date) as prev_value,
        value - LAG(value, 1) OVER (ORDER BY date) as change,
        AVG(value) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as moving_avg_3,
        ROW_NUMBER() OVER (ORDER BY value DESC) as value_rank
    FROM self
    ORDER BY date
""")

Subqueries and Set Operations

large_dataset = pl.DataFrame({
    "category": ["A", "B", "A", "C", "B", "A", "C", "B"],
    "value": [10, 20, 15, 25, 30, 12, 28, 22],
    "region": ["North", "South", "North", "West", "South", "West", "North", "West"]
})

result = large_dataset.sql("""
    SELECT category, region, value
    FROM self
    WHERE value > (
        SELECT AVG(value) 
        FROM self s2 
        WHERE s2.category = self.category
    )
    
    UNION ALL
    
    SELECT category, region, value
    FROM self
    WHERE region = 'North' AND value > 20
    
    ORDER BY category, value DESC
""")

SQL Expressions in Polars Operations

# Use SQL expressions within regular Polars operations
df_with_sql_expr = df.select([
    pl.col("name"),
    pl.sql_expr("age * 12").alias("age_in_months"),
    pl.sql_expr("CASE WHEN salary > 60000 THEN 'High' ELSE 'Low' END").alias("salary_category"),
    pl.sql_expr("UPPER(department)").alias("dept_upper")
])

# Filter using SQL expressions
filtered_df = df.filter(
    pl.sql_expr("age BETWEEN 25 AND 35 AND department IN ('IT', 'Finance')")
)

LazyFrame SQL Integration

# Create LazyFrame with SQL
lazy_df = pl.scan_csv("large_dataset.csv")

# Execute SQL on LazyFrame (returns LazyFrame)
lazy_result = lazy_df.sql("""
    SELECT 
        category,
        COUNT(*) as count,
        AVG(value) as avg_value,
        SUM(CASE WHEN value > 100 THEN 1 ELSE 0 END) as high_value_count
    FROM self
    WHERE date >= '2023-01-01'
    GROUP BY category
    HAVING avg_value > 50
    ORDER BY avg_value DESC
""")

# Collect the lazy result
final_result = lazy_result.collect()

Advanced SQL Context Operations

# Register DataFrames from global namespace
df1 = pl.DataFrame({"a": [1, 2, 3]})
df2 = pl.DataFrame({"b": [4, 5, 6]})

ctx = pl.SQLContext()
ctx.register_globals()  # Automatically registers df1 and df2

# Register many frames at once
additional_frames = {
    "customers": customer_df,
    "orders": order_df,
    "products": product_df
}
ctx.register_many(additional_frames)

# Complex multi-table query
result = ctx.execute("""
    SELECT 
        c.customer_name,
        p.product_name,
        o.quantity,
        o.price * o.quantity as total_amount,
        RANK() OVER (PARTITION BY c.customer_id ORDER BY o.price * o.quantity DESC) as purchase_rank
    FROM customers c
    JOIN orders o ON c.customer_id = o.customer_id
    JOIN products p ON o.product_id = p.product_id
    WHERE o.order_date >= '2023-01-01'
      AND p.category = 'Electronics'
    ORDER BY c.customer_name, purchase_rank
""")

# Get list of registered tables
tables_list = ctx.tables()
print(f"Registered tables: {tables_list}")

SQL and Polars Integration Benefits

Performance Optimization

  • Predicate Pushdown: WHERE conditions pushed to scan level
  • Projection Pushdown: Only required columns read from storage
  • Query Optimization: Polars optimizer works with SQL queries
  • Lazy Evaluation: SQL queries on LazyFrames remain lazy

Type Safety

  • Schema Inference: Automatic type inference for SQL results
  • Type Preservation: Polars types maintained through SQL operations
  • Error Handling: Clear error messages for type mismatches

Interoperability

  • Mixed Syntax: Combine SQL with Polars expressions
  • DataFrame Methods: SQL results work with all DataFrame methods
  • Streaming: SQL queries work with streaming operations
  • Memory Efficiency: Zero-copy operations where possible

Limitations and Considerations

SQL Feature Limitations

  • Some advanced SQL features may not be supported
  • Complex recursive queries not available
  • Database-specific functions not included
  • DDL operations (CREATE, DROP) not supported

Performance Considerations

  • SQL parsing adds slight overhead compared to native Polars
  • Complex queries may benefit from native Polars expressions
  • String-based queries lack compile-time type checking
  • Very large queries may hit parser limits

Best Practices

  • Use SQL for complex analytical queries
  • Prefer native Polars for simple transformations
  • Test SQL query performance against native equivalents
  • Use parameterized approaches for dynamic queries

Install with Tessl CLI

npx tessl i tessl/pypi-polars-lts-cpu

docs

configuration.md

core-classes.md

data-types.md

expressions.md

functions.md

index.md

io-operations.md

sql-functionality.md

tile.json