CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-connectorx

Load data from databases to dataframes, the fastest way.

86

1.04x

Quality

Pending

Does it follow best practices?

Impact

86%

1.04x

Average score across 10 eval scenarios

Overview
Eval results
Files

metadata-retrieval.mddocs/

Metadata Retrieval

Retrieve schema information and metadata from SQL queries without loading the full dataset. This functionality is useful for understanding query structure, validating queries, and inspecting data types before performing full data loads.

Capabilities

Query Metadata Extraction

Get column names, data types, and other metadata information from a SQL query result set.

def get_meta(
    conn: str | ConnectionUrl,
    query: str,
    protocol: Protocol | None = None,
) -> pd.DataFrame:
    """
    Get metadata (header) of the given query result set.

    Parameters:
    - conn: Database connection string or ConnectionUrl
    - query: SQL query to analyze
    - protocol: Backend-specific transfer protocol (optional)

    Returns:
    pandas DataFrame containing schema metadata
    """

Usage Examples:

import connectorx as cx

postgres_url = "postgresql://username:password@server:port/database"

# Get metadata for a simple query
query = "SELECT customer_id, order_date, total_amount FROM orders"
meta_df = cx.get_meta(postgres_url, query)
print(meta_df)

# Get metadata for complex queries with joins
complex_query = """
SELECT 
    c.customer_name,
    o.order_date,
    oi.product_id,
    oi.quantity * oi.unit_price as total_value
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id  
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_date >= '2023-01-01'
"""
meta_df = cx.get_meta(postgres_url, complex_query)

# Specify protocol for optimization
meta_df = cx.get_meta(postgres_url, query, protocol="cursor")

Metadata Information

The returned DataFrame contains schema information about the query result:

Column Information

  • Column names: Names of all columns in the result set
  • Data types: Database-specific and Python-equivalent data types
  • Nullability: Whether columns can contain NULL values
  • Precision/Scale: For decimal and numeric types

Query Validation

Metadata retrieval serves as query validation:

  • Syntax errors: Invalid SQL will raise errors during metadata extraction
  • Missing tables/columns: References to non-existent objects are caught
  • Type conflicts: Join and comparison type mismatches are identified
  • Permission issues: Access control violations are detected

Implementation Details

Query Execution

ConnectorX typically executes a LIMIT 1 version of your query to extract metadata:

-- Your query
SELECT customer_id, order_date, total_amount FROM orders

-- Executed for metadata (conceptually)
SELECT customer_id, order_date, total_amount FROM orders LIMIT 1

Protocol Support

Different protocols may affect metadata extraction:

  • binary: Fastest metadata extraction for most databases
  • cursor: May be required for some server configurations
  • text: Alternative for databases with binary protocol issues

Use Cases

Pre-flight Query Validation

Validate queries before executing expensive operations:

try:
    meta = cx.get_meta(conn, expensive_query)
    print(f"Query will return {len(meta.columns)} columns")
    
    # Proceed with full query
    df = cx.read_sql(conn, expensive_query, partition_num=10)
except Exception as e:
    print(f"Query validation failed: {e}")

Schema Discovery

Explore database schemas and query results:

# Discover table structure
table_meta = cx.get_meta(conn, "SELECT * FROM unknown_table")
print("Available columns:", table_meta.columns.tolist())
print("Data types:", table_meta.dtypes.to_dict())

# Analyze complex query results
join_meta = cx.get_meta(conn, complex_join_query)
for col in join_meta.columns:
    print(f"{col}: {join_meta[col].dtype}")

ETL Pipeline Planning

Plan data transformations based on source metadata:

source_meta = cx.get_meta(source_conn, extraction_query)

# Plan transformations based on source types
for col_name, col_type in source_meta.dtypes.items():
    if col_type == 'object':  # String columns
        print(f"Plan text processing for {col_name}")
    elif 'datetime' in str(col_type):  # Date columns
        print(f"Plan date transformation for {col_name}")

Dynamic Query Building

Build queries dynamically based on available columns:

# Get available columns
meta = cx.get_meta(conn, "SELECT * FROM dynamic_table")
available_cols = meta.columns.tolist()

# Build query based on available columns
required_cols = ['id', 'name', 'created_date']
existing_cols = [col for col in required_cols if col in available_cols]

if existing_cols:
    dynamic_query = f"SELECT {', '.join(existing_cols)} FROM dynamic_table"
    df = cx.read_sql(conn, dynamic_query)

Error Handling

Common errors when retrieving metadata:

  • Connection errors: Database unavailable or authentication failure
  • Permission errors: Insufficient privileges to execute query
  • Syntax errors: Invalid SQL syntax
  • Missing objects: Referenced tables/columns don't exist
  • Protocol errors: Unsupported protocol for specific database
try:
    meta = cx.get_meta(conn, query)
except Exception as e:
    if "permission denied" in str(e).lower():
        print("Check database permissions")
    elif "does not exist" in str(e).lower():
        print("Referenced table or column not found")
    else:
        print(f"Metadata extraction failed: {e}")

Install with Tessl CLI

npx tessl i tessl/pypi-connectorx

docs

connection-management.md

data-loading.md

federated-queries.md

index.md

metadata-retrieval.md

query-partitioning.md

tile.json