Tessl Tile for pypi/connectorx@0.4.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

connection-management.md data-loading.md federated-queries.md index.md metadata-retrieval.md query-partitioning.md

federated-queries.mddocs/

0
# Federated Queries
1

2
Execute queries across multiple databases in a single statement, with automatic join optimization and query rewriting. Federated queries enable ConnectorX to join tables from different database sources, providing a unified query interface across heterogeneous data sources.
3

4
## Capabilities
5

6
### Cross-Database Query Execution
7

8
Execute SQL queries that span multiple databases using a single query statement.
9

10
```python { .api }
11
def read_sql(
12
    conn: dict[str, str] | dict[str, ConnectionUrl],
13
    query: str,
14
    *,
15
    return_type: Literal["pandas", "polars", "arrow"] = "pandas",
16
    strategy: str | None = None,
17
    **kwargs
18
) -> pd.DataFrame | pl.DataFrame | pa.Table:
19
    """
20
    Execute federated query across multiple databases.
21

22
    Parameters:
23
    - conn: Dictionary mapping database aliases to connection strings
24
    - query: SQL query referencing tables with database aliases (db_alias.table_name)
25
    - return_type: Output format ("pandas", "polars", "arrow")
26
    - strategy: Query rewriting strategy for join pushdown optimization
27

28
    Returns:
29
    DataFrame in specified format with joined data from multiple sources
30
    
31
    Note: Federated queries do not support partitioning or protocol specification
32
    """
33
```
34

35
## Usage Examples
36

37
### Basic Federated Query
38

39
```python
40
import connectorx as cx
41

42
# Define multiple database connections
43
connections = {
44
    "db1": "postgresql://user1:pass1@server1:5432/database1",
45
    "db2": "postgresql://user2:pass2@server2:5432/database2"
46
}
47

48
# Query across databases using aliases
49
federated_query = """
50
SELECT 
51
    n.n_name as nation_name,
52
    r.r_name as region_name,
53
    n.n_comment
54
FROM db1.nation n
55
JOIN db2.region r ON n.n_regionkey = r.r_regionkey
56
"""
57

58
df = cx.read_sql(connections, federated_query)
59
```
60

61
### Multi-Database Analytics
62

63
```python
64
# Connect to different database types
65
connections = {
66
    "sales_db": "postgresql://user:pass@sales-server:5432/sales",
67
    "customer_db": "mysql://user:pass@customer-server:3306/customers", 
68
    "inventory_db": "mssql://user:pass@inventory-server:1433/inventory"
69
}
70

71
# Complex federated analytics query
72
analytics_query = """
73
SELECT 
74
    c.customer_name,
75
    c.customer_segment,
76
    s.order_date,
77
    s.total_amount,
78
    i.product_name,
79
    i.category
80
FROM customer_db.customers c
81
JOIN sales_db.orders s ON c.customer_id = s.customer_id
82
JOIN inventory_db.products i ON s.product_id = i.product_id
83
WHERE s.order_date >= '2023-01-01'
84
  AND c.customer_segment = 'Enterprise'
85
"""
86

87
result_df = cx.read_sql(connections, analytics_query)
88
```
89

90
### Using ConnectionUrl Objects
91

92
```python
93
from connectorx import ConnectionUrl
94

95
# Build connections with ConnectionUrl for type safety
96
db1_conn = ConnectionUrl(
97
    backend="postgresql",
98
    username="analytics_user",
99
    password="secure_pass",
100
    server="analytics.company.com",
101
    port=5432,
102
    database="warehouse"
103
)
104

105
db2_conn = ConnectionUrl(
106
    backend="mysql", 
107
    username="reporting_user",
108
    password="report_pass",
109
    server="mysql.company.com",
110
    port=3306,
111
    database="reporting"
112
)
113

114
connections = {
115
    "warehouse": db1_conn,
116
    "reports": db2_conn
117
}
118

119
query = """
120
SELECT 
121
    w.fact_id,
122
    w.metric_value, 
123
    r.report_name,
124
    r.created_date
125
FROM warehouse.fact_table w
126
JOIN reports.report_metadata r ON w.report_id = r.report_id
127
"""
128

129
df = cx.read_sql(connections, query)
130
```
131

132
## Query Syntax
133

134
### Database Alias Referencing
135

136
Reference tables using the format: `database_alias.table_name`
137

138
```sql
139
-- Correct federated query syntax
140
SELECT t1.col1, t2.col2 
141
FROM db1.table1 t1
142
JOIN db2.table2 t2 ON t1.id = t2.foreign_id
143

144
-- Standard table aliases still work
145
SELECT n.nation_name, r.region_name
146
FROM db1.nation n  
147
JOIN db2.region r ON n.regionkey = r.regionkey
148
```
149

150
### Supported SQL Features
151

152
**Supported Operations:**
153
- Inner and outer joins across databases
154
- WHERE clauses with cross-database conditions
155
- Aggregations (COUNT, SUM, AVG, etc.)
156
- ORDER BY and GROUP BY clauses
157
- Subqueries within individual databases
158

159
**Current Limitations:**
160
- Subqueries spanning multiple databases
161
- Complex window functions across databases
162
- Database-specific functions may not work across all sources
163
- Transaction semantics are not guaranteed across databases
164

165
## Query Optimization
166

167
### Join Pushdown Strategy
168

169
ConnectorX automatically optimizes federated queries by pushing joins down to individual databases when possible.
170

171
```python
172
# Specify join pushdown strategy
173
df = cx.read_sql(
174
    connections,
175
    federated_query,
176
    strategy="push_down_joins"  # or other optimization strategies
177
)
178
```
179

180
### Automatic Optimization
181

182
By default, ConnectorX:
183

184
1. **Analyzes the query** to identify join patterns
185
2. **Pushes filtering** to individual databases to reduce data transfer
186
3. **Optimizes join order** based on estimated result sizes
187
4. **Minimizes network traffic** between databases
188

189
### Performance Considerations
190

191
**Best Practices:**
192
- Apply filters early to reduce data transfer
193
- Use selective joins rather than cross-products
194
- Consider data locality when designing federated queries
195
- Index join columns in source databases
196

197
**Example Optimized Query:**
198
```sql
199
-- Good: Filters applied before join
200
SELECT c.name, o.total
201
FROM customer_db.customers c
202
JOIN sales_db.orders o ON c.id = o.customer_id  
203
WHERE c.active = true           -- Filter applied at source
204
  AND o.order_date >= '2023-01-01'  -- Filter applied at source
205

206
-- Less optimal: Post-join filtering
207
SELECT c.name, o.total
208
FROM customer_db.customers c
209
JOIN sales_db.orders o ON c.id = o.customer_id
210
WHERE c.active = true AND o.order_date >= '2023-01-01'
211
```
212

213
## Return Type Support
214

215
Federated queries support multiple output formats:
216

217
```python
218
# pandas DataFrame (default)
219
df = cx.read_sql(connections, query)
220

221
# PyArrow Table
222
arrow_table = cx.read_sql(connections, query, return_type="arrow")
223

224
# Polars DataFrame  
225
polars_df = cx.read_sql(connections, query, return_type="polars")
226
```
227

228
## Error Handling
229

230
Common federated query errors:
231

232
- **Connection errors**: One or more databases unavailable
233
- **Permission errors**: Insufficient access to referenced tables
234
- **Schema mismatches**: Join columns with incompatible types
235
- **Query complexity**: Unsupported cross-database operations
236

237
```python
238
try:
239
    df = cx.read_sql(connections, federated_query)
240
except Exception as e:
241
    if "connection" in str(e).lower():
242
        print("Check database connectivity")
243
    elif "permission" in str(e).lower():
244
        print("Verify database permissions")  
245
    elif "column" in str(e).lower():
246
        print("Check column names and types")
247
    else:
248
        print(f"Federated query failed: {e}")
249
```
250

251
## Configuration Requirements
252

253
### Database Setup
254

255
For optimal federated query performance:
256

257
1. **Network connectivity**: Ensure databases can be reached from ConnectorX
258
2. **User permissions**: Grant SELECT access to required tables
259
3. **Index optimization**: Index join columns and filter columns
260
4. **Connection pooling**: Configure appropriate connection limits
261

262
### JAR Dependencies
263

264
ConnectorX uses a federated query rewriter JAR file:
265
- Automatically configured in most installations
266
- Located at `CX_REWRITER_PATH` environment variable
267
- Required for advanced join pushdown optimizations
268

269
## Limitations
270

271
### Current Restrictions
272

273
- **No partitioning support**: Federated queries cannot use `partition_on`
274
- **No protocol specification**: Protocol is automatically selected
275
- **Limited to SPJA queries**: Complex operations may not be supported
276
- **No transaction guarantees**: Each database operates independently
277

278
### Workarounds
279

280
For unsupported features:
281
- Use separate queries and join results in Python
282
- Materialize intermediate results in staging tables
283
- Consider data warehouse solutions for complex analytics
284

285
```python
286
# Workaround: Manual federation for complex cases
287
db1_data = cx.read_sql(conn1, "SELECT * FROM table1 WHERE condition")
288
db2_data = cx.read_sql(conn2, "SELECT * FROM table2 WHERE condition")
289

290
# Join in pandas
291
import pandas as pd
292
result = pd.merge(db1_data, db2_data, on='common_column')
293
```

Version

Tile

Files

federated-queries.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

federated-queries.mddocs/