0
# Federated Queries
1
2
Execute queries across multiple databases in a single statement, with automatic join optimization and query rewriting. Federated queries enable ConnectorX to join tables from different database sources, providing a unified query interface across heterogeneous data sources.
3
4
## Capabilities
5
6
### Cross-Database Query Execution
7
8
Execute SQL queries that span multiple databases using a single query statement.
9
10
```python { .api }
11
def read_sql(
12
conn: dict[str, str] | dict[str, ConnectionUrl],
13
query: str,
14
*,
15
return_type: Literal["pandas", "polars", "arrow"] = "pandas",
16
strategy: str | None = None,
17
**kwargs
18
) -> pd.DataFrame | pl.DataFrame | pa.Table:
19
"""
20
Execute federated query across multiple databases.
21
22
Parameters:
23
- conn: Dictionary mapping database aliases to connection strings
24
- query: SQL query referencing tables with database aliases (db_alias.table_name)
25
- return_type: Output format ("pandas", "polars", "arrow")
26
- strategy: Query rewriting strategy for join pushdown optimization
27
28
Returns:
29
DataFrame in specified format with joined data from multiple sources
30
31
Note: Federated queries do not support partitioning or protocol specification
32
"""
33
```
34
35
## Usage Examples
36
37
### Basic Federated Query
38
39
```python
40
import connectorx as cx
41
42
# Define multiple database connections
43
connections = {
44
"db1": "postgresql://user1:pass1@server1:5432/database1",
45
"db2": "postgresql://user2:pass2@server2:5432/database2"
46
}
47
48
# Query across databases using aliases
49
federated_query = """
50
SELECT
51
n.n_name as nation_name,
52
r.r_name as region_name,
53
n.n_comment
54
FROM db1.nation n
55
JOIN db2.region r ON n.n_regionkey = r.r_regionkey
56
"""
57
58
df = cx.read_sql(connections, federated_query)
59
```
60
61
### Multi-Database Analytics
62
63
```python
64
# Connect to different database types
65
connections = {
66
"sales_db": "postgresql://user:pass@sales-server:5432/sales",
67
"customer_db": "mysql://user:pass@customer-server:3306/customers",
68
"inventory_db": "mssql://user:pass@inventory-server:1433/inventory"
69
}
70
71
# Complex federated analytics query
72
analytics_query = """
73
SELECT
74
c.customer_name,
75
c.customer_segment,
76
s.order_date,
77
s.total_amount,
78
i.product_name,
79
i.category
80
FROM customer_db.customers c
81
JOIN sales_db.orders s ON c.customer_id = s.customer_id
82
JOIN inventory_db.products i ON s.product_id = i.product_id
83
WHERE s.order_date >= '2023-01-01'
84
AND c.customer_segment = 'Enterprise'
85
"""
86
87
result_df = cx.read_sql(connections, analytics_query)
88
```
89
90
### Using ConnectionUrl Objects
91
92
```python
93
from connectorx import ConnectionUrl
94
95
# Build connections with ConnectionUrl for type safety
96
db1_conn = ConnectionUrl(
97
backend="postgresql",
98
username="analytics_user",
99
password="secure_pass",
100
server="analytics.company.com",
101
port=5432,
102
database="warehouse"
103
)
104
105
db2_conn = ConnectionUrl(
106
backend="mysql",
107
username="reporting_user",
108
password="report_pass",
109
server="mysql.company.com",
110
port=3306,
111
database="reporting"
112
)
113
114
connections = {
115
"warehouse": db1_conn,
116
"reports": db2_conn
117
}
118
119
query = """
120
SELECT
121
w.fact_id,
122
w.metric_value,
123
r.report_name,
124
r.created_date
125
FROM warehouse.fact_table w
126
JOIN reports.report_metadata r ON w.report_id = r.report_id
127
"""
128
129
df = cx.read_sql(connections, query)
130
```
131
132
## Query Syntax
133
134
### Database Alias Referencing
135
136
Reference tables using the format: `database_alias.table_name`
137
138
```sql
139
-- Correct federated query syntax
140
SELECT t1.col1, t2.col2
141
FROM db1.table1 t1
142
JOIN db2.table2 t2 ON t1.id = t2.foreign_id
143
144
-- Standard table aliases still work
145
SELECT n.nation_name, r.region_name
146
FROM db1.nation n
147
JOIN db2.region r ON n.regionkey = r.regionkey
148
```
149
150
### Supported SQL Features
151
152
**Supported Operations:**
153
- Inner and outer joins across databases
154
- WHERE clauses with cross-database conditions
155
- Aggregations (COUNT, SUM, AVG, etc.)
156
- ORDER BY and GROUP BY clauses
157
- Subqueries within individual databases
158
159
**Current Limitations:**
160
- Subqueries spanning multiple databases
161
- Complex window functions across databases
162
- Database-specific functions may not work across all sources
163
- Transaction semantics are not guaranteed across databases
164
165
## Query Optimization
166
167
### Join Pushdown Strategy
168
169
ConnectorX automatically optimizes federated queries by pushing joins down to individual databases when possible.
170
171
```python
172
# Specify join pushdown strategy
173
df = cx.read_sql(
174
connections,
175
federated_query,
176
strategy="push_down_joins" # or other optimization strategies
177
)
178
```
179
180
### Automatic Optimization
181
182
By default, ConnectorX:
183
184
1. **Analyzes the query** to identify join patterns
185
2. **Pushes filtering** to individual databases to reduce data transfer
186
3. **Optimizes join order** based on estimated result sizes
187
4. **Minimizes network traffic** between databases
188
189
### Performance Considerations
190
191
**Best Practices:**
192
- Apply filters early to reduce data transfer
193
- Use selective joins rather than cross-products
194
- Consider data locality when designing federated queries
195
- Index join columns in source databases
196
197
**Example Optimized Query:**
198
```sql
199
-- Good: Filters applied before join
200
SELECT c.name, o.total
201
FROM customer_db.customers c
202
JOIN sales_db.orders o ON c.id = o.customer_id
203
WHERE c.active = true -- Filter applied at source
204
AND o.order_date >= '2023-01-01' -- Filter applied at source
205
206
-- Less optimal: Post-join filtering
207
SELECT c.name, o.total
208
FROM customer_db.customers c
209
JOIN sales_db.orders o ON c.id = o.customer_id
210
WHERE c.active = true AND o.order_date >= '2023-01-01'
211
```
212
213
## Return Type Support
214
215
Federated queries support multiple output formats:
216
217
```python
218
# pandas DataFrame (default)
219
df = cx.read_sql(connections, query)
220
221
# PyArrow Table
222
arrow_table = cx.read_sql(connections, query, return_type="arrow")
223
224
# Polars DataFrame
225
polars_df = cx.read_sql(connections, query, return_type="polars")
226
```
227
228
## Error Handling
229
230
Common federated query errors:
231
232
- **Connection errors**: One or more databases unavailable
233
- **Permission errors**: Insufficient access to referenced tables
234
- **Schema mismatches**: Join columns with incompatible types
235
- **Query complexity**: Unsupported cross-database operations
236
237
```python
238
try:
239
df = cx.read_sql(connections, federated_query)
240
except Exception as e:
241
if "connection" in str(e).lower():
242
print("Check database connectivity")
243
elif "permission" in str(e).lower():
244
print("Verify database permissions")
245
elif "column" in str(e).lower():
246
print("Check column names and types")
247
else:
248
print(f"Federated query failed: {e}")
249
```
250
251
## Configuration Requirements
252
253
### Database Setup
254
255
For optimal federated query performance:
256
257
1. **Network connectivity**: Ensure databases can be reached from ConnectorX
258
2. **User permissions**: Grant SELECT access to required tables
259
3. **Index optimization**: Index join columns and filter columns
260
4. **Connection pooling**: Configure appropriate connection limits
261
262
### JAR Dependencies
263
264
ConnectorX uses a federated query rewriter JAR file:
265
- Automatically configured in most installations
266
- Located at `CX_REWRITER_PATH` environment variable
267
- Required for advanced join pushdown optimizations
268
269
## Limitations
270
271
### Current Restrictions
272
273
- **No partitioning support**: Federated queries cannot use `partition_on`
274
- **No protocol specification**: Protocol is automatically selected
275
- **Limited to SPJA queries**: Complex operations may not be supported
276
- **No transaction guarantees**: Each database operates independently
277
278
### Workarounds
279
280
For unsupported features:
281
- Use separate queries and join results in Python
282
- Materialize intermediate results in staging tables
283
- Consider data warehouse solutions for complex analytics
284
285
```python
286
# Workaround: Manual federation for complex cases
287
db1_data = cx.read_sql(conn1, "SELECT * FROM table1 WHERE condition")
288
db2_data = cx.read_sql(conn2, "SELECT * FROM table2 WHERE condition")
289
290
# Join in pandas
291
import pandas as pd
292
result = pd.merge(db1_data, db2_data, on='common_column')
293
```