or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

connection-management.mddata-loading.mdfederated-queries.mdindex.mdmetadata-retrieval.mdquery-partitioning.md

federated-queries.mddocs/

0

# Federated Queries

1

2

Execute queries across multiple databases in a single statement, with automatic join optimization and query rewriting. Federated queries enable ConnectorX to join tables from different database sources, providing a unified query interface across heterogeneous data sources.

3

4

## Capabilities

5

6

### Cross-Database Query Execution

7

8

Execute SQL queries that span multiple databases using a single query statement.

9

10

```python { .api }

11

def read_sql(

12

conn: dict[str, str] | dict[str, ConnectionUrl],

13

query: str,

14

*,

15

return_type: Literal["pandas", "polars", "arrow"] = "pandas",

16

strategy: str | None = None,

17

**kwargs

18

) -> pd.DataFrame | pl.DataFrame | pa.Table:

19

"""

20

Execute federated query across multiple databases.

21

22

Parameters:

23

- conn: Dictionary mapping database aliases to connection strings

24

- query: SQL query referencing tables with database aliases (db_alias.table_name)

25

- return_type: Output format ("pandas", "polars", "arrow")

26

- strategy: Query rewriting strategy for join pushdown optimization

27

28

Returns:

29

DataFrame in specified format with joined data from multiple sources

30

31

Note: Federated queries do not support partitioning or protocol specification

32

"""

33

```

34

35

## Usage Examples

36

37

### Basic Federated Query

38

39

```python

40

import connectorx as cx

41

42

# Define multiple database connections

43

connections = {

44

"db1": "postgresql://user1:pass1@server1:5432/database1",

45

"db2": "postgresql://user2:pass2@server2:5432/database2"

46

}

47

48

# Query across databases using aliases

49

federated_query = """

50

SELECT

51

n.n_name as nation_name,

52

r.r_name as region_name,

53

n.n_comment

54

FROM db1.nation n

55

JOIN db2.region r ON n.n_regionkey = r.r_regionkey

56

"""

57

58

df = cx.read_sql(connections, federated_query)

59

```

60

61

### Multi-Database Analytics

62

63

```python

64

# Connect to different database types

65

connections = {

66

"sales_db": "postgresql://user:pass@sales-server:5432/sales",

67

"customer_db": "mysql://user:pass@customer-server:3306/customers",

68

"inventory_db": "mssql://user:pass@inventory-server:1433/inventory"

69

}

70

71

# Complex federated analytics query

72

analytics_query = """

73

SELECT

74

c.customer_name,

75

c.customer_segment,

76

s.order_date,

77

s.total_amount,

78

i.product_name,

79

i.category

80

FROM customer_db.customers c

81

JOIN sales_db.orders s ON c.customer_id = s.customer_id

82

JOIN inventory_db.products i ON s.product_id = i.product_id

83

WHERE s.order_date >= '2023-01-01'

84

AND c.customer_segment = 'Enterprise'

85

"""

86

87

result_df = cx.read_sql(connections, analytics_query)

88

```

89

90

### Using ConnectionUrl Objects

91

92

```python

93

from connectorx import ConnectionUrl

94

95

# Build connections with ConnectionUrl for type safety

96

db1_conn = ConnectionUrl(

97

backend="postgresql",

98

username="analytics_user",

99

password="secure_pass",

100

server="analytics.company.com",

101

port=5432,

102

database="warehouse"

103

)

104

105

db2_conn = ConnectionUrl(

106

backend="mysql",

107

username="reporting_user",

108

password="report_pass",

109

server="mysql.company.com",

110

port=3306,

111

database="reporting"

112

)

113

114

connections = {

115

"warehouse": db1_conn,

116

"reports": db2_conn

117

}

118

119

query = """

120

SELECT

121

w.fact_id,

122

w.metric_value,

123

r.report_name,

124

r.created_date

125

FROM warehouse.fact_table w

126

JOIN reports.report_metadata r ON w.report_id = r.report_id

127

"""

128

129

df = cx.read_sql(connections, query)

130

```

131

132

## Query Syntax

133

134

### Database Alias Referencing

135

136

Reference tables using the format: `database_alias.table_name`

137

138

```sql

139

-- Correct federated query syntax

140

SELECT t1.col1, t2.col2

141

FROM db1.table1 t1

142

JOIN db2.table2 t2 ON t1.id = t2.foreign_id

143

144

-- Standard table aliases still work

145

SELECT n.nation_name, r.region_name

146

FROM db1.nation n

147

JOIN db2.region r ON n.regionkey = r.regionkey

148

```

149

150

### Supported SQL Features

151

152

**Supported Operations:**

153

- Inner and outer joins across databases

154

- WHERE clauses with cross-database conditions

155

- Aggregations (COUNT, SUM, AVG, etc.)

156

- ORDER BY and GROUP BY clauses

157

- Subqueries within individual databases

158

159

**Current Limitations:**

160

- Subqueries spanning multiple databases

161

- Complex window functions across databases

162

- Database-specific functions may not work across all sources

163

- Transaction semantics are not guaranteed across databases

164

165

## Query Optimization

166

167

### Join Pushdown Strategy

168

169

ConnectorX automatically optimizes federated queries by pushing joins down to individual databases when possible.

170

171

```python

172

# Specify join pushdown strategy

173

df = cx.read_sql(

174

connections,

175

federated_query,

176

strategy="push_down_joins" # or other optimization strategies

177

)

178

```

179

180

### Automatic Optimization

181

182

By default, ConnectorX:

183

184

1. **Analyzes the query** to identify join patterns

185

2. **Pushes filtering** to individual databases to reduce data transfer

186

3. **Optimizes join order** based on estimated result sizes

187

4. **Minimizes network traffic** between databases

188

189

### Performance Considerations

190

191

**Best Practices:**

192

- Apply filters early to reduce data transfer

193

- Use selective joins rather than cross-products

194

- Consider data locality when designing federated queries

195

- Index join columns in source databases

196

197

**Example Optimized Query:**

198

```sql

199

-- Good: Filters applied before join

200

SELECT c.name, o.total

201

FROM customer_db.customers c

202

JOIN sales_db.orders o ON c.id = o.customer_id

203

WHERE c.active = true -- Filter applied at source

204

AND o.order_date >= '2023-01-01' -- Filter applied at source

205

206

-- Less optimal: Post-join filtering

207

SELECT c.name, o.total

208

FROM customer_db.customers c

209

JOIN sales_db.orders o ON c.id = o.customer_id

210

WHERE c.active = true AND o.order_date >= '2023-01-01'

211

```

212

213

## Return Type Support

214

215

Federated queries support multiple output formats:

216

217

```python

218

# pandas DataFrame (default)

219

df = cx.read_sql(connections, query)

220

221

# PyArrow Table

222

arrow_table = cx.read_sql(connections, query, return_type="arrow")

223

224

# Polars DataFrame

225

polars_df = cx.read_sql(connections, query, return_type="polars")

226

```

227

228

## Error Handling

229

230

Common federated query errors:

231

232

- **Connection errors**: One or more databases unavailable

233

- **Permission errors**: Insufficient access to referenced tables

234

- **Schema mismatches**: Join columns with incompatible types

235

- **Query complexity**: Unsupported cross-database operations

236

237

```python

238

try:

239

df = cx.read_sql(connections, federated_query)

240

except Exception as e:

241

if "connection" in str(e).lower():

242

print("Check database connectivity")

243

elif "permission" in str(e).lower():

244

print("Verify database permissions")

245

elif "column" in str(e).lower():

246

print("Check column names and types")

247

else:

248

print(f"Federated query failed: {e}")

249

```

250

251

## Configuration Requirements

252

253

### Database Setup

254

255

For optimal federated query performance:

256

257

1. **Network connectivity**: Ensure databases can be reached from ConnectorX

258

2. **User permissions**: Grant SELECT access to required tables

259

3. **Index optimization**: Index join columns and filter columns

260

4. **Connection pooling**: Configure appropriate connection limits

261

262

### JAR Dependencies

263

264

ConnectorX uses a federated query rewriter JAR file:

265

- Automatically configured in most installations

266

- Located at `CX_REWRITER_PATH` environment variable

267

- Required for advanced join pushdown optimizations

268

269

## Limitations

270

271

### Current Restrictions

272

273

- **No partitioning support**: Federated queries cannot use `partition_on`

274

- **No protocol specification**: Protocol is automatically selected

275

- **Limited to SPJA queries**: Complex operations may not be supported

276

- **No transaction guarantees**: Each database operates independently

277

278

### Workarounds

279

280

For unsupported features:

281

- Use separate queries and join results in Python

282

- Materialize intermediate results in staging tables

283

- Consider data warehouse solutions for complex analytics

284

285

```python

286

# Workaround: Manual federation for complex cases

287

db1_data = cx.read_sql(conn1, "SELECT * FROM table1 WHERE condition")

288

db2_data = cx.read_sql(conn2, "SELECT * FROM table2 WHERE condition")

289

290

# Join in pandas

291

import pandas as pd

292

result = pd.merge(db1_data, db2_data, on='common_column')

293

```