Load data from databases to dataframes, the fastest way.
npx @tessl/cli install tessl/pypi-connectorx@0.4.00
# ConnectorX
1
2
ConnectorX is a high-performance data loading library that enables users to efficiently transfer data from various databases directly into Python dataframes. Built in Rust with Python bindings, it follows a zero-copy principle to achieve significant performance improvements - up to 21x faster than traditional solutions while using 3x less memory. The library supports multiple database sources and dataframe formats with features like automatic parallelization through partition-based loading and federated queries across multiple databases.
3
4
## Package Information
5
6
- **Package Name**: connectorx
7
- **Language**: Python
8
- **Installation**: `pip install connectorx`
9
10
## Core Imports
11
12
```python
13
import connectorx as cx
14
```
15
16
Common individual imports:
17
18
```python
19
from connectorx import read_sql, get_meta, partition_sql, ConnectionUrl
20
```
21
22
All available imports:
23
24
```python
25
from connectorx import (
26
read_sql,
27
read_sql_pandas,
28
get_meta,
29
partition_sql,
30
ConnectionUrl,
31
remove_ending_semicolon,
32
try_import_module,
33
Protocol,
34
__version__
35
)
36
```
37
38
## Basic Usage
39
40
```python
41
import connectorx as cx
42
43
# Basic data loading from PostgreSQL
44
postgres_url = "postgresql://username:password@server:port/database"
45
query = "SELECT * FROM lineitem"
46
df = cx.read_sql(postgres_url, query)
47
48
# Parallel loading with partitioning
49
df_parallel = cx.read_sql(
50
postgres_url,
51
query,
52
partition_on="l_orderkey",
53
partition_num=10
54
)
55
56
# Load to different dataframe formats
57
arrow_table = cx.read_sql(postgres_url, query, return_type="arrow")
58
polars_df = cx.read_sql(postgres_url, query, return_type="polars")
59
```
60
61
## Architecture
62
63
ConnectorX follows a zero-copy architecture built on Rust's performance characteristics:
64
65
- **Rust Core**: High-performance data transfer engine with zero-copy principles
66
- **Python Bindings**: Seamless integration with Python dataframe libraries
67
- **Parallel Processing**: Automatic query partitioning for concurrent data loading
68
- **Protocol Optimization**: Backend-specific protocol selection for optimal performance
69
- **Memory Efficiency**: Direct data transfer from database to target dataframe format
70
71
## Capabilities
72
73
### Data Loading
74
75
Primary functionality for executing SQL queries and loading data into various dataframe formats. Supports single-threaded and parallel execution with automatic partitioning.
76
77
```python { .api }
78
def read_sql(
79
conn: str | ConnectionUrl | dict[str, str] | dict[str, ConnectionUrl],
80
query: list[str] | str,
81
*,
82
return_type: Literal["pandas", "polars", "arrow", "modin", "dask", "arrow_stream"] = "pandas",
83
protocol: Protocol | None = None,
84
partition_on: str | None = None,
85
partition_range: tuple[int, int] | None = None,
86
partition_num: int | None = None,
87
index_col: str | None = None,
88
strategy: str | None = None,
89
pre_execution_query: list[str] | str | None = None,
90
batch_size: int = 10000,
91
**kwargs
92
) -> pd.DataFrame | mpd.DataFrame | dd.DataFrame | pl.DataFrame | pa.Table | pa.RecordBatchReader
93
94
def read_sql_pandas(
95
sql: list[str] | str,
96
con: str | ConnectionUrl | dict[str, str] | dict[str, ConnectionUrl],
97
index_col: str | None = None,
98
protocol: Protocol | None = None,
99
partition_on: str | None = None,
100
partition_range: tuple[int, int] | None = None,
101
partition_num: int | None = None,
102
pre_execution_queries: list[str] | str | None = None,
103
) -> pd.DataFrame
104
```
105
106
[Data Loading](./data-loading.md)
107
108
### Query Partitioning
109
110
Functionality for partitioning SQL queries to enable parallel data loading across multiple threads.
111
112
```python { .api }
113
def partition_sql(
114
conn: str | ConnectionUrl,
115
query: str,
116
partition_on: str,
117
partition_num: int,
118
partition_range: tuple[int, int] | None = None,
119
) -> list[str]
120
```
121
122
[Query Partitioning](./query-partitioning.md)
123
124
### Metadata Retrieval
125
126
Retrieve schema information and metadata from SQL queries without loading the full dataset.
127
128
```python { .api }
129
def get_meta(
130
conn: str | ConnectionUrl,
131
query: str,
132
protocol: Protocol | None = None,
133
) -> pd.DataFrame
134
```
135
136
[Metadata Retrieval](./metadata-retrieval.md)
137
138
### Connection Management
139
140
Helper utilities for building and managing database connection strings across different database backends.
141
142
```python { .api }
143
class ConnectionUrl(Generic[_BackendT], str):
144
# For SQLite databases
145
def __new__(
146
cls,
147
*,
148
backend: Literal["sqlite"],
149
db_path: str | Path,
150
) -> ConnectionUrl[Literal["sqlite"]]: ...
151
152
# For BigQuery databases
153
def __new__(
154
cls,
155
*,
156
backend: Literal["bigquery"],
157
db_path: str | Path,
158
) -> ConnectionUrl[Literal["bigquery"]]: ...
159
160
# For server-based databases
161
def __new__(
162
cls,
163
*,
164
backend: _ServerBackendT,
165
username: str,
166
password: str = "",
167
server: str,
168
port: int,
169
database: str = "",
170
database_options: dict[str, str] | None = None,
171
) -> ConnectionUrl[_ServerBackendT]: ...
172
173
# For raw connection strings
174
def __new__(
175
cls,
176
raw_connection: str,
177
) -> ConnectionUrl: ...
178
```
179
180
[Connection Management](./connection-management.md)
181
182
### Federated Queries
183
184
Execute queries across multiple databases in a single statement, with automatic join optimization and query rewriting.
185
186
```python { .api }
187
def read_sql(
188
conn: dict[str, str] | dict[str, ConnectionUrl],
189
query: str,
190
*,
191
strategy: str | None = None,
192
**kwargs
193
) -> pd.DataFrame | pl.DataFrame | pa.Table
194
```
195
196
[Federated Queries](./federated-queries.md)
197
198
### Utility Functions
199
200
Helper functions for SQL query processing and module management.
201
202
```python { .api }
203
def remove_ending_semicolon(query: str) -> str:
204
"""Remove trailing semicolon from SQL query if present."""
205
206
def try_import_module(name: str) -> Any:
207
"""Import a module with helpful error message if not found."""
208
209
def rewrite_conn(
210
conn: str | ConnectionUrl,
211
protocol: Protocol | None = None
212
) -> tuple[str, Protocol]:
213
"""Rewrite connection string for backend compatibility."""
214
215
__version__: str # Package version string
216
```
217
218
## Types
219
220
```python { .api }
221
Protocol = Literal["csv", "binary", "cursor", "simple", "text"]
222
223
# Type variables for connection URL backends
224
_BackendT = TypeVar("_BackendT")
225
_ServerBackendT = TypeVar(
226
"_ServerBackendT",
227
bound=Literal[
228
"redshift",
229
"clickhouse",
230
"postgres",
231
"postgresql",
232
"mysql",
233
"mssql",
234
"oracle",
235
"duckdb",
236
],
237
)
238
239
# Internal types from Rust bindings
240
_DataframeInfos = dict[str, Any] # Pandas DataFrame reconstruction info
241
_ArrowInfos = tuple[list[str], list[Any]] # Arrow table reconstruction info
242
243
# Type checking imports (only available when TYPE_CHECKING is True)
244
if TYPE_CHECKING:
245
import pandas as pd
246
import polars as pl
247
import modin.pandas as mpd
248
import dask.dataframe as dd
249
import pyarrow as pa
250
```
251
252
## Supported Databases
253
254
- **PostgreSQL** (`postgresql://`)
255
- **MySQL** (`mysql://`)
256
- **SQLite** (`sqlite://`)
257
- **Microsoft SQL Server** (`mssql://`)
258
- **Oracle** (`oracle://`)
259
- **BigQuery** (`bigquery://`)
260
- **Redshift** (`redshift://`)
261
- **ClickHouse** (`clickhouse://`)
262
- **Trino/Presto**
263
264
## Supported Dataframe Libraries
265
266
- **pandas** (default)
267
- **PyArrow** (arrow tables and record batch readers)
268
- **Polars**
269
- **Modin** (distributed pandas)
270
- **Dask** (distributed computing)