Tessl Tile for pypi/connectorx@0.4.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

connection-management.md data-loading.md federated-queries.md index.md metadata-retrieval.md query-partitioning.md

index.mddocs/

0
# ConnectorX
1

2
ConnectorX is a high-performance data loading library that enables users to efficiently transfer data from various databases directly into Python dataframes. Built in Rust with Python bindings, it follows a zero-copy principle to achieve significant performance improvements - up to 21x faster than traditional solutions while using 3x less memory. The library supports multiple database sources and dataframe formats with features like automatic parallelization through partition-based loading and federated queries across multiple databases.
3

4
## Package Information
5

6
- **Package Name**: connectorx
7
- **Language**: Python
8
- **Installation**: `pip install connectorx`
9

10
## Core Imports
11

12
```python
13
import connectorx as cx
14
```
15

16
Common individual imports:
17

18
```python
19
from connectorx import read_sql, get_meta, partition_sql, ConnectionUrl
20
```
21

22
All available imports:
23

24
```python
25
from connectorx import (
26
    read_sql, 
27
    read_sql_pandas, 
28
    get_meta, 
29
    partition_sql, 
30
    ConnectionUrl,
31
    remove_ending_semicolon,
32
    try_import_module,
33
    Protocol,
34
    __version__
35
)
36
```
37

38
## Basic Usage
39

40
```python
41
import connectorx as cx
42

43
# Basic data loading from PostgreSQL
44
postgres_url = "postgresql://username:password@server:port/database"
45
query = "SELECT * FROM lineitem"
46
df = cx.read_sql(postgres_url, query)
47

48
# Parallel loading with partitioning
49
df_parallel = cx.read_sql(
50
    postgres_url, 
51
    query, 
52
    partition_on="l_orderkey", 
53
    partition_num=10
54
)
55

56
# Load to different dataframe formats
57
arrow_table = cx.read_sql(postgres_url, query, return_type="arrow")
58
polars_df = cx.read_sql(postgres_url, query, return_type="polars")
59
```
60

61
## Architecture
62

63
ConnectorX follows a zero-copy architecture built on Rust's performance characteristics:
64

65
- **Rust Core**: High-performance data transfer engine with zero-copy principles
66
- **Python Bindings**: Seamless integration with Python dataframe libraries
67
- **Parallel Processing**: Automatic query partitioning for concurrent data loading
68
- **Protocol Optimization**: Backend-specific protocol selection for optimal performance
69
- **Memory Efficiency**: Direct data transfer from database to target dataframe format
70

71
## Capabilities
72

73
### Data Loading
74

75
Primary functionality for executing SQL queries and loading data into various dataframe formats. Supports single-threaded and parallel execution with automatic partitioning.
76

77
```python { .api }
78
def read_sql(
79
    conn: str | ConnectionUrl | dict[str, str] | dict[str, ConnectionUrl],
80
    query: list[str] | str,
81
    *,
82
    return_type: Literal["pandas", "polars", "arrow", "modin", "dask", "arrow_stream"] = "pandas",
83
    protocol: Protocol | None = None,
84
    partition_on: str | None = None,
85
    partition_range: tuple[int, int] | None = None,
86
    partition_num: int | None = None,
87
    index_col: str | None = None,
88
    strategy: str | None = None,
89
    pre_execution_query: list[str] | str | None = None,
90
    batch_size: int = 10000,
91
    **kwargs
92
) -> pd.DataFrame | mpd.DataFrame | dd.DataFrame | pl.DataFrame | pa.Table | pa.RecordBatchReader
93

94
def read_sql_pandas(
95
    sql: list[str] | str,
96
    con: str | ConnectionUrl | dict[str, str] | dict[str, ConnectionUrl],
97
    index_col: str | None = None,
98
    protocol: Protocol | None = None,
99
    partition_on: str | None = None,
100
    partition_range: tuple[int, int] | None = None,
101
    partition_num: int | None = None,
102
    pre_execution_queries: list[str] | str | None = None,
103
) -> pd.DataFrame
104
```
105

106
[Data Loading](./data-loading.md)
107

108
### Query Partitioning
109

110
Functionality for partitioning SQL queries to enable parallel data loading across multiple threads.
111

112
```python { .api }
113
def partition_sql(
114
    conn: str | ConnectionUrl,
115
    query: str,
116
    partition_on: str,
117
    partition_num: int,
118
    partition_range: tuple[int, int] | None = None,
119
) -> list[str]
120
```
121

122
[Query Partitioning](./query-partitioning.md)
123

124
### Metadata Retrieval
125

126
Retrieve schema information and metadata from SQL queries without loading the full dataset.
127

128
```python { .api }
129
def get_meta(
130
    conn: str | ConnectionUrl,
131
    query: str,
132
    protocol: Protocol | None = None,
133
) -> pd.DataFrame
134
```
135

136
[Metadata Retrieval](./metadata-retrieval.md)
137

138
### Connection Management
139

140
Helper utilities for building and managing database connection strings across different database backends.
141

142
```python { .api }
143
class ConnectionUrl(Generic[_BackendT], str):
144
    # For SQLite databases
145
    def __new__(
146
        cls,
147
        *,
148
        backend: Literal["sqlite"],
149
        db_path: str | Path,
150
    ) -> ConnectionUrl[Literal["sqlite"]]: ...
151
    
152
    # For BigQuery databases  
153
    def __new__(
154
        cls,
155
        *,
156
        backend: Literal["bigquery"],
157
        db_path: str | Path,
158
    ) -> ConnectionUrl[Literal["bigquery"]]: ...
159
    
160
    # For server-based databases
161
    def __new__(
162
        cls,
163
        *,
164
        backend: _ServerBackendT,
165
        username: str,
166
        password: str = "",
167
        server: str,
168
        port: int,
169
        database: str = "",
170
        database_options: dict[str, str] | None = None,
171
    ) -> ConnectionUrl[_ServerBackendT]: ...
172
    
173
    # For raw connection strings
174
    def __new__(
175
        cls,
176
        raw_connection: str,
177
    ) -> ConnectionUrl: ...
178
```
179

180
[Connection Management](./connection-management.md)
181

182
### Federated Queries
183

184
Execute queries across multiple databases in a single statement, with automatic join optimization and query rewriting.
185

186
```python { .api }
187
def read_sql(
188
    conn: dict[str, str] | dict[str, ConnectionUrl],
189
    query: str,
190
    *,
191
    strategy: str | None = None,
192
    **kwargs
193
) -> pd.DataFrame | pl.DataFrame | pa.Table
194
```
195

196
[Federated Queries](./federated-queries.md)
197

198
### Utility Functions
199

200
Helper functions for SQL query processing and module management.
201

202
```python { .api }
203
def remove_ending_semicolon(query: str) -> str:
204
    """Remove trailing semicolon from SQL query if present."""
205

206
def try_import_module(name: str) -> Any:
207
    """Import a module with helpful error message if not found."""
208

209
def rewrite_conn(
210
    conn: str | ConnectionUrl, 
211
    protocol: Protocol | None = None
212
) -> tuple[str, Protocol]:
213
    """Rewrite connection string for backend compatibility."""
214

215
__version__: str  # Package version string
216
```
217

218
## Types
219

220
```python { .api }
221
Protocol = Literal["csv", "binary", "cursor", "simple", "text"]
222

223
# Type variables for connection URL backends
224
_BackendT = TypeVar("_BackendT")
225
_ServerBackendT = TypeVar(
226
    "_ServerBackendT",
227
    bound=Literal[
228
        "redshift",
229
        "clickhouse", 
230
        "postgres",
231
        "postgresql",
232
        "mysql",
233
        "mssql",
234
        "oracle",
235
        "duckdb",
236
    ],
237
)
238

239
# Internal types from Rust bindings
240
_DataframeInfos = dict[str, Any]  # Pandas DataFrame reconstruction info
241
_ArrowInfos = tuple[list[str], list[Any]]  # Arrow table reconstruction info
242

243
# Type checking imports (only available when TYPE_CHECKING is True)
244
if TYPE_CHECKING:
245
    import pandas as pd
246
    import polars as pl  
247
    import modin.pandas as mpd
248
    import dask.dataframe as dd
249
    import pyarrow as pa
250
```
251

252
## Supported Databases
253

254
- **PostgreSQL** (`postgresql://`)
255
- **MySQL** (`mysql://`)
256
- **SQLite** (`sqlite://`)
257
- **Microsoft SQL Server** (`mssql://`)
258
- **Oracle** (`oracle://`)
259
- **BigQuery** (`bigquery://`)
260
- **Redshift** (`redshift://`)
261
- **ClickHouse** (`clickhouse://`)
262
- **Trino/Presto**
263

264
## Supported Dataframe Libraries
265

266
- **pandas** (default)
267
- **PyArrow** (arrow tables and record batch readers)
268
- **Polars**
269
- **Modin** (distributed pandas)
270
- **Dask** (distributed computing)

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/