or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

connection-management.mddata-loading.mdfederated-queries.mdindex.mdmetadata-retrieval.mdquery-partitioning.md

index.mddocs/

0

# ConnectorX

1

2

ConnectorX is a high-performance data loading library that enables users to efficiently transfer data from various databases directly into Python dataframes. Built in Rust with Python bindings, it follows a zero-copy principle to achieve significant performance improvements - up to 21x faster than traditional solutions while using 3x less memory. The library supports multiple database sources and dataframe formats with features like automatic parallelization through partition-based loading and federated queries across multiple databases.

3

4

## Package Information

5

6

- **Package Name**: connectorx

7

- **Language**: Python

8

- **Installation**: `pip install connectorx`

9

10

## Core Imports

11

12

```python

13

import connectorx as cx

14

```

15

16

Common individual imports:

17

18

```python

19

from connectorx import read_sql, get_meta, partition_sql, ConnectionUrl

20

```

21

22

All available imports:

23

24

```python

25

from connectorx import (

26

read_sql,

27

read_sql_pandas,

28

get_meta,

29

partition_sql,

30

ConnectionUrl,

31

remove_ending_semicolon,

32

try_import_module,

33

Protocol,

34

__version__

35

)

36

```

37

38

## Basic Usage

39

40

```python

41

import connectorx as cx

42

43

# Basic data loading from PostgreSQL

44

postgres_url = "postgresql://username:password@server:port/database"

45

query = "SELECT * FROM lineitem"

46

df = cx.read_sql(postgres_url, query)

47

48

# Parallel loading with partitioning

49

df_parallel = cx.read_sql(

50

postgres_url,

51

query,

52

partition_on="l_orderkey",

53

partition_num=10

54

)

55

56

# Load to different dataframe formats

57

arrow_table = cx.read_sql(postgres_url, query, return_type="arrow")

58

polars_df = cx.read_sql(postgres_url, query, return_type="polars")

59

```

60

61

## Architecture

62

63

ConnectorX follows a zero-copy architecture built on Rust's performance characteristics:

64

65

- **Rust Core**: High-performance data transfer engine with zero-copy principles

66

- **Python Bindings**: Seamless integration with Python dataframe libraries

67

- **Parallel Processing**: Automatic query partitioning for concurrent data loading

68

- **Protocol Optimization**: Backend-specific protocol selection for optimal performance

69

- **Memory Efficiency**: Direct data transfer from database to target dataframe format

70

71

## Capabilities

72

73

### Data Loading

74

75

Primary functionality for executing SQL queries and loading data into various dataframe formats. Supports single-threaded and parallel execution with automatic partitioning.

76

77

```python { .api }

78

def read_sql(

79

conn: str | ConnectionUrl | dict[str, str] | dict[str, ConnectionUrl],

80

query: list[str] | str,

81

*,

82

return_type: Literal["pandas", "polars", "arrow", "modin", "dask", "arrow_stream"] = "pandas",

83

protocol: Protocol | None = None,

84

partition_on: str | None = None,

85

partition_range: tuple[int, int] | None = None,

86

partition_num: int | None = None,

87

index_col: str | None = None,

88

strategy: str | None = None,

89

pre_execution_query: list[str] | str | None = None,

90

batch_size: int = 10000,

91

**kwargs

92

) -> pd.DataFrame | mpd.DataFrame | dd.DataFrame | pl.DataFrame | pa.Table | pa.RecordBatchReader

93

94

def read_sql_pandas(

95

sql: list[str] | str,

96

con: str | ConnectionUrl | dict[str, str] | dict[str, ConnectionUrl],

97

index_col: str | None = None,

98

protocol: Protocol | None = None,

99

partition_on: str | None = None,

100

partition_range: tuple[int, int] | None = None,

101

partition_num: int | None = None,

102

pre_execution_queries: list[str] | str | None = None,

103

) -> pd.DataFrame

104

```

105

106

[Data Loading](./data-loading.md)

107

108

### Query Partitioning

109

110

Functionality for partitioning SQL queries to enable parallel data loading across multiple threads.

111

112

```python { .api }

113

def partition_sql(

114

conn: str | ConnectionUrl,

115

query: str,

116

partition_on: str,

117

partition_num: int,

118

partition_range: tuple[int, int] | None = None,

119

) -> list[str]

120

```

121

122

[Query Partitioning](./query-partitioning.md)

123

124

### Metadata Retrieval

125

126

Retrieve schema information and metadata from SQL queries without loading the full dataset.

127

128

```python { .api }

129

def get_meta(

130

conn: str | ConnectionUrl,

131

query: str,

132

protocol: Protocol | None = None,

133

) -> pd.DataFrame

134

```

135

136

[Metadata Retrieval](./metadata-retrieval.md)

137

138

### Connection Management

139

140

Helper utilities for building and managing database connection strings across different database backends.

141

142

```python { .api }

143

class ConnectionUrl(Generic[_BackendT], str):

144

# For SQLite databases

145

def __new__(

146

cls,

147

*,

148

backend: Literal["sqlite"],

149

db_path: str | Path,

150

) -> ConnectionUrl[Literal["sqlite"]]: ...

151

152

# For BigQuery databases

153

def __new__(

154

cls,

155

*,

156

backend: Literal["bigquery"],

157

db_path: str | Path,

158

) -> ConnectionUrl[Literal["bigquery"]]: ...

159

160

# For server-based databases

161

def __new__(

162

cls,

163

*,

164

backend: _ServerBackendT,

165

username: str,

166

password: str = "",

167

server: str,

168

port: int,

169

database: str = "",

170

database_options: dict[str, str] | None = None,

171

) -> ConnectionUrl[_ServerBackendT]: ...

172

173

# For raw connection strings

174

def __new__(

175

cls,

176

raw_connection: str,

177

) -> ConnectionUrl: ...

178

```

179

180

[Connection Management](./connection-management.md)

181

182

### Federated Queries

183

184

Execute queries across multiple databases in a single statement, with automatic join optimization and query rewriting.

185

186

```python { .api }

187

def read_sql(

188

conn: dict[str, str] | dict[str, ConnectionUrl],

189

query: str,

190

*,

191

strategy: str | None = None,

192

**kwargs

193

) -> pd.DataFrame | pl.DataFrame | pa.Table

194

```

195

196

[Federated Queries](./federated-queries.md)

197

198

### Utility Functions

199

200

Helper functions for SQL query processing and module management.

201

202

```python { .api }

203

def remove_ending_semicolon(query: str) -> str:

204

"""Remove trailing semicolon from SQL query if present."""

205

206

def try_import_module(name: str) -> Any:

207

"""Import a module with helpful error message if not found."""

208

209

def rewrite_conn(

210

conn: str | ConnectionUrl,

211

protocol: Protocol | None = None

212

) -> tuple[str, Protocol]:

213

"""Rewrite connection string for backend compatibility."""

214

215

__version__: str # Package version string

216

```

217

218

## Types

219

220

```python { .api }

221

Protocol = Literal["csv", "binary", "cursor", "simple", "text"]

222

223

# Type variables for connection URL backends

224

_BackendT = TypeVar("_BackendT")

225

_ServerBackendT = TypeVar(

226

"_ServerBackendT",

227

bound=Literal[

228

"redshift",

229

"clickhouse",

230

"postgres",

231

"postgresql",

232

"mysql",

233

"mssql",

234

"oracle",

235

"duckdb",

236

],

237

)

238

239

# Internal types from Rust bindings

240

_DataframeInfos = dict[str, Any] # Pandas DataFrame reconstruction info

241

_ArrowInfos = tuple[list[str], list[Any]] # Arrow table reconstruction info

242

243

# Type checking imports (only available when TYPE_CHECKING is True)

244

if TYPE_CHECKING:

245

import pandas as pd

246

import polars as pl

247

import modin.pandas as mpd

248

import dask.dataframe as dd

249

import pyarrow as pa

250

```

251

252

## Supported Databases

253

254

- **PostgreSQL** (`postgresql://`)

255

- **MySQL** (`mysql://`)

256

- **SQLite** (`sqlite://`)

257

- **Microsoft SQL Server** (`mssql://`)

258

- **Oracle** (`oracle://`)

259

- **BigQuery** (`bigquery://`)

260

- **Redshift** (`redshift://`)

261

- **ClickHouse** (`clickhouse://`)

262

- **Trino/Presto**

263

264

## Supported Dataframe Libraries

265

266

- **pandas** (default)

267

- **PyArrow** (arrow tables and record batch readers)

268

- **Polars**

269

- **Modin** (distributed pandas)

270

- **Dask** (distributed computing)