or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

dataframe.mddbapi.mdindex.mdquery-functions.mdsessions.mdudf.mdutils.md

udf.mddocs/

0

# User-Defined Functions

1

2

Python UDF (User Defined Function) support enabling custom Python functions to be used within SQL queries. UDFs are registered using decorators and executed as external processes with automatic type handling and configuration management.

3

4

## Capabilities

5

6

### UDF Decorator

7

8

Register Python functions as SQL UDFs with automatic configuration generation.

9

10

```python { .api }

11

def chdb_udf(return_type: str = "String"):

12

"""

13

Decorator for registering Python functions as chDB UDFs.

14

15

Parameters:

16

- return_type: SQL return type ("String", "Int32", "Float64", etc.)

17

Must be valid ClickHouse data type

18

19

Returns:

20

Function decorator that registers the function as UDF

21

22

Notes:

23

- Functions must be stateless (no UDAFs supported)

24

- All input parameters treated as strings (TabSeparated format)

25

- Function called once per input row

26

- Must import all required modules within function

27

- Uses same Python interpreter as calling script

28

"""

29

```

30

31

### UDF Generation Function

32

33

Programmatically generate UDF configurations and scripts.

34

35

```python { .api }

36

def generate_udf(func_name: str, args: list, return_type: str, udf_body: str):

37

"""

38

Generate UDF configuration and Python script files.

39

40

Parameters:

41

- func_name: Name of the UDF function

42

- args: List of argument names

43

- return_type: SQL return type string

44

- udf_body: Python function implementation code

45

46

Side Effects:

47

- Creates {func_name}.py executable script in UDF path

48

- Updates/creates udf_config.xml with function registration

49

- Sets up automatic cleanup on process exit

50

"""

51

```

52

53

## Usage Examples

54

55

### Basic UDF Definition

56

57

```python

58

from chdb.udf import chdb_udf

59

from chdb import query

60

61

# Simple arithmetic UDF

62

@chdb_udf()

63

def add_numbers(a, b):

64

return str(int(a) + int(b))

65

66

# Use in SQL query

67

result = query("SELECT add_numbers('10', '20') as sum_result")

68

print(result) # Returns: 30

69

```

70

71

### UDF with Different Return Types

72

73

```python

74

from chdb.udf import chdb_udf

75

from chdb import query

76

77

# UDF returning integer

78

@chdb_udf(return_type="Int32")

79

def multiply_int(x, y):

80

return int(x) * int(y)

81

82

# UDF returning float

83

@chdb_udf(return_type="Float64")

84

def calculate_average(a, b, c):

85

values = [float(a), float(b), float(c)]

86

return sum(values) / len(values)

87

88

# UDF returning string (default)

89

@chdb_udf()

90

def format_name(first, last):

91

return f"{first.title()} {last.upper()}"

92

93

# Use all UDFs in queries

94

result1 = query("SELECT multiply_int('5', '7') as product")

95

result2 = query("SELECT calculate_average('10.5', '20.3', '15.7') as avg")

96

result3 = query("SELECT format_name('john', 'doe') as formatted_name")

97

98

print("Product:", result1)

99

print("Average:", result2)

100

print("Formatted:", result3)

101

```

102

103

### UDF with Complex Logic

104

105

```python

106

from chdb.udf import chdb_udf

107

from chdb import query

108

109

@chdb_udf()

110

def validate_email(email):

111

import re

112

113

# Email validation pattern

114

pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'

115

116

if re.match(pattern, email):

117

return "valid"

118

else:

119

return "invalid"

120

121

@chdb_udf()

122

def extract_domain(email):

123

import re

124

125

match = re.search(r'@([a-zA-Z0-9.-]+)', email)

126

if match:

127

return match.group(1)

128

else:

129

return "unknown"

130

131

# Use UDFs with data processing

132

result = query("""

133

SELECT

134

email,

135

validate_email(email) as is_valid,

136

extract_domain(email) as domain

137

FROM (

138

SELECT 'user@example.com' as email

139

UNION ALL

140

SELECT 'invalid.email'

141

UNION ALL

142

SELECT 'admin@company.org'

143

)

144

""", "Pretty")

145

146

print(result)

147

```

148

149

### UDF with JSON Processing

150

151

```python

152

from chdb.udf import chdb_udf

153

from chdb import query

154

155

@chdb_udf()

156

def parse_json_field(json_str, field_name):

157

import json

158

159

try:

160

data = json.loads(json_str)

161

return str(data.get(field_name, ""))

162

except:

163

return "error"

164

165

@chdb_udf(return_type="Int32")

166

def count_json_keys(json_str):

167

import json

168

169

try:

170

data = json.loads(json_str)

171

if isinstance(data, dict):

172

return len(data)

173

else:

174

return 0

175

except:

176

return -1

177

178

# Query with JSON UDFs

179

result = query("""

180

SELECT

181

json_data,

182

parse_json_field(json_data, 'name') as name,

183

parse_json_field(json_data, 'age') as age,

184

count_json_keys(json_data) as key_count

185

FROM (

186

SELECT '{"name": "Alice", "age": 30, "city": "NYC"}' as json_data

187

UNION ALL

188

SELECT '{"name": "Bob", "age": 25}'

189

UNION ALL

190

SELECT 'invalid json'

191

)

192

""", "Pretty")

193

194

print(result)

195

```

196

197

### UDF with Data Transformations

198

199

```python

200

from chdb.udf import chdb_udf

201

from chdb import query

202

203

@chdb_udf()

204

def normalize_phone(phone):

205

import re

206

207

# Remove all non-digits

208

digits = re.sub(r'\D', '', phone)

209

210

# Format as (XXX) XXX-XXXX if US number

211

if len(digits) == 10:

212

return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"

213

elif len(digits) == 11 and digits[0] == '1':

214

return f"({digits[1:4]}) {digits[4:7]}-{digits[7:]}"

215

else:

216

return "invalid"

217

218

@chdb_udf()

219

def categorize_age(age_str):

220

try:

221

age = int(age_str)

222

if age < 18:

223

return "minor"

224

elif age < 65:

225

return "adult"

226

else:

227

return "senior"

228

except:

229

return "unknown"

230

231

# Apply transformations to dataset

232

result = query("""

233

SELECT

234

name,

235

normalize_phone(phone) as formatted_phone,

236

categorize_age(age) as age_category

237

FROM (

238

SELECT 'Alice' as name, '555-123-4567' as phone, '32' as age

239

UNION ALL

240

SELECT 'Bob', '(555) 987 6543', '17'

241

UNION ALL

242

SELECT 'Charlie', '15559876543', '67'

243

)

244

""", "Pretty")

245

246

print(result)

247

```

248

249

### UDF with External Libraries

250

251

```python

252

from chdb.udf import chdb_udf

253

from chdb import query

254

255

@chdb_udf(return_type="Float64")

256

def calculate_distance(lat1, lon1, lat2, lon2):

257

import math

258

259

# Convert to radians

260

lat1, lon1, lat2, lon2 = map(math.radians, [float(lat1), float(lon1), float(lat2), float(lon2)])

261

262

# Haversine formula

263

dlat = lat2 - lat1

264

dlon = lon2 - lon1

265

a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2

266

c = 2 * math.asin(math.sqrt(a))

267

268

# Earth radius in kilometers

269

r = 6371

270

271

return c * r

272

273

@chdb_udf()

274

def hash_string(input_str):

275

import hashlib

276

277

return hashlib.md5(input_str.encode()).hexdigest()

278

279

# Use UDFs with geographic and crypto functions

280

result = query("""

281

SELECT

282

city1,

283

city2,

284

calculate_distance(lat1, lon1, lat2, lon2) as distance_km,

285

hash_string(CONCAT(city1, '-', city2)) as route_hash

286

FROM (

287

SELECT

288

'New York' as city1, '40.7128' as lat1, '-74.0060' as lon1,

289

'Los Angeles' as city2, '34.0522' as lat2, '-118.2437' as lon2

290

UNION ALL

291

SELECT

292

'Chicago', '41.8781', '-87.6298',

293

'Houston', '29.7604', '-95.3698'

294

)

295

""", "Pretty")

296

297

print(result)

298

```

299

300

### Error Handling with UDFs

301

302

```python

303

from chdb.udf import chdb_udf

304

from chdb import query, ChdbError

305

306

@chdb_udf()

307

def safe_divide(a, b):

308

try:

309

num = float(a)

310

den = float(b)

311

312

if den == 0:

313

return "division_by_zero"

314

315

return str(num / den)

316

except ValueError:

317

return "invalid_input"

318

except Exception:

319

return "error"

320

321

try:

322

result = query("""

323

SELECT

324

a, b,

325

safe_divide(a, b) as result

326

FROM (

327

SELECT '10' as a, '2' as b

328

UNION ALL

329

SELECT '15', '0'

330

UNION ALL

331

SELECT 'abc', '5'

332

)

333

""", "Pretty")

334

335

print(result)

336

337

except ChdbError as e:

338

print(f"Query with UDF failed: {e}")

339

```

340

341

### UDF Path Management

342

343

```python

344

from chdb.udf import chdb_udf

345

from chdb import query

346

import tempfile

347

import os

348

349

# Custom UDF path

350

custom_udf_path = tempfile.mkdtemp()

351

352

@chdb_udf()

353

def custom_function(x):

354

return f"processed_{x}"

355

356

# Use UDF with custom path

357

result = query(

358

"SELECT custom_function('test') as output",

359

udf_path=custom_udf_path

360

)

361

362

print(result)

363

364

# Cleanup custom UDF path when done

365

import shutil

366

shutil.rmtree(custom_udf_path)

367

```

368

369

## Important UDF Guidelines

370

371

### Function Requirements

372

- **Stateless**: Functions must be stateless (no global state)

373

- **String inputs**: All parameters received as strings

374

- **Self-contained**: Import all required modules within the function

375

- **Error handling**: Handle exceptions gracefully to avoid query failures

376

377

### Performance Considerations

378

- **Process overhead**: Each UDF call spawns a Python process

379

- **Serialization**: Data converted between TabSeparated format

380

- **Best for**: Complex logic that can't be expressed in SQL

381

- **Avoid for**: Simple operations that SQL can handle efficiently

382

383

### Type System

384

- **Input types**: Always strings (converted from TabSeparated)

385

- **Return types**: Must match declared return_type

386

- **Valid return types**: Any ClickHouse data type

387

- **Type conversion**: Automatic conversion between Python and SQL types