0
# User-Defined Functions
1
2
Python UDF (User Defined Function) support enabling custom Python functions to be used within SQL queries. UDFs are registered using decorators and executed as external processes with automatic type handling and configuration management.
3
4
## Capabilities
5
6
### UDF Decorator
7
8
Register Python functions as SQL UDFs with automatic configuration generation.
9
10
```python { .api }
11
def chdb_udf(return_type: str = "String"):
12
"""
13
Decorator for registering Python functions as chDB UDFs.
14
15
Parameters:
16
- return_type: SQL return type ("String", "Int32", "Float64", etc.)
17
Must be valid ClickHouse data type
18
19
Returns:
20
Function decorator that registers the function as UDF
21
22
Notes:
23
- Functions must be stateless (no UDAFs supported)
24
- All input parameters treated as strings (TabSeparated format)
25
- Function called once per input row
26
- Must import all required modules within function
27
- Uses same Python interpreter as calling script
28
"""
29
```
30
31
### UDF Generation Function
32
33
Programmatically generate UDF configurations and scripts.
34
35
```python { .api }
36
def generate_udf(func_name: str, args: list, return_type: str, udf_body: str):
37
"""
38
Generate UDF configuration and Python script files.
39
40
Parameters:
41
- func_name: Name of the UDF function
42
- args: List of argument names
43
- return_type: SQL return type string
44
- udf_body: Python function implementation code
45
46
Side Effects:
47
- Creates {func_name}.py executable script in UDF path
48
- Updates/creates udf_config.xml with function registration
49
- Sets up automatic cleanup on process exit
50
"""
51
```
52
53
## Usage Examples
54
55
### Basic UDF Definition
56
57
```python
58
from chdb.udf import chdb_udf
59
from chdb import query
60
61
# Simple arithmetic UDF
62
@chdb_udf()
63
def add_numbers(a, b):
64
return str(int(a) + int(b))
65
66
# Use in SQL query
67
result = query("SELECT add_numbers('10', '20') as sum_result")
68
print(result) # Returns: 30
69
```
70
71
### UDF with Different Return Types
72
73
```python
74
from chdb.udf import chdb_udf
75
from chdb import query
76
77
# UDF returning integer
78
@chdb_udf(return_type="Int32")
79
def multiply_int(x, y):
80
return int(x) * int(y)
81
82
# UDF returning float
83
@chdb_udf(return_type="Float64")
84
def calculate_average(a, b, c):
85
values = [float(a), float(b), float(c)]
86
return sum(values) / len(values)
87
88
# UDF returning string (default)
89
@chdb_udf()
90
def format_name(first, last):
91
return f"{first.title()} {last.upper()}"
92
93
# Use all UDFs in queries
94
result1 = query("SELECT multiply_int('5', '7') as product")
95
result2 = query("SELECT calculate_average('10.5', '20.3', '15.7') as avg")
96
result3 = query("SELECT format_name('john', 'doe') as formatted_name")
97
98
print("Product:", result1)
99
print("Average:", result2)
100
print("Formatted:", result3)
101
```
102
103
### UDF with Complex Logic
104
105
```python
106
from chdb.udf import chdb_udf
107
from chdb import query
108
109
@chdb_udf()
110
def validate_email(email):
111
import re
112
113
# Email validation pattern
114
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
115
116
if re.match(pattern, email):
117
return "valid"
118
else:
119
return "invalid"
120
121
@chdb_udf()
122
def extract_domain(email):
123
import re
124
125
match = re.search(r'@([a-zA-Z0-9.-]+)', email)
126
if match:
127
return match.group(1)
128
else:
129
return "unknown"
130
131
# Use UDFs with data processing
132
result = query("""
133
SELECT
134
email,
135
validate_email(email) as is_valid,
136
extract_domain(email) as domain
137
FROM (
138
SELECT 'user@example.com' as email
139
UNION ALL
140
SELECT 'invalid.email'
141
UNION ALL
142
SELECT 'admin@company.org'
143
)
144
""", "Pretty")
145
146
print(result)
147
```
148
149
### UDF with JSON Processing
150
151
```python
152
from chdb.udf import chdb_udf
153
from chdb import query
154
155
@chdb_udf()
156
def parse_json_field(json_str, field_name):
157
import json
158
159
try:
160
data = json.loads(json_str)
161
return str(data.get(field_name, ""))
162
except:
163
return "error"
164
165
@chdb_udf(return_type="Int32")
166
def count_json_keys(json_str):
167
import json
168
169
try:
170
data = json.loads(json_str)
171
if isinstance(data, dict):
172
return len(data)
173
else:
174
return 0
175
except:
176
return -1
177
178
# Query with JSON UDFs
179
result = query("""
180
SELECT
181
json_data,
182
parse_json_field(json_data, 'name') as name,
183
parse_json_field(json_data, 'age') as age,
184
count_json_keys(json_data) as key_count
185
FROM (
186
SELECT '{"name": "Alice", "age": 30, "city": "NYC"}' as json_data
187
UNION ALL
188
SELECT '{"name": "Bob", "age": 25}'
189
UNION ALL
190
SELECT 'invalid json'
191
)
192
""", "Pretty")
193
194
print(result)
195
```
196
197
### UDF with Data Transformations
198
199
```python
200
from chdb.udf import chdb_udf
201
from chdb import query
202
203
@chdb_udf()
204
def normalize_phone(phone):
205
import re
206
207
# Remove all non-digits
208
digits = re.sub(r'\D', '', phone)
209
210
# Format as (XXX) XXX-XXXX if US number
211
if len(digits) == 10:
212
return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
213
elif len(digits) == 11 and digits[0] == '1':
214
return f"({digits[1:4]}) {digits[4:7]}-{digits[7:]}"
215
else:
216
return "invalid"
217
218
@chdb_udf()
219
def categorize_age(age_str):
220
try:
221
age = int(age_str)
222
if age < 18:
223
return "minor"
224
elif age < 65:
225
return "adult"
226
else:
227
return "senior"
228
except:
229
return "unknown"
230
231
# Apply transformations to dataset
232
result = query("""
233
SELECT
234
name,
235
normalize_phone(phone) as formatted_phone,
236
categorize_age(age) as age_category
237
FROM (
238
SELECT 'Alice' as name, '555-123-4567' as phone, '32' as age
239
UNION ALL
240
SELECT 'Bob', '(555) 987 6543', '17'
241
UNION ALL
242
SELECT 'Charlie', '15559876543', '67'
243
)
244
""", "Pretty")
245
246
print(result)
247
```
248
249
### UDF with External Libraries
250
251
```python
252
from chdb.udf import chdb_udf
253
from chdb import query
254
255
@chdb_udf(return_type="Float64")
256
def calculate_distance(lat1, lon1, lat2, lon2):
257
import math
258
259
# Convert to radians
260
lat1, lon1, lat2, lon2 = map(math.radians, [float(lat1), float(lon1), float(lat2), float(lon2)])
261
262
# Haversine formula
263
dlat = lat2 - lat1
264
dlon = lon2 - lon1
265
a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2
266
c = 2 * math.asin(math.sqrt(a))
267
268
# Earth radius in kilometers
269
r = 6371
270
271
return c * r
272
273
@chdb_udf()
274
def hash_string(input_str):
275
import hashlib
276
277
return hashlib.md5(input_str.encode()).hexdigest()
278
279
# Use UDFs with geographic and crypto functions
280
result = query("""
281
SELECT
282
city1,
283
city2,
284
calculate_distance(lat1, lon1, lat2, lon2) as distance_km,
285
hash_string(CONCAT(city1, '-', city2)) as route_hash
286
FROM (
287
SELECT
288
'New York' as city1, '40.7128' as lat1, '-74.0060' as lon1,
289
'Los Angeles' as city2, '34.0522' as lat2, '-118.2437' as lon2
290
UNION ALL
291
SELECT
292
'Chicago', '41.8781', '-87.6298',
293
'Houston', '29.7604', '-95.3698'
294
)
295
""", "Pretty")
296
297
print(result)
298
```
299
300
### Error Handling with UDFs
301
302
```python
303
from chdb.udf import chdb_udf
304
from chdb import query, ChdbError
305
306
@chdb_udf()
307
def safe_divide(a, b):
308
try:
309
num = float(a)
310
den = float(b)
311
312
if den == 0:
313
return "division_by_zero"
314
315
return str(num / den)
316
except ValueError:
317
return "invalid_input"
318
except Exception:
319
return "error"
320
321
try:
322
result = query("""
323
SELECT
324
a, b,
325
safe_divide(a, b) as result
326
FROM (
327
SELECT '10' as a, '2' as b
328
UNION ALL
329
SELECT '15', '0'
330
UNION ALL
331
SELECT 'abc', '5'
332
)
333
""", "Pretty")
334
335
print(result)
336
337
except ChdbError as e:
338
print(f"Query with UDF failed: {e}")
339
```
340
341
### UDF Path Management
342
343
```python
344
from chdb.udf import chdb_udf
345
from chdb import query
346
import tempfile
347
import os
348
349
# Custom UDF path
350
custom_udf_path = tempfile.mkdtemp()
351
352
@chdb_udf()
353
def custom_function(x):
354
return f"processed_{x}"
355
356
# Use UDF with custom path
357
result = query(
358
"SELECT custom_function('test') as output",
359
udf_path=custom_udf_path
360
)
361
362
print(result)
363
364
# Cleanup custom UDF path when done
365
import shutil
366
shutil.rmtree(custom_udf_path)
367
```
368
369
## Important UDF Guidelines
370
371
### Function Requirements
372
- **Stateless**: Functions must be stateless (no global state)
373
- **String inputs**: All parameters received as strings
374
- **Self-contained**: Import all required modules within the function
375
- **Error handling**: Handle exceptions gracefully to avoid query failures
376
377
### Performance Considerations
378
- **Process overhead**: Each UDF call spawns a Python process
379
- **Serialization**: Data converted between TabSeparated format
380
- **Best for**: Complex logic that can't be expressed in SQL
381
- **Avoid for**: Simple operations that SQL can handle efficiently
382
383
### Type System
384
- **Input types**: Always strings (converted from TabSeparated)
385
- **Return types**: Must match declared return_type
386
- **Valid return types**: Any ClickHouse data type
387
- **Type conversion**: Automatic conversion between Python and SQL types