Tessl Tile for pypi/chdb@2.1.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

dataframe.md dbapi.md index.md query-functions.md sessions.md udf.md utils.md

udf.mddocs/

0
# User-Defined Functions
1

2
Python UDF (User Defined Function) support enabling custom Python functions to be used within SQL queries. UDFs are registered using decorators and executed as external processes with automatic type handling and configuration management.
3

4
## Capabilities
5

6
### UDF Decorator
7

8
Register Python functions as SQL UDFs with automatic configuration generation.
9

10
```python { .api }
11
def chdb_udf(return_type: str = "String"):
12
    """
13
    Decorator for registering Python functions as chDB UDFs.
14
    
15
    Parameters:
16
    - return_type: SQL return type ("String", "Int32", "Float64", etc.)
17
                   Must be valid ClickHouse data type
18
    
19
    Returns:
20
    Function decorator that registers the function as UDF
21
    
22
    Notes:
23
    - Functions must be stateless (no UDAFs supported)
24
    - All input parameters treated as strings (TabSeparated format)
25
    - Function called once per input row
26
    - Must import all required modules within function
27
    - Uses same Python interpreter as calling script
28
    """
29
```
30

31
### UDF Generation Function
32

33
Programmatically generate UDF configurations and scripts.
34

35
```python { .api }
36
def generate_udf(func_name: str, args: list, return_type: str, udf_body: str):
37
    """
38
    Generate UDF configuration and Python script files.
39
    
40
    Parameters:
41
    - func_name: Name of the UDF function
42
    - args: List of argument names
43
    - return_type: SQL return type string
44
    - udf_body: Python function implementation code
45
    
46
    Side Effects:
47
    - Creates {func_name}.py executable script in UDF path
48
    - Updates/creates udf_config.xml with function registration
49
    - Sets up automatic cleanup on process exit
50
    """
51
```
52

53
## Usage Examples
54

55
### Basic UDF Definition
56

57
```python
58
from chdb.udf import chdb_udf
59
from chdb import query
60

61
# Simple arithmetic UDF
62
@chdb_udf()
63
def add_numbers(a, b):
64
    return str(int(a) + int(b))
65

66
# Use in SQL query
67
result = query("SELECT add_numbers('10', '20') as sum_result")
68
print(result)  # Returns: 30
69
```
70

71
### UDF with Different Return Types
72

73
```python
74
from chdb.udf import chdb_udf
75
from chdb import query
76

77
# UDF returning integer
78
@chdb_udf(return_type="Int32")
79
def multiply_int(x, y):
80
    return int(x) * int(y)
81

82
# UDF returning float
83
@chdb_udf(return_type="Float64")
84
def calculate_average(a, b, c):
85
    values = [float(a), float(b), float(c)]
86
    return sum(values) / len(values)
87

88
# UDF returning string (default)
89
@chdb_udf()
90
def format_name(first, last):
91
    return f"{first.title()} {last.upper()}"
92

93
# Use all UDFs in queries
94
result1 = query("SELECT multiply_int('5', '7') as product")
95
result2 = query("SELECT calculate_average('10.5', '20.3', '15.7') as avg")
96
result3 = query("SELECT format_name('john', 'doe') as formatted_name")
97

98
print("Product:", result1)
99
print("Average:", result2)
100
print("Formatted:", result3)
101
```
102

103
### UDF with Complex Logic
104

105
```python
106
from chdb.udf import chdb_udf
107
from chdb import query
108

109
@chdb_udf()
110
def validate_email(email):
111
    import re
112
    
113
    # Email validation pattern
114
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
115
    
116
    if re.match(pattern, email):
117
        return "valid"
118
    else:
119
        return "invalid"
120

121
@chdb_udf()
122
def extract_domain(email):
123
    import re
124
    
125
    match = re.search(r'@([a-zA-Z0-9.-]+)', email)
126
    if match:
127
        return match.group(1)
128
    else:
129
        return "unknown"
130

131
# Use UDFs with data processing
132
result = query("""
133
    SELECT 
134
        email,
135
        validate_email(email) as is_valid,
136
        extract_domain(email) as domain
137
    FROM (
138
        SELECT 'user@example.com' as email
139
        UNION ALL
140
        SELECT 'invalid.email'
141
        UNION ALL
142
        SELECT 'admin@company.org'
143
    )
144
""", "Pretty")
145

146
print(result)
147
```
148

149
### UDF with JSON Processing
150

151
```python
152
from chdb.udf import chdb_udf
153
from chdb import query
154

155
@chdb_udf()
156
def parse_json_field(json_str, field_name):
157
    import json
158
    
159
    try:
160
        data = json.loads(json_str)
161
        return str(data.get(field_name, ""))
162
    except:
163
        return "error"
164

165
@chdb_udf(return_type="Int32")
166
def count_json_keys(json_str):
167
    import json
168
    
169
    try:
170
        data = json.loads(json_str)
171
        if isinstance(data, dict):
172
            return len(data)
173
        else:
174
            return 0
175
    except:
176
        return -1
177

178
# Query with JSON UDFs
179
result = query("""
180
    SELECT 
181
        json_data,
182
        parse_json_field(json_data, 'name') as name,
183
        parse_json_field(json_data, 'age') as age,
184
        count_json_keys(json_data) as key_count
185
    FROM (
186
        SELECT '{"name": "Alice", "age": 30, "city": "NYC"}' as json_data
187
        UNION ALL
188
        SELECT '{"name": "Bob", "age": 25}'
189
        UNION ALL
190
        SELECT 'invalid json'
191
    )
192
""", "Pretty")
193

194
print(result)
195
```
196

197
### UDF with Data Transformations
198

199
```python
200
from chdb.udf import chdb_udf
201
from chdb import query
202

203
@chdb_udf()
204
def normalize_phone(phone):
205
    import re
206
    
207
    # Remove all non-digits
208
    digits = re.sub(r'\D', '', phone)
209
    
210
    # Format as (XXX) XXX-XXXX if US number
211
    if len(digits) == 10:
212
        return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
213
    elif len(digits) == 11 and digits[0] == '1':
214
        return f"({digits[1:4]}) {digits[4:7]}-{digits[7:]}"
215
    else:
216
        return "invalid"
217

218
@chdb_udf()
219
def categorize_age(age_str):
220
    try:
221
        age = int(age_str)
222
        if age < 18:
223
            return "minor"
224
        elif age < 65:
225
            return "adult"
226
        else:
227
            return "senior"
228
    except:
229
        return "unknown"
230

231
# Apply transformations to dataset
232
result = query("""
233
    SELECT 
234
        name,
235
        normalize_phone(phone) as formatted_phone,
236
        categorize_age(age) as age_category
237
    FROM (
238
        SELECT 'Alice' as name, '555-123-4567' as phone, '32' as age
239
        UNION ALL
240
        SELECT 'Bob', '(555) 987 6543', '17'
241
        UNION ALL
242
        SELECT 'Charlie', '15559876543', '67'
243
    )
244
""", "Pretty")
245

246
print(result)
247
```
248

249
### UDF with External Libraries
250

251
```python
252
from chdb.udf import chdb_udf
253
from chdb import query
254

255
@chdb_udf(return_type="Float64")
256
def calculate_distance(lat1, lon1, lat2, lon2):
257
    import math
258
    
259
    # Convert to radians
260
    lat1, lon1, lat2, lon2 = map(math.radians, [float(lat1), float(lon1), float(lat2), float(lon2)])
261
    
262
    # Haversine formula
263
    dlat = lat2 - lat1
264
    dlon = lon2 - lon1
265
    a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2
266
    c = 2 * math.asin(math.sqrt(a))
267
    
268
    # Earth radius in kilometers
269
    r = 6371
270
    
271
    return c * r
272

273
@chdb_udf()
274
def hash_string(input_str):
275
    import hashlib
276
    
277
    return hashlib.md5(input_str.encode()).hexdigest()
278

279
# Use UDFs with geographic and crypto functions
280
result = query("""
281
    SELECT 
282
        city1,
283
        city2,
284
        calculate_distance(lat1, lon1, lat2, lon2) as distance_km,
285
        hash_string(CONCAT(city1, '-', city2)) as route_hash
286
    FROM (
287
        SELECT 
288
            'New York' as city1, '40.7128' as lat1, '-74.0060' as lon1,
289
            'Los Angeles' as city2, '34.0522' as lat2, '-118.2437' as lon2
290
        UNION ALL
291
        SELECT 
292
            'Chicago', '41.8781', '-87.6298',
293
            'Houston', '29.7604', '-95.3698'
294
    )
295
""", "Pretty")
296

297
print(result)
298
```
299

300
### Error Handling with UDFs
301

302
```python
303
from chdb.udf import chdb_udf
304
from chdb import query, ChdbError
305

306
@chdb_udf()
307
def safe_divide(a, b):
308
    try:
309
        num = float(a)
310
        den = float(b)
311
        
312
        if den == 0:
313
            return "division_by_zero"
314
        
315
        return str(num / den)
316
    except ValueError:
317
        return "invalid_input"
318
    except Exception:
319
        return "error"
320

321
try:
322
    result = query("""
323
        SELECT 
324
            a, b,
325
            safe_divide(a, b) as result
326
        FROM (
327
            SELECT '10' as a, '2' as b
328
            UNION ALL
329
            SELECT '15', '0'
330
            UNION ALL
331
            SELECT 'abc', '5'
332
        )
333
    """, "Pretty")
334
    
335
    print(result)
336
    
337
except ChdbError as e:
338
    print(f"Query with UDF failed: {e}")
339
```
340

341
### UDF Path Management
342

343
```python
344
from chdb.udf import chdb_udf
345
from chdb import query
346
import tempfile
347
import os
348

349
# Custom UDF path
350
custom_udf_path = tempfile.mkdtemp()
351

352
@chdb_udf()
353
def custom_function(x):
354
    return f"processed_{x}"
355

356
# Use UDF with custom path
357
result = query(
358
    "SELECT custom_function('test') as output",
359
    udf_path=custom_udf_path
360
)
361

362
print(result)
363

364
# Cleanup custom UDF path when done
365
import shutil
366
shutil.rmtree(custom_udf_path)
367
```
368

369
## Important UDF Guidelines
370

371
### Function Requirements
372
- **Stateless**: Functions must be stateless (no global state)
373
- **String inputs**: All parameters received as strings
374
- **Self-contained**: Import all required modules within the function
375
- **Error handling**: Handle exceptions gracefully to avoid query failures
376

377
### Performance Considerations
378
- **Process overhead**: Each UDF call spawns a Python process
379
- **Serialization**: Data converted between TabSeparated format
380
- **Best for**: Complex logic that can't be expressed in SQL
381
- **Avoid for**: Simple operations that SQL can handle efficiently
382

383
### Type System
384
- **Input types**: Always strings (converted from TabSeparated)
385
- **Return types**: Must match declared return_type
386
- **Valid return types**: Any ClickHouse data type
387
- **Type conversion**: Automatic conversion between Python and SQL types

Version

Tile

Files

udf.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

udf.mddocs/