0
# Apache Spark SQL - Functions
1
2
Apache Spark SQL provides a comprehensive library of **764 built-in functions** for data manipulation across all major functional categories. These functions enable complex data transformations, analytics, and processing operations within the distributed SQL query engine.
3
4
## Capabilities
5
6
### Core Column and Expression Functions (8 functions)
7
- Create column references and literal values for DataFrame operations
8
- Parse and evaluate SQL expressions within DataFrame transformations
9
- Handle broadcast hints for optimized distributed joins
10
- Support for strongly-typed literal creation with compile-time type safety
11
12
### Aggregate and Statistical Functions (123 functions)
13
- Standard aggregations: count, sum, average, min, max with null handling
14
- Statistical functions: correlation, covariance, standard deviation, variance, skewness, kurtosis
15
- Approximate functions: approx_count_distinct, approx_percentile with configurable precision
16
- Collection aggregations: collect_list, collect_set, first, last with ordering control
17
- Advanced aggregations: regr_* (regression), percentile_approx, histogram_numeric
18
19
### String and Text Processing Functions (105 functions)
20
- String manipulation: concatenation, substring, trimming, case conversion
21
- Regular expressions: regexp_replace, regexp_extract, regexp_like with full pattern support
22
- String formatting: format_string, format_number, printf with locale-aware options
23
- Encoding/decoding: base64, encode, decode, url_encode, url_decode
24
- Text analysis: soundex, levenshtein distance, string similarity functions
25
- Unicode operations: character encoding transformations, byte manipulation
26
27
### Date, Time, and Temporal Functions (113 functions)
28
- Current time functions: current_date, current_timestamp, now, localtimestamp
29
- Date extraction: year, month, day, hour, minute, second, quarter, week functions
30
- Date arithmetic: date_add, date_sub, datediff, months_between, add_months
31
- Formatting and parsing: date_format, unix_timestamp, from_unixtime, to_date, to_timestamp
32
- Timezone operations: from_utc_timestamp, to_utc_timestamp with timezone conversion
33
- Date truncation: date_trunc, trunc for period-based grouping
34
- Interval arithmetic: interval operations for complex temporal calculations
35
36
### Mathematical and Numeric Functions (131 functions)
37
- Basic arithmetic: abs, ceil, floor, round, rint with precision control
38
- Trigonometric: sin, cos, tan, asin, acos, atan, atan2, sinh, cosh, tanh
39
- Logarithmic and exponential: log, log10, log2, ln, exp, exp2, expm1, log1p
40
- Power and root: pow, power, sqrt, cbrt with arbitrary precision
41
- Statistical distributions: normal, binomial, poisson distribution functions
42
- Number theory: gcd, lcm, factorial, sign, signum
43
- Bitwise operations: bit_and, bit_or, bit_xor, bit_not, bit_count
44
- Number formatting: hex, unhex, conv for base conversion
45
46
### Array, Map, and Struct Functions (66 functions total)
47
- **Array functions (32)**: array_*, sort_array, reverse, array_contains, array_position
48
- **Map functions (11)**: map_*, map_keys, map_values, map_entries, map_from_entries
49
- **Struct functions (3)**: struct creation and field access operations
50
- **Collection functions (23)**: explode, explode_outer, posexplode, flatten
51
- Complex nested data manipulation: array_zip, array_union, array_intersect
52
- Element access: element_at, array_remove, array_distinct operations
53
54
### Conditional and Control Flow Functions (9 functions)
55
- Conditional logic: when, otherwise, case expressions with multiple branches
56
- Null handling: coalesce, nvl, nvl2, isnull, isnotnull with comprehensive null safety
57
- Value comparison: greatest, least across multiple columns
58
- Branching: if, decode for data-driven conditional processing
59
60
### Window and Analytical Functions (18 functions)
61
- Ranking: row_number, rank, dense_rank, percent_rank, ntile
62
- Value access: lag, lead, first_value, last_value with offset and default handling
63
- Cumulative operations: cume_dist for cumulative distribution calculations
64
- Percentile functions: percentile_cont, percentile_disc for statistical analysis
65
66
### Advanced Specialized Functions
67
- **Hash functions (7)**: md5, sha1, sha2, crc32, hash, xxhash64, murmur3_hash
68
- **JSON functions (20)**: json_extract, json_parse, json_array_length, to_json, from_json
69
- **XML functions (19)**: xpath, xpath_string, xpath_boolean, xpath_number functions
70
- **CSV functions (7)**: from_csv, to_csv, schema_of_csv with delimiter handling
71
- **URL functions (7)**: parse_url, url_decode, url_encode with component extraction
72
- **Bitwise functions (11)**: Complete bitwise operation suite
73
- **Generator functions (7)**: explode, posexplode, inline, stack for data expansion
74
- **VARIANT functions (10)**: variant_get, variant_explode for semi-structured data
75
- **Partition transform functions (12)**: years, months, days, hours, bucket transformations
76
- **UDF functions (27)**: User-defined function registration and management utilities
77
78
## API Reference
79
80
### Core Functions
81
```scala { .api }
82
object functions {
83
// Column creation and literals
84
def col(colName: String): Column
85
def column(colName: String): Column
86
def lit(literal: Any): Column
87
def typedLit[T : TypeTag](literal: T): Column
88
def expr(expr: String): Column
89
90
// Special values
91
def broadcast[T](df: Dataset[T]): Dataset[T]
92
def monotonically_increasing_id(): Column
93
def spark_partition_id(): Column
94
def input_file_name(): Column
95
}
96
```
97
98
### Aggregate Functions
99
```scala { .api }
100
// Basic aggregations
101
def count(e: Column): Column
102
def count(columnName: String): Column
103
def countDistinct(expr: Column, exprs: Column*): Column
104
def countDistinct(columnName: String, columnNames: String*): Column
105
def approx_count_distinct(e: Column): Column
106
def approx_count_distinct(e: Column, rsd: Double): Column
107
def approx_count_distinct(columnName: String): Column
108
def approx_count_distinct(columnName: String, rsd: Double): Column
109
110
def sum(e: Column): Column
111
def sum(columnName: String): Column
112
def sumDistinct(e: Column): Column
113
def sumDistinct(columnName: String): Column
114
def avg(e: Column): Column
115
def avg(columnName: String): Column
116
def mean(e: Column): Column
117
def mean(columnName: String): Column
118
def min(e: Column): Column
119
def min(columnName: String): Column
120
def max(e: Column): Column
121
def max(columnName: String): Column
122
123
// Statistical aggregations
124
def stddev(e: Column): Column
125
def stddev(columnName: String): Column
126
def stddev_pop(e: Column): Column
127
def stddev_pop(columnName: String): Column
128
def stddev_samp(e: Column): Column
129
def stddev_samp(columnName: String): Column
130
def variance(e: Column): Column
131
def variance(columnName: String): Column
132
def var_pop(e: Column): Column
133
def var_pop(columnName: String): Column
134
def var_samp(e: Column): Column
135
def var_samp(columnName: String): Column
136
137
def corr(column1: Column, column2: Column): Column
138
def corr(columnName1: String, columnName2: String): Column
139
def covar_pop(column1: Column, column2: Column): Column
140
def covar_pop(columnName1: String, columnName2: String): Column
141
def covar_samp(column1: Column, column2: Column): Column
142
def covar_samp(columnName1: String, columnName2: String): Column
143
144
def skewness(e: Column): Column
145
def skewness(columnName: String): Column
146
def kurtosis(e: Column): Column
147
def kurtosis(columnName: String): Column
148
149
// Collection aggregations
150
def collect_list(e: Column): Column
151
def collect_list(columnName: String): Column
152
def collect_set(e: Column): Column
153
def collect_set(columnName: String): Column
154
def first(e: Column): Column
155
def first(e: Column, ignoreNulls: Boolean): Column
156
def first(columnName: String): Column
157
def first(columnName: String, ignoreNulls: Boolean): Column
158
def last(e: Column): Column
159
def last(e: Column, ignoreNulls: Boolean): Column
160
def last(columnName: String): Column
161
def last(columnName: String, ignoreNulls: Boolean): Column
162
163
// Approximate aggregations
164
def approx_percentile(e: Column, percentage: Column): Column
165
def approx_percentile(e: Column, percentage: Column, accuracy: Column): Column
166
def approx_percentile(columnName: String, percentage: Double): Column
167
def approx_percentile(columnName: String, percentage: Double, accuracy: Double): Column
168
169
// Regression functions
170
def regr_avgx(y: Column, x: Column): Column
171
def regr_avgy(y: Column, x: Column): Column
172
def regr_count(y: Column, x: Column): Column
173
def regr_intercept(y: Column, x: Column): Column
174
def regr_r2(y: Column, x: Column): Column
175
def regr_slope(y: Column, x: Column): Column
176
def regr_sxx(y: Column, x: Column): Column
177
def regr_sxy(y: Column, x: Column): Column
178
def regr_syy(y: Column, x: Column): Column
179
180
// Percentile functions
181
def percentile_approx(col: Column, percentage: Column): Column
182
def percentile_approx(col: Column, percentage: Column, accuracy: Column): Column
183
def median(col: Column): Column
184
```
185
186
### Mathematical Functions
187
```scala { .api }
188
// Basic arithmetic and rounding
189
def abs(e: Column): Column
190
def ceil(e: Column): Column
191
def ceiling(e: Column): Column
192
def floor(e: Column): Column
193
def round(e: Column): Column
194
def round(e: Column, scale: Int): Column
195
def rint(e: Column): Column
196
def sign(e: Column): Column
197
def signum(e: Column): Column
198
199
// Trigonometric functions
200
def sin(e: Column): Column
201
def cos(e: Column): Column
202
def tan(e: Column): Column
203
def asin(e: Column): Column
204
def acos(e: Column): Column
205
def atan(e: Column): Column
206
def atan2(y: Column, x: Column): Column
207
def sinh(e: Column): Column
208
def cosh(e: Column): Column
209
def tanh(e: Column): Column
210
def asinh(e: Column): Column
211
def acosh(e: Column): Column
212
def atanh(e: Column): Column
213
214
// Logarithmic and exponential
215
def exp(e: Column): Column
216
def exp2(e: Column): Column
217
def expm1(e: Column): Column
218
def log(e: Column): Column
219
def log(base: Double, e: Column): Column
220
def log10(e: Column): Column
221
def log2(e: Column): Column
222
def log1p(e: Column): Column
223
def ln(e: Column): Column
224
225
// Power and root functions
226
def pow(l: Column, r: Column): Column
227
def power(l: Column, r: Column): Column
228
def sqrt(e: Column): Column
229
def cbrt(e: Column): Column
230
231
// Random number generation
232
def rand(): Column
233
def rand(seed: Long): Column
234
def randn(): Column
235
def randn(seed: Long): Column
236
237
// Number conversion and formatting
238
def hex(column: Column): Column
239
def unhex(column: Column): Column
240
def conv(num: Column, fromBase: Int, toBase: Int): Column
241
def bin(e: Column): Column
242
243
// Advanced mathematical functions
244
def factorial(e: Column): Column
245
def degrees(e: Column): Column
246
def radians(e: Column): Column
247
def gcd(left: Column, right: Column): Column
248
def lcm(left: Column, right: Column): Column
249
def width_bucket(v: Column, min: Column, max: Column, numBucket: Column): Column
250
```
251
252
### Date and Time Functions
253
```scala { .api }
254
// Current date/time functions
255
def current_date(): Column
256
def current_timestamp(): Column
257
def now(): Column
258
def localtimestamp(): Column
259
def current_timezone(): Column
260
261
// Date/time extraction functions
262
def year(e: Column): Column
263
def quarter(e: Column): Column
264
def month(e: Column): Column
265
def weekofyear(e: Column): Column
266
def weekday(e: Column): Column
267
def dayofyear(e: Column): Column
268
def dayofmonth(e: Column): Column
269
def dayofweek(e: Column): Column
270
def hour(e: Column): Column
271
def minute(e: Column): Column
272
def second(e: Column): Column
273
def extract(field: Column, source: Column): Column
274
275
// Date arithmetic and manipulation
276
def date_add(start: Column, days: Int): Column
277
def date_add(start: Column, days: Column): Column
278
def date_sub(start: Column, days: Int): Column
279
def date_sub(start: Column, days: Column): Column
280
def datediff(end: Column, start: Column): Column
281
def months_between(end: Column, start: Column): Column
282
def months_between(end: Column, start: Column, roundOff: Boolean): Column
283
def add_months(startDate: Column, numMonths: Int): Column
284
def add_months(startDate: Column, numMonths: Column): Column
285
286
// Date formatting and parsing
287
def date_format(dateExpr: Column, format: String): Column
288
def from_unixtime(ut: Column): Column
289
def from_unixtime(ut: Column, f: String): Column
290
def unix_timestamp(): Column
291
def unix_timestamp(s: Column): Column
292
def unix_timestamp(s: Column, p: String): Column
293
def to_date(e: Column): Column
294
def to_date(e: Column, fmt: String): Column
295
def to_timestamp(s: Column): Column
296
def to_timestamp(s: Column, fmt: String): Column
297
def make_date(year: Column, month: Column, day: Column): Column
298
def make_timestamp(
299
year: Column, month: Column, day: Column,
300
hour: Column, min: Column, sec: Column
301
): Column
302
def make_timestamp(
303
year: Column, month: Column, day: Column,
304
hour: Column, min: Column, sec: Column, timezone: Column
305
): Column
306
307
// Date truncation and boundaries
308
def date_trunc(format: String, timestamp: Column): Column
309
def trunc(date: Column, format: String): Column
310
def last_day(e: Column): Column
311
def next_day(date: Column, dayOfWeek: String): Column
312
313
// Timezone operations
314
def from_utc_timestamp(ts: Column, tz: String): Column
315
def from_utc_timestamp(ts: Column, tz: Column): Column
316
def to_utc_timestamp(ts: Column, tz: String): Column
317
def to_utc_timestamp(ts: Column, tz: Column): Column
318
319
// Date validation and testing
320
def isnan(e: Column): Column
321
def isnull(e: Column): Column
322
```
323
324
### Array Functions
325
```scala { .api }
326
// Array creation and manipulation
327
def array(cols: Column*): Column
328
def array_contains(column: Column, value: Any): Column
329
def array_distinct(e: Column): Column
330
def array_except(col1: Column, col2: Column): Column
331
def array_intersect(col1: Column, col2: Column): Column
332
def array_join(column: Column, delimiter: String): Column
333
def array_join(column: Column, delimiter: String, nullReplacement: String): Column
334
def array_max(e: Column): Column
335
def array_min(e: Column): Column
336
def array_position(column: Column, value: Any): Column
337
def array_remove(column: Column, element: Any): Column
338
def array_repeat(left: Column, right: Column): Column
339
def array_size(e: Column): Column
340
def array_sort(e: Column): Column
341
def array_union(col1: Column, col2: Column): Column
342
def arrays_overlap(a1: Column, a2: Column): Column
343
def arrays_zip(cols: Column*): Column
344
345
// Array aggregation and transformation
346
def flatten(e: Column): Column
347
def reverse(e: Column): Column
348
def shuffle(e: Column): Column
349
def slice(x: Column, start: Int, length: Int): Column
350
def sort_array(e: Column): Column
351
def sort_array(e: Column, asc: Boolean): Column
352
353
// Array element access
354
def element_at(column: Column, extraction: Any): Column
355
def get_json_object(e: Column, path: String): Column
356
357
// Array generation
358
def sequence(start: Column, stop: Column): Column
359
def sequence(start: Column, stop: Column, step: Column): Column
360
```
361
362
### Map Functions
363
```scala { .api }
364
// Map creation and manipulation
365
def map(cols: Column*): Column
366
def map_contains_key(column: Column, key: Any): Column
367
def map_entries(e: Column): Column
368
def map_from_arrays(keys: Column, values: Column): Column
369
def map_from_entries(e: Column): Column
370
def map_keys(e: Column): Column
371
def map_values(e: Column): Column
372
def map_zip_with(left: Column, right: Column, f: (Column, Column, Column) => Column): Column
373
374
// Map transformations
375
def transform_keys(expr: Column, f: (Column, Column) => Column): Column
376
def transform_values(expr: Column, f: (Column, Column) => Column): Column
377
```
378
379
### JSON Functions
380
```scala { .api }
381
// JSON parsing and manipulation
382
def from_json(e: Column, schema: DataType): Column
383
def from_json(e: Column, schema: DataType, options: Map[String, String]): Column
384
def from_json(e: Column, schema: String): Column
385
def from_json(e: Column, schema: String, options: Map[String, String]): Column
386
def to_json(e: Column): Column
387
def to_json(e: Column, options: Map[String, String]): Column
388
389
def json_array_length(jsonArray: Column): Column
390
def json_object_keys(json: Column): Column
391
def json_tuple(json: Column, fields: String*): Column
392
def get_json_object(e: Column, path: String): Column
393
def json_extract(jsonStr: Column, path: Column): Column
394
395
// Schema inference
396
def schema_of_json(json: Column): Column
397
def schema_of_json(json: String): Column
398
```
399
400
### Window Functions
401
```scala { .api }
402
// Ranking functions
403
def row_number(): Column
404
def rank(): Column
405
def dense_rank(): Column
406
def percent_rank(): Column
407
def ntile(n: Int): Column
408
def cume_dist(): Column
409
410
// Value access functions
411
def lag(e: Column, offset: Int): Column
412
def lag(e: Column, offset: Int, defaultValue: Any): Column
413
def lead(e: Column, offset: Int): Column
414
def lead(e: Column, offset: Int, defaultValue: Any): Column
415
def first_value(e: Column): Column
416
def first_value(e: Column, ignoreNulls: Boolean): Column
417
def last_value(e: Column): Column
418
def last_value(e: Column, ignoreNulls: Boolean): Column
419
def nth_value(e: Column, offset: Int): Column
420
def nth_value(e: Column, offset: Int, ignoreNulls: Boolean): Column
421
```
422
423
### Hash Functions
424
```scala { .api }
425
// Cryptographic hash functions
426
def md5(e: Column): Column
427
def sha1(e: Column): Column
428
def sha2(e: Column, numBits: Int): Column
429
def hash(cols: Column*): Column
430
def xxhash64(cols: Column*): Column
431
def crc32(e: Column): Column
432
def murmur3_hash(cols: Column*): Column
433
```
434
435
### Conditional Functions
436
```scala { .api }
437
// Conditional logic
438
def when(condition: Column, value: Any): Column
439
def coalesce(e: Column*): Column
440
def isnull(e: Column): Column
441
def isnan(e: Column): Column
442
def isnotnull(e: Column): Column
443
def nvl(col1: Column, col2: Column): Column
444
def nvl2(col1: Column, col2: Column, col3: Column): Column
445
def nullif(col1: Column, col2: Column): Column
446
def if(condition: Column, trueValue: Column, falseValue: Column): Column
447
def greatest(exprs: Column*): Column
448
def least(exprs: Column*): Column
449
```
450
451
### String Functions
452
```scala { .api }
453
// String operations
454
def length(e: Column): Column
455
def length(columnName: String): Column
456
def char_length(e: Column): Column
457
def character_length(e: Column): Column
458
def bit_length(e: Column): Column
459
def octet_length(e: Column): Column
460
461
def upper(e: Column): Column
462
def upper(columnName: String): Column
463
def lower(e: Column): Column
464
def lower(columnName: String): Column
465
def initcap(e: Column): Column
466
def initcap(columnName: String): Column
467
468
def trim(e: Column): Column
469
def trim(columnName: String): Column
470
def trim(e: Column, trimString: String): Column
471
def ltrim(e: Column): Column
472
def ltrim(columnName: String): Column
473
def ltrim(e: Column, trimString: String): Column
474
def rtrim(e: Column): Column
475
def rtrim(columnName: String): Column
476
def rtrim(e: Column, trimString: String): Column
477
478
// String concatenation
479
def concat(exprs: Column*): Column
480
def concat_ws(sep: String, exprs: Column*): Column
481
482
// Substring operations
483
def substring(str: Column, pos: Int, len: Int): Column
484
def substring(columnName: String, pos: Int, len: Int): Column
485
def substr(str: Column, pos: Column, len: Column): Column
486
def left(str: Column, len: Column): Column
487
def right(str: Column, len: Column): Column
488
489
// String searching and matching
490
def instr(str: Column, substring: String): Column
491
def locate(substr: String, str: Column): Column
492
def locate(substr: String, str: Column, pos: Int): Column
493
def find_in_set(str: Column, strArray: Column): Column
494
495
// Regular expressions
496
def regexp_replace(e: Column, pattern: String, replacement: String): Column
497
def regexp_replace(e: Column, pattern: Column, replacement: Column): Column
498
def regexp_extract(e: Column, exp: String, groupIdx: Int): Column
499
def regexp_extract_all(e: Column, exp: String): Column
500
def regexp_extract_all(e: Column, exp: String, groupIdx: Int): Column
501
def rlike(str: Column, regexp: String): Column
502
def regexp_like(str: Column, regexp: String): Column
503
504
// String splitting and parsing
505
def split(str: Column, pattern: String): Column
506
def split(str: Column, pattern: String, limit: Int): Column
507
508
// String formatting
509
def format_string(format: String, arguments: Column*): Column
510
def format_number(x: Column, d: Int): Column
511
def printf(format: Column, arguments: Column*): Column
512
513
// String padding and alignment
514
def lpad(str: Column, len: Int, pad: String): Column
515
def rpad(str: Column, len: Int, pad: String): Column
516
517
// String translation and encoding
518
def translate(src: Column, matchingString: String, replaceString: String): Column
519
def encode(value: Column, charset: String): Column
520
def decode(value: Column, charset: String): Column
521
def base64(e: Column): Column
522
def unbase64(e: Column): Column
523
524
// String comparison and testing
525
def soundex(e: Column): Column
526
def levenshtein(l: Column, r: Column): Column
527
def levenshtein(l: Column, r: Column, threshold: Int): Column
528
```
529
530
### Date and Time Functions
531
```scala { .api }
532
// Current date/time functions
533
def current_date(): Column
534
def current_timestamp(): Column
535
def now(): Column
536
def localtimestamp(): Column
537
def current_timezone(): Column
538
539
// Date/time extraction
540
def year(e: Column): Column
541
def year(columnName: String): Column
542
def quarter(e: Column): Column
543
def quarter(columnName: String): Column
544
def month(e: Column): Column
545
def month(columnName: String): Column
546
def dayofyear(e: Column): Column
547
def dayofyear(columnName: String): Column
548
def dayofmonth(e: Column): Column
549
def dayofmonth(columnName: String): Column
550
def dayofweek(e: Column): Column
551
def dayofweek(columnName: String): Column
552
def weekofyear(e: Column): Column
553
def weekofyear(columnName: String): Column
554
555
def hour(e: Column): Column
556
def hour(columnName: String): Column
557
def minute(e: Column): Column
558
def minute(columnName: String): Column
559
def second(e: Column): Column
560
def second(columnName: String): Column
561
562
// Date arithmetic
563
def date_add(start: Column, days: Int): Column
564
def date_add(start: Column, days: Column): Column
565
def date_add(columnName: String, days: Int): Column
566
def date_sub(start: Column, days: Int): Column
567
def date_sub(start: Column, days: Column): Column
568
def date_sub(columnName: String, days: Int): Column
569
def datediff(end: Column, start: Column): Column
570
def datediff(end: String, start: String): Column
571
def months_between(end: Column, start: Column): Column
572
def months_between(end: Column, start: Column, roundOff: Boolean): Column
573
def add_months(startDate: Column, numMonths: Int): Column
574
def add_months(startDate: Column, numMonths: Column): Column
575
576
// Date formatting and parsing
577
def date_format(dateExpr: Column, format: String): Column
578
def date_format(columnName: String, format: String): Column
579
def from_unixtime(ut: Column): Column
580
def from_unixtime(ut: Column, f: String): Column
581
def from_unixtime(columnName: String): Column
582
def from_unixtime(columnName: String, format: String): Column
583
def unix_timestamp(): Column
584
def unix_timestamp(s: Column): Column
585
def unix_timestamp(s: Column, p: String): Column
586
def unix_timestamp(columnName: String): Column
587
def unix_timestamp(columnName: String, pattern: String): Column
588
589
def to_date(e: Column): Column
590
def to_date(e: Column, fmt: String): Column
591
def to_timestamp(s: Column): Column
592
def to_timestamp(s: Column, fmt: String): Column
593
594
// Date truncation and rounding
595
def date_trunc(format: String, timestamp: Column): Column
596
def trunc(date: Column, format: String): Column
597
598
// Timezone operations
599
def from_utc_timestamp(ts: Column, tz: String): Column
600
def from_utc_timestamp(ts: Column, tz: Column): Column
601
def to_utc_timestamp(ts: Column, tz: String): Column
602
def to_utc_timestamp(ts: Column, tz: Column): Column
603
604
// Date sequence generation
605
def sequence(start: Column, stop: Column, step: Column): Column
606
def date_sequence(start: Column, stop: Column, step: Column): Column
607
608
// Last day operations
609
def last_day(e: Column): Column
610
def last_day(columnName: String): Column
611
def next_day(date: Column, dayOfWeek: String): Column
612
```
613
614
### Mathematical Functions
615
```scala { .api }
616
// Basic arithmetic
617
def abs(e: Column): Column
618
def abs(columnName: String): Column
619
def negate(e: Column): Column
620
def sqrt(e: Column): Column
621
def sqrt(columnName: String): Column
622
def cbrt(e: Column): Column
623
def pow(l: Column, r: Column): Column
624
def pow(l: Column, r: Double): Column
625
def pow(columnName: String, p: Double): Column
626
def power(l: Column, r: Column): Column
627
628
// Rounding functions
629
def round(e: Column): Column
630
def round(e: Column, scale: Int): Column
631
def round(columnName: String): Column
632
def round(columnName: String, scale: Int): Column
633
def bround(e: Column): Column
634
def bround(e: Column, scale: Int): Column
635
def ceil(e: Column): Column
636
def ceil(columnName: String): Column
637
def ceiling(e: Column): Column
638
def floor(e: Column): Column
639
def floor(columnName: String): Column
640
def rint(e: Column): Column
641
def rint(columnName: String): Column
642
643
// Trigonometric functions
644
def sin(e: Column): Column
645
def sin(columnName: String): Column
646
def cos(e: Column): Column
647
def cos(columnName: String): Column
648
def tan(e: Column): Column
649
def tan(columnName: String): Column
650
def asin(e: Column): Column
651
def asin(columnName: String): Column
652
def acos(e: Column): Column
653
def acos(columnName: String): Column
654
def atan(e: Column): Column
655
def atan(columnName: String): Column
656
def atan2(l: Column, r: Column): Column
657
def atan2(l: Column, r: Double): Column
658
def atan2(columnName1: String, columnName2: String): Column
659
def sinh(e: Column): Column
660
def cosh(e: Column): Column
661
def tanh(e: Column): Column
662
663
// Logarithmic and exponential
664
def log(e: Column): Column
665
def log(columnName: String): Column
666
def log(base: Double, e: Column): Column
667
def log(base: Double, columnName: String): Column
668
def log10(e: Column): Column
669
def log10(columnName: String): Column
670
def log1p(e: Column): Column
671
def log1p(columnName: String): Column
672
def log2(e: Column): Column
673
def exp(e: Column): Column
674
def exp(columnName: String): Column
675
def expm1(e: Column): Column
676
def expm1(columnName: String): Column
677
678
// Angle conversion
679
def degrees(e: Column): Column
680
def degrees(columnName: String): Column
681
def radians(e: Column): Column
682
def radians(columnName: String): Column
683
def toDegrees(e: Column): Column
684
def toRadians(e: Column): Column
685
686
// Special mathematical functions
687
def factorial(e: Column): Column
688
def signum(e: Column): Column
689
def signum(columnName: String): Column
690
691
// Bitwise operations
692
def bitwiseNOT(e: Column): Column
693
def shiftLeft(e: Column, numBits: Int): Column
694
def shiftRight(e: Column, numBits: Int): Column
695
def shiftRightUnsigned(e: Column, numBits: Int): Column
696
697
// Random functions
698
def rand(): Column
699
def rand(seed: Long): Column
700
def randn(): Column
701
def randn(seed: Long): Column
702
```
703
704
### Array Functions
705
```scala { .api }
706
// Array creation
707
def array(cols: Column*): Column
708
def array_repeat(e: Column, count: Int): Column
709
def array_repeat(left: Column, right: Column): Column
710
711
// Array access and properties
712
def size(e: Column): Column
713
def array_min(e: Column): Column
714
def array_max(e: Column): Column
715
def array_position(column: Column, value: Any): Column
716
def element_at(column: Column, extraction: Any): Column
717
718
// Array testing
719
def array_contains(column: Column, value: Any): Column
720
def arrays_overlap(a1: Column, a2: Column): Column
721
722
// Array transformations
723
def array_distinct(e: Column): Column
724
def array_remove(column: Column, element: Column): Column
725
def array_sort(e: Column): Column
726
def array_union(col1: Column, col2: Column): Column
727
def array_intersect(col1: Column, col2: Column): Column
728
def array_except(col1: Column, col2: Column): Column
729
def array_join(column: Column, delimiter: String): Column
730
def array_join(column: Column, delimiter: String, nullReplacement: String): Column
731
def reverse(e: Column): Column
732
def shuffle(e: Column): Column
733
def slice(x: Column, start: Int, length: Int): Column
734
def slice(x: Column, start: Column, length: Column): Column
735
def sort_array(e: Column): Column
736
def sort_array(e: Column, asc: Boolean): Column
737
738
// Array aggregation
739
def array_agg(e: Column): Column
740
def flatten(e: Column): Column
741
def sequence(start: Column, stop: Column): Column
742
def sequence(start: Column, stop: Column, step: Column): Column
743
744
// Array explosion
745
def explode(e: Column): Column
746
def explode_outer(e: Column): Column
747
def posexplode(e: Column): Column
748
def posexplode_outer(e: Column): Column
749
750
// Array generation
751
def array_zip(e: Column*): Column
752
def arrays_zip(e: Column*): Column
753
```
754
755
### Map Functions
756
```scala { .api }
757
// Map creation
758
def map(cols: Column*): Column
759
def map_from_arrays(keys: Column, values: Column): Column
760
def map_from_entries(e: Column): Column
761
762
// Map access
763
def map_keys(e: Column): Column
764
def map_values(e: Column): Column
765
def map_entries(e: Column): Column
766
def map_concat(cols: Column*): Column
767
768
// Map filtering and transformation
769
def map_filter(expr: Column, f: (Column, Column) => Column): Column
770
def transform_keys(expr: Column, f: (Column, Column) => Column): Column
771
def transform_values(expr: Column, f: (Column, Column) => Column): Column
772
773
// Map explosion
774
def explode(e: Column): Column
775
def explode_outer(e: Column): Column
776
def posexplode(e: Column): Column
777
def posexplode_outer(e: Column): Column
778
```
779
780
### Struct Functions
781
```scala { .api }
782
// Struct creation
783
def struct(cols: Column*): Column
784
def named_struct(cols: Column*): Column
785
786
// JSON operations
787
def from_json(e: Column, schema: StructType): Column
788
def from_json(e: Column, schema: String): Column
789
def from_json(e: Column, schema: Column): Column
790
def from_json(e: Column, schema: DataType): Column
791
def from_json(e: Column, schema: StructType, options: Map[String, String]): Column
792
def to_json(e: Column): Column
793
def to_json(e: Column, options: Map[String, String]): Column
794
def json_object_keys(json: Column): Column
795
def json_array_length(jsonArray: Column): Column
796
def json_tuple(json: Column, fields: String*): Column
797
def get_json_object(e: Column, path: String): Column
798
799
// Schema operations
800
def schema_of_json(json: Column): Column
801
def schema_of_json(json: String): Column
802
def schema_of_csv(csv: Column): Column
803
def schema_of_csv(csv: String): Column
804
```
805
806
### Conditional Functions
807
```scala { .api }
808
// Conditional expressions
809
def when(condition: Column, value: Any): Column
810
def coalesce(e: Column*): Column
811
def greatest(exprs: Column*): Column
812
def greatest(columnName: String, columnNames: String*): Column
813
def least(exprs: Column*): Column
814
def least(columnName: String, columnNames: String*): Column
815
816
// Null handling
817
def isnull(e: Column): Column
818
def isnan(e: Column): Column
819
def nanvl(col1: Column, col2: Column): Column
820
def nvl(col1: Column, col2: Column): Column
821
def nvl2(col1: Column, col2: Column, col3: Column): Column
822
def ifnull(col1: Column, col2: Column): Column
823
def nullif(col1: Column, col2: Column): Column
824
```
825
826
### Window Functions
827
```scala { .api }
828
// Ranking functions
829
def row_number(): Column
830
def rank(): Column
831
def dense_rank(): Column
832
def percent_rank(): Column
833
def ntile(n: Int): Column
834
def cume_dist(): Column
835
836
// Offset functions
837
def lag(e: Column, offset: Int): Column
838
def lag(e: Column, offset: Int, defaultValue: Any): Column
839
def lag(columnName: String, offset: Int): Column
840
def lag(columnName: String, offset: Int, defaultValue: Any): Column
841
def lead(e: Column, offset: Int): Column
842
def lead(e: Column, offset: Int, defaultValue: Any): Column
843
def lead(columnName: String, offset: Int): Column
844
def lead(columnName: String, offset: Int, defaultValue: Any): Column
845
846
// Value functions
847
def first_value(e: Column): Column
848
def first_value(e: Column, ignoreNulls: Boolean): Column
849
def last_value(e: Column): Column
850
def last_value(e: Column, ignoreNulls: Boolean): Column
851
def nth_value(e: Column, n: Int): Column
852
def nth_value(e: Column, n: Int, ignoreNulls: Boolean): Column
853
```
854
855
### Type Conversion Functions
856
```scala { .api }
857
// Type casting
858
def cast(e: Column, dataType: DataType): Column
859
def cast(columnName: String, dataType: String): Column
860
861
// Binary operations
862
def bin(e: Column): Column
863
def hex(column: Column): Column
864
def unhex(column: Column): Column
865
def md5(e: Column): Column
866
def sha1(e: Column): Column
867
def sha2(e: Column, numBits: Int): Column
868
def hash(cols: Column*): Column
869
def xxhash64(cols: Column*): Column
870
def crc32(e: Column): Column
871
```
872
873
## Usage Examples
874
875
### Basic Function Usage
876
```scala
877
import org.apache.spark.sql.functions._
878
import org.apache.spark.sql.SparkSession
879
880
val spark = SparkSession.builder().appName("Functions Demo").getOrCreate()
881
import spark.implicits._
882
883
// Sample data
884
val df = Seq(
885
("Alice", 25, 50000.0, "2023-01-15"),
886
("Bob", 30, 60000.0, "2023-02-20"),
887
("Charlie", 35, 70000.0, "2023-03-10")
888
).toDF("name", "age", "salary", "hire_date")
889
890
// String functions
891
val stringOps = df.select(
892
upper($"name").as("upper_name"),
893
length($"name").as("name_length"),
894
concat($"name", lit(" - "), $"age").as("name_age"),
895
regexp_replace($"name", "a", "@").as("replaced_name")
896
)
897
898
// Mathematical functions
899
val mathOps = df.select(
900
$"name",
901
abs($"age" - 30).as("age_diff_from_30"),
902
round($"salary" / 1000, 2).as("salary_in_k"),
903
pow($"age", 2).as("age_squared"),
904
sqrt($"salary").as("sqrt_salary")
905
)
906
907
// Date functions
908
val dateOps = df.select(
909
$"name",
910
to_date($"hire_date").as("hire_date_parsed"),
911
year(to_date($"hire_date")).as("hire_year"),
912
months_between(current_date(), to_date($"hire_date")).as("months_employed"),
913
date_add(to_date($"hire_date"), 365).as("one_year_later")
914
)
915
```
916
917
### Aggregate Functions
918
```scala
919
// Basic aggregations
920
val basicStats = df.agg(
921
count($"name").as("total_employees"),
922
avg($"salary").as("avg_salary"),
923
min($"age").as("min_age"),
924
max($"salary").as("max_salary"),
925
stddev($"salary").as("salary_stddev")
926
)
927
928
// Grouped aggregations
929
val groupedStats = df.groupBy($"age" > 30)
930
.agg(
931
count($"name").as("count"),
932
sum($"salary").as("total_salary"),
933
avg($"salary").as("avg_salary"),
934
collect_list($"name").as("names"),
935
first($"name").as("first_name"),
936
last($"name").as("last_name")
937
)
938
939
// Statistical functions
940
val correlationStats = df.select(
941
corr($"age", $"salary").as("age_salary_correlation"),
942
covar_samp($"age", $"salary").as("age_salary_covariance")
943
)
944
```
945
946
### Array and Map Functions
947
```scala
948
// Sample data with arrays
949
val arrayData = Seq(
950
("Alice", Array("reading", "swimming", "cooking")),
951
("Bob", Array("hiking", "reading", "gaming")),
952
("Charlie", Array("cooking", "gaming"))
953
).toDF("name", "hobbies")
954
955
// Array operations
956
val arrayOps = arrayData.select(
957
$"name",
958
$"hobbies",
959
size($"hobbies").as("hobby_count"),
960
array_contains($"hobbies", "reading").as("likes_reading"),
961
array_sort($"hobbies").as("sorted_hobbies"),
962
array_distinct($"hobbies").as("unique_hobbies")
963
)
964
965
// Explode arrays
966
val exploded = arrayData.select(
967
$"name",
968
explode($"hobbies").as("hobby")
969
)
970
971
// Map operations
972
val mapData = Seq(
973
("Alice", Map("skill1" -> "Java", "skill2" -> "Scala")),
974
("Bob", Map("skill1" -> "Python", "skill2" -> "SQL"))
975
).toDF("name", "skills")
976
977
val mapOps = mapData.select(
978
$"name",
979
map_keys($"skills").as("skill_names"),
980
map_values($"skills").as("skill_values"),
981
$"skills".getItem("skill1").as("primary_skill")
982
)
983
```
984
985
### Window Functions
986
```scala
987
import org.apache.spark.sql.expressions.Window
988
989
// Window specifications
990
val windowSpec = Window.partitionBy($"age" > 30).orderBy($"salary".desc)
991
val unboundedWindow = Window.partitionBy($"age" > 30)
992
.orderBy($"salary".desc)
993
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
994
995
// Window function usage
996
val windowedData = df.select(
997
$"name",
998
$"age",
999
$"salary",
1000
row_number().over(windowSpec).as("salary_rank"),
1001
rank().over(windowSpec).as("salary_rank_with_ties"),
1002
dense_rank().over(windowSpec).as("salary_dense_rank"),
1003
percent_rank().over(windowSpec).as("salary_percentile"),
1004
ntile(3).over(windowSpec).as("salary_tertile"),
1005
lag($"salary", 1).over(windowSpec).as("previous_salary"),
1006
lead($"salary", 1).over(windowSpec).as("next_salary"),
1007
sum($"salary").over(unboundedWindow).as("running_total_salary"),
1008
avg($"salary").over(unboundedWindow).as("running_avg_salary")
1009
)
1010
```
1011
1012
### Conditional and Null Handling Functions
1013
```scala
1014
// Sample data with nulls
1015
val dataWithNulls = Seq(
1016
("Alice", Some(25), Some(50000.0)),
1017
("Bob", None, Some(60000.0)),
1018
("Charlie", Some(35), None)
1019
).toDF("name", "age", "salary")
1020
1021
// Conditional operations
1022
val conditionalOps = dataWithNulls.select(
1023
$"name",
1024
when($"age".isNull, "Unknown").otherwise($"age").as("age_with_default"),
1025
when($"salary" > 55000, "High")
1026
.when($"salary" > 45000, "Medium")
1027
.otherwise("Low").as("salary_category"),
1028
coalesce($"salary", lit(40000.0)).as("salary_with_default"),
1029
nvl($"age", lit(0)).as("age_nvl"),
1030
nvl2($"salary", "Has Salary", "No Salary").as("salary_status"),
1031
greatest($"age", lit(30)).as("age_or_30"),
1032
least($"age", lit(30)).as("min_age_30")
1033
)
1034
1035
// Null checking
1036
val nullChecks = dataWithNulls.select(
1037
$"name",
1038
isnull($"age").as("age_is_null"),
1039
$"age".isNull.as("age_is_null_method"),
1040
$"salary".isNotNull.as("salary_not_null")
1041
)
1042
```
1043
1044
### Complex Data Transformations
1045
```scala
1046
// JSON operations
1047
val jsonData = Seq(
1048
"""{"name": "Alice", "details": {"age": 25, "city": "Seattle"}}""",
1049
"""{"name": "Bob", "details": {"age": 30, "city": "Portland"}}"""
1050
).toDF("json_str")
1051
1052
val jsonSchema = new StructType()
1053
.add("name", StringType)
1054
.add("details", new StructType()
1055
.add("age", IntegerType)
1056
.add("city", StringType))
1057
1058
val parsedJson = jsonData.select(
1059
from_json($"json_str", jsonSchema).as("data")
1060
).select(
1061
$"data.name",
1062
$"data.details.age",
1063
$"data.details.city"
1064
)
1065
1066
// Complex transformations with multiple functions
1067
val complexTransforms = df.select(
1068
$"name",
1069
// String manipulation
1070
initcap($"name").as("proper_name"),
1071
lpad($"name", 10, "*").as("padded_name"),
1072
// Mathematical computations
1073
round($"salary" / 12, 0).as("monthly_salary"),
1074
// Conditional logic with math
1075
when($"age" < 30, $"salary" * 0.1)
1076
.when($"age" < 40, $"salary" * 0.15)
1077
.otherwise($"salary" * 0.2).as("bonus"),
1078
// Date calculations
1079
datediff(current_date(), to_date($"hire_date")).as("days_employed"),
1080
// Hash functions for data integrity
1081
md5(concat($"name", $"age", $"salary")).as("record_hash")
1082
)
1083
```
1084
1085
### Custom Function Combinations
1086
```scala
1087
// Combining multiple functions for complex logic
1088
val advancedTransforms = df.select(
1089
$"name",
1090
$"age",
1091
$"salary",
1092
// Performance rating based on multiple criteria
1093
when($"age" > 32 && $"salary" > 65000, "Senior High Performer")
1094
.when($"age" > 28 && $"salary" > 55000, "Mid-Level Performer")
1095
.when($"salary" > 50000, "Junior High Performer")
1096
.otherwise("Developing").as("performance_category"),
1097
1098
// Salary percentile within age group
1099
percent_rank().over(Window.partitionBy(
1100
when($"age" < 30, "Young")
1101
.when($"age" < 40, "Mid")
1102
.otherwise("Senior")
1103
).orderBy($"salary")).as("salary_percentile_in_age_group"),
1104
1105
// Complex string formatting
1106
format_string(
1107
"Employee %s (age %d) earns $%.2f annually",
1108
$"name", $"age", $"salary"
1109
).as("employee_summary"),
1110
1111
// Encoded identifier
1112
base64(
1113
concat($"name", lit("_"), $"age", lit("_"),
1114
date_format(current_date(), "yyyyMM"))
1115
).as("encoded_id")
1116
)
1117
```