or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

catalog.mddata-sources.mddata-types.mddataframe-dataset.mdindex.mdsession-management.mdsql-functions.mdstreaming.mdudfs.md

sql-functions.mddocs/

0

# Apache Spark SQL - Functions

1

2

Apache Spark SQL provides a comprehensive library of **764 built-in functions** for data manipulation across all major functional categories. These functions enable complex data transformations, analytics, and processing operations within the distributed SQL query engine.

3

4

## Capabilities

5

6

### Core Column and Expression Functions (8 functions)

7

- Create column references and literal values for DataFrame operations

8

- Parse and evaluate SQL expressions within DataFrame transformations

9

- Handle broadcast hints for optimized distributed joins

10

- Support for strongly-typed literal creation with compile-time type safety

11

12

### Aggregate and Statistical Functions (123 functions)

13

- Standard aggregations: count, sum, average, min, max with null handling

14

- Statistical functions: correlation, covariance, standard deviation, variance, skewness, kurtosis

15

- Approximate functions: approx_count_distinct, approx_percentile with configurable precision

16

- Collection aggregations: collect_list, collect_set, first, last with ordering control

17

- Advanced aggregations: regr_* (regression), percentile_approx, histogram_numeric

18

19

### String and Text Processing Functions (105 functions)

20

- String manipulation: concatenation, substring, trimming, case conversion

21

- Regular expressions: regexp_replace, regexp_extract, regexp_like with full pattern support

22

- String formatting: format_string, format_number, printf with locale-aware options

23

- Encoding/decoding: base64, encode, decode, url_encode, url_decode

24

- Text analysis: soundex, levenshtein distance, string similarity functions

25

- Unicode operations: character encoding transformations, byte manipulation

26

27

### Date, Time, and Temporal Functions (113 functions)

28

- Current time functions: current_date, current_timestamp, now, localtimestamp

29

- Date extraction: year, month, day, hour, minute, second, quarter, week functions

30

- Date arithmetic: date_add, date_sub, datediff, months_between, add_months

31

- Formatting and parsing: date_format, unix_timestamp, from_unixtime, to_date, to_timestamp

32

- Timezone operations: from_utc_timestamp, to_utc_timestamp with timezone conversion

33

- Date truncation: date_trunc, trunc for period-based grouping

34

- Interval arithmetic: interval operations for complex temporal calculations

35

36

### Mathematical and Numeric Functions (131 functions)

37

- Basic arithmetic: abs, ceil, floor, round, rint with precision control

38

- Trigonometric: sin, cos, tan, asin, acos, atan, atan2, sinh, cosh, tanh

39

- Logarithmic and exponential: log, log10, log2, ln, exp, exp2, expm1, log1p

40

- Power and root: pow, power, sqrt, cbrt with arbitrary precision

41

- Statistical distributions: normal, binomial, poisson distribution functions

42

- Number theory: gcd, lcm, factorial, sign, signum

43

- Bitwise operations: bit_and, bit_or, bit_xor, bit_not, bit_count

44

- Number formatting: hex, unhex, conv for base conversion

45

46

### Array, Map, and Struct Functions (66 functions total)

47

- **Array functions (32)**: array_*, sort_array, reverse, array_contains, array_position

48

- **Map functions (11)**: map_*, map_keys, map_values, map_entries, map_from_entries

49

- **Struct functions (3)**: struct creation and field access operations

50

- **Collection functions (23)**: explode, explode_outer, posexplode, flatten

51

- Complex nested data manipulation: array_zip, array_union, array_intersect

52

- Element access: element_at, array_remove, array_distinct operations

53

54

### Conditional and Control Flow Functions (9 functions)

55

- Conditional logic: when, otherwise, case expressions with multiple branches

56

- Null handling: coalesce, nvl, nvl2, isnull, isnotnull with comprehensive null safety

57

- Value comparison: greatest, least across multiple columns

58

- Branching: if, decode for data-driven conditional processing

59

60

### Window and Analytical Functions (18 functions)

61

- Ranking: row_number, rank, dense_rank, percent_rank, ntile

62

- Value access: lag, lead, first_value, last_value with offset and default handling

63

- Cumulative operations: cume_dist for cumulative distribution calculations

64

- Percentile functions: percentile_cont, percentile_disc for statistical analysis

65

66

### Advanced Specialized Functions

67

- **Hash functions (7)**: md5, sha1, sha2, crc32, hash, xxhash64, murmur3_hash

68

- **JSON functions (20)**: json_extract, json_parse, json_array_length, to_json, from_json

69

- **XML functions (19)**: xpath, xpath_string, xpath_boolean, xpath_number functions

70

- **CSV functions (7)**: from_csv, to_csv, schema_of_csv with delimiter handling

71

- **URL functions (7)**: parse_url, url_decode, url_encode with component extraction

72

- **Bitwise functions (11)**: Complete bitwise operation suite

73

- **Generator functions (7)**: explode, posexplode, inline, stack for data expansion

74

- **VARIANT functions (10)**: variant_get, variant_explode for semi-structured data

75

- **Partition transform functions (12)**: years, months, days, hours, bucket transformations

76

- **UDF functions (27)**: User-defined function registration and management utilities

77

78

## API Reference

79

80

### Core Functions

81

```scala { .api }

82

object functions {

83

// Column creation and literals

84

def col(colName: String): Column

85

def column(colName: String): Column

86

def lit(literal: Any): Column

87

def typedLit[T : TypeTag](literal: T): Column

88

def expr(expr: String): Column

89

90

// Special values

91

def broadcast[T](df: Dataset[T]): Dataset[T]

92

def monotonically_increasing_id(): Column

93

def spark_partition_id(): Column

94

def input_file_name(): Column

95

}

96

```

97

98

### Aggregate Functions

99

```scala { .api }

100

// Basic aggregations

101

def count(e: Column): Column

102

def count(columnName: String): Column

103

def countDistinct(expr: Column, exprs: Column*): Column

104

def countDistinct(columnName: String, columnNames: String*): Column

105

def approx_count_distinct(e: Column): Column

106

def approx_count_distinct(e: Column, rsd: Double): Column

107

def approx_count_distinct(columnName: String): Column

108

def approx_count_distinct(columnName: String, rsd: Double): Column

109

110

def sum(e: Column): Column

111

def sum(columnName: String): Column

112

def sumDistinct(e: Column): Column

113

def sumDistinct(columnName: String): Column

114

def avg(e: Column): Column

115

def avg(columnName: String): Column

116

def mean(e: Column): Column

117

def mean(columnName: String): Column

118

def min(e: Column): Column

119

def min(columnName: String): Column

120

def max(e: Column): Column

121

def max(columnName: String): Column

122

123

// Statistical aggregations

124

def stddev(e: Column): Column

125

def stddev(columnName: String): Column

126

def stddev_pop(e: Column): Column

127

def stddev_pop(columnName: String): Column

128

def stddev_samp(e: Column): Column

129

def stddev_samp(columnName: String): Column

130

def variance(e: Column): Column

131

def variance(columnName: String): Column

132

def var_pop(e: Column): Column

133

def var_pop(columnName: String): Column

134

def var_samp(e: Column): Column

135

def var_samp(columnName: String): Column

136

137

def corr(column1: Column, column2: Column): Column

138

def corr(columnName1: String, columnName2: String): Column

139

def covar_pop(column1: Column, column2: Column): Column

140

def covar_pop(columnName1: String, columnName2: String): Column

141

def covar_samp(column1: Column, column2: Column): Column

142

def covar_samp(columnName1: String, columnName2: String): Column

143

144

def skewness(e: Column): Column

145

def skewness(columnName: String): Column

146

def kurtosis(e: Column): Column

147

def kurtosis(columnName: String): Column

148

149

// Collection aggregations

150

def collect_list(e: Column): Column

151

def collect_list(columnName: String): Column

152

def collect_set(e: Column): Column

153

def collect_set(columnName: String): Column

154

def first(e: Column): Column

155

def first(e: Column, ignoreNulls: Boolean): Column

156

def first(columnName: String): Column

157

def first(columnName: String, ignoreNulls: Boolean): Column

158

def last(e: Column): Column

159

def last(e: Column, ignoreNulls: Boolean): Column

160

def last(columnName: String): Column

161

def last(columnName: String, ignoreNulls: Boolean): Column

162

163

// Approximate aggregations

164

def approx_percentile(e: Column, percentage: Column): Column

165

def approx_percentile(e: Column, percentage: Column, accuracy: Column): Column

166

def approx_percentile(columnName: String, percentage: Double): Column

167

def approx_percentile(columnName: String, percentage: Double, accuracy: Double): Column

168

169

// Regression functions

170

def regr_avgx(y: Column, x: Column): Column

171

def regr_avgy(y: Column, x: Column): Column

172

def regr_count(y: Column, x: Column): Column

173

def regr_intercept(y: Column, x: Column): Column

174

def regr_r2(y: Column, x: Column): Column

175

def regr_slope(y: Column, x: Column): Column

176

def regr_sxx(y: Column, x: Column): Column

177

def regr_sxy(y: Column, x: Column): Column

178

def regr_syy(y: Column, x: Column): Column

179

180

// Percentile functions

181

def percentile_approx(col: Column, percentage: Column): Column

182

def percentile_approx(col: Column, percentage: Column, accuracy: Column): Column

183

def median(col: Column): Column

184

```

185

186

### Mathematical Functions

187

```scala { .api }

188

// Basic arithmetic and rounding

189

def abs(e: Column): Column

190

def ceil(e: Column): Column

191

def ceiling(e: Column): Column

192

def floor(e: Column): Column

193

def round(e: Column): Column

194

def round(e: Column, scale: Int): Column

195

def rint(e: Column): Column

196

def sign(e: Column): Column

197

def signum(e: Column): Column

198

199

// Trigonometric functions

200

def sin(e: Column): Column

201

def cos(e: Column): Column

202

def tan(e: Column): Column

203

def asin(e: Column): Column

204

def acos(e: Column): Column

205

def atan(e: Column): Column

206

def atan2(y: Column, x: Column): Column

207

def sinh(e: Column): Column

208

def cosh(e: Column): Column

209

def tanh(e: Column): Column

210

def asinh(e: Column): Column

211

def acosh(e: Column): Column

212

def atanh(e: Column): Column

213

214

// Logarithmic and exponential

215

def exp(e: Column): Column

216

def exp2(e: Column): Column

217

def expm1(e: Column): Column

218

def log(e: Column): Column

219

def log(base: Double, e: Column): Column

220

def log10(e: Column): Column

221

def log2(e: Column): Column

222

def log1p(e: Column): Column

223

def ln(e: Column): Column

224

225

// Power and root functions

226

def pow(l: Column, r: Column): Column

227

def power(l: Column, r: Column): Column

228

def sqrt(e: Column): Column

229

def cbrt(e: Column): Column

230

231

// Random number generation

232

def rand(): Column

233

def rand(seed: Long): Column

234

def randn(): Column

235

def randn(seed: Long): Column

236

237

// Number conversion and formatting

238

def hex(column: Column): Column

239

def unhex(column: Column): Column

240

def conv(num: Column, fromBase: Int, toBase: Int): Column

241

def bin(e: Column): Column

242

243

// Advanced mathematical functions

244

def factorial(e: Column): Column

245

def degrees(e: Column): Column

246

def radians(e: Column): Column

247

def gcd(left: Column, right: Column): Column

248

def lcm(left: Column, right: Column): Column

249

def width_bucket(v: Column, min: Column, max: Column, numBucket: Column): Column

250

```

251

252

### Date and Time Functions

253

```scala { .api }

254

// Current date/time functions

255

def current_date(): Column

256

def current_timestamp(): Column

257

def now(): Column

258

def localtimestamp(): Column

259

def current_timezone(): Column

260

261

// Date/time extraction functions

262

def year(e: Column): Column

263

def quarter(e: Column): Column

264

def month(e: Column): Column

265

def weekofyear(e: Column): Column

266

def weekday(e: Column): Column

267

def dayofyear(e: Column): Column

268

def dayofmonth(e: Column): Column

269

def dayofweek(e: Column): Column

270

def hour(e: Column): Column

271

def minute(e: Column): Column

272

def second(e: Column): Column

273

def extract(field: Column, source: Column): Column

274

275

// Date arithmetic and manipulation

276

def date_add(start: Column, days: Int): Column

277

def date_add(start: Column, days: Column): Column

278

def date_sub(start: Column, days: Int): Column

279

def date_sub(start: Column, days: Column): Column

280

def datediff(end: Column, start: Column): Column

281

def months_between(end: Column, start: Column): Column

282

def months_between(end: Column, start: Column, roundOff: Boolean): Column

283

def add_months(startDate: Column, numMonths: Int): Column

284

def add_months(startDate: Column, numMonths: Column): Column

285

286

// Date formatting and parsing

287

def date_format(dateExpr: Column, format: String): Column

288

def from_unixtime(ut: Column): Column

289

def from_unixtime(ut: Column, f: String): Column

290

def unix_timestamp(): Column

291

def unix_timestamp(s: Column): Column

292

def unix_timestamp(s: Column, p: String): Column

293

def to_date(e: Column): Column

294

def to_date(e: Column, fmt: String): Column

295

def to_timestamp(s: Column): Column

296

def to_timestamp(s: Column, fmt: String): Column

297

def make_date(year: Column, month: Column, day: Column): Column

298

def make_timestamp(

299

year: Column, month: Column, day: Column,

300

hour: Column, min: Column, sec: Column

301

): Column

302

def make_timestamp(

303

year: Column, month: Column, day: Column,

304

hour: Column, min: Column, sec: Column, timezone: Column

305

): Column

306

307

// Date truncation and boundaries

308

def date_trunc(format: String, timestamp: Column): Column

309

def trunc(date: Column, format: String): Column

310

def last_day(e: Column): Column

311

def next_day(date: Column, dayOfWeek: String): Column

312

313

// Timezone operations

314

def from_utc_timestamp(ts: Column, tz: String): Column

315

def from_utc_timestamp(ts: Column, tz: Column): Column

316

def to_utc_timestamp(ts: Column, tz: String): Column

317

def to_utc_timestamp(ts: Column, tz: Column): Column

318

319

// Date validation and testing

320

def isnan(e: Column): Column

321

def isnull(e: Column): Column

322

```

323

324

### Array Functions

325

```scala { .api }

326

// Array creation and manipulation

327

def array(cols: Column*): Column

328

def array_contains(column: Column, value: Any): Column

329

def array_distinct(e: Column): Column

330

def array_except(col1: Column, col2: Column): Column

331

def array_intersect(col1: Column, col2: Column): Column

332

def array_join(column: Column, delimiter: String): Column

333

def array_join(column: Column, delimiter: String, nullReplacement: String): Column

334

def array_max(e: Column): Column

335

def array_min(e: Column): Column

336

def array_position(column: Column, value: Any): Column

337

def array_remove(column: Column, element: Any): Column

338

def array_repeat(left: Column, right: Column): Column

339

def array_size(e: Column): Column

340

def array_sort(e: Column): Column

341

def array_union(col1: Column, col2: Column): Column

342

def arrays_overlap(a1: Column, a2: Column): Column

343

def arrays_zip(cols: Column*): Column

344

345

// Array aggregation and transformation

346

def flatten(e: Column): Column

347

def reverse(e: Column): Column

348

def shuffle(e: Column): Column

349

def slice(x: Column, start: Int, length: Int): Column

350

def sort_array(e: Column): Column

351

def sort_array(e: Column, asc: Boolean): Column

352

353

// Array element access

354

def element_at(column: Column, extraction: Any): Column

355

def get_json_object(e: Column, path: String): Column

356

357

// Array generation

358

def sequence(start: Column, stop: Column): Column

359

def sequence(start: Column, stop: Column, step: Column): Column

360

```

361

362

### Map Functions

363

```scala { .api }

364

// Map creation and manipulation

365

def map(cols: Column*): Column

366

def map_contains_key(column: Column, key: Any): Column

367

def map_entries(e: Column): Column

368

def map_from_arrays(keys: Column, values: Column): Column

369

def map_from_entries(e: Column): Column

370

def map_keys(e: Column): Column

371

def map_values(e: Column): Column

372

def map_zip_with(left: Column, right: Column, f: (Column, Column, Column) => Column): Column

373

374

// Map transformations

375

def transform_keys(expr: Column, f: (Column, Column) => Column): Column

376

def transform_values(expr: Column, f: (Column, Column) => Column): Column

377

```

378

379

### JSON Functions

380

```scala { .api }

381

// JSON parsing and manipulation

382

def from_json(e: Column, schema: DataType): Column

383

def from_json(e: Column, schema: DataType, options: Map[String, String]): Column

384

def from_json(e: Column, schema: String): Column

385

def from_json(e: Column, schema: String, options: Map[String, String]): Column

386

def to_json(e: Column): Column

387

def to_json(e: Column, options: Map[String, String]): Column

388

389

def json_array_length(jsonArray: Column): Column

390

def json_object_keys(json: Column): Column

391

def json_tuple(json: Column, fields: String*): Column

392

def get_json_object(e: Column, path: String): Column

393

def json_extract(jsonStr: Column, path: Column): Column

394

395

// Schema inference

396

def schema_of_json(json: Column): Column

397

def schema_of_json(json: String): Column

398

```

399

400

### Window Functions

401

```scala { .api }

402

// Ranking functions

403

def row_number(): Column

404

def rank(): Column

405

def dense_rank(): Column

406

def percent_rank(): Column

407

def ntile(n: Int): Column

408

def cume_dist(): Column

409

410

// Value access functions

411

def lag(e: Column, offset: Int): Column

412

def lag(e: Column, offset: Int, defaultValue: Any): Column

413

def lead(e: Column, offset: Int): Column

414

def lead(e: Column, offset: Int, defaultValue: Any): Column

415

def first_value(e: Column): Column

416

def first_value(e: Column, ignoreNulls: Boolean): Column

417

def last_value(e: Column): Column

418

def last_value(e: Column, ignoreNulls: Boolean): Column

419

def nth_value(e: Column, offset: Int): Column

420

def nth_value(e: Column, offset: Int, ignoreNulls: Boolean): Column

421

```

422

423

### Hash Functions

424

```scala { .api }

425

// Cryptographic hash functions

426

def md5(e: Column): Column

427

def sha1(e: Column): Column

428

def sha2(e: Column, numBits: Int): Column

429

def hash(cols: Column*): Column

430

def xxhash64(cols: Column*): Column

431

def crc32(e: Column): Column

432

def murmur3_hash(cols: Column*): Column

433

```

434

435

### Conditional Functions

436

```scala { .api }

437

// Conditional logic

438

def when(condition: Column, value: Any): Column

439

def coalesce(e: Column*): Column

440

def isnull(e: Column): Column

441

def isnan(e: Column): Column

442

def isnotnull(e: Column): Column

443

def nvl(col1: Column, col2: Column): Column

444

def nvl2(col1: Column, col2: Column, col3: Column): Column

445

def nullif(col1: Column, col2: Column): Column

446

def if(condition: Column, trueValue: Column, falseValue: Column): Column

447

def greatest(exprs: Column*): Column

448

def least(exprs: Column*): Column

449

```

450

451

### String Functions

452

```scala { .api }

453

// String operations

454

def length(e: Column): Column

455

def length(columnName: String): Column

456

def char_length(e: Column): Column

457

def character_length(e: Column): Column

458

def bit_length(e: Column): Column

459

def octet_length(e: Column): Column

460

461

def upper(e: Column): Column

462

def upper(columnName: String): Column

463

def lower(e: Column): Column

464

def lower(columnName: String): Column

465

def initcap(e: Column): Column

466

def initcap(columnName: String): Column

467

468

def trim(e: Column): Column

469

def trim(columnName: String): Column

470

def trim(e: Column, trimString: String): Column

471

def ltrim(e: Column): Column

472

def ltrim(columnName: String): Column

473

def ltrim(e: Column, trimString: String): Column

474

def rtrim(e: Column): Column

475

def rtrim(columnName: String): Column

476

def rtrim(e: Column, trimString: String): Column

477

478

// String concatenation

479

def concat(exprs: Column*): Column

480

def concat_ws(sep: String, exprs: Column*): Column

481

482

// Substring operations

483

def substring(str: Column, pos: Int, len: Int): Column

484

def substring(columnName: String, pos: Int, len: Int): Column

485

def substr(str: Column, pos: Column, len: Column): Column

486

def left(str: Column, len: Column): Column

487

def right(str: Column, len: Column): Column

488

489

// String searching and matching

490

def instr(str: Column, substring: String): Column

491

def locate(substr: String, str: Column): Column

492

def locate(substr: String, str: Column, pos: Int): Column

493

def find_in_set(str: Column, strArray: Column): Column

494

495

// Regular expressions

496

def regexp_replace(e: Column, pattern: String, replacement: String): Column

497

def regexp_replace(e: Column, pattern: Column, replacement: Column): Column

498

def regexp_extract(e: Column, exp: String, groupIdx: Int): Column

499

def regexp_extract_all(e: Column, exp: String): Column

500

def regexp_extract_all(e: Column, exp: String, groupIdx: Int): Column

501

def rlike(str: Column, regexp: String): Column

502

def regexp_like(str: Column, regexp: String): Column

503

504

// String splitting and parsing

505

def split(str: Column, pattern: String): Column

506

def split(str: Column, pattern: String, limit: Int): Column

507

508

// String formatting

509

def format_string(format: String, arguments: Column*): Column

510

def format_number(x: Column, d: Int): Column

511

def printf(format: Column, arguments: Column*): Column

512

513

// String padding and alignment

514

def lpad(str: Column, len: Int, pad: String): Column

515

def rpad(str: Column, len: Int, pad: String): Column

516

517

// String translation and encoding

518

def translate(src: Column, matchingString: String, replaceString: String): Column

519

def encode(value: Column, charset: String): Column

520

def decode(value: Column, charset: String): Column

521

def base64(e: Column): Column

522

def unbase64(e: Column): Column

523

524

// String comparison and testing

525

def soundex(e: Column): Column

526

def levenshtein(l: Column, r: Column): Column

527

def levenshtein(l: Column, r: Column, threshold: Int): Column

528

```

529

530

### Date and Time Functions

531

```scala { .api }

532

// Current date/time functions

533

def current_date(): Column

534

def current_timestamp(): Column

535

def now(): Column

536

def localtimestamp(): Column

537

def current_timezone(): Column

538

539

// Date/time extraction

540

def year(e: Column): Column

541

def year(columnName: String): Column

542

def quarter(e: Column): Column

543

def quarter(columnName: String): Column

544

def month(e: Column): Column

545

def month(columnName: String): Column

546

def dayofyear(e: Column): Column

547

def dayofyear(columnName: String): Column

548

def dayofmonth(e: Column): Column

549

def dayofmonth(columnName: String): Column

550

def dayofweek(e: Column): Column

551

def dayofweek(columnName: String): Column

552

def weekofyear(e: Column): Column

553

def weekofyear(columnName: String): Column

554

555

def hour(e: Column): Column

556

def hour(columnName: String): Column

557

def minute(e: Column): Column

558

def minute(columnName: String): Column

559

def second(e: Column): Column

560

def second(columnName: String): Column

561

562

// Date arithmetic

563

def date_add(start: Column, days: Int): Column

564

def date_add(start: Column, days: Column): Column

565

def date_add(columnName: String, days: Int): Column

566

def date_sub(start: Column, days: Int): Column

567

def date_sub(start: Column, days: Column): Column

568

def date_sub(columnName: String, days: Int): Column

569

def datediff(end: Column, start: Column): Column

570

def datediff(end: String, start: String): Column

571

def months_between(end: Column, start: Column): Column

572

def months_between(end: Column, start: Column, roundOff: Boolean): Column

573

def add_months(startDate: Column, numMonths: Int): Column

574

def add_months(startDate: Column, numMonths: Column): Column

575

576

// Date formatting and parsing

577

def date_format(dateExpr: Column, format: String): Column

578

def date_format(columnName: String, format: String): Column

579

def from_unixtime(ut: Column): Column

580

def from_unixtime(ut: Column, f: String): Column

581

def from_unixtime(columnName: String): Column

582

def from_unixtime(columnName: String, format: String): Column

583

def unix_timestamp(): Column

584

def unix_timestamp(s: Column): Column

585

def unix_timestamp(s: Column, p: String): Column

586

def unix_timestamp(columnName: String): Column

587

def unix_timestamp(columnName: String, pattern: String): Column

588

589

def to_date(e: Column): Column

590

def to_date(e: Column, fmt: String): Column

591

def to_timestamp(s: Column): Column

592

def to_timestamp(s: Column, fmt: String): Column

593

594

// Date truncation and rounding

595

def date_trunc(format: String, timestamp: Column): Column

596

def trunc(date: Column, format: String): Column

597

598

// Timezone operations

599

def from_utc_timestamp(ts: Column, tz: String): Column

600

def from_utc_timestamp(ts: Column, tz: Column): Column

601

def to_utc_timestamp(ts: Column, tz: String): Column

602

def to_utc_timestamp(ts: Column, tz: Column): Column

603

604

// Date sequence generation

605

def sequence(start: Column, stop: Column, step: Column): Column

606

def date_sequence(start: Column, stop: Column, step: Column): Column

607

608

// Last day operations

609

def last_day(e: Column): Column

610

def last_day(columnName: String): Column

611

def next_day(date: Column, dayOfWeek: String): Column

612

```

613

614

### Mathematical Functions

615

```scala { .api }

616

// Basic arithmetic

617

def abs(e: Column): Column

618

def abs(columnName: String): Column

619

def negate(e: Column): Column

620

def sqrt(e: Column): Column

621

def sqrt(columnName: String): Column

622

def cbrt(e: Column): Column

623

def pow(l: Column, r: Column): Column

624

def pow(l: Column, r: Double): Column

625

def pow(columnName: String, p: Double): Column

626

def power(l: Column, r: Column): Column

627

628

// Rounding functions

629

def round(e: Column): Column

630

def round(e: Column, scale: Int): Column

631

def round(columnName: String): Column

632

def round(columnName: String, scale: Int): Column

633

def bround(e: Column): Column

634

def bround(e: Column, scale: Int): Column

635

def ceil(e: Column): Column

636

def ceil(columnName: String): Column

637

def ceiling(e: Column): Column

638

def floor(e: Column): Column

639

def floor(columnName: String): Column

640

def rint(e: Column): Column

641

def rint(columnName: String): Column

642

643

// Trigonometric functions

644

def sin(e: Column): Column

645

def sin(columnName: String): Column

646

def cos(e: Column): Column

647

def cos(columnName: String): Column

648

def tan(e: Column): Column

649

def tan(columnName: String): Column

650

def asin(e: Column): Column

651

def asin(columnName: String): Column

652

def acos(e: Column): Column

653

def acos(columnName: String): Column

654

def atan(e: Column): Column

655

def atan(columnName: String): Column

656

def atan2(l: Column, r: Column): Column

657

def atan2(l: Column, r: Double): Column

658

def atan2(columnName1: String, columnName2: String): Column

659

def sinh(e: Column): Column

660

def cosh(e: Column): Column

661

def tanh(e: Column): Column

662

663

// Logarithmic and exponential

664

def log(e: Column): Column

665

def log(columnName: String): Column

666

def log(base: Double, e: Column): Column

667

def log(base: Double, columnName: String): Column

668

def log10(e: Column): Column

669

def log10(columnName: String): Column

670

def log1p(e: Column): Column

671

def log1p(columnName: String): Column

672

def log2(e: Column): Column

673

def exp(e: Column): Column

674

def exp(columnName: String): Column

675

def expm1(e: Column): Column

676

def expm1(columnName: String): Column

677

678

// Angle conversion

679

def degrees(e: Column): Column

680

def degrees(columnName: String): Column

681

def radians(e: Column): Column

682

def radians(columnName: String): Column

683

def toDegrees(e: Column): Column

684

def toRadians(e: Column): Column

685

686

// Special mathematical functions

687

def factorial(e: Column): Column

688

def signum(e: Column): Column

689

def signum(columnName: String): Column

690

691

// Bitwise operations

692

def bitwiseNOT(e: Column): Column

693

def shiftLeft(e: Column, numBits: Int): Column

694

def shiftRight(e: Column, numBits: Int): Column

695

def shiftRightUnsigned(e: Column, numBits: Int): Column

696

697

// Random functions

698

def rand(): Column

699

def rand(seed: Long): Column

700

def randn(): Column

701

def randn(seed: Long): Column

702

```

703

704

### Array Functions

705

```scala { .api }

706

// Array creation

707

def array(cols: Column*): Column

708

def array_repeat(e: Column, count: Int): Column

709

def array_repeat(left: Column, right: Column): Column

710

711

// Array access and properties

712

def size(e: Column): Column

713

def array_min(e: Column): Column

714

def array_max(e: Column): Column

715

def array_position(column: Column, value: Any): Column

716

def element_at(column: Column, extraction: Any): Column

717

718

// Array testing

719

def array_contains(column: Column, value: Any): Column

720

def arrays_overlap(a1: Column, a2: Column): Column

721

722

// Array transformations

723

def array_distinct(e: Column): Column

724

def array_remove(column: Column, element: Column): Column

725

def array_sort(e: Column): Column

726

def array_union(col1: Column, col2: Column): Column

727

def array_intersect(col1: Column, col2: Column): Column

728

def array_except(col1: Column, col2: Column): Column

729

def array_join(column: Column, delimiter: String): Column

730

def array_join(column: Column, delimiter: String, nullReplacement: String): Column

731

def reverse(e: Column): Column

732

def shuffle(e: Column): Column

733

def slice(x: Column, start: Int, length: Int): Column

734

def slice(x: Column, start: Column, length: Column): Column

735

def sort_array(e: Column): Column

736

def sort_array(e: Column, asc: Boolean): Column

737

738

// Array aggregation

739

def array_agg(e: Column): Column

740

def flatten(e: Column): Column

741

def sequence(start: Column, stop: Column): Column

742

def sequence(start: Column, stop: Column, step: Column): Column

743

744

// Array explosion

745

def explode(e: Column): Column

746

def explode_outer(e: Column): Column

747

def posexplode(e: Column): Column

748

def posexplode_outer(e: Column): Column

749

750

// Array generation

751

def array_zip(e: Column*): Column

752

def arrays_zip(e: Column*): Column

753

```

754

755

### Map Functions

756

```scala { .api }

757

// Map creation

758

def map(cols: Column*): Column

759

def map_from_arrays(keys: Column, values: Column): Column

760

def map_from_entries(e: Column): Column

761

762

// Map access

763

def map_keys(e: Column): Column

764

def map_values(e: Column): Column

765

def map_entries(e: Column): Column

766

def map_concat(cols: Column*): Column

767

768

// Map filtering and transformation

769

def map_filter(expr: Column, f: (Column, Column) => Column): Column

770

def transform_keys(expr: Column, f: (Column, Column) => Column): Column

771

def transform_values(expr: Column, f: (Column, Column) => Column): Column

772

773

// Map explosion

774

def explode(e: Column): Column

775

def explode_outer(e: Column): Column

776

def posexplode(e: Column): Column

777

def posexplode_outer(e: Column): Column

778

```

779

780

### Struct Functions

781

```scala { .api }

782

// Struct creation

783

def struct(cols: Column*): Column

784

def named_struct(cols: Column*): Column

785

786

// JSON operations

787

def from_json(e: Column, schema: StructType): Column

788

def from_json(e: Column, schema: String): Column

789

def from_json(e: Column, schema: Column): Column

790

def from_json(e: Column, schema: DataType): Column

791

def from_json(e: Column, schema: StructType, options: Map[String, String]): Column

792

def to_json(e: Column): Column

793

def to_json(e: Column, options: Map[String, String]): Column

794

def json_object_keys(json: Column): Column

795

def json_array_length(jsonArray: Column): Column

796

def json_tuple(json: Column, fields: String*): Column

797

def get_json_object(e: Column, path: String): Column

798

799

// Schema operations

800

def schema_of_json(json: Column): Column

801

def schema_of_json(json: String): Column

802

def schema_of_csv(csv: Column): Column

803

def schema_of_csv(csv: String): Column

804

```

805

806

### Conditional Functions

807

```scala { .api }

808

// Conditional expressions

809

def when(condition: Column, value: Any): Column

810

def coalesce(e: Column*): Column

811

def greatest(exprs: Column*): Column

812

def greatest(columnName: String, columnNames: String*): Column

813

def least(exprs: Column*): Column

814

def least(columnName: String, columnNames: String*): Column

815

816

// Null handling

817

def isnull(e: Column): Column

818

def isnan(e: Column): Column

819

def nanvl(col1: Column, col2: Column): Column

820

def nvl(col1: Column, col2: Column): Column

821

def nvl2(col1: Column, col2: Column, col3: Column): Column

822

def ifnull(col1: Column, col2: Column): Column

823

def nullif(col1: Column, col2: Column): Column

824

```

825

826

### Window Functions

827

```scala { .api }

828

// Ranking functions

829

def row_number(): Column

830

def rank(): Column

831

def dense_rank(): Column

832

def percent_rank(): Column

833

def ntile(n: Int): Column

834

def cume_dist(): Column

835

836

// Offset functions

837

def lag(e: Column, offset: Int): Column

838

def lag(e: Column, offset: Int, defaultValue: Any): Column

839

def lag(columnName: String, offset: Int): Column

840

def lag(columnName: String, offset: Int, defaultValue: Any): Column

841

def lead(e: Column, offset: Int): Column

842

def lead(e: Column, offset: Int, defaultValue: Any): Column

843

def lead(columnName: String, offset: Int): Column

844

def lead(columnName: String, offset: Int, defaultValue: Any): Column

845

846

// Value functions

847

def first_value(e: Column): Column

848

def first_value(e: Column, ignoreNulls: Boolean): Column

849

def last_value(e: Column): Column

850

def last_value(e: Column, ignoreNulls: Boolean): Column

851

def nth_value(e: Column, n: Int): Column

852

def nth_value(e: Column, n: Int, ignoreNulls: Boolean): Column

853

```

854

855

### Type Conversion Functions

856

```scala { .api }

857

// Type casting

858

def cast(e: Column, dataType: DataType): Column

859

def cast(columnName: String, dataType: String): Column

860

861

// Binary operations

862

def bin(e: Column): Column

863

def hex(column: Column): Column

864

def unhex(column: Column): Column

865

def md5(e: Column): Column

866

def sha1(e: Column): Column

867

def sha2(e: Column, numBits: Int): Column

868

def hash(cols: Column*): Column

869

def xxhash64(cols: Column*): Column

870

def crc32(e: Column): Column

871

```

872

873

## Usage Examples

874

875

### Basic Function Usage

876

```scala

877

import org.apache.spark.sql.functions._

878

import org.apache.spark.sql.SparkSession

879

880

val spark = SparkSession.builder().appName("Functions Demo").getOrCreate()

881

import spark.implicits._

882

883

// Sample data

884

val df = Seq(

885

("Alice", 25, 50000.0, "2023-01-15"),

886

("Bob", 30, 60000.0, "2023-02-20"),

887

("Charlie", 35, 70000.0, "2023-03-10")

888

).toDF("name", "age", "salary", "hire_date")

889

890

// String functions

891

val stringOps = df.select(

892

upper($"name").as("upper_name"),

893

length($"name").as("name_length"),

894

concat($"name", lit(" - "), $"age").as("name_age"),

895

regexp_replace($"name", "a", "@").as("replaced_name")

896

)

897

898

// Mathematical functions

899

val mathOps = df.select(

900

$"name",

901

abs($"age" - 30).as("age_diff_from_30"),

902

round($"salary" / 1000, 2).as("salary_in_k"),

903

pow($"age", 2).as("age_squared"),

904

sqrt($"salary").as("sqrt_salary")

905

)

906

907

// Date functions

908

val dateOps = df.select(

909

$"name",

910

to_date($"hire_date").as("hire_date_parsed"),

911

year(to_date($"hire_date")).as("hire_year"),

912

months_between(current_date(), to_date($"hire_date")).as("months_employed"),

913

date_add(to_date($"hire_date"), 365).as("one_year_later")

914

)

915

```

916

917

### Aggregate Functions

918

```scala

919

// Basic aggregations

920

val basicStats = df.agg(

921

count($"name").as("total_employees"),

922

avg($"salary").as("avg_salary"),

923

min($"age").as("min_age"),

924

max($"salary").as("max_salary"),

925

stddev($"salary").as("salary_stddev")

926

)

927

928

// Grouped aggregations

929

val groupedStats = df.groupBy($"age" > 30)

930

.agg(

931

count($"name").as("count"),

932

sum($"salary").as("total_salary"),

933

avg($"salary").as("avg_salary"),

934

collect_list($"name").as("names"),

935

first($"name").as("first_name"),

936

last($"name").as("last_name")

937

)

938

939

// Statistical functions

940

val correlationStats = df.select(

941

corr($"age", $"salary").as("age_salary_correlation"),

942

covar_samp($"age", $"salary").as("age_salary_covariance")

943

)

944

```

945

946

### Array and Map Functions

947

```scala

948

// Sample data with arrays

949

val arrayData = Seq(

950

("Alice", Array("reading", "swimming", "cooking")),

951

("Bob", Array("hiking", "reading", "gaming")),

952

("Charlie", Array("cooking", "gaming"))

953

).toDF("name", "hobbies")

954

955

// Array operations

956

val arrayOps = arrayData.select(

957

$"name",

958

$"hobbies",

959

size($"hobbies").as("hobby_count"),

960

array_contains($"hobbies", "reading").as("likes_reading"),

961

array_sort($"hobbies").as("sorted_hobbies"),

962

array_distinct($"hobbies").as("unique_hobbies")

963

)

964

965

// Explode arrays

966

val exploded = arrayData.select(

967

$"name",

968

explode($"hobbies").as("hobby")

969

)

970

971

// Map operations

972

val mapData = Seq(

973

("Alice", Map("skill1" -> "Java", "skill2" -> "Scala")),

974

("Bob", Map("skill1" -> "Python", "skill2" -> "SQL"))

975

).toDF("name", "skills")

976

977

val mapOps = mapData.select(

978

$"name",

979

map_keys($"skills").as("skill_names"),

980

map_values($"skills").as("skill_values"),

981

$"skills".getItem("skill1").as("primary_skill")

982

)

983

```

984

985

### Window Functions

986

```scala

987

import org.apache.spark.sql.expressions.Window

988

989

// Window specifications

990

val windowSpec = Window.partitionBy($"age" > 30).orderBy($"salary".desc)

991

val unboundedWindow = Window.partitionBy($"age" > 30)

992

.orderBy($"salary".desc)

993

.rowsBetween(Window.unboundedPreceding, Window.currentRow)

994

995

// Window function usage

996

val windowedData = df.select(

997

$"name",

998

$"age",

999

$"salary",

1000

row_number().over(windowSpec).as("salary_rank"),

1001

rank().over(windowSpec).as("salary_rank_with_ties"),

1002

dense_rank().over(windowSpec).as("salary_dense_rank"),

1003

percent_rank().over(windowSpec).as("salary_percentile"),

1004

ntile(3).over(windowSpec).as("salary_tertile"),

1005

lag($"salary", 1).over(windowSpec).as("previous_salary"),

1006

lead($"salary", 1).over(windowSpec).as("next_salary"),

1007

sum($"salary").over(unboundedWindow).as("running_total_salary"),

1008

avg($"salary").over(unboundedWindow).as("running_avg_salary")

1009

)

1010

```

1011

1012

### Conditional and Null Handling Functions

1013

```scala

1014

// Sample data with nulls

1015

val dataWithNulls = Seq(

1016

("Alice", Some(25), Some(50000.0)),

1017

("Bob", None, Some(60000.0)),

1018

("Charlie", Some(35), None)

1019

).toDF("name", "age", "salary")

1020

1021

// Conditional operations

1022

val conditionalOps = dataWithNulls.select(

1023

$"name",

1024

when($"age".isNull, "Unknown").otherwise($"age").as("age_with_default"),

1025

when($"salary" > 55000, "High")

1026

.when($"salary" > 45000, "Medium")

1027

.otherwise("Low").as("salary_category"),

1028

coalesce($"salary", lit(40000.0)).as("salary_with_default"),

1029

nvl($"age", lit(0)).as("age_nvl"),

1030

nvl2($"salary", "Has Salary", "No Salary").as("salary_status"),

1031

greatest($"age", lit(30)).as("age_or_30"),

1032

least($"age", lit(30)).as("min_age_30")

1033

)

1034

1035

// Null checking

1036

val nullChecks = dataWithNulls.select(

1037

$"name",

1038

isnull($"age").as("age_is_null"),

1039

$"age".isNull.as("age_is_null_method"),

1040

$"salary".isNotNull.as("salary_not_null")

1041

)

1042

```

1043

1044

### Complex Data Transformations

1045

```scala

1046

// JSON operations

1047

val jsonData = Seq(

1048

"""{"name": "Alice", "details": {"age": 25, "city": "Seattle"}}""",

1049

"""{"name": "Bob", "details": {"age": 30, "city": "Portland"}}"""

1050

).toDF("json_str")

1051

1052

val jsonSchema = new StructType()

1053

.add("name", StringType)

1054

.add("details", new StructType()

1055

.add("age", IntegerType)

1056

.add("city", StringType))

1057

1058

val parsedJson = jsonData.select(

1059

from_json($"json_str", jsonSchema).as("data")

1060

).select(

1061

$"data.name",

1062

$"data.details.age",

1063

$"data.details.city"

1064

)

1065

1066

// Complex transformations with multiple functions

1067

val complexTransforms = df.select(

1068

$"name",

1069

// String manipulation

1070

initcap($"name").as("proper_name"),

1071

lpad($"name", 10, "*").as("padded_name"),

1072

// Mathematical computations

1073

round($"salary" / 12, 0).as("monthly_salary"),

1074

// Conditional logic with math

1075

when($"age" < 30, $"salary" * 0.1)

1076

.when($"age" < 40, $"salary" * 0.15)

1077

.otherwise($"salary" * 0.2).as("bonus"),

1078

// Date calculations

1079

datediff(current_date(), to_date($"hire_date")).as("days_employed"),

1080

// Hash functions for data integrity

1081

md5(concat($"name", $"age", $"salary")).as("record_hash")

1082

)

1083

```

1084

1085

### Custom Function Combinations

1086

```scala

1087

// Combining multiple functions for complex logic

1088

val advancedTransforms = df.select(

1089

$"name",

1090

$"age",

1091

$"salary",

1092

// Performance rating based on multiple criteria

1093

when($"age" > 32 && $"salary" > 65000, "Senior High Performer")

1094

.when($"age" > 28 && $"salary" > 55000, "Mid-Level Performer")

1095

.when($"salary" > 50000, "Junior High Performer")

1096

.otherwise("Developing").as("performance_category"),

1097

1098

// Salary percentile within age group

1099

percent_rank().over(Window.partitionBy(

1100

when($"age" < 30, "Young")

1101

.when($"age" < 40, "Mid")

1102

.otherwise("Senior")

1103

).orderBy($"salary")).as("salary_percentile_in_age_group"),

1104

1105

// Complex string formatting

1106

format_string(

1107

"Employee %s (age %d) earns $%.2f annually",

1108

$"name", $"age", $"salary"

1109

).as("employee_summary"),

1110

1111

// Encoded identifier

1112

base64(

1113

concat($"name", lit("_"), $"age", lit("_"),

1114

date_format(current_date(), "yyyyMM"))

1115

).as("encoded_id")

1116

)

1117

```