or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

data-types.mdencoders.mderror-handling.mdindex.mdrow-operations.mdstreaming-operations.mdutilities.md

utilities.mddocs/

0

# Utilities

1

2

Utility classes and objects for working with Spark SQL data types, conversions, and integrations.

3

4

## Capabilities

5

6

### Arrow Format Integration

7

8

Utilities for converting between Spark SQL types and Apache Arrow format for high-performance columnar data exchange.

9

10

```scala { .api }

11

/**

12

* Utilities for converting between Spark SQL and Apache Arrow formats

13

*/

14

object ArrowUtils {

15

/** Root allocator for Arrow memory management */

16

val rootAllocator: org.apache.arrow.memory.RootAllocator

17

18

/** Convert Spark DataType to Arrow ArrowType */

19

def toArrowType(

20

dt: DataType,

21

timeZoneId: String,

22

largeVarTypes: Boolean = false

23

): org.apache.arrow.vector.types.pojo.ArrowType

24

25

/** Convert Arrow ArrowType to Spark DataType */

26

def fromArrowType(dt: org.apache.arrow.vector.types.pojo.ArrowType): DataType

27

28

/** Convert Spark field to Arrow Field */

29

def toArrowField(

30

name: String,

31

dt: DataType,

32

nullable: Boolean,

33

timeZoneId: String,

34

largeVarTypes: Boolean = false

35

): org.apache.arrow.vector.types.pojo.Field

36

37

/** Convert Arrow Field to Spark DataType */

38

def fromArrowField(field: org.apache.arrow.vector.types.pojo.Field): DataType

39

40

/** Convert Spark StructType to Arrow Schema */

41

def toArrowSchema(

42

schema: StructType,

43

timeZoneId: String,

44

errorOnDuplicatedFieldNames: Boolean,

45

largeVarTypes: Boolean = false

46

): org.apache.arrow.vector.types.pojo.Schema

47

48

/** Convert Arrow Schema to Spark StructType */

49

def fromArrowSchema(schema: org.apache.arrow.vector.types.pojo.Schema): StructType

50

}

51

```

52

53

## Usage Examples

54

55

**Converting Spark types to Arrow:**

56

57

```scala

58

import org.apache.spark.sql.util.ArrowUtils

59

import org.apache.spark.sql.types._

60

61

// Convert basic types

62

val sparkIntType = IntegerType

63

val arrowIntType = ArrowUtils.toArrowType(sparkIntType, null)

64

65

val sparkStringType = StringType

66

val arrowStringType = ArrowUtils.toArrowType(sparkStringType, null)

67

68

// Convert timestamp (requires timezone)

69

val sparkTimestampType = TimestampType

70

val arrowTimestampType = ArrowUtils.toArrowType(sparkTimestampType, "UTC")

71

72

// Convert complex schema

73

val sparkSchema = StructType(Array(

74

StructField("id", LongType, false),

75

StructField("name", StringType, false),

76

StructField("scores", ArrayType(DoubleType, false), true),

77

StructField("metadata", MapType(StringType, StringType, true), true)

78

))

79

80

val arrowSchema = ArrowUtils.toArrowSchema(

81

sparkSchema,

82

timeZoneId = "UTC",

83

errorOnDuplicatedFieldNames = true

84

)

85

```

86

87

**Converting Arrow types back to Spark:**

88

89

```scala

90

// Convert Arrow types back to Spark

91

val convertedDataType = ArrowUtils.fromArrowType(arrowIntType)

92

println(s"Converted back: $convertedDataType") // IntegerType

93

94

// Convert Arrow schema back to Spark

95

val convertedSchema = ArrowUtils.fromArrowSchema(arrowSchema)

96

println(s"Schema fields: ${convertedSchema.fieldNames.mkString(", ")}")

97

```

98

99

**Working with large variable types:**

100

101

```scala

102

// Use large variable types for very large strings/binary data

103

val largeStringType = ArrowUtils.toArrowType(

104

StringType,

105

timeZoneId = null,

106

largeVarTypes = true

107

)

108

109

val largeBinaryType = ArrowUtils.toArrowType(

110

BinaryType,

111

timeZoneId = null,

112

largeVarTypes = true

113

)

114

```

115

116

**Type conversion mappings:**

117

118

The following conversions are supported:

119

120

**Spark to Arrow:**

121

- `BooleanType``ArrowType.Bool`

122

- `ByteType``ArrowType.Int(8, signed=true)`

123

- `ShortType``ArrowType.Int(16, signed=true)`

124

- `IntegerType``ArrowType.Int(32, signed=true)`

125

- `LongType``ArrowType.Int(64, signed=true)`

126

- `FloatType``ArrowType.FloatingPoint(SINGLE)`

127

- `DoubleType``ArrowType.FloatingPoint(DOUBLE)`

128

- `StringType``ArrowType.Utf8` or `ArrowType.LargeUtf8`

129

- `BinaryType``ArrowType.Binary` or `ArrowType.LargeBinary`

130

- `DecimalType(p,s)``ArrowType.Decimal(p,s)`

131

- `DateType``ArrowType.Date(DAY)`

132

- `TimestampType``ArrowType.Timestamp(MICROSECOND, timezone)`

133

- `TimestampNTZType``ArrowType.Timestamp(MICROSECOND, null)`

134

- `ArrayType``ArrowType.List`

135

- `MapType``ArrowType.Map`

136

- `StructType``ArrowType.Struct`

137

- `YearMonthIntervalType``ArrowType.Interval(YEAR_MONTH)`

138

- `DayTimeIntervalType``ArrowType.Duration(MICROSECOND)`

139

140

**Error handling:**

141

142

```scala

143

import org.apache.spark.sql.types.CalendarIntervalType

144

145

try {

146

// This will throw an exception - CalendarIntervalType not supported

147

ArrowUtils.toArrowType(CalendarIntervalType, null)

148

} catch {

149

case e: Exception =>

150

println(s"Unsupported type: ${e.getMessage}")

151

}

152

153

try {

154

// This will throw - TimestampType requires timezone

155

ArrowUtils.toArrowType(TimestampType, null)

156

} catch {

157

case e: IllegalStateException =>

158

println("TimestampType requires timeZoneId")

159

}

160

```