0
# Utilities
1
2
Utility classes and objects for working with Spark SQL data types, conversions, and integrations.
3
4
## Capabilities
5
6
### Arrow Format Integration
7
8
Utilities for converting between Spark SQL types and Apache Arrow format for high-performance columnar data exchange.
9
10
```scala { .api }
11
/**
12
* Utilities for converting between Spark SQL and Apache Arrow formats
13
*/
14
object ArrowUtils {
15
/** Root allocator for Arrow memory management */
16
val rootAllocator: org.apache.arrow.memory.RootAllocator
17
18
/** Convert Spark DataType to Arrow ArrowType */
19
def toArrowType(
20
dt: DataType,
21
timeZoneId: String,
22
largeVarTypes: Boolean = false
23
): org.apache.arrow.vector.types.pojo.ArrowType
24
25
/** Convert Arrow ArrowType to Spark DataType */
26
def fromArrowType(dt: org.apache.arrow.vector.types.pojo.ArrowType): DataType
27
28
/** Convert Spark field to Arrow Field */
29
def toArrowField(
30
name: String,
31
dt: DataType,
32
nullable: Boolean,
33
timeZoneId: String,
34
largeVarTypes: Boolean = false
35
): org.apache.arrow.vector.types.pojo.Field
36
37
/** Convert Arrow Field to Spark DataType */
38
def fromArrowField(field: org.apache.arrow.vector.types.pojo.Field): DataType
39
40
/** Convert Spark StructType to Arrow Schema */
41
def toArrowSchema(
42
schema: StructType,
43
timeZoneId: String,
44
errorOnDuplicatedFieldNames: Boolean,
45
largeVarTypes: Boolean = false
46
): org.apache.arrow.vector.types.pojo.Schema
47
48
/** Convert Arrow Schema to Spark StructType */
49
def fromArrowSchema(schema: org.apache.arrow.vector.types.pojo.Schema): StructType
50
}
51
```
52
53
## Usage Examples
54
55
**Converting Spark types to Arrow:**
56
57
```scala
58
import org.apache.spark.sql.util.ArrowUtils
59
import org.apache.spark.sql.types._
60
61
// Convert basic types
62
val sparkIntType = IntegerType
63
val arrowIntType = ArrowUtils.toArrowType(sparkIntType, null)
64
65
val sparkStringType = StringType
66
val arrowStringType = ArrowUtils.toArrowType(sparkStringType, null)
67
68
// Convert timestamp (requires timezone)
69
val sparkTimestampType = TimestampType
70
val arrowTimestampType = ArrowUtils.toArrowType(sparkTimestampType, "UTC")
71
72
// Convert complex schema
73
val sparkSchema = StructType(Array(
74
StructField("id", LongType, false),
75
StructField("name", StringType, false),
76
StructField("scores", ArrayType(DoubleType, false), true),
77
StructField("metadata", MapType(StringType, StringType, true), true)
78
))
79
80
val arrowSchema = ArrowUtils.toArrowSchema(
81
sparkSchema,
82
timeZoneId = "UTC",
83
errorOnDuplicatedFieldNames = true
84
)
85
```
86
87
**Converting Arrow types back to Spark:**
88
89
```scala
90
// Convert Arrow types back to Spark
91
val convertedDataType = ArrowUtils.fromArrowType(arrowIntType)
92
println(s"Converted back: $convertedDataType") // IntegerType
93
94
// Convert Arrow schema back to Spark
95
val convertedSchema = ArrowUtils.fromArrowSchema(arrowSchema)
96
println(s"Schema fields: ${convertedSchema.fieldNames.mkString(", ")}")
97
```
98
99
**Working with large variable types:**
100
101
```scala
102
// Use large variable types for very large strings/binary data
103
val largeStringType = ArrowUtils.toArrowType(
104
StringType,
105
timeZoneId = null,
106
largeVarTypes = true
107
)
108
109
val largeBinaryType = ArrowUtils.toArrowType(
110
BinaryType,
111
timeZoneId = null,
112
largeVarTypes = true
113
)
114
```
115
116
**Type conversion mappings:**
117
118
The following conversions are supported:
119
120
**Spark to Arrow:**
121
- `BooleanType` → `ArrowType.Bool`
122
- `ByteType` → `ArrowType.Int(8, signed=true)`
123
- `ShortType` → `ArrowType.Int(16, signed=true)`
124
- `IntegerType` → `ArrowType.Int(32, signed=true)`
125
- `LongType` → `ArrowType.Int(64, signed=true)`
126
- `FloatType` → `ArrowType.FloatingPoint(SINGLE)`
127
- `DoubleType` → `ArrowType.FloatingPoint(DOUBLE)`
128
- `StringType` → `ArrowType.Utf8` or `ArrowType.LargeUtf8`
129
- `BinaryType` → `ArrowType.Binary` or `ArrowType.LargeBinary`
130
- `DecimalType(p,s)` → `ArrowType.Decimal(p,s)`
131
- `DateType` → `ArrowType.Date(DAY)`
132
- `TimestampType` → `ArrowType.Timestamp(MICROSECOND, timezone)`
133
- `TimestampNTZType` → `ArrowType.Timestamp(MICROSECOND, null)`
134
- `ArrayType` → `ArrowType.List`
135
- `MapType` → `ArrowType.Map`
136
- `StructType` → `ArrowType.Struct`
137
- `YearMonthIntervalType` → `ArrowType.Interval(YEAR_MONTH)`
138
- `DayTimeIntervalType` → `ArrowType.Duration(MICROSECOND)`
139
140
**Error handling:**
141
142
```scala
143
import org.apache.spark.sql.types.CalendarIntervalType
144
145
try {
146
// This will throw an exception - CalendarIntervalType not supported
147
ArrowUtils.toArrowType(CalendarIntervalType, null)
148
} catch {
149
case e: Exception =>
150
println(s"Unsupported type: ${e.getMessage}")
151
}
152
153
try {
154
// This will throw - TimestampType requires timezone
155
ArrowUtils.toArrowType(TimestampType, null)
156
} catch {
157
case e: IllegalStateException =>
158
println("TimestampType requires timeZoneId")
159
}
160
```