0
# Apache Spark SQL API
1
2
Apache Spark SQL API provides the core SQL data types, row representations, and foundational APIs for Spark SQL operations. This library serves as the foundation for DataFrame and Dataset operations, SQL query execution, and structured streaming in Apache Spark's distributed computing framework.
3
4
## Package Information
5
6
- **Package Name**: org.apache.spark:spark-sql-api_2.12
7
- **Package Type**: maven
8
- **Language**: Scala
9
- **Installation**: `maven: org.apache.spark:spark-sql-api_2.12:3.5.6`
10
11
## Core Imports
12
13
```scala
14
// Core row and type imports
15
import org.apache.spark.sql.Row
16
import org.apache.spark.sql.types._
17
18
// Streaming state management
19
import org.apache.spark.sql.streaming.GroupState
20
21
// Encoding support
22
import org.apache.spark.sql.Encoder
23
24
// Error handling
25
import org.apache.spark.sql.AnalysisException
26
```
27
28
## Basic Usage
29
30
```scala
31
import org.apache.spark.sql.{Row, AnalysisException}
32
import org.apache.spark.sql.types._
33
34
// Create a schema for structured data
35
val schema = StructType(Array(
36
StructField("name", StringType, nullable = false),
37
StructField("age", IntegerType, nullable = false),
38
StructField("salary", DecimalType(10, 2), nullable = true)
39
))
40
41
// Create rows of data
42
val row1 = Row("Alice", 25, BigDecimal("55000.00"))
43
val row2 = Row.fromSeq(Seq("Bob", 30, BigDecimal("65000.00")))
44
45
// Access row data
46
val name: String = row1.getAs[String]("name")
47
val age: Int = row1.getInt(1)
48
val hasNullSalary: Boolean = row1.isNullAt(2)
49
50
// Work with complex data types
51
val arrayType = ArrayType(StringType, containsNull = true)
52
val mapType = MapType(StringType, IntegerType, valueContainsNull = false)
53
val nestedSchema = StructType(Array(
54
StructField("addresses", arrayType, nullable = true),
55
StructField("scores", mapType, nullable = false)
56
))
57
```
58
59
## Architecture
60
61
The Spark SQL API is built around several key components:
62
63
- **Type System**: Comprehensive data type hierarchy supporting primitives, collections, and user-defined types
64
- **Row Interface**: Structured data representation with type-safe access methods
65
- **Schema Management**: Dynamic schema creation, validation, and evolution support
66
- **Streaming State**: Stateful operations for complex streaming analytics
67
- **Encoding Framework**: Type-safe conversion between JVM objects and Spark SQL representations
68
- **Error Handling**: Structured exception hierarchy with detailed error reporting
69
70
## Capabilities
71
72
### Core Data Types
73
74
Comprehensive type system including primitives, collections, and complex nested structures. Essential for defining schemas and working with structured data.
75
76
```scala { .api }
77
// Base type hierarchy
78
abstract class DataType extends AbstractDataType
79
abstract class AbstractDataType
80
81
// Primitive types
82
case object StringType extends StringType
83
case object IntegerType extends IntegerType
84
case object LongType extends LongType
85
case object DoubleType extends DoubleType
86
case object BooleanType extends BooleanType
87
88
// Complex types
89
case class DecimalType(precision: Int, scale: Int) extends FractionalType
90
case class ArrayType(elementType: DataType, containsNull: Boolean) extends DataType
91
case class MapType(keyType: DataType, valueType: DataType, valueContainsNull: Boolean) extends DataType
92
case class StructType(fields: Array[StructField]) extends DataType
93
```
94
95
[Data Types](./data-types.md)
96
97
### Row Operations
98
99
Structured data representation and manipulation with type-safe access methods for distributed data processing.
100
101
```scala { .api }
102
trait Row extends Serializable {
103
def length: Int
104
def apply(i: Int): Any
105
def get(i: Int): Any
106
def isNullAt(i: Int): Boolean
107
def getAs[T](i: Int): T
108
def getAs[T](fieldName: String): T
109
def getString(i: Int): String
110
def getInt(i: Int): Int
111
def getLong(i: Int): Long
112
def getDouble(i: Int): Double
113
def getBoolean(i: Int): Boolean
114
}
115
116
object Row {
117
def apply(values: Any*): Row
118
def fromSeq(values: Seq[Any]): Row
119
def fromTuple(tuple: Product): Row
120
}
121
```
122
123
[Row Operations](./row-operations.md)
124
125
### Streaming State Management
126
127
Stateful operations for complex streaming analytics with timeout support and watermark handling.
128
129
```scala { .api }
130
trait GroupState[S] extends LogicalGroupState[S] {
131
def exists: Boolean
132
def get: S
133
def getOption: Option[S]
134
def update(newState: S): Unit
135
def remove(): Unit
136
def hasTimedOut: Boolean
137
def setTimeoutDuration(durationMs: Long): Unit
138
def setTimeoutTimestamp(timestampMs: Long): Unit
139
def getCurrentWatermarkMs(): Long
140
def getCurrentProcessingTimeMs(): Long
141
}
142
```
143
144
[Streaming Operations](./streaming-operations.md)
145
146
### Encoding Framework
147
148
Type-safe conversion between JVM objects and Spark SQL representations for distributed serialization.
149
150
```scala { .api }
151
trait Encoder[T] extends Serializable {
152
def schema: StructType
153
def clsTag: ClassTag[T]
154
}
155
156
trait AgnosticEncoder[T] extends Encoder[T] {
157
def isPrimitive: Boolean
158
def nullable: Boolean
159
def dataType: DataType
160
}
161
```
162
163
[Encoders](./encoders.md)
164
165
### Analysis and Error Handling
166
167
Structured exception handling with detailed error information for query analysis and execution.
168
169
```scala { .api }
170
class AnalysisException(
171
message: String,
172
line: Option[Int] = None,
173
startPosition: Option[Int] = None,
174
errorClass: Option[String] = None,
175
messageParameters: Map[String, String] = Map.empty,
176
context: Array[QueryContext] = Array.empty
177
) extends Exception with SparkThrowable {
178
def withPosition(origin: Origin): AnalysisException
179
def getSimpleMessage: String
180
}
181
```
182
183
[Error Handling](./error-handling.md)
184
185
### Utility Functions
186
187
Helper utilities for data type conversions and integrations with external systems.
188
189
```scala { .api }
190
object ArrowUtils {
191
def toArrowType(dt: DataType, timeZoneId: String, largeVarTypes: Boolean = false): ArrowType
192
def fromArrowType(dt: ArrowType): DataType
193
def toArrowSchema(schema: StructType, timeZoneId: String, errorOnDuplicatedFieldNames: Boolean, largeVarTypes: Boolean = false): Schema
194
def fromArrowSchema(schema: Schema): StructType
195
}
196
```
197
198
[Utilities](./utilities.md)