Apache Spark Hive integration module that provides support for Hive tables, queries, and SerDes
npx @tessl/cli install tessl/maven-org-apache-spark--spark-hive_2-11@2.4.00
# Apache Spark Hive Integration
1
2
Apache Spark Hive integration module that provides comprehensive support for accessing and manipulating Hive tables, executing HiveQL queries, and leveraging Hive SerDes. This module serves as the bridge between Apache Spark SQL and Apache Hive, enabling Spark applications to work seamlessly with existing Hive infrastructure, metastore, and data formats while maintaining full compatibility with Hive features including UDFs, partitioning, and complex data types.
3
4
## Package Information
5
6
- **Package Name**: org.apache.spark:spark-hive_2.11
7
- **Package Type**: Maven
8
- **Language**: Scala
9
- **Installation**: Add to your Maven dependencies or include in Spark classpath
10
- **Version**: 2.4.8
11
12
```xml
13
<dependency>
14
<groupId>org.apache.spark</groupId>
15
<artifactId>spark-hive_2.11</artifactId>
16
<version>2.4.8</version>
17
</dependency>
18
```
19
20
## Core Imports
21
22
For Scala applications:
23
24
```scala
25
import org.apache.spark.sql.SparkSession
26
import org.apache.spark.sql.hive.HiveContext // Deprecated - use SparkSession
27
```
28
29
For enabling Hive support (modern approach):
30
31
```scala
32
import org.apache.spark.sql.SparkSession
33
34
val spark = SparkSession.builder()
35
.appName("Hive Integration App")
36
.enableHiveSupport()
37
.getOrCreate()
38
```
39
40
## Basic Usage
41
42
```scala
43
import org.apache.spark.sql.SparkSession
44
45
// Create SparkSession with Hive support
46
val spark = SparkSession.builder()
47
.appName("Hive Integration Example")
48
.enableHiveSupport()
49
.getOrCreate()
50
51
// Execute HiveQL queries
52
spark.sql("CREATE TABLE IF NOT EXISTS users (id INT, name STRING, age INT)")
53
spark.sql("INSERT INTO users VALUES (1, 'Alice', 25), (2, 'Bob', 30)")
54
55
// Query Hive tables
56
val result = spark.sql("SELECT * FROM users WHERE age > 25")
57
result.show()
58
59
// Access Hive metastore
60
spark.catalog.listTables().show()
61
spark.catalog.listDatabases().show()
62
63
// Work with Hive partitioned tables
64
spark.sql("""
65
CREATE TABLE IF NOT EXISTS partitioned_sales (
66
product STRING,
67
amount DOUBLE
68
) PARTITIONED BY (year INT, month INT)
69
""")
70
71
// Load data into partitioned table
72
spark.sql("INSERT INTO partitioned_sales PARTITION(year=2023, month=12) VALUES ('laptop', 999.99)")
73
```
74
75
## Architecture
76
77
The Spark Hive integration is built around several key components:
78
79
- **HiveExternalCatalog**: Persistent catalog implementation using Hive metastore for database, table, and partition metadata management
80
- **HiveClient Interface**: Low-level interface for direct Hive metastore operations and HiveQL execution
81
- **Session Integration**: HiveSessionCatalog and HiveSessionStateBuilder for Hive-aware Spark SQL sessions
82
- **UDF Support**: Comprehensive wrappers for Hive UDFs, UDAFs, and UDTFs with automatic type conversion
83
- **Data Format Integration**: Native support for Hive SerDes, ORC files, and table format conversion
84
- **Query Planning**: Hive-specific optimization strategies and execution operators
85
86
## Capabilities
87
88
### Session Management
89
90
Core functionality for creating and managing Spark sessions with Hive integration, including legacy HiveContext support and modern SparkSession configuration.
91
92
```scala { .api }
93
// Modern approach (recommended)
94
def enableHiveSupport(): SparkSession.Builder
95
96
// Legacy approach (FULLY DEPRECATED since 2.0.0 - DO NOT USE)
97
@deprecated("Use SparkSession.builder.enableHiveSupport instead", "2.0.0")
98
class HiveContext(sc: SparkContext) extends SQLContext
99
```
100
101
[Session Management](./session-management.md)
102
103
### Hive Metastore Operations
104
105
Direct access to Hive metastore for programmatic database, table, partition, and function management through the HiveClient interface.
106
107
```scala { .api }
108
trait HiveClient {
109
def listDatabases(pattern: String): Seq[String]
110
def getDatabase(name: String): CatalogDatabase
111
def listTables(dbName: String): Seq[String]
112
def getTable(dbName: String, tableName: String): CatalogTable
113
def createTable(table: CatalogTable, ignoreIfExists: Boolean): Unit
114
def getPartitions(catalogTable: CatalogTable, partialSpec: Option[TablePartitionSpec] = None): Seq[CatalogTablePartition]
115
def runSqlHive(sql: String): Seq[String]
116
}
117
```
118
119
[Metastore Operations](./metastore-operations.md)
120
121
### Hive UDF Integration
122
123
Comprehensive support for Hive User-Defined Functions including simple UDFs, generic UDFs, table-generating functions (UDTFs), and aggregate functions (UDAFs).
124
125
```scala { .api }
126
case class HiveSimpleUDF(funcWrapper: HiveFunctionWrapper, children: Seq[Expression]) extends Expression
127
case class HiveGenericUDF(funcWrapper: HiveFunctionWrapper, children: Seq[Expression]) extends Expression
128
case class HiveGenericUDTF(funcWrapper: HiveFunctionWrapper, children: Seq[Expression]) extends Generator
129
case class HiveUDAFFunction(funcWrapper: HiveFunctionWrapper, children: Seq[Expression]) extends TypedImperativeAggregate[Any]
130
```
131
132
[UDF Integration](./udf-integration.md)
133
134
### Configuration and Utilities
135
136
Configuration options, utilities, and constants for customizing Hive integration behavior including metastore settings, file format conversion, and compatibility options.
137
138
```scala { .api }
139
object HiveUtils {
140
val builtinHiveVersion: String = "1.2.1"
141
val HIVE_METASTORE_VERSION: ConfigEntry[String]
142
val CONVERT_METASTORE_PARQUET: ConfigEntry[Boolean]
143
val CONVERT_METASTORE_ORC: ConfigEntry[Boolean]
144
def newTemporaryConfiguration(useInMemoryDerby: Boolean): Map[String, String]
145
}
146
```
147
148
[Configuration](./configuration.md)
149
150
### Data Type Conversion
151
152
Utilities for converting between Hive and Catalyst data types, handling ObjectInspectors, and managing SerDe operations.
153
154
```scala { .api }
155
trait HiveInspectors {
156
def javaTypeToDataType(clz: Type): DataType
157
def toInspector(dataType: DataType): ObjectInspector
158
def inspectorToDataType(inspector: ObjectInspector): DataType
159
def wrapperFor(oi: ObjectInspector, dataType: DataType): Any => Any
160
def unwrapperFor(objectInspector: ObjectInspector): Any => Any
161
}
162
```
163
164
[Data Type Conversion](./data-type-conversion.md)
165
166
### File Format Support
167
168
Native support for Hive file formats including traditional Hive tables and optimized ORC files with Hive compatibility.
169
170
```scala { .api }
171
class HiveFileFormat extends FileFormat with DataSourceRegister {
172
override def shortName(): String = "hive"
173
}
174
175
class OrcFileFormat extends FileFormat with DataSourceRegister {
176
override def shortName(): String = "orc"
177
}
178
```
179
180
[File Formats](./file-formats.md)
181
182
## Types
183
184
### Core Catalog Types
185
186
```scala { .api }
187
// From Spark SQL Catalyst - used throughout Hive integration
188
case class CatalogDatabase(
189
name: String,
190
description: String,
191
locationUri: String,
192
properties: Map[String, String]
193
)
194
195
case class CatalogTable(
196
identifier: TableIdentifier,
197
tableType: CatalogTableType,
198
storage: CatalogStorageFormat,
199
schema: StructType,
200
partitionColumnNames: Seq[String] = Seq.empty,
201
properties: Map[String, String] = Map.empty
202
)
203
204
case class CatalogTablePartition(
205
spec: TablePartitionSpec,
206
storage: CatalogStorageFormat,
207
parameters: Map[String, String] = Map.empty
208
)
209
210
case class CatalogFunction(
211
identifier: FunctionIdentifier,
212
className: String,
213
resources: Seq[FunctionResource]
214
)
215
```
216
217
### Hive-Specific Types
218
219
```scala { .api }
220
// Hive version support
221
abstract class HiveVersion(
222
val fullVersion: String,
223
val extraDeps: Seq[String] = Nil,
224
val exclusions: Seq[String] = Nil
225
)
226
227
// Configuration for Hive data sources
228
class HiveOptions(parameters: Map[String, String]) {
229
val fileFormat: Option[String]
230
val inputFormat: Option[String]
231
val outputFormat: Option[String]
232
val serde: Option[String]
233
def serdeProperties: Map[String, String]
234
}
235
```