Apache Spark Hive integration module that provides compatibility with Apache Hive for Spark SQL operations
npx @tessl/cli install tessl/maven-org-apache-spark--spark-hive@3.5.00
# Apache Spark Hive Integration
1
2
Apache Spark Hive module provides comprehensive integration between Apache Spark and Apache Hive, enabling Spark SQL to seamlessly interact with Hive tables, metastore, and storage formats. This module serves as the bridge between Spark SQL and the Hive ecosystem for backward compatibility and hybrid environments.
3
4
## Package Information
5
6
- **Package Name**: spark-hive_2.13
7
- **Package Type**: maven
8
- **Language**: Scala
9
- **Installation**: Maven dependency `org.apache.spark:spark-hive_2.13:3.5.6`
10
11
## Core Imports
12
13
```scala
14
import org.apache.spark.sql.SparkSession
15
import org.apache.spark.sql.hive.HiveUtils
16
import org.apache.spark.sql.hive.client.HiveClient
17
import org.apache.spark.sql.hive.HiveExternalCatalog
18
import org.apache.spark.sql.catalyst.catalog._
19
```
20
21
For UDF support:
22
```scala
23
import org.apache.spark.sql.hive.HiveSimpleUDF
24
import org.apache.spark.sql.hive.HiveGenericUDF
25
import org.apache.spark.sql.hive.HiveUDAFFunction
26
```
27
28
## Basic Usage
29
30
The primary way to enable Hive support in Spark is through SparkSession:
31
32
```scala
33
import org.apache.spark.sql.SparkSession
34
35
val spark = SparkSession.builder()
36
.appName("Hive Integration Example")
37
.config("spark.sql.catalogImplementation", "hive")
38
.enableHiveSupport()
39
.getOrCreate()
40
41
// Access Hive tables
42
val df = spark.sql("SELECT * FROM hive_table")
43
df.show()
44
45
// Create Hive table
46
spark.sql("""
47
CREATE TABLE user_data (
48
id INT,
49
name STRING,
50
age INT
51
) USING HIVE
52
""")
53
```
54
55
## Architecture
56
57
The Spark Hive module is organized around several key architectural components:
58
59
- **Session Integration**: `HiveSessionStateBuilder` and `HiveSessionCatalog` provide Hive-enabled session state
60
- **Metastore Access**: `HiveExternalCatalog` and `HiveClient` interface with Hive metastore for metadata operations
61
- **Query Planning**: `HiveStrategies` and analysis rules convert Hive operations to Spark physical plans
62
- **File Format Support**: `OrcFileFormat` and related classes handle Hive file formats
63
- **Execution Layer**: Specialized execution plans like `HiveTableScanExec` for Hive table operations
64
- **Security**: `HiveDelegationTokenProvider` handles authentication in secure clusters
65
66
## Capabilities
67
68
### Session and Configuration
69
70
Core session management and configuration for enabling Hive support in Spark SQL sessions.
71
72
```scala { .api }
73
object SparkSession {
74
def builder(): Builder
75
}
76
77
class Builder {
78
def enableHiveSupport(): Builder
79
def config(key: String, value: String): Builder
80
def getOrCreate(): SparkSession
81
}
82
```
83
84
[Session and Configuration](./session-configuration.md)
85
86
### Hive Metastore Integration
87
88
Integration with Hive metastore for table metadata, database operations, and catalog management.
89
90
```scala { .api }
91
// Configuration constants for Hive metastore integration
92
object HiveUtils {
93
val CONVERT_METASTORE_PARQUET: ConfigEntry[Boolean]
94
val CONVERT_METASTORE_ORC: ConfigEntry[Boolean]
95
val CONVERT_INSERTING_PARTITIONED_TABLE: ConfigEntry[Boolean]
96
}
97
```
98
99
[Metastore Integration](./metastore-integration.md)
100
101
### File Format Support
102
103
Support for Hive-compatible file formats, particularly ORC files with Hive metadata integration.
104
105
```scala { .api }
106
class OrcFileFormat extends FileFormat {
107
def inferSchema(
108
sparkSession: SparkSession,
109
options: Map[String, String],
110
files: Seq[FileStatus]
111
): Option[StructType]
112
113
def prepareWrite(
114
sparkSession: SparkSession,
115
job: Job,
116
options: Map[String, String],
117
dataSchema: StructType
118
): OutputWriterFactory
119
}
120
```
121
122
[File Format Support](./file-formats.md)
123
124
### Query Execution
125
126
Specialized execution plans and strategies for Hive table operations and query processing.
127
128
```scala { .api }
129
case class HiveTableScanExec(
130
requestedAttributes: Seq[Attribute],
131
relation: HiveTableRelation,
132
partitionPruningPred: Seq[Expression]
133
)(@transient private val sparkSession: SparkSession) extends LeafExecNode
134
```
135
136
[Query Execution](./query-execution.md)
137
138
### Hive Client Interface
139
140
Direct interface for interacting with Hive metastore providing database, table, partition, and function operations.
141
142
```scala { .api }
143
private[hive] trait HiveClient {
144
def version: HiveVersion
145
def getDatabase(name: String): CatalogDatabase
146
def listDatabases(pattern: String): Seq[String]
147
def createDatabase(database: CatalogDatabase, ignoreIfExists: Boolean): Unit
148
def dropDatabase(name: String, ignoreIfNotExists: Boolean, cascade: Boolean): Unit
149
def getTable(dbName: String, tableName: String): CatalogTable
150
def createTable(table: CatalogTable, ignoreIfExists: Boolean): Unit
151
def dropTable(dbName: String, tableName: String, ignoreIfNotExists: Boolean, purge: Boolean): Unit
152
}
153
```
154
155
[Hive Client Interface](./hive-client.md)
156
157
### Configuration Utilities
158
159
Comprehensive configuration constants and utilities for Hive integration behavior and metastore connection.
160
161
```scala { .api }
162
object HiveUtils {
163
val HIVE_METASTORE_VERSION: ConfigEntry[String]
164
val HIVE_METASTORE_JARS: ConfigEntry[String]
165
val HIVE_METASTORE_JARS_PATH: ConfigEntry[Seq[String]]
166
val CONVERT_METASTORE_PARQUET: ConfigEntry[Boolean]
167
val CONVERT_METASTORE_ORC: ConfigEntry[Boolean]
168
val HIVE_THRIFT_SERVER_ASYNC: ConfigEntry[Boolean]
169
}
170
```
171
172
[Configuration Utilities](./configuration-utilities.md)
173
174
### Hive UDF Support
175
176
Integration support for Hive User Defined Functions (UDFs), User Defined Aggregate Functions (UDAFs), and User Defined Table Functions (UDTFs).
177
178
```scala { .api }
179
case class HiveSimpleUDF(
180
name: String,
181
funcWrapper: HiveFunctionWrapper,
182
children: Seq[Expression]
183
) extends Expression
184
185
case class HiveGenericUDF(
186
name: String,
187
funcWrapper: HiveFunctionWrapper,
188
children: Seq[Expression]
189
) extends Expression
190
191
case class HiveUDAFFunction(
192
name: String,
193
funcWrapper: HiveFunctionWrapper,
194
children: Seq[Expression],
195
isDistinct: Boolean
196
) extends AggregateFunction
197
```
198
199
[Hive UDF Support](./udf-support.md)
200
201
## Types
202
203
### Type Aliases and Basic Types
204
205
```scala { .api }
206
// Common type aliases used throughout the API
207
type TablePartitionSpec = Map[String, String]
208
type HiveTable = org.apache.hadoop.hive.ql.metadata.Table
209
210
// Version information for Hive compatibility
211
case class HiveVersion(
212
fullVersion: String,
213
majorVersion: Int,
214
minorVersion: Int
215
) {
216
def supportsFeature(feature: String): Boolean
217
}
218
```
219
220
### Core Configuration Types
221
222
```scala { .api }
223
class HiveOptions(parameters: Map[String, String]) {
224
def fileFormat: Option[String]
225
def inputFormat: Option[String]
226
def outputFormat: Option[String]
227
def serde: Option[String]
228
def serdeProperties: Map[String, String]
229
}
230
```
231
232
### Table and Relation Types
233
234
```scala { .api }
235
case class HiveTableRelation(
236
tableMeta: CatalogTable,
237
dataCols: Seq[AttributeReference],
238
partitionCols: Seq[AttributeReference],
239
tableStats: Option[Statistics],
240
prunedPartitions: Option[Seq[CatalogTablePartition]]
241
) extends LogicalRelation
242
```
243
244
## Error Handling
245
246
The module provides specific exceptions for Hive integration errors:
247
248
- **AnalysisException**: Thrown for schema and table analysis errors
249
- **IllegalArgumentException**: Thrown for invalid file format configurations
250
- **IOException**: Thrown for HDFS and file system access errors during table statistics
251
252
## Security Considerations
253
254
- **Delegation Tokens**: Automatic handling of Hive metastore delegation tokens in secure clusters
255
- **Authentication**: Integration with Hadoop security for authenticated access to Hive metastore
256
- **Authorization**: Respects Hive table-level permissions and security policies