or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/maven-org-apache-spark--spark-hive

Apache Spark Hive integration module that provides compatibility with Apache Hive for Spark SQL operations

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-hive_2.13@3.5.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-hive@3.5.0

0

# Apache Spark Hive Integration

1

2

Apache Spark Hive module provides comprehensive integration between Apache Spark and Apache Hive, enabling Spark SQL to seamlessly interact with Hive tables, metastore, and storage formats. This module serves as the bridge between Spark SQL and the Hive ecosystem for backward compatibility and hybrid environments.

3

4

## Package Information

5

6

- **Package Name**: spark-hive_2.13

7

- **Package Type**: maven

8

- **Language**: Scala

9

- **Installation**: Maven dependency `org.apache.spark:spark-hive_2.13:3.5.6`

10

11

## Core Imports

12

13

```scala

14

import org.apache.spark.sql.SparkSession

15

import org.apache.spark.sql.hive.HiveUtils

16

import org.apache.spark.sql.hive.client.HiveClient

17

import org.apache.spark.sql.hive.HiveExternalCatalog

18

import org.apache.spark.sql.catalyst.catalog._

19

```

20

21

For UDF support:

22

```scala

23

import org.apache.spark.sql.hive.HiveSimpleUDF

24

import org.apache.spark.sql.hive.HiveGenericUDF

25

import org.apache.spark.sql.hive.HiveUDAFFunction

26

```

27

28

## Basic Usage

29

30

The primary way to enable Hive support in Spark is through SparkSession:

31

32

```scala

33

import org.apache.spark.sql.SparkSession

34

35

val spark = SparkSession.builder()

36

.appName("Hive Integration Example")

37

.config("spark.sql.catalogImplementation", "hive")

38

.enableHiveSupport()

39

.getOrCreate()

40

41

// Access Hive tables

42

val df = spark.sql("SELECT * FROM hive_table")

43

df.show()

44

45

// Create Hive table

46

spark.sql("""

47

CREATE TABLE user_data (

48

id INT,

49

name STRING,

50

age INT

51

) USING HIVE

52

""")

53

```

54

55

## Architecture

56

57

The Spark Hive module is organized around several key architectural components:

58

59

- **Session Integration**: `HiveSessionStateBuilder` and `HiveSessionCatalog` provide Hive-enabled session state

60

- **Metastore Access**: `HiveExternalCatalog` and `HiveClient` interface with Hive metastore for metadata operations

61

- **Query Planning**: `HiveStrategies` and analysis rules convert Hive operations to Spark physical plans

62

- **File Format Support**: `OrcFileFormat` and related classes handle Hive file formats

63

- **Execution Layer**: Specialized execution plans like `HiveTableScanExec` for Hive table operations

64

- **Security**: `HiveDelegationTokenProvider` handles authentication in secure clusters

65

66

## Capabilities

67

68

### Session and Configuration

69

70

Core session management and configuration for enabling Hive support in Spark SQL sessions.

71

72

```scala { .api }

73

object SparkSession {

74

def builder(): Builder

75

}

76

77

class Builder {

78

def enableHiveSupport(): Builder

79

def config(key: String, value: String): Builder

80

def getOrCreate(): SparkSession

81

}

82

```

83

84

[Session and Configuration](./session-configuration.md)

85

86

### Hive Metastore Integration

87

88

Integration with Hive metastore for table metadata, database operations, and catalog management.

89

90

```scala { .api }

91

// Configuration constants for Hive metastore integration

92

object HiveUtils {

93

val CONVERT_METASTORE_PARQUET: ConfigEntry[Boolean]

94

val CONVERT_METASTORE_ORC: ConfigEntry[Boolean]

95

val CONVERT_INSERTING_PARTITIONED_TABLE: ConfigEntry[Boolean]

96

}

97

```

98

99

[Metastore Integration](./metastore-integration.md)

100

101

### File Format Support

102

103

Support for Hive-compatible file formats, particularly ORC files with Hive metadata integration.

104

105

```scala { .api }

106

class OrcFileFormat extends FileFormat {

107

def inferSchema(

108

sparkSession: SparkSession,

109

options: Map[String, String],

110

files: Seq[FileStatus]

111

): Option[StructType]

112

113

def prepareWrite(

114

sparkSession: SparkSession,

115

job: Job,

116

options: Map[String, String],

117

dataSchema: StructType

118

): OutputWriterFactory

119

}

120

```

121

122

[File Format Support](./file-formats.md)

123

124

### Query Execution

125

126

Specialized execution plans and strategies for Hive table operations and query processing.

127

128

```scala { .api }

129

case class HiveTableScanExec(

130

requestedAttributes: Seq[Attribute],

131

relation: HiveTableRelation,

132

partitionPruningPred: Seq[Expression]

133

)(@transient private val sparkSession: SparkSession) extends LeafExecNode

134

```

135

136

[Query Execution](./query-execution.md)

137

138

### Hive Client Interface

139

140

Direct interface for interacting with Hive metastore providing database, table, partition, and function operations.

141

142

```scala { .api }

143

private[hive] trait HiveClient {

144

def version: HiveVersion

145

def getDatabase(name: String): CatalogDatabase

146

def listDatabases(pattern: String): Seq[String]

147

def createDatabase(database: CatalogDatabase, ignoreIfExists: Boolean): Unit

148

def dropDatabase(name: String, ignoreIfNotExists: Boolean, cascade: Boolean): Unit

149

def getTable(dbName: String, tableName: String): CatalogTable

150

def createTable(table: CatalogTable, ignoreIfExists: Boolean): Unit

151

def dropTable(dbName: String, tableName: String, ignoreIfNotExists: Boolean, purge: Boolean): Unit

152

}

153

```

154

155

[Hive Client Interface](./hive-client.md)

156

157

### Configuration Utilities

158

159

Comprehensive configuration constants and utilities for Hive integration behavior and metastore connection.

160

161

```scala { .api }

162

object HiveUtils {

163

val HIVE_METASTORE_VERSION: ConfigEntry[String]

164

val HIVE_METASTORE_JARS: ConfigEntry[String]

165

val HIVE_METASTORE_JARS_PATH: ConfigEntry[Seq[String]]

166

val CONVERT_METASTORE_PARQUET: ConfigEntry[Boolean]

167

val CONVERT_METASTORE_ORC: ConfigEntry[Boolean]

168

val HIVE_THRIFT_SERVER_ASYNC: ConfigEntry[Boolean]

169

}

170

```

171

172

[Configuration Utilities](./configuration-utilities.md)

173

174

### Hive UDF Support

175

176

Integration support for Hive User Defined Functions (UDFs), User Defined Aggregate Functions (UDAFs), and User Defined Table Functions (UDTFs).

177

178

```scala { .api }

179

case class HiveSimpleUDF(

180

name: String,

181

funcWrapper: HiveFunctionWrapper,

182

children: Seq[Expression]

183

) extends Expression

184

185

case class HiveGenericUDF(

186

name: String,

187

funcWrapper: HiveFunctionWrapper,

188

children: Seq[Expression]

189

) extends Expression

190

191

case class HiveUDAFFunction(

192

name: String,

193

funcWrapper: HiveFunctionWrapper,

194

children: Seq[Expression],

195

isDistinct: Boolean

196

) extends AggregateFunction

197

```

198

199

[Hive UDF Support](./udf-support.md)

200

201

## Types

202

203

### Type Aliases and Basic Types

204

205

```scala { .api }

206

// Common type aliases used throughout the API

207

type TablePartitionSpec = Map[String, String]

208

type HiveTable = org.apache.hadoop.hive.ql.metadata.Table

209

210

// Version information for Hive compatibility

211

case class HiveVersion(

212

fullVersion: String,

213

majorVersion: Int,

214

minorVersion: Int

215

) {

216

def supportsFeature(feature: String): Boolean

217

}

218

```

219

220

### Core Configuration Types

221

222

```scala { .api }

223

class HiveOptions(parameters: Map[String, String]) {

224

def fileFormat: Option[String]

225

def inputFormat: Option[String]

226

def outputFormat: Option[String]

227

def serde: Option[String]

228

def serdeProperties: Map[String, String]

229

}

230

```

231

232

### Table and Relation Types

233

234

```scala { .api }

235

case class HiveTableRelation(

236

tableMeta: CatalogTable,

237

dataCols: Seq[AttributeReference],

238

partitionCols: Seq[AttributeReference],

239

tableStats: Option[Statistics],

240

prunedPartitions: Option[Seq[CatalogTablePartition]]

241

) extends LogicalRelation

242

```

243

244

## Error Handling

245

246

The module provides specific exceptions for Hive integration errors:

247

248

- **AnalysisException**: Thrown for schema and table analysis errors

249

- **IllegalArgumentException**: Thrown for invalid file format configurations

250

- **IOException**: Thrown for HDFS and file system access errors during table statistics

251

252

## Security Considerations

253

254

- **Delegation Tokens**: Automatic handling of Hive metastore delegation tokens in secure clusters

255

- **Authentication**: Integration with Hadoop security for authenticated access to Hive metastore

256

- **Authorization**: Respects Hive table-level permissions and security policies