or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

configuration-utilities.mdfile-formats.mdhive-client.mdindex.mdmetastore-integration.mdquery-execution.mdsession-configuration.mdudf-support.md
tile.json

tessl/maven-org-apache-spark--spark-hive

Apache Spark Hive integration module that provides compatibility with Apache Hive for Spark SQL operations

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-hive_2.13@3.5.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-hive@3.5.0

index.mddocs/

Apache Spark Hive Integration

Apache Spark Hive module provides comprehensive integration between Apache Spark and Apache Hive, enabling Spark SQL to seamlessly interact with Hive tables, metastore, and storage formats. This module serves as the bridge between Spark SQL and the Hive ecosystem for backward compatibility and hybrid environments.

Package Information

  • Package Name: spark-hive_2.13
  • Package Type: maven
  • Language: Scala
  • Installation: Maven dependency org.apache.spark:spark-hive_2.13:3.5.6

Core Imports

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.hive.HiveUtils
import org.apache.spark.sql.hive.client.HiveClient
import org.apache.spark.sql.hive.HiveExternalCatalog
import org.apache.spark.sql.catalyst.catalog._

For UDF support:

import org.apache.spark.sql.hive.HiveSimpleUDF
import org.apache.spark.sql.hive.HiveGenericUDF
import org.apache.spark.sql.hive.HiveUDAFFunction

Basic Usage

The primary way to enable Hive support in Spark is through SparkSession:

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .appName("Hive Integration Example")
  .config("spark.sql.catalogImplementation", "hive")
  .enableHiveSupport()
  .getOrCreate()

// Access Hive tables
val df = spark.sql("SELECT * FROM hive_table")
df.show()

// Create Hive table
spark.sql("""
  CREATE TABLE user_data (
    id INT,
    name STRING,
    age INT
  ) USING HIVE
""")

Architecture

The Spark Hive module is organized around several key architectural components:

  • Session Integration: HiveSessionStateBuilder and HiveSessionCatalog provide Hive-enabled session state
  • Metastore Access: HiveExternalCatalog and HiveClient interface with Hive metastore for metadata operations
  • Query Planning: HiveStrategies and analysis rules convert Hive operations to Spark physical plans
  • File Format Support: OrcFileFormat and related classes handle Hive file formats
  • Execution Layer: Specialized execution plans like HiveTableScanExec for Hive table operations
  • Security: HiveDelegationTokenProvider handles authentication in secure clusters

Capabilities

Session and Configuration

Core session management and configuration for enabling Hive support in Spark SQL sessions.

object SparkSession {
  def builder(): Builder
}

class Builder {
  def enableHiveSupport(): Builder
  def config(key: String, value: String): Builder
  def getOrCreate(): SparkSession
}

Session and Configuration

Hive Metastore Integration

Integration with Hive metastore for table metadata, database operations, and catalog management.

// Configuration constants for Hive metastore integration
object HiveUtils {
  val CONVERT_METASTORE_PARQUET: ConfigEntry[Boolean]
  val CONVERT_METASTORE_ORC: ConfigEntry[Boolean]
  val CONVERT_INSERTING_PARTITIONED_TABLE: ConfigEntry[Boolean]
}

Metastore Integration

File Format Support

Support for Hive-compatible file formats, particularly ORC files with Hive metadata integration.

class OrcFileFormat extends FileFormat {
  def inferSchema(
    sparkSession: SparkSession,
    options: Map[String, String],
    files: Seq[FileStatus]
  ): Option[StructType]
  
  def prepareWrite(
    sparkSession: SparkSession,
    job: Job,
    options: Map[String, String],
    dataSchema: StructType
  ): OutputWriterFactory
}

File Format Support

Query Execution

Specialized execution plans and strategies for Hive table operations and query processing.

case class HiveTableScanExec(
  requestedAttributes: Seq[Attribute],
  relation: HiveTableRelation,
  partitionPruningPred: Seq[Expression]
)(@transient private val sparkSession: SparkSession) extends LeafExecNode

Query Execution

Hive Client Interface

Direct interface for interacting with Hive metastore providing database, table, partition, and function operations.

private[hive] trait HiveClient {
  def version: HiveVersion
  def getDatabase(name: String): CatalogDatabase
  def listDatabases(pattern: String): Seq[String]
  def createDatabase(database: CatalogDatabase, ignoreIfExists: Boolean): Unit
  def dropDatabase(name: String, ignoreIfNotExists: Boolean, cascade: Boolean): Unit
  def getTable(dbName: String, tableName: String): CatalogTable
  def createTable(table: CatalogTable, ignoreIfExists: Boolean): Unit
  def dropTable(dbName: String, tableName: String, ignoreIfNotExists: Boolean, purge: Boolean): Unit
}

Hive Client Interface

Configuration Utilities

Comprehensive configuration constants and utilities for Hive integration behavior and metastore connection.

object HiveUtils {
  val HIVE_METASTORE_VERSION: ConfigEntry[String]
  val HIVE_METASTORE_JARS: ConfigEntry[String]
  val HIVE_METASTORE_JARS_PATH: ConfigEntry[Seq[String]]
  val CONVERT_METASTORE_PARQUET: ConfigEntry[Boolean]
  val CONVERT_METASTORE_ORC: ConfigEntry[Boolean]
  val HIVE_THRIFT_SERVER_ASYNC: ConfigEntry[Boolean]
}

Configuration Utilities

Hive UDF Support

Integration support for Hive User Defined Functions (UDFs), User Defined Aggregate Functions (UDAFs), and User Defined Table Functions (UDTFs).

case class HiveSimpleUDF(
  name: String,
  funcWrapper: HiveFunctionWrapper,
  children: Seq[Expression]
) extends Expression

case class HiveGenericUDF(
  name: String,
  funcWrapper: HiveFunctionWrapper,
  children: Seq[Expression]
) extends Expression

case class HiveUDAFFunction(
  name: String,
  funcWrapper: HiveFunctionWrapper,
  children: Seq[Expression],
  isDistinct: Boolean
) extends AggregateFunction

Hive UDF Support

Types

Type Aliases and Basic Types

// Common type aliases used throughout the API
type TablePartitionSpec = Map[String, String]
type HiveTable = org.apache.hadoop.hive.ql.metadata.Table

// Version information for Hive compatibility
case class HiveVersion(
  fullVersion: String,
  majorVersion: Int,
  minorVersion: Int
) {
  def supportsFeature(feature: String): Boolean
}

Core Configuration Types

class HiveOptions(parameters: Map[String, String]) {
  def fileFormat: Option[String]
  def inputFormat: Option[String]
  def outputFormat: Option[String]
  def serde: Option[String]
  def serdeProperties: Map[String, String]
}

Table and Relation Types

case class HiveTableRelation(
  tableMeta: CatalogTable,
  dataCols: Seq[AttributeReference],
  partitionCols: Seq[AttributeReference],
  tableStats: Option[Statistics],
  prunedPartitions: Option[Seq[CatalogTablePartition]]
) extends LogicalRelation

Error Handling

The module provides specific exceptions for Hive integration errors:

  • AnalysisException: Thrown for schema and table analysis errors
  • IllegalArgumentException: Thrown for invalid file format configurations
  • IOException: Thrown for HDFS and file system access errors during table statistics

Security Considerations

  • Delegation Tokens: Automatic handling of Hive metastore delegation tokens in secure clusters
  • Authentication: Integration with Hadoop security for authenticated access to Hive metastore
  • Authorization: Respects Hive table-level permissions and security policies