or run

tessl search
Log in

Version

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-hive_2.11@1.6.x
tile.json

tessl/maven-org-apache-spark--spark-hive

tessl install tessl/maven-org-apache-spark--spark-hive@1.6.0

Apache Spark SQL Hive integration module providing HiveContext, metastore operations, HiveQL parsing, and Hive data format compatibility

index.mddocs/

Apache Spark Hive Integration

Apache Spark SQL Hive integration module provides comprehensive compatibility with Apache Hive, enabling Spark applications to seamlessly access and query Hive tables, metadata, and data formats. This module implements HiveContext as an extension to SQLContext, offering full metastore operations, HiveQL parsing, SerDe support, and advanced Hive ecosystem integration.

Package Information

  • Package Name: org.apache.spark:spark-hive_2.11
  • Package Type: Maven
  • Language: Scala
  • Version: 1.6.3
  • Installation: Add to Maven dependencies or include in Spark classpath

Core Imports

import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.client._
import org.apache.spark.sql.hive.execution._
import org.apache.spark.sql.hive.HiveQl
import org.apache.spark.sql.hive.HiveInspectors

// For Java applications
import org.apache.spark.api.java.JavaSparkContext

// Client types
import org.apache.spark.sql.hive.client.{HiveTable, HiveColumn, HiveDatabase}
import org.apache.spark.sql.hive.client.{ExternalTable, ManagedTable, VirtualView, IndexTable}

// Function support
import org.apache.spark.sql.hive.HiveFunctionRegistry

Basic Usage

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.hive.HiveContext

// Create Spark context and HiveContext
val conf = new SparkConf().setAppName("HiveApp")
val sc = new SparkContext(conf)
val hiveContext = new HiveContext(sc)

// Execute HiveQL queries
val result = hiveContext.sql("SELECT * FROM my_hive_table")
result.show()

// Access Hive metastore
hiveContext.refreshTable("my_database.my_table")

// Create temporary views from Hive tables
hiveContext.sql("CREATE TEMPORARY VIEW temp_view AS SELECT * FROM hive_table WHERE year = 2023")

Architecture

The Spark Hive integration is built around several key components:

  • HiveContext: Main entry point extending SQLContext with Hive capabilities
  • Client Layer: Abstracted Hive client interface supporting multiple Hive versions (0.12.0 to 1.2.1)
  • Execution Layer: Hive-specific physical operators for table scans, insertions, and transformations
  • Parser: HiveQL to Catalyst logical plan translation with full HiveQL syntax support
  • Type System: Bidirectional type conversion between Catalyst and Hive ObjectInspectors
  • Metastore Integration: Complete CRUD operations for databases, tables, partitions, and metadata
  • SerDe Support: Integration with Hive SerDe libraries for various data formats
  • ORC Integration: Native ORC file format support with predicate pushdown optimizations

Capabilities

HiveContext - Main Integration Interface

Primary entry point providing full Hive integration capabilities including context management, configuration, and session handling.

class HiveContext(sc: SparkContext) extends SQLContext

// Constructors
def this(sc: SparkContext): HiveContext
def this(sc: JavaSparkContext): HiveContext

// Core methods
def newSession(): HiveContext
def refreshTable(tableName: String): Unit
def analyze(tableName: String): Unit

HiveContext

Hive Client Interface

Abstracted interface for interacting with different versions of Hive metastore, providing database and table operations with version compatibility.

trait ClientInterface {
  def version: HiveVersion
  def currentDatabase: String
  def getTable(dbName: String, tableName: String): HiveTable
  def createTable(table: HiveTable): Unit
  def runSqlHive(sql: String): Seq[String]
}

case class HiveTable(
  specifiedDatabase: Option[String],
  name: String,
  schema: Seq[HiveColumn],
  partitionColumns: Seq[HiveColumn],
  properties: Map[String, String],
  tableType: TableType
)

Client Interface

HiveQL Parser

Converts HiveQL syntax to Catalyst logical plans, supporting full HiveQL language features including DDL, DML, and complex expressions.

object HiveQl {
  def parseSql(sql: String): LogicalPlan
}

case class CreateTableAsSelect(
  tableDesc: HiveTable,
  child: LogicalPlan,
  allowExisting: Boolean
) extends LogicalPlan

HiveQL Parser

Execution Engine

Physical operators for Hive-specific operations including table scans, data insertion, script transformations, and native command execution.

case class HiveTableScan(
  attributes: Seq[Attribute],
  relation: MetastoreRelation,
  partitionPruningPred: Option[Expression]
) extends LeafNode

case class InsertIntoHiveTable(
  table: MetastoreRelation,
  partition: Map[String, Option[String]],
  child: SparkPlan,
  overwrite: Boolean
) extends UnaryNode

Execution Engine

ORC File Format Support

Native ORC file format integration with predicate pushdown, column pruning, and schema evolution support.

class DefaultSource extends HadoopFsRelationProvider with DataSourceRegister {
  def shortName(): String = "orc"
}

case class OrcRelation(
  location: String,
  parameters: Map[String, String]
)(sqlContext: SQLContext) extends HadoopFsRelation

ORC Support

Type System and Inspectors

Bidirectional type conversion utilities between Spark SQL Catalyst types and Hive ObjectInspectors for seamless data exchange.

trait HiveInspectors {
  def toInspector(dataType: DataType): ObjectInspector
  def unwrap(data: Any, oi: ObjectInspector): Any
  def wrap(a: Any, oi: ObjectInspector): AnyRef
}

Type System

UDF and Function Support

Support for User-Defined Functions (UDFs), User-Defined Aggregate Functions (UDAFs), and User-Defined Table Functions (UDTFs) with comprehensive Hive compatibility.

class HiveFunctionRegistry(underlying: FunctionRegistry, executionHive: ClientWrapper) 
  extends FunctionRegistry with HiveInspectors {
  
  def getFunctionInfo(name: String): FunctionInfo
  def lookupFunction(name: String, children: Seq[Expression]): Expression
}

UDF Support

Configuration

Key configuration properties for Hive integration:

// Metastore configuration
val HIVE_METASTORE_VERSION: SQLConfEntry[String]
val HIVE_METASTORE_JARS: SQLConfEntry[String]

// Format conversion settings
val CONVERT_METASTORE_PARQUET: SQLConfEntry[Boolean]
val CONVERT_CTAS: SQLConfEntry[Boolean]

// Class loading configuration
val HIVE_METASTORE_SHARED_PREFIXES: SQLConfEntry[Seq[String]]
val HIVE_METASTORE_BARRIER_PREFIXES: SQLConfEntry[Seq[String]]

Key Settings

  • spark.sql.hive.metastore.version - Hive metastore version (default: "1.2.1")
  • spark.sql.hive.metastore.jars - Hive JAR location ("builtin", "maven", or path)
  • spark.sql.hive.convertMetastoreParquet - Convert Parquet tables (default: true)
  • spark.sql.hive.convertCTAS - Convert CTAS statements (default: false)
  • spark.sql.hive.thriftServer.async - Async thrift server (default: true)

Types

Core Data Model Types

case class HiveDatabase(
  name: String,
  location: String
)

case class HiveColumn(
  name: String,
  hiveType: String,
  comment: String
)

case class HiveStorageDescriptor(
  location: String,
  inputFormat: String,
  outputFormat: String,
  serde: String,
  serdeProperties: Map[String, String]
)

case class HivePartition(
  values: Seq[String],
  storage: HiveStorageDescriptor
)

abstract class TableType { val name: String }
case object ExternalTable extends TableType { val name = "EXTERNAL_TABLE" }
case object ManagedTable extends TableType { val name = "MANAGED_TABLE" }
case object VirtualView extends TableType { val name = "VIRTUAL_VIEW" }

Version Support Types

abstract class HiveVersion(
  fullVersion: String,
  extraDeps: Seq[String],
  exclusions: Seq[String]
)

// Supported versions: v12, v13, v14, v1_0, v1_1, v1_2

Command Types

case class AnalyzeTable(tableName: String) extends RunnableCommand
case class DropTable(tableName: String, ifExists: Boolean) extends RunnableCommand
case class AddJar(path: String) extends RunnableCommand
case class CreateMetastoreDataSource(...) extends RunnableCommand