or run

tessl search

Version

Workspace: tessl
Visibility: Public
Created: 4 months ago
Last updated: 4 months ago
Describes: pkg:maven/org.apache.spark/spark-hive_2.11@1.6.x

docs

client-interface.md execution-engine.md hive-context.md hiveql-parser.md index.md orc-support.md type-system.md udf-support.md

tile.json

tessl/maven-org-apache-spark--spark-hive

tessl install tessl/maven-org-apache-spark--spark-hive@1.6.0

Apache Spark SQL Hive integration module providing HiveContext, metastore operations, HiveQL parsing, and Hive data format compatibility

UDF and Function Support

Apache Spark Hive integration provides comprehensive support for User-Defined Functions (UDFs), User-Defined Aggregate Functions (UDAFs), and User-Defined Table Functions (UDTFs). This includes both Hive-native functions and Spark SQL function integration.

Required Imports

import org.apache.spark.sql.hive.HiveFunctionRegistry
import org.apache.spark.sql.catalyst.FunctionIdentifier
import org.apache.spark.sql.catalyst.expressions.Expression
import org.apache.hadoop.hive.ql.exec.FunctionInfo
import org.apache.hadoop.hive.ql.udf.generic._

Function Registry

HiveFunctionRegistry

class HiveFunctionRegistry(
  underlying: FunctionRegistry, 
  executionHive: ClientWrapper
) extends FunctionRegistry with HiveInspectors {
  
  def getFunctionInfo(name: String): FunctionInfo
  def lookupFunction(name: String, children: Seq[Expression]): Expression
  def registerFunction(name: FunctionIdentifier, builder: Seq[Expression] => Expression): Unit
}

HiveFunctionRegistry - Registry for both Hive and Spark SQL functions

getFunctionInfo - Retrieves Hive function metadata and signature information

lookupFunction - Resolves function name to executable expression with type checking

registerFunction - Registers new function with given name and expression builder

Usage Examples:

val hiveContext = new HiveContext(sc)
val registry = hiveContext.sessionState.functionRegistry

// Get function information
val funcInfo = registry.getFunctionInfo("concat")
println(s"Function class: ${funcInfo.getFunctionClass}")

// Look up function with arguments
val concatExpr = registry.lookupFunction("concat", Seq(stringLit1, stringLit2))

// Register custom function
registry.registerFunction(
  FunctionIdentifier("my_custom_func"),
  (args: Seq[Expression]) => new MyCustomFunction(args)
)

Built-in Function Support

Hive Built-in Functions

The integration supports all standard Hive built-in functions:

String Functions:

concat, concat_ws, substr, substring, length, trim, ltrim, rtrim
upper, lower, reverse, split, regexp_replace, regexp_extract
like, rlike, instr, locate, ascii, base64, unbase64

Mathematical Functions:

abs, acos, asin, atan, atan2, ceil, ceiling, cos, cosh
exp, floor, ln, log, log10, log2, negative, pi, positive
pow, power, rand, round, sign, sin, sinh, sqrt, tan, tanh

Date/Time Functions:

year, month, day, hour, minute, second, weekofyear
date_add, date_sub, datediff, date_format, from_unixtime, unix_timestamp
to_date, trunc, months_between, add_months, last_day, next_day

Aggregate Functions:

count, sum, avg, min, max, variance, var_pop, var_samp
stddev, stddev_pop, stddev_samp, collect_list, collect_set
percentile, percentile_approx, histogram_numeric

Usage Examples:

// Use built-in functions in SQL
val result = hiveContext.sql("""
  SELECT 
    concat(first_name, ' ', last_name) as full_name,
    year(birth_date) as birth_year,
    round(salary * 1.1, 2) as adjusted_salary
  FROM employees
  WHERE length(first_name) > 3
""")

// Use built-in functions in DataFrame API
import org.apache.spark.sql.functions._
val df = hiveContext.table("employees")
  .select(
    concat(col("first_name"), lit(" "), col("last_name")).as("full_name"),
    year(col("birth_date")).as("birth_year")
  )

Custom UDF Integration

Hive UDF Classes

// UDF base classes supported
abstract class UDF extends org.apache.hadoop.hive.ql.exec.UDF
abstract class GenericUDF extends org.apache.hadoop.hive.ql.udf.generic.GenericUDF
abstract class GenericUDTF extends org.apache.hadoop.hive.ql.udf.generic.GenericUDTF
abstract class GenericUDAF extends org.apache.hadoop.hive.ql.udf.generic.GenericUDAF

// Evaluator for UDAFs
abstract class GenericUDAFEvaluator 
  extends org.apache.hadoop.hive.ql.udf.generic.GenericUDAF.GenericUDAFEvaluator

Usage Examples:

// Register Hive UDF class
hiveContext.sql("ADD JAR /path/to/my-udfs.jar")
hiveContext.sql("""
  CREATE TEMPORARY FUNCTION my_string_func 
  AS 'com.company.hive.MyStringUDF'
""")

// Use registered UDF
val result = hiveContext.sql("""
  SELECT customer_id, my_string_func(customer_name) as processed_name
  FROM customers
""")

// Register UDAF
hiveContext.sql("""
  CREATE TEMPORARY FUNCTION my_aggregate_func
  AS 'com.company.hive.MyAggregateUDAF'
""")

val aggregateResult = hiveContext.sql("""
  SELECT 
    department,
    my_aggregate_func(salary) as custom_aggregate
  FROM employees
  GROUP BY department
""")

Function Wrapper Classes

HiveGenericUDF

case class HiveGenericUDF(
  name: String,
  funcWrapper: HiveFunctionWrapper,
  children: Seq[Expression]
) extends Expression with HiveInspectors with CodegenFallback

HiveGenericUDF - Wrapper for Hive GenericUDF implementations

Usage in query planning:

// This is typically handled internally by the query planner
// when Hive UDFs are encountered in SQL queries

HiveGenericUDTF

case class HiveGenericUDTF(
  name: String,
  funcWrapper: HiveFunctionWrapper,
  children: Seq[Expression]
) extends Generator with HiveInspectors with CodegenFallback

HiveGenericUDTF - Wrapper for Hive table-generating functions

Usage Examples:

// UDTF usage with LATERAL VIEW
val result = hiveContext.sql("""
  SELECT 
    customer_id,
    tag
  FROM customers
  LATERAL VIEW explode(tags) exploded_table AS tag
""")

// Custom UDTF
hiveContext.sql("""
  CREATE TEMPORARY FUNCTION my_explode_func
  AS 'com.company.hive.MyExplodeUDTF'
""")

val customResult = hiveContext.sql("""
  SELECT 
    id,
    value
  FROM input_table
  LATERAL VIEW my_explode_func(data_column) t AS value
""")

Function Registration and Management

Dynamic Function Registration

// Add JAR containing UDFs
def addJar(path: String): Unit

// Create temporary function
def createFunction(
  name: String,
  className: String,
  resources: Seq[String] = Seq.empty
): Unit

// Drop temporary function
def dropFunction(name: String): Unit

Usage Examples:

// Add UDF JAR to classpath
hiveContext.sql("ADD JAR hdfs://namenode:port/path/to/udf.jar")

// Create temporary function
hiveContext.sql("""
  CREATE TEMPORARY FUNCTION advanced_analytics
  AS 'com.analytics.AdvancedAnalyticsUDF'
  USING JAR 'hdfs://namenode:port/path/to/analytics.jar'
""")

// Use in queries
val insights = hiveContext.sql("""
  SELECT 
    customer_segment,
    advanced_analytics(purchase_history, demographic_data) as insights
  FROM customer_data
  GROUP BY customer_segment
""")

// Clean up function
hiveContext.sql("DROP TEMPORARY FUNCTION advanced_analytics")

Function Inspection

// List available functions
def listFunctions(): Array[String]
def listFunctions(pattern: String): Array[String]

// Describe function
def describeFunction(name: String): String
def describeFunction(name: String, extended: Boolean): String

Usage Examples:

// List all functions
val allFunctions = hiveContext.sql("SHOW FUNCTIONS").collect()

// List functions matching pattern
val stringFunctions = hiveContext.sql("SHOW FUNCTIONS 'str*'").collect()

// Get function description
val description = hiveContext.sql("DESCRIBE FUNCTION concat").collect()
val extendedDesc = hiveContext.sql("DESCRIBE FUNCTION EXTENDED concat").collect()

Error Handling and Troubleshooting

Common UDF-related issues:

ClassNotFoundException - UDF JAR not in classpath

// Solution: Add JAR before creating function
hiveContext.sql("ADD JAR /path/to/udf.jar")

UDFArgumentException - Incorrect argument types or count

// Check function signature and argument types
// Verify UDF implementation matches expected signature

SerializationException - UDF not serializable

// Ensure UDF classes implement Serializable if needed
// Use transient fields for non-serializable components

Performance Considerations

Native Functions: Prefer built-in Spark SQL functions over Hive UDFs when available
GenericUDF vs UDF: GenericUDF provides better type safety and performance
UDAF Performance: Custom UDAFs may have significant overhead compared to built-in aggregates
JAR Loading: Loading large UDF JARs can impact cluster startup time
Serialization: Complex UDF objects may cause serialization overhead

Optimization Tips:

Use vectorized built-in functions when possible
Minimize object creation in UDF evaluate methods
Cache expensive computations in UDF initialization
Consider using Spark SQL native functions instead of Hive UDFs for better performance

Version

tessl/maven-org-apache-spark--spark-hive

udf-support.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

UDF and Function Support

Required Imports

Function Registry

HiveFunctionRegistry

Built-in Function Support

Hive Built-in Functions

Custom UDF Integration

Hive UDF Classes

Function Wrapper Classes

HiveGenericUDF

HiveGenericUDTF

Function Registration and Management

Dynamic Function Registration

Function Inspection

Error Handling and Troubleshooting

Performance Considerations

udf-support.mddocs/