or run

tessl search

Version

Workspace: tessl
Visibility: Public
Created: 4 months ago
Last updated: 4 months ago
Describes: pkg:maven/org.apache.spark/spark-hive_2.11@1.6.x

docs

client-interface.md execution-engine.md hive-context.md hiveql-parser.md index.md orc-support.md type-system.md udf-support.md

tile.json

tessl/maven-org-apache-spark--spark-hive

tessl install tessl/maven-org-apache-spark--spark-hive@1.6.0

Apache Spark SQL Hive integration module providing HiveContext, metastore operations, HiveQL parsing, and Hive data format compatibility

HiveQL Parser

The HiveQL Parser provides comprehensive translation from HiveQL syntax to Spark SQL Catalyst logical plans. It supports the full HiveQL language specification including DDL, DML, complex expressions, window functions, and Hive-specific extensions.

Required Imports

import org.apache.spark.sql.hive.HiveQl
import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
import org.apache.spark.sql.catalyst.expressions.Attribute
import org.apache.hadoop.hive.ql.parse.ASTNode

Core Parser

HiveQl Object

object HiveQl {
  def parseSql(sql: String): LogicalPlan
  def createPlan(sql: String): LogicalPlan
  def parseDdl(ddl: String): Seq[Attribute]
  def getAst(sql: String): ASTNode
}

parseSql - Converts HiveQL string to Catalyst LogicalPlan representation

createPlan - Alternative method for creating logical plans from SQL

parseDdl - Parses DDL statements and returns column attributes

getAst - Returns the raw Hive AST (Abstract Syntax Tree) for the SQL

The parser handles:

Data Definition Language (DDL): CREATE, ALTER, DROP statements
Data Manipulation Language (DML): SELECT, INSERT, UPDATE, DELETE
Complex expressions: CASE, window functions, UDFs
Hive-specific syntax: LATERAL VIEW, DISTRIBUTE BY, SORT BY
SerDe specifications and storage formats

Usage Examples:

import org.apache.spark.sql.hive.HiveQl

// Parse SELECT query
val selectPlan = HiveQl.parseSql("""
  SELECT customer_id, sum(amount) as total
  FROM sales 
  WHERE year = 2023 
  GROUP BY customer_id
  HAVING total > 1000
""")

// Parse CREATE TABLE with SerDe
val createTablePlan = HiveQl.parseSql("""
  CREATE TABLE staging.orders (
    order_id bigint,
    customer_id string,
    amount decimal(10,2)
  )
  PARTITIONED BY (year int, month int)
  STORED AS PARQUET
  LOCATION '/warehouse/staging/orders'
""")

// Parse complex query with window functions
val windowPlan = HiveQl.parseSql("""
  SELECT 
    product_id,
    sale_date,
    amount,
    ROW_NUMBER() OVER (PARTITION BY product_id ORDER BY sale_date) as rank,
    LAG(amount, 1) OVER (PARTITION BY product_id ORDER BY sale_date) as prev_amount
  FROM sales_data
""")

DDL Command Support

Table Creation

case class CreateTableAsSelect(
  tableDesc: HiveTable,
  child: LogicalPlan,
  allowExisting: Boolean
) extends LogicalPlan

CreateTableAsSelect - Represents CREATE TABLE AS SELECT operations

Usage Example:

val ctasPlan = HiveQl.parseSql("""
  CREATE TABLE analytics.customer_summary AS
  SELECT 
    customer_id,
    COUNT(*) as order_count,
    SUM(amount) as total_spent
  FROM orders
  GROUP BY customer_id
""")

View Creation

case class CreateViewAsSelect(
  tableDesc: HiveTable,
  child: LogicalPlan,
  allowExisting: Boolean,
  replace: Boolean,
  sql: String
) extends LogicalPlan

CreateViewAsSelect - Represents CREATE VIEW operations with original SQL text

Usage Example:

val viewPlan = HiveQl.parseSql("""
  CREATE OR REPLACE VIEW analytics.monthly_sales AS
  SELECT 
    YEAR(order_date) as year,
    MONTH(order_date) as month,
    SUM(amount) as total_sales
  FROM orders
  GROUP BY YEAR(order_date), MONTH(order_date)
""")

Expression Parsing

Supported Expression Types

The parser supports all HiveQL expression types:

Arithmetic Operations:

Basic: +, -, *, /, %
Functions: ABS, ROUND, CEILING, FLOOR

String Operations:

Functions: CONCAT, SUBSTR, LENGTH, TRIM
Pattern matching: LIKE, RLIKE, REGEXP

Date/Time Operations:

Functions: YEAR, MONTH, DAY, DATE_ADD, DATEDIFF
Formatting: DATE_FORMAT, UNIX_TIMESTAMP

Conditional Logic:

CASE WHEN ... THEN ... ELSE ... END
IF(condition, true_value, false_value)
COALESCE, NULLIF

Aggregate Functions:

Standard: COUNT, SUM, AVG, MIN, MAX
Advanced: COLLECT_LIST, COLLECT_SET
Statistical: STDDEV, VARIANCE, PERCENTILE

Window Functions:

Ranking: ROW_NUMBER, RANK, DENSE_RANK
Offset: LAG, LEAD, FIRST_VALUE, LAST_VALUE
Aggregate: All aggregate functions with OVER clause

Usage Examples:

// Complex CASE expression
val caseExpr = HiveQl.parseSql("""
  SELECT 
    customer_id,
    CASE 
      WHEN total_orders > 50 THEN 'Premium'
      WHEN total_orders > 10 THEN 'Regular'
      ELSE 'Basic'
    END as customer_tier
  FROM customer_stats
""")

// Window function with frame specification
val windowFrame = HiveQl.parseSql("""
  SELECT 
    sale_date,
    amount,
    SUM(amount) OVER (
      ORDER BY sale_date 
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as running_total
  FROM daily_sales
""")

Hive-Specific Features

LATERAL VIEW

val lateralViewPlan = HiveQl.parseSql("""
  SELECT 
    customer_id,
    tag
  FROM customers
  LATERAL VIEW explode(tags) exploded_table AS tag
  WHERE tag LIKE 'premium%'
""")

DISTRIBUTE BY and SORT BY

val distributePlan = HiveQl.parseSql("""
  SELECT customer_id, order_date, amount
  FROM orders
  DISTRIBUTE BY customer_id
  SORT BY order_date
""")

Custom SerDes and Storage Formats

val customSerDePlan = HiveQl.parseSql("""
  CREATE TABLE json_data (
    id bigint,
    data string
  )
  ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
  LOCATION '/data/json_files'
""")

Parser Constants

SQL Keywords and Operators

// Aggregate functions
val COUNT: String
val SUM: String
val AVG: String
val MIN: String
val MAX: String

// Logical operators
val AND: String
val OR: String
val NOT: String

// Comparison operators
val LIKE: String
val RLIKE: String
val REGEXP: String
val IN: String

// Conditional expressions
val CASE: String
val WHEN: String
val THEN: String
val ELSE: String

// Window function frame boundaries
val PRECEDING: String
val FOLLOWING: String
val CURRENT: String
val UNBOUNDED: String

// Data types
val STRING: String
val BIGINT: String
val DOUBLE: String
val BOOLEAN: String
val TIMESTAMP: String
val DECIMAL: String

Advanced Parsing Features

User-Defined Functions (UDFs)

The parser supports all UDF types:

// Built-in UDFs
val builtinUDF = HiveQl.parseSql("""
  SELECT regexp_replace(name, '[^a-zA-Z]', '') as clean_name
  FROM customers
""")

// Custom UDFs (assuming registered)
val customUDF = HiveQl.parseSql("""
  SELECT my_custom_function(data, param1, param2) as result
  FROM input_table
""")

// UDAFs (User-Defined Aggregate Functions)
val udafPlan = HiveQl.parseSql("""
  SELECT 
    category,
    my_custom_agg(value) as custom_aggregate
  FROM data_table
  GROUP BY category
""")

Complex Data Types

// Array operations
val arrayPlan = HiveQl.parseSql("""
  SELECT 
    array_col[0] as first_element,
    size(array_col) as array_length
  FROM table_with_arrays
""")

// Map operations  
val mapPlan = HiveQl.parseSql("""
  SELECT 
    map_col['key1'] as value1,
    map_keys(map_col) as all_keys
  FROM table_with_maps
""")

// Struct operations
val structPlan = HiveQl.parseSql("""
  SELECT 
    struct_col.field1,
    struct_col.field2
  FROM table_with_structs
""")

Partitioning and Bucketing

// Dynamic partitioning
val dynamicPartition = HiveQl.parseSql("""
  INSERT INTO TABLE partitioned_table
  PARTITION (year, month)
  SELECT data, year_col, month_col
  FROM source_table
""")

// Bucketed tables
val bucketedTable = HiveQl.parseSql("""
  CREATE TABLE bucketed_data (
    id bigint,
    name string,
    category string
  )
  CLUSTERED BY (id) INTO 32 BUCKETS
  STORED AS ORC
""")

Parser Integration

HiveQLDialect

private[hive] class HiveQLDialect(sqlContext: HiveContext) extends ParserDialect {
  override def parse(sqlText: String): LogicalPlan = {
    sqlContext.executionHive.withHiveState {
      HiveQl.parseSql(sqlText)
    }
  }
}

The HiveQLDialect integrates the parser with HiveContext, ensuring proper Hive state management during parsing.

Error Handling

The parser handles various error conditions:

Syntax Errors: Invalid HiveQL syntax
Semantic Errors: References to non-existent tables or columns
Type Errors: Incompatible data type operations
Function Errors: Unknown or incorrectly used functions

Example Error Handling:

try {
  val plan = HiveQl.parseSql("SELECT * FROM non_existent_table")
} catch {
  case e: ParseException => 
    println(s"Syntax error: ${e.getMessage}")
  case e: AnalysisException => 
    println(s"Semantic error: ${e.getMessage}")
}

Performance Considerations

Parser caches frequently used patterns and expressions
Complex queries with many subqueries may have longer parse times
Window functions and complex expressions add parsing overhead
Large DDL statements (many columns/partitions) require more memory

Optimization Tips:

Use simple column references when possible
Avoid deeply nested subqueries
Pre-validate table and column existence before parsing
Cache parsed plans for repeated queries