tessl/maven-org-apache-spark--spark-avro-2-11

Apache Spark SQL connector for reading and writing Avro data with comprehensive serialization capabilities

—

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Securityby

Pending

The risk profile of this skill

Overview

Eval results

Files

Spark Avro

Name: tessl/maven-org-apache-spark--spark-avro-2-11
Author: tessl

Apache Spark SQL connector for reading and writing Avro data. The spark-avro library provides comprehensive functionality for working with Apache Avro format in Spark SQL applications, including serialization, deserialization, and schema conversion capabilities for high-performance big data processing scenarios where Avro's schema evolution and compact binary serialization are essential.

Package Information

Package Name: spark-avro_2.11
Package Type: maven
Language: Scala (with Java support classes)
Maven Coordinates: org.apache.spark:spark-avro_2.11:2.4.8
Installation:
- Maven: Add dependency to pom.xml
- Spark Submit: Use --packages org.apache.spark:spark-avro_2.11:2.4.8
- Spark Shell: Use --packages org.apache.spark:spark-avro_2.11:2.4.8

Maven Dependency:

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-avro_2.11</artifactId>
  <version>2.4.8</version>
</dependency>

Note: The spark-avro module is external and not included in spark-submit or spark-shell by default.

Core Imports

import org.apache.spark.sql.avro._

For DataFrame operations:

import org.apache.spark.sql.{SparkSession, DataFrame}
import org.apache.spark.sql.functions.col

Basic Usage

import org.apache.spark.sql.{SparkSession, DataFrame}
import org.apache.spark.sql.avro._
import org.apache.spark.sql.functions.col

// Initialize SparkSession
val spark = SparkSession.builder()
  .appName("Spark Avro Example")
  .config("spark.sql.extensions", "org.apache.spark.sql.avro.AvroExtensions")
  .getOrCreate()

import spark.implicits._

// Read from Avro files
val df = spark.read
  .format("avro")
  .load("path/to/avro/files")

// Write to Avro files
df.write
  .format("avro")
  .mode("overwrite")
  .save("path/to/output")

// Convert binary Avro data in columns
val avroSchema = """{"type": "record", "name": "User", "fields": [
  {"name": "name", "type": "string"},
  {"name": "age", "type": "int"}
]}"""

val transformedDF = df.select(
  from_avro(col("avroData"), avroSchema).as("userData"),
  to_avro(col("structData")).as("binaryAvro")
)

Architecture

Spark Avro is built around several key components:

Data Source Integration: AvroFileFormat implements Spark's FileFormat and DataSourceRegister interfaces
Schema Conversion: SchemaConverters provides bidirectional mapping between Avro and Spark SQL schemas
Serialization Layer: AvroSerializer and AvroDeserializer handle data conversion between Catalyst and Avro formats
Column Functions: from_avro and to_avro expressions for working with binary Avro data in DataFrame columns
Configuration: AvroOptions provides comprehensive configuration for read/write operations
Output Management: Custom output writers with metadata support for efficient Avro file generation

Capabilities

Column Functions

Core functions for working with Avro binary data in DataFrame columns, enabling conversion between Catalyst internal format and Avro binary format within SQL queries and DataFrame operations.

def from_avro(data: Column, jsonFormatSchema: String): Column
def to_avro(data: Column): Column

Column Functions

Data Source Operations

Comprehensive file-based data source functionality for reading from and writing to Avro files using Spark's standard DataFrame read/write API with automatic schema inference and configurable options.

// Reading
spark.read.format("avro").option(key, value).load(path)

// Writing  
df.write.format("avro").mode(saveMode).option(key, value).save(path)

Data Source Operations

Schema Conversion

Utilities for converting between Apache Avro schemas and Spark SQL schemas, supporting complex nested types, logical types, and schema evolution patterns.

object SchemaConverters {
  def toSqlType(avroSchema: Schema): SchemaType
  def toAvroType(catalystType: DataType, nullable: Boolean = false, 
                recordName: String = "topLevelRecord", nameSpace: String = ""): Schema
}

case class SchemaType(dataType: DataType, nullable: Boolean)

Schema Conversion

Configuration Options

Comprehensive configuration system for customizing Avro read and write operations, including schema specification, compression settings, and file handling options.

class AvroOptions(parameters: Map[String, String], conf: Configuration) {
  val schema: Option[String]
  val recordName: String
  val recordNamespace: String
  val ignoreExtension: Boolean
  val compression: String
}

Configuration

Types

// Core Spark SQL types
import org.apache.spark.sql.types._
import org.apache.spark.sql.Column

// Avro schema types
import org.apache.avro.Schema

// Configuration and options
import org.apache.hadoop.conf.Configuration
import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap

// Exception types
case class IncompatibleSchemaException(msg: String) extends Exception(msg)

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 2 months ago
Describes: pkg:maven/org.apache.spark/spark-avro_2.11@2.4.x
Publish Source: CLI
Badge