Apache Spark Avro Data Source provides built-in support for reading and writing Apache Avro data in Apache Spark SQL with schema evolution capabilities.
npx @tessl/cli install tessl/maven-org-apache-spark--avro@3.5.0The Apache Spark Avro Data Source provides built-in support for reading and writing Apache Avro data in Apache Spark SQL. This library enables seamless conversion between Spark's Catalyst data types and Avro format, supporting schema evolution, complex nested data structures, and efficient serialization for large-scale data processing.
For Maven:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.13</artifactId>
<version>3.5.6</version>
</dependency>For SBT:
libraryDependencies += "org.apache.spark" %% "spark-avro" % "3.5.6"// For reading/writing Avro files as DataFrames
import org.apache.spark.sql.SparkSession
// For binary Avro data conversion functions
import org.apache.spark.sql.avro.functions._
// For schema conversion utilities
import org.apache.spark.sql.avro.SchemaConverters
// For DataFrames and Column operations
import org.apache.spark.sql.{DataFrame, Column}
import org.apache.spark.sql.functions._val spark = SparkSession.builder()
.appName("Avro Example")
.getOrCreate()
// Read Avro files into DataFrame
val df = spark.read
.format("avro")
.load("path/to/avro/files")
df.show()// Write DataFrame to Avro format
df.write
.format("avro")
.option("compression", "snappy")
.save("path/to/output")import org.apache.spark.sql.avro.functions._
val avroSchema = """
{
"type": "record",
"name": "User",
"fields": [
{"name": "id", "type": "long"},
{"name": "name", "type": "string"}
]
}
"""
// Convert binary Avro data to Catalyst format
val decodedDF = df.select(
from_avro(col("avro_data"), avroSchema).as("decoded")
)
// Convert Catalyst data to binary Avro format
val encodedDF = df.select(
to_avro(struct(col("id"), col("name"))).as("avro_data")
)The Spark Avro connector consists of several key components:
Read and write Avro files with automatic schema inference and configurable options.
// Key APIs for file operations
def format(source: String): DataFrameReader // Use "avro" as source
def option(key: String, value: String): DataFrameReader
def load(path: String*): DataFrame
def save(path: String): UnitConvert between binary Avro data and Spark DataFrame columns using built-in functions.
// Key APIs for binary conversion
def from_avro(data: Column, jsonFormatSchema: String): Column
def from_avro(data: Column, jsonFormatSchema: String, options: java.util.Map[String, String]): Column
def to_avro(data: Column): Column
def to_avro(data: Column, jsonFormatSchema: String): ColumnBinary Conversion Documentation
Convert between Avro schemas and Spark SQL schemas for interoperability.
// Key APIs for schema conversion
def toSqlType(avroSchema: Schema): SchemaType
def toSqlType(avroSchema: Schema, options: Map[String, String]): SchemaType
def toAvroType(catalystType: DataType, nullable: Boolean, recordName: String, nameSpace: String): Schema
case class SchemaType(dataType: DataType, nullable: Boolean)Schema Conversion Documentation
Comprehensive configuration options for reading, writing, and data conversion operations.
// Example configuration usage
val df = spark.read
.format("avro")
.option("avroSchema", customSchemaJson)
.option("mode", "PERMISSIVE")
.option("positionalFieldMatching", "true")
.load("path/to/files")Configuration Options Documentation
Support for all Spark SQL data types including complex nested structures, logical types, and custom types.