or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

configuration.mddatasource.mdfunctions.mdindex.mdschema-conversion.md
tile.json

tessl/maven-org-apache-spark--spark-avro_2-13

Avro data source for Apache Spark SQL that provides functionality to read from and write to Avro files with DataFrames and Datasets

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-avro_2.13@4.0.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-avro_2-13@4.0.0

index.mddocs/

Apache Spark Avro

Apache Spark Avro provides comprehensive support for reading from and writing to Apache Avro files in Spark SQL. It enables seamless conversion between Avro binary format and Spark DataFrames/Datasets, with built-in functions for data transformation and robust schema handling.

Package Information

  • Package Name: spark-avro_2.13
  • Package Type: maven
  • Language: Scala
  • Installation: Add to your Maven POM or SBT build file:

Maven:

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-avro_2.13</artifactId>
  <version>4.0.0</version>
</dependency>

SBT:

libraryDependencies += "org.apache.spark" %% "spark-avro" % "4.0.0"

Core Imports

import org.apache.spark.sql.avro.functions._
import org.apache.spark.sql.functions.col

For deprecated functions (not recommended):

import org.apache.spark.sql.avro.{from_avro, to_avro}

Schema conversion utilities:

import org.apache.spark.sql.avro.SchemaConverters

Python API:

from pyspark.sql.avro.functions import from_avro, to_avro

Basic Usage

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.avro.functions._

val spark = SparkSession.builder()
  .appName("AvroExample")
  .getOrCreate()

// Reading Avro files
val df = spark.read.format("avro").load("path/to/avro/files")

// Writing Avro files
df.write.format("avro").save("path/to/output")

// Converting binary Avro data to DataFrame columns
val avroSchema = """{"type":"record","name":"User","fields":[{"name":"name","type":"string"},{"name":"age","type":"int"}]}"""
val decodedDf = df.select(from_avro(col("avro_data"), avroSchema).as("user"))

// Converting DataFrame columns to binary Avro
val encodedDf = df.select(to_avro(col("user_struct")).as("avro_data"))

// Getting schema information
val schemaDf = spark.sql("SELECT schema_of_avro('"+avroSchema+"') AS avro_schema")

Python Example:

from pyspark.sql import SparkSession
from pyspark.sql.avro.functions import from_avro, to_avro
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("AvroExample").getOrCreate()

# Reading Avro files
df = spark.read.format("avro").load("path/to/avro/files")

# Writing Avro files
df.write.format("avro").save("path/to/output")

# Converting binary Avro data to DataFrame columns
avro_schema = '{"type":"record","name":"User","fields":[{"name":"name","type":"string"},{"name":"age","type":"int"}]}'
decoded_df = df.select(from_avro(col("avro_data"), avro_schema).alias("user"))

# Converting DataFrame columns to binary Avro
encoded_df = df.select(to_avro(col("user_struct")).alias("avro_data"))

Architecture

Apache Spark Avro is built around several key components:

  • Function API: High-level functions (from_avro, to_avro, schema_of_avro) for data conversion and schema operations
  • DataSource V2: Native Spark integration for reading/writing Avro files with predicate pushdown and column pruning
  • Schema Conversion: Bidirectional conversion between Avro schemas and Spark SQL data types
  • Expression System: Internal Catalyst expressions that power the public functions
  • Compression Support: Multiple compression codecs (Snappy, Deflate, XZ, ZStandard, etc.)
  • Configuration Options: Extensive options for controlling parsing behavior, schema handling, and optimization

Capabilities

Avro Functions

Core functions for converting between binary Avro data and Spark SQL columns. These functions handle schema-based serialization and deserialization with comprehensive type mapping.

def from_avro(data: Column, jsonFormatSchema: String): Column
def from_avro(data: Column, jsonFormatSchema: String, options: java.util.Map[String, String]): Column
def to_avro(data: Column): Column
def to_avro(data: Column, jsonFormatSchema: String): Column
def schema_of_avro(jsonFormatSchema: String): Column
def schema_of_avro(jsonFormatSchema: String, options: java.util.Map[String, String]): Column
def from_avro(data: ColumnOrName, jsonFormatSchema: str, options: Optional[Dict[str, str]] = None) -> Column
def to_avro(data: ColumnOrName, jsonFormatSchema: str = "") -> Column

Avro Functions

File DataSource

Native Spark DataSource V2 implementation for reading and writing Avro files with optimized performance and advanced features like schema inference and predicate pushdown.

// Reading
spark.read.format("avro").load(path)
spark.read.format("avro").option("avroSchema", schema).load(path)

// Writing  
df.write.format("avro").save(path)
df.write.format("avro").option("compression", "snappy").save(path)

File DataSource

Schema Conversion

Developer API for converting between Avro schemas and Spark SQL data types, supporting complex nested structures and logical types.

object SchemaConverters {
  def toSqlType(avroSchema: org.apache.avro.Schema): SchemaType
  def toSqlType(avroSchema: org.apache.avro.Schema, useStableIdForUnionType: Boolean, 
                stableIdPrefixForUnionType: String, recursiveFieldMaxDepth: Int): SchemaType
  def toAvroType(catalystType: DataType, nullable: Boolean, recordName: String, nameSpace: String): org.apache.avro.Schema
}

case class SchemaType(dataType: DataType, nullable: Boolean)

Schema Conversion

Configuration Options

Comprehensive configuration options for controlling Avro processing behavior, including compression, schema handling, and parsing modes.

// Common options
Map(
  "compression" -> "snappy",
  "avroSchema" -> jsonSchema,
  "mode" -> "PERMISSIVE",
  "ignoreExtension" -> "true"
)

Configuration Options

Types

// Schema conversion result
case class SchemaType(dataType: DataType, nullable: Boolean)
// Compression codec enumeration
public enum AvroCompressionCodec {
    UNCOMPRESSED("null", false),
    DEFLATE("deflate", true), 
    SNAPPY("snappy", false),
    BZIP2("bzip2", false),
    XZ("xz", true),
    ZSTANDARD("zstandard", true);
    
    public String getCodecName();
    public boolean getSupportCompressionLevel();
    public String lowerCaseName();
    public static AvroCompressionCodec fromString(String s);
}