or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

binary-conversion.mdconfiguration.mddata-types.mdfile-operations.mdindex.mdschema-conversion.md
tile.json

tessl/maven-org-apache-spark--avro

Apache Spark Avro Data Source provides built-in support for reading and writing Apache Avro data in Apache Spark SQL with schema evolution capabilities.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-avro_2.13@3.5.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--avro@3.5.0

index.mddocs/

Apache Spark Avro Data Source

The Apache Spark Avro Data Source provides built-in support for reading and writing Apache Avro data in Apache Spark SQL. This library enables seamless conversion between Spark's Catalyst data types and Avro format, supporting schema evolution, complex nested data structures, and efficient serialization for large-scale data processing.

Package Information

  • Name: spark-avro_2.13
  • Type: Scala library
  • Language: Scala (compatible with Scala 2.12 and 2.13)
  • Platform: JVM
  • Latest Version: 3.5.6
  • Language Bindings: Also available in Python (PySpark), Java, and R

Installation

For Maven:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-avro_2.13</artifactId>
    <version>3.5.6</version>
</dependency>

For SBT:

libraryDependencies += "org.apache.spark" %% "spark-avro" % "3.5.6"

Core Imports

// For reading/writing Avro files as DataFrames
import org.apache.spark.sql.SparkSession

// For binary Avro data conversion functions
import org.apache.spark.sql.avro.functions._

// For schema conversion utilities
import org.apache.spark.sql.avro.SchemaConverters

// For DataFrames and Column operations
import org.apache.spark.sql.{DataFrame, Column}
import org.apache.spark.sql.functions._

Basic Usage

Reading Avro Files

val spark = SparkSession.builder()
  .appName("Avro Example")
  .getOrCreate()

// Read Avro files into DataFrame
val df = spark.read
  .format("avro")
  .load("path/to/avro/files")

df.show()

Writing Avro Files

// Write DataFrame to Avro format
df.write
  .format("avro")
  .option("compression", "snappy")
  .save("path/to/output")

Converting Binary Avro Data

import org.apache.spark.sql.avro.functions._

val avroSchema = """
{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "id", "type": "long"},
    {"name": "name", "type": "string"}
  ]
}
"""

// Convert binary Avro data to Catalyst format
val decodedDF = df.select(
  from_avro(col("avro_data"), avroSchema).as("decoded")
)

// Convert Catalyst data to binary Avro format
val encodedDF = df.select(
  to_avro(struct(col("id"), col("name"))).as("avro_data")
)

Architecture

The Spark Avro connector consists of several key components:

  • Data Source Integration: FileFormat (V1) and DataSourceV2 implementations for seamless Spark SQL integration
  • Schema Conversion: Bidirectional conversion between Avro schemas and Spark SQL schemas
  • Binary Conversion Functions: High-level functions for converting between binary Avro data and Catalyst values
  • Serialization/Deserialization: Efficient conversion between Avro GenericRecord and Spark InternalRow formats

Capabilities

File Operations

Read and write Avro files with automatic schema inference and configurable options.

// Key APIs for file operations
def format(source: String): DataFrameReader  // Use "avro" as source
def option(key: String, value: String): DataFrameReader
def load(path: String*): DataFrame
def save(path: String): Unit

File Operations Documentation

Binary Data Conversion

Convert between binary Avro data and Spark DataFrame columns using built-in functions.

// Key APIs for binary conversion
def from_avro(data: Column, jsonFormatSchema: String): Column
def from_avro(data: Column, jsonFormatSchema: String, options: java.util.Map[String, String]): Column
def to_avro(data: Column): Column
def to_avro(data: Column, jsonFormatSchema: String): Column

Binary Conversion Documentation

Schema Conversion

Convert between Avro schemas and Spark SQL schemas for interoperability.

// Key APIs for schema conversion
def toSqlType(avroSchema: Schema): SchemaType
def toSqlType(avroSchema: Schema, options: Map[String, String]): SchemaType
def toAvroType(catalystType: DataType, nullable: Boolean, recordName: String, nameSpace: String): Schema

case class SchemaType(dataType: DataType, nullable: Boolean)

Schema Conversion Documentation

Configuration Options

Comprehensive configuration options for reading, writing, and data conversion operations.

// Example configuration usage
val df = spark.read
  .format("avro")
  .option("avroSchema", customSchemaJson)
  .option("mode", "PERMISSIVE")
  .option("positionalFieldMatching", "true")
  .load("path/to/files")

Configuration Options Documentation

Data Type Support

Support for all Spark SQL data types including complex nested structures, logical types, and custom types.

Data Type Support Documentation