0
# Apache Spark Avro Data Source
1
2
The Apache Spark Avro Data Source provides built-in support for reading and writing Apache Avro data in Apache Spark SQL. This library enables seamless conversion between Spark's Catalyst data types and Avro format, supporting schema evolution, complex nested data structures, and efficient serialization for large-scale data processing.
3
4
## Package Information
5
6
- **Name**: spark-avro_2.13
7
- **Type**: Scala library
8
- **Language**: Scala (compatible with Scala 2.12 and 2.13)
9
- **Platform**: JVM
10
- **Latest Version**: 3.5.6
11
- **Language Bindings**: Also available in Python (PySpark), Java, and R
12
13
### Installation
14
15
For Maven:
16
```xml
17
<dependency>
18
<groupId>org.apache.spark</groupId>
19
<artifactId>spark-avro_2.13</artifactId>
20
<version>3.5.6</version>
21
</dependency>
22
```
23
24
For SBT:
25
```scala
26
libraryDependencies += "org.apache.spark" %% "spark-avro" % "3.5.6"
27
```
28
29
## Core Imports
30
31
```scala
32
// For reading/writing Avro files as DataFrames
33
import org.apache.spark.sql.SparkSession
34
35
// For binary Avro data conversion functions
36
import org.apache.spark.sql.avro.functions._
37
38
// For schema conversion utilities
39
import org.apache.spark.sql.avro.SchemaConverters
40
41
// For DataFrames and Column operations
42
import org.apache.spark.sql.{DataFrame, Column}
43
import org.apache.spark.sql.functions._
44
```
45
46
## Basic Usage
47
48
### Reading Avro Files
49
```scala
50
val spark = SparkSession.builder()
51
.appName("Avro Example")
52
.getOrCreate()
53
54
// Read Avro files into DataFrame
55
val df = spark.read
56
.format("avro")
57
.load("path/to/avro/files")
58
59
df.show()
60
```
61
62
### Writing Avro Files
63
```scala
64
// Write DataFrame to Avro format
65
df.write
66
.format("avro")
67
.option("compression", "snappy")
68
.save("path/to/output")
69
```
70
71
### Converting Binary Avro Data
72
```scala
73
import org.apache.spark.sql.avro.functions._
74
75
val avroSchema = """
76
{
77
"type": "record",
78
"name": "User",
79
"fields": [
80
{"name": "id", "type": "long"},
81
{"name": "name", "type": "string"}
82
]
83
}
84
"""
85
86
// Convert binary Avro data to Catalyst format
87
val decodedDF = df.select(
88
from_avro(col("avro_data"), avroSchema).as("decoded")
89
)
90
91
// Convert Catalyst data to binary Avro format
92
val encodedDF = df.select(
93
to_avro(struct(col("id"), col("name"))).as("avro_data")
94
)
95
```
96
97
## Architecture
98
99
The Spark Avro connector consists of several key components:
100
101
- **Data Source Integration**: FileFormat (V1) and DataSourceV2 implementations for seamless Spark SQL integration
102
- **Schema Conversion**: Bidirectional conversion between Avro schemas and Spark SQL schemas
103
- **Binary Conversion Functions**: High-level functions for converting between binary Avro data and Catalyst values
104
- **Serialization/Deserialization**: Efficient conversion between Avro GenericRecord and Spark InternalRow formats
105
106
## Capabilities
107
108
### File Operations
109
Read and write Avro files with automatic schema inference and configurable options.
110
111
```scala
112
// Key APIs for file operations
113
def format(source: String): DataFrameReader // Use "avro" as source
114
def option(key: String, value: String): DataFrameReader
115
def load(path: String*): DataFrame
116
def save(path: String): Unit
117
```
118
119
[File Operations Documentation](./file-operations.md)
120
121
### Binary Data Conversion
122
Convert between binary Avro data and Spark DataFrame columns using built-in functions.
123
124
```scala
125
// Key APIs for binary conversion
126
def from_avro(data: Column, jsonFormatSchema: String): Column
127
def from_avro(data: Column, jsonFormatSchema: String, options: java.util.Map[String, String]): Column
128
def to_avro(data: Column): Column
129
def to_avro(data: Column, jsonFormatSchema: String): Column
130
```
131
132
[Binary Conversion Documentation](./binary-conversion.md)
133
134
### Schema Conversion
135
Convert between Avro schemas and Spark SQL schemas for interoperability.
136
137
```scala
138
// Key APIs for schema conversion
139
def toSqlType(avroSchema: Schema): SchemaType
140
def toSqlType(avroSchema: Schema, options: Map[String, String]): SchemaType
141
def toAvroType(catalystType: DataType, nullable: Boolean, recordName: String, nameSpace: String): Schema
142
143
case class SchemaType(dataType: DataType, nullable: Boolean)
144
```
145
146
[Schema Conversion Documentation](./schema-conversion.md)
147
148
### Configuration Options
149
Comprehensive configuration options for reading, writing, and data conversion operations.
150
151
```scala
152
// Example configuration usage
153
val df = spark.read
154
.format("avro")
155
.option("avroSchema", customSchemaJson)
156
.option("mode", "PERMISSIVE")
157
.option("positionalFieldMatching", "true")
158
.load("path/to/files")
159
```
160
161
[Configuration Options Documentation](./configuration.md)
162
163
### Data Type Support
164
Support for all Spark SQL data types including complex nested structures, logical types, and custom types.
165
166
[Data Type Support Documentation](./data-types.md)