or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/maven-org-apache-spark--avro

Apache Spark Avro Data Source provides built-in support for reading and writing Apache Avro data in Apache Spark SQL with schema evolution capabilities.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-avro_2.13@3.5.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--avro@3.5.0

0

# Apache Spark Avro Data Source

1

2

The Apache Spark Avro Data Source provides built-in support for reading and writing Apache Avro data in Apache Spark SQL. This library enables seamless conversion between Spark's Catalyst data types and Avro format, supporting schema evolution, complex nested data structures, and efficient serialization for large-scale data processing.

3

4

## Package Information

5

6

- **Name**: spark-avro_2.13

7

- **Type**: Scala library

8

- **Language**: Scala (compatible with Scala 2.12 and 2.13)

9

- **Platform**: JVM

10

- **Latest Version**: 3.5.6

11

- **Language Bindings**: Also available in Python (PySpark), Java, and R

12

13

### Installation

14

15

For Maven:

16

```xml

17

<dependency>

18

<groupId>org.apache.spark</groupId>

19

<artifactId>spark-avro_2.13</artifactId>

20

<version>3.5.6</version>

21

</dependency>

22

```

23

24

For SBT:

25

```scala

26

libraryDependencies += "org.apache.spark" %% "spark-avro" % "3.5.6"

27

```

28

29

## Core Imports

30

31

```scala

32

// For reading/writing Avro files as DataFrames

33

import org.apache.spark.sql.SparkSession

34

35

// For binary Avro data conversion functions

36

import org.apache.spark.sql.avro.functions._

37

38

// For schema conversion utilities

39

import org.apache.spark.sql.avro.SchemaConverters

40

41

// For DataFrames and Column operations

42

import org.apache.spark.sql.{DataFrame, Column}

43

import org.apache.spark.sql.functions._

44

```

45

46

## Basic Usage

47

48

### Reading Avro Files

49

```scala

50

val spark = SparkSession.builder()

51

.appName("Avro Example")

52

.getOrCreate()

53

54

// Read Avro files into DataFrame

55

val df = spark.read

56

.format("avro")

57

.load("path/to/avro/files")

58

59

df.show()

60

```

61

62

### Writing Avro Files

63

```scala

64

// Write DataFrame to Avro format

65

df.write

66

.format("avro")

67

.option("compression", "snappy")

68

.save("path/to/output")

69

```

70

71

### Converting Binary Avro Data

72

```scala

73

import org.apache.spark.sql.avro.functions._

74

75

val avroSchema = """

76

{

77

"type": "record",

78

"name": "User",

79

"fields": [

80

{"name": "id", "type": "long"},

81

{"name": "name", "type": "string"}

82

]

83

}

84

"""

85

86

// Convert binary Avro data to Catalyst format

87

val decodedDF = df.select(

88

from_avro(col("avro_data"), avroSchema).as("decoded")

89

)

90

91

// Convert Catalyst data to binary Avro format

92

val encodedDF = df.select(

93

to_avro(struct(col("id"), col("name"))).as("avro_data")

94

)

95

```

96

97

## Architecture

98

99

The Spark Avro connector consists of several key components:

100

101

- **Data Source Integration**: FileFormat (V1) and DataSourceV2 implementations for seamless Spark SQL integration

102

- **Schema Conversion**: Bidirectional conversion between Avro schemas and Spark SQL schemas

103

- **Binary Conversion Functions**: High-level functions for converting between binary Avro data and Catalyst values

104

- **Serialization/Deserialization**: Efficient conversion between Avro GenericRecord and Spark InternalRow formats

105

106

## Capabilities

107

108

### File Operations

109

Read and write Avro files with automatic schema inference and configurable options.

110

111

```scala

112

// Key APIs for file operations

113

def format(source: String): DataFrameReader // Use "avro" as source

114

def option(key: String, value: String): DataFrameReader

115

def load(path: String*): DataFrame

116

def save(path: String): Unit

117

```

118

119

[File Operations Documentation](./file-operations.md)

120

121

### Binary Data Conversion

122

Convert between binary Avro data and Spark DataFrame columns using built-in functions.

123

124

```scala

125

// Key APIs for binary conversion

126

def from_avro(data: Column, jsonFormatSchema: String): Column

127

def from_avro(data: Column, jsonFormatSchema: String, options: java.util.Map[String, String]): Column

128

def to_avro(data: Column): Column

129

def to_avro(data: Column, jsonFormatSchema: String): Column

130

```

131

132

[Binary Conversion Documentation](./binary-conversion.md)

133

134

### Schema Conversion

135

Convert between Avro schemas and Spark SQL schemas for interoperability.

136

137

```scala

138

// Key APIs for schema conversion

139

def toSqlType(avroSchema: Schema): SchemaType

140

def toSqlType(avroSchema: Schema, options: Map[String, String]): SchemaType

141

def toAvroType(catalystType: DataType, nullable: Boolean, recordName: String, nameSpace: String): Schema

142

143

case class SchemaType(dataType: DataType, nullable: Boolean)

144

```

145

146

[Schema Conversion Documentation](./schema-conversion.md)

147

148

### Configuration Options

149

Comprehensive configuration options for reading, writing, and data conversion operations.

150

151

```scala

152

// Example configuration usage

153

val df = spark.read

154

.format("avro")

155

.option("avroSchema", customSchemaJson)

156

.option("mode", "PERMISSIVE")

157

.option("positionalFieldMatching", "true")

158

.load("path/to/files")

159

```

160

161

[Configuration Options Documentation](./configuration.md)

162

163

### Data Type Support

164

Support for all Spark SQL data types including complex nested structures, logical types, and custom types.

165

166

[Data Type Support Documentation](./data-types.md)