or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/maven-org-apache-spark--spark-catalyst-2-13

Catalyst is Spark's library for manipulating relational query plans and expressions

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-catalyst_2.13@4.0.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-catalyst-2-13@4.0.0

0

# Apache Spark Catalyst

1

2

Catalyst is Apache Spark's foundational library for relational query planning and optimization. It provides a comprehensive framework for manipulating and optimizing SQL query plans through a tree-based representation system that supports rule-based optimization, cost-based optimization, and code generation.

3

4

## Package Information

5

6

- **Package Name**: spark-catalyst_2.13

7

- **Package Type**: maven

8

- **Language**: Scala

9

- **Maven Coordinates**: `org.apache.spark:spark-catalyst_2.13:4.0.0`

10

- **Installation**: Add to your `pom.xml` or `build.sbt`

11

12

```xml

13

<dependency>

14

<groupId>org.apache.spark</groupId>

15

<artifactId>spark-catalyst_2.13</artifactId>

16

<version>4.0.0</version>

17

</dependency>

18

```

19

20

```scala

21

libraryDependencies += "org.apache.spark" %% "spark-catalyst" % "4.0.0"

22

```

23

24

## Core Imports

25

26

```scala

27

import org.apache.spark.sql.types._

28

import org.apache.spark.sql.catalyst.expressions._

29

import org.apache.spark.sql.catalyst.plans.logical._

30

import org.apache.spark.sql.connector.catalog._

31

import org.apache.spark.sql.catalyst.trees._

32

```

33

34

## Basic Usage

35

36

```scala

37

import org.apache.spark.sql.types._

38

import org.apache.spark.sql.catalyst.expressions._

39

40

// Create data types

41

val stringType = StringType

42

val intType = IntegerType

43

val structType = StructType(Seq(

44

StructField("name", StringType, nullable = false),

45

StructField("age", IntegerType, nullable = true)

46

))

47

48

// Create expressions

49

val nameCol = AttributeReference("name", StringType, nullable = false)()

50

val ageCol = AttributeReference("age", IntegerType, nullable = true)()

51

val filterExpr = GreaterThan(ageCol, Literal(18))

52

53

// Work with logical plans

54

import org.apache.spark.sql.catalyst.plans.logical._

55

val plan = Filter(filterExpr, LocalRelation(nameCol, ageCol))

56

```

57

58

## Architecture

59

60

Catalyst is built around several foundational components:

61

62

- **Type System**: Comprehensive data type definitions supporting all SQL types including complex nested structures

63

- **Expression Framework**: Tree-based representation for all SQL expressions, functions, and operations with compile-time code generation

64

- **Query Plans**: Logical and physical plan representations for query optimization and execution

65

- **Tree Infrastructure**: Generic tree node framework enabling transformations and optimizations across all plan types

66

- **Data Source Connectors**: V2 connector APIs providing standardized interfaces for external data sources

67

- **Analysis & Optimization**: Rule-based frameworks for query analysis, resolution, and optimization

68

69

## Capabilities

70

71

### Data Type System

72

73

Core type system providing all SQL data types with full type safety and JSON serialization support. Essential for schema definition and type checking.

74

75

```scala { .api }

76

abstract class DataType extends AbstractDataType {

77

def defaultSize: Int

78

def typeName: String

79

def json: String

80

def simpleString: String

81

def sql: String

82

}

83

84

case class StructType(fields: Array[StructField]) extends DataType

85

case class StructField(name: String, dataType: DataType, nullable: Boolean, metadata: Metadata)

86

case class ArrayType(elementType: DataType, containsNull: Boolean) extends DataType

87

case class MapType(keyType: DataType, valueType: DataType, valueContainsNull: Boolean) extends DataType

88

```

89

90

[Data Types](./data-types.md)

91

92

### Expression System

93

94

Tree-based expression framework for representing all SQL operations, functions, and computations with support for code generation and optimization.

95

96

```scala { .api }

97

abstract class Expression extends TreeNode[Expression] {

98

def dataType: DataType

99

def nullable: Boolean

100

def eval(input: InternalRow): Any

101

def children: Seq[Expression]

102

}

103

104

abstract class BinaryExpression extends Expression {

105

def left: Expression

106

def right: Expression

107

}

108

109

abstract class UnaryExpression extends Expression {

110

def child: Expression

111

}

112

```

113

114

[Expressions](./expressions.md)

115

116

### Query Planning

117

118

Logical and physical query plan representations enabling sophisticated query optimization including predicate pushdown, join reordering, and cost-based optimization.

119

120

```scala { .api }

121

abstract class LogicalPlan extends QueryPlan[LogicalPlan] {

122

def output: Seq[Attribute]

123

def children: Seq[LogicalPlan]

124

def resolved: Boolean

125

}

126

127

case class Filter(condition: Expression, child: LogicalPlan) extends UnaryNode

128

case class Project(projectList: Seq[NamedExpression], child: LogicalPlan) extends UnaryNode

129

case class Join(left: LogicalPlan, right: LogicalPlan, joinType: JoinType, condition: Option[Expression]) extends BinaryNode

130

```

131

132

[Query Plans](./query-plans.md)

133

134

### Data Source Connectors

135

136

V2 connector APIs providing standardized interfaces for integrating external data sources with full support for predicate pushdown, column pruning, and streaming.

137

138

```scala { .api }

139

trait TableCatalog {

140

def loadTable(ident: Identifier): Table

141

def createTable(ident: Identifier, schema: StructType, partitions: Array[Transform], properties: Map[String, String]): Table

142

def alterTable(ident: Identifier, changes: TableChange*): Table

143

def dropTable(ident: Identifier): Boolean

144

}

145

146

trait Table {

147

def name(): String

148

def schema(): StructType

149

def partitioning(): Array[Transform]

150

def properties(): Map[String, String]

151

}

152

```

153

154

[Data Source Connectors](./connectors.md)

155

156

### Tree Infrastructure

157

158

Generic tree node framework providing transformation and traversal capabilities used throughout Catalyst for plans, expressions, and other tree structures.

159

160

```scala { .api }

161

abstract class TreeNode[BaseType <: TreeNode[BaseType]] {

162

def children: Seq[BaseType]

163

def transform(rule: PartialFunction[BaseType, BaseType]): BaseType

164

def foreach(f: BaseType => Unit): Unit

165

def collect[B](pf: PartialFunction[BaseType, B]): Seq[B]

166

def find(f: BaseType => Boolean): Option[BaseType]

167

}

168

```

169

170

The tree infrastructure enables powerful pattern matching and transformation capabilities essential for query optimization and analysis.