Catalyst is Spark's library for manipulating relational query plans and expressions
npx @tessl/cli install tessl/maven-org-apache-spark--spark-catalyst-2-13@4.0.00
# Apache Spark Catalyst
1
2
Catalyst is Apache Spark's foundational library for relational query planning and optimization. It provides a comprehensive framework for manipulating and optimizing SQL query plans through a tree-based representation system that supports rule-based optimization, cost-based optimization, and code generation.
3
4
## Package Information
5
6
- **Package Name**: spark-catalyst_2.13
7
- **Package Type**: maven
8
- **Language**: Scala
9
- **Maven Coordinates**: `org.apache.spark:spark-catalyst_2.13:4.0.0`
10
- **Installation**: Add to your `pom.xml` or `build.sbt`
11
12
```xml
13
<dependency>
14
<groupId>org.apache.spark</groupId>
15
<artifactId>spark-catalyst_2.13</artifactId>
16
<version>4.0.0</version>
17
</dependency>
18
```
19
20
```scala
21
libraryDependencies += "org.apache.spark" %% "spark-catalyst" % "4.0.0"
22
```
23
24
## Core Imports
25
26
```scala
27
import org.apache.spark.sql.types._
28
import org.apache.spark.sql.catalyst.expressions._
29
import org.apache.spark.sql.catalyst.plans.logical._
30
import org.apache.spark.sql.connector.catalog._
31
import org.apache.spark.sql.catalyst.trees._
32
```
33
34
## Basic Usage
35
36
```scala
37
import org.apache.spark.sql.types._
38
import org.apache.spark.sql.catalyst.expressions._
39
40
// Create data types
41
val stringType = StringType
42
val intType = IntegerType
43
val structType = StructType(Seq(
44
StructField("name", StringType, nullable = false),
45
StructField("age", IntegerType, nullable = true)
46
))
47
48
// Create expressions
49
val nameCol = AttributeReference("name", StringType, nullable = false)()
50
val ageCol = AttributeReference("age", IntegerType, nullable = true)()
51
val filterExpr = GreaterThan(ageCol, Literal(18))
52
53
// Work with logical plans
54
import org.apache.spark.sql.catalyst.plans.logical._
55
val plan = Filter(filterExpr, LocalRelation(nameCol, ageCol))
56
```
57
58
## Architecture
59
60
Catalyst is built around several foundational components:
61
62
- **Type System**: Comprehensive data type definitions supporting all SQL types including complex nested structures
63
- **Expression Framework**: Tree-based representation for all SQL expressions, functions, and operations with compile-time code generation
64
- **Query Plans**: Logical and physical plan representations for query optimization and execution
65
- **Tree Infrastructure**: Generic tree node framework enabling transformations and optimizations across all plan types
66
- **Data Source Connectors**: V2 connector APIs providing standardized interfaces for external data sources
67
- **Analysis & Optimization**: Rule-based frameworks for query analysis, resolution, and optimization
68
69
## Capabilities
70
71
### Data Type System
72
73
Core type system providing all SQL data types with full type safety and JSON serialization support. Essential for schema definition and type checking.
74
75
```scala { .api }
76
abstract class DataType extends AbstractDataType {
77
def defaultSize: Int
78
def typeName: String
79
def json: String
80
def simpleString: String
81
def sql: String
82
}
83
84
case class StructType(fields: Array[StructField]) extends DataType
85
case class StructField(name: String, dataType: DataType, nullable: Boolean, metadata: Metadata)
86
case class ArrayType(elementType: DataType, containsNull: Boolean) extends DataType
87
case class MapType(keyType: DataType, valueType: DataType, valueContainsNull: Boolean) extends DataType
88
```
89
90
[Data Types](./data-types.md)
91
92
### Expression System
93
94
Tree-based expression framework for representing all SQL operations, functions, and computations with support for code generation and optimization.
95
96
```scala { .api }
97
abstract class Expression extends TreeNode[Expression] {
98
def dataType: DataType
99
def nullable: Boolean
100
def eval(input: InternalRow): Any
101
def children: Seq[Expression]
102
}
103
104
abstract class BinaryExpression extends Expression {
105
def left: Expression
106
def right: Expression
107
}
108
109
abstract class UnaryExpression extends Expression {
110
def child: Expression
111
}
112
```
113
114
[Expressions](./expressions.md)
115
116
### Query Planning
117
118
Logical and physical query plan representations enabling sophisticated query optimization including predicate pushdown, join reordering, and cost-based optimization.
119
120
```scala { .api }
121
abstract class LogicalPlan extends QueryPlan[LogicalPlan] {
122
def output: Seq[Attribute]
123
def children: Seq[LogicalPlan]
124
def resolved: Boolean
125
}
126
127
case class Filter(condition: Expression, child: LogicalPlan) extends UnaryNode
128
case class Project(projectList: Seq[NamedExpression], child: LogicalPlan) extends UnaryNode
129
case class Join(left: LogicalPlan, right: LogicalPlan, joinType: JoinType, condition: Option[Expression]) extends BinaryNode
130
```
131
132
[Query Plans](./query-plans.md)
133
134
### Data Source Connectors
135
136
V2 connector APIs providing standardized interfaces for integrating external data sources with full support for predicate pushdown, column pruning, and streaming.
137
138
```scala { .api }
139
trait TableCatalog {
140
def loadTable(ident: Identifier): Table
141
def createTable(ident: Identifier, schema: StructType, partitions: Array[Transform], properties: Map[String, String]): Table
142
def alterTable(ident: Identifier, changes: TableChange*): Table
143
def dropTable(ident: Identifier): Boolean
144
}
145
146
trait Table {
147
def name(): String
148
def schema(): StructType
149
def partitioning(): Array[Transform]
150
def properties(): Map[String, String]
151
}
152
```
153
154
[Data Source Connectors](./connectors.md)
155
156
### Tree Infrastructure
157
158
Generic tree node framework providing transformation and traversal capabilities used throughout Catalyst for plans, expressions, and other tree structures.
159
160
```scala { .api }
161
abstract class TreeNode[BaseType <: TreeNode[BaseType]] {
162
def children: Seq[BaseType]
163
def transform(rule: PartialFunction[BaseType, BaseType]): BaseType
164
def foreach(f: BaseType => Unit): Unit
165
def collect[B](pf: PartialFunction[BaseType, B]): Seq[B]
166
def find(f: BaseType => Boolean): Option[BaseType]
167
}
168
```
169
170
The tree infrastructure enables powerful pattern matching and transformation capabilities essential for query optimization and analysis.