Catalyst query optimization framework and expression evaluation engine for Apache Spark SQL
npx @tessl/cli install tessl/maven-org-apache-spark--spark-catalyst@2.2.00
# Spark Catalyst
1
2
Catalyst is Apache Spark's query optimization framework and expression evaluation engine for Spark SQL. It provides a comprehensive system for manipulating relational query plans, transforming SQL queries and DataFrame operations into efficient execution plans through rule-based and cost-based optimization techniques.
3
4
## Package Information
5
6
- **Package Name**: spark-catalyst_2.11
7
- **Package Type**: maven
8
- **Language**: Scala
9
- **Version**: 2.2.3
10
- **Maven Coordinates**: `org.apache.spark:spark-catalyst_2.11:2.2.3`
11
12
## Core Imports
13
14
```scala
15
import org.apache.spark.sql.Row
16
import org.apache.spark.sql.types._
17
import org.apache.spark.sql.catalyst.expressions._
18
import org.apache.spark.sql.catalyst.plans.logical._
19
import org.apache.spark.sql.catalyst.trees._
20
```
21
22
## Basic Usage
23
24
```scala
25
import org.apache.spark.sql.Row
26
import org.apache.spark.sql.types._
27
import org.apache.spark.sql.catalyst.expressions._
28
29
// Create a data type
30
val stringType = StringType
31
val intType = IntegerType
32
33
// Create a row
34
val row = Row("Alice", 25, true)
35
36
// Access row data
37
val name: String = row.getString(0)
38
val age: Int = row.getInt(1)
39
val isActive: Boolean = row.getBoolean(2)
40
41
// Create expressions
42
val col = UnresolvedAttribute("name")
43
val literal = Literal("Alice")
44
val equals = EqualTo(col, literal)
45
```
46
47
## Architecture
48
49
Catalyst is built around several key components:
50
51
- **Tree Framework**: All query components (expressions, plans) are represented as trees that can be transformed using rules
52
- **Expression System**: Rich set of expressions for computations, predicates, and data access with code generation support
53
- **Query Plans**: Logical and physical plan representations for query execution
54
- **Analysis Framework**: Rule-based system for resolving references and type checking
55
- **Optimization Engine**: Cost-based and rule-based optimizations for query performance
56
- **Code Generation**: Dynamic code generation for high-performance expression evaluation
57
58
## Capabilities
59
60
### Data Types and Structures
61
62
Core data type system including primitive types, complex types (arrays, maps, structs), and Row interface for data access.
63
64
```scala { .api }
65
trait Row {
66
def apply(i: Int): Any
67
def get(i: Int): Any
68
def isNullAt(i: Int): Boolean
69
def getInt(i: Int): Int
70
def getString(i: Int): String
71
def getBoolean(i: Int): Boolean
72
// ... additional primitive accessors
73
}
74
75
abstract class DataType {
76
def typeName: String
77
def json: String
78
def prettyJson: String
79
def simpleString: String
80
}
81
```
82
83
[Data Types and Structures](./data-types.md)
84
85
### Expression System
86
87
Comprehensive expression framework for computations, predicates, aggregations, and data transformations with built-in code generation.
88
89
```scala { .api }
90
abstract class Expression extends TreeNode[Expression] {
91
def dataType: DataType
92
def nullable: Boolean
93
def eval(input: InternalRow): Any
94
def genCode(ctx: CodegenContext): ExprCode
95
def prettyName: String
96
}
97
98
abstract class UnaryExpression extends Expression {
99
def child: Expression
100
}
101
102
abstract class BinaryExpression extends Expression {
103
def left: Expression
104
def right: Expression
105
}
106
```
107
108
[Expression System](./expressions.md)
109
110
### Query Plans
111
112
Logical and physical query plan representations with tree-based transformations and optimization support.
113
114
```scala { .api }
115
abstract class LogicalPlan extends QueryPlan[LogicalPlan] {
116
def output: Seq[Attribute]
117
def references: AttributeSet
118
def inputSet: AttributeSet
119
def resolved: Boolean
120
}
121
122
abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] extends TreeNode[PlanType] {
123
def output: Seq[Attribute]
124
def outputSet: AttributeSet
125
def references: AttributeSet
126
}
127
```
128
129
[Query Plans](./query-plans.md)
130
131
### Analysis Framework
132
133
Rule-based analysis system for resolving unresolved references, type checking, and semantic validation.
134
135
```scala { .api }
136
class Analyzer(catalog: SessionCatalog, conf: SQLConf, maxIterations: Int)
137
extends RuleExecutor[LogicalPlan] {
138
def execute(plan: LogicalPlan): LogicalPlan
139
}
140
141
abstract class Rule[TreeType <: TreeNode[TreeType]] {
142
def ruleName: String
143
def apply(plan: TreeType): TreeType
144
}
145
```
146
147
[Analysis Framework](./analysis.md)
148
149
### Optimization
150
151
Query optimization engine with rule-based and cost-based optimization techniques.
152
153
```scala { .api }
154
abstract class Optimizer extends RuleExecutor[LogicalPlan] {
155
def batches: Seq[Batch]
156
}
157
158
case class Batch(name: String, strategy: Strategy, rules: Rule[LogicalPlan]*)
159
160
abstract class Strategy {
161
def maxIterations: Int
162
}
163
```
164
165
[Optimization](./optimization.md)
166
167
### SQL Parsing
168
169
SQL parsing interfaces and abstract syntax tree representations.
170
171
```scala { .api }
172
abstract class ParserInterface {
173
def parsePlan(sqlText: String): LogicalPlan
174
def parseExpression(sqlText: String): Expression
175
def parseDataType(sqlText: String): DataType
176
}
177
178
trait AstBuilder extends SqlBaseBaseVisitor[AnyRef] {
179
def visitSingleStatement(ctx: SingleStatementContext): LogicalPlan
180
def visitSingleExpression(ctx: SingleExpressionContext): Expression
181
}
182
```
183
184
[SQL Parsing](./parsing.md)
185
186
### Code Generation
187
188
Framework for generating efficient Java code for expression evaluation and query execution.
189
190
```scala { .api }
191
class CodegenContext {
192
def freshName(name: String): String
193
def addReferenceObj(objName: String, obj: Any, className: String = null): String
194
def addMutableState(javaType: String, variableName: String, initFunc: String = ""): String
195
}
196
197
case class ExprCode(code: String, isNull: String, value: String)
198
199
trait CodeGenerator[InType <: AnyRef, OutType <: AnyRef] {
200
def generate(expressions: InType): OutType
201
}
202
```
203
204
[Code Generation](./code-generation.md)
205
206
### Utilities
207
208
Utility classes for date/time operations, string manipulation, and other common operations.
209
210
```scala { .api }
211
object DateTimeUtils {
212
def stringToTimestamp(s: UTF8String): Option[Long]
213
def timestampToString(us: Long): String
214
def dateToString(days: Int): String
215
def stringToDate(s: UTF8String): Option[Int]
216
}
217
218
object UTF8String {
219
def fromString(str: String): UTF8String
220
def fromBytes(bytes: Array[Byte]): UTF8String
221
}
222
```
223
224
[Utilities](./utilities.md)