Catalyst is a library for manipulating relational query plans used as the foundation for Spark SQL's query optimizer and execution engine
npx @tessl/cli install tessl/maven-org-apache-spark--spark-catalyst_2-10@1.6.00
# Spark Catalyst
1
2
Spark Catalyst is an extensible query optimizer and execution planning framework that serves as the foundation of Apache Spark's SQL engine. It provides a comprehensive set of tools for representing, manipulating, and optimizing relational query plans through a tree-based representation system that supports complex transformations including predicate pushdown, constant folding, and projection pruning.
3
4
## Package Information
5
6
- **Package Name**: org.apache.spark:spark-catalyst_2.10
7
- **Package Type**: maven
8
- **Language**: Scala
9
- **Installation**: Add to your Maven pom.xml:
10
11
```xml
12
<dependency>
13
<groupId>org.apache.spark</groupId>
14
<artifactId>spark-catalyst_2.10</artifactId>
15
<version>1.6.3</version>
16
</dependency>
17
```
18
19
## Core Imports
20
21
```scala
22
import org.apache.spark.sql._
23
import org.apache.spark.sql.catalyst.expressions._
24
import org.apache.spark.sql.catalyst.plans.logical._
25
import org.apache.spark.sql.catalyst.trees._
26
import org.apache.spark.sql.types._
27
```
28
29
## Basic Usage
30
31
```scala
32
import org.apache.spark.sql._
33
import org.apache.spark.sql.catalyst.expressions._
34
import org.apache.spark.sql.catalyst.plans.logical._
35
import org.apache.spark.sql.types._
36
37
// Create a simple logical plan
38
val schema = StructType(Seq(
39
StructField("id", IntegerType, nullable = false),
40
StructField("name", StringType, nullable = true)
41
))
42
43
// Work with expressions
44
val idRef = AttributeReference("id", IntegerType, nullable = false)()
45
val nameRef = AttributeReference("name", StringType, nullable = true)()
46
val literal = Literal(42, IntegerType)
47
48
// Create a simple filter expression
49
val filterExpr = EqualTo(idRef, literal)
50
51
// Create a Row
52
val row = Row(1, "Alice")
53
val name = row.getString(1)
54
val id = row.getInt(0)
55
```
56
57
## Architecture
58
59
Spark Catalyst is built around several key architectural components:
60
61
- **Tree-based Representation**: All plans and expressions extend TreeNode, providing uniform tree traversal and transformation capabilities
62
- **Type System**: Rich type system supporting SQL types including primitive, complex, and user-defined types
63
- **Expression Framework**: Extensible expression evaluation system with code generation support for high performance
64
- **Analysis Phase**: Resolution of unresolved references, type checking, and semantic validation
65
- **Optimization Phase**: Rule-based optimization framework with built-in optimizations like predicate pushdown and constant folding
66
- **Code Generation**: Whole-stage code generation for high-performance query execution
67
68
## Capabilities
69
70
### Row Data Interface
71
72
Core interface for working with structured row data, providing both generic and type-safe access methods.
73
74
```scala { .api }
75
trait Row extends Serializable {
76
def size: Int
77
def length: Int
78
def schema: StructType
79
def apply(i: Int): Any
80
def get(i: Int): Any
81
def isNullAt(i: Int): Boolean
82
def getBoolean(i: Int): Boolean
83
def getByte(i: Int): Byte
84
def getShort(i: Int): Short
85
def getInt(i: Int): Int
86
def getLong(i: Int): Long
87
def getFloat(i: Int): Float
88
def getDouble(i: Int): Double
89
def getString(i: Int): String
90
def getDecimal(i: Int): java.math.BigDecimal
91
def getDate(i: Int): java.sql.Date
92
def getTimestamp(i: Int): java.sql.Timestamp
93
def getAs[T](i: Int): T
94
def getAs[T](fieldName: String): T
95
def copy(): Row
96
}
97
98
object Row {
99
def apply(values: Any*): Row
100
def fromSeq(values: Seq[Any]): Row
101
def fromTuple(tuple: Product): Row
102
def merge(rows: Row*): Row
103
val empty: Row
104
}
105
```
106
107
[Row Operations](./row-operations.md)
108
109
### Tree Framework
110
111
Foundation for all Catalyst data structures, providing tree traversal, transformation, and manipulation capabilities.
112
113
```scala { .api }
114
abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product {
115
def children: Seq[BaseType]
116
def fastEquals(other: TreeNode[_]): Boolean
117
// Tree traversal and transformation methods
118
}
119
120
case class Origin(
121
line: Option[Int] = None,
122
startPosition: Option[Int] = None
123
)
124
125
object CurrentOrigin {
126
def get: Origin
127
def set(o: Origin): Unit
128
def reset(): Unit
129
def withOrigin[A](o: Origin)(f: => A): A
130
}
131
```
132
133
[Tree Operations](./tree-operations.md)
134
135
### Type System
136
137
Comprehensive type system supporting all SQL types including primitives, complex nested types, and user-defined types.
138
139
```scala { .api }
140
abstract class DataType extends AbstractDataType {
141
def defaultSize: Int
142
def typeName: String
143
def json: String
144
def prettyJson: String
145
def simpleString: String
146
def sameType(other: DataType): Boolean
147
}
148
149
// Primitive types (objects)
150
object BooleanType extends DataType
151
object ByteType extends DataType
152
object ShortType extends DataType
153
object IntegerType extends DataType
154
object LongType extends DataType
155
object FloatType extends DataType
156
object DoubleType extends DataType
157
object StringType extends DataType
158
object BinaryType extends DataType
159
object DateType extends DataType
160
object TimestampType extends DataType
161
```
162
163
[Type System](./types.md)
164
165
### Expression System
166
167
Extensible expression evaluation framework with support for code generation and complex expression trees.
168
169
```scala { .api }
170
abstract class Expression extends TreeNode[Expression] {
171
def foldable: Boolean
172
def deterministic: Boolean
173
def nullable: Boolean
174
def references: AttributeSet
175
def eval(input: InternalRow): Any
176
def dataType: DataType
177
}
178
179
// Expression hierarchy
180
trait LeafExpression extends Expression
181
trait UnaryExpression extends Expression
182
trait BinaryExpression extends Expression
183
trait BinaryOperator extends BinaryExpression
184
185
// Key expression types
186
case class Literal(value: Any, dataType: DataType) extends LeafExpression
187
case class AttributeReference(name: String, dataType: DataType, nullable: Boolean, metadata: Metadata = Metadata.empty) extends Attribute
188
case class Cast(child: Expression, dataType: DataType) extends UnaryExpression
189
```
190
191
[Expressions](./expressions.md)
192
193
### Query Planning
194
195
Logical and physical query plan representations with transformation and optimization capabilities.
196
197
```scala { .api }
198
abstract class QueryPlan[PlanType <: TreeNode[PlanType]] extends TreeNode[PlanType] {
199
def output: Seq[Attribute]
200
def outputSet: AttributeSet
201
def references: AttributeSet
202
def inputSet: AttributeSet
203
def transformExpressions(rule: PartialFunction[Expression, Expression]): this.type
204
}
205
206
abstract class LogicalPlan extends QueryPlan[LogicalPlan] {
207
def analyzed: Boolean
208
def resolveOperators(rule: PartialFunction[LogicalPlan, LogicalPlan]): LogicalPlan
209
def resolveExpressions(r: PartialFunction[Expression, Expression]): LogicalPlan
210
}
211
```
212
213
[Query Planning](./query-planning.md)
214
215
### Analysis Framework
216
217
Semantic analysis system for resolving unresolved references, type checking, and plan validation.
218
219
```scala { .api }
220
class Analyzer(catalog: Catalog, registry: FunctionRegistry, conf: CatalystConf) extends RuleExecutor[LogicalPlan] {
221
def execute(plan: LogicalPlan): LogicalPlan
222
}
223
224
trait Catalog {
225
def lookupRelation(name: Seq[String]): LogicalPlan
226
def functionExists(name: String): Boolean
227
def lookupFunction(name: String, children: Seq[Expression]): Expression
228
}
229
230
trait FunctionRegistry {
231
def registerFunction(name: String, info: ExpressionInfo, builder: Seq[Expression] => Expression): Unit
232
def lookupFunction(name: String, children: Seq[Expression]): Expression
233
def functionExists(name: String): Boolean
234
}
235
```
236
237
[Analysis](./analysis.md)
238
239
### Optimization Framework
240
241
Rule-based optimization system with built-in optimizations for query plan improvement.
242
243
```scala { .api }
244
object Optimizer extends RuleExecutor[LogicalPlan] {
245
// Key optimization rules available as objects
246
object ConstantFolding extends Rule[LogicalPlan]
247
object BooleanSimplification extends Rule[LogicalPlan]
248
object ColumnPruning extends Rule[LogicalPlan]
249
object FilterPushdown extends Rule[LogicalPlan]
250
object ProjectCollapsing extends Rule[LogicalPlan]
251
}
252
253
abstract class Rule[TreeType <: TreeNode[TreeType]] {
254
def apply(plan: TreeType): TreeType
255
}
256
257
abstract class RuleExecutor[TreeType <: TreeNode[TreeType]] {
258
def execute(plan: TreeType): TreeType
259
def batches: Seq[Batch]
260
}
261
```
262
263
[Optimization](./optimization.md)
264
265
## Types
266
267
```scala { .api }
268
// Complex data types
269
case class ArrayType(elementType: DataType, containsNull: Boolean) extends DataType
270
case class MapType(keyType: DataType, valueType: DataType, valueContainsNull: Boolean) extends DataType
271
case class StructType(fields: Array[StructField]) extends DataType
272
case class StructField(name: String, dataType: DataType, nullable: Boolean, metadata: Metadata)
273
274
// Decimal types
275
case class DecimalType(precision: Int, scale: Int) extends DataType
276
case class Decimal(value: java.math.BigDecimal) extends Ordered[Decimal]
277
278
// Analysis types
279
type Resolver = (String, String) => Boolean
280
case class TypeCheckResult(isSuccess: Boolean, errorMessage: Option[String])
281
282
// Configuration
283
trait CatalystConf {
284
def caseSensitiveAnalysis: Boolean
285
def orderByOrdinal: Boolean
286
def groupByOrdinal: Boolean
287
}
288
289
// Table identification
290
case class TableIdentifier(table: String, database: Option[String])
291
292
// Exceptions
293
class AnalysisException(message: String, line: Option[Int], startPosition: Option[Int]) extends Exception
294
```