Tessl Tile for maven/org.apache.spark/spark-sql_2.11@2.4.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

tessl/maven-org-apache-spark--spark-sql_2-11

Apache Spark SQL is a distributed SQL query engine that provides DataFrame and Dataset APIs for working with structured data.

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:maven/org.apache.spark/spark-sql_2.11@2.4.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-sql_2-11@2.4.0

0
# Apache Spark SQL
1

2
Apache Spark SQL is a distributed SQL query engine that provides DataFrame and Dataset APIs for working with structured data. It enables users to run SQL queries against structured data sources like JSON, Parquet, and JDBC databases, while providing seamless integration with Spark's core RDD API. The module includes the Catalyst optimizer framework for logical query planning and optimization, support for both SQL and DataFrame/Dataset APIs, integration with Hive for metadata and SerDes, and the ability to cache data in memory for faster query performance.
3

4
## Package Information
5

6
- **Package Name**: spark-sql_2.11
7
- **Package Type**: maven
8
- **Language**: Scala (with Java interoperability)
9
- **Installation**: 
10
  ```xml
11
  <dependency>
12
    <groupId>org.apache.spark</groupId>
13
    <artifactId>spark-sql_2.11</artifactId>
14
    <version>2.4.8</version>
15
  </dependency>
16
  ```
17

18
## Core Imports
19

20
```scala
21
import org.apache.spark.sql.{SparkSession, DataFrame, Dataset, Column, Row}
22
import org.apache.spark.sql.functions._
23
import org.apache.spark.sql.types._
24
```
25

26
For Java:
27

28
```java
29
import org.apache.spark.sql.SparkSession;
30
import org.apache.spark.sql.Dataset;
31
import org.apache.spark.sql.Row;
32
import static org.apache.spark.sql.functions.*;
33
```
34

35
## Basic Usage
36

37
```scala
38
import org.apache.spark.sql.{SparkSession, DataFrame}
39
import org.apache.spark.sql.functions._
40

41
// Create SparkSession
42
val spark = SparkSession.builder()
43
  .appName("Spark SQL Example")
44
  .master("local[*]")
45
  .getOrCreate()
46

47
// Load data
48
val df = spark.read
49
  .format("json")
50
  .option("multiline", "true")
51
  .load("path/to/data.json")
52

53
// Transform data
54
val result = df
55
  .select(col("name"), col("age").cast("int"))
56
  .filter(col("age") > 18)
57
  .groupBy(col("name"))
58
  .agg(avg(col("age")).alias("avg_age"))
59

60
// Execute and show results
61
result.show()
62

63
// SQL queries
64
df.createOrReplaceTempView("people")
65
val sqlResult = spark.sql("SELECT name, AVG(age) as avg_age FROM people WHERE age > 18 GROUP BY name")
66
sqlResult.show()
67

68
spark.stop()
69
```
70

71
## Architecture
72

73
Apache Spark SQL is built around several key components:
74

75
- **SparkSession**: The unified entry point for all Spark SQL functionality, replacing SQLContext and HiveContext
76
- **DataFrame/Dataset API**: Distributed collections with schema information and type safety (Dataset[T])
77
- **Catalyst Optimizer**: Query optimizer that applies rule-based and cost-based optimizations
78
- **Tungsten Execution Engine**: Code generation and memory management for improved performance
79
- **Data Source API**: Pluggable interface for reading from and writing to various data formats
80
- **Catalog API**: Interface for managing databases, tables, functions, and cached data
81

82
## Capabilities
83

84
### Session Management
85

86
SparkSession serves as the unified entry point for all Spark SQL operations, providing access to DataFrames, SQL execution, and configuration management.
87

88
```scala { .api }
89
object SparkSession {
90
  def builder(): SparkSession.Builder
91
}
92

93
class SparkSession {
94
  def sql(sqlText: String): DataFrame
95
  def read: DataFrameReader
96
  def readStream: DataStreamReader  
97
  def catalog: Catalog
98
  def conf: RuntimeConfig
99
  def table(tableName: String): DataFrame
100
  def stop(): Unit
101
}
102
```
103

104
[Session Management](./session-management.md)
105

106
### DataFrame and Dataset Operations
107

108
Core distributed data structures with schema information and type safety, supporting both typed (Dataset[T]) and untyped (DataFrame) operations.
109

110
```scala { .api }
111
class Dataset[T] {
112
  def select(cols: Column*): DataFrame
113
  def filter(condition: Column): Dataset[T]
114
  def groupBy(cols: Column*): RelationalGroupedDataset
115
  def join(right: Dataset[_]): DataFrame
116
  def union(other: Dataset[T]): Dataset[T]
117
  def count(): Long
118
  def collect(): Array[T]
119
  def show(): Unit
120
}
121

122
type DataFrame = Dataset[Row]
123
```
124

125
[DataFrame and Dataset Operations](./dataframe-dataset.md)
126

127
### Data Input and Output
128

129
Comprehensive I/O capabilities for reading from and writing to various data sources including files, databases, and streaming sources.
130

131
```scala { .api }
132
class DataFrameReader {
133
  def format(source: String): DataFrameReader
134
  def schema(schema: StructType): DataFrameReader
135
  def option(key: String, value: String): DataFrameReader
136
  def load(): DataFrame
137
  def json(path: String): DataFrame
138
  def parquet(path: String): DataFrame
139
  def jdbc(url: String, table: String, properties: Properties): DataFrame
140
}
141

142
class DataFrameWriter[T] {
143
  def mode(saveMode: SaveMode): DataFrameWriter[T]
144
  def format(source: String): DataFrameWriter[T]
145
  def option(key: String, value: String): DataFrameWriter[T]
146
  def save(): Unit
147
  def saveAsTable(tableName: String): Unit
148
}
149
```
150

151
[Data Input and Output](./data-io.md)
152

153
### SQL Functions and Expressions
154

155
Extensive library of built-in SQL functions for data manipulation, aggregation, string processing, date/time operations, and mathematical calculations.
156

157
```scala { .api }
158
// Column operations
159
class Column {
160
  def +(other: Any): Column
161
  def ===(other: Any): Column
162
  def isNull: Column
163
  def cast(to: DataType): Column
164
  def alias(alias: String): Column
165
}
166

167
// Built-in functions (from functions object)
168
def col(colName: String): Column
169
def lit(literal: Any): Column
170
def when(condition: Column, value: Any): Column
171
def sum(e: Column): Column
172
def avg(e: Column): Column
173
def count(e: Column): Column
174
def max(e: Column): Column
175
def min(e: Column): Column
176
```
177

178
[SQL Functions and Expressions](./functions-expressions.md)
179

180
### Aggregations and Grouping
181

182
Powerful aggregation capabilities with both untyped DataFrame aggregations and type-safe Dataset aggregations for complex analytical operations.
183

184
```scala { .api }
185
class RelationalGroupedDataset {
186
  def agg(expr: Column, exprs: Column*): DataFrame
187
  def count(): DataFrame
188
  def sum(colNames: String*): DataFrame
189
  def avg(colNames: String*): DataFrame
190
  def pivot(pivotColumn: String): RelationalGroupedDataset
191
}
192

193
class KeyValueGroupedDataset[K, V] {
194
  def agg[U1](column: TypedColumn[V, U1]): Dataset[(K, U1)]
195
  def count(): Dataset[(K, Long)]
196
  def mapGroups[U](f: (K, Iterator[V]) => U): Dataset[U]
197
  def reduceGroups(f: (V, V) => V): Dataset[(K, V)]
198
}
199
```
200

201
[Aggregations and Grouping](./aggregations.md)
202

203
### Streaming Queries
204

205
Structured streaming capabilities for processing continuous data streams with the same DataFrame/Dataset APIs used for batch processing.
206

207
```scala { .api }
208
class DataStreamWriter[T] {
209
  def outputMode(outputMode: OutputMode): DataStreamWriter[T]
210
  def trigger(trigger: Trigger): DataStreamWriter[T]
211
  def start(): StreamingQuery
212
  def foreach(writer: ForeachWriter[T]): DataStreamWriter[T]
213
}
214

215
class StreamingQuery {
216
  def isActive: Boolean
217
  def awaitTermination(): Unit
218
  def stop(): Unit
219
  def status: StreamingQueryStatus
220
}
221
```
222

223
[Streaming Queries](./streaming.md)
224

225
### Catalog and Metadata Management
226

227
Comprehensive metadata management for databases, tables, functions, and cached data with full programmatic access to the Spark catalog.
228

229
```scala { .api }
230
class Catalog {
231
  def currentDatabase: String
232
  def listDatabases(): Dataset[Database]
233
  def listTables(): Dataset[Table]
234
  def listColumns(tableName: String): Dataset[Column]
235
  def listFunctions(): Dataset[Function]
236
  def cacheTable(tableName: String): Unit
237
  def isCached(tableName: String): Boolean
238
}
239
```
240

241
[Catalog and Metadata Management](./catalog.md)
242

243
## Types
244

245
```scala { .api }
246
// Core data types
247
abstract class DataType
248
case object StringType extends DataType
249
case object IntegerType extends DataType
250
case object LongType extends DataType
251
case object DoubleType extends DataType
252
case object BooleanType extends DataType
253
case object DateType extends DataType
254
case object TimestampType extends DataType
255

256
// Complex types
257
case class StructType(fields: Array[StructField]) extends DataType
258
case class StructField(name: String, dataType: DataType, nullable: Boolean)
259
case class ArrayType(elementType: DataType, containsNull: Boolean) extends DataType
260
case class MapType(keyType: DataType, valueType: DataType, valueContainsNull: Boolean) extends DataType
261

262
// Row representation
263
class Row {
264
  def get(i: Int): Any
265
  def getString(i: Int): String
266
  def getInt(i: Int): Int
267
  def getLong(i: Int): Long
268
  def getDouble(i: Int): Double
269
  def getBoolean(i: Int): Boolean
270
}
271

272
// Save modes
273
object SaveMode extends Enumeration {
274
  val Overwrite, Append, ErrorIfExists, Ignore = Value
275
}
276
```