Interactive Scala shell (REPL) component for Apache Spark providing real-time data processing capabilities and exploratory data analysis
npx @tessl/cli install tessl/maven-org-apache-spark--spark-repl_2-12@3.5.00
# Apache Spark REPL
1
2
Apache Spark REPL provides an interactive Scala shell for Apache Spark, enabling developers to interactively explore data and execute Spark computations in a command-line environment. It integrates seamlessly with Spark's core functionality to provide real-time data processing capabilities and serves as both a learning tool and development environment for Spark applications.
3
4
## Package Information
5
6
- **Package Name**: org.apache.spark:spark-repl_2.12
7
- **Package Type**: Maven
8
- **Language**: Scala 2.12
9
- **Installation**: Include as Maven dependency or use via `spark-shell` command
10
- **Version**: 3.5.6
11
12
## Maven Dependency
13
14
```xml
15
<dependency>
16
<groupId>org.apache.spark</groupId>
17
<artifactId>spark-repl_2.12</artifactId>
18
<version>3.5.6</version>
19
</dependency>
20
```
21
22
## Core Imports
23
24
```scala
25
import org.apache.spark.repl.{Main, SparkILoop, Signaling}
26
import org.apache.spark.{SparkConf, SparkContext}
27
import org.apache.spark.sql.SparkSession
28
import scala.tools.nsc.Settings
29
import scala.tools.nsc.interpreter.JPrintWriter
30
import java.io.BufferedReader
31
```
32
33
## Basic Usage
34
35
### Starting the Interactive Shell
36
37
```scala
38
// Command-line usage (typical)
39
$ spark-shell
40
41
// Programmatic startup
42
import org.apache.spark.repl.Main
43
44
object MyApp {
45
def main(args: Array[String]): Unit = {
46
Main.main(args)
47
}
48
}
49
```
50
51
### Programmatic Code Execution
52
53
```scala
54
import org.apache.spark.repl.SparkILoop
55
56
// Execute code in REPL and capture output
57
val result = SparkILoop.run("""
58
val rdd = sc.parallelize(1 to 100)
59
val sum = rdd.sum()
60
println(s"Sum: $sum")
61
""")
62
63
// Execute multiple code blocks
64
val lines = List(
65
"val data = 1 to 1000",
66
"val rdd = sc.parallelize(data)",
67
"val squares = rdd.map(x => x * x)",
68
"squares.take(10)"
69
)
70
val output = SparkILoop.run(lines)
71
```
72
73
## Architecture
74
75
The Spark REPL is built around several key components:
76
77
- **Main Entry Point**: `Main` object handles application startup, SparkSession creation, and REPL lifecycle management
78
- **Interactive Loop**: `SparkILoop` class extends Scala's standard REPL with Spark-specific functionality and initialization commands
79
- **Session Management**: Automatic SparkSession and SparkContext setup with proper configuration for interactive use
80
- **Signal Handling**: Graceful job cancellation via Ctrl+C interrupt handling
81
- **Class Loading**: Dynamic compilation and loading of user code with proper Spark integration
82
83
## Capabilities
84
85
### REPL Session Management
86
87
Core functionality for starting, configuring, and managing interactive Spark shell sessions. Handles SparkSession creation, configuration, and lifecycle management.
88
89
```scala { .api }
90
object Main extends Logging {
91
val conf: SparkConf
92
val outputDir: File
93
var sparkContext: SparkContext
94
var sparkSession: SparkSession
95
var interp: SparkILoop
96
97
def main(args: Array[String]): Unit
98
def createSparkSession(): SparkSession
99
private[repl] def doMain(args: Array[String], _interp: SparkILoop): Unit
100
}
101
```
102
103
[Session Management](./session-management.md)
104
105
### Interactive Shell Interface
106
107
Interactive shell implementation providing Spark-specific REPL functionality with automatic context initialization and enhanced command support.
108
109
```scala { .api }
110
class SparkILoop(in0: Option[BufferedReader], out: JPrintWriter) extends ILoop(in0, out) {
111
def this(in0: BufferedReader, out: JPrintWriter)
112
def this()
113
114
val initializationCommands: Seq[String]
115
def initializeSpark(): Unit
116
def printWelcome(): Unit
117
def resetCommand(line: String): Unit
118
def replay(): Unit
119
def process(settings: Settings): Boolean
120
def commands: List[LoopCommand]
121
}
122
123
object SparkILoop {
124
def run(code: String, sets: Settings = new Settings): String
125
def run(lines: List[String]): String
126
}
127
```
128
129
[Interactive Shell](./interactive-shell.md)
130
131
### Signal Handling
132
133
Interrupt and job cancellation functionality for graceful handling of Ctrl+C and job termination in interactive sessions.
134
135
```scala { .api }
136
object Signaling extends Logging {
137
def cancelOnInterrupt(): Unit
138
}
139
```
140
141
[Signal Handling](./signaling.md)
142
143
### Global Variables and Context
144
145
When the REPL starts, several key variables are automatically available:
146
147
```scala { .api }
148
// Available in REPL session after initialization
149
@transient val spark: SparkSession // The active SparkSession
150
@transient val sc: SparkContext // The SparkContext from the session
151
152
// Standard imports are automatically available:
153
import org.apache.spark.SparkContext._
154
import spark.implicits._
155
import spark.sql
156
import org.apache.spark.sql.functions._
157
```
158
159
## Error Handling
160
161
The REPL provides robust error handling for common scenarios:
162
163
- **Initialization Failures**: Graceful handling of SparkSession creation errors
164
- **Job Cancellation**: Ctrl+C handling for running jobs with user-friendly messaging
165
- **Compilation Errors**: Clear reporting of Scala compilation issues
166
- **Runtime Exceptions**: Proper exception handling and reporting within the REPL context
167
168
## Platform Considerations
169
170
### Scala Version Compatibility
171
172
The REPL supports multiple Scala versions with version-specific implementations:
173
174
- **Scala 2.12**: Uses `process()` method for REPL execution
175
- **Scala 2.13**: Uses `run()` method (API change in Scala compiler)
176
177
### Environment Integration
178
179
- **SPARK_HOME**: Automatically detected and configured via `System.getenv("SPARK_HOME")`
180
- **SPARK_EXECUTOR_URI**: Custom executor URI configuration via environment variable
181
- **Classpath Management**: Dynamic JAR loading with file:// URL scheme normalization
182
- **Class Output**: Temporary directory creation with `spark.repl.classdir` configuration
183
- **Web UI**: Automatic display of Spark Web UI URL with reverse proxy support
184
- **Hive Support**: Conditional enablement based on `SparkSession.hiveClassesArePresent`