Interactive Scala shell (REPL) component for Apache Spark providing real-time data processing capabilities and exploratory data analysis
npx @tessl/cli install tessl/maven-org-apache-spark--spark-repl_2-12@3.5.0Apache Spark REPL provides an interactive Scala shell for Apache Spark, enabling developers to interactively explore data and execute Spark computations in a command-line environment. It integrates seamlessly with Spark's core functionality to provide real-time data processing capabilities and serves as both a learning tool and development environment for Spark applications.
spark-shell command<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-repl_2.12</artifactId>
<version>3.5.6</version>
</dependency>import org.apache.spark.repl.{Main, SparkILoop, Signaling}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession
import scala.tools.nsc.Settings
import scala.tools.nsc.interpreter.JPrintWriter
import java.io.BufferedReader// Command-line usage (typical)
$ spark-shell
// Programmatic startup
import org.apache.spark.repl.Main
object MyApp {
def main(args: Array[String]): Unit = {
Main.main(args)
}
}import org.apache.spark.repl.SparkILoop
// Execute code in REPL and capture output
val result = SparkILoop.run("""
val rdd = sc.parallelize(1 to 100)
val sum = rdd.sum()
println(s"Sum: $sum")
""")
// Execute multiple code blocks
val lines = List(
"val data = 1 to 1000",
"val rdd = sc.parallelize(data)",
"val squares = rdd.map(x => x * x)",
"squares.take(10)"
)
val output = SparkILoop.run(lines)The Spark REPL is built around several key components:
Main object handles application startup, SparkSession creation, and REPL lifecycle managementSparkILoop class extends Scala's standard REPL with Spark-specific functionality and initialization commandsCore functionality for starting, configuring, and managing interactive Spark shell sessions. Handles SparkSession creation, configuration, and lifecycle management.
object Main extends Logging {
val conf: SparkConf
val outputDir: File
var sparkContext: SparkContext
var sparkSession: SparkSession
var interp: SparkILoop
def main(args: Array[String]): Unit
def createSparkSession(): SparkSession
private[repl] def doMain(args: Array[String], _interp: SparkILoop): Unit
}Interactive shell implementation providing Spark-specific REPL functionality with automatic context initialization and enhanced command support.
class SparkILoop(in0: Option[BufferedReader], out: JPrintWriter) extends ILoop(in0, out) {
def this(in0: BufferedReader, out: JPrintWriter)
def this()
val initializationCommands: Seq[String]
def initializeSpark(): Unit
def printWelcome(): Unit
def resetCommand(line: String): Unit
def replay(): Unit
def process(settings: Settings): Boolean
def commands: List[LoopCommand]
}
object SparkILoop {
def run(code: String, sets: Settings = new Settings): String
def run(lines: List[String]): String
}Interrupt and job cancellation functionality for graceful handling of Ctrl+C and job termination in interactive sessions.
object Signaling extends Logging {
def cancelOnInterrupt(): Unit
}When the REPL starts, several key variables are automatically available:
// Available in REPL session after initialization
@transient val spark: SparkSession // The active SparkSession
@transient val sc: SparkContext // The SparkContext from the session
// Standard imports are automatically available:
import org.apache.spark.SparkContext._
import spark.implicits._
import spark.sql
import org.apache.spark.sql.functions._The REPL provides robust error handling for common scenarios:
The REPL supports multiple Scala versions with version-specific implementations:
process() method for REPL executionrun() method (API change in Scala compiler)System.getenv("SPARK_HOME")spark.repl.classdir configurationSparkSession.hiveClassesArePresent