or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

index.mdinteractive-shell.mdsession-management.mdsignaling.md
tile.json

tessl/maven-org-apache-spark--spark-repl_2-12

Interactive Scala shell (REPL) component for Apache Spark providing real-time data processing capabilities and exploratory data analysis

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-repl_2.12@3.5.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-repl_2-12@3.5.0

index.mddocs/

Apache Spark REPL

Apache Spark REPL provides an interactive Scala shell for Apache Spark, enabling developers to interactively explore data and execute Spark computations in a command-line environment. It integrates seamlessly with Spark's core functionality to provide real-time data processing capabilities and serves as both a learning tool and development environment for Spark applications.

Package Information

  • Package Name: org.apache.spark:spark-repl_2.12
  • Package Type: Maven
  • Language: Scala 2.12
  • Installation: Include as Maven dependency or use via spark-shell command
  • Version: 3.5.6

Maven Dependency

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-repl_2.12</artifactId>
    <version>3.5.6</version>
</dependency>

Core Imports

import org.apache.spark.repl.{Main, SparkILoop, Signaling}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession
import scala.tools.nsc.Settings
import scala.tools.nsc.interpreter.JPrintWriter
import java.io.BufferedReader

Basic Usage

Starting the Interactive Shell

// Command-line usage (typical)
$ spark-shell

// Programmatic startup
import org.apache.spark.repl.Main

object MyApp {
  def main(args: Array[String]): Unit = {
    Main.main(args)
  }
}

Programmatic Code Execution

import org.apache.spark.repl.SparkILoop

// Execute code in REPL and capture output
val result = SparkILoop.run("""
val rdd = sc.parallelize(1 to 100)
val sum = rdd.sum()
println(s"Sum: $sum")
""")

// Execute multiple code blocks
val lines = List(
  "val data = 1 to 1000",
  "val rdd = sc.parallelize(data)",
  "val squares = rdd.map(x => x * x)",
  "squares.take(10)"
)
val output = SparkILoop.run(lines)

Architecture

The Spark REPL is built around several key components:

  • Main Entry Point: Main object handles application startup, SparkSession creation, and REPL lifecycle management
  • Interactive Loop: SparkILoop class extends Scala's standard REPL with Spark-specific functionality and initialization commands
  • Session Management: Automatic SparkSession and SparkContext setup with proper configuration for interactive use
  • Signal Handling: Graceful job cancellation via Ctrl+C interrupt handling
  • Class Loading: Dynamic compilation and loading of user code with proper Spark integration

Capabilities

REPL Session Management

Core functionality for starting, configuring, and managing interactive Spark shell sessions. Handles SparkSession creation, configuration, and lifecycle management.

object Main extends Logging {
  val conf: SparkConf
  val outputDir: File
  var sparkContext: SparkContext
  var sparkSession: SparkSession
  var interp: SparkILoop
  
  def main(args: Array[String]): Unit
  def createSparkSession(): SparkSession
  private[repl] def doMain(args: Array[String], _interp: SparkILoop): Unit
}

Session Management

Interactive Shell Interface

Interactive shell implementation providing Spark-specific REPL functionality with automatic context initialization and enhanced command support.

class SparkILoop(in0: Option[BufferedReader], out: JPrintWriter) extends ILoop(in0, out) {
  def this(in0: BufferedReader, out: JPrintWriter)
  def this()
  
  val initializationCommands: Seq[String]
  def initializeSpark(): Unit
  def printWelcome(): Unit
  def resetCommand(line: String): Unit
  def replay(): Unit
  def process(settings: Settings): Boolean
  def commands: List[LoopCommand]
}

object SparkILoop {
  def run(code: String, sets: Settings = new Settings): String
  def run(lines: List[String]): String
}

Interactive Shell

Signal Handling

Interrupt and job cancellation functionality for graceful handling of Ctrl+C and job termination in interactive sessions.

object Signaling extends Logging {
  def cancelOnInterrupt(): Unit
}

Signal Handling

Global Variables and Context

When the REPL starts, several key variables are automatically available:

// Available in REPL session after initialization
@transient val spark: SparkSession  // The active SparkSession
@transient val sc: SparkContext     // The SparkContext from the session

// Standard imports are automatically available:
import org.apache.spark.SparkContext._
import spark.implicits._
import spark.sql
import org.apache.spark.sql.functions._

Error Handling

The REPL provides robust error handling for common scenarios:

  • Initialization Failures: Graceful handling of SparkSession creation errors
  • Job Cancellation: Ctrl+C handling for running jobs with user-friendly messaging
  • Compilation Errors: Clear reporting of Scala compilation issues
  • Runtime Exceptions: Proper exception handling and reporting within the REPL context

Platform Considerations

Scala Version Compatibility

The REPL supports multiple Scala versions with version-specific implementations:

  • Scala 2.12: Uses process() method for REPL execution
  • Scala 2.13: Uses run() method (API change in Scala compiler)

Environment Integration

  • SPARK_HOME: Automatically detected and configured via System.getenv("SPARK_HOME")
  • SPARK_EXECUTOR_URI: Custom executor URI configuration via environment variable
  • Classpath Management: Dynamic JAR loading with file:// URL scheme normalization
  • Class Output: Temporary directory creation with spark.repl.classdir configuration
  • Web UI: Automatic display of Spark Web UI URL with reverse proxy support
  • Hive Support: Conditional enablement based on SparkSession.hiveClassesArePresent