or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

artifact-management.mdconfiguration.mdindex.mdmonitoring-ui.mdplan-processing.mdplugin-system.mdserver-management.mdsession-management.md
tile.json

tessl/maven-org-apache-spark--spark-connect_2-13

A decoupled client-server architecture component for Apache Spark that enables remote connectivity to Spark clusters using the DataFrame API and gRPC protocol.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-connect_2.13@3.5.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-connect_2-13@3.5.0

index.mddocs/

Apache Spark Connect Server

Apache Spark Connect Server provides a decoupled client-server architecture that enables remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The server acts as a gRPC service that receives requests from Spark Connect clients and executes them on the Spark cluster, enabling Spark to be leveraged from various environments including modern data applications, IDEs, notebooks, and different programming languages.

Package Information

  • Package Name: spark-connect_2.13
  • Package Type: maven
  • Language: Scala
  • Group ID: org.apache.spark
  • Installation: Add dependency to your build.sbt or pom.xml

Maven:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-connect_2.13</artifactId>
    <version>3.5.6</version>
</dependency>

SBT:

libraryDependencies += "org.apache.spark" %% "spark-connect" % "3.5.6"

Core Imports

import org.apache.spark.sql.connect.service.{SparkConnectService, SparkConnectServer}
import org.apache.spark.sql.connect.plugin.{RelationPlugin, ExpressionPlugin, CommandPlugin}
import org.apache.spark.sql.connect.planner.SparkConnectPlanner
import org.apache.spark.sql.connect.config.Connect

Basic Usage

Starting the Server

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.connect.service.SparkConnectService

// Start Spark session
val session = SparkSession.builder.getOrCreate()

// Start Connect server
SparkConnectService.start(session.sparkContext)

// Server is now listening on configured port (default: 15002)

Standalone Server Application

import org.apache.spark.sql.connect.service.SparkConnectServer

// Run standalone server
SparkConnectServer.main(Array.empty)

Architecture

The Spark Connect Server architecture consists of several key layers:

  • gRPC Service Layer: Handles client requests via protocol buffer definitions
  • Request Handlers: Process different types of requests (execute, analyze, artifacts, etc.)
  • Planning Layer: Converts protocol buffer plans to Catalyst logical plans
  • Plugin System: Extensible architecture for custom functionality
  • Session Management: Manages client sessions and execution state
  • Artifact Management: Handles JAR uploads and dynamic class loading
  • Monitoring & UI: Web interface for server monitoring and debugging

Capabilities

Server Management and Configuration

Core server functionality including startup, configuration, and lifecycle management.

object SparkConnectService {
  def start(sc: SparkContext): Unit
  def stop(timeout: Option[Long] = None, unit: Option[TimeUnit] = None): Unit
}

object SparkConnectServer {
  def main(args: Array[String]): Unit
}

Server Management

Plugin System

Extensible plugin architecture for custom relations, expressions, and commands.

trait RelationPlugin {
  def transform(relation: com.google.protobuf.Any, planner: SparkConnectPlanner): Option[LogicalPlan]
}

trait ExpressionPlugin {
  def transform(expression: com.google.protobuf.Any, planner: SparkConnectPlanner): Option[Expression]
}

trait CommandPlugin {
  def process(command: com.google.protobuf.Any, planner: SparkConnectPlanner): Option[Unit]
}

Plugin System

Plan Processing and Execution

Convert protocol buffer plans to Catalyst plans and manage execution lifecycle.

class SparkConnectPlanner(sessionHolder: SessionHolder) {
  def transformRelation(rel: proto.Relation): LogicalPlan
  def transformExpression(exp: proto.Expression): Expression
  def process(command: proto.Command, responseObserver: StreamObserver[ExecutePlanResponse], executeHolder: ExecuteHolder): Unit
}

Plan Processing

Session and State Management

Manage client sessions, execution state, and concurrent operations.

object SparkConnectService {
  def getOrCreateIsolatedSession(userId: String, sessionId: String): SessionHolder
  def getIsolatedSession(userId: String, sessionId: String): SessionHolder
  def listActiveExecutions: Either[Long, Seq[ExecuteInfo]]
}

Session Management

Artifact Management

Handle JAR uploads, file management, and dynamic class loading for user code.

class SparkConnectArtifactManager(sessionHolder: SessionHolder) {
  def getSparkConnectAddedJars: Seq[URL]
  def getSparkConnectPythonIncludes: Seq[String]
  def classloader: ClassLoader
}

Artifact Management

Monitoring and Web UI

Web interface components for server monitoring, session tracking, and debugging.

class SparkConnectServerTab(sparkContext: SparkContext, store: SparkConnectServerAppStatusStore, appName: String) {
  def detach(): Unit
  def displayOrder: Int
}

Monitoring and UI

Configuration System

Comprehensive configuration options for server behavior, security, and performance tuning.

object Connect {
  val CONNECT_GRPC_BINDING_PORT: ConfigEntry[Int]
  val CONNECT_GRPC_MAX_INBOUND_MESSAGE_SIZE: ConfigEntry[Int]
  val CONNECT_EXECUTE_REATTACHABLE_ENABLED: ConfigEntry[Boolean]
  // ... additional configuration entries
}

Configuration

Error Handling

The server uses centralized error handling through the ErrorUtils object, which converts Spark exceptions to appropriate gRPC status codes and error messages for client consumption.

Security Considerations

  • Configure authentication and authorization via Spark security settings
  • Use TLS for encrypted communication between clients and server
  • Implement custom interceptors for request validation and logging
  • Manage artifacts securely with proper sandboxing and validation

Performance and Scalability

  • Supports concurrent client sessions with isolated execution contexts
  • Implements reattachable executions for fault tolerance
  • Provides streaming responses for large result sets
  • Configurable resource limits and timeouts
  • Efficient artifact caching and class loading