CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-org-apache-spark--spark-kubernetes-2-12

Apache Spark Kubernetes resource manager that enables running Spark applications on Kubernetes clusters

Pending
Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Pending

The risk profile of this skill

Overview
Eval results
Files

Apache Spark Kubernetes Resource Manager

A comprehensive Kubernetes resource manager for Apache Spark that enables running Spark applications natively on Kubernetes clusters with full integration of Kubernetes features and APIs.

Package Information

Group ID: org.apache.spark
Artifact ID: spark-kubernetes_2.12
Version: 3.0.1

Maven Dependency:

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-kubernetes_2.12</artifactId>
  <version>3.0.1</version>
</dependency>

SBT Dependency:

libraryDependencies += "org.apache.spark" %% "spark-kubernetes" % "3.0.1"

Core Imports

import org.apache.spark.deploy.k8s._
import org.apache.spark.deploy.k8s.submit._
import org.apache.spark.deploy.k8s.features._
import org.apache.spark.scheduler.cluster.k8s._

Overview

The Apache Spark Kubernetes resource manager provides native integration between Apache Spark and Kubernetes, enabling seamless deployment and execution of Spark applications in containerized environments. It implements a complete cluster manager that schedules Spark driver and executor pods, manages dynamic resource allocation, and provides fault tolerance through Kubernetes-native capabilities.

Key Features

  • Native Kubernetes Integration: Full support for Kubernetes APIs, ConfigMaps, Secrets, and persistent volumes
  • Dynamic Resource Management: Automatic scaling of executor pods based on workload demands
  • Pod Lifecycle Management: Complete monitoring and management of driver and executor pod lifecycles
  • Configuration Flexibility: Extensive configuration options through Spark and Kubernetes parameters
  • Feature-Based Architecture: Modular design using feature steps for pod customization
  • Multi-Language Support: Support for Java, Scala, Python, and R applications

Architecture Overview

The Kubernetes resource manager follows a layered architecture:

  1. Cluster Management Layer: Implements Spark's external cluster manager interface
  2. Application Submission Layer: Handles spark-submit integration and client operations
  3. Configuration Layer: Manages Kubernetes-specific configuration and constants
  4. Pod Management Layer: Handles executor pod states, snapshots, and lifecycle operations
  5. Feature System Layer: Provides extensible pod configuration through feature steps
  6. Utilities Layer: Common utilities for Kubernetes operations and client management

Core Entry Points

Cluster Management { .api }

Primary entry point for Kubernetes cluster operations:

class KubernetesClusterManager extends ExternalClusterManager {
  def canCreate(masterURL: String): Boolean
  def createTaskScheduler(sc: SparkContext, masterURL: String): TaskScheduler
  def createSchedulerBackend(
    sc: SparkContext, 
    masterURL: String, 
    scheduler: TaskScheduler
  ): SchedulerBackend
  def initialize(scheduler: TaskScheduler, backend: SchedulerBackend): Unit
}

Usage: Automatically registered with Spark when using k8s:// master URLs. Handles cluster manager lifecycle and component creation.

Application Submission { .api }

Main entry point for application submission:

class KubernetesClientApplication extends SparkApplication {
  def start(args: Array[String], conf: SparkConf): Unit
}

Usage: Invoked by spark-submit when using Kubernetes cluster mode. Manages the complete application submission workflow.

Core Configuration Types

KubernetesConf Hierarchy { .api }

Base configuration class for Kubernetes operations:

abstract class KubernetesConf(
  val sparkConf: SparkConf,
  val appId: String,
  val resourceNamePrefix: String,
  val appName: String,
  val namespace: String,
  val labels: Map[String, String],
  val environment: Map[String, String],
  val annotations: Map[String, String],
  val secretEnvNamesToKeyRefs: Map[String, String],
  val secretNamesToMountPaths: Map[String, String],
  val volumes: Seq[KubernetesVolumeSpec],
  val imagePullPolicy: String,
  val nodeSelector: Map[String, String]
)

class KubernetesDriverConf extends KubernetesConf {
  def serviceAnnotations: Map[String, String]
}

class KubernetesExecutorConf extends KubernetesConf

Basic Usage Examples

Submitting a Spark Application

// Using spark-submit with Kubernetes cluster mode
spark-submit \
  --master k8s://https://kubernetes.example.com:443 \
  --deploy-mode cluster \
  --name spark-pi \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.kubernetes.container.image=spark:latest \
  --conf spark.kubernetes.namespace=spark \
  local:///opt/spark/examples/jars/spark-examples.jar

Programmatic Configuration

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.deploy.k8s.Config._

val conf = new SparkConf()
  .setAppName("MySparkApp")
  .setMaster("k8s://https://kubernetes.example.com:443")
  .set(CONTAINER_IMAGE, "my-spark:latest")
  .set(KUBERNETES_NAMESPACE, "spark-apps")
  .set(KUBERNETES_DRIVER_LIMIT_CORES, "2")
  .set(KUBERNETES_EXECUTOR_INSTANCES, "4")

val sc = new SparkContext(conf)

Capability Areas

This library provides comprehensive Kubernetes integration through several key capability areas. Each area has its own detailed documentation:

Cluster Management

  • Core Components: KubernetesClusterManager, KubernetesClusterSchedulerBackend
  • Capabilities: Cluster lifecycle management, task scheduling, resource allocation
  • Integration: Seamless integration with Spark's cluster manager interface

Application Submission

  • Core Components: KubernetesClientApplication, Client, ClientArguments
  • Capabilities: Application submission workflow, argument parsing, status monitoring
  • Integration: Full spark-submit compatibility with Kubernetes-specific features

Configuration Management

  • Core Components: Config object, Constants, KubernetesConf hierarchy
  • Capabilities: Centralized configuration, validation, type-safe properties
  • Integration: Extends Spark's configuration system with Kubernetes-specific options

Pod Management

  • Core Components: ExecutorPodsSnapshot, ExecutorPodState hierarchy, lifecycle managers
  • Capabilities: Pod state tracking, lifecycle management, snapshot-based monitoring
  • Integration: Real-time monitoring and management of executor pods

Feature Steps System

  • Core Components: KubernetesFeatureConfigStep implementations
  • Capabilities: Modular pod configuration, extensible architecture, reusable components
  • Integration: Pluggable system for customizing driver and executor pods

Utilities and Helpers

  • Core Components: KubernetesUtils, SparkKubernetesClientFactory, volume utilities
  • Capabilities: Common operations, client management, volume handling
  • Integration: Supporting utilities used throughout the Kubernetes integration

Integration Patterns

Cluster Manager Registration

The Kubernetes cluster manager automatically registers with Spark's cluster manager registry:

// Automatic registration for k8s:// URLs
val spark = SparkSession.builder()
  .appName("MyApp")
  .master("k8s://https://my-cluster:443")
  .getOrCreate()

Feature-Based Pod Configuration

The feature step system allows modular configuration of pods:

// Feature steps are automatically applied based on configuration
val steps: Seq[KubernetesFeatureConfigStep] = Seq(
  new BasicDriverFeatureStep(conf),
  new DriverServiceFeatureStep(conf),
  new MountVolumesFeatureStep(conf)
)

Snapshot-Based Monitoring

Executor pods are monitored through a snapshot-based system:

// Automatic snapshot updates via Kubernetes API
val snapshot: ExecutorPodsSnapshot = snapshotStore.currentSnapshot
val runningExecutors = snapshot.executorPods.values.collect {
  case PodRunning(pod) => pod
}

Thread Safety and Concurrency

The library is designed for concurrent use in Spark's multi-threaded environment:

  • Immutable Data Structures: Core data types like ExecutorPodsSnapshot and SparkPod are immutable
  • Thread-Safe Operations: Client factories and utilities are thread-safe
  • Concurrent Monitoring: Snapshot sources handle concurrent pod state updates
  • Atomic Updates: Configuration and state changes use atomic operations

Error Handling and Fault Tolerance

Comprehensive error handling and fault tolerance mechanisms:

  • Pod Failure Recovery: Automatic restart of failed executor pods
  • Network Resilience: Robust handling of Kubernetes API connectivity issues
  • Configuration Validation: Extensive validation of Kubernetes-specific configuration
  • Graceful Degradation: Fallback mechanisms for non-critical features

Performance Considerations

  • Batch Operations: Pod operations are batched for efficiency
  • Watch vs Polling: Configurable snapshot sources for optimal performance
  • Resource Limits: Proper CPU and memory limit configuration
  • Image Pull Optimization: Configurable image pull policies

Getting Started

  1. Setup Kubernetes: Ensure you have a running Kubernetes cluster with appropriate RBAC permissions
  2. Configure Spark: Set Kubernetes-specific configuration properties
  3. Build Container Image: Create a container image with your Spark application
  4. Submit Application: Use spark-submit with Kubernetes master URL

For detailed implementation guidance, see the specific capability documentation linked above.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-kubernetes_2.12@3.0.x
Publish Source
CLI
Badge
tessl/maven-org-apache-spark--spark-kubernetes-2-12 badge