tessl/maven-org-apache-spark--spark-kubernetes-2-12

Apache Spark Kubernetes resource manager that enables running Spark applications on Kubernetes clusters

Overview

Eval results

Files

Apache Spark Kubernetes Resource Manager

Name: tessl/maven-org-apache-spark--spark-kubernetes-2-12
Author: tessl

A comprehensive Kubernetes resource manager for Apache Spark that enables running Spark applications natively on Kubernetes clusters with full integration of Kubernetes features and APIs.

Package Information

Group ID: org.apache.spark
Artifact ID: spark-kubernetes_2.12
Version: 3.0.1

Maven Dependency:

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-kubernetes_2.12</artifactId>
  <version>3.0.1</version>
</dependency>

SBT Dependency:

libraryDependencies += "org.apache.spark" %% "spark-kubernetes" % "3.0.1"

Core Imports

import org.apache.spark.deploy.k8s._
import org.apache.spark.deploy.k8s.submit._
import org.apache.spark.deploy.k8s.features._
import org.apache.spark.scheduler.cluster.k8s._

Overview

The Apache Spark Kubernetes resource manager provides native integration between Apache Spark and Kubernetes, enabling seamless deployment and execution of Spark applications in containerized environments. It implements a complete cluster manager that schedules Spark driver and executor pods, manages dynamic resource allocation, and provides fault tolerance through Kubernetes-native capabilities.

Key Features

Native Kubernetes Integration: Full support for Kubernetes APIs, ConfigMaps, Secrets, and persistent volumes
Dynamic Resource Management: Automatic scaling of executor pods based on workload demands
Pod Lifecycle Management: Complete monitoring and management of driver and executor pod lifecycles
Configuration Flexibility: Extensive configuration options through Spark and Kubernetes parameters
Feature-Based Architecture: Modular design using feature steps for pod customization
Multi-Language Support: Support for Java, Scala, Python, and R applications

Architecture Overview

The Kubernetes resource manager follows a layered architecture:

Cluster Management Layer: Implements Spark's external cluster manager interface
Application Submission Layer: Handles spark-submit integration and client operations
Configuration Layer: Manages Kubernetes-specific configuration and constants
Pod Management Layer: Handles executor pod states, snapshots, and lifecycle operations
Feature System Layer: Provides extensible pod configuration through feature steps
Utilities Layer: Common utilities for Kubernetes operations and client management

Core Entry Points

Cluster Management { .api }

Primary entry point for Kubernetes cluster operations:

class KubernetesClusterManager extends ExternalClusterManager {
  def canCreate(masterURL: String): Boolean
  def createTaskScheduler(sc: SparkContext, masterURL: String): TaskScheduler
  def createSchedulerBackend(
    sc: SparkContext, 
    masterURL: String, 
    scheduler: TaskScheduler
  ): SchedulerBackend
  def initialize(scheduler: TaskScheduler, backend: SchedulerBackend): Unit
}

Usage: Automatically registered with Spark when using k8s:// master URLs. Handles cluster manager lifecycle and component creation.

Application Submission { .api }

Main entry point for application submission:

class KubernetesClientApplication extends SparkApplication {
  def start(args: Array[String], conf: SparkConf): Unit
}

Usage: Invoked by spark-submit when using Kubernetes cluster mode. Manages the complete application submission workflow.

Core Configuration Types

KubernetesConf Hierarchy { .api }

Base configuration class for Kubernetes operations:

abstract class KubernetesConf(
  val sparkConf: SparkConf,
  val appId: String,
  val resourceNamePrefix: String,
  val appName: String,
  val namespace: String,
  val labels: Map[String, String],
  val environment: Map[String, String],
  val annotations: Map[String, String],
  val secretEnvNamesToKeyRefs: Map[String, String],
  val secretNamesToMountPaths: Map[String, String],
  val volumes: Seq[KubernetesVolumeSpec],
  val imagePullPolicy: String,
  val nodeSelector: Map[String, String]
)

class KubernetesDriverConf extends KubernetesConf {
  def serviceAnnotations: Map[String, String]
}

class KubernetesExecutorConf extends KubernetesConf

Basic Usage Examples

Submitting a Spark Application

// Using spark-submit with Kubernetes cluster mode
spark-submit \
  --master k8s://https://kubernetes.example.com:443 \
  --deploy-mode cluster \
  --name spark-pi \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.kubernetes.container.image=spark:latest \
  --conf spark.kubernetes.namespace=spark \
  local:///opt/spark/examples/jars/spark-examples.jar

Programmatic Configuration

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.deploy.k8s.Config._

val conf = new SparkConf()
  .setAppName("MySparkApp")
  .setMaster("k8s://https://kubernetes.example.com:443")
  .set(CONTAINER_IMAGE, "my-spark:latest")
  .set(KUBERNETES_NAMESPACE, "spark-apps")
  .set(KUBERNETES_DRIVER_LIMIT_CORES, "2")
  .set(KUBERNETES_EXECUTOR_INSTANCES, "4")

val sc = new SparkContext(conf)

Capability Areas

This library provides comprehensive Kubernetes integration through several key capability areas. Each area has its own detailed documentation:

Cluster Management

Core Components: KubernetesClusterManager, KubernetesClusterSchedulerBackend
Capabilities: Cluster lifecycle management, task scheduling, resource allocation
Integration: Seamless integration with Spark's cluster manager interface

Application Submission

Core Components: KubernetesClientApplication, Client, ClientArguments
Capabilities: Application submission workflow, argument parsing, status monitoring
Integration: Full spark-submit compatibility with Kubernetes-specific features

Configuration Management

Core Components: Config object, Constants, KubernetesConf hierarchy
Capabilities: Centralized configuration, validation, type-safe properties
Integration: Extends Spark's configuration system with Kubernetes-specific options

Pod Management

Core Components: ExecutorPodsSnapshot, ExecutorPodState hierarchy, lifecycle managers
Capabilities: Pod state tracking, lifecycle management, snapshot-based monitoring
Integration: Real-time monitoring and management of executor pods

Feature Steps System

Core Components: KubernetesFeatureConfigStep implementations
Capabilities: Modular pod configuration, extensible architecture, reusable components
Integration: Pluggable system for customizing driver and executor pods

Utilities and Helpers

Core Components: KubernetesUtils, SparkKubernetesClientFactory, volume utilities
Capabilities: Common operations, client management, volume handling
Integration: Supporting utilities used throughout the Kubernetes integration

Integration Patterns

Cluster Manager Registration

The Kubernetes cluster manager automatically registers with Spark's cluster manager registry:

// Automatic registration for k8s:// URLs
val spark = SparkSession.builder()
  .appName("MyApp")
  .master("k8s://https://my-cluster:443")
  .getOrCreate()

Feature-Based Pod Configuration

The feature step system allows modular configuration of pods:

// Feature steps are automatically applied based on configuration
val steps: Seq[KubernetesFeatureConfigStep] = Seq(
  new BasicDriverFeatureStep(conf),
  new DriverServiceFeatureStep(conf),
  new MountVolumesFeatureStep(conf)
)

Snapshot-Based Monitoring

Executor pods are monitored through a snapshot-based system:

// Automatic snapshot updates via Kubernetes API
val snapshot: ExecutorPodsSnapshot = snapshotStore.currentSnapshot
val runningExecutors = snapshot.executorPods.values.collect {
  case PodRunning(pod) => pod
}

Thread Safety and Concurrency

The library is designed for concurrent use in Spark's multi-threaded environment:

Immutable Data Structures: Core data types like ExecutorPodsSnapshot and SparkPod are immutable
Thread-Safe Operations: Client factories and utilities are thread-safe
Concurrent Monitoring: Snapshot sources handle concurrent pod state updates
Atomic Updates: Configuration and state changes use atomic operations

Error Handling and Fault Tolerance

Comprehensive error handling and fault tolerance mechanisms:

Pod Failure Recovery: Automatic restart of failed executor pods
Network Resilience: Robust handling of Kubernetes API connectivity issues
Configuration Validation: Extensive validation of Kubernetes-specific configuration
Graceful Degradation: Fallback mechanisms for non-critical features

Performance Considerations

Batch Operations: Pod operations are batched for efficiency
Watch vs Polling: Configurable snapshot sources for optimal performance
Resource Limits: Proper CPU and memory limit configuration
Image Pull Optimization: Configurable image pull policies

Getting Started

Setup Kubernetes: Ensure you have a running Kubernetes cluster with appropriate RBAC permissions
Configure Spark: Set Kubernetes-specific configuration properties
Build Container Image: Create a container image with your Spark application
Submit Application: Use spark-submit with Kubernetes master URL

For detailed implementation guidance, see the specific capability documentation linked above.

Install with Tessl CLI

npx tessl i tessl/maven-org-apache-spark--spark-kubernetes-2-12

Workspace: tessl
Visibility: Public
Created: 2 months ago
Last updated: about 1 month ago
Describes: pkg:maven/org.apache.spark/spark-kubernetes_2.12@3.0.x