CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-org-apache-spark--yarn-parent-2-10

YARN integration support for Apache Spark cluster computing, enabling Spark applications to run on Hadoop YARN clusters

Pending
Overview
Eval results
Files

scheduler-backends.mddocs/

Scheduler Backends

Scheduler backend implementations for integrating Spark's TaskScheduler with YARN resource management. These components provide the bridge between Spark's task scheduling system and YARN's resource allocation, supporting both client and cluster deployment modes.

Capabilities

YarnClientSchedulerBackend

Scheduler backend for YARN client mode, where the Spark driver runs on the client machine and the ApplicationMaster manages only executors.

/**
 * Scheduler backend for YARN client mode
 * Manages executor lifecycle and resource requests through YARN ApplicationMaster
 */
private[spark] class YarnClientSchedulerBackend(
  scheduler: TaskSchedulerImpl, 
  sc: SparkContext
) extends YarnSchedulerBackend {
  
  /**
   * Start the backend and submit application to YARN
   * Initiates ApplicationMaster and begins executor allocation
   */
  def start(): Unit
  
  // Application monitoring and lifecycle management
  // Resource request handling through ApplicationMaster
  // Executor status tracking and failure handling
}

Usage Examples:

import org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend
import org.apache.spark.scheduler.TaskSchedulerImpl
import org.apache.spark.SparkContext

// Typically created by Spark runtime in client mode
val sc = new SparkContext(sparkConf)
val taskScheduler = new TaskSchedulerImpl(sc)
val backend = new YarnClientSchedulerBackend(taskScheduler, sc)

// Backend lifecycle managed by Spark runtime
backend.start()

YarnClusterSchedulerBackend

Scheduler backend for YARN cluster mode, where both the driver and executors run on the YARN cluster.

/**
 * Scheduler backend for YARN cluster mode  
 * Manages executors when driver runs within ApplicationMaster
 */
private[spark] class YarnClusterSchedulerBackend extends YarnSchedulerBackend {
  // Cluster-specific resource management
  // Direct integration with ApplicationMaster
  // Optimized executor allocation for cluster deployment
}

Usage Examples:

import org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend

// Created by Spark runtime in cluster mode
// Driver runs inside ApplicationMaster on YARN cluster
val backend = new YarnClusterSchedulerBackend()

YarnSchedulerBackend (Base Class)

Abstract base class providing common YARN scheduler backend functionality.

/**
 * Base class for YARN scheduler backend implementations
 * Provides common functionality for both client and cluster modes
 */
private[spark] abstract class YarnSchedulerBackend 
    extends CoarseGrainedSchedulerBackend {
  
  // Common YARN resource management operations
  // Executor container lifecycle management  
  // Resource request and allocation handling
  // Integration with YARN ApplicationMaster
}

Task Schedulers

Task scheduler implementations optimized for YARN deployment modes.

/**
 * Task scheduler for YARN client mode
 * Optimized for client-side driver with remote executors
 */
private[spark] class YarnClientClusterScheduler(sc: SparkContext) 
    extends TaskSchedulerImpl {
  
  // Client-specific scheduling optimizations
  // Remote executor communication handling
  // Task distribution strategies for client mode
}

/**
 * Task scheduler for YARN cluster mode
 * Optimized for driver and executors running on same cluster  
 */
private[spark] class YarnClusterScheduler(sc: SparkContext) 
    extends TaskSchedulerImpl {
  
  // Cluster-specific scheduling optimizations
  // Locality-aware task placement
  // Efficient intra-cluster communication
}

Usage Examples:

import org.apache.spark.scheduler.cluster.{YarnClientClusterScheduler, YarnClusterScheduler}
import org.apache.spark.SparkContext

// Client mode scheduler
val sc = new SparkContext(sparkConf)
val clientScheduler = new YarnClientClusterScheduler(sc)

// Cluster mode scheduler  
val clusterScheduler = new YarnClusterScheduler(sc)

Scheduler Backend Architecture

Client Mode Architecture

// Client Mode Flow:
// SparkContext (Client Machine)
//     ↓
// YarnClientSchedulerBackend  
//     ↓
// YarnClientClusterScheduler
//     ↓  
// ApplicationMaster (YARN Cluster) - ExecutorLauncher mode
//     ↓
// Executor Containers (YARN Nodes)

In client mode:

  1. Driver runs on client machine
  2. YarnClientSchedulerBackend submits ApplicationMaster to YARN
  3. ApplicationMaster runs in ExecutorLauncher mode (manages only executors)
  4. Tasks are scheduled from client to remote executors

Cluster Mode Architecture

// Cluster Mode Flow:
// ApplicationMaster (YARN Cluster) - Driver mode
//     ↓
// YarnClusterSchedulerBackend
//     ↓
// YarnClusterScheduler  
//     ↓
// Executor Containers (YARN Nodes)

In cluster mode:

  1. Driver runs inside ApplicationMaster on YARN cluster
  2. YarnClusterSchedulerBackend manages local executor allocation
  3. Optimized for locality and reduced network overhead

Resource Management Integration

Executor Allocation

// Scheduler backends integrate with YARN resource management
abstract class YarnSchedulerBackend {
  // Request executor containers from YARN
  protected def requestExecutors(numExecutors: Int): Unit
  
  // Handle executor container allocation from ResourceManager
  protected def onExecutorsAdded(executorIds: Seq[String]): Unit
  
  // Handle executor failures and cleanup
  protected def onExecutorRemoved(executorId: String, reason: String): Unit
}

Dynamic Allocation

Integration with Spark's dynamic allocation feature:

// Configuration for dynamic executor allocation
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.minExecutors=1  
spark.dynamicAllocation.maxExecutors=10
spark.dynamicAllocation.initialExecutors=2

// YarnSchedulerBackend handles dynamic scaling
// Requests/releases executors based on workload

Configuration Options

Client Mode Configuration

// Key configuration properties for client mode
spark.yarn.am.memory=512m              // ApplicationMaster memory
spark.yarn.am.cores=1                  // ApplicationMaster cores
spark.yarn.am.waitTime=100s            // Max wait for SparkContext
spark.yarn.client.executor.graceTime=5s // Executor shutdown grace period

Cluster Mode Configuration

// Key configuration properties for cluster mode  
spark.driver.memory=1g                 // Driver memory (runs in AM)
spark.driver.cores=1                   // Driver cores
spark.yarn.driver.memoryFraction=0.1   // Driver memory fraction
spark.yarn.am.extraJavaOptions         // JVM options for AM/driver

Common Configuration

// Configuration affecting both modes
spark.executor.memory=1g               // Executor memory
spark.executor.cores=1                 // Executor cores
spark.executor.instances=2             // Number of executors
spark.yarn.queue=default               // YARN queue
spark.yarn.priority=0                  // Application priority

Fault Tolerance

Executor Failure Handling

// Scheduler backends provide fault tolerance mechanisms
class YarnSchedulerBackend {
  // Detect executor failures through heartbeat monitoring
  // Automatically request replacement executors from YARN
  // Reschedule failed tasks on healthy executors
  // Blacklist problematic nodes after repeated failures
}

ApplicationMaster Failure Recovery

  • Client Mode: ApplicationMaster failure requires resubmission
  • Cluster Mode: Driver failure terminates application (no recovery)
  • Both modes support checkpoint-based recovery for stateful applications

Performance Optimization

Locality Optimization

// Scheduler backends optimize for data locality
class YarnClusterScheduler {
  // Prefer scheduling tasks on nodes with cached data
  // Consider HDFS block locations for task placement
  // Balance between locality and resource availability
}

Resource Utilization

// Efficient resource utilization strategies
class YarnSchedulerBackend {
  // Container sharing between multiple executors (when supported)
  // Optimal container size calculation based on workload
  // Preemption handling for shared cluster environments
}

Monitoring and Metrics

The scheduler backends integrate with Spark's metrics system:

  • Executor allocation/deallocation events
  • Task scheduling latency metrics
  • Resource utilization tracking
  • YARN application progress reporting
  • Integration with Spark History Server

Install with Tessl CLI

npx tessl i tessl/maven-org-apache-spark--yarn-parent-2-10

docs

application-master.md

index.md

resource-management.md

scheduler-backends.md

utilities.md

yarn-client.md

tile.json