or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

application-management.mdconfiguration.mdindex.mdresource-management.mdscheduler-integration.md
tile.json

tessl/maven-org-apache-spark--spark-yarn-2-12

Apache Spark YARN resource manager integration module that enables Spark applications to run on YARN clusters

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-yarn_2.12@3.5.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-yarn-2-12@3.5.0

index.mddocs/

Apache Spark YARN Resource Manager

Apache Spark YARN Resource Manager provides integration between Apache Spark and YARN (Yet Another Resource Negotiator) for running Spark applications on Hadoop clusters. This module enables Spark to leverage YARN's resource management and scheduling capabilities, supporting both client and cluster deployment modes with comprehensive resource allocation, security, and monitoring features.

Package Information

  • Package Name: org.apache.spark:spark-yarn_2.12
  • Package Type: maven
  • Language: Scala
  • Installation: Add dependency to pom.xml or include in Spark distribution

Core Imports

import org.apache.spark.deploy.yarn.{Client, ApplicationMaster}
import org.apache.spark.scheduler.cluster.{YarnClusterManager, YarnSchedulerBackend}
import org.apache.spark.SparkConf

Basic Usage

import org.apache.spark.{SparkConf, SparkContext}

// Configure Spark for YARN
val conf = new SparkConf()
  .setAppName("MySparkApp")
  .setMaster("yarn")
  .set("spark.yarn.queue", "default")
  .set("spark.yarn.am.memory", "1g")
  .set("spark.executor.memory", "2g")
  .set("spark.executor.cores", "2")

// Create SparkContext - YARN integration is handled automatically
val sc = new SparkContext(conf)

// Your Spark application code here
val rdd = sc.parallelize(1 to 100)
val result = rdd.map(_ * 2).collect()

sc.stop()

Architecture

The Apache Spark YARN integration consists of several key components:

  • Application Management: Client for submitting applications, ApplicationMaster for managing application lifecycle
  • Scheduler Integration: YarnClusterManager for cluster management, scheduler backends for resource requests
  • Resource Management: YarnAllocator for container allocation, placement strategies for optimal resource utilization
  • Executor Integration: YARN-specific executor backend with container management
  • Configuration System: Comprehensive YARN-specific configuration options
  • Security Integration: Delegation token management and Kerberos authentication support

Capabilities

Application Management

Core components for submitting and managing Spark applications on YARN clusters. Handles application submission, monitoring, and lifecycle management.

class Client(
  args: ClientArguments,
  sparkConf: SparkConf,
  rpcEnv: RpcEnv
)

class ApplicationMaster(
  args: ApplicationMasterArguments,
  sparkConf: SparkConf,
  yarnConf: YarnConfiguration
)

Application Management

Scheduler Integration

Integration components that connect Spark's task scheduling system with YARN's resource management. Provides cluster manager and scheduler backends for both client and cluster modes.

class YarnClusterManager extends ExternalClusterManager

abstract class YarnSchedulerBackend(
  scheduler: TaskSchedulerImpl,
  sc: SparkContext
) extends CoarseGrainedSchedulerBackend

class YarnClientSchedulerBackend(
  scheduler: TaskSchedulerImpl,
  sc: SparkContext
) extends YarnSchedulerBackend

class YarnClusterSchedulerBackend(
  scheduler: TaskSchedulerImpl,
  sc: SparkContext
) extends YarnSchedulerBackend

Scheduler Integration

Resource Management

Components responsible for allocating and managing YARN containers for Spark executors. Includes allocation strategies, placement policies, and resource request management.

class YarnAllocator

class YarnRMClient

object ResourceRequestHelper

class LocalityPreferredContainerPlacementStrategy

Resource Management

Configuration System

Comprehensive configuration system for YARN-specific settings including resource allocation, security, and deployment options.

package object config {
  val APPLICATION_TAGS: ConfigEntry[Set[String]]
  val QUEUE_NAME: ConfigEntry[String]
  val AM_MEMORY: ConfigEntry[Long]
  val AM_CORES: ConfigEntry[Int]
  val EXECUTOR_NODE_LABEL_EXPRESSION: OptionalConfigEntry[String]
  // ... and many more configuration options
}

class ClientArguments(args: Array[String])
class ApplicationMasterArguments(args: Array[String])

Configuration System

Types

Core Application Types

case class YarnAppReport(
  appState: YarnApplicationState,
  finalState: FinalApplicationStatus,
  diagnostics: Option[String]
)

class YarnClusterApplication extends SparkApplication {
  def start(args: Array[String], conf: SparkConf): Unit
}

Scheduler Types

class YarnScheduler(sc: SparkContext) extends TaskSchedulerImpl
class YarnClusterScheduler(sc: SparkContext) extends YarnScheduler

Executor Types

class YarnCoarseGrainedExecutorBackend extends CoarseGrainedExecutorBackend {
  def getUserClassPath: Seq[URL]
  def extractLogUrls: Map[String, String]
  def extractAttributes: Map[String, String]
}

class ExecutorRunnable {
  def run(): Unit
  def launchContextDebugInfo(): String
}

Entry Points

Primary Integration Points

  • yarn-client mode: Applications run driver on local machine, executors on YARN
  • yarn-cluster mode: Both driver and executors run on YARN cluster
  • Programmatic submission: Use Client class for custom application submission
  • SparkSubmit integration: Transparent integration when using --master yarn

Main Classes

  • ApplicationMaster.main() - Entry point for cluster mode ApplicationMaster
  • YarnCoarseGrainedExecutorBackend.main() - Entry point for executor processes
  • YarnClusterApplication.start() - Entry point for programmatic cluster mode submission
  • ExecutorLauncher.main() - Entry point for client mode executor launcher