Apache Spark YARN resource manager integration module that enables Spark applications to run on YARN clusters
npx @tessl/cli install tessl/maven-org-apache-spark--spark-yarn-2-12@3.5.00
# Apache Spark YARN Resource Manager
1
2
Apache Spark YARN Resource Manager provides integration between Apache Spark and YARN (Yet Another Resource Negotiator) for running Spark applications on Hadoop clusters. This module enables Spark to leverage YARN's resource management and scheduling capabilities, supporting both client and cluster deployment modes with comprehensive resource allocation, security, and monitoring features.
3
4
## Package Information
5
6
- **Package Name**: org.apache.spark:spark-yarn_2.12
7
- **Package Type**: maven
8
- **Language**: Scala
9
- **Installation**: Add dependency to `pom.xml` or include in Spark distribution
10
11
## Core Imports
12
13
```scala
14
import org.apache.spark.deploy.yarn.{Client, ApplicationMaster}
15
import org.apache.spark.scheduler.cluster.{YarnClusterManager, YarnSchedulerBackend}
16
import org.apache.spark.SparkConf
17
```
18
19
## Basic Usage
20
21
```scala
22
import org.apache.spark.{SparkConf, SparkContext}
23
24
// Configure Spark for YARN
25
val conf = new SparkConf()
26
.setAppName("MySparkApp")
27
.setMaster("yarn")
28
.set("spark.yarn.queue", "default")
29
.set("spark.yarn.am.memory", "1g")
30
.set("spark.executor.memory", "2g")
31
.set("spark.executor.cores", "2")
32
33
// Create SparkContext - YARN integration is handled automatically
34
val sc = new SparkContext(conf)
35
36
// Your Spark application code here
37
val rdd = sc.parallelize(1 to 100)
38
val result = rdd.map(_ * 2).collect()
39
40
sc.stop()
41
```
42
43
## Architecture
44
45
The Apache Spark YARN integration consists of several key components:
46
47
- **Application Management**: Client for submitting applications, ApplicationMaster for managing application lifecycle
48
- **Scheduler Integration**: YarnClusterManager for cluster management, scheduler backends for resource requests
49
- **Resource Management**: YarnAllocator for container allocation, placement strategies for optimal resource utilization
50
- **Executor Integration**: YARN-specific executor backend with container management
51
- **Configuration System**: Comprehensive YARN-specific configuration options
52
- **Security Integration**: Delegation token management and Kerberos authentication support
53
54
## Capabilities
55
56
### Application Management
57
58
Core components for submitting and managing Spark applications on YARN clusters. Handles application submission, monitoring, and lifecycle management.
59
60
```scala { .api }
61
class Client(
62
args: ClientArguments,
63
sparkConf: SparkConf,
64
rpcEnv: RpcEnv
65
)
66
67
class ApplicationMaster(
68
args: ApplicationMasterArguments,
69
sparkConf: SparkConf,
70
yarnConf: YarnConfiguration
71
)
72
```
73
74
[Application Management](./application-management.md)
75
76
### Scheduler Integration
77
78
Integration components that connect Spark's task scheduling system with YARN's resource management. Provides cluster manager and scheduler backends for both client and cluster modes.
79
80
```scala { .api }
81
class YarnClusterManager extends ExternalClusterManager
82
83
abstract class YarnSchedulerBackend(
84
scheduler: TaskSchedulerImpl,
85
sc: SparkContext
86
) extends CoarseGrainedSchedulerBackend
87
88
class YarnClientSchedulerBackend(
89
scheduler: TaskSchedulerImpl,
90
sc: SparkContext
91
) extends YarnSchedulerBackend
92
93
class YarnClusterSchedulerBackend(
94
scheduler: TaskSchedulerImpl,
95
sc: SparkContext
96
) extends YarnSchedulerBackend
97
```
98
99
[Scheduler Integration](./scheduler-integration.md)
100
101
### Resource Management
102
103
Components responsible for allocating and managing YARN containers for Spark executors. Includes allocation strategies, placement policies, and resource request management.
104
105
```scala { .api }
106
class YarnAllocator
107
108
class YarnRMClient
109
110
object ResourceRequestHelper
111
112
class LocalityPreferredContainerPlacementStrategy
113
```
114
115
[Resource Management](./resource-management.md)
116
117
### Configuration System
118
119
Comprehensive configuration system for YARN-specific settings including resource allocation, security, and deployment options.
120
121
```scala { .api }
122
package object config {
123
val APPLICATION_TAGS: ConfigEntry[Set[String]]
124
val QUEUE_NAME: ConfigEntry[String]
125
val AM_MEMORY: ConfigEntry[Long]
126
val AM_CORES: ConfigEntry[Int]
127
val EXECUTOR_NODE_LABEL_EXPRESSION: OptionalConfigEntry[String]
128
// ... and many more configuration options
129
}
130
131
class ClientArguments(args: Array[String])
132
class ApplicationMasterArguments(args: Array[String])
133
```
134
135
[Configuration System](./configuration.md)
136
137
## Types
138
139
### Core Application Types
140
141
```scala { .api }
142
case class YarnAppReport(
143
appState: YarnApplicationState,
144
finalState: FinalApplicationStatus,
145
diagnostics: Option[String]
146
)
147
148
class YarnClusterApplication extends SparkApplication {
149
def start(args: Array[String], conf: SparkConf): Unit
150
}
151
```
152
153
### Scheduler Types
154
155
```scala { .api }
156
class YarnScheduler(sc: SparkContext) extends TaskSchedulerImpl
157
class YarnClusterScheduler(sc: SparkContext) extends YarnScheduler
158
```
159
160
### Executor Types
161
162
```scala { .api }
163
class YarnCoarseGrainedExecutorBackend extends CoarseGrainedExecutorBackend {
164
def getUserClassPath: Seq[URL]
165
def extractLogUrls: Map[String, String]
166
def extractAttributes: Map[String, String]
167
}
168
169
class ExecutorRunnable {
170
def run(): Unit
171
def launchContextDebugInfo(): String
172
}
173
```
174
175
## Entry Points
176
177
### Primary Integration Points
178
179
- **yarn-client mode**: Applications run driver on local machine, executors on YARN
180
- **yarn-cluster mode**: Both driver and executors run on YARN cluster
181
- **Programmatic submission**: Use `Client` class for custom application submission
182
- **SparkSubmit integration**: Transparent integration when using `--master yarn`
183
184
### Main Classes
185
186
- `ApplicationMaster.main()` - Entry point for cluster mode ApplicationMaster
187
- `YarnCoarseGrainedExecutorBackend.main()` - Entry point for executor processes
188
- `YarnClusterApplication.start()` - Entry point for programmatic cluster mode submission
189
- `ExecutorLauncher.main()` - Entry point for client mode executor launcher