Spec RegistrySpec Registry

Help your agents use open-source better. Learn more.

Find usage specs for your project’s dependencies

>

maven-apache-spark

Description
Lightning-fast unified analytics engine for large-scale data processing with high-level APIs in Scala, Java, Python, and R
Author
tessl
Last updated

How to use

npx @tessl/cli registry install tessl/maven-apache-spark@1.0.0

index.md docs/

1
# Apache Spark
2
3
Apache Spark is a lightning-fast unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis.
4
5
## Package Information
6
7
**Maven Coordinates:**
8
```xml
9
<groupId>org.apache.spark</groupId>
10
<artifactId>spark-core_2.10</artifactId>
11
<version>1.0.0</version>
12
```
13
14
**Scala Version:** 2.10.x
15
**Java Version:** Java 6+
16
17
## Core Imports
18
19
**Scala:**
20
```scala { .api }
21
import org.apache.spark.SparkContext
22
import org.apache.spark.SparkConf
23
import org.apache.spark.rdd.RDD
24
import org.apache.spark.SparkContext._ // for implicit conversions
25
```
26
27
**Java:**
28
```java { .api }
29
import org.apache.spark.api.java.JavaSparkContext;
30
import org.apache.spark.api.java.JavaRDD;
31
import org.apache.spark.api.java.JavaPairRDD;
32
import org.apache.spark.SparkConf;
33
```
34
35
**Python:**
36
```python { .api }
37
from pyspark import SparkContext, SparkConf
38
from pyspark.sql import SQLContext, Row
39
from pyspark import StorageLevel, SparkFiles
40
```
41
42
## Basic Usage
43
44
### Creating a SparkContext
45
46
```scala { .api }
47
import org.apache.spark.{SparkContext, SparkConf}
48
49
val conf = new SparkConf()
50
.setAppName("My Spark Application")
51
.setMaster("local[*]") // Use all available cores locally
52
53
val sc = new SparkContext(conf)
54
55
// Remember to stop the context when done
56
sc.stop()
57
```
58
59
### Simple RDD Operations
60
61
**Scala:**
62
```scala { .api }
63
// Create an RDD from a collection
64
val data = Array(1, 2, 3, 4, 5)
65
val distData: RDD[Int] = sc.parallelize(data)
66
67
// Transform and action
68
val doubled = distData.map(_ * 2)
69
val result = doubled.collect() // Returns Array(2, 4, 6, 8, 10)
70
```
71
72
**Python:**
73
```python { .api }
74
# Create an RDD from a collection
75
data = [1, 2, 3, 4, 5]
76
dist_data = sc.parallelize(data)
77
78
# Transform and action
79
doubled = dist_data.map(lambda x: x * 2)
80
result = doubled.collect() # Returns [2, 4, 6, 8, 10]
81
```
82
83
## Architecture
84
85
Spark's core components work together to provide a unified analytics platform:
86
87
### SparkContext
88
The main entry point that coordinates distributed processing across a cluster. It creates RDDs, manages shared variables (broadcast variables and accumulators), and controls job execution.
89
90
### RDD (Resilient Distributed Dataset)
91
The fundamental data abstraction - an immutable, fault-tolerant collection of elements that can be operated on in parallel. RDDs support two types of operations:
92
- **Transformations**: Create new RDDs (lazy evaluation)
93
- **Actions**: Return values or save data (trigger computation)
94
95
### Cluster Manager
96
Spark can run on various cluster managers including:
97
- Standalone cluster manager
98
- Apache Mesos
99
- Hadoop YARN
100
101
## Core Capabilities
102
103
### [RDD Operations](./core-rdd.md)
104
105
Essential transformations and actions for data processing:
106
107
```scala { .api }
108
// Transformations (lazy)
109
rdd.map(f) // Apply function to each element
110
rdd.filter(f) // Keep elements matching predicate
111
rdd.flatMap(f) // Apply function and flatten results
112
rdd.distinct() // Remove duplicates
113
114
// Actions (eager)
115
rdd.collect() // Return all elements as array
116
rdd.count() // Count number of elements
117
rdd.reduce(f) // Reduce using associative function
118
rdd.take(n) // Return first n elements
119
```
120
121
### [SparkContext API](./spark-context.md)
122
123
Comprehensive cluster management and RDD creation:
124
125
```scala { .api }
126
class SparkContext(conf: SparkConf) {
127
// RDD Creation
128
def parallelize[T: ClassTag](seq: Seq[T]): RDD[T]
129
def textFile(path: String): RDD[String]
130
def hadoopFile[K, V](path: String, ...): RDD[(K, V)]
131
132
// Shared Variables
133
def broadcast[T: ClassTag](value: T): Broadcast[T]
134
def accumulator[T](initialValue: T): Accumulator[T]
135
136
// Job Control
137
def runJob[T, U](rdd: RDD[T], func: Iterator[T] => U): Array[U]
138
def stop(): Unit
139
}
140
```
141
142
### [Key-Value Operations](./key-value-operations.md)
143
144
Powerful operations on RDDs of (key, value) pairs:
145
146
```scala { .api }
147
// Import for PairRDDFunctions
148
import org.apache.spark.SparkContext._
149
150
val pairRDD: RDD[(String, Int)] = sc.parallelize(Seq(("a", 1), ("b", 2)))
151
152
// Key-value transformations
153
pairRDD.reduceByKey(_ + _) // Combine values by key
154
pairRDD.groupByKey() // Group values by key
155
pairRDD.mapValues(_ * 2) // Transform values, preserve keys
156
pairRDD.join(otherPairRDD) // Inner join on keys
157
```
158
159
### [Data Sources](./data-sources.md)
160
161
Read and write data from various sources:
162
163
```scala { .api }
164
// Reading data
165
sc.textFile("hdfs://path/to/file")
166
sc.sequenceFile[K, V]("path")
167
sc.objectFile[T]("path")
168
sc.hadoopFile[K, V]("path", inputFormat, keyClass, valueClass)
169
170
// Writing data
171
rdd.saveAsTextFile("path")
172
rdd.saveAsObjectFile("path")
173
pairRDD.saveAsSequenceFile("path")
174
```
175
176
### [Caching & Persistence](./caching-persistence.md)
177
178
Optimize performance by caching frequently accessed RDDs:
179
180
```scala { .api }
181
import org.apache.spark.storage.StorageLevel
182
183
rdd.cache() // Cache in memory
184
rdd.persist(StorageLevel.MEMORY_ONLY) // Explicit storage level
185
rdd.persist(StorageLevel.MEMORY_AND_DISK_SER) // Memory + disk with serialization
186
rdd.unpersist() // Remove from cache
187
```
188
189
### [Streaming](./streaming.md)
190
191
Process live data streams in micro-batches:
192
193
```scala { .api }
194
import org.apache.spark.streaming.{StreamingContext, Seconds}
195
196
val ssc = new StreamingContext(sc, Seconds(1))
197
198
val lines = ssc.socketTextStream("hostname", 9999)
199
val words = lines.flatMap(_.split(" "))
200
val wordCounts = words.map(word => (word, 1)).reduceByKey(_ + _)
201
202
wordCounts.print()
203
ssc.start()
204
ssc.awaitTermination()
205
```
206
207
### [Machine Learning (MLlib)](./mllib.md)
208
209
Scalable machine learning algorithms:
210
211
```scala { .api }
212
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
213
import org.apache.spark.mllib.regression.LabeledPoint
214
import org.apache.spark.mllib.linalg.Vectors
215
216
val data: RDD[LabeledPoint] = sc.parallelize(trainingData)
217
val model = LogisticRegressionWithSGD.train(data, numIterations = 100)
218
val predictions = model.predict(testData.map(_.features))
219
```
220
221
### [Graph Processing (GraphX)](./graphx.md)
222
223
Large-scale graph analytics:
224
225
```scala { .api }
226
import org.apache.spark.graphx._
227
228
// Create graph from edge list
229
val edges: RDD[Edge[Double]] = sc.parallelize(edgeArray)
230
val graph = Graph.fromEdges(edges, defaultValue = 1.0)
231
232
// Run PageRank
233
val ranks = graph.pageRank(0.0001).vertices
234
```
235
236
### [SQL](./sql.md)
237
238
Structured data processing with SQL:
239
240
```scala { .api }
241
import org.apache.spark.sql.SQLContext
242
243
val sqlContext = new SQLContext(sc)
244
val df = sqlContext.read.json("path/to/people.json")
245
246
df.registerTempTable("people")
247
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
248
```
249
250
### [Python API (PySpark)](./python-api.md)
251
252
Python interface for Spark with Pythonic APIs:
253
254
```python { .api }
255
from pyspark import SparkContext, SparkConf
256
257
conf = SparkConf().setAppName("My Python App").setMaster("local[*]")
258
sc = SparkContext(conf=conf)
259
260
data = [1, 2, 3, 4, 5]
261
rdd = sc.parallelize(data)
262
squared = rdd.map(lambda x: x * x)
263
result = squared.collect()
264
```
265
266
### [Java API](./java-api.md)
267
268
Java-friendly wrappers with proper type safety:
269
270
```java { .api }
271
import org.apache.spark.api.java.JavaSparkContext;
272
import org.apache.spark.api.java.JavaRDD;
273
274
JavaSparkContext jsc = new JavaSparkContext(conf);
275
JavaRDD<String> lines = jsc.textFile("data.txt");
276
JavaRDD<Integer> lineLengths = lines.map(String::length);
277
```
278
279
## Performance Considerations
280
281
- **Caching**: Use `cache()` or `persist()` for RDDs accessed multiple times
282
- **Partitioning**: Control data partitioning for better performance in key-based operations
283
- **Serialization**: Use Kryo serializer for better performance
284
- **Memory Management**: Configure executor memory and storage levels appropriately
285
- **Shuffle Operations**: Minimize expensive shuffle operations like `groupByKey()`
286
287
## Common Patterns
288
289
**Word Count:**
290
```scala { .api }
291
val textFile = sc.textFile("hdfs://...")
292
val counts = textFile
293
.flatMap(line => line.split(" "))
294
.map(word => (word, 1))
295
.reduceByKey(_ + _)
296
```
297
298
**Log Analysis:**
299
```scala { .api }
300
val logFile = sc.textFile("access.log")
301
val errors = logFile.filter(_.contains("ERROR"))
302
val errorsByHost = errors
303
.map(line => (extractHost(line), 1))
304
.reduceByKey(_ + _)
305
```