Spec Registry

Help your agents use open-source better. Learn more.

Find usage specs for your project’s dependencies

Author: tessl

Last updated: 15 days ago

Spec files

maven-apache-spark

Describes: maven maven/org.apache.spark/spark-parent

Description: Lightning-fast unified analytics engine for large-scale data processing with high-level APIs in Scala, Java, Python, and R

Author: tessl

Last updated: 15 days ago

How to use

npx @tessl/cli registry install tessl/maven-apache-spark@1.0.0

Provide feedback Docs

index.md docs/

1
# Apache Spark
2

3
Apache Spark is a lightning-fast unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis.
4

5
## Package Information
6

7
**Maven Coordinates:**
8
```xml
9
<groupId>org.apache.spark</groupId>
10
<artifactId>spark-core_2.10</artifactId>
11
<version>1.0.0</version>
12
```
13

14
**Scala Version:** 2.10.x  
15
**Java Version:** Java 6+
16

17
## Core Imports
18

19
**Scala:**
20
```scala { .api }
21
import org.apache.spark.SparkContext
22
import org.apache.spark.SparkConf
23
import org.apache.spark.rdd.RDD
24
import org.apache.spark.SparkContext._  // for implicit conversions
25
```
26

27
**Java:**
28
```java { .api }
29
import org.apache.spark.api.java.JavaSparkContext;
30
import org.apache.spark.api.java.JavaRDD;
31
import org.apache.spark.api.java.JavaPairRDD;
32
import org.apache.spark.SparkConf;
33
```
34

35
**Python:**
36
```python { .api }
37
from pyspark import SparkContext, SparkConf
38
from pyspark.sql import SQLContext, Row
39
from pyspark import StorageLevel, SparkFiles
40
```
41

42
## Basic Usage
43

44
### Creating a SparkContext
45

46
```scala { .api }
47
import org.apache.spark.{SparkContext, SparkConf}
48

49
val conf = new SparkConf()
50
  .setAppName("My Spark Application")
51
  .setMaster("local[*]")  // Use all available cores locally
52

53
val sc = new SparkContext(conf)
54

55
// Remember to stop the context when done
56
sc.stop()
57
```
58

59
### Simple RDD Operations
60

61
**Scala:**
62
```scala { .api }
63
// Create an RDD from a collection
64
val data = Array(1, 2, 3, 4, 5)
65
val distData: RDD[Int] = sc.parallelize(data)
66

67
// Transform and action
68
val doubled = distData.map(_ * 2)
69
val result = doubled.collect()  // Returns Array(2, 4, 6, 8, 10)
70
```
71

72
**Python:**
73
```python { .api }
74
# Create an RDD from a collection
75
data = [1, 2, 3, 4, 5]
76
dist_data = sc.parallelize(data)
77

78
# Transform and action
79
doubled = dist_data.map(lambda x: x * 2)
80
result = doubled.collect()  # Returns [2, 4, 6, 8, 10]
81
```
82

83
## Architecture
84

85
Spark's core components work together to provide a unified analytics platform:
86

87
### SparkContext
88
The main entry point that coordinates distributed processing across a cluster. It creates RDDs, manages shared variables (broadcast variables and accumulators), and controls job execution.
89

90
### RDD (Resilient Distributed Dataset)
91
The fundamental data abstraction - an immutable, fault-tolerant collection of elements that can be operated on in parallel. RDDs support two types of operations:
92
- **Transformations**: Create new RDDs (lazy evaluation)
93
- **Actions**: Return values or save data (trigger computation)
94

95
### Cluster Manager
96
Spark can run on various cluster managers including:
97
- Standalone cluster manager
98
- Apache Mesos
99
- Hadoop YARN
100

101
## Core Capabilities
102

103
### [RDD Operations](./core-rdd.md)
104

105
Essential transformations and actions for data processing:
106

107
```scala { .api }
108
// Transformations (lazy)
109
rdd.map(f)           // Apply function to each element
110
rdd.filter(f)        // Keep elements matching predicate  
111
rdd.flatMap(f)       // Apply function and flatten results
112
rdd.distinct()       // Remove duplicates
113

114
// Actions (eager)
115
rdd.collect()        // Return all elements as array
116
rdd.count()          // Count number of elements
117
rdd.reduce(f)        // Reduce using associative function
118
rdd.take(n)          // Return first n elements
119
```
120

121
### [SparkContext API](./spark-context.md)
122

123
Comprehensive cluster management and RDD creation:
124

125
```scala { .api }
126
class SparkContext(conf: SparkConf) {
127
  // RDD Creation
128
  def parallelize[T: ClassTag](seq: Seq[T]): RDD[T]
129
  def textFile(path: String): RDD[String]
130
  def hadoopFile[K, V](path: String, ...): RDD[(K, V)]
131
  
132
  // Shared Variables
133
  def broadcast[T: ClassTag](value: T): Broadcast[T]
134
  def accumulator[T](initialValue: T): Accumulator[T]
135
  
136
  // Job Control
137
  def runJob[T, U](rdd: RDD[T], func: Iterator[T] => U): Array[U]
138
  def stop(): Unit
139
}
140
```
141

142
### [Key-Value Operations](./key-value-operations.md)
143

144
Powerful operations on RDDs of (key, value) pairs:
145

146
```scala { .api }
147
// Import for PairRDDFunctions
148
import org.apache.spark.SparkContext._
149

150
val pairRDD: RDD[(String, Int)] = sc.parallelize(Seq(("a", 1), ("b", 2)))
151

152
// Key-value transformations
153
pairRDD.reduceByKey(_ + _)        // Combine values by key
154
pairRDD.groupByKey()              // Group values by key
155
pairRDD.mapValues(_ * 2)          // Transform values, preserve keys
156
pairRDD.join(otherPairRDD)        // Inner join on keys
157
```
158

159
### [Data Sources](./data-sources.md)
160

161
Read and write data from various sources:
162

163
```scala { .api }
164
// Reading data
165
sc.textFile("hdfs://path/to/file")
166
sc.sequenceFile[K, V]("path")
167
sc.objectFile[T]("path") 
168
sc.hadoopFile[K, V]("path", inputFormat, keyClass, valueClass)
169

170
// Writing data
171
rdd.saveAsTextFile("path")
172
rdd.saveAsObjectFile("path")
173
pairRDD.saveAsSequenceFile("path")
174
```
175

176
### [Caching & Persistence](./caching-persistence.md)
177

178
Optimize performance by caching frequently accessed RDDs:
179

180
```scala { .api }
181
import org.apache.spark.storage.StorageLevel
182

183
rdd.cache()                           // Cache in memory
184
rdd.persist(StorageLevel.MEMORY_ONLY) // Explicit storage level
185
rdd.persist(StorageLevel.MEMORY_AND_DISK_SER) // Memory + disk with serialization
186
rdd.unpersist()                       // Remove from cache
187
```
188

189
### [Streaming](./streaming.md)
190

191
Process live data streams in micro-batches:
192

193
```scala { .api }
194
import org.apache.spark.streaming.{StreamingContext, Seconds}
195

196
val ssc = new StreamingContext(sc, Seconds(1))
197

198
val lines = ssc.socketTextStream("hostname", 9999)
199
val words = lines.flatMap(_.split(" "))
200
val wordCounts = words.map(word => (word, 1)).reduceByKey(_ + _)
201

202
wordCounts.print()
203
ssc.start()
204
ssc.awaitTermination()
205
```
206

207
### [Machine Learning (MLlib)](./mllib.md)
208

209
Scalable machine learning algorithms:
210

211
```scala { .api }
212
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
213
import org.apache.spark.mllib.regression.LabeledPoint
214
import org.apache.spark.mllib.linalg.Vectors
215

216
val data: RDD[LabeledPoint] = sc.parallelize(trainingData)
217
val model = LogisticRegressionWithSGD.train(data, numIterations = 100)
218
val predictions = model.predict(testData.map(_.features))
219
```
220

221
### [Graph Processing (GraphX)](./graphx.md)
222

223
Large-scale graph analytics:
224

225
```scala { .api }
226
import org.apache.spark.graphx._
227

228
// Create graph from edge list
229
val edges: RDD[Edge[Double]] = sc.parallelize(edgeArray)
230
val graph = Graph.fromEdges(edges, defaultValue = 1.0)
231

232
// Run PageRank
233
val ranks = graph.pageRank(0.0001).vertices
234
```
235

236
### [SQL](./sql.md)
237

238
Structured data processing with SQL:
239

240
```scala { .api }
241
import org.apache.spark.sql.SQLContext
242

243
val sqlContext = new SQLContext(sc)
244
val df = sqlContext.read.json("path/to/people.json")
245

246
df.registerTempTable("people")
247
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
248
```
249

250
### [Python API (PySpark)](./python-api.md)
251

252
Python interface for Spark with Pythonic APIs:
253

254
```python { .api }
255
from pyspark import SparkContext, SparkConf
256

257
conf = SparkConf().setAppName("My Python App").setMaster("local[*]")
258
sc = SparkContext(conf=conf)
259

260
data = [1, 2, 3, 4, 5]
261
rdd = sc.parallelize(data)
262
squared = rdd.map(lambda x: x * x)
263
result = squared.collect()
264
```
265

266
### [Java API](./java-api.md)
267

268
Java-friendly wrappers with proper type safety:
269

270
```java { .api }
271
import org.apache.spark.api.java.JavaSparkContext;
272
import org.apache.spark.api.java.JavaRDD;
273

274
JavaSparkContext jsc = new JavaSparkContext(conf);
275
JavaRDD<String> lines = jsc.textFile("data.txt");
276
JavaRDD<Integer> lineLengths = lines.map(String::length);
277
```
278

279
## Performance Considerations
280

281
- **Caching**: Use `cache()` or `persist()` for RDDs accessed multiple times
282
- **Partitioning**: Control data partitioning for better performance in key-based operations
283
- **Serialization**: Use Kryo serializer for better performance
284
- **Memory Management**: Configure executor memory and storage levels appropriately
285
- **Shuffle Operations**: Minimize expensive shuffle operations like `groupByKey()`
286

287
## Common Patterns
288

289
**Word Count:**
290
```scala { .api }
291
val textFile = sc.textFile("hdfs://...")
292
val counts = textFile
293
  .flatMap(line => line.split(" "))
294
  .map(word => (word, 1))
295
  .reduceByKey(_ + _)
296
```
297

298
**Log Analysis:**
299
```scala { .api }
300
val logFile = sc.textFile("access.log")
301
val errors = logFile.filter(_.contains("ERROR"))
302
val errorsByHost = errors
303
  .map(line => (extractHost(line), 1))
304
  .reduceByKey(_ + _)
305
```