Spec Registry
Help your agents use open-source better. Learn more.
Find usage specs for your project’s dependencies
- Author
- tessl
- Last updated
- Spec files
maven-apache-spark
Describes: maven/org.apache.spark/spark-parent
- Description
- Lightning-fast unified analytics engine for large-scale data processing with high-level APIs in Scala, Java, Python, and R
- Author
- tessl
- Last updated
index.md docs/
1# Apache Spark23Apache Spark is a lightning-fast unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis.45## Package Information67**Maven Coordinates:**8```xml9<groupId>org.apache.spark</groupId>10<artifactId>spark-core_2.10</artifactId>11<version>1.0.0</version>12```1314**Scala Version:** 2.10.x15**Java Version:** Java 6+1617## Core Imports1819**Scala:**20```scala { .api }21import org.apache.spark.SparkContext22import org.apache.spark.SparkConf23import org.apache.spark.rdd.RDD24import org.apache.spark.SparkContext._ // for implicit conversions25```2627**Java:**28```java { .api }29import org.apache.spark.api.java.JavaSparkContext;30import org.apache.spark.api.java.JavaRDD;31import org.apache.spark.api.java.JavaPairRDD;32import org.apache.spark.SparkConf;33```3435**Python:**36```python { .api }37from pyspark import SparkContext, SparkConf38from pyspark.sql import SQLContext, Row39from pyspark import StorageLevel, SparkFiles40```4142## Basic Usage4344### Creating a SparkContext4546```scala { .api }47import org.apache.spark.{SparkContext, SparkConf}4849val conf = new SparkConf()50.setAppName("My Spark Application")51.setMaster("local[*]") // Use all available cores locally5253val sc = new SparkContext(conf)5455// Remember to stop the context when done56sc.stop()57```5859### Simple RDD Operations6061**Scala:**62```scala { .api }63// Create an RDD from a collection64val data = Array(1, 2, 3, 4, 5)65val distData: RDD[Int] = sc.parallelize(data)6667// Transform and action68val doubled = distData.map(_ * 2)69val result = doubled.collect() // Returns Array(2, 4, 6, 8, 10)70```7172**Python:**73```python { .api }74# Create an RDD from a collection75data = [1, 2, 3, 4, 5]76dist_data = sc.parallelize(data)7778# Transform and action79doubled = dist_data.map(lambda x: x * 2)80result = doubled.collect() # Returns [2, 4, 6, 8, 10]81```8283## Architecture8485Spark's core components work together to provide a unified analytics platform:8687### SparkContext88The main entry point that coordinates distributed processing across a cluster. It creates RDDs, manages shared variables (broadcast variables and accumulators), and controls job execution.8990### RDD (Resilient Distributed Dataset)91The fundamental data abstraction - an immutable, fault-tolerant collection of elements that can be operated on in parallel. RDDs support two types of operations:92- **Transformations**: Create new RDDs (lazy evaluation)93- **Actions**: Return values or save data (trigger computation)9495### Cluster Manager96Spark can run on various cluster managers including:97- Standalone cluster manager98- Apache Mesos99- Hadoop YARN100101## Core Capabilities102103### [RDD Operations](./core-rdd.md)104105Essential transformations and actions for data processing:106107```scala { .api }108// Transformations (lazy)109rdd.map(f) // Apply function to each element110rdd.filter(f) // Keep elements matching predicate111rdd.flatMap(f) // Apply function and flatten results112rdd.distinct() // Remove duplicates113114// Actions (eager)115rdd.collect() // Return all elements as array116rdd.count() // Count number of elements117rdd.reduce(f) // Reduce using associative function118rdd.take(n) // Return first n elements119```120121### [SparkContext API](./spark-context.md)122123Comprehensive cluster management and RDD creation:124125```scala { .api }126class SparkContext(conf: SparkConf) {127// RDD Creation128def parallelize[T: ClassTag](seq: Seq[T]): RDD[T]129def textFile(path: String): RDD[String]130def hadoopFile[K, V](path: String, ...): RDD[(K, V)]131132// Shared Variables133def broadcast[T: ClassTag](value: T): Broadcast[T]134def accumulator[T](initialValue: T): Accumulator[T]135136// Job Control137def runJob[T, U](rdd: RDD[T], func: Iterator[T] => U): Array[U]138def stop(): Unit139}140```141142### [Key-Value Operations](./key-value-operations.md)143144Powerful operations on RDDs of (key, value) pairs:145146```scala { .api }147// Import for PairRDDFunctions148import org.apache.spark.SparkContext._149150val pairRDD: RDD[(String, Int)] = sc.parallelize(Seq(("a", 1), ("b", 2)))151152// Key-value transformations153pairRDD.reduceByKey(_ + _) // Combine values by key154pairRDD.groupByKey() // Group values by key155pairRDD.mapValues(_ * 2) // Transform values, preserve keys156pairRDD.join(otherPairRDD) // Inner join on keys157```158159### [Data Sources](./data-sources.md)160161Read and write data from various sources:162163```scala { .api }164// Reading data165sc.textFile("hdfs://path/to/file")166sc.sequenceFile[K, V]("path")167sc.objectFile[T]("path")168sc.hadoopFile[K, V]("path", inputFormat, keyClass, valueClass)169170// Writing data171rdd.saveAsTextFile("path")172rdd.saveAsObjectFile("path")173pairRDD.saveAsSequenceFile("path")174```175176### [Caching & Persistence](./caching-persistence.md)177178Optimize performance by caching frequently accessed RDDs:179180```scala { .api }181import org.apache.spark.storage.StorageLevel182183rdd.cache() // Cache in memory184rdd.persist(StorageLevel.MEMORY_ONLY) // Explicit storage level185rdd.persist(StorageLevel.MEMORY_AND_DISK_SER) // Memory + disk with serialization186rdd.unpersist() // Remove from cache187```188189### [Streaming](./streaming.md)190191Process live data streams in micro-batches:192193```scala { .api }194import org.apache.spark.streaming.{StreamingContext, Seconds}195196val ssc = new StreamingContext(sc, Seconds(1))197198val lines = ssc.socketTextStream("hostname", 9999)199val words = lines.flatMap(_.split(" "))200val wordCounts = words.map(word => (word, 1)).reduceByKey(_ + _)201202wordCounts.print()203ssc.start()204ssc.awaitTermination()205```206207### [Machine Learning (MLlib)](./mllib.md)208209Scalable machine learning algorithms:210211```scala { .api }212import org.apache.spark.mllib.classification.LogisticRegressionWithSGD213import org.apache.spark.mllib.regression.LabeledPoint214import org.apache.spark.mllib.linalg.Vectors215216val data: RDD[LabeledPoint] = sc.parallelize(trainingData)217val model = LogisticRegressionWithSGD.train(data, numIterations = 100)218val predictions = model.predict(testData.map(_.features))219```220221### [Graph Processing (GraphX)](./graphx.md)222223Large-scale graph analytics:224225```scala { .api }226import org.apache.spark.graphx._227228// Create graph from edge list229val edges: RDD[Edge[Double]] = sc.parallelize(edgeArray)230val graph = Graph.fromEdges(edges, defaultValue = 1.0)231232// Run PageRank233val ranks = graph.pageRank(0.0001).vertices234```235236### [SQL](./sql.md)237238Structured data processing with SQL:239240```scala { .api }241import org.apache.spark.sql.SQLContext242243val sqlContext = new SQLContext(sc)244val df = sqlContext.read.json("path/to/people.json")245246df.registerTempTable("people")247val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")248```249250### [Python API (PySpark)](./python-api.md)251252Python interface for Spark with Pythonic APIs:253254```python { .api }255from pyspark import SparkContext, SparkConf256257conf = SparkConf().setAppName("My Python App").setMaster("local[*]")258sc = SparkContext(conf=conf)259260data = [1, 2, 3, 4, 5]261rdd = sc.parallelize(data)262squared = rdd.map(lambda x: x * x)263result = squared.collect()264```265266### [Java API](./java-api.md)267268Java-friendly wrappers with proper type safety:269270```java { .api }271import org.apache.spark.api.java.JavaSparkContext;272import org.apache.spark.api.java.JavaRDD;273274JavaSparkContext jsc = new JavaSparkContext(conf);275JavaRDD<String> lines = jsc.textFile("data.txt");276JavaRDD<Integer> lineLengths = lines.map(String::length);277```278279## Performance Considerations280281- **Caching**: Use `cache()` or `persist()` for RDDs accessed multiple times282- **Partitioning**: Control data partitioning for better performance in key-based operations283- **Serialization**: Use Kryo serializer for better performance284- **Memory Management**: Configure executor memory and storage levels appropriately285- **Shuffle Operations**: Minimize expensive shuffle operations like `groupByKey()`286287## Common Patterns288289**Word Count:**290```scala { .api }291val textFile = sc.textFile("hdfs://...")292val counts = textFile293.flatMap(line => line.split(" "))294.map(word => (word, 1))295.reduceByKey(_ + _)296```297298**Log Analysis:**299```scala { .api }300val logFile = sc.textFile("access.log")301val errors = logFile.filter(_.contains("ERROR"))302val errorsByHost = errors303.map(line => (extractHost(line), 1))304.reduceByKey(_ + _)305```