CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-apache-spark

Lightning-fast unified analytics engine for large-scale data processing with high-level APIs in Scala, Java, Python, and R

Pending
Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Pending

The risk profile of this skill

Overview
Eval results
Files

Apache Spark

Apache Spark is a lightning-fast unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis.

Package Information

Maven Coordinates:

<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.0.0</version>

Scala Version: 2.10.x
Java Version: Java 6+

Core Imports

Scala:

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext._  // for implicit conversions

Java:

import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.SparkConf;

Python:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, Row
from pyspark import StorageLevel, SparkFiles

Basic Usage

Creating a SparkContext

import org.apache.spark.{SparkContext, SparkConf}

val conf = new SparkConf()
  .setAppName("My Spark Application")
  .setMaster("local[*]")  // Use all available cores locally

val sc = new SparkContext(conf)

// Remember to stop the context when done
sc.stop()

Simple RDD Operations

Scala:

// Create an RDD from a collection
val data = Array(1, 2, 3, 4, 5)
val distData: RDD[Int] = sc.parallelize(data)

// Transform and action
val doubled = distData.map(_ * 2)
val result = doubled.collect()  // Returns Array(2, 4, 6, 8, 10)

Python:

# Create an RDD from a collection
data = [1, 2, 3, 4, 5]
dist_data = sc.parallelize(data)

# Transform and action
doubled = dist_data.map(lambda x: x * 2)
result = doubled.collect()  # Returns [2, 4, 6, 8, 10]

Architecture

Spark's core components work together to provide a unified analytics platform:

SparkContext

The main entry point that coordinates distributed processing across a cluster. It creates RDDs, manages shared variables (broadcast variables and accumulators), and controls job execution.

RDD (Resilient Distributed Dataset)

The fundamental data abstraction - an immutable, fault-tolerant collection of elements that can be operated on in parallel. RDDs support two types of operations:

  • Transformations: Create new RDDs (lazy evaluation)
  • Actions: Return values or save data (trigger computation)

Cluster Manager

Spark can run on various cluster managers including:

  • Standalone cluster manager
  • Apache Mesos
  • Hadoop YARN

Core Capabilities

RDD Operations

Essential transformations and actions for data processing:

// Transformations (lazy)
rdd.map(f)           // Apply function to each element
rdd.filter(f)        // Keep elements matching predicate  
rdd.flatMap(f)       // Apply function and flatten results
rdd.distinct()       // Remove duplicates

// Actions (eager)
rdd.collect()        // Return all elements as array
rdd.count()          // Count number of elements
rdd.reduce(f)        // Reduce using associative function
rdd.take(n)          // Return first n elements

SparkContext API

Comprehensive cluster management and RDD creation:

class SparkContext(conf: SparkConf) {
  // RDD Creation
  def parallelize[T: ClassTag](seq: Seq[T]): RDD[T]
  def textFile(path: String): RDD[String]
  def hadoopFile[K, V](path: String, ...): RDD[(K, V)]
  
  // Shared Variables
  def broadcast[T: ClassTag](value: T): Broadcast[T]
  def accumulator[T](initialValue: T): Accumulator[T]
  
  // Job Control
  def runJob[T, U](rdd: RDD[T], func: Iterator[T] => U): Array[U]
  def stop(): Unit
}

Key-Value Operations

Powerful operations on RDDs of (key, value) pairs:

// Import for PairRDDFunctions
import org.apache.spark.SparkContext._

val pairRDD: RDD[(String, Int)] = sc.parallelize(Seq(("a", 1), ("b", 2)))

// Key-value transformations
pairRDD.reduceByKey(_ + _)        // Combine values by key
pairRDD.groupByKey()              // Group values by key
pairRDD.mapValues(_ * 2)          // Transform values, preserve keys
pairRDD.join(otherPairRDD)        // Inner join on keys

Data Sources

Read and write data from various sources:

// Reading data
sc.textFile("hdfs://path/to/file")
sc.sequenceFile[K, V]("path")
sc.objectFile[T]("path") 
sc.hadoopFile[K, V]("path", inputFormat, keyClass, valueClass)

// Writing data
rdd.saveAsTextFile("path")
rdd.saveAsObjectFile("path")
pairRDD.saveAsSequenceFile("path")

Caching & Persistence

Optimize performance by caching frequently accessed RDDs:

import org.apache.spark.storage.StorageLevel

rdd.cache()                           // Cache in memory
rdd.persist(StorageLevel.MEMORY_ONLY) // Explicit storage level
rdd.persist(StorageLevel.MEMORY_AND_DISK_SER) // Memory + disk with serialization
rdd.unpersist()                       // Remove from cache

Streaming

Process live data streams in micro-batches:

import org.apache.spark.streaming.{StreamingContext, Seconds}

val ssc = new StreamingContext(sc, Seconds(1))

val lines = ssc.socketTextStream("hostname", 9999)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(word => (word, 1)).reduceByKey(_ + _)

wordCounts.print()
ssc.start()
ssc.awaitTermination()

Machine Learning (MLlib)

Scalable machine learning algorithms:

import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors

val data: RDD[LabeledPoint] = sc.parallelize(trainingData)
val model = LogisticRegressionWithSGD.train(data, numIterations = 100)
val predictions = model.predict(testData.map(_.features))

Graph Processing (GraphX)

Large-scale graph analytics:

import org.apache.spark.graphx._

// Create graph from edge list
val edges: RDD[Edge[Double]] = sc.parallelize(edgeArray)
val graph = Graph.fromEdges(edges, defaultValue = 1.0)

// Run PageRank
val ranks = graph.pageRank(0.0001).vertices

SQL

Structured data processing with SQL:

import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
val df = sqlContext.read.json("path/to/people.json")

df.registerTempTable("people")
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

Python API (PySpark)

Python interface for Spark with Pythonic APIs:

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("My Python App").setMaster("local[*]")
sc = SparkContext(conf=conf)

data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
squared = rdd.map(lambda x: x * x)
result = squared.collect()

Java API

Java-friendly wrappers with proper type safety:

import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;

JavaSparkContext jsc = new JavaSparkContext(conf);
JavaRDD<String> lines = jsc.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(String::length);

Performance Considerations

  • Caching: Use cache() or persist() for RDDs accessed multiple times
  • Partitioning: Control data partitioning for better performance in key-based operations
  • Serialization: Use Kryo serializer for better performance
  • Memory Management: Configure executor memory and storage levels appropriately
  • Shuffle Operations: Minimize expensive shuffle operations like groupByKey()

Common Patterns

Word Count:

val textFile = sc.textFile("hdfs://...")
val counts = textFile
  .flatMap(line => line.split(" "))
  .map(word => (word, 1))
  .reduceByKey(_ + _)

Log Analysis:

val logFile = sc.textFile("access.log")
val errors = logFile.filter(_.contains("ERROR"))
val errorsByHost = errors
  .map(line => (extractHost(line), 1))
  .reduceByKey(_ + _)
Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-parent@1.0.x
Publish Source
CLI
Badge
tessl/maven-apache-spark badge