CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-org-apache-spark--spark-streaming-kafka-assembly-2-10

A shaded JAR assembly containing all dependencies needed for Kafka integration with Spark Streaming

Pending
Overview
Eval results
Files

Apache Spark Streaming Kafka Assembly

Apache Spark Streaming Kafka Assembly provides seamless integration between Apache Kafka and Spark Streaming, enabling real-time data processing from Kafka topics. This shaded JAR assembly includes all necessary dependencies to avoid version conflicts, supporting both receiver-based and direct (no-receiver) streaming approaches with exactly-once processing semantics.

Package Information

  • Package Name: spark-streaming-kafka-assembly_2.10
  • Package Type: maven
  • Language: Scala (with Java API)
  • Installation: spark-streaming-kafka-assembly_2.10-1.6.3.jar (add to classpath)
  • Maven Coordinates: org.apache.spark:spark-streaming-kafka-assembly_2.10:1.6.3

Core Imports

Scala

import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.kafka.KafkaUtils
import kafka.serializer.{StringDecoder, DefaultDecoder}

Java

import org.apache.spark.streaming.kafka.KafkaUtils;
import org.apache.spark.streaming.kafka.OffsetRange;
import org.apache.spark.streaming.kafka.Broker;
import org.apache.spark.streaming.kafka.HasOffsetRanges;

Basic Usage

Direct Streaming (Recommended)

import org.apache.spark.streaming.kafka._
import kafka.serializer.StringDecoder

val kafkaParams = Map[String, String](
  "metadata.broker.list" -> "localhost:9092",
  "auto.offset.reset" -> "largest"
)
val topics = Set("my-topic")

val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
  streamingContext, kafkaParams, topics
)

stream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  // Process the data
  rdd.foreach(println)
}

Receiver-based Streaming (Legacy)

val kafkaParams = Map[String, String](
  "zookeeper.connect" -> "localhost:2181",
  "group.id" -> "my-consumer-group"
)
val topicMap = Map("my-topic" -> 1)

val stream = KafkaUtils.createStream(streamingContext, "localhost:2181", "my-consumer-group", topicMap)
stream.print()

Architecture

The Spark Streaming Kafka integration is built around several key components:

  • KafkaUtils: Central factory object providing static methods for creating streams and RDDs
  • Direct Streaming: No-receiver approach that directly queries Kafka brokers for exactly-once semantics
  • Receiver-based Streaming: Traditional approach using long-running receivers (legacy)
  • Offset Management: Manual control over Kafka offsets for exactly-once processing guarantees
  • Type Safety: Full support for custom key/value types with pluggable serializer/deserializer framework

Capabilities

Direct Stream Creation

Creates input streams that directly pull messages from Kafka brokers without receivers, providing exactly-once semantics and manual offset control.

def createDirectStream[K, V, KD <: Decoder[K], VD <: Decoder[V]](
  ssc: StreamingContext,
  kafkaParams: Map[String, String],
  topics: Set[String]
): InputDStream[(K, V)]

def createDirectStream[K, V, KD <: Decoder[K], VD <: Decoder[V], R](
  ssc: StreamingContext,
  kafkaParams: Map[String, String],  
  fromOffsets: Map[TopicAndPartition, Long],
  messageHandler: MessageAndMetadata[K, V] => R
): InputDStream[R]

Direct Streaming

Receiver-based Stream Creation

Creates input streams using long-running receivers that pull messages from Kafka brokers through Zookeeper coordination (legacy approach).

def createStream[K, V, U <: Decoder[K], T <: Decoder[V]](
  ssc: StreamingContext,
  kafkaParams: Map[String, String],
  topics: Map[String, Int],
  storageLevel: StorageLevel
): ReceiverInputDStream[(K, V)]

Receiver-based Streaming

Batch RDD Creation

Creates RDDs from Kafka using specific offset ranges for batch processing scenarios.

def createRDD[K, V, KD <: Decoder[K], VD <: Decoder[V]](
  sc: SparkContext,
  kafkaParams: Map[String, String],
  offsetRanges: Array[OffsetRange]
): RDD[(K, V)]

def createRDD[K, V, KD <: Decoder[K], VD <: Decoder[V], R](
  sc: SparkContext,
  kafkaParams: Map[String, String],
  offsetRanges: Array[OffsetRange],
  leaders: Map[TopicAndPartition, Broker],
  messageHandler: MessageAndMetadata[K, V] => R
): RDD[R]

Batch RDD Processing

Java API

Complete Java API with type-safe wrappers for all Scala functionality, supporting both streaming and batch processing scenarios.

public static <K, V, KD extends Decoder<K>, VD extends Decoder<V>> 
  JavaPairInputDStream<K, V> createDirectStream(
    JavaStreamingContext jssc,
    Class<K> keyClass,
    Class<V> valueClass,
    Class<KD> keyDecoderClass,
    Class<VD> valueDecoderClass,
    Map<String, String> kafkaParams,
    Set<String> topics
  )

Java API

Offset Management

Utilities for managing Kafka offsets, including offset range representation and cluster interaction helpers.

final class OffsetRange(
  val topic: String,
  val partition: Int,
  val fromOffset: Long,
  val untilOffset: Long
)

trait HasOffsetRanges {
  def offsetRanges: Array[OffsetRange]
}

Offset Management

Types

Common Type Parameters

// K - Type of Kafka message key
// V - Type of Kafka message value  
// KD - Type of Kafka message key decoder (extends kafka.serializer.Decoder[K])
// VD - Type of Kafka message value decoder (extends kafka.serializer.Decoder[V])
// R - Type returned by message handler function

Kafka Dependencies

// From Kafka library
kafka.common.TopicAndPartition
kafka.message.MessageAndMetadata[K, V] 
kafka.serializer.Decoder[T]
kafka.serializer.StringDecoder
kafka.serializer.DefaultDecoder

Spark Dependencies

// Core Spark classes
org.apache.spark.streaming.StreamingContext
org.apache.spark.SparkContext
org.apache.spark.rdd.RDD[T] 
org.apache.spark.streaming.dstream.{InputDStream, ReceiverInputDStream}
org.apache.spark.storage.StorageLevel

// Spark Streaming Kafka classes (from this package)
org.apache.spark.streaming.kafka.OffsetRange
org.apache.spark.streaming.kafka.Broker
org.apache.spark.streaming.kafka.HasOffsetRanges

Install with Tessl CLI

npx tessl i tessl/maven-org-apache-spark--spark-streaming-kafka-assembly-2-10
Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-streaming-kafka-assembly_2.10@1.6.x
Badge
tessl/maven-org-apache-spark--spark-streaming-kafka-assembly-2-10 badge