0
# Apache Spark MLlib
1
2
Apache Spark MLlib is a comprehensive machine learning library designed for large-scale data processing and analysis. It provides a unified API for implementing machine learning algorithms including classification, regression, clustering, and collaborative filtering for recommendation systems. MLlib supports end-to-end machine learning workflows through its Pipeline API and is built on Spark's distributed computing framework for scalable production deployments.
3
4
## Package Information
5
6
- **Package Name**: spark-mllib_2.13
7
- **Package Type**: maven
8
- **Language**: Scala
9
- **Installation**: Add to your `build.sbt`: `libraryDependencies += "org.apache.spark" %% "spark-mllib" % "3.5.6"`
10
- **Maven**: `<dependency><groupId>org.apache.spark</groupId><artifactId>spark-mllib_2.13</artifactId><version>3.5.6</version></dependency>`
11
12
## Core Imports
13
14
```scala
15
import org.apache.spark.ml._
16
import org.apache.spark.ml.classification._
17
import org.apache.spark.ml.regression._
18
import org.apache.spark.ml.clustering._
19
import org.apache.spark.ml.feature._
20
import org.apache.spark.ml.evaluation._
21
import org.apache.spark.ml.tuning._
22
import org.apache.spark.ml.linalg._
23
```
24
25
## Basic Usage
26
27
```scala
28
import org.apache.spark.sql.SparkSession
29
import org.apache.spark.ml.classification.LogisticRegression
30
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler}
31
import org.apache.spark.ml.Pipeline
32
33
// Initialize Spark session
34
val spark = SparkSession.builder()
35
.appName("MLlib Example")
36
.master("local[*]")
37
.getOrCreate()
38
39
import spark.implicits._
40
41
// Load data
42
val data = spark.read
43
.option("header", "true")
44
.option("inferSchema", "true")
45
.csv("path/to/data.csv")
46
47
// Feature preparation
48
val assembler = new VectorAssembler()
49
.setInputCols(Array("feature1", "feature2", "feature3"))
50
.setOutputCol("features")
51
52
val indexer = new StringIndexer()
53
.setInputCol("label")
54
.setOutputCol("labelIndex")
55
56
// Create classifier
57
val lr = new LogisticRegression()
58
.setLabelCol("labelIndex")
59
.setFeaturesCol("features")
60
61
// Create pipeline
62
val pipeline = new Pipeline()
63
.setStages(Array(indexer, assembler, lr))
64
65
// Fit model
66
val model = pipeline.fit(data)
67
68
// Make predictions
69
val predictions = model.transform(data)
70
predictions.select("prediction", "labelIndex", "features").show()
71
```
72
73
## Architecture
74
75
Apache Spark MLlib is built around several key abstractions:
76
77
- **DataFrame-based API (org.apache.spark.ml)**: Modern, unified API using DataFrames with strongly-typed interfaces
78
- **Pipeline Framework**: Composable machine learning workflows using `Estimator`, `Transformer`, and `Model` abstractions
79
- **Parameter System**: Type-safe parameter handling with validation and grid search support
80
- **Persistence**: Built-in model saving and loading capabilities
81
- **Distributed Computing**: Leverages Spark's distributed processing for scalability across clusters
82
- **Linear Algebra**: Optimized vector and matrix operations with sparse/dense representations
83
84
The design follows these key patterns:
85
86
1. **Estimator Pattern**: Algorithms implement `Estimator[M]` with a `fit()` method that returns a `Model[M]`
87
2. **Transformer Pattern**: All models and feature transformers implement `Transformer` with a `transform()` method
88
3. **Pipeline Pattern**: Multiple stages can be chained together in a `Pipeline` for complex workflows
89
4. **Parameter Pattern**: All components use typed parameters with automatic validation
90
91
## Capabilities
92
93
### Classification Algorithms
94
95
Supervised learning algorithms for predicting categorical labels including logistic regression, decision trees, random forests, gradient-boosted trees, and neural networks.
96
97
```scala { .api }
98
// Core classification interface
99
abstract class Classifier[
100
FeaturesType,
101
Learner <: Classifier[FeaturesType, Learner, M],
102
M <: ClassificationModel[FeaturesType, M]
103
] extends Predictor[FeaturesType, Learner, M]
104
105
// Main algorithms
106
class LogisticRegression extends ProbabilisticClassifier[Vector, LogisticRegression, LogisticRegressionModel]
107
class DecisionTreeClassifier extends ProbabilisticClassifier[Vector, DecisionTreeClassifier, DecisionTreeClassificationModel]
108
class RandomForestClassifier extends ProbabilisticClassifier[Vector, RandomForestClassifier, RandomForestClassificationModel]
109
class GBTClassifier extends Classifier[Vector, GBTClassifier, GBTClassificationModel]
110
```
111
112
[Classification Algorithms](./classification.md)
113
114
### Regression Algorithms
115
116
Supervised learning algorithms for predicting continuous values including linear regression, decision trees, random forests, and generalized linear models.
117
118
```scala { .api }
119
// Core regression interface
120
abstract class Regressor[
121
FeaturesType,
122
Learner <: Regressor[FeaturesType, Learner, M],
123
M <: RegressionModel[FeaturesType, M]
124
] extends Predictor[FeaturesType, Learner, M]
125
126
// Main algorithms
127
class LinearRegression extends Regressor[Vector, LinearRegression, LinearRegressionModel]
128
class DecisionTreeRegressor extends Regressor[Vector, DecisionTreeRegressor, DecisionTreeRegressionModel]
129
class RandomForestRegressor extends Regressor[Vector, RandomForestRegressor, RandomForestRegressionModel]
130
class GBTRegressor extends Regressor[Vector, GBTRegressor, GBTRegressionModel]
131
```
132
133
[Regression Algorithms](./regression.md)
134
135
### Clustering Algorithms
136
137
Unsupervised learning algorithms for discovering patterns and groupings in data including K-means, Gaussian mixture models, and topic modeling.
138
139
```scala { .api }
140
// Main clustering algorithms
141
class KMeans extends Estimator[KMeansModel] with KMeansParams
142
class GaussianMixture extends Estimator[GaussianMixtureModel] with GaussianMixtureParams
143
class BisectingKMeans extends Estimator[BisectingKMeansModel] with BisectingKMeansParams
144
class LDA extends Estimator[LDAModel] with LDAParams
145
```
146
147
[Clustering Algorithms](./clustering.md)
148
149
### Feature Processing
150
151
Comprehensive feature extraction, transformation, selection, and engineering capabilities including text processing, scaling, dimensionality reduction, and categorical encoding.
152
153
```scala { .api }
154
// Feature transformation base classes
155
abstract class Transformer extends PipelineStage
156
abstract class Estimator[M <: Model[M]] extends PipelineStage
157
158
// Key feature transformers
159
class VectorAssembler extends Transformer
160
class StandardScaler extends Estimator[StandardScalerModel] with StandardScalerParams
161
class StringIndexer extends Estimator[StringIndexerModel] with StringIndexerParams
162
class OneHotEncoder extends Estimator[OneHotEncoderModel] with OneHotEncoderParams
163
class PCA extends Estimator[PCAModel] with PCAParams
164
```
165
166
[Feature Processing](./feature-processing.md)
167
168
### Model Evaluation
169
170
Tools for assessing model performance including classification metrics, regression metrics, clustering evaluation, and cross-validation.
171
172
```scala { .api }
173
// Evaluation base class
174
abstract class Evaluator extends Params
175
176
// Main evaluators
177
class BinaryClassificationEvaluator extends Evaluator with HasLabelCol with HasFeaturesCol
178
class MulticlassClassificationEvaluator extends Evaluator with HasLabelCol with HasPredictionCol
179
class RegressionEvaluator extends Evaluator with HasLabelCol with HasPredictionCol
180
class ClusteringEvaluator extends Evaluator with HasFeaturesCol with HasPredictionCol
181
```
182
183
[Model Evaluation](./evaluation.md)
184
185
### Pipeline and Hyperparameter Tuning
186
187
Framework for building complex ML workflows and automated hyperparameter optimization including pipelines, cross-validation, and parameter grids.
188
189
```scala { .api }
190
// Pipeline framework
191
class Pipeline extends Estimator[PipelineModel]
192
class PipelineModel extends Model[PipelineModel] with Transformer
193
194
// Hyperparameter tuning
195
class CrossValidator extends Estimator[CrossValidatorModel]
196
class TrainValidationSplit extends Estimator[TrainValidationSplitModel]
197
class ParamGridBuilder
198
```
199
200
[Pipeline and Tuning](./pipeline.md)
201
202
### Linear Algebra
203
204
High-performance vector and matrix operations with support for dense and sparse representations, distributed matrices, and common linear algebra operations.
205
206
```scala { .api }
207
// Core linear algebra types
208
sealed trait Vector extends Serializable
209
class DenseVector(val values: Array[Double]) extends Vector
210
class SparseVector(val size: Int, val indices: Array[Int], val values: Array[Double]) extends Vector
211
212
sealed trait Matrix extends Serializable
213
class DenseMatrix(val numRows: Int, val numCols: Int, val values: Array[Double]) extends Matrix
214
class SparseMatrix(val numRows: Int, val numCols: Int, val colPtrs: Array[Int], val rowIndices: Array[Int], val values: Array[Double]) extends Matrix
215
216
// Factory objects
217
object Vectors
218
object Matrices
219
```
220
221
[Linear Algebra](./linear-algebra.md)
222
223
### Collaborative Filtering
224
225
Matrix factorization-based recommendation systems using Alternating Least Squares for building recommendation engines from user-item interaction data.
226
227
```scala { .api }
228
class ALS extends Estimator[ALSModel] with ALSParams
229
class ALSModel extends Model[ALSModel] with ALSModelParams {
230
def recommendForAllUsers(numItems: Int): DataFrame
231
def recommendForAllItems(numUsers: Int): DataFrame
232
def userFactors: DataFrame
233
def itemFactors: DataFrame
234
}
235
```
236
237
[Collaborative Filtering](./recommendation.md)
238
239
### Frequent Pattern Mining
240
241
Data mining algorithms for discovering frequent patterns, itemsets, and sequential patterns including FP-Growth for market basket analysis and PrefixSpan for sequential pattern discovery.
242
243
```scala { .api }
244
// FP-Growth for frequent itemset mining
245
class FPGrowth extends Estimator[FPGrowthModel] with FPGrowthParams
246
class FPGrowthModel extends Model[FPGrowthModel] with FPGrowthParams {
247
def transform(dataset: Dataset[_]): DataFrame
248
def freqItemsets: DataFrame
249
def associationRules: DataFrame
250
}
251
252
// PrefixSpan for sequential pattern mining
253
class PrefixSpan extends Estimator[PrefixSpanModel] with PrefixSpanParams
254
class PrefixSpanModel extends Model[PrefixSpanModel] with PrefixSpanParams {
255
def findFrequentSequentialPatterns(dataset: Dataset[_]): DataFrame
256
}
257
```
258
259
[Frequent Pattern Mining](./frequent-pattern-mining.md)
260
261
### Statistical Functions
262
263
Statistical analysis and hypothesis testing including correlation analysis, summary statistics, Chi-square tests, ANOVA, and kernel density estimation.
264
265
```scala { .api }
266
object Statistics {
267
def corr(dataset: Dataset[_], columns: Seq[String], method: String): Matrix
268
def chiSquareTest(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame
269
def anovaTest(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame
270
}
271
272
object Summarizer {
273
def metrics(metrics: String*): SummaryBuilder
274
def mean(col: Column): Column
275
def variance(col: Column): Column
276
}
277
```
278
279
[Statistical Functions](./statistics.md)
280
281
## Core Types
282
283
```scala { .api }
284
// Pipeline stage base types
285
trait PipelineStage extends Params with Logging
286
287
trait Estimator[M <: Model[M]] extends PipelineStage {
288
def fit(dataset: Dataset[_]): M
289
def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M]
290
}
291
292
trait Transformer extends PipelineStage {
293
def transform(dataset: Dataset[_]): DataFrame
294
def transformSchema(schema: StructType): StructType
295
}
296
297
trait Model[M <: Model[M]] extends Transformer {
298
def parent: Estimator[M]
299
def hasParent: Boolean
300
}
301
302
// Parameter system
303
trait Params {
304
def copy(extra: ParamMap): this.type
305
def extractParamMap(): ParamMap
306
def explainParams(): String
307
}
308
309
case class ParamMap(map: mutable.Map[Param[Any], Any])
310
class Param[T](val parent: String, val name: String)
311
case class ParamPair[T](param: Param[T], value: T)
312
313
// Common parameter traits
314
trait HasInputCol extends Params
315
trait HasOutputCol extends Params
316
trait HasLabelCol extends Params
317
trait HasFeaturesCol extends Params
318
trait HasPredictionCol extends Params
319
```