Apache Spark's scalable machine learning library providing comprehensive ML algorithms and utilities for large-scale data processing
npx @tessl/cli install tessl/maven-org-apache-spark--spark-mllib_2-13@4.0.00
# Apache Spark MLlib
1
2
Apache Spark's scalable machine learning library providing comprehensive ML algorithms and utilities for large-scale data processing. MLlib delivers high-performance distributed machine learning that scales from single machines to large clusters.
3
4
## Package Information
5
6
- **Package Name**: spark-mllib_2.13
7
- **Package Type**: Maven
8
- **Language**: Scala (compatible with Java)
9
- **Installation**:
10
- **Maven**:
11
```xml
12
<dependency>
13
<groupId>org.apache.spark</groupId>
14
<artifactId>spark-mllib_2.13</artifactId>
15
<version>4.0.0</version>
16
</dependency>
17
```
18
- **SBT**: `libraryDependencies += "org.apache.spark" %% "spark-mllib" % "4.0.0"`
19
- **Gradle**: `implementation 'org.apache.spark:spark-mllib_2.13:4.0.0'`
20
21
## Core Imports
22
23
```scala
24
// Modern DataFrame-based API (recommended)
25
import org.apache.spark.ml._
26
import org.apache.spark.ml.classification._
27
import org.apache.spark.ml.regression._
28
import org.apache.spark.ml.clustering._
29
import org.apache.spark.ml.feature._
30
import org.apache.spark.ml.fpm._
31
32
// Legacy RDD-based API (maintained for compatibility)
33
import org.apache.spark.mllib.classification._
34
import org.apache.spark.mllib.regression._
35
import org.apache.spark.mllib.clustering._
36
```
37
38
## Basic Usage
39
40
```scala
41
import org.apache.spark.sql.SparkSession
42
import org.apache.spark.ml.classification.LogisticRegression
43
import org.apache.spark.ml.feature.{VectorAssembler, StringIndexer}
44
import org.apache.spark.ml.Pipeline
45
46
// Initialize Spark session
47
val spark = SparkSession.builder()
48
.appName("MLlib Example")
49
.master("local[*]")
50
.getOrCreate()
51
52
import spark.implicits._
53
54
// Load data
55
val data = spark.read.format("libsvm")
56
.load("data/mllib/sample_multiclass_classification_data.txt")
57
58
// Prepare features
59
val assembler = new VectorAssembler()
60
.setInputCols(Array("feature1", "feature2"))
61
.setOutputCol("features")
62
63
// Create classifier
64
val lr = new LogisticRegression()
65
.setMaxIter(20)
66
.setRegParam(0.3)
67
.setElasticNetParam(0.8)
68
69
// Create pipeline
70
val pipeline = new Pipeline()
71
.setStages(Array(assembler, lr))
72
73
// Train model
74
val model = pipeline.fit(data)
75
76
// Make predictions
77
val predictions = model.transform(data)
78
predictions.show()
79
```
80
81
## Architecture
82
83
MLlib provides two complementary APIs built on Spark's distributed computing engine:
84
85
- **DataFrame-based API (org.apache.spark.ml)**: Modern, recommended API built around Spark DataFrames with type-safe ML pipelines, comprehensive parameter handling, and rich metadata support
86
- **RDD-based API (org.apache.spark.mllib)**: Legacy API maintained for backward compatibility, operating directly on RDDs with functional programming patterns
87
- **Pipeline Architecture**: Estimator/Transformer pattern enabling composable ML workflows with automated parameter management and model persistence
88
- **Distributed Linear Algebra**: High-performance distributed matrices and vectors optimized for large-scale computations
89
- **Scalable Algorithms**: All algorithms designed for horizontal scaling across cluster nodes with optimized data partitioning strategies
90
91
## Capabilities
92
93
### Classification Algorithms
94
95
Supervised learning algorithms for predicting categorical outcomes, including binary and multiclass classification with probabilistic predictions and comprehensive model evaluation.
96
97
```scala { .api }
98
// Logistic Regression
99
class LogisticRegression extends Classifier[Vector, LogisticRegression, LogisticRegressionModel]
100
class LogisticRegressionModel extends ClassificationModel[Vector, LogisticRegressionModel]
101
102
// Decision Trees
103
class DecisionTreeClassifier extends Classifier[Vector, DecisionTreeClassifier, DecisionTreeClassificationModel]
104
class DecisionTreeClassificationModel extends ClassificationModel[Vector, DecisionTreeClassificationModel]
105
106
// Random Forest
107
class RandomForestClassifier extends Classifier[Vector, RandomForestClassifier, RandomForestClassificationModel]
108
class RandomForestClassificationModel extends ClassificationModel[Vector, RandomForestClassificationModel]
109
```
110
111
[Classification](./classification.md)
112
113
### Regression Algorithms
114
115
Supervised learning algorithms for predicting continuous numerical values, including linear models, tree-based methods, and survival analysis with comprehensive residual analysis.
116
117
```scala { .api }
118
// Linear Regression
119
class LinearRegression extends Regressor[Vector, LinearRegression, LinearRegressionModel]
120
class LinearRegressionModel extends RegressionModel[Vector, LinearRegressionModel]
121
122
// Decision Tree Regression
123
class DecisionTreeRegressor extends Regressor[Vector, DecisionTreeRegressor, DecisionTreeRegressionModel]
124
class DecisionTreeRegressionModel extends RegressionModel[Vector, DecisionTreeRegressionModel]
125
126
// Random Forest Regression
127
class RandomForestRegressor extends Regressor[Vector, RandomForestRegressor, RandomForestRegressionModel]
128
class RandomForestRegressionModel extends RegressionModel[Vector, RandomForestRegressionModel]
129
```
130
131
[Regression](./regression.md)
132
133
### Clustering Algorithms
134
135
Unsupervised learning algorithms for discovering hidden patterns and grouping similar data points, including partitioning, hierarchical, and probabilistic clustering methods.
136
137
```scala { .api }
138
// K-Means Clustering
139
class KMeans extends Estimator[KMeans] with KMeansParams
140
class KMeansModel extends Model[KMeansModel] with KMeansParams
141
142
// Gaussian Mixture Model
143
class GaussianMixture extends Estimator[GaussianMixture] with GaussianMixtureParams
144
class GaussianMixtureModel extends Model[GaussianMixtureModel] with GaussianMixtureParams
145
146
// Latent Dirichlet Allocation
147
class LDA extends Estimator[LDA] with LDAParams
148
abstract class LDAModel extends Model[LDAModel] with LDAParams
149
```
150
151
[Clustering](./clustering.md)
152
153
### Feature Engineering
154
155
Comprehensive data preprocessing and feature transformation utilities for preparing raw data for machine learning algorithms, including text processing, categorical encoding, and numerical scaling.
156
157
```scala { .api }
158
// Vector Assembly and Manipulation
159
class VectorAssembler extends Transformer
160
class VectorSlicer extends Transformer
161
class VectorIndexer extends Estimator[VectorIndexerModel]
162
163
// Scaling and Normalization
164
class StandardScaler extends Estimator[StandardScalerModel]
165
class MinMaxScaler extends Estimator[MinMaxScalerModel]
166
class Normalizer extends Transformer
167
168
// Categorical Features
169
class StringIndexer extends Estimator[StringIndexerModel]
170
class OneHotEncoder extends Transformer
171
class IndexToString extends Transformer
172
```
173
174
[Feature Engineering](./feature-engineering.md)
175
176
### Model Evaluation and Selection
177
178
Comprehensive model evaluation metrics and automated hyperparameter tuning capabilities for assessing model performance and optimizing ML pipelines.
179
180
```scala { .api }
181
// Evaluators
182
abstract class Evaluator extends Params
183
class BinaryClassificationEvaluator extends Evaluator
184
class MulticlassClassificationEvaluator extends Evaluator
185
class RegressionEvaluator extends Evaluator
186
187
// Model Selection
188
class CrossValidator extends Estimator[CrossValidatorModel]
189
class TrainValidationSplit extends Estimator[TrainValidationSplitModel]
190
class ParamGridBuilder
191
```
192
193
[Evaluation and Tuning](./evaluation-tuning.md)
194
195
### Recommendation Systems
196
197
Collaborative filtering algorithms for building recommendation engines, including matrix factorization techniques optimized for large-scale user-item interaction datasets.
198
199
```scala { .api }
200
// Alternating Least Squares
201
class ALS extends Estimator[ALSModel] with ALSParams
202
class ALSModel extends Model[ALSModel] with ALSParams
203
```
204
205
[Recommendation](./recommendation.md)
206
207
### Pipeline Components
208
209
Core abstractions and utilities for building composable machine learning workflows with automated parameter management, model persistence, and metadata handling.
210
211
```scala { .api }
212
// Core Pipeline Classes
213
abstract class Estimator[M <: Model[M]] extends PipelineStage
214
abstract class Transformer extends PipelineStage
215
abstract class Model[M <: Model[M]] extends Transformer
216
class Pipeline extends Estimator[PipelineModel]
217
class PipelineModel extends Model[PipelineModel]
218
219
// Parameter System
220
trait Params
221
class Param[T]
222
class ParamMap
223
```
224
225
[Pipeline Components](./pipeline-components.md)
226
227
### Linear Algebra
228
229
Distributed linear algebra operations and data structures optimized for large-scale numerical computations across cluster nodes.
230
231
```scala { .api }
232
// Vector Types
233
abstract class Vector
234
class DenseVector extends Vector
235
class SparseVector extends Vector
236
object Vectors
237
238
// Matrix Types
239
abstract class Matrix
240
class DenseMatrix extends Matrix
241
class SparseMatrix extends Matrix
242
object Matrices
243
```
244
245
[Linear Algebra](./linear-algebra.md)
246
247
### Frequent Pattern Mining
248
249
Algorithms for discovering frequent patterns, association rules, and sequences in large datasets, essential for market basket analysis and recommendation systems.
250
251
```scala { .api }
252
// FP-Growth Algorithm
253
class FPGrowth extends Estimator[FPGrowthModel] with FPGrowthParams
254
class FPGrowthModel extends Model[FPGrowthModel] with FPGrowthParams
255
256
// PrefixSpan Algorithm
257
class PrefixSpan extends Estimator[PrefixSpanModel] with PrefixSpanParams
258
class PrefixSpanModel extends Model[PrefixSpanModel] with PrefixSpanParams
259
```
260
261
Note: Frequent Pattern Mining capabilities are included in the core MLlib package.
262
263
## Types
264
265
```scala { .api }
266
// Core ML Types
267
import org.apache.spark.ml.linalg.{Vector, DenseVector, SparseVector, Matrix, DenseMatrix, SparseMatrix}
268
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
269
270
// Pipeline Parameter Types
271
import org.apache.spark.ml.param.{Param, ParamMap, Params}
272
import org.apache.spark.ml.util.{Identifiable, MLWritable, MLReadable}
273
274
// Algorithm-Specific Types
275
import org.apache.spark.ml.classification.{ClassificationModel, Classifier, ProbabilisticClassifier}
276
import org.apache.spark.ml.regression.{RegressionModel, Regressor}
277
import org.apache.spark.ml.clustering.ClusteringModel
278
import org.apache.spark.ml.fpm.{FPGrowth, FPGrowthModel, PrefixSpan, PrefixSpanModel}
279
280
// Parameter Traits
281
trait LogisticRegressionParams extends Params
282
trait ClassificationParams extends Params
283
trait RegressionParams extends Params
284
```