Tessl Tile for maven/org.apache.spark/spark-mllib_2.13@3.5.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

classification.md clustering.md evaluation.md feature-processing.md frequent-pattern-mining.md index.md linear-algebra.md pipeline.md recommendation.md regression.md statistics.md

index.mddocs/

0
# Apache Spark MLlib
1

2
Apache Spark MLlib is a comprehensive machine learning library designed for large-scale data processing and analysis. It provides a unified API for implementing machine learning algorithms including classification, regression, clustering, and collaborative filtering for recommendation systems. MLlib supports end-to-end machine learning workflows through its Pipeline API and is built on Spark's distributed computing framework for scalable production deployments.
3

4
## Package Information
5

6
- **Package Name**: spark-mllib_2.13
7
- **Package Type**: maven
8
- **Language**: Scala
9
- **Installation**: Add to your `build.sbt`: `libraryDependencies += "org.apache.spark" %% "spark-mllib" % "3.5.6"`
10
- **Maven**: `<dependency><groupId>org.apache.spark</groupId><artifactId>spark-mllib_2.13</artifactId><version>3.5.6</version></dependency>`
11

12
## Core Imports
13

14
```scala
15
import org.apache.spark.ml._
16
import org.apache.spark.ml.classification._
17
import org.apache.spark.ml.regression._
18
import org.apache.spark.ml.clustering._
19
import org.apache.spark.ml.feature._
20
import org.apache.spark.ml.evaluation._
21
import org.apache.spark.ml.tuning._
22
import org.apache.spark.ml.linalg._
23
```
24

25
## Basic Usage
26

27
```scala
28
import org.apache.spark.sql.SparkSession
29
import org.apache.spark.ml.classification.LogisticRegression
30
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler}
31
import org.apache.spark.ml.Pipeline
32

33
// Initialize Spark session
34
val spark = SparkSession.builder()
35
  .appName("MLlib Example")
36
  .master("local[*]")
37
  .getOrCreate()
38

39
import spark.implicits._
40

41
// Load data
42
val data = spark.read
43
  .option("header", "true")
44
  .option("inferSchema", "true")
45
  .csv("path/to/data.csv")
46

47
// Feature preparation
48
val assembler = new VectorAssembler()
49
  .setInputCols(Array("feature1", "feature2", "feature3"))
50
  .setOutputCol("features")
51

52
val indexer = new StringIndexer()
53
  .setInputCol("label")
54
  .setOutputCol("labelIndex")
55

56
// Create classifier
57
val lr = new LogisticRegression()
58
  .setLabelCol("labelIndex")
59
  .setFeaturesCol("features")
60

61
// Create pipeline
62
val pipeline = new Pipeline()
63
  .setStages(Array(indexer, assembler, lr))
64

65
// Fit model
66
val model = pipeline.fit(data)
67

68
// Make predictions
69
val predictions = model.transform(data)
70
predictions.select("prediction", "labelIndex", "features").show()
71
```
72

73
## Architecture
74

75
Apache Spark MLlib is built around several key abstractions:
76

77
- **DataFrame-based API (org.apache.spark.ml)**: Modern, unified API using DataFrames with strongly-typed interfaces
78
- **Pipeline Framework**: Composable machine learning workflows using `Estimator`, `Transformer`, and `Model` abstractions
79
- **Parameter System**: Type-safe parameter handling with validation and grid search support  
80
- **Persistence**: Built-in model saving and loading capabilities
81
- **Distributed Computing**: Leverages Spark's distributed processing for scalability across clusters
82
- **Linear Algebra**: Optimized vector and matrix operations with sparse/dense representations
83

84
The design follows these key patterns:
85

86
1. **Estimator Pattern**: Algorithms implement `Estimator[M]` with a `fit()` method that returns a `Model[M]`
87
2. **Transformer Pattern**: All models and feature transformers implement `Transformer` with a `transform()` method
88
3. **Pipeline Pattern**: Multiple stages can be chained together in a `Pipeline` for complex workflows
89
4. **Parameter Pattern**: All components use typed parameters with automatic validation
90

91
## Capabilities
92

93
### Classification Algorithms
94

95
Supervised learning algorithms for predicting categorical labels including logistic regression, decision trees, random forests, gradient-boosted trees, and neural networks.
96

97
```scala { .api }
98
// Core classification interface
99
abstract class Classifier[
100
  FeaturesType,
101
  Learner <: Classifier[FeaturesType, Learner, M],
102
  M <: ClassificationModel[FeaturesType, M]
103
] extends Predictor[FeaturesType, Learner, M]
104

105
// Main algorithms
106
class LogisticRegression extends ProbabilisticClassifier[Vector, LogisticRegression, LogisticRegressionModel]
107
class DecisionTreeClassifier extends ProbabilisticClassifier[Vector, DecisionTreeClassifier, DecisionTreeClassificationModel]  
108
class RandomForestClassifier extends ProbabilisticClassifier[Vector, RandomForestClassifier, RandomForestClassificationModel]
109
class GBTClassifier extends Classifier[Vector, GBTClassifier, GBTClassificationModel]
110
```
111

112
[Classification Algorithms](./classification.md)
113

114
### Regression Algorithms  
115

116
Supervised learning algorithms for predicting continuous values including linear regression, decision trees, random forests, and generalized linear models.
117

118
```scala { .api }
119
// Core regression interface
120
abstract class Regressor[
121
  FeaturesType,
122
  Learner <: Regressor[FeaturesType, Learner, M],
123
  M <: RegressionModel[FeaturesType, M]
124
] extends Predictor[FeaturesType, Learner, M]
125

126
// Main algorithms
127
class LinearRegression extends Regressor[Vector, LinearRegression, LinearRegressionModel]
128
class DecisionTreeRegressor extends Regressor[Vector, DecisionTreeRegressor, DecisionTreeRegressionModel]
129
class RandomForestRegressor extends Regressor[Vector, RandomForestRegressor, RandomForestRegressionModel]
130
class GBTRegressor extends Regressor[Vector, GBTRegressor, GBTRegressionModel]
131
```
132

133
[Regression Algorithms](./regression.md)
134

135
### Clustering Algorithms
136

137
Unsupervised learning algorithms for discovering patterns and groupings in data including K-means, Gaussian mixture models, and topic modeling.
138

139
```scala { .api }
140
// Main clustering algorithms
141
class KMeans extends Estimator[KMeansModel] with KMeansParams
142
class GaussianMixture extends Estimator[GaussianMixtureModel] with GaussianMixtureParams  
143
class BisectingKMeans extends Estimator[BisectingKMeansModel] with BisectingKMeansParams
144
class LDA extends Estimator[LDAModel] with LDAParams
145
```
146

147
[Clustering Algorithms](./clustering.md)
148

149
### Feature Processing
150

151
Comprehensive feature extraction, transformation, selection, and engineering capabilities including text processing, scaling, dimensionality reduction, and categorical encoding.
152

153
```scala { .api }
154
// Feature transformation base classes
155
abstract class Transformer extends PipelineStage
156
abstract class Estimator[M <: Model[M]] extends PipelineStage
157

158
// Key feature transformers
159
class VectorAssembler extends Transformer
160
class StandardScaler extends Estimator[StandardScalerModel] with StandardScalerParams
161
class StringIndexer extends Estimator[StringIndexerModel] with StringIndexerParams
162
class OneHotEncoder extends Estimator[OneHotEncoderModel] with OneHotEncoderParams
163
class PCA extends Estimator[PCAModel] with PCAParams
164
```
165

166
[Feature Processing](./feature-processing.md)
167

168
### Model Evaluation
169

170
Tools for assessing model performance including classification metrics, regression metrics, clustering evaluation, and cross-validation.
171

172
```scala { .api }
173
// Evaluation base class
174
abstract class Evaluator extends Params
175

176
// Main evaluators
177
class BinaryClassificationEvaluator extends Evaluator with HasLabelCol with HasFeaturesCol
178
class MulticlassClassificationEvaluator extends Evaluator with HasLabelCol with HasPredictionCol  
179
class RegressionEvaluator extends Evaluator with HasLabelCol with HasPredictionCol
180
class ClusteringEvaluator extends Evaluator with HasFeaturesCol with HasPredictionCol
181
```
182

183
[Model Evaluation](./evaluation.md)
184

185
### Pipeline and Hyperparameter Tuning
186

187
Framework for building complex ML workflows and automated hyperparameter optimization including pipelines, cross-validation, and parameter grids.
188

189
```scala { .api }
190
// Pipeline framework
191
class Pipeline extends Estimator[PipelineModel]
192
class PipelineModel extends Model[PipelineModel] with Transformer
193

194
// Hyperparameter tuning
195
class CrossValidator extends Estimator[CrossValidatorModel]
196
class TrainValidationSplit extends Estimator[TrainValidationSplitModel]
197
class ParamGridBuilder
198
```
199

200
[Pipeline and Tuning](./pipeline.md)
201

202
### Linear Algebra
203

204
High-performance vector and matrix operations with support for dense and sparse representations, distributed matrices, and common linear algebra operations.
205

206
```scala { .api }  
207
// Core linear algebra types
208
sealed trait Vector extends Serializable
209
class DenseVector(val values: Array[Double]) extends Vector
210
class SparseVector(val size: Int, val indices: Array[Int], val values: Array[Double]) extends Vector
211

212
sealed trait Matrix extends Serializable  
213
class DenseMatrix(val numRows: Int, val numCols: Int, val values: Array[Double]) extends Matrix
214
class SparseMatrix(val numRows: Int, val numCols: Int, val colPtrs: Array[Int], val rowIndices: Array[Int], val values: Array[Double]) extends Matrix
215

216
// Factory objects
217
object Vectors
218
object Matrices
219
```
220

221
[Linear Algebra](./linear-algebra.md)
222

223
### Collaborative Filtering
224

225
Matrix factorization-based recommendation systems using Alternating Least Squares for building recommendation engines from user-item interaction data.
226

227
```scala { .api }
228
class ALS extends Estimator[ALSModel] with ALSParams
229
class ALSModel extends Model[ALSModel] with ALSModelParams {
230
  def recommendForAllUsers(numItems: Int): DataFrame
231
  def recommendForAllItems(numUsers: Int): DataFrame
232
  def userFactors: DataFrame
233
  def itemFactors: DataFrame
234
}
235
```
236

237
[Collaborative Filtering](./recommendation.md)
238

239
### Frequent Pattern Mining
240

241
Data mining algorithms for discovering frequent patterns, itemsets, and sequential patterns including FP-Growth for market basket analysis and PrefixSpan for sequential pattern discovery.
242

243
```scala { .api }
244
// FP-Growth for frequent itemset mining
245
class FPGrowth extends Estimator[FPGrowthModel] with FPGrowthParams
246
class FPGrowthModel extends Model[FPGrowthModel] with FPGrowthParams {
247
  def transform(dataset: Dataset[_]): DataFrame
248
  def freqItemsets: DataFrame
249
  def associationRules: DataFrame
250
}
251

252
// PrefixSpan for sequential pattern mining  
253
class PrefixSpan extends Estimator[PrefixSpanModel] with PrefixSpanParams
254
class PrefixSpanModel extends Model[PrefixSpanModel] with PrefixSpanParams {
255
  def findFrequentSequentialPatterns(dataset: Dataset[_]): DataFrame
256
}
257
```
258

259
[Frequent Pattern Mining](./frequent-pattern-mining.md)
260

261
### Statistical Functions
262

263
Statistical analysis and hypothesis testing including correlation analysis, summary statistics, Chi-square tests, ANOVA, and kernel density estimation.
264

265
```scala { .api }
266
object Statistics {
267
  def corr(dataset: Dataset[_], columns: Seq[String], method: String): Matrix
268
  def chiSquareTest(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame
269
  def anovaTest(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame
270
}
271

272
object Summarizer {
273
  def metrics(metrics: String*): SummaryBuilder
274
  def mean(col: Column): Column
275
  def variance(col: Column): Column
276
}
277
```
278

279
[Statistical Functions](./statistics.md)
280

281
## Core Types
282

283
```scala { .api }
284
// Pipeline stage base types
285
trait PipelineStage extends Params with Logging
286

287
trait Estimator[M <: Model[M]] extends PipelineStage {
288
  def fit(dataset: Dataset[_]): M
289
  def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M]  
290
}
291

292
trait Transformer extends PipelineStage {
293
  def transform(dataset: Dataset[_]): DataFrame
294
  def transformSchema(schema: StructType): StructType
295
}
296

297
trait Model[M <: Model[M]] extends Transformer {
298
  def parent: Estimator[M]
299
  def hasParent: Boolean
300
}
301

302
// Parameter system
303
trait Params {
304
  def copy(extra: ParamMap): this.type
305
  def extractParamMap(): ParamMap
306
  def explainParams(): String
307
}
308

309
case class ParamMap(map: mutable.Map[Param[Any], Any])
310
class Param[T](val parent: String, val name: String)
311
case class ParamPair[T](param: Param[T], value: T)
312

313
// Common parameter traits
314
trait HasInputCol extends Params
315
trait HasOutputCol extends Params  
316
trait HasLabelCol extends Params
317
trait HasFeaturesCol extends Params
318
trait HasPredictionCol extends Params
319
```

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/