or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

classification.mdclustering.mdevaluation.mdfeature-processing.mdfrequent-pattern-mining.mdindex.mdlinear-algebra.mdpipeline.mdrecommendation.mdregression.mdstatistics.md

index.mddocs/

0

# Apache Spark MLlib

1

2

Apache Spark MLlib is a comprehensive machine learning library designed for large-scale data processing and analysis. It provides a unified API for implementing machine learning algorithms including classification, regression, clustering, and collaborative filtering for recommendation systems. MLlib supports end-to-end machine learning workflows through its Pipeline API and is built on Spark's distributed computing framework for scalable production deployments.

3

4

## Package Information

5

6

- **Package Name**: spark-mllib_2.13

7

- **Package Type**: maven

8

- **Language**: Scala

9

- **Installation**: Add to your `build.sbt`: `libraryDependencies += "org.apache.spark" %% "spark-mllib" % "3.5.6"`

10

- **Maven**: `<dependency><groupId>org.apache.spark</groupId><artifactId>spark-mllib_2.13</artifactId><version>3.5.6</version></dependency>`

11

12

## Core Imports

13

14

```scala

15

import org.apache.spark.ml._

16

import org.apache.spark.ml.classification._

17

import org.apache.spark.ml.regression._

18

import org.apache.spark.ml.clustering._

19

import org.apache.spark.ml.feature._

20

import org.apache.spark.ml.evaluation._

21

import org.apache.spark.ml.tuning._

22

import org.apache.spark.ml.linalg._

23

```

24

25

## Basic Usage

26

27

```scala

28

import org.apache.spark.sql.SparkSession

29

import org.apache.spark.ml.classification.LogisticRegression

30

import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler}

31

import org.apache.spark.ml.Pipeline

32

33

// Initialize Spark session

34

val spark = SparkSession.builder()

35

.appName("MLlib Example")

36

.master("local[*]")

37

.getOrCreate()

38

39

import spark.implicits._

40

41

// Load data

42

val data = spark.read

43

.option("header", "true")

44

.option("inferSchema", "true")

45

.csv("path/to/data.csv")

46

47

// Feature preparation

48

val assembler = new VectorAssembler()

49

.setInputCols(Array("feature1", "feature2", "feature3"))

50

.setOutputCol("features")

51

52

val indexer = new StringIndexer()

53

.setInputCol("label")

54

.setOutputCol("labelIndex")

55

56

// Create classifier

57

val lr = new LogisticRegression()

58

.setLabelCol("labelIndex")

59

.setFeaturesCol("features")

60

61

// Create pipeline

62

val pipeline = new Pipeline()

63

.setStages(Array(indexer, assembler, lr))

64

65

// Fit model

66

val model = pipeline.fit(data)

67

68

// Make predictions

69

val predictions = model.transform(data)

70

predictions.select("prediction", "labelIndex", "features").show()

71

```

72

73

## Architecture

74

75

Apache Spark MLlib is built around several key abstractions:

76

77

- **DataFrame-based API (org.apache.spark.ml)**: Modern, unified API using DataFrames with strongly-typed interfaces

78

- **Pipeline Framework**: Composable machine learning workflows using `Estimator`, `Transformer`, and `Model` abstractions

79

- **Parameter System**: Type-safe parameter handling with validation and grid search support

80

- **Persistence**: Built-in model saving and loading capabilities

81

- **Distributed Computing**: Leverages Spark's distributed processing for scalability across clusters

82

- **Linear Algebra**: Optimized vector and matrix operations with sparse/dense representations

83

84

The design follows these key patterns:

85

86

1. **Estimator Pattern**: Algorithms implement `Estimator[M]` with a `fit()` method that returns a `Model[M]`

87

2. **Transformer Pattern**: All models and feature transformers implement `Transformer` with a `transform()` method

88

3. **Pipeline Pattern**: Multiple stages can be chained together in a `Pipeline` for complex workflows

89

4. **Parameter Pattern**: All components use typed parameters with automatic validation

90

91

## Capabilities

92

93

### Classification Algorithms

94

95

Supervised learning algorithms for predicting categorical labels including logistic regression, decision trees, random forests, gradient-boosted trees, and neural networks.

96

97

```scala { .api }

98

// Core classification interface

99

abstract class Classifier[

100

FeaturesType,

101

Learner <: Classifier[FeaturesType, Learner, M],

102

M <: ClassificationModel[FeaturesType, M]

103

] extends Predictor[FeaturesType, Learner, M]

104

105

// Main algorithms

106

class LogisticRegression extends ProbabilisticClassifier[Vector, LogisticRegression, LogisticRegressionModel]

107

class DecisionTreeClassifier extends ProbabilisticClassifier[Vector, DecisionTreeClassifier, DecisionTreeClassificationModel]

108

class RandomForestClassifier extends ProbabilisticClassifier[Vector, RandomForestClassifier, RandomForestClassificationModel]

109

class GBTClassifier extends Classifier[Vector, GBTClassifier, GBTClassificationModel]

110

```

111

112

[Classification Algorithms](./classification.md)

113

114

### Regression Algorithms

115

116

Supervised learning algorithms for predicting continuous values including linear regression, decision trees, random forests, and generalized linear models.

117

118

```scala { .api }

119

// Core regression interface

120

abstract class Regressor[

121

FeaturesType,

122

Learner <: Regressor[FeaturesType, Learner, M],

123

M <: RegressionModel[FeaturesType, M]

124

] extends Predictor[FeaturesType, Learner, M]

125

126

// Main algorithms

127

class LinearRegression extends Regressor[Vector, LinearRegression, LinearRegressionModel]

128

class DecisionTreeRegressor extends Regressor[Vector, DecisionTreeRegressor, DecisionTreeRegressionModel]

129

class RandomForestRegressor extends Regressor[Vector, RandomForestRegressor, RandomForestRegressionModel]

130

class GBTRegressor extends Regressor[Vector, GBTRegressor, GBTRegressionModel]

131

```

132

133

[Regression Algorithms](./regression.md)

134

135

### Clustering Algorithms

136

137

Unsupervised learning algorithms for discovering patterns and groupings in data including K-means, Gaussian mixture models, and topic modeling.

138

139

```scala { .api }

140

// Main clustering algorithms

141

class KMeans extends Estimator[KMeansModel] with KMeansParams

142

class GaussianMixture extends Estimator[GaussianMixtureModel] with GaussianMixtureParams

143

class BisectingKMeans extends Estimator[BisectingKMeansModel] with BisectingKMeansParams

144

class LDA extends Estimator[LDAModel] with LDAParams

145

```

146

147

[Clustering Algorithms](./clustering.md)

148

149

### Feature Processing

150

151

Comprehensive feature extraction, transformation, selection, and engineering capabilities including text processing, scaling, dimensionality reduction, and categorical encoding.

152

153

```scala { .api }

154

// Feature transformation base classes

155

abstract class Transformer extends PipelineStage

156

abstract class Estimator[M <: Model[M]] extends PipelineStage

157

158

// Key feature transformers

159

class VectorAssembler extends Transformer

160

class StandardScaler extends Estimator[StandardScalerModel] with StandardScalerParams

161

class StringIndexer extends Estimator[StringIndexerModel] with StringIndexerParams

162

class OneHotEncoder extends Estimator[OneHotEncoderModel] with OneHotEncoderParams

163

class PCA extends Estimator[PCAModel] with PCAParams

164

```

165

166

[Feature Processing](./feature-processing.md)

167

168

### Model Evaluation

169

170

Tools for assessing model performance including classification metrics, regression metrics, clustering evaluation, and cross-validation.

171

172

```scala { .api }

173

// Evaluation base class

174

abstract class Evaluator extends Params

175

176

// Main evaluators

177

class BinaryClassificationEvaluator extends Evaluator with HasLabelCol with HasFeaturesCol

178

class MulticlassClassificationEvaluator extends Evaluator with HasLabelCol with HasPredictionCol

179

class RegressionEvaluator extends Evaluator with HasLabelCol with HasPredictionCol

180

class ClusteringEvaluator extends Evaluator with HasFeaturesCol with HasPredictionCol

181

```

182

183

[Model Evaluation](./evaluation.md)

184

185

### Pipeline and Hyperparameter Tuning

186

187

Framework for building complex ML workflows and automated hyperparameter optimization including pipelines, cross-validation, and parameter grids.

188

189

```scala { .api }

190

// Pipeline framework

191

class Pipeline extends Estimator[PipelineModel]

192

class PipelineModel extends Model[PipelineModel] with Transformer

193

194

// Hyperparameter tuning

195

class CrossValidator extends Estimator[CrossValidatorModel]

196

class TrainValidationSplit extends Estimator[TrainValidationSplitModel]

197

class ParamGridBuilder

198

```

199

200

[Pipeline and Tuning](./pipeline.md)

201

202

### Linear Algebra

203

204

High-performance vector and matrix operations with support for dense and sparse representations, distributed matrices, and common linear algebra operations.

205

206

```scala { .api }

207

// Core linear algebra types

208

sealed trait Vector extends Serializable

209

class DenseVector(val values: Array[Double]) extends Vector

210

class SparseVector(val size: Int, val indices: Array[Int], val values: Array[Double]) extends Vector

211

212

sealed trait Matrix extends Serializable

213

class DenseMatrix(val numRows: Int, val numCols: Int, val values: Array[Double]) extends Matrix

214

class SparseMatrix(val numRows: Int, val numCols: Int, val colPtrs: Array[Int], val rowIndices: Array[Int], val values: Array[Double]) extends Matrix

215

216

// Factory objects

217

object Vectors

218

object Matrices

219

```

220

221

[Linear Algebra](./linear-algebra.md)

222

223

### Collaborative Filtering

224

225

Matrix factorization-based recommendation systems using Alternating Least Squares for building recommendation engines from user-item interaction data.

226

227

```scala { .api }

228

class ALS extends Estimator[ALSModel] with ALSParams

229

class ALSModel extends Model[ALSModel] with ALSModelParams {

230

def recommendForAllUsers(numItems: Int): DataFrame

231

def recommendForAllItems(numUsers: Int): DataFrame

232

def userFactors: DataFrame

233

def itemFactors: DataFrame

234

}

235

```

236

237

[Collaborative Filtering](./recommendation.md)

238

239

### Frequent Pattern Mining

240

241

Data mining algorithms for discovering frequent patterns, itemsets, and sequential patterns including FP-Growth for market basket analysis and PrefixSpan for sequential pattern discovery.

242

243

```scala { .api }

244

// FP-Growth for frequent itemset mining

245

class FPGrowth extends Estimator[FPGrowthModel] with FPGrowthParams

246

class FPGrowthModel extends Model[FPGrowthModel] with FPGrowthParams {

247

def transform(dataset: Dataset[_]): DataFrame

248

def freqItemsets: DataFrame

249

def associationRules: DataFrame

250

}

251

252

// PrefixSpan for sequential pattern mining

253

class PrefixSpan extends Estimator[PrefixSpanModel] with PrefixSpanParams

254

class PrefixSpanModel extends Model[PrefixSpanModel] with PrefixSpanParams {

255

def findFrequentSequentialPatterns(dataset: Dataset[_]): DataFrame

256

}

257

```

258

259

[Frequent Pattern Mining](./frequent-pattern-mining.md)

260

261

### Statistical Functions

262

263

Statistical analysis and hypothesis testing including correlation analysis, summary statistics, Chi-square tests, ANOVA, and kernel density estimation.

264

265

```scala { .api }

266

object Statistics {

267

def corr(dataset: Dataset[_], columns: Seq[String], method: String): Matrix

268

def chiSquareTest(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame

269

def anovaTest(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame

270

}

271

272

object Summarizer {

273

def metrics(metrics: String*): SummaryBuilder

274

def mean(col: Column): Column

275

def variance(col: Column): Column

276

}

277

```

278

279

[Statistical Functions](./statistics.md)

280

281

## Core Types

282

283

```scala { .api }

284

// Pipeline stage base types

285

trait PipelineStage extends Params with Logging

286

287

trait Estimator[M <: Model[M]] extends PipelineStage {

288

def fit(dataset: Dataset[_]): M

289

def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M]

290

}

291

292

trait Transformer extends PipelineStage {

293

def transform(dataset: Dataset[_]): DataFrame

294

def transformSchema(schema: StructType): StructType

295

}

296

297

trait Model[M <: Model[M]] extends Transformer {

298

def parent: Estimator[M]

299

def hasParent: Boolean

300

}

301

302

// Parameter system

303

trait Params {

304

def copy(extra: ParamMap): this.type

305

def extractParamMap(): ParamMap

306

def explainParams(): String

307

}

308

309

case class ParamMap(map: mutable.Map[Param[Any], Any])

310

class Param[T](val parent: String, val name: String)

311

case class ParamPair[T](param: Param[T], value: T)

312

313

// Common parameter traits

314

trait HasInputCol extends Params

315

trait HasOutputCol extends Params

316

trait HasLabelCol extends Params

317

trait HasFeaturesCol extends Params

318

trait HasPredictionCol extends Params

319

```