or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/maven-org-apache-spark--spark-mllib_2-13

Apache Spark's scalable machine learning library providing comprehensive ML algorithms and utilities for large-scale data processing

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-mllib_2.13@4.0.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-mllib_2-13@4.0.0

0

# Apache Spark MLlib

1

2

Apache Spark's scalable machine learning library providing comprehensive ML algorithms and utilities for large-scale data processing. MLlib delivers high-performance distributed machine learning that scales from single machines to large clusters.

3

4

## Package Information

5

6

- **Package Name**: spark-mllib_2.13

7

- **Package Type**: Maven

8

- **Language**: Scala (compatible with Java)

9

- **Installation**:

10

- **Maven**:

11

```xml

12

<dependency>

13

<groupId>org.apache.spark</groupId>

14

<artifactId>spark-mllib_2.13</artifactId>

15

<version>4.0.0</version>

16

</dependency>

17

```

18

- **SBT**: `libraryDependencies += "org.apache.spark" %% "spark-mllib" % "4.0.0"`

19

- **Gradle**: `implementation 'org.apache.spark:spark-mllib_2.13:4.0.0'`

20

21

## Core Imports

22

23

```scala

24

// Modern DataFrame-based API (recommended)

25

import org.apache.spark.ml._

26

import org.apache.spark.ml.classification._

27

import org.apache.spark.ml.regression._

28

import org.apache.spark.ml.clustering._

29

import org.apache.spark.ml.feature._

30

import org.apache.spark.ml.fpm._

31

32

// Legacy RDD-based API (maintained for compatibility)

33

import org.apache.spark.mllib.classification._

34

import org.apache.spark.mllib.regression._

35

import org.apache.spark.mllib.clustering._

36

```

37

38

## Basic Usage

39

40

```scala

41

import org.apache.spark.sql.SparkSession

42

import org.apache.spark.ml.classification.LogisticRegression

43

import org.apache.spark.ml.feature.{VectorAssembler, StringIndexer}

44

import org.apache.spark.ml.Pipeline

45

46

// Initialize Spark session

47

val spark = SparkSession.builder()

48

.appName("MLlib Example")

49

.master("local[*]")

50

.getOrCreate()

51

52

import spark.implicits._

53

54

// Load data

55

val data = spark.read.format("libsvm")

56

.load("data/mllib/sample_multiclass_classification_data.txt")

57

58

// Prepare features

59

val assembler = new VectorAssembler()

60

.setInputCols(Array("feature1", "feature2"))

61

.setOutputCol("features")

62

63

// Create classifier

64

val lr = new LogisticRegression()

65

.setMaxIter(20)

66

.setRegParam(0.3)

67

.setElasticNetParam(0.8)

68

69

// Create pipeline

70

val pipeline = new Pipeline()

71

.setStages(Array(assembler, lr))

72

73

// Train model

74

val model = pipeline.fit(data)

75

76

// Make predictions

77

val predictions = model.transform(data)

78

predictions.show()

79

```

80

81

## Architecture

82

83

MLlib provides two complementary APIs built on Spark's distributed computing engine:

84

85

- **DataFrame-based API (org.apache.spark.ml)**: Modern, recommended API built around Spark DataFrames with type-safe ML pipelines, comprehensive parameter handling, and rich metadata support

86

- **RDD-based API (org.apache.spark.mllib)**: Legacy API maintained for backward compatibility, operating directly on RDDs with functional programming patterns

87

- **Pipeline Architecture**: Estimator/Transformer pattern enabling composable ML workflows with automated parameter management and model persistence

88

- **Distributed Linear Algebra**: High-performance distributed matrices and vectors optimized for large-scale computations

89

- **Scalable Algorithms**: All algorithms designed for horizontal scaling across cluster nodes with optimized data partitioning strategies

90

91

## Capabilities

92

93

### Classification Algorithms

94

95

Supervised learning algorithms for predicting categorical outcomes, including binary and multiclass classification with probabilistic predictions and comprehensive model evaluation.

96

97

```scala { .api }

98

// Logistic Regression

99

class LogisticRegression extends Classifier[Vector, LogisticRegression, LogisticRegressionModel]

100

class LogisticRegressionModel extends ClassificationModel[Vector, LogisticRegressionModel]

101

102

// Decision Trees

103

class DecisionTreeClassifier extends Classifier[Vector, DecisionTreeClassifier, DecisionTreeClassificationModel]

104

class DecisionTreeClassificationModel extends ClassificationModel[Vector, DecisionTreeClassificationModel]

105

106

// Random Forest

107

class RandomForestClassifier extends Classifier[Vector, RandomForestClassifier, RandomForestClassificationModel]

108

class RandomForestClassificationModel extends ClassificationModel[Vector, RandomForestClassificationModel]

109

```

110

111

[Classification](./classification.md)

112

113

### Regression Algorithms

114

115

Supervised learning algorithms for predicting continuous numerical values, including linear models, tree-based methods, and survival analysis with comprehensive residual analysis.

116

117

```scala { .api }

118

// Linear Regression

119

class LinearRegression extends Regressor[Vector, LinearRegression, LinearRegressionModel]

120

class LinearRegressionModel extends RegressionModel[Vector, LinearRegressionModel]

121

122

// Decision Tree Regression

123

class DecisionTreeRegressor extends Regressor[Vector, DecisionTreeRegressor, DecisionTreeRegressionModel]

124

class DecisionTreeRegressionModel extends RegressionModel[Vector, DecisionTreeRegressionModel]

125

126

// Random Forest Regression

127

class RandomForestRegressor extends Regressor[Vector, RandomForestRegressor, RandomForestRegressionModel]

128

class RandomForestRegressionModel extends RegressionModel[Vector, RandomForestRegressionModel]

129

```

130

131

[Regression](./regression.md)

132

133

### Clustering Algorithms

134

135

Unsupervised learning algorithms for discovering hidden patterns and grouping similar data points, including partitioning, hierarchical, and probabilistic clustering methods.

136

137

```scala { .api }

138

// K-Means Clustering

139

class KMeans extends Estimator[KMeans] with KMeansParams

140

class KMeansModel extends Model[KMeansModel] with KMeansParams

141

142

// Gaussian Mixture Model

143

class GaussianMixture extends Estimator[GaussianMixture] with GaussianMixtureParams

144

class GaussianMixtureModel extends Model[GaussianMixtureModel] with GaussianMixtureParams

145

146

// Latent Dirichlet Allocation

147

class LDA extends Estimator[LDA] with LDAParams

148

abstract class LDAModel extends Model[LDAModel] with LDAParams

149

```

150

151

[Clustering](./clustering.md)

152

153

### Feature Engineering

154

155

Comprehensive data preprocessing and feature transformation utilities for preparing raw data for machine learning algorithms, including text processing, categorical encoding, and numerical scaling.

156

157

```scala { .api }

158

// Vector Assembly and Manipulation

159

class VectorAssembler extends Transformer

160

class VectorSlicer extends Transformer

161

class VectorIndexer extends Estimator[VectorIndexerModel]

162

163

// Scaling and Normalization

164

class StandardScaler extends Estimator[StandardScalerModel]

165

class MinMaxScaler extends Estimator[MinMaxScalerModel]

166

class Normalizer extends Transformer

167

168

// Categorical Features

169

class StringIndexer extends Estimator[StringIndexerModel]

170

class OneHotEncoder extends Transformer

171

class IndexToString extends Transformer

172

```

173

174

[Feature Engineering](./feature-engineering.md)

175

176

### Model Evaluation and Selection

177

178

Comprehensive model evaluation metrics and automated hyperparameter tuning capabilities for assessing model performance and optimizing ML pipelines.

179

180

```scala { .api }

181

// Evaluators

182

abstract class Evaluator extends Params

183

class BinaryClassificationEvaluator extends Evaluator

184

class MulticlassClassificationEvaluator extends Evaluator

185

class RegressionEvaluator extends Evaluator

186

187

// Model Selection

188

class CrossValidator extends Estimator[CrossValidatorModel]

189

class TrainValidationSplit extends Estimator[TrainValidationSplitModel]

190

class ParamGridBuilder

191

```

192

193

[Evaluation and Tuning](./evaluation-tuning.md)

194

195

### Recommendation Systems

196

197

Collaborative filtering algorithms for building recommendation engines, including matrix factorization techniques optimized for large-scale user-item interaction datasets.

198

199

```scala { .api }

200

// Alternating Least Squares

201

class ALS extends Estimator[ALSModel] with ALSParams

202

class ALSModel extends Model[ALSModel] with ALSParams

203

```

204

205

[Recommendation](./recommendation.md)

206

207

### Pipeline Components

208

209

Core abstractions and utilities for building composable machine learning workflows with automated parameter management, model persistence, and metadata handling.

210

211

```scala { .api }

212

// Core Pipeline Classes

213

abstract class Estimator[M <: Model[M]] extends PipelineStage

214

abstract class Transformer extends PipelineStage

215

abstract class Model[M <: Model[M]] extends Transformer

216

class Pipeline extends Estimator[PipelineModel]

217

class PipelineModel extends Model[PipelineModel]

218

219

// Parameter System

220

trait Params

221

class Param[T]

222

class ParamMap

223

```

224

225

[Pipeline Components](./pipeline-components.md)

226

227

### Linear Algebra

228

229

Distributed linear algebra operations and data structures optimized for large-scale numerical computations across cluster nodes.

230

231

```scala { .api }

232

// Vector Types

233

abstract class Vector

234

class DenseVector extends Vector

235

class SparseVector extends Vector

236

object Vectors

237

238

// Matrix Types

239

abstract class Matrix

240

class DenseMatrix extends Matrix

241

class SparseMatrix extends Matrix

242

object Matrices

243

```

244

245

[Linear Algebra](./linear-algebra.md)

246

247

### Frequent Pattern Mining

248

249

Algorithms for discovering frequent patterns, association rules, and sequences in large datasets, essential for market basket analysis and recommendation systems.

250

251

```scala { .api }

252

// FP-Growth Algorithm

253

class FPGrowth extends Estimator[FPGrowthModel] with FPGrowthParams

254

class FPGrowthModel extends Model[FPGrowthModel] with FPGrowthParams

255

256

// PrefixSpan Algorithm

257

class PrefixSpan extends Estimator[PrefixSpanModel] with PrefixSpanParams

258

class PrefixSpanModel extends Model[PrefixSpanModel] with PrefixSpanParams

259

```

260

261

Note: Frequent Pattern Mining capabilities are included in the core MLlib package.

262

263

## Types

264

265

```scala { .api }

266

// Core ML Types

267

import org.apache.spark.ml.linalg.{Vector, DenseVector, SparseVector, Matrix, DenseMatrix, SparseMatrix}

268

import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}

269

270

// Pipeline Parameter Types

271

import org.apache.spark.ml.param.{Param, ParamMap, Params}

272

import org.apache.spark.ml.util.{Identifiable, MLWritable, MLReadable}

273

274

// Algorithm-Specific Types

275

import org.apache.spark.ml.classification.{ClassificationModel, Classifier, ProbabilisticClassifier}

276

import org.apache.spark.ml.regression.{RegressionModel, Regressor}

277

import org.apache.spark.ml.clustering.ClusteringModel

278

import org.apache.spark.ml.fpm.{FPGrowth, FPGrowthModel, PrefixSpan, PrefixSpanModel}

279

280

// Parameter Traits

281

trait LogisticRegressionParams extends Params

282

trait ClassificationParams extends Params

283

trait RegressionParams extends Params

284

```