Spec RegistrySpec Registry

Help your agents use open-source better. Learn more.

Find usage specs for your project’s dependencies

>

maven-apache-spark

Description
Lightning-fast unified analytics engine for large-scale data processing with high-level APIs in Scala, Java, Python, and R
Author
tessl
Last updated

How to use

npx @tessl/cli registry install tessl/maven-apache-spark@1.0.0

mllib.md docs/

1
# Machine Learning Library (MLlib)
2
3
MLlib is Spark's scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives.
4
5
## Core Data Types
6
7
### Vector
8
9
Mathematical vector for representing features:
10
11
```scala { .api }
12
package org.apache.spark.mllib.linalg
13
14
abstract class Vector extends Serializable {
15
def size: Int
16
def toArray: Array[Double]
17
def apply(i: Int): Double
18
}
19
```
20
21
**Vectors Factory**:
22
```scala { .api }
23
object Vectors {
24
def dense(values: Array[Double]): Vector
25
def dense(firstValue: Double, otherValues: Double*): Vector
26
def sparse(size: Int, indices: Array[Int], values: Array[Double]): Vector
27
def sparse(size: Int, elements: Seq[(Int, Double)]): Vector
28
def zeros(size: Int): Vector
29
}
30
```
31
32
```scala
33
import org.apache.spark.mllib.linalg.{Vector, Vectors}
34
35
// Dense vector
36
val denseVec = Vectors.dense(1.0, 2.0, 3.0, 0.0, 0.0)
37
38
// Sparse vector (size=5, indices=[0,2], values=[1.0,3.0])
39
val sparseVec = Vectors.sparse(5, Array(0, 2), Array(1.0, 3.0))
40
41
// Alternative sparse creation
42
val sparseVec2 = Vectors.sparse(5, Seq((0, 1.0), (2, 3.0)))
43
44
// Vector operations
45
val size = denseVec.size // 5
46
val element = denseVec(2) // 3.0
47
val array = denseVec.toArray // Array(1.0, 2.0, 3.0, 0.0, 0.0)
48
```
49
50
### Matrix
51
52
Mathematical matrix for linear algebra operations:
53
54
```scala { .api }
55
abstract class Matrix extends Serializable {
56
def numRows: Int
57
def numCols: Int
58
def toArray: Array[Double]
59
def apply(i: Int, j: Int): Double
60
}
61
```
62
63
**Matrices Factory**:
64
```scala { .api }
65
object Matrices {
66
def dense(numRows: Int, numCols: Int, values: Array[Double]): Matrix
67
def sparse(numRows: Int, numCols: Int, colPtrs: Array[Int], rowIndices: Array[Int], values: Array[Double]): Matrix
68
def eye(n: Int): Matrix
69
def zeros(numRows: Int, numCols: Int): Matrix
70
}
71
```
72
73
```scala
74
import org.apache.spark.mllib.linalg.{Matrix, Matrices}
75
76
// Dense matrix (2x3, column-major order)
77
val denseMatrix = Matrices.dense(2, 3, Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))
78
// Matrix:
79
// 1.0 3.0 5.0
80
// 2.0 4.0 6.0
81
82
// Identity matrix
83
val identity = Matrices.eye(3)
84
85
// Zero matrix
86
val zeros = Matrices.zeros(2, 3)
87
```
88
89
### LabeledPoint
90
91
Data point with label and features for supervised learning:
92
93
```scala { .api }
94
case class LabeledPoint(label: Double, features: Vector) {
95
override def toString: String = s"($label,$features)"
96
}
97
```
98
99
```scala
100
import org.apache.spark.mllib.regression.LabeledPoint
101
import org.apache.spark.mllib.linalg.Vectors
102
103
// Create labeled points
104
val positive = LabeledPoint(1.0, Vectors.dense(1.0, 2.0, 3.0))
105
val negative = LabeledPoint(0.0, Vectors.dense(-1.0, -2.0, -3.0))
106
107
// For regression
108
val regressionPoint = LabeledPoint(3.5, Vectors.dense(1.0, 2.0))
109
110
// Create training data
111
val trainingData = sc.parallelize(Seq(
112
LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
113
LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)),
114
LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)),
115
LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5))
116
))
117
```
118
119
### Rating
120
121
User-item rating for collaborative filtering:
122
123
```scala { .api }
124
case class Rating(user: Int, product: Int, rating: Double)
125
```
126
127
```scala
128
import org.apache.spark.mllib.recommendation.Rating
129
130
// Create ratings
131
val ratings = sc.parallelize(Seq(
132
Rating(1, 101, 5.0),
133
Rating(1, 102, 3.0),
134
Rating(2, 101, 4.0),
135
Rating(2, 103, 2.0)
136
))
137
```
138
139
## Classification
140
141
### Logistic Regression
142
143
**LogisticRegressionWithSGD**: Train logistic regression using stochastic gradient descent
144
```scala { .api }
145
object LogisticRegressionWithSGD {
146
def train(input: RDD[LabeledPoint], numIterations: Int): LogisticRegressionModel
147
def train(input: RDD[LabeledPoint], numIterations: Int, stepSize: Double): LogisticRegressionModel
148
def train(input: RDD[LabeledPoint], numIterations: Int, stepSize: Double, miniBatchFraction: Double): LogisticRegressionModel
149
}
150
151
class LogisticRegressionModel(override val weights: Vector, override val intercept: Double) extends ClassificationModel with Serializable {
152
def predict(testData: RDD[Vector]): RDD[Double]
153
def predict(testData: Vector): Double
154
def clearThreshold(): LogisticRegressionModel
155
def setThreshold(threshold: Double): LogisticRegressionModel
156
}
157
```
158
159
```scala
160
import org.apache.spark.mllib.classification.{LogisticRegressionWithSGD, LogisticRegressionModel}
161
import org.apache.spark.mllib.regression.LabeledPoint
162
import org.apache.spark.mllib.linalg.Vectors
163
164
// Prepare training data
165
val trainingData = sc.parallelize(Seq(
166
LabeledPoint(1.0, Vectors.dense(1.0, 2.0)),
167
LabeledPoint(0.0, Vectors.dense(-1.0, -2.0)),
168
LabeledPoint(1.0, Vectors.dense(1.5, 1.8)),
169
LabeledPoint(0.0, Vectors.dense(-1.5, -1.8))
170
))
171
172
// Train model
173
val model = LogisticRegressionWithSGD.train(trainingData, numIterations = 100)
174
175
// Make predictions
176
val testData = sc.parallelize(Seq(
177
Vectors.dense(1.0, 1.0),
178
Vectors.dense(-1.0, -1.0)
179
))
180
181
val predictions = model.predict(testData)
182
predictions.collect() // Array(1.0, 0.0)
183
184
// Single prediction
185
val singlePrediction = model.predict(Vectors.dense(0.5, 0.5))
186
187
// Set classification threshold
188
val calibratedModel = model.setThreshold(0.3)
189
```
190
191
### Support Vector Machines
192
193
**SVMWithSGD**: Train SVM using stochastic gradient descent
194
```scala { .api }
195
object SVMWithSGD {
196
def train(input: RDD[LabeledPoint], numIterations: Int): SVMModel
197
def train(input: RDD[LabeledPoint], numIterations: Int, stepSize: Double, regParam: Double): SVMModel
198
def train(input: RDD[LabeledPoint], numIterations: Int, stepSize: Double, regParam: Double, miniBatchFraction: Double): SVMModel
199
}
200
201
class SVMModel(override val weights: Vector, override val intercept: Double) extends ClassificationModel with Serializable {
202
def predict(testData: RDD[Vector]): RDD[Double]
203
def predict(testData: Vector): Double
204
}
205
```
206
207
```scala
208
import org.apache.spark.mllib.classification.{SVMWithSGD, SVMModel}
209
210
// Train SVM model
211
val svmModel = SVMWithSGD.train(
212
input = trainingData,
213
numIterations = 100,
214
stepSize = 1.0,
215
regParam = 0.01
216
)
217
218
// Make predictions
219
val svmPredictions = svmModel.predict(testData)
220
```
221
222
### Naive Bayes
223
224
**NaiveBayes**: Train multinomial Naive Bayes classifier
225
```scala { .api }
226
object NaiveBayes {
227
def train(input: RDD[LabeledPoint], lambda: Double = 1.0): NaiveBayesModel
228
}
229
230
class NaiveBayesModel(val labels: Array[Double], val pi: Array[Double], val theta: Array[Array[Double]]) extends ClassificationModel with Serializable {
231
def predict(testData: RDD[Vector]): RDD[Double]
232
def predict(testData: Vector): Double
233
}
234
```
235
236
```scala
237
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
238
239
// Training data for text classification (bag of words)
240
val textData = sc.parallelize(Seq(
241
LabeledPoint(0.0, Vectors.dense(1.0, 0.0, 1.0)), // spam
242
LabeledPoint(1.0, Vectors.dense(0.0, 1.0, 0.0)), // ham
243
LabeledPoint(0.0, Vectors.dense(1.0, 1.0, 0.0)), // spam
244
LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 1.0)) // ham
245
))
246
247
// Train Naive Bayes model
248
val nbModel = NaiveBayes.train(textData, lambda = 1.0)
249
250
// Make predictions
251
val textPredictions = nbModel.predict(testData)
252
```
253
254
## Regression
255
256
### Linear Regression
257
258
**LinearRegressionWithSGD**: Train linear regression using SGD
259
```scala { .api }
260
object LinearRegressionWithSGD {
261
def train(input: RDD[LabeledPoint], numIterations: Int): LinearRegressionModel
262
def train(input: RDD[LabeledPoint], numIterations: Int, stepSize: Double): LinearRegressionModel
263
def train(input: RDD[LabeledPoint], numIterations: Int, stepSize: Double, miniBatchFraction: Double): LinearRegressionModel
264
}
265
266
class LinearRegressionModel(override val weights: Vector, override val intercept: Double) extends GeneralizedLinearModel with RegressionModel with Serializable {
267
def predict(testData: RDD[Vector]): RDD[Double]
268
def predict(testData: Vector): Double
269
}
270
```
271
272
```scala
273
import org.apache.spark.mllib.regression.{LinearRegressionWithSGD, LinearRegressionModel, LabeledPoint}
274
275
// Regression training data
276
val regressionData = sc.parallelize(Seq(
277
LabeledPoint(1.0, Vectors.dense(1.0)),
278
LabeledPoint(2.0, Vectors.dense(2.0)),
279
LabeledPoint(3.0, Vectors.dense(3.0)),
280
LabeledPoint(4.0, Vectors.dense(4.0))
281
))
282
283
// Train linear regression
284
val lrModel = LinearRegressionWithSGD.train(
285
input = regressionData,
286
numIterations = 100,
287
stepSize = 0.01
288
)
289
290
// Make predictions
291
val regressionTestData = sc.parallelize(Seq(
292
Vectors.dense(1.5),
293
Vectors.dense(2.5)
294
))
295
296
val regressionPredictions = lrModel.predict(regressionTestData)
297
regressionPredictions.collect() // Array(~1.5, ~2.5)
298
```
299
300
### Ridge Regression
301
302
**RidgeRegressionWithSGD**: Ridge regression with L2 regularization
303
```scala { .api }
304
object RidgeRegressionWithSGD {
305
def train(input: RDD[LabeledPoint], numIterations: Int): RidgeRegressionModel
306
def train(input: RDD[LabeledPoint], numIterations: Int, stepSize: Double): RidgeRegressionModel
307
def train(input: RDD[LabeledPoint], numIterations: Int, stepSize: Double, regParam: Double): RidgeRegressionModel
308
}
309
```
310
311
### Lasso Regression
312
313
**LassoWithSGD**: Lasso regression with L1 regularization
314
```scala { .api }
315
object LassoWithSGD {
316
def train(input: RDD[LabeledPoint], numIterations: Int): LassoModel
317
def train(input: RDD[LabeledPoint], numIterations: Int, stepSize: Double): LassoModel
318
def train(input: RDD[LabeledPoint], numIterations: Int, stepSize: Double, regParam: Double): LassoModel
319
}
320
```
321
322
```scala
323
import org.apache.spark.mllib.regression.{RidgeRegressionWithSGD, LassoWithSGD}
324
325
// Ridge regression with regularization
326
val ridgeModel = RidgeRegressionWithSGD.train(
327
input = regressionData,
328
numIterations = 100,
329
stepSize = 0.01,
330
regParam = 0.1
331
)
332
333
// Lasso regression with L1 regularization
334
val lassoModel = LassoWithSGD.train(
335
input = regressionData,
336
numIterations = 100,
337
stepSize = 0.01,
338
regParam = 0.1
339
)
340
```
341
342
## Clustering
343
344
### K-Means
345
346
**KMeans**: K-means clustering algorithm
347
```scala { .api }
348
class KMeans private (private var k: Int, private var maxIterations: Int, private var runs: Int, private var initializationMode: String, private var initializationSteps: Int, private var epsilon: Double) extends Serializable {
349
def setK(k: Int): KMeans
350
def setMaxIterations(maxIterations: Int): KMeans
351
def setRuns(runs: Int): KMeans
352
def setInitializationMode(initializationMode: String): KMeans
353
def setInitializationSteps(initializationSteps: Int): KMeans
354
def setEpsilon(epsilon: Double): KMeans
355
def run(data: RDD[Vector]): KMeansModel
356
}
357
358
object KMeans {
359
def train(data: RDD[Vector], k: Int, maxIterations: Int): KMeansModel
360
def train(data: RDD[Vector], k: Int, maxIterations: Int, runs: Int): KMeansModel
361
def train(data: RDD[Vector], k: Int, maxIterations: Int, runs: Int, initializationMode: String): KMeansModel
362
}
363
364
class KMeansModel(val clusterCenters: Array[Vector]) extends Serializable {
365
def predict(point: Vector): Int
366
def predict(points: RDD[Vector]): RDD[Int]
367
def computeCost(data: RDD[Vector]): Double
368
}
369
```
370
371
```scala
372
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
373
374
// Prepare clustering data
375
val clusteringData = sc.parallelize(Seq(
376
Vectors.dense(1.0, 1.0),
377
Vectors.dense(1.0, 2.0),
378
Vectors.dense(2.0, 1.0),
379
Vectors.dense(9.0, 8.0),
380
Vectors.dense(8.0, 9.0),
381
Vectors.dense(9.0, 9.0)
382
))
383
384
// Train K-means model
385
val kmeansModel = KMeans.train(
386
data = clusteringData,
387
k = 2, // Number of clusters
388
maxIterations = 20
389
)
390
391
// Get cluster centers
392
val centers = kmeansModel.clusterCenters
393
centers.foreach(println)
394
395
// Make predictions
396
val clusterPredictions = kmeansModel.predict(clusteringData)
397
clusterPredictions.collect() // Array(0, 0, 0, 1, 1, 1)
398
399
// Compute cost (sum of squared distances to centroids)
400
val cost = kmeansModel.computeCost(clusteringData)
401
402
// Advanced K-means with custom parameters
403
val advancedKMeans = new KMeans()
404
.setK(3)
405
.setMaxIterations(50)
406
.setRuns(10) // Multiple runs for better results
407
.setInitializationMode("k-means||")
408
.setEpsilon(1e-4)
409
410
val advancedModel = advancedKMeans.run(clusteringData)
411
```
412
413
## Collaborative Filtering
414
415
### Alternating Least Squares (ALS)
416
417
**ALS**: Matrix factorization for collaborative filtering
418
```scala { .api }
419
object ALS {
420
def train(ratings: RDD[Rating], rank: Int, iterations: Int): MatrixFactorizationModel
421
def train(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: Double): MatrixFactorizationModel
422
def train(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: Double, blocks: Int): MatrixFactorizationModel
423
def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int): MatrixFactorizationModel
424
def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: Double, alpha: Double): MatrixFactorizationModel
425
}
426
427
class MatrixFactorizationModel(val rank: Int, val userFeatures: RDD[(Int, Array[Double])], val productFeatures: RDD[(Int, Array[Double])]) extends Serializable {
428
def predict(user: Int, product: Int): Double
429
def predict(usersProducts: RDD[(Int, Int)]): RDD[Rating]
430
def recommendProducts(user: Int, num: Int): Array[Rating]
431
def recommendUsers(product: Int, num: Int): Array[Rating]
432
}
433
```
434
435
```scala
436
import org.apache.spark.mllib.recommendation.{ALS, MatrixFactorizationModel, Rating}
437
438
// Create ratings data
439
val ratings = sc.parallelize(Seq(
440
Rating(1, 1, 5.0),
441
Rating(1, 2, 1.0),
442
Rating(1, 3, 5.0),
443
Rating(2, 1, 1.0),
444
Rating(2, 2, 5.0),
445
Rating(2, 3, 1.0),
446
Rating(3, 1, 5.0),
447
Rating(3, 2, 1.0),
448
Rating(3, 3, 5.0)
449
))
450
451
// Train collaborative filtering model
452
val alsModel = ALS.train(
453
ratings = ratings,
454
rank = 10, // Number of latent factors
455
iterations = 10, // Number of iterations
456
lambda = 0.01 // Regularization parameter
457
)
458
459
// Predict rating for user-item pair
460
val prediction = alsModel.predict(1, 2)
461
462
// Predict ratings for multiple user-item pairs
463
val userProducts = sc.parallelize(Seq((1, 1), (2, 2), (3, 3)))
464
val predictions = alsModel.predict(userProducts)
465
466
// Recommend products for a user
467
val recommendations = alsModel.recommendProducts(1, 5)
468
recommendations.foreach { rating =>
469
println(s"Product ${rating.product}: ${rating.rating}")
470
}
471
472
// Recommend users for a product
473
val userRecommendations = alsModel.recommendUsers(1, 3)
474
475
// For implicit feedback data
476
val implicitModel = ALS.trainImplicit(
477
ratings = ratings,
478
rank = 10,
479
iterations = 10,
480
lambda = 0.01,
481
alpha = 0.1 // Confidence parameter
482
)
483
```
484
485
## Statistics
486
487
### Summary Statistics
488
489
**Statistics**: Statistical functions for RDDs
490
```scala { .api }
491
object Statistics {
492
def colStats(rdd: RDD[Vector]): MultivariateStatisticalSummary
493
def corr(x: RDD[Double], y: RDD[Double], method: String = "pearson"): Double
494
def corr(X: RDD[Vector], method: String = "pearson"): Matrix
495
def chiSqTest(observed: Vector, expected: Vector): ChiSqTestResult
496
def chiSqTest(observed: Matrix): ChiSqTestResult
497
def chiSqTest(observed: RDD[LabeledPoint]): Array[ChiSqTestResult]
498
}
499
500
trait MultivariateStatisticalSummary {
501
def mean: Vector
502
def variance: Vector
503
def count: Long
504
def numNonzeros: Vector
505
def max: Vector
506
def min: Vector
507
}
508
```
509
510
```scala
511
import org.apache.spark.mllib.stat.Statistics
512
import org.apache.spark.mllib.linalg.{Vector, Vectors, Matrix}
513
514
// Sample data for statistics
515
val observations = sc.parallelize(Seq(
516
Vectors.dense(1.0, 2.0, 3.0),
517
Vectors.dense(4.0, 5.0, 6.0),
518
Vectors.dense(7.0, 8.0, 9.0)
519
))
520
521
// Compute column statistics
522
val summary = Statistics.colStats(observations)
523
println(s"Mean: ${summary.mean}") // [4.0, 5.0, 6.0]
524
println(s"Variance: ${summary.variance}") // [9.0, 9.0, 9.0]
525
println(s"Count: ${summary.count}") // 3
526
println(s"Max: ${summary.max}") // [7.0, 8.0, 9.0]
527
println(s"Min: ${summary.min}") // [1.0, 2.0, 3.0]
528
529
// Correlation between two RDD[Double]
530
val x = sc.parallelize(Array(1.0, 2.0, 3.0, 4.0))
531
val y = sc.parallelize(Array(2.0, 4.0, 6.0, 8.0))
532
val correlation = Statistics.corr(x, y, "pearson") // 1.0 (perfect positive correlation)
533
534
// Correlation matrix for RDD[Vector]
535
val correlationMatrix = Statistics.corr(observations, "pearson")
536
537
// Chi-squared test
538
val observed = Vectors.dense(1.0, 2.0, 3.0)
539
val expected = Vectors.dense(1.5, 1.5, 3.0)
540
val chiSqResult = Statistics.chiSqTest(observed, expected)
541
542
println(s"Chi-squared statistic: ${chiSqResult.statistic}")
543
println(s"P-value: ${chiSqResult.pValue}")
544
println(s"Degrees of freedom: ${chiSqResult.degreesOfFreedom}")
545
```
546
547
## Model Evaluation
548
549
### Binary Classification Metrics
550
551
```scala
552
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
553
554
// Predictions and labels (prediction score, true label)
555
val scoreAndLabels = sc.parallelize(Seq(
556
(0.9, 1.0), (0.8, 1.0), (0.7, 1.0),
557
(0.6, 0.0), (0.5, 1.0), (0.4, 0.0),
558
(0.3, 0.0), (0.2, 0.0), (0.1, 0.0)
559
))
560
561
val binaryMetrics = new BinaryClassificationMetrics(scoreAndLabels)
562
563
// Area under ROC curve
564
val areaUnderROC = binaryMetrics.areaUnderROC()
565
println(s"Area under ROC: $areaUnderROC")
566
567
// Area under Precision-Recall curve
568
val areaUnderPR = binaryMetrics.areaUnderPR()
569
println(s"Area under PR: $areaUnderPR")
570
```
571
572
### Multi-class Classification Metrics
573
574
```scala
575
import org.apache.spark.mllib.evaluation.MulticlassMetrics
576
577
// Predictions and labels (predicted class, true class)
578
val predictionAndLabels = sc.parallelize(Seq(
579
(0.0, 0.0), (1.0, 1.0), (2.0, 2.0),
580
(0.0, 0.0), (1.0, 2.0), (2.0, 1.0)
581
))
582
583
val multiMetrics = new MulticlassMetrics(predictionAndLabels)
584
585
// Overall statistics
586
val accuracy = multiMetrics.accuracy
587
val weightedPrecision = multiMetrics.weightedPrecision
588
val weightedRecall = multiMetrics.weightedRecall
589
val weightedFMeasure = multiMetrics.weightedFMeasure
590
591
// Per-class metrics
592
val labels = multiMetrics.labels
593
labels.foreach { label =>
594
println(s"Class $label precision: ${multiMetrics.precision(label)}")
595
println(s"Class $label recall: ${multiMetrics.recall(label)}")
596
println(s"Class $label F1-score: ${multiMetrics.fMeasure(label)}")
597
}
598
599
// Confusion matrix
600
val confusionMatrix = multiMetrics.confusionMatrix
601
println(s"Confusion matrix:\n$confusionMatrix")
602
```
603
604
## Pipeline Example
605
606
Complete machine learning pipeline:
607
608
```scala
609
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
610
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
611
import org.apache.spark.mllib.regression.LabeledPoint
612
import org.apache.spark.mllib.linalg.Vectors
613
614
// 1. Load and prepare data
615
val rawData = sc.textFile("data.csv")
616
val parsedData = rawData.map { line =>
617
val parts = line.split(',')
618
val label = parts(0).toDouble
619
val features = Vectors.dense(parts.tail.map(_.toDouble))
620
LabeledPoint(label, features)
621
}
622
623
// 2. Split data
624
val Array(training, test) = parsedData.randomSplit(Array(0.7, 0.3), seed = 11L)
625
training.cache()
626
627
// 3. Train model
628
val model = LogisticRegressionWithSGD.train(training, numIterations = 100)
629
630
// 4. Make predictions
631
val predictionAndLabel = test.map { point =>
632
val prediction = model.predict(point.features)
633
(prediction, point.label)
634
}
635
636
// 5. Evaluate model
637
val metrics = new BinaryClassificationMetrics(predictionAndLabel)
638
val auROC = metrics.areaUnderROC()
639
640
println(s"Area under ROC: $auROC")
641
642
// 6. Save model
643
model.save(sc, "myModel")
644
645
// 7. Load model later
646
val loadedModel = LogisticRegressionModel.load(sc, "myModel")
647
```
648
649
This comprehensive guide covers all the essential machine learning capabilities available in Spark's MLlib for building scalable ML applications.