Spec Registry

Help your agents use open-source better. Learn more.

Find usage specs for your project’s dependencies

Author: tessl

Last updated: 15 days ago

Spec files

maven-apache-spark

Describes: maven maven/org.apache.spark/spark-parent

Description: Lightning-fast unified analytics engine for large-scale data processing with high-level APIs in Scala, Java, Python, and R

Author: tessl

Last updated: 15 days ago

How to use

npx @tessl/cli registry install tessl/maven-apache-spark@1.0.0

Provide feedback Docs

mllib.md docs/

1
# Machine Learning Library (MLlib)
2

3
MLlib is Spark's scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives.
4

5
## Core Data Types
6

7
### Vector
8

9
Mathematical vector for representing features:
10

11
```scala { .api }
12
package org.apache.spark.mllib.linalg
13

14
abstract class Vector extends Serializable {
15
  def size: Int
16
  def toArray: Array[Double]
17
  def apply(i: Int): Double
18
}
19
```
20

21
**Vectors Factory**:
22
```scala { .api }
23
object Vectors {
24
  def dense(values: Array[Double]): Vector
25
  def dense(firstValue: Double, otherValues: Double*): Vector
26
  def sparse(size: Int, indices: Array[Int], values: Array[Double]): Vector
27
  def sparse(size: Int, elements: Seq[(Int, Double)]): Vector
28
  def zeros(size: Int): Vector
29
}
30
```
31

32
```scala
33
import org.apache.spark.mllib.linalg.{Vector, Vectors}
34

35
// Dense vector
36
val denseVec = Vectors.dense(1.0, 2.0, 3.0, 0.0, 0.0)
37

38
// Sparse vector (size=5, indices=[0,2], values=[1.0,3.0])
39
val sparseVec = Vectors.sparse(5, Array(0, 2), Array(1.0, 3.0))
40

41
// Alternative sparse creation
42
val sparseVec2 = Vectors.sparse(5, Seq((0, 1.0), (2, 3.0)))
43

44
// Vector operations
45
val size = denseVec.size           // 5
46
val element = denseVec(2)          // 3.0  
47
val array = denseVec.toArray       // Array(1.0, 2.0, 3.0, 0.0, 0.0)
48
```
49

50
### Matrix
51

52
Mathematical matrix for linear algebra operations:
53

54
```scala { .api }
55
abstract class Matrix extends Serializable {
56
  def numRows: Int
57
  def numCols: Int
58
  def toArray: Array[Double]
59
  def apply(i: Int, j: Int): Double
60
}
61
```
62

63
**Matrices Factory**:
64
```scala { .api }
65
object Matrices {
66
  def dense(numRows: Int, numCols: Int, values: Array[Double]): Matrix
67
  def sparse(numRows: Int, numCols: Int, colPtrs: Array[Int], rowIndices: Array[Int], values: Array[Double]): Matrix
68
  def eye(n: Int): Matrix
69
  def zeros(numRows: Int, numCols: Int): Matrix
70
}
71
```
72

73
```scala
74
import org.apache.spark.mllib.linalg.{Matrix, Matrices}
75

76
// Dense matrix (2x3, column-major order)
77
val denseMatrix = Matrices.dense(2, 3, Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))
78
// Matrix:
79
// 1.0  3.0  5.0
80
// 2.0  4.0  6.0
81

82
// Identity matrix
83
val identity = Matrices.eye(3)
84

85
// Zero matrix
86
val zeros = Matrices.zeros(2, 3)
87
```
88

89
### LabeledPoint
90

91
Data point with label and features for supervised learning:
92

93
```scala { .api }
94
case class LabeledPoint(label: Double, features: Vector) {
95
  override def toString: String = s"($label,$features)"
96
}
97
```
98

99
```scala
100
import org.apache.spark.mllib.regression.LabeledPoint
101
import org.apache.spark.mllib.linalg.Vectors
102

103
// Create labeled points
104
val positive = LabeledPoint(1.0, Vectors.dense(1.0, 2.0, 3.0))
105
val negative = LabeledPoint(0.0, Vectors.dense(-1.0, -2.0, -3.0))
106

107
// For regression
108
val regressionPoint = LabeledPoint(3.5, Vectors.dense(1.0, 2.0))
109

110
// Create training data
111
val trainingData = sc.parallelize(Seq(
112
  LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
113
  LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)),
114
  LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)),
115
  LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5))
116
))
117
```
118

119
### Rating
120

121
User-item rating for collaborative filtering:
122

123
```scala { .api }
124
case class Rating(user: Int, product: Int, rating: Double)
125
```
126

127
```scala
128
import org.apache.spark.mllib.recommendation.Rating
129

130
// Create ratings
131
val ratings = sc.parallelize(Seq(
132
  Rating(1, 101, 5.0),
133
  Rating(1, 102, 3.0),
134
  Rating(2, 101, 4.0),
135
  Rating(2, 103, 2.0)
136
))
137
```
138

139
## Classification
140

141
### Logistic Regression
142

143
**LogisticRegressionWithSGD**: Train logistic regression using stochastic gradient descent
144
```scala { .api }
145
object LogisticRegressionWithSGD {
146
  def train(input: RDD[LabeledPoint], numIterations: Int): LogisticRegressionModel
147
  def train(input: RDD[LabeledPoint], numIterations: Int, stepSize: Double): LogisticRegressionModel
148
  def train(input: RDD[LabeledPoint], numIterations: Int, stepSize: Double, miniBatchFraction: Double): LogisticRegressionModel
149
}
150

151
class LogisticRegressionModel(override val weights: Vector, override val intercept: Double) extends ClassificationModel with Serializable {
152
  def predict(testData: RDD[Vector]): RDD[Double]
153
  def predict(testData: Vector): Double
154
  def clearThreshold(): LogisticRegressionModel
155
  def setThreshold(threshold: Double): LogisticRegressionModel
156
}
157
```
158

159
```scala
160
import org.apache.spark.mllib.classification.{LogisticRegressionWithSGD, LogisticRegressionModel}
161
import org.apache.spark.mllib.regression.LabeledPoint
162
import org.apache.spark.mllib.linalg.Vectors
163

164
// Prepare training data
165
val trainingData = sc.parallelize(Seq(
166
  LabeledPoint(1.0, Vectors.dense(1.0, 2.0)),
167
  LabeledPoint(0.0, Vectors.dense(-1.0, -2.0)),
168
  LabeledPoint(1.0, Vectors.dense(1.5, 1.8)),
169
  LabeledPoint(0.0, Vectors.dense(-1.5, -1.8))
170
))
171

172
// Train model
173
val model = LogisticRegressionWithSGD.train(trainingData, numIterations = 100)
174

175
// Make predictions
176
val testData = sc.parallelize(Seq(
177
  Vectors.dense(1.0, 1.0),
178
  Vectors.dense(-1.0, -1.0)
179
))
180

181
val predictions = model.predict(testData)
182
predictions.collect()  // Array(1.0, 0.0)
183

184
// Single prediction
185
val singlePrediction = model.predict(Vectors.dense(0.5, 0.5))
186

187
// Set classification threshold
188
val calibratedModel = model.setThreshold(0.3)
189
```
190

191
### Support Vector Machines
192

193
**SVMWithSGD**: Train SVM using stochastic gradient descent
194
```scala { .api }
195
object SVMWithSGD {
196
  def train(input: RDD[LabeledPoint], numIterations: Int): SVMModel
197
  def train(input: RDD[LabeledPoint], numIterations: Int, stepSize: Double, regParam: Double): SVMModel
198
  def train(input: RDD[LabeledPoint], numIterations: Int, stepSize: Double, regParam: Double, miniBatchFraction: Double): SVMModel
199
}
200

201
class SVMModel(override val weights: Vector, override val intercept: Double) extends ClassificationModel with Serializable {
202
  def predict(testData: RDD[Vector]): RDD[Double]
203
  def predict(testData: Vector): Double
204
}
205
```
206

207
```scala
208
import org.apache.spark.mllib.classification.{SVMWithSGD, SVMModel}
209

210
// Train SVM model
211
val svmModel = SVMWithSGD.train(
212
  input = trainingData,
213
  numIterations = 100,
214
  stepSize = 1.0,
215
  regParam = 0.01
216
)
217

218
// Make predictions
219
val svmPredictions = svmModel.predict(testData)
220
```
221

222
### Naive Bayes
223

224
**NaiveBayes**: Train multinomial Naive Bayes classifier
225
```scala { .api }
226
object NaiveBayes {
227
  def train(input: RDD[LabeledPoint], lambda: Double = 1.0): NaiveBayesModel
228
}
229

230
class NaiveBayesModel(val labels: Array[Double], val pi: Array[Double], val theta: Array[Array[Double]]) extends ClassificationModel with Serializable {
231
  def predict(testData: RDD[Vector]): RDD[Double]
232
  def predict(testData: Vector): Double
233
}
234
```
235

236
```scala
237
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
238

239
// Training data for text classification (bag of words)
240
val textData = sc.parallelize(Seq(
241
  LabeledPoint(0.0, Vectors.dense(1.0, 0.0, 1.0)), // spam
242
  LabeledPoint(1.0, Vectors.dense(0.0, 1.0, 0.0)), // ham
243
  LabeledPoint(0.0, Vectors.dense(1.0, 1.0, 0.0)), // spam
244
  LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 1.0))  // ham
245
))
246

247
// Train Naive Bayes model
248
val nbModel = NaiveBayes.train(textData, lambda = 1.0)
249

250
// Make predictions
251
val textPredictions = nbModel.predict(testData)
252
```
253

254
## Regression
255

256
### Linear Regression
257

258
**LinearRegressionWithSGD**: Train linear regression using SGD
259
```scala { .api }
260
object LinearRegressionWithSGD {
261
  def train(input: RDD[LabeledPoint], numIterations: Int): LinearRegressionModel
262
  def train(input: RDD[LabeledPoint], numIterations: Int, stepSize: Double): LinearRegressionModel
263
  def train(input: RDD[LabeledPoint], numIterations: Int, stepSize: Double, miniBatchFraction: Double): LinearRegressionModel
264
}
265

266
class LinearRegressionModel(override val weights: Vector, override val intercept: Double) extends GeneralizedLinearModel with RegressionModel with Serializable {
267
  def predict(testData: RDD[Vector]): RDD[Double]
268
  def predict(testData: Vector): Double
269
}
270
```
271

272
```scala
273
import org.apache.spark.mllib.regression.{LinearRegressionWithSGD, LinearRegressionModel, LabeledPoint}
274

275
// Regression training data
276
val regressionData = sc.parallelize(Seq(
277
  LabeledPoint(1.0, Vectors.dense(1.0)),
278
  LabeledPoint(2.0, Vectors.dense(2.0)),
279
  LabeledPoint(3.0, Vectors.dense(3.0)),
280
  LabeledPoint(4.0, Vectors.dense(4.0))
281
))
282

283
// Train linear regression
284
val lrModel = LinearRegressionWithSGD.train(
285
  input = regressionData,
286
  numIterations = 100,
287
  stepSize = 0.01
288
)
289

290
// Make predictions
291
val regressionTestData = sc.parallelize(Seq(
292
  Vectors.dense(1.5),
293
  Vectors.dense(2.5)
294
))
295

296
val regressionPredictions = lrModel.predict(regressionTestData)
297
regressionPredictions.collect()  // Array(~1.5, ~2.5)
298
```
299

300
### Ridge Regression
301

302
**RidgeRegressionWithSGD**: Ridge regression with L2 regularization
303
```scala { .api }
304
object RidgeRegressionWithSGD {
305
  def train(input: RDD[LabeledPoint], numIterations: Int): RidgeRegressionModel
306
  def train(input: RDD[LabeledPoint], numIterations: Int, stepSize: Double): RidgeRegressionModel
307
  def train(input: RDD[LabeledPoint], numIterations: Int, stepSize: Double, regParam: Double): RidgeRegressionModel
308
}
309
```
310

311
### Lasso Regression
312

313
**LassoWithSGD**: Lasso regression with L1 regularization
314
```scala { .api }
315
object LassoWithSGD {
316
  def train(input: RDD[LabeledPoint], numIterations: Int): LassoModel
317
  def train(input: RDD[LabeledPoint], numIterations: Int, stepSize: Double): LassoModel
318
  def train(input: RDD[LabeledPoint], numIterations: Int, stepSize: Double, regParam: Double): LassoModel
319
}
320
```
321

322
```scala
323
import org.apache.spark.mllib.regression.{RidgeRegressionWithSGD, LassoWithSGD}
324

325
// Ridge regression with regularization
326
val ridgeModel = RidgeRegressionWithSGD.train(
327
  input = regressionData,
328
  numIterations = 100,
329
  stepSize = 0.01,
330
  regParam = 0.1
331
)
332

333
// Lasso regression with L1 regularization
334
val lassoModel = LassoWithSGD.train(
335
  input = regressionData,
336
  numIterations = 100,
337
  stepSize = 0.01,
338
  regParam = 0.1
339
)
340
```
341

342
## Clustering
343

344
### K-Means
345

346
**KMeans**: K-means clustering algorithm
347
```scala { .api }
348
class KMeans private (private var k: Int, private var maxIterations: Int, private var runs: Int, private var initializationMode: String, private var initializationSteps: Int, private var epsilon: Double) extends Serializable {
349
  def setK(k: Int): KMeans
350
  def setMaxIterations(maxIterations: Int): KMeans
351
  def setRuns(runs: Int): KMeans
352
  def setInitializationMode(initializationMode: String): KMeans
353
  def setInitializationSteps(initializationSteps: Int): KMeans
354
  def setEpsilon(epsilon: Double): KMeans
355
  def run(data: RDD[Vector]): KMeansModel
356
}
357

358
object KMeans {
359
  def train(data: RDD[Vector], k: Int, maxIterations: Int): KMeansModel
360
  def train(data: RDD[Vector], k: Int, maxIterations: Int, runs: Int): KMeansModel
361
  def train(data: RDD[Vector], k: Int, maxIterations: Int, runs: Int, initializationMode: String): KMeansModel
362
}
363

364
class KMeansModel(val clusterCenters: Array[Vector]) extends Serializable {
365
  def predict(point: Vector): Int
366
  def predict(points: RDD[Vector]): RDD[Int]
367
  def computeCost(data: RDD[Vector]): Double
368
}
369
```
370

371
```scala
372
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
373

374
// Prepare clustering data
375
val clusteringData = sc.parallelize(Seq(
376
  Vectors.dense(1.0, 1.0),
377
  Vectors.dense(1.0, 2.0),
378
  Vectors.dense(2.0, 1.0),
379
  Vectors.dense(9.0, 8.0),
380
  Vectors.dense(8.0, 9.0),
381
  Vectors.dense(9.0, 9.0)
382
))
383

384
// Train K-means model
385
val kmeansModel = KMeans.train(
386
  data = clusteringData,
387
  k = 2,              // Number of clusters
388
  maxIterations = 20
389
)
390

391
// Get cluster centers
392
val centers = kmeansModel.clusterCenters
393
centers.foreach(println)
394

395
// Make predictions
396
val clusterPredictions = kmeansModel.predict(clusteringData)
397
clusterPredictions.collect()  // Array(0, 0, 0, 1, 1, 1)
398

399
// Compute cost (sum of squared distances to centroids)
400
val cost = kmeansModel.computeCost(clusteringData)
401

402
// Advanced K-means with custom parameters
403
val advancedKMeans = new KMeans()
404
  .setK(3)
405
  .setMaxIterations(50)
406
  .setRuns(10)                    // Multiple runs for better results
407
  .setInitializationMode("k-means||")
408
  .setEpsilon(1e-4)
409

410
val advancedModel = advancedKMeans.run(clusteringData)
411
```
412

413
## Collaborative Filtering
414

415
### Alternating Least Squares (ALS)
416

417
**ALS**: Matrix factorization for collaborative filtering
418
```scala { .api }
419
object ALS {
420
  def train(ratings: RDD[Rating], rank: Int, iterations: Int): MatrixFactorizationModel
421
  def train(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: Double): MatrixFactorizationModel
422
  def train(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: Double, blocks: Int): MatrixFactorizationModel
423
  def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int): MatrixFactorizationModel
424
  def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: Double, alpha: Double): MatrixFactorizationModel
425
}
426

427
class MatrixFactorizationModel(val rank: Int, val userFeatures: RDD[(Int, Array[Double])], val productFeatures: RDD[(Int, Array[Double])]) extends Serializable {
428
  def predict(user: Int, product: Int): Double
429
  def predict(usersProducts: RDD[(Int, Int)]): RDD[Rating]
430
  def recommendProducts(user: Int, num: Int): Array[Rating]
431
  def recommendUsers(product: Int, num: Int): Array[Rating]
432
}
433
```
434

435
```scala
436
import org.apache.spark.mllib.recommendation.{ALS, MatrixFactorizationModel, Rating}
437

438
// Create ratings data
439
val ratings = sc.parallelize(Seq(
440
  Rating(1, 1, 5.0),
441
  Rating(1, 2, 1.0),
442
  Rating(1, 3, 5.0),
443
  Rating(2, 1, 1.0),
444
  Rating(2, 2, 5.0),
445
  Rating(2, 3, 1.0),
446
  Rating(3, 1, 5.0),
447
  Rating(3, 2, 1.0),
448
  Rating(3, 3, 5.0)
449
))
450

451
// Train collaborative filtering model
452
val alsModel = ALS.train(
453
  ratings = ratings,
454
  rank = 10,        // Number of latent factors
455
  iterations = 10,   // Number of iterations
456
  lambda = 0.01     // Regularization parameter
457
)
458

459
// Predict rating for user-item pair
460
val prediction = alsModel.predict(1, 2)
461

462
// Predict ratings for multiple user-item pairs
463
val userProducts = sc.parallelize(Seq((1, 1), (2, 2), (3, 3)))
464
val predictions = alsModel.predict(userProducts)
465

466
// Recommend products for a user
467
val recommendations = alsModel.recommendProducts(1, 5)
468
recommendations.foreach { rating =>
469
  println(s"Product ${rating.product}: ${rating.rating}")
470
}
471

472
// Recommend users for a product
473
val userRecommendations = alsModel.recommendUsers(1, 3)
474

475
// For implicit feedback data
476
val implicitModel = ALS.trainImplicit(
477
  ratings = ratings,
478
  rank = 10,
479
  iterations = 10,
480
  lambda = 0.01,
481
  alpha = 0.1      // Confidence parameter
482
)
483
```
484

485
## Statistics
486

487
### Summary Statistics
488

489
**Statistics**: Statistical functions for RDDs
490
```scala { .api }
491
object Statistics {
492
  def colStats(rdd: RDD[Vector]): MultivariateStatisticalSummary
493
  def corr(x: RDD[Double], y: RDD[Double], method: String = "pearson"): Double
494
  def corr(X: RDD[Vector], method: String = "pearson"): Matrix
495
  def chiSqTest(observed: Vector, expected: Vector): ChiSqTestResult
496
  def chiSqTest(observed: Matrix): ChiSqTestResult
497
  def chiSqTest(observed: RDD[LabeledPoint]): Array[ChiSqTestResult]
498
}
499

500
trait MultivariateStatisticalSummary {
501
  def mean: Vector
502
  def variance: Vector
503
  def count: Long  
504
  def numNonzeros: Vector
505
  def max: Vector
506
  def min: Vector
507
}
508
```
509

510
```scala
511
import org.apache.spark.mllib.stat.Statistics
512
import org.apache.spark.mllib.linalg.{Vector, Vectors, Matrix}
513

514
// Sample data for statistics
515
val observations = sc.parallelize(Seq(
516
  Vectors.dense(1.0, 2.0, 3.0),
517
  Vectors.dense(4.0, 5.0, 6.0),
518
  Vectors.dense(7.0, 8.0, 9.0)
519
))
520

521
// Compute column statistics
522
val summary = Statistics.colStats(observations)
523
println(s"Mean: ${summary.mean}")           // [4.0, 5.0, 6.0]
524
println(s"Variance: ${summary.variance}")   // [9.0, 9.0, 9.0]
525
println(s"Count: ${summary.count}")         // 3
526
println(s"Max: ${summary.max}")             // [7.0, 8.0, 9.0]
527
println(s"Min: ${summary.min}")             // [1.0, 2.0, 3.0]
528

529
// Correlation between two RDD[Double]
530
val x = sc.parallelize(Array(1.0, 2.0, 3.0, 4.0))
531
val y = sc.parallelize(Array(2.0, 4.0, 6.0, 8.0))
532
val correlation = Statistics.corr(x, y, "pearson")  // 1.0 (perfect positive correlation)
533

534
// Correlation matrix for RDD[Vector]
535
val correlationMatrix = Statistics.corr(observations, "pearson")
536

537
// Chi-squared test
538
val observed = Vectors.dense(1.0, 2.0, 3.0)
539
val expected = Vectors.dense(1.5, 1.5, 3.0)
540
val chiSqResult = Statistics.chiSqTest(observed, expected)
541

542
println(s"Chi-squared statistic: ${chiSqResult.statistic}")
543
println(s"P-value: ${chiSqResult.pValue}")
544
println(s"Degrees of freedom: ${chiSqResult.degreesOfFreedom}")
545
```
546

547
## Model Evaluation
548

549
### Binary Classification Metrics
550

551
```scala
552
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
553

554
// Predictions and labels (prediction score, true label)
555
val scoreAndLabels = sc.parallelize(Seq(
556
  (0.9, 1.0), (0.8, 1.0), (0.7, 1.0),
557
  (0.6, 0.0), (0.5, 1.0), (0.4, 0.0),
558
  (0.3, 0.0), (0.2, 0.0), (0.1, 0.0)
559
))
560

561
val binaryMetrics = new BinaryClassificationMetrics(scoreAndLabels)
562

563
// Area under ROC curve
564
val areaUnderROC = binaryMetrics.areaUnderROC()
565
println(s"Area under ROC: $areaUnderROC")
566

567
// Area under Precision-Recall curve
568
val areaUnderPR = binaryMetrics.areaUnderPR()
569
println(s"Area under PR: $areaUnderPR")
570
```
571

572
### Multi-class Classification Metrics
573

574
```scala
575
import org.apache.spark.mllib.evaluation.MulticlassMetrics
576

577
// Predictions and labels (predicted class, true class)
578
val predictionAndLabels = sc.parallelize(Seq(
579
  (0.0, 0.0), (1.0, 1.0), (2.0, 2.0),
580
  (0.0, 0.0), (1.0, 2.0), (2.0, 1.0)
581
))
582

583
val multiMetrics = new MulticlassMetrics(predictionAndLabels)
584

585
// Overall statistics
586
val accuracy = multiMetrics.accuracy
587
val weightedPrecision = multiMetrics.weightedPrecision  
588
val weightedRecall = multiMetrics.weightedRecall
589
val weightedFMeasure = multiMetrics.weightedFMeasure
590

591
// Per-class metrics
592
val labels = multiMetrics.labels
593
labels.foreach { label =>
594
  println(s"Class $label precision: ${multiMetrics.precision(label)}")
595
  println(s"Class $label recall: ${multiMetrics.recall(label)}")
596
  println(s"Class $label F1-score: ${multiMetrics.fMeasure(label)}")
597
}
598

599
// Confusion matrix
600
val confusionMatrix = multiMetrics.confusionMatrix
601
println(s"Confusion matrix:\n$confusionMatrix")
602
```
603

604
## Pipeline Example
605

606
Complete machine learning pipeline:
607

608
```scala
609
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
610
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
611
import org.apache.spark.mllib.regression.LabeledPoint
612
import org.apache.spark.mllib.linalg.Vectors
613

614
// 1. Load and prepare data
615
val rawData = sc.textFile("data.csv")
616
val parsedData = rawData.map { line =>
617
  val parts = line.split(',')
618
  val label = parts(0).toDouble
619
  val features = Vectors.dense(parts.tail.map(_.toDouble))
620
  LabeledPoint(label, features)
621
}
622

623
// 2. Split data
624
val Array(training, test) = parsedData.randomSplit(Array(0.7, 0.3), seed = 11L)
625
training.cache()
626

627
// 3. Train model
628
val model = LogisticRegressionWithSGD.train(training, numIterations = 100)
629

630
// 4. Make predictions
631
val predictionAndLabel = test.map { point =>
632
  val prediction = model.predict(point.features)
633
  (prediction, point.label)
634
}
635

636
// 5. Evaluate model
637
val metrics = new BinaryClassificationMetrics(predictionAndLabel)
638
val auROC = metrics.areaUnderROC()
639

640
println(s"Area under ROC: $auROC")
641

642
// 6. Save model
643
model.save(sc, "myModel")
644

645
// 7. Load model later
646
val loadedModel = LogisticRegressionModel.load(sc, "myModel")
647
```
648

649
This comprehensive guide covers all the essential machine learning capabilities available in Spark's MLlib for building scalable ML applications.