Tessl Tile for maven/org.apache.spark/spark-mllib_2.13@3.5.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

classification.md clustering.md evaluation.md feature-processing.md frequent-pattern-mining.md index.md linear-algebra.md pipeline.md recommendation.md regression.md statistics.md

statistics.mddocs/

0
# Statistical Functions
1

2
Statistical analysis and hypothesis testing capabilities. MLlib provides comprehensive statistical functions including descriptive statistics, correlation analysis, hypothesis testing, and random data generation.
3

4
## Capabilities
5

6
### Summary Statistics
7

8
```scala { .api }
9
/**
10
 * Statistics - statistical functions for DataFrames and vectors
11
 * Provides statistical analysis methods for machine learning workflows
12
 */
13
object Statistics {
14
  def corr(dataset: Dataset[_], columns: Seq[String], method: String): Matrix
15
  def corr(dataset: Dataset[_], columns: Seq[String]): Matrix
16
  def chiSquareTest(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame
17
  def anovaTest(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame
18
  def fTest(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame
19
}
20

21
/**
22
 * Correlation - correlation computation methods
23
 * Supports Pearson and Spearman correlation coefficients
24
 */
25
object Correlation {
26
  def corr(dataset: Dataset[_], columns: Seq[String], method: String): Matrix
27
  def corr(x: RDD[Vector], method: String): Matrix
28
  def corr(x: RDD[Vector]): Matrix
29
  def corr(x: RDD[Double], y: RDD[Double], method: String): Double
30
}
31

32
/**
33
 * Summarizer - provides summary statistics for vectors
34
 * Computes descriptive statistics including means, variances, counts, etc.
35
 */
36
object Summarizer {
37
  def metrics(metrics: String*): SummaryBuilder
38
  def mean(col: Column): Column
39
  def sum(col: Column): Column
40
  def variance(col: Column): Column
41
  def std(col: Column): Column  
42
  def count(col: Column): Column
43
  def numNonZeros(col: Column): Column
44
  def max(col: Column): Column
45
  def min(col: Column): Column
46
  def normL1(col: Column): Column
47
  def normL2(col: Column): Column
48
}
49

50
/**
51
 * SummaryBuilder - builder for computing multiple summary statistics
52
 * Enables efficient computation of multiple statistics in single pass
53
 */
54
class SummaryBuilder {
55
  def summary(featuresCol: Column): Column
56
  def summary(featuresCol: Column, weightCol: Column): Column
57
}
58
```
59

60
### Hypothesis Testing
61

62
```scala { .api }
63
/**
64
 * ChiSquareTest - Chi-square test of independence
65
 * Tests independence between categorical features and labels
66
 */
67
object ChiSquareTest {
68
  def test(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame
69
  def test(dataset: Dataset[_], featuresCol: String, labelCol: String, flatten: Boolean): DataFrame
70
}
71

72
/**
73
 * ANOVATest - Analysis of Variance F-test
74
 * Tests whether means of groups are significantly different
75
 */  
76
object ANOVATest {
77
  def test(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame
78
  def test(dataset: Dataset[_], featuresCol: String, labelCol: String, flatten: Boolean): DataFrame
79
}
80

81
/**
82
 * FValueTest - F-value test for feature selection
83
 * Computes F-statistics for continuous features against categorical labels
84
 */
85
object FValueTest {
86
  def test(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame
87
  def test(dataset: Dataset[_], featuresCol: String, labelCol: String, flatten: Boolean): DataFrame
88
}
89

90
/**
91
 * UnivariateFeatureSelectionTest - base for univariate statistical tests
92
 * Provides framework for feature selection using statistical tests
93
 */
94
abstract class UnivariateFeatureSelectionTest {
95
  def test(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame
96
  def testStatistic(feature: Vector, label: Vector): Double
97
  def pValue(testStatistic: Double, degreesOfFreedom: Double): Double
98
}
99
```
100

101
### Legacy RDD-based Statistics
102

103
```scala { .api }
104
/**
105
 * Statistics (RDD-based) - statistical functions for RDDs
106
 * Legacy API for statistical analysis on RDD data structures
107
 */
108
object org.apache.spark.mllib.stat.Statistics {
109
  def mean(rdd: RDD[Vector]): Vector
110
  def colStats(rdd: RDD[Vector]): MultivariateStatisticalSummary
111
  def corr(x: RDD[Double], y: RDD[Double], method: String): Double
112
  def corr(x: RDD[Double], y: RDD[Double]): Double
113
  def corr(X: RDD[Vector], method: String): Matrix
114
  def corr(X: RDD[Vector]): Matrix
115
  def chiSqTest(observed: Vector): ChiSqTestResult
116
  def chiSqTest(observed: Matrix): Array[ChiSqTestResult]
117
  def chiSqTest(observed: RDD[LabeledPoint]): Array[ChiSqTestResult]
118
  def kolmogorovSmirnovTest(sampleX: RDD[Double], distName: String, params: Double*): KolmogorovSmirnovTestResult
119
  def kolmogorovSmirnovTest(sampleX: RDD[Double], sampleY: RDD[Double]): KolmogorovSmirnovTestResult
120
}
121

122
/**
123
 * MultivariateStatisticalSummary - summary statistics for vectors
124
 * Provides comprehensive statistical summaries for multivariate data
125
 */
126
trait MultivariateStatisticalSummary {
127
  def mean: Vector
128
  def variance: Vector
129
  def count: Long
130
  def numNonzeros: Vector
131
  def max: Vector
132
  def min: Vector
133
  def normL1: Vector
134
  def normL2: Vector
135
}
136

137
/**
138
 * ChiSqTestResult - result of Chi-square statistical test
139
 * Contains test statistic, p-value, and degrees of freedom
140
 */
141
class ChiSqTestResult {
142
  def pValue: Double
143
  def degreesOfFreedom: Int
144
  def statistic: Double
145
  def nullHypothesis: String
146
  def method: String
147
}
148

149
/**
150
 * KolmogorovSmirnovTestResult - result of Kolmogorov-Smirnov test
151
 * Tests goodness of fit or equality of distributions
152
 */
153
class KolmogorovSmirnovTestResult {
154
  def pValue: Double
155
  def statistic: Double
156
  def nullHypothesis: String
157
}
158
```
159

160
## Usage Examples
161

162
### Correlation Analysis
163

164
```scala
165
import org.apache.spark.ml.stat.{Correlation, Statistics}
166
import org.apache.spark.ml.linalg.Matrix
167

168
// Compute correlation matrix using Pearson correlation
169
val correlationMatrix: Matrix = Correlation.corr(dataset, "features")
170
println(s"Pearson correlation matrix:\n$correlationMatrix")
171

172
// Compute correlation using Spearman correlation
173
val spearmanMatrix: Matrix = Correlation.corr(dataset, "features", "spearman")
174
println(s"Spearman correlation matrix:\n$spearmanMatrix")
175

176
// Correlate specific columns
177
val selectedCorr = Statistics.corr(dataset, Seq("feature1", "feature2", "feature3"))
178
println(s"Selected features correlation:\n$selectedCorr")
179
```
180

181
### Summary Statistics
182

183
```scala
184
import org.apache.spark.ml.stat.Summarizer
185
import org.apache.spark.sql.functions._
186

187
// Compute multiple statistics efficiently
188
val summary = dataset.select(
189
  Summarizer.metrics("mean", "variance", "count", "numNonzeros", "max", "min")
190
    .summary(col("features"))
191
    .alias("summary")
192
).collect()
193

194
println(s"Summary statistics: ${summary(0)}")
195

196
// Compute individual statistics
197
val meanStats = dataset.select(
198
  Summarizer.mean(col("features")).alias("mean"),
199
  Summarizer.variance(col("features")).alias("variance"),
200
  Summarizer.std(col("features")).alias("std")
201
)
202

203
meanStats.show(false)
204
```
205

206
### Hypothesis Testing
207

208
```scala
209
import org.apache.spark.ml.stat.{ChiSquareTest, ANOVATest, FValueTest}
210

211
// Chi-square test of independence
212
val chiSquareResults = ChiSquareTest.test(dataset, "features", "label")
213
chiSquareResults.select("pValues", "degreesOfFreedom", "statistics").show(false)
214

215
// ANOVA F-test
216
val anovaResults = ANOVATest.test(dataset, "features", "label")
217
anovaResults.select("pValues", "degreesOfFreedom", "fValues").show(false)
218

219
// F-value test for feature selection
220
val fTestResults = FValueTest.test(dataset, "features", "label") 
221
fTestResults.select("pValues", "degreesOfFreedom", "fValues").show(false)
222
```
223

224
### RDD-based Statistical Analysis
225

226
```scala
227
import org.apache.spark.mllib.stat.Statistics
228
import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
229
import org.apache.spark.mllib.regression.LabeledPoint
230

231
// Create RDD of vectors
232
val observations = sc.parallelize(Seq(
233
  OldVectors.dense(1.0, 10.0, 100.0),
234
  OldVectors.dense(2.0, 20.0, 200.0),
235
  OldVectors.dense(3.0, 30.0, 300.0)
236
))
237

238
// Compute summary statistics
239
val summary = Statistics.colStats(observations)
240
println(s"Mean: ${summary.mean}")
241
println(s"Variance: ${summary.variance}")
242
println(s"Count: ${summary.count}")
243
println(s"Max: ${summary.max}")
244
println(s"Min: ${summary.min}")
245

246
// Compute correlation matrix
247
val correlationMatrix = Statistics.corr(observations, "pearson")
248
println(s"Correlation matrix:\n$correlationMatrix")
249

250
// Chi-square goodness of fit test
251
val observed = OldVectors.dense(1.0, 2.0, 3.0, 4.0)
252
val chiSqTestResult = Statistics.chiSqTest(observed)
253
println(s"Chi-square test p-value: ${chiSqTestResult.pValue}")
254
println(s"Chi-square test statistic: ${chiSqTestResult.statistic}")
255
```
256

257
### Feature Selection with Statistical Tests
258

259
```scala
260
import org.apache.spark.ml.feature.UnivariateFeatureSelector
261

262
// Use statistical tests for feature selection
263
val selector = new UnivariateFeatureSelector()
264
  .setFeatureType("continuous")
265
  .setLabelType("categorical")
266
  .setSelectionMode("numTopFeatures")
267
  .setSelectionThreshold(50)
268
  .setFeaturesCol("features")
269
  .setLabelCol("label")
270
  .setOutputCol("selectedFeatures")
271

272
val selectorModel = selector.fit(dataset)
273
val selectedData = selectorModel.transform(dataset)
274

275
// View selected features  
276
println(s"Selected features: ${selectorModel.selectedFeatures.mkString(", ")}")
277
```
278

279
### Kernel Density Estimation
280

281
```scala
282
import org.apache.spark.mllib.stat.KernelDensity
283

284
// Estimate probability density function
285
val sample = sc.parallelize(Seq(1.0, 2.0, 3.0, 4.0, 5.0))
286

287
val kd = new KernelDensity()
288
  .setSample(sample)
289
  .setBandwidth(3.0)
290

291
// Evaluate density at specific points
292
val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
293
densities.zip(Array(-1.0, 2.0, 5.0)).foreach { case (density, point) =>
294
  println(s"Density at $point: $density")
295
}
296
```
297

298
### Streaming Statistical Tests
299

300
```scala
301
import org.apache.spark.mllib.stat.test.StreamingTest
302
import org.apache.spark.streaming.dstream.DStream
303

304
// Streaming hypothesis testing (for Spark Streaming)
305
val streamingTest = StreamingTest
306
  .welchTTest()
307
  .peacePeriod(0)
308
  .windowSize(0)
309

310
// Apply test to streaming data
311
// streamingTest.registerStream(dataStream1)
312
// streamingTest.registerStream(dataStream2)
313
```
314

315
## Key Statistical Methods
316

317
### Correlation Measures
318
- **Pearson**: Linear correlation (default)
319
- **Spearman**: Rank-based correlation for non-linear relationships
320

321
### Hypothesis Tests
322
- **Chi-square**: Independence test for categorical variables
323
- **ANOVA F-test**: Mean comparison across groups
324
- **F-value test**: Feature importance for continuous features
325
- **Kolmogorov-Smirnov**: Distribution comparison
326

327
### Summary Statistics
328
- **Central tendency**: Mean, median, mode
329
- **Variability**: Variance, standard deviation, min/max
330
- **Shape**: Skewness, kurtosis (via RDD API)
331
- **Count statistics**: Total count, non-zero count
332

333
### Applications
334
- **Exploratory Data Analysis**: Understanding data distributions and relationships
335
- **Feature Selection**: Identifying relevant features using statistical significance
336
- **Quality Assessment**: Detecting outliers and data quality issues
337
- **Model Validation**: Statistical tests for model assumptions

Version

Tile

Files

statistics.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

statistics.mddocs/