0
# Statistical Functions
1
2
Statistical analysis and hypothesis testing capabilities. MLlib provides comprehensive statistical functions including descriptive statistics, correlation analysis, hypothesis testing, and random data generation.
3
4
## Capabilities
5
6
### Summary Statistics
7
8
```scala { .api }
9
/**
10
* Statistics - statistical functions for DataFrames and vectors
11
* Provides statistical analysis methods for machine learning workflows
12
*/
13
object Statistics {
14
def corr(dataset: Dataset[_], columns: Seq[String], method: String): Matrix
15
def corr(dataset: Dataset[_], columns: Seq[String]): Matrix
16
def chiSquareTest(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame
17
def anovaTest(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame
18
def fTest(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame
19
}
20
21
/**
22
* Correlation - correlation computation methods
23
* Supports Pearson and Spearman correlation coefficients
24
*/
25
object Correlation {
26
def corr(dataset: Dataset[_], columns: Seq[String], method: String): Matrix
27
def corr(x: RDD[Vector], method: String): Matrix
28
def corr(x: RDD[Vector]): Matrix
29
def corr(x: RDD[Double], y: RDD[Double], method: String): Double
30
}
31
32
/**
33
* Summarizer - provides summary statistics for vectors
34
* Computes descriptive statistics including means, variances, counts, etc.
35
*/
36
object Summarizer {
37
def metrics(metrics: String*): SummaryBuilder
38
def mean(col: Column): Column
39
def sum(col: Column): Column
40
def variance(col: Column): Column
41
def std(col: Column): Column
42
def count(col: Column): Column
43
def numNonZeros(col: Column): Column
44
def max(col: Column): Column
45
def min(col: Column): Column
46
def normL1(col: Column): Column
47
def normL2(col: Column): Column
48
}
49
50
/**
51
* SummaryBuilder - builder for computing multiple summary statistics
52
* Enables efficient computation of multiple statistics in single pass
53
*/
54
class SummaryBuilder {
55
def summary(featuresCol: Column): Column
56
def summary(featuresCol: Column, weightCol: Column): Column
57
}
58
```
59
60
### Hypothesis Testing
61
62
```scala { .api }
63
/**
64
* ChiSquareTest - Chi-square test of independence
65
* Tests independence between categorical features and labels
66
*/
67
object ChiSquareTest {
68
def test(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame
69
def test(dataset: Dataset[_], featuresCol: String, labelCol: String, flatten: Boolean): DataFrame
70
}
71
72
/**
73
* ANOVATest - Analysis of Variance F-test
74
* Tests whether means of groups are significantly different
75
*/
76
object ANOVATest {
77
def test(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame
78
def test(dataset: Dataset[_], featuresCol: String, labelCol: String, flatten: Boolean): DataFrame
79
}
80
81
/**
82
* FValueTest - F-value test for feature selection
83
* Computes F-statistics for continuous features against categorical labels
84
*/
85
object FValueTest {
86
def test(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame
87
def test(dataset: Dataset[_], featuresCol: String, labelCol: String, flatten: Boolean): DataFrame
88
}
89
90
/**
91
* UnivariateFeatureSelectionTest - base for univariate statistical tests
92
* Provides framework for feature selection using statistical tests
93
*/
94
abstract class UnivariateFeatureSelectionTest {
95
def test(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame
96
def testStatistic(feature: Vector, label: Vector): Double
97
def pValue(testStatistic: Double, degreesOfFreedom: Double): Double
98
}
99
```
100
101
### Legacy RDD-based Statistics
102
103
```scala { .api }
104
/**
105
* Statistics (RDD-based) - statistical functions for RDDs
106
* Legacy API for statistical analysis on RDD data structures
107
*/
108
object org.apache.spark.mllib.stat.Statistics {
109
def mean(rdd: RDD[Vector]): Vector
110
def colStats(rdd: RDD[Vector]): MultivariateStatisticalSummary
111
def corr(x: RDD[Double], y: RDD[Double], method: String): Double
112
def corr(x: RDD[Double], y: RDD[Double]): Double
113
def corr(X: RDD[Vector], method: String): Matrix
114
def corr(X: RDD[Vector]): Matrix
115
def chiSqTest(observed: Vector): ChiSqTestResult
116
def chiSqTest(observed: Matrix): Array[ChiSqTestResult]
117
def chiSqTest(observed: RDD[LabeledPoint]): Array[ChiSqTestResult]
118
def kolmogorovSmirnovTest(sampleX: RDD[Double], distName: String, params: Double*): KolmogorovSmirnovTestResult
119
def kolmogorovSmirnovTest(sampleX: RDD[Double], sampleY: RDD[Double]): KolmogorovSmirnovTestResult
120
}
121
122
/**
123
* MultivariateStatisticalSummary - summary statistics for vectors
124
* Provides comprehensive statistical summaries for multivariate data
125
*/
126
trait MultivariateStatisticalSummary {
127
def mean: Vector
128
def variance: Vector
129
def count: Long
130
def numNonzeros: Vector
131
def max: Vector
132
def min: Vector
133
def normL1: Vector
134
def normL2: Vector
135
}
136
137
/**
138
* ChiSqTestResult - result of Chi-square statistical test
139
* Contains test statistic, p-value, and degrees of freedom
140
*/
141
class ChiSqTestResult {
142
def pValue: Double
143
def degreesOfFreedom: Int
144
def statistic: Double
145
def nullHypothesis: String
146
def method: String
147
}
148
149
/**
150
* KolmogorovSmirnovTestResult - result of Kolmogorov-Smirnov test
151
* Tests goodness of fit or equality of distributions
152
*/
153
class KolmogorovSmirnovTestResult {
154
def pValue: Double
155
def statistic: Double
156
def nullHypothesis: String
157
}
158
```
159
160
## Usage Examples
161
162
### Correlation Analysis
163
164
```scala
165
import org.apache.spark.ml.stat.{Correlation, Statistics}
166
import org.apache.spark.ml.linalg.Matrix
167
168
// Compute correlation matrix using Pearson correlation
169
val correlationMatrix: Matrix = Correlation.corr(dataset, "features")
170
println(s"Pearson correlation matrix:\n$correlationMatrix")
171
172
// Compute correlation using Spearman correlation
173
val spearmanMatrix: Matrix = Correlation.corr(dataset, "features", "spearman")
174
println(s"Spearman correlation matrix:\n$spearmanMatrix")
175
176
// Correlate specific columns
177
val selectedCorr = Statistics.corr(dataset, Seq("feature1", "feature2", "feature3"))
178
println(s"Selected features correlation:\n$selectedCorr")
179
```
180
181
### Summary Statistics
182
183
```scala
184
import org.apache.spark.ml.stat.Summarizer
185
import org.apache.spark.sql.functions._
186
187
// Compute multiple statistics efficiently
188
val summary = dataset.select(
189
Summarizer.metrics("mean", "variance", "count", "numNonzeros", "max", "min")
190
.summary(col("features"))
191
.alias("summary")
192
).collect()
193
194
println(s"Summary statistics: ${summary(0)}")
195
196
// Compute individual statistics
197
val meanStats = dataset.select(
198
Summarizer.mean(col("features")).alias("mean"),
199
Summarizer.variance(col("features")).alias("variance"),
200
Summarizer.std(col("features")).alias("std")
201
)
202
203
meanStats.show(false)
204
```
205
206
### Hypothesis Testing
207
208
```scala
209
import org.apache.spark.ml.stat.{ChiSquareTest, ANOVATest, FValueTest}
210
211
// Chi-square test of independence
212
val chiSquareResults = ChiSquareTest.test(dataset, "features", "label")
213
chiSquareResults.select("pValues", "degreesOfFreedom", "statistics").show(false)
214
215
// ANOVA F-test
216
val anovaResults = ANOVATest.test(dataset, "features", "label")
217
anovaResults.select("pValues", "degreesOfFreedom", "fValues").show(false)
218
219
// F-value test for feature selection
220
val fTestResults = FValueTest.test(dataset, "features", "label")
221
fTestResults.select("pValues", "degreesOfFreedom", "fValues").show(false)
222
```
223
224
### RDD-based Statistical Analysis
225
226
```scala
227
import org.apache.spark.mllib.stat.Statistics
228
import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
229
import org.apache.spark.mllib.regression.LabeledPoint
230
231
// Create RDD of vectors
232
val observations = sc.parallelize(Seq(
233
OldVectors.dense(1.0, 10.0, 100.0),
234
OldVectors.dense(2.0, 20.0, 200.0),
235
OldVectors.dense(3.0, 30.0, 300.0)
236
))
237
238
// Compute summary statistics
239
val summary = Statistics.colStats(observations)
240
println(s"Mean: ${summary.mean}")
241
println(s"Variance: ${summary.variance}")
242
println(s"Count: ${summary.count}")
243
println(s"Max: ${summary.max}")
244
println(s"Min: ${summary.min}")
245
246
// Compute correlation matrix
247
val correlationMatrix = Statistics.corr(observations, "pearson")
248
println(s"Correlation matrix:\n$correlationMatrix")
249
250
// Chi-square goodness of fit test
251
val observed = OldVectors.dense(1.0, 2.0, 3.0, 4.0)
252
val chiSqTestResult = Statistics.chiSqTest(observed)
253
println(s"Chi-square test p-value: ${chiSqTestResult.pValue}")
254
println(s"Chi-square test statistic: ${chiSqTestResult.statistic}")
255
```
256
257
### Feature Selection with Statistical Tests
258
259
```scala
260
import org.apache.spark.ml.feature.UnivariateFeatureSelector
261
262
// Use statistical tests for feature selection
263
val selector = new UnivariateFeatureSelector()
264
.setFeatureType("continuous")
265
.setLabelType("categorical")
266
.setSelectionMode("numTopFeatures")
267
.setSelectionThreshold(50)
268
.setFeaturesCol("features")
269
.setLabelCol("label")
270
.setOutputCol("selectedFeatures")
271
272
val selectorModel = selector.fit(dataset)
273
val selectedData = selectorModel.transform(dataset)
274
275
// View selected features
276
println(s"Selected features: ${selectorModel.selectedFeatures.mkString(", ")}")
277
```
278
279
### Kernel Density Estimation
280
281
```scala
282
import org.apache.spark.mllib.stat.KernelDensity
283
284
// Estimate probability density function
285
val sample = sc.parallelize(Seq(1.0, 2.0, 3.0, 4.0, 5.0))
286
287
val kd = new KernelDensity()
288
.setSample(sample)
289
.setBandwidth(3.0)
290
291
// Evaluate density at specific points
292
val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
293
densities.zip(Array(-1.0, 2.0, 5.0)).foreach { case (density, point) =>
294
println(s"Density at $point: $density")
295
}
296
```
297
298
### Streaming Statistical Tests
299
300
```scala
301
import org.apache.spark.mllib.stat.test.StreamingTest
302
import org.apache.spark.streaming.dstream.DStream
303
304
// Streaming hypothesis testing (for Spark Streaming)
305
val streamingTest = StreamingTest
306
.welchTTest()
307
.peacePeriod(0)
308
.windowSize(0)
309
310
// Apply test to streaming data
311
// streamingTest.registerStream(dataStream1)
312
// streamingTest.registerStream(dataStream2)
313
```
314
315
## Key Statistical Methods
316
317
### Correlation Measures
318
- **Pearson**: Linear correlation (default)
319
- **Spearman**: Rank-based correlation for non-linear relationships
320
321
### Hypothesis Tests
322
- **Chi-square**: Independence test for categorical variables
323
- **ANOVA F-test**: Mean comparison across groups
324
- **F-value test**: Feature importance for continuous features
325
- **Kolmogorov-Smirnov**: Distribution comparison
326
327
### Summary Statistics
328
- **Central tendency**: Mean, median, mode
329
- **Variability**: Variance, standard deviation, min/max
330
- **Shape**: Skewness, kurtosis (via RDD API)
331
- **Count statistics**: Total count, non-zero count
332
333
### Applications
334
- **Exploratory Data Analysis**: Understanding data distributions and relationships
335
- **Feature Selection**: Identifying relevant features using statistical significance
336
- **Quality Assessment**: Detecting outliers and data quality issues
337
- **Model Validation**: Statistical tests for model assumptions