or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

classification.mdclustering.mdevaluation.mdfeature-processing.mdfrequent-pattern-mining.mdindex.mdlinear-algebra.mdpipeline.mdrecommendation.mdregression.mdstatistics.md

statistics.mddocs/

0

# Statistical Functions

1

2

Statistical analysis and hypothesis testing capabilities. MLlib provides comprehensive statistical functions including descriptive statistics, correlation analysis, hypothesis testing, and random data generation.

3

4

## Capabilities

5

6

### Summary Statistics

7

8

```scala { .api }

9

/**

10

* Statistics - statistical functions for DataFrames and vectors

11

* Provides statistical analysis methods for machine learning workflows

12

*/

13

object Statistics {

14

def corr(dataset: Dataset[_], columns: Seq[String], method: String): Matrix

15

def corr(dataset: Dataset[_], columns: Seq[String]): Matrix

16

def chiSquareTest(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame

17

def anovaTest(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame

18

def fTest(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame

19

}

20

21

/**

22

* Correlation - correlation computation methods

23

* Supports Pearson and Spearman correlation coefficients

24

*/

25

object Correlation {

26

def corr(dataset: Dataset[_], columns: Seq[String], method: String): Matrix

27

def corr(x: RDD[Vector], method: String): Matrix

28

def corr(x: RDD[Vector]): Matrix

29

def corr(x: RDD[Double], y: RDD[Double], method: String): Double

30

}

31

32

/**

33

* Summarizer - provides summary statistics for vectors

34

* Computes descriptive statistics including means, variances, counts, etc.

35

*/

36

object Summarizer {

37

def metrics(metrics: String*): SummaryBuilder

38

def mean(col: Column): Column

39

def sum(col: Column): Column

40

def variance(col: Column): Column

41

def std(col: Column): Column

42

def count(col: Column): Column

43

def numNonZeros(col: Column): Column

44

def max(col: Column): Column

45

def min(col: Column): Column

46

def normL1(col: Column): Column

47

def normL2(col: Column): Column

48

}

49

50

/**

51

* SummaryBuilder - builder for computing multiple summary statistics

52

* Enables efficient computation of multiple statistics in single pass

53

*/

54

class SummaryBuilder {

55

def summary(featuresCol: Column): Column

56

def summary(featuresCol: Column, weightCol: Column): Column

57

}

58

```

59

60

### Hypothesis Testing

61

62

```scala { .api }

63

/**

64

* ChiSquareTest - Chi-square test of independence

65

* Tests independence between categorical features and labels

66

*/

67

object ChiSquareTest {

68

def test(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame

69

def test(dataset: Dataset[_], featuresCol: String, labelCol: String, flatten: Boolean): DataFrame

70

}

71

72

/**

73

* ANOVATest - Analysis of Variance F-test

74

* Tests whether means of groups are significantly different

75

*/

76

object ANOVATest {

77

def test(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame

78

def test(dataset: Dataset[_], featuresCol: String, labelCol: String, flatten: Boolean): DataFrame

79

}

80

81

/**

82

* FValueTest - F-value test for feature selection

83

* Computes F-statistics for continuous features against categorical labels

84

*/

85

object FValueTest {

86

def test(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame

87

def test(dataset: Dataset[_], featuresCol: String, labelCol: String, flatten: Boolean): DataFrame

88

}

89

90

/**

91

* UnivariateFeatureSelectionTest - base for univariate statistical tests

92

* Provides framework for feature selection using statistical tests

93

*/

94

abstract class UnivariateFeatureSelectionTest {

95

def test(dataset: Dataset[_], featuresCol: String, labelCol: String): DataFrame

96

def testStatistic(feature: Vector, label: Vector): Double

97

def pValue(testStatistic: Double, degreesOfFreedom: Double): Double

98

}

99

```

100

101

### Legacy RDD-based Statistics

102

103

```scala { .api }

104

/**

105

* Statistics (RDD-based) - statistical functions for RDDs

106

* Legacy API for statistical analysis on RDD data structures

107

*/

108

object org.apache.spark.mllib.stat.Statistics {

109

def mean(rdd: RDD[Vector]): Vector

110

def colStats(rdd: RDD[Vector]): MultivariateStatisticalSummary

111

def corr(x: RDD[Double], y: RDD[Double], method: String): Double

112

def corr(x: RDD[Double], y: RDD[Double]): Double

113

def corr(X: RDD[Vector], method: String): Matrix

114

def corr(X: RDD[Vector]): Matrix

115

def chiSqTest(observed: Vector): ChiSqTestResult

116

def chiSqTest(observed: Matrix): Array[ChiSqTestResult]

117

def chiSqTest(observed: RDD[LabeledPoint]): Array[ChiSqTestResult]

118

def kolmogorovSmirnovTest(sampleX: RDD[Double], distName: String, params: Double*): KolmogorovSmirnovTestResult

119

def kolmogorovSmirnovTest(sampleX: RDD[Double], sampleY: RDD[Double]): KolmogorovSmirnovTestResult

120

}

121

122

/**

123

* MultivariateStatisticalSummary - summary statistics for vectors

124

* Provides comprehensive statistical summaries for multivariate data

125

*/

126

trait MultivariateStatisticalSummary {

127

def mean: Vector

128

def variance: Vector

129

def count: Long

130

def numNonzeros: Vector

131

def max: Vector

132

def min: Vector

133

def normL1: Vector

134

def normL2: Vector

135

}

136

137

/**

138

* ChiSqTestResult - result of Chi-square statistical test

139

* Contains test statistic, p-value, and degrees of freedom

140

*/

141

class ChiSqTestResult {

142

def pValue: Double

143

def degreesOfFreedom: Int

144

def statistic: Double

145

def nullHypothesis: String

146

def method: String

147

}

148

149

/**

150

* KolmogorovSmirnovTestResult - result of Kolmogorov-Smirnov test

151

* Tests goodness of fit or equality of distributions

152

*/

153

class KolmogorovSmirnovTestResult {

154

def pValue: Double

155

def statistic: Double

156

def nullHypothesis: String

157

}

158

```

159

160

## Usage Examples

161

162

### Correlation Analysis

163

164

```scala

165

import org.apache.spark.ml.stat.{Correlation, Statistics}

166

import org.apache.spark.ml.linalg.Matrix

167

168

// Compute correlation matrix using Pearson correlation

169

val correlationMatrix: Matrix = Correlation.corr(dataset, "features")

170

println(s"Pearson correlation matrix:\n$correlationMatrix")

171

172

// Compute correlation using Spearman correlation

173

val spearmanMatrix: Matrix = Correlation.corr(dataset, "features", "spearman")

174

println(s"Spearman correlation matrix:\n$spearmanMatrix")

175

176

// Correlate specific columns

177

val selectedCorr = Statistics.corr(dataset, Seq("feature1", "feature2", "feature3"))

178

println(s"Selected features correlation:\n$selectedCorr")

179

```

180

181

### Summary Statistics

182

183

```scala

184

import org.apache.spark.ml.stat.Summarizer

185

import org.apache.spark.sql.functions._

186

187

// Compute multiple statistics efficiently

188

val summary = dataset.select(

189

Summarizer.metrics("mean", "variance", "count", "numNonzeros", "max", "min")

190

.summary(col("features"))

191

.alias("summary")

192

).collect()

193

194

println(s"Summary statistics: ${summary(0)}")

195

196

// Compute individual statistics

197

val meanStats = dataset.select(

198

Summarizer.mean(col("features")).alias("mean"),

199

Summarizer.variance(col("features")).alias("variance"),

200

Summarizer.std(col("features")).alias("std")

201

)

202

203

meanStats.show(false)

204

```

205

206

### Hypothesis Testing

207

208

```scala

209

import org.apache.spark.ml.stat.{ChiSquareTest, ANOVATest, FValueTest}

210

211

// Chi-square test of independence

212

val chiSquareResults = ChiSquareTest.test(dataset, "features", "label")

213

chiSquareResults.select("pValues", "degreesOfFreedom", "statistics").show(false)

214

215

// ANOVA F-test

216

val anovaResults = ANOVATest.test(dataset, "features", "label")

217

anovaResults.select("pValues", "degreesOfFreedom", "fValues").show(false)

218

219

// F-value test for feature selection

220

val fTestResults = FValueTest.test(dataset, "features", "label")

221

fTestResults.select("pValues", "degreesOfFreedom", "fValues").show(false)

222

```

223

224

### RDD-based Statistical Analysis

225

226

```scala

227

import org.apache.spark.mllib.stat.Statistics

228

import org.apache.spark.mllib.linalg.{Vectors => OldVectors}

229

import org.apache.spark.mllib.regression.LabeledPoint

230

231

// Create RDD of vectors

232

val observations = sc.parallelize(Seq(

233

OldVectors.dense(1.0, 10.0, 100.0),

234

OldVectors.dense(2.0, 20.0, 200.0),

235

OldVectors.dense(3.0, 30.0, 300.0)

236

))

237

238

// Compute summary statistics

239

val summary = Statistics.colStats(observations)

240

println(s"Mean: ${summary.mean}")

241

println(s"Variance: ${summary.variance}")

242

println(s"Count: ${summary.count}")

243

println(s"Max: ${summary.max}")

244

println(s"Min: ${summary.min}")

245

246

// Compute correlation matrix

247

val correlationMatrix = Statistics.corr(observations, "pearson")

248

println(s"Correlation matrix:\n$correlationMatrix")

249

250

// Chi-square goodness of fit test

251

val observed = OldVectors.dense(1.0, 2.0, 3.0, 4.0)

252

val chiSqTestResult = Statistics.chiSqTest(observed)

253

println(s"Chi-square test p-value: ${chiSqTestResult.pValue}")

254

println(s"Chi-square test statistic: ${chiSqTestResult.statistic}")

255

```

256

257

### Feature Selection with Statistical Tests

258

259

```scala

260

import org.apache.spark.ml.feature.UnivariateFeatureSelector

261

262

// Use statistical tests for feature selection

263

val selector = new UnivariateFeatureSelector()

264

.setFeatureType("continuous")

265

.setLabelType("categorical")

266

.setSelectionMode("numTopFeatures")

267

.setSelectionThreshold(50)

268

.setFeaturesCol("features")

269

.setLabelCol("label")

270

.setOutputCol("selectedFeatures")

271

272

val selectorModel = selector.fit(dataset)

273

val selectedData = selectorModel.transform(dataset)

274

275

// View selected features

276

println(s"Selected features: ${selectorModel.selectedFeatures.mkString(", ")}")

277

```

278

279

### Kernel Density Estimation

280

281

```scala

282

import org.apache.spark.mllib.stat.KernelDensity

283

284

// Estimate probability density function

285

val sample = sc.parallelize(Seq(1.0, 2.0, 3.0, 4.0, 5.0))

286

287

val kd = new KernelDensity()

288

.setSample(sample)

289

.setBandwidth(3.0)

290

291

// Evaluate density at specific points

292

val densities = kd.estimate(Array(-1.0, 2.0, 5.0))

293

densities.zip(Array(-1.0, 2.0, 5.0)).foreach { case (density, point) =>

294

println(s"Density at $point: $density")

295

}

296

```

297

298

### Streaming Statistical Tests

299

300

```scala

301

import org.apache.spark.mllib.stat.test.StreamingTest

302

import org.apache.spark.streaming.dstream.DStream

303

304

// Streaming hypothesis testing (for Spark Streaming)

305

val streamingTest = StreamingTest

306

.welchTTest()

307

.peacePeriod(0)

308

.windowSize(0)

309

310

// Apply test to streaming data

311

// streamingTest.registerStream(dataStream1)

312

// streamingTest.registerStream(dataStream2)

313

```

314

315

## Key Statistical Methods

316

317

### Correlation Measures

318

- **Pearson**: Linear correlation (default)

319

- **Spearman**: Rank-based correlation for non-linear relationships

320

321

### Hypothesis Tests

322

- **Chi-square**: Independence test for categorical variables

323

- **ANOVA F-test**: Mean comparison across groups

324

- **F-value test**: Feature importance for continuous features

325

- **Kolmogorov-Smirnov**: Distribution comparison

326

327

### Summary Statistics

328

- **Central tendency**: Mean, median, mode

329

- **Variability**: Variance, standard deviation, min/max

330

- **Shape**: Skewness, kurtosis (via RDD API)

331

- **Count statistics**: Total count, non-zero count

332

333

### Applications

334

- **Exploratory Data Analysis**: Understanding data distributions and relationships

335

- **Feature Selection**: Identifying relevant features using statistical significance

336

- **Quality Assessment**: Detecting outliers and data quality issues

337

- **Model Validation**: Statistical tests for model assumptions