0
# Statistical Distributions
1
2
Multivariate probability distributions with numerical stability and support for singular covariance matrices. Designed for robust statistical computations in machine learning applications.
3
4
## Capabilities
5
6
### Multivariate Gaussian Distribution
7
8
Implementation of multivariate normal distribution with support for degenerate (singular) covariance matrices.
9
10
```scala { .api }
11
/**
12
* Multivariate Gaussian (Normal) Distribution
13
* Handles singular covariance matrices by computing density in reduced dimensional subspace
14
*
15
* Note: This class is marked as @DeveloperApi in Spark MLlib
16
*
17
* @param mean Mean vector of the distribution
18
* @param cov Covariance matrix of the distribution (must be square and same size as mean)
19
*/
20
class MultivariateGaussian(
21
val mean: Vector,
22
val cov: Matrix
23
) extends Serializable {
24
25
/**
26
* Private constructor taking Breeze types (internal use)
27
* @param mean Mean vector as Breeze DenseVector
28
* @param cov Covariance matrix as Breeze DenseMatrix
29
*/
30
private[ml] def this(mean: breeze.linalg.DenseVector[Double], cov: breeze.linalg.DenseMatrix[Double])
31
/**
32
* Mean vector of the distribution
33
* @return Vector containing mean values for each dimension
34
*/
35
def mean: Vector
36
37
/**
38
* Covariance matrix of the distribution
39
* @return Square matrix representing covariance structure
40
*/
41
def cov: Matrix
42
43
/**
44
* Compute probability density function at given point
45
* @param x Point to evaluate (must have same size as mean)
46
* @return Probability density value (always non-negative)
47
*/
48
def pdf(x: Vector): Double
49
50
/**
51
* Compute log probability density function at given point
52
* @param x Point to evaluate (must have same size as mean)
53
* @return Log probability density value
54
*/
55
def logpdf(x: Vector): Double
56
}
57
```
58
59
**Usage Examples:**
60
61
```scala
62
import org.apache.spark.ml.linalg._
63
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
64
65
// 2D Gaussian distribution
66
val mean = Vectors.dense(0.0, 0.0)
67
val cov = Matrices.dense(2, 2, Array(
68
1.0, 0.5, // column 1: [1.0, 0.5]
69
0.5, 1.0 // column 2: [0.5, 1.0]
70
))
71
72
val mvn = new MultivariateGaussian(mean, cov)
73
74
// Evaluate density at specific points
75
val point1 = Vectors.dense(0.0, 0.0) // at mean
76
val point2 = Vectors.dense(1.0, 1.0) // away from mean
77
78
val density1 = mvn.pdf(point1) // Higher density (near mean)
79
val density2 = mvn.pdf(point2) // Lower density (away from mean)
80
81
val logDensity1 = mvn.logpdf(point1) // More numerically stable
82
val logDensity2 = mvn.logpdf(point2)
83
84
println(s"Density at mean: $density1")
85
println(s"Density at (1,1): $density2")
86
println(s"Log density at mean: $logDensity1")
87
```
88
89
### Identity Covariance
90
91
Simple case with uncorrelated dimensions.
92
93
```scala
94
import org.apache.spark.ml.linalg._
95
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
96
97
// 3D Gaussian with identity covariance (uncorrelated)
98
val mean = Vectors.dense(1.0, 2.0, 3.0)
99
val identityCov = DenseMatrix.eye(3)
100
101
val independentMvn = new MultivariateGaussian(mean, identityCov)
102
103
// Evaluate at mean (should give highest density)
104
val atMean = independentMvn.pdf(mean)
105
val awayfromMean = independentMvn.pdf(Vectors.dense(0.0, 0.0, 0.0))
106
107
println(s"Density at mean: $atMean")
108
println(s"Density at origin: $awayfromMean")
109
```
110
111
### Singular Covariance Matrices
112
113
Handling of degenerate covariance matrices where some dimensions are linearly dependent.
114
115
```scala
116
import org.apache.spark.ml.linalg._
117
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
118
119
// Singular covariance matrix (rank deficient)
120
val mean = Vectors.dense(0.0, 0.0, 0.0)
121
val singularCov = Matrices.dense(3, 3, Array(
122
1.0, 1.0, 1.0, // column 1
123
1.0, 1.0, 1.0, // column 2 (same as column 1)
124
1.0, 1.0, 1.0 // column 3 (same as column 1)
125
))
126
127
// This will work despite singular covariance
128
val singularMvn = new MultivariateGaussian(mean, singularCov)
129
130
// Density is computed in reduced dimensional subspace
131
val point = Vectors.dense(1.0, 1.0, 1.0)
132
val density = singularMvn.pdf(point)
133
134
println(s"Density with singular covariance: $density")
135
```
136
137
### Advanced Usage Patterns
138
139
**Working with Large Dimensions:**
140
141
```scala
142
import org.apache.spark.ml.linalg._
143
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
144
145
// High-dimensional Gaussian (e.g., for text analysis)
146
val dim = 100
147
val mean = Vectors.zeros(dim)
148
val cov = DenseMatrix.eye(dim)
149
150
val highDimMvn = new MultivariateGaussian(mean, cov)
151
152
// Use logpdf for numerical stability in high dimensions
153
val testPoint = Vectors.dense(Array.fill(dim)(0.1))
154
val logDensity = highDimMvn.logpdf(testPoint)
155
156
// Avoid pdf in high dimensions due to numerical underflow
157
// val density = highDimMvn.pdf(testPoint) // May underflow to 0.0
158
```
159
160
**Batch Evaluation:**
161
162
```scala
163
import org.apache.spark.ml.linalg._
164
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
165
166
val mvn = new MultivariateGaussian(
167
Vectors.dense(0.0, 0.0),
168
DenseMatrix.eye(2)
169
)
170
171
// Evaluate multiple points
172
val testPoints = Array(
173
Vectors.dense(0.0, 0.0),
174
Vectors.dense(1.0, 0.0),
175
Vectors.dense(0.0, 1.0),
176
Vectors.dense(1.0, 1.0)
177
)
178
179
val densities = testPoints.map(mvn.pdf)
180
val logDensities = testPoints.map(mvn.logpdf)
181
182
testPoints.zip(densities).foreach { case (point, density) =>
183
println(s"Point ${point.toArray.mkString("(", ", ", ")")}: density = $density")
184
}
185
```
186
187
## Internal Implementation
188
189
### Numerical Stability
190
191
The implementation uses several techniques for numerical stability:
192
193
1. **Eigenvalue Decomposition**: Uses eigendecomposition instead of direct matrix inversion
194
2. **Pseudo-Inverse**: Handles singular covariance matrices via Moore-Penrose pseudo-inverse
195
3. **Tolerance-Based Filtering**: Eigenvalues below machine precision threshold are treated as zero
196
4. **Log-Space Computation**: `logpdf` avoids numerical underflow in high dimensions
197
198
### Tolerance Calculation
199
200
```scala
201
// Internal tolerance calculation (not part of public API)
202
val tolerance = EPSILON * maxEigenvalue * matrixDimension
203
```
204
205
Where:
206
- `EPSILON`: Machine epsilon from `Utils.EPSILON`
207
- `maxEigenvalue`: Maximum eigenvalue of covariance matrix
208
- `matrixDimension`: Size of the covariance matrix
209
210
### Memory Optimization
211
212
- **Lazy Evaluation**: Expensive computations are cached using `@transient private lazy val`
213
- **Efficient Storage**: Uses optimized Breeze operations internally
214
- **Minimal Memory Footprint**: Only stores essential computed values
215
216
## Error Handling
217
218
The MultivariateGaussian constructor validates inputs:
219
220
```scala
221
// These will throw IllegalArgumentException:
222
223
// Non-square covariance matrix
224
val badCov1 = Matrices.dense(2, 3, Array(1.0, 0.0, 0.0, 1.0, 0.0, 0.0))
225
// new MultivariateGaussian(mean, badCov1) // throws exception
226
227
// Dimension mismatch between mean and covariance
228
val mean2D = Vectors.dense(0.0, 0.0)
229
val cov3D = DenseMatrix.eye(3)
230
// new MultivariateGaussian(mean2D, cov3D) // throws exception
231
232
// All-zero eigenvalues (no non-zero singular values)
233
val zeroCov = Matrices.zeros(2, 2)
234
// new MultivariateGaussian(mean2D, zeroCov) // may throw IllegalArgumentException
235
```
236
237
**Valid Cases:**
238
239
```scala
240
// These are all valid:
241
242
// Standard non-singular covariance
243
val validMvn1 = new MultivariateGaussian(
244
Vectors.dense(0.0, 0.0),
245
DenseMatrix.eye(2)
246
)
247
248
// Singular but non-zero covariance
249
val singularCov = Matrices.dense(2, 2, Array(1.0, 1.0, 1.0, 1.0))
250
val validMvn2 = new MultivariateGaussian(
251
Vectors.dense(0.0, 0.0),
252
singularCov
253
)
254
255
// Very small non-zero eigenvalues (handled gracefully)
256
val nearSingular = Matrices.dense(2, 2, Array(1.0, 0.0, 0.0, 1e-15))
257
val validMvn3 = new MultivariateGaussian(
258
Vectors.dense(0.0, 0.0),
259
nearSingular
260
)
261
```
262
263
## Mathematical Background
264
265
The multivariate Gaussian probability density function is:
266
267
```
268
pdf(x) = (2π)^(-k/2) * |Σ|^(-1/2) * exp(-1/2 * (x-μ)^T * Σ^(-1) * (x-μ))
269
```
270
271
Where:
272
- `k`: Dimensionality (length of mean vector)
273
- `μ`: Mean vector
274
- `Σ`: Covariance matrix
275
- `|Σ|`: Determinant of covariance matrix
276
277
For numerical stability, the implementation computes:
278
279
```
280
logpdf(x) = -k/2 * log(2π) - 1/2 * log|Σ| - 1/2 * (x-μ)^T * Σ^(-1) * (x-μ)
281
```
282
283
And uses eigendecomposition `Σ = U * D * U^T` to compute the inverse and determinant efficiently.
284
285
## Integration with Spark MLlib
286
287
MultivariateGaussian is used throughout Spark MLlib for:
288
289
- **Gaussian Mixture Models**: Component distributions
290
- **Naive Bayes**: Class-conditional distributions
291
- **Anomaly Detection**: Outlier scoring
292
- **Dimensionality Reduction**: Principal component analysis
293
- **Clustering**: Gaussian-based clustering algorithms