0
# Statistical Distributions
1
2
Multivariate statistical distributions for probabilistic modeling and machine learning applications. Provides robust implementations that handle edge cases like singular covariance matrices.
3
4
## Capabilities
5
6
### Multivariate Gaussian Distribution
7
8
Implementation of multivariate normal distribution with support for singular covariance matrices through pseudo-inverse computation.
9
10
```scala { .api }
11
class MultivariateGaussian(val mean: Vector, val cov: Matrix) extends Serializable {
12
/** Returns density of this multivariate Gaussian at given point */
13
def pdf(x: Vector): Double
14
15
/** Returns the log-density of this multivariate Gaussian at given point */
16
def logpdf(x: Vector): Double
17
}
18
```
19
20
Usage examples:
21
22
```scala
23
import org.apache.spark.ml.linalg._
24
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
25
26
// Create 2D Gaussian distribution
27
val mean = Vectors.dense(0.0, 0.0)
28
val cov = Matrices.dense(2, 2, Array(
29
1.0, 0.5, // Covariance matrix: [[1.0, 0.5],
30
0.5, 1.0 // [0.5, 1.0]]
31
))
32
33
val mvGaussian = new MultivariateGaussian(mean, cov)
34
35
// Evaluate probability density
36
val point1 = Vectors.dense(0.0, 0.0) // At mean
37
val point2 = Vectors.dense(1.0, 1.0) // Away from mean
38
39
val density1 = mvGaussian.pdf(point1) // Higher density at mean
40
val density2 = mvGaussian.pdf(point2) // Lower density away from mean
41
42
val logDensity1 = mvGaussian.logpdf(point1) // Log-density (numerically stable)
43
val logDensity2 = mvGaussian.logpdf(point2)
44
```
45
46
### Advanced Usage
47
48
#### Singular Covariance Matrices
49
50
The implementation handles singular (non-invertible) covariance matrices by computing the pseudo-inverse and working in the reduced-dimensional subspace where the distribution is supported.
51
52
```scala
53
import org.apache.spark.ml.linalg._
54
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
55
56
// Singular covariance matrix (rank deficient)
57
val singularCov = Matrices.dense(3, 3, Array(
58
1.0, 1.0, 0.0, // Rows 1 and 2 are identical -> rank = 2
59
1.0, 1.0, 0.0,
60
0.0, 0.0, 1.0
61
))
62
63
val mean = Vectors.dense(0.0, 0.0, 0.0)
64
val mvGaussian = new MultivariateGaussian(mean, singularCov)
65
66
// Still works correctly with singular covariance
67
val point = Vectors.dense(1.0, 1.0, 0.5)
68
val density = mvGaussian.pdf(point)
69
val logDensity = mvGaussian.logpdf(point)
70
```
71
72
#### High-Dimensional Distributions
73
74
Efficient computation for high-dimensional multivariate Gaussians.
75
76
```scala
77
import org.apache.spark.ml.linalg._
78
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
79
import java.util.Random
80
81
val dim = 100
82
val rng = new Random(42)
83
84
// Create high-dimensional Gaussian
85
val mean = Vectors.dense(Array.fill(dim)(0.0))
86
87
// Create diagonal covariance matrix for efficiency
88
val covValues = Array.fill(dim * dim)(0.0)
89
for (i <- 0 until dim) {
90
covValues(i * dim + i) = 1.0 + rng.nextGaussian() * 0.1 // Diagonal elements
91
}
92
val cov = Matrices.dense(dim, dim, covValues)
93
94
val mvGaussian = new MultivariateGaussian(mean, cov)
95
96
// Evaluate at random points
97
val testPoint = Vectors.dense(Array.fill(dim)(rng.nextGaussian()))
98
val density = mvGaussian.pdf(testPoint)
99
val logDensity = mvGaussian.logpdf(testPoint) // Preferred for numerical stability
100
```
101
102
#### Integration with Breeze
103
104
The implementation can also work with Breeze vectors and matrices for interoperability.
105
106
```scala
107
import org.apache.spark.ml.linalg._
108
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
109
import breeze.linalg.{DenseVector => BDV, DenseMatrix => BDM}
110
111
// Create using Breeze types (internal constructor)
112
val breezeMean = BDV(1.0, 2.0)
113
val breezeCov = BDM((1.0, 0.3), (0.3, 1.0))
114
115
// Note: This constructor is private[ml], shown for completeness
116
// val mvGaussian = new MultivariateGaussian(breezeMean, breezeCov)
117
118
// Convert from MLlib types
119
val mean = Vectors.fromBreeze(breezeMean)
120
val cov = Matrices.fromBreeze(breezeCov)
121
val mvGaussian = new MultivariateGaussian(mean, cov)
122
```
123
124
### Mathematical Background
125
126
The multivariate Gaussian distribution has the probability density function:
127
128
```
129
pdf(x) = (2π)^(-k/2) * |Σ|^(-1/2) * exp(-1/2 * (x-μ)ᵀ * Σ⁻¹ * (x-μ))
130
```
131
132
Where:
133
- `k` is the dimensionality
134
- `μ` is the mean vector
135
- `Σ` is the covariance matrix
136
- `|Σ|` is the determinant of the covariance matrix
137
138
The implementation:
139
- Uses eigendecomposition for numerical stability
140
- Computes pseudo-determinant and pseudo-inverse for singular matrices
141
- Applies tolerance-based filtering of singular values
142
- Supports both PDF and log-PDF computation
143
144
## Error Handling
145
146
The implementation includes robust error handling for common edge cases:
147
148
```scala
149
import org.apache.spark.ml.linalg._
150
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
151
152
// These will throw appropriate exceptions:
153
154
// Mismatched dimensions
155
val mean = Vectors.dense(1.0, 2.0)
156
val wrongCov = Matrices.dense(3, 3, Array.fill(9)(1.0))
157
// val mvGaussian = new MultivariateGaussian(mean, wrongCov) // IllegalArgumentException
158
159
// Non-square covariance matrix
160
val nonSquareCov = Matrices.dense(2, 3, Array.fill(6)(1.0))
161
// val mvGaussian = new MultivariateGaussian(mean, nonSquareCov) // IllegalArgumentException
162
163
// Zero covariance matrix (all eigenvalues are zero)
164
val zeroCov = Matrices.zeros(2, 2)
165
// val mvGaussian = new MultivariateGaussian(mean, zeroCov) // IllegalArgumentException
166
```
167
168
## Type Definitions
169
170
```scala { .api }
171
class MultivariateGaussian(val mean: Vector, val cov: Matrix) extends Serializable {
172
require(cov.numCols == cov.numRows, "Covariance matrix must be square")
173
require(mean.size == cov.numCols, "Mean vector length must match covariance matrix size")
174
175
def pdf(x: Vector): Double
176
def logpdf(x: Vector): Double
177
}
178
```