0
# Statistical Distributions
1
2
Multivariate statistical distributions with robust numerical implementations that handle edge cases like singular covariance matrices. The library provides probability density function calculations with numerical stability features for machine learning applications.
3
4
## Capabilities
5
6
### Multivariate Gaussian Distribution
7
8
Implementation of multivariate normal distribution with support for singular covariance matrices through pseudo-inverse calculations and reduced-dimensional subspace computations.
9
10
```scala { .api }
11
class MultivariateGaussian(mean: Vector, cov: Matrix) extends Serializable {
12
val mean: Vector
13
val cov: Matrix
14
def pdf(x: Vector): Double
15
def logpdf(x: Vector): Double
16
}
17
```
18
19
**Usage examples:**
20
21
```scala
22
import org.apache.spark.ml.linalg._
23
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
24
25
// Create a 2D Gaussian distribution
26
val mean = Vectors.dense(0.0, 0.0)
27
val cov = DenseMatrix.eye(2) // Identity covariance matrix
28
val gaussian = new MultivariateGaussian(mean, cov)
29
30
// Evaluate probability density at a point
31
val x = Vectors.dense(1.0, 1.0)
32
val density = gaussian.pdf(x)
33
val logDensity = gaussian.logpdf(x)
34
35
println(s"PDF at $x: $density")
36
println(s"Log PDF at $x: $logDensity")
37
38
// Create distribution with custom covariance
39
val customCov = Matrices.dense(2, 2, Array(
40
1.0, 0.5,
41
0.5, 2.0
42
))
43
val correlatedGaussian = new MultivariateGaussian(mean, customCov)
44
45
// Evaluate at multiple points
46
val points = Seq(
47
Vectors.dense(0.0, 0.0),
48
Vectors.dense(1.0, 0.0),
49
Vectors.dense(0.0, 1.0),
50
Vectors.dense(1.0, 1.0)
51
)
52
53
points.foreach { point =>
54
val prob = correlatedGaussian.pdf(point)
55
println(s"PDF at ${point.toArray.mkString("[", ", ", "]")}: $prob")
56
}
57
```
58
59
### Singular Covariance Matrix Handling
60
61
The implementation handles singular (non-invertible) covariance matrices through pseudo-inverse computation and reduced-dimensional analysis.
62
63
**Working with singular covariance matrices:**
64
65
```scala
66
import org.apache.spark.ml.linalg._
67
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
68
69
// Create a singular covariance matrix (rank deficient)
70
val singularCov = Matrices.dense(3, 3, Array(
71
1.0, 1.0, 1.0,
72
1.0, 1.0, 1.0,
73
1.0, 1.0, 1.0
74
))
75
76
val mean = Vectors.dense(0.0, 0.0, 0.0)
77
78
// The MultivariateGaussian handles singular matrices automatically
79
val singularGaussian = new MultivariateGaussian(mean, singularCov)
80
81
// Density computation works in the reduced subspace
82
val x = Vectors.dense(1.0, 1.0, 1.0)
83
val density = singularGaussian.pdf(x)
84
val logDensity = singularGaussian.logpdf(x)
85
86
println(s"PDF with singular covariance: $density")
87
println(s"Log PDF with singular covariance: $logDensity")
88
```
89
90
### Constructor Variants
91
92
Multiple ways to create MultivariateGaussian distributions from different input formats.
93
94
```scala { .api }
95
class MultivariateGaussian(mean: Vector, cov: Matrix) extends Serializable
96
97
// Private constructor for Breeze types (internal use)
98
private[ml] def this(mean: breeze.linalg.DenseVector[Double], cov: breeze.linalg.DenseMatrix[Double])
99
```
100
101
**Different initialization patterns:**
102
103
```scala
104
import org.apache.spark.ml.linalg._
105
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
106
107
// Standard initialization with Spark vectors/matrices
108
val mean1 = Vectors.dense(1.0, 2.0)
109
val cov1 = DenseMatrix.eye(2)
110
val gaussian1 = new MultivariateGaussian(mean1, cov1)
111
112
// With sparse mean vector
113
val sparseMean = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
114
val cov2 = DenseMatrix.eye(3)
115
val gaussian2 = new MultivariateGaussian(sparseMean, cov2)
116
117
// With sparse covariance matrix
118
val sparseIdentity = SparseMatrix.speye(2)
119
val gaussian3 = new MultivariateGaussian(mean1, sparseIdentity)
120
```
121
122
## Mathematical Background
123
124
### Probability Density Function
125
126
The multivariate Gaussian PDF is computed as:
127
128
```
129
pdf(x) = (2π)^(-k/2) * |Σ|^(-1/2) * exp(-1/2 * (x-μ)^T * Σ^(-1) * (x-μ))
130
```
131
132
Where:
133
- `k` is the dimensionality (length of mean vector)
134
- `μ` is the mean vector
135
- `Σ` is the covariance matrix
136
- `|Σ|` is the determinant of the covariance matrix
137
138
### Log Probability Density Function
139
140
The log PDF is computed for numerical stability:
141
142
```
143
logpdf(x) = -k/2 * log(2π) - 1/2 * log|Σ| - 1/2 * (x-μ)^T * Σ^(-1) * (x-μ)
144
```
145
146
### Singular Covariance Handling
147
148
For singular covariance matrices, the implementation:
149
150
1. Computes eigendecomposition: `Σ = U * D * U^T`
151
2. Identifies non-zero eigenvalues using numerical tolerance
152
3. Computes pseudo-determinant from non-zero eigenvalues
153
4. Uses pseudo-inverse for the quadratic form computation
154
5. Operates in the reduced-dimensional subspace where the distribution is supported
155
156
## Numerical Stability Features
157
158
### Tolerance-Based Eigenvalue Filtering
159
160
The implementation uses machine precision-based tolerance to determine which eigenvalues are considered non-zero:
161
162
```scala
163
val tolerance = EPSILON * max(eigenvalues) * dimensionality
164
```
165
166
This prevents numerical instability from very small eigenvalues that should be treated as zero.
167
168
### Efficient Matrix Operations
169
170
Internal implementation optimizes matrix operations:
171
172
- Uses eigendecomposition instead of direct matrix inversion
173
- Computes pseudo-inverse through eigenvalue manipulation
174
- Avoids explicit matrix inverse computation for better numerical stability
175
- Lazy evaluation of distribution-dependent constants
176
177
### Error Handling
178
179
The implementation validates inputs and provides meaningful error messages:
180
181
```scala
182
// Dimension validation
183
require(cov.numCols == cov.numRows, "Covariance matrix must be square")
184
require(mean.size == cov.numCols, "Mean vector length must match covariance matrix size")
185
186
// Singular matrix handling
187
try {
188
val gaussian = new MultivariateGaussian(mean, singularCov)
189
val density = gaussian.pdf(x)
190
} catch {
191
case e: IllegalArgumentException =>
192
println("Covariance matrix has no non-zero singular values")
193
}
194
```
195
196
## Performance Considerations
197
198
### Lazy Computation
199
200
Distribution-dependent constants are computed lazily and cached:
201
202
- Eigendecomposition is performed once during first PDF/log-PDF call
203
- Matrix square roots and determinants are cached
204
- Intermediate computations are reused across multiple evaluations
205
206
### Memory Efficiency
207
208
- Uses efficient eigendecomposition instead of full matrix operations
209
- Minimal memory footprint for distribution parameters
210
- Optimized for repeated evaluations with the same distribution
211
212
### Integration with BLAS
213
214
The implementation leverages optimized BLAS operations:
215
216
- Matrix-vector multiplications use BLAS Level 2 routines
217
- Eigendecomposition uses optimized linear algebra libraries
218
- Vector operations benefit from vectorized implementations
219
220
## Common Use Cases
221
222
### Density Estimation
223
224
```scala
225
val samples = Seq(
226
Vectors.dense(1.2, 0.8),
227
Vectors.dense(0.9, 1.1),
228
Vectors.dense(1.1, 0.9)
229
)
230
231
val gaussian = new MultivariateGaussian(mean, cov)
232
233
// Compute likelihood of samples
234
val likelihoods = samples.map(gaussian.pdf)
235
val logLikelihoods = samples.map(gaussian.logpdf)
236
237
// Total log-likelihood
238
val totalLogLikelihood = logLikelihoods.sum
239
```
240
241
### Anomaly Detection
242
243
```scala
244
val normalDataMean = Vectors.dense(0.0, 0.0)
245
val normalDataCov = DenseMatrix.eye(2)
246
val normalModel = new MultivariateGaussian(normalDataMean, normalDataCov)
247
248
def isAnomalous(point: Vector, threshold: Double): Boolean = {
249
val logProb = normalModel.logpdf(point)
250
logProb < threshold
251
}
252
253
val testPoint = Vectors.dense(3.0, 3.0)
254
val anomalous = isAnomalous(testPoint, -5.0)
255
```
256
257
### Gaussian Mixture Components
258
259
```scala
260
// Component distributions for a mixture model
261
val component1 = new MultivariateGaussian(
262
Vectors.dense(-1.0, -1.0),
263
DenseMatrix.eye(2)
264
)
265
266
val component2 = new MultivariateGaussian(
267
Vectors.dense(1.0, 1.0),
268
Matrices.dense(2, 2, Array(2.0, 0.5, 0.5, 2.0))
269
)
270
271
// Evaluate mixture probability (would need mixture weights)
272
def mixturePdf(x: Vector, weights: Array[Double]): Double = {
273
weights(0) * component1.pdf(x) + weights(1) * component2.pdf(x)
274
}
275
```