0
# Statistical Distributions
1
2
Multivariate statistical distributions for machine learning applications with support for probability density functions and log-probability calculations. Includes robust handling of singular covariance matrices using pseudo-inverse techniques.
3
4
## Capabilities
5
6
### Multivariate Gaussian Distribution
7
8
Multivariate Gaussian (Normal) distribution supporting both regular and degenerate (singular) covariance matrices through pseudo-inverse computation.
9
10
```scala { .api }
11
/**
12
* Multivariate Gaussian (Normal) Distribution
13
* Handles singular covariance matrices using reduced dimensional subspace computation
14
* @param mean mean vector of the distribution
15
* @param cov covariance matrix of the distribution (must be square and same size as mean)
16
*/
17
@DeveloperApi
18
class MultivariateGaussian(val mean: Vector, val cov: Matrix) extends Serializable {
19
20
/**
21
* Returns density of this multivariate Gaussian at given point
22
* @param x point to evaluate density at
23
* @return probability density value
24
*/
25
def pdf(x: Vector): Double
26
27
/**
28
* Returns log-density of this multivariate Gaussian at given point
29
* @param x point to evaluate log-density at
30
* @return log probability density value
31
*/
32
def logpdf(x: Vector): Double
33
}
34
```
35
36
**Usage Examples:**
37
38
```scala
39
import org.apache.spark.ml.linalg.{Vectors, Matrices}
40
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
41
42
// Create 2D Gaussian distribution
43
val mean = Vectors.dense(0.0, 0.0)
44
val cov = Matrices.dense(2, 2, Array(
45
1.0, 0.5, // Covariance matrix: [[1.0, 0.5],
46
0.5, 2.0 // [0.5, 2.0]]
47
))
48
49
val gaussian = new MultivariateGaussian(mean, cov)
50
51
// Evaluate density at different points
52
val point1 = Vectors.dense(0.0, 0.0) // At mean
53
val point2 = Vectors.dense(1.0, 1.0) // Away from mean
54
val point3 = Vectors.dense(-1.0, 2.0) // Different direction
55
56
val density1 = gaussian.pdf(point1) // Highest density (at mean)
57
val density2 = gaussian.pdf(point2) // Lower density
58
val density3 = gaussian.pdf(point3) // Even lower density
59
60
// Log-densities (more numerically stable for extreme values)
61
val logDensity1 = gaussian.logpdf(point1) // log(density1)
62
val logDensity2 = gaussian.logpdf(point2) // log(density2)
63
val logDensity3 = gaussian.logpdf(point3) // log(density3)
64
65
println(s"PDF at mean: $density1") // ~0.159 (1/(2π√det(cov)))
66
println(s"PDF at (1,1): $density2") // Lower value
67
println(s"Log-PDF at mean: $logDensity1") // ~-1.838
68
```
69
70
### Singular Covariance Handling
71
72
The implementation robustly handles singular (non-invertible) covariance matrices using pseudo-inverse techniques based on eigenvalue decomposition.
73
74
```scala { .api }
75
// Internal singular value handling (conceptual - not directly accessible)
76
private def calculateCovarianceConstants: (BDM[Double], Double) = {
77
// 1. Eigendecomposition: cov = U * D * U^T
78
// 2. Filter eigenvalues below tolerance: tol = EPSILON * max(eigenvalues) * dimension
79
// 3. Compute pseudo-determinant from non-zero eigenvalues
80
// 4. Compute pseudo-inverse square root: D^(-1/2) for non-zero eigenvalues
81
// 5. Return (D^(-1/2) * U^T, log_normalizer)
82
}
83
```
84
85
**Usage Examples:**
86
87
```scala
88
import org.apache.spark.ml.linalg.{Vectors, Matrices}
89
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
90
91
// Example 1: Singular covariance (rank deficient)
92
val mean = Vectors.dense(0.0, 0.0, 0.0)
93
val singularCov = Matrices.dense(3, 3, Array(
94
1.0, 1.0, 1.0, // All rows/columns are linearly dependent
95
1.0, 1.0, 1.0, // Rank = 1, not full rank
96
1.0, 1.0, 1.0
97
))
98
99
val singularGaussian = new MultivariateGaussian(mean, singularCov)
100
101
// Still works! Uses pseudo-inverse internally
102
val testPoint = Vectors.dense(0.5, 0.5, 0.5)
103
val density = singularGaussian.pdf(testPoint) // Computed in reduced dimension
104
val logDensity = singularGaussian.logpdf(testPoint) // More stable computation
105
106
// Example 2: Nearly singular covariance
107
val mean2 = Vectors.dense(1.0, 2.0)
108
val nearlySingularCov = Matrices.dense(2, 2, Array(
109
1.0, 0.9999, // Very high correlation, nearly singular
110
0.9999, 1.0
111
))
112
113
val nearSingularGaussian = new MultivariateGaussian(mean2, nearlySingularCov)
114
val point = Vectors.dense(1.1, 2.1)
115
val stableDensity = nearSingularGaussian.pdf(point) // Handles numerical issues
116
117
// Example 3: Diagonal covariance (well-conditioned)
118
val diagonalCov = Matrices.dense(2, 2, Array(
119
2.0, 0.0, // Independent dimensions
120
0.0, 3.0
121
))
122
123
val diagonalGaussian = new MultivariateGaussian(mean2, diagonalCov)
124
val fastDensity = diagonalGaussian.pdf(point) // Efficient computation
125
```
126
127
### Construction Patterns
128
129
Different ways to create multivariate Gaussian distributions for common use cases.
130
131
```scala { .api }
132
// Standard constructor
133
new MultivariateGaussian(mean: Vector, cov: Matrix)
134
135
// Internal Breeze constructor (not typically used directly)
136
private[ml] def this(mean: BDV[Double], cov: BDM[Double])
137
```
138
139
**Usage Examples:**
140
141
```scala
142
import org.apache.spark.ml.linalg.{Vectors, Matrices, DenseMatrix}
143
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
144
145
// 1. Isotropic Gaussian (spherical covariance)
146
val isotropicMean = Vectors.dense(0.0, 0.0, 0.0)
147
val isotropicCov = Matrices.eye(3).asInstanceOf[DenseMatrix]
148
isotropicCov.update(_ * 0.5) // Scale by 0.5: σ² = 0.5 for all dimensions
149
150
val isotropicGaussian = new MultivariateGaussian(isotropicMean, isotropicCov)
151
152
// 2. Diagonal Gaussian (independent dimensions with different variances)
153
val diagMean = Vectors.dense(1.0, -1.0, 2.0)
154
val diagCov = DenseMatrix.diag(Vectors.dense(1.0, 2.0, 0.5)) // Different variances
155
156
val diagonalGaussian = new MultivariateGaussian(diagMean, diagCov)
157
158
// 3. Full covariance Gaussian (correlated dimensions)
159
val corrMean = Vectors.dense(0.0, 0.0)
160
val corrCov = Matrices.dense(2, 2, Array(
161
2.0, 1.0, // Positive correlation
162
1.0, 2.0
163
))
164
165
val correlatedGaussian = new MultivariateGaussian(corrMean, corrCov)
166
167
// 4. From empirical data (conceptual - you'd compute these from data)
168
val dataMean = Vectors.dense(2.5, 1.8) // Sample mean
169
val dataCov = Matrices.dense(2, 2, Array( // Sample covariance
170
1.2, 0.3,
171
0.3, 0.8
172
))
173
174
val empiricalGaussian = new MultivariateGaussian(dataMean, dataCov)
175
176
// Evaluate all distributions at a test point
177
val testPoint = Vectors.dense(0.0, 0.0, 0.0)
178
println(s"Isotropic density: ${isotropicGaussian.pdf(testPoint.slice(Array(0, 1, 2)))}")
179
println(s"Diagonal density: ${diagonalGaussian.pdf(testPoint)}")
180
println(s"Correlated density: ${correlatedGaussian.pdf(testPoint.slice(Array(0, 1)))}")
181
```
182
183
### Mathematical Properties
184
185
Understanding the mathematical foundation for effective usage.
186
187
```scala { .api }
188
// Probability density function formula (conceptual)
189
// pdf(x) = (2π)^(-k/2) * det(Σ)^(-1/2) * exp(-0.5 * (x-μ)^T * Σ^(-1) * (x-μ))
190
// where:
191
// k = dimension
192
// μ = mean vector
193
// Σ = covariance matrix
194
// det(Σ) = determinant of covariance matrix
195
196
// Log probability density (more numerically stable)
197
// logpdf(x) = -0.5 * [k*log(2π) + log(det(Σ)) + (x-μ)^T * Σ^(-1) * (x-μ)]
198
```
199
200
**Usage Examples:**
201
202
```scala
203
import org.apache.spark.ml.linalg.{Vectors, Matrices}
204
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
205
206
// Understanding the relationship between pdf and logpdf
207
val mean = Vectors.dense(1.0, 2.0)
208
val cov = Matrices.dense(2, 2, Array(1.0, 0.0, 0.0, 1.0)) // Identity covariance
209
val gaussian = new MultivariateGaussian(mean, cov)
210
211
val x = Vectors.dense(1.0, 2.0) // Point at the mean
212
213
val pdf = gaussian.pdf(x)
214
val logpdf = gaussian.logpdf(x)
215
216
// Verify relationship: pdf = exp(logpdf)
217
val pdfFromLog = math.exp(logpdf)
218
println(s"PDF: $pdf")
219
println(s"PDF from log: $pdfFromLog")
220
println(s"Difference: ${math.abs(pdf - pdfFromLog)}") // Should be very small
221
222
// Maximum density is at the mean
223
val atMean = gaussian.pdf(mean)
224
val awayFromMean = gaussian.pdf(Vectors.dense(3.0, 4.0))
225
println(s"Density at mean: $atMean")
226
println(s"Density away from mean: $awayFromMean")
227
assert(atMean > awayFromMean) // Density decreases away from mean
228
229
// For identity covariance, the maximum density is 1/(2π)^(k/2)
230
val theoreticalMax = 1.0 / math.pow(2.0 * math.Pi, 1.0) // k=2 dimensions
231
println(s"Theoretical maximum density: $theoreticalMax")
232
println(s"Actual maximum density: $atMean")
233
```
234
235
## Error Handling
236
237
The MultivariateGaussian class validates inputs and handles edge cases gracefully:
238
239
```scala
240
import org.apache.spark.ml.linalg.{Vectors, Matrices}
241
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
242
243
// Dimension mismatch between mean and covariance
244
val mean = Vectors.dense(1.0, 2.0)
245
val wrongSizeCov = Matrices.eye(3)
246
// new MultivariateGaussian(mean, wrongSizeCov) // IllegalArgumentException: Mean vector length must match covariance matrix size
247
248
// Non-square covariance matrix
249
val nonSquareCov = Matrices.dense(2, 3, Array(1,2,3,4,5,6))
250
// new MultivariateGaussian(mean, nonSquareCov) // IllegalArgumentException: Covariance matrix must be square
251
252
// Completely zero covariance (no variation)
253
val zeroMean = Vectors.dense(0.0, 0.0)
254
val zeroCov = Matrices.zeros(2, 2)
255
// new MultivariateGaussian(zeroMean, zeroCov) // IllegalArgumentException: Covariance matrix has no non-zero singular values
256
257
// Valid but challenging cases (handled gracefully)
258
val validMean = Vectors.dense(0.0, 0.0)
259
val validCov = Matrices.dense(2, 2, Array(1.0, 0.0, 0.0, 1.0))
260
val validGaussian = new MultivariateGaussian(validMean, validCov)
261
262
// Very small probabilities (handled via log-space)
263
val farPoint = Vectors.dense(10.0, 10.0) // Very far from mean
264
val smallPdf = validGaussian.pdf(farPoint) // Very small positive number
265
val logPdf = validGaussian.logpdf(farPoint) // Large negative number
266
println(s"Small PDF: $smallPdf") // May be close to 0
267
println(s"Log PDF: $logPdf") // More informative
268
269
// Numerical stability demonstration
270
val extremePoint = Vectors.dense(50.0, 50.0)
271
val extremeLogPdf = validGaussian.logpdf(extremePoint) // Large negative
272
val extremePdf = validGaussian.pdf(extremePoint) // Essentially 0
273
println(s"Extreme log PDF: $extremeLogPdf") // Still computable
274
println(s"Extreme PDF: $extremePdf") // May underflow to 0
275
```