0
# Statistical Distributions
1
2
Statistical distribution implementations for probability computations and machine learning algorithms. Supports multivariate distributions with advanced numerical stability features.
3
4
## Capabilities
5
6
### Multivariate Gaussian Distribution
7
8
Implementation of multivariate Gaussian (Normal) distribution with support for singular covariance matrices through pseudo-inverse computations.
9
10
```scala { .api }
11
/**
12
* Multivariate Gaussian (Normal) Distribution
13
* Handles singular covariance matrices by computing density in reduced dimensional subspace
14
*/
15
class MultivariateGaussian(val mean: Vector, val cov: Matrix) extends Serializable {
16
/** Mean vector of the distribution */
17
val mean: Vector
18
19
/** Covariance matrix of the distribution */
20
val cov: Matrix
21
22
/**
23
* Computes probability density function at given point
24
* @param x Point to evaluate density at
25
* @return Probability density value
26
*/
27
def pdf(x: Vector): Double
28
29
/**
30
* Computes log probability density function at given point
31
* @param x Point to evaluate log density at
32
* @return Log probability density value
33
*/
34
def logpdf(x: Vector): Double
35
}
36
```
37
38
**Usage Examples:**
39
40
```scala
41
import org.apache.spark.ml.linalg.{Vectors, Matrices}
42
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
43
44
// Create a 2D Gaussian distribution
45
val mean = Vectors.dense(0.0, 0.0)
46
val cov = Matrices.dense(2, 2, Array(
47
1.0, 0.5, // First column: [1.0, 0.5]
48
0.5, 1.0 // Second column: [0.5, 1.0]
49
))
50
51
val gaussian = new MultivariateGaussian(mean, cov)
52
53
// Evaluate probability density
54
val point1 = Vectors.dense(0.0, 0.0) // At the mean
55
val density1 = gaussian.pdf(point1)
56
println(s"Density at mean: $density1")
57
58
val point2 = Vectors.dense(1.0, 1.0) // Away from mean
59
val density2 = gaussian.pdf(point2)
60
println(s"Density at (1,1): $density2")
61
62
// Evaluate log probability density (more numerically stable)
63
val logDensity1 = gaussian.logpdf(point1)
64
val logDensity2 = gaussian.logpdf(point2)
65
println(s"Log density at mean: $logDensity1")
66
println(s"Log density at (1,1): $logDensity2")
67
```
68
69
### Advanced Usage Examples
70
71
#### Working with High-Dimensional Distributions
72
73
```scala
74
import org.apache.spark.ml.linalg.{Vectors, Matrices}
75
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
76
77
// Create 5-dimensional Gaussian
78
val dim = 5
79
val mean = Vectors.zeros(dim)
80
81
// Create diagonal covariance matrix
82
val covValues = Array.tabulate(dim * dim) { i =>
83
val row = i / dim
84
val col = i % dim
85
if (row == col) 1.0 else 0.0 // Identity matrix
86
}
87
val cov = Matrices.dense(dim, dim, covValues)
88
89
val gaussian = new MultivariateGaussian(mean, cov)
90
91
// Evaluate multiple points
92
val points = Array(
93
Vectors.dense(0.0, 0.0, 0.0, 0.0, 0.0), // Origin
94
Vectors.dense(1.0, 0.0, 0.0, 0.0, 0.0), // Unit vector in first dimension
95
Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0) // Unit vector in all dimensions
96
)
97
98
points.zipWithIndex.foreach { case (point, i) =>
99
val density = gaussian.pdf(point)
100
val logDensity = gaussian.logpdf(point)
101
println(s"Point $i: density = $density, log density = $logDensity")
102
}
103
```
104
105
#### Working with Singular Covariance Matrices
106
107
The implementation handles singular (non-invertible) covariance matrices using pseudo-inverse computation:
108
109
```scala
110
import org.apache.spark.ml.linalg.{Vectors, Matrices}
111
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
112
113
// Create a singular covariance matrix (rank deficient)
114
val mean = Vectors.dense(0.0, 0.0, 0.0)
115
val singularCov = Matrices.dense(3, 3, Array(
116
1.0, 1.0, 1.0, // First column
117
1.0, 1.0, 1.0, // Second column (identical to first)
118
1.0, 1.0, 1.0 // Third column (identical to first)
119
))
120
121
// This will work despite the singular covariance matrix
122
val singularGaussian = new MultivariateGaussian(mean, singularCov)
123
124
val testPoint = Vectors.dense(1.0, 1.0, 1.0)
125
val density = singularGaussian.pdf(testPoint)
126
val logDensity = singularGaussian.logpdf(testPoint)
127
128
println(s"Density with singular covariance: $density")
129
println(s"Log density with singular covariance: $logDensity")
130
```
131
132
#### Correlated Multivariate Gaussian
133
134
```scala
135
import org.apache.spark.ml.linalg.{Vectors, Matrices}
136
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
137
138
// Create correlated 3D Gaussian
139
val mean = Vectors.dense(1.0, 2.0, 3.0)
140
141
// Covariance matrix with correlations
142
val cov = Matrices.dense(3, 3, Array(
143
2.0, 0.8, 0.3, // var=2.0, corr with dim2=0.4, corr with dim3=0.15
144
0.8, 1.5, 0.6, // corr with dim1=0.4, var=1.5, corr with dim3=0.4
145
0.3, 0.6, 1.0 // corr with dim1=0.15, corr with dim2=0.4, var=1.0
146
))
147
148
val correlatedGaussian = new MultivariateGaussian(mean, cov)
149
150
// Sample different points and compare densities
151
val points = Array(
152
mean, // Should have highest density
153
Vectors.dense(1.0, 2.0, 3.5), // Close to mean
154
Vectors.dense(0.0, 0.0, 0.0), // Far from mean
155
Vectors.dense(2.0, 3.0, 4.0) // Scaled version of mean
156
)
157
158
println("Evaluating correlated Gaussian:")
159
points.zipWithIndex.foreach { case (point, i) =>
160
val density = correlatedGaussian.pdf(point)
161
val logDensity = correlatedGaussian.logpdf(point)
162
println(f"Point $i: density = $density%.6f, log density = $logDensity%.6f")
163
}
164
```
165
166
### Numerical Stability Features
167
168
The `MultivariateGaussian` implementation includes several numerical stability features:
169
170
1. **Eigenvalue Decomposition**: Uses eigendecomposition instead of direct matrix inversion
171
2. **Tolerance-based Computation**: Considers singular values as zero only if they fall below a machine precision-based tolerance
172
3. **Pseudo-inverse Computation**: Uses Moore-Penrose pseudo-inverse for singular matrices
173
4. **Log-space Computation**: Provides `logpdf` method for numerical stability in high dimensions
174
175
```scala
176
// The implementation automatically handles numerical issues
177
val mean = Vectors.dense(0.0, 0.0)
178
val poorlyConditioned = Matrices.dense(2, 2, Array(
179
1e12, 1e12, // Very large values
180
1e12, 1e12 + 1e-6 // Nearly singular
181
))
182
183
val stableGaussian = new MultivariateGaussian(mean, poorlyConditioned)
184
185
// These operations will be numerically stable
186
val point = Vectors.dense(1e6, 1e6)
187
val stableDensity = stableGaussian.pdf(point)
188
val stableLogDensity = stableGaussian.logpdf(point) // Preferred for stability
189
```
190
191
## Mathematical Background
192
193
The multivariate Gaussian PDF is given by:
194
195
```
196
pdf(x) = (2π)^(-k/2) * |Σ|^(-1/2) * exp(-1/2 * (x-μ)^T * Σ^(-1) * (x-μ))
197
```
198
199
Where:
200
- `k` is the dimensionality
201
- `μ` is the mean vector
202
- `Σ` is the covariance matrix
203
- `|Σ|` is the determinant of the covariance matrix
204
205
For singular covariance matrices, the implementation computes the pseudo-determinant and uses the pseudo-inverse in a reduced-dimensional subspace.
206
207
## Types
208
209
```scala { .api }
210
import org.apache.spark.ml.linalg.{Vector, Matrix}
211
212
class MultivariateGaussian(val mean: Vector, val cov: Matrix) extends Serializable {
213
require(cov.numCols == cov.numRows, "Covariance matrix must be square")
214
require(mean.size == cov.numCols, "Mean vector length must match covariance matrix size")
215
}
216
```