Statistical distributions provide probability computations essential for machine learning algorithms. Spark MLlib Local includes a robust implementation of the multivariate Gaussian distribution with support for singular covariance matrices.
The MultivariateGaussian class provides a complete implementation of the multivariate normal distribution with automatic handling of singular covariance matrices through pseudoinverse computations.
class MultivariateGaussian(val mean: Vector, val cov: Matrix) extends Serializable {
// Validates that covariance matrix is square and matches mean vector size
}Usage examples:
import org.apache.spark.ml.linalg.{Vectors, Matrices}
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
// 2D Gaussian distribution
val mean = Vectors.dense(1.0, 2.0)
val cov = Matrices.dense(2, 2, Array(
1.0, 0.5, // Covariance matrix in column-major order:
0.5, 2.0 // [[1.0, 0.5], [0.5, 2.0]]
))
val gaussian = new MultivariateGaussian(mean, cov)
// 3D Gaussian with diagonal covariance
val mean3d = Vectors.dense(0.0, 0.0, 0.0)
val cov3d = Matrices.dense(3, 3, Array(
1.0, 0.0, 0.0,
0.0, 2.0, 0.0,
0.0, 0.0, 1.5
))
val gaussian3d = new MultivariateGaussian(mean3d, cov3d)Compute probability densities and log-densities for given points.
class MultivariateGaussian {
def pdf(x: Vector): Double
def logpdf(x: Vector): Double
}Usage examples:
val gaussian = new MultivariateGaussian(
Vectors.dense(1.0, 2.0),
Matrices.dense(2, 2, Array(1.0, 0.0, 0.0, 1.0)) // Identity covariance
)
// Evaluate probability density at different points
val point1 = Vectors.dense(1.0, 2.0) // At the mean
val point2 = Vectors.dense(0.0, 1.0) // Away from mean
val point3 = Vectors.dense(2.0, 3.0) // Another point
val density1 = gaussian.pdf(point1) // Highest density (at mean)
val density2 = gaussian.pdf(point2) // Lower density
val density3 = gaussian.pdf(point3) // Lower density
println(f"Density at mean: $density1%.6f")
println(f"Density at (0,1): $density2%.6f")
println(f"Density at (2,3): $density3%.6f")
// Log densities for numerical stability with small probabilities
val logDensity1 = gaussian.logpdf(point1)
val logDensity2 = gaussian.logpdf(point2)
println(f"Log density at mean: $logDensity1%.6f")
println(f"Log density at (0,1): $logDensity2%.6f")The implementation robustly handles singular (non-invertible) covariance matrices using eigendecomposition and pseudoinverse techniques.
// Example with singular covariance matrix
val singularCov = Matrices.dense(3, 3, Array(
1.0, 1.0, 0.0,
1.0, 1.0, 0.0, // Rank-deficient matrix (rank 2)
0.0, 0.0, 1.0
))
val singularGaussian = new MultivariateGaussian(
Vectors.dense(0.0, 0.0, 0.0),
singularCov
)
// Still computes valid densities in the supported subspace
val testPoint = Vectors.dense(1.0, 1.0, 0.5)
val density = singularGaussian.pdf(testPoint)
val logDensity = singularGaussian.logpdf(testPoint)
println(f"Density with singular covariance: $density%.6f")
println(f"Log density with singular covariance: $logDensity%.6f")Use Gaussian distributions to detect outliers in multivariate data.
import org.apache.spark.ml.linalg.{Vectors, Matrices}
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
// Fit Gaussian to training data (simplified example)
val trainingMean = Vectors.dense(5.0, 10.0)
val trainingCov = Matrices.dense(2, 2, Array(2.0, 0.1, 0.1, 3.0))
val model = new MultivariateGaussian(trainingMean, trainingCov)
// Score new data points
val normalPoint = Vectors.dense(5.2, 9.8) // Close to training mean
val anomalyPoint = Vectors.dense(15.0, 2.0) // Far from training mean
val normalLogProb = model.logpdf(normalPoint)
val anomalyLogProb = model.logpdf(anomalyPoint)
// Points with very low probability (high negative log probability) are anomalies
val threshold = -10.0 // Example threshold
val isAnomalous = anomalyLogProb < threshold
println(f"Normal point log probability: $normalLogProb%.3f")
println(f"Anomaly point log probability: $anomalyLogProb%.3f")
println(s"Is anomalous: $isAnomalous")Use as components in Gaussian mixture models.
// Multiple Gaussian components for mixture model
val component1 = new MultivariateGaussian(
Vectors.dense(0.0, 0.0),
Matrices.dense(2, 2, Array(1.0, 0.0, 0.0, 1.0))
)
val component2 = new MultivariateGaussian(
Vectors.dense(5.0, 5.0),
Matrices.dense(2, 2, Array(2.0, 0.5, 0.5, 2.0))
)
val mixingWeights = Array(0.3, 0.7) // Component weights
// Evaluate mixture density at a point
val testPoint = Vectors.dense(2.5, 2.5)
val mixtureDensity = mixingWeights(0) * component1.pdf(testPoint) +
mixingWeights(1) * component2.pdf(testPoint)
println(f"Mixture density: $mixtureDensity%.6f")Estimate probability densities for continuous multivariate data.
// Given sample statistics, create Gaussian model
def fitGaussian(samples: Array[Vector]): MultivariateGaussian = {
// Simplified example - in practice you'd compute sample mean and covariance
val n = samples.length
val d = samples(0).size
// Compute sample mean
val meanArray = Array.fill(d)(0.0)
samples.foreach { vec =>
for (i <- 0 until d) {
meanArray(i) += vec(i) / n
}
}
val sampleMean = Vectors.dense(meanArray)
// For this example, use identity covariance (would compute sample covariance in practice)
val sampleCov = Matrices.eye(d)
new MultivariateGaussian(sampleMean, sampleCov)
}
// Use the fitted model
val data = Array(
Vectors.dense(1.1, 2.0),
Vectors.dense(0.9, 1.8),
Vectors.dense(1.2, 2.1)
)
val fittedModel = fitGaussian(data)
val newPoint = Vectors.dense(1.0, 2.0)
val likelihood = fittedModel.pdf(newPoint)
println(f"Likelihood of new point: $likelihood%.6f")The MultivariateGaussian implementation includes several numerical stability features:
// The implementation automatically handles numerical precision:
// - Eigenvalues below EPSILON * max_eigenvalue * dimension are treated as zero
// - Prevents numerical overflow in inverse computations
// - Uses log-space computations when possible to avoid underflowThe constructor validates inputs and throws exceptions for invalid configurations:
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
// These will throw IllegalArgumentException:
try {
// Non-square covariance matrix
val badCov = Matrices.dense(2, 3, Array(1.0, 0.0, 0.0, 1.0, 0.0, 0.0))
new MultivariateGaussian(Vectors.dense(1.0, 2.0), badCov)
} catch {
case e: IllegalArgumentException => println("Covariance matrix must be square")
}
try {
// Mismatched dimensions
val mean = Vectors.dense(1.0, 2.0, 3.0)
val cov = Matrices.dense(2, 2, Array(1.0, 0.0, 0.0, 1.0))
new MultivariateGaussian(mean, cov)
} catch {
case e: IllegalArgumentException => println("Mean vector length must match covariance matrix size")
}For matrices with no positive eigenvalues, the constructor will throw an IllegalArgumentException indicating that the covariance matrix has no non-zero singular values.