Multivariate probability distributions with numerical stability and support for singular covariance matrices. Designed for robust statistical computations in machine learning applications.
Implementation of multivariate normal distribution with support for degenerate (singular) covariance matrices.
/**
* Multivariate Gaussian (Normal) Distribution
* Handles singular covariance matrices by computing density in reduced dimensional subspace
*
* Note: This class is marked as @DeveloperApi in Spark MLlib
*
* @param mean Mean vector of the distribution
* @param cov Covariance matrix of the distribution (must be square and same size as mean)
*/
class MultivariateGaussian(
val mean: Vector,
val cov: Matrix
) extends Serializable {
/**
* Private constructor taking Breeze types (internal use)
* @param mean Mean vector as Breeze DenseVector
* @param cov Covariance matrix as Breeze DenseMatrix
*/
private[ml] def this(mean: breeze.linalg.DenseVector[Double], cov: breeze.linalg.DenseMatrix[Double])
/**
* Mean vector of the distribution
* @return Vector containing mean values for each dimension
*/
def mean: Vector
/**
* Covariance matrix of the distribution
* @return Square matrix representing covariance structure
*/
def cov: Matrix
/**
* Compute probability density function at given point
* @param x Point to evaluate (must have same size as mean)
* @return Probability density value (always non-negative)
*/
def pdf(x: Vector): Double
/**
* Compute log probability density function at given point
* @param x Point to evaluate (must have same size as mean)
* @return Log probability density value
*/
def logpdf(x: Vector): Double
}Usage Examples:
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
// 2D Gaussian distribution
val mean = Vectors.dense(0.0, 0.0)
val cov = Matrices.dense(2, 2, Array(
1.0, 0.5, // column 1: [1.0, 0.5]
0.5, 1.0 // column 2: [0.5, 1.0]
))
val mvn = new MultivariateGaussian(mean, cov)
// Evaluate density at specific points
val point1 = Vectors.dense(0.0, 0.0) // at mean
val point2 = Vectors.dense(1.0, 1.0) // away from mean
val density1 = mvn.pdf(point1) // Higher density (near mean)
val density2 = mvn.pdf(point2) // Lower density (away from mean)
val logDensity1 = mvn.logpdf(point1) // More numerically stable
val logDensity2 = mvn.logpdf(point2)
println(s"Density at mean: $density1")
println(s"Density at (1,1): $density2")
println(s"Log density at mean: $logDensity1")Simple case with uncorrelated dimensions.
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
// 3D Gaussian with identity covariance (uncorrelated)
val mean = Vectors.dense(1.0, 2.0, 3.0)
val identityCov = DenseMatrix.eye(3)
val independentMvn = new MultivariateGaussian(mean, identityCov)
// Evaluate at mean (should give highest density)
val atMean = independentMvn.pdf(mean)
val awayfromMean = independentMvn.pdf(Vectors.dense(0.0, 0.0, 0.0))
println(s"Density at mean: $atMean")
println(s"Density at origin: $awayfromMean")Handling of degenerate covariance matrices where some dimensions are linearly dependent.
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
// Singular covariance matrix (rank deficient)
val mean = Vectors.dense(0.0, 0.0, 0.0)
val singularCov = Matrices.dense(3, 3, Array(
1.0, 1.0, 1.0, // column 1
1.0, 1.0, 1.0, // column 2 (same as column 1)
1.0, 1.0, 1.0 // column 3 (same as column 1)
))
// This will work despite singular covariance
val singularMvn = new MultivariateGaussian(mean, singularCov)
// Density is computed in reduced dimensional subspace
val point = Vectors.dense(1.0, 1.0, 1.0)
val density = singularMvn.pdf(point)
println(s"Density with singular covariance: $density")Working with Large Dimensions:
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
// High-dimensional Gaussian (e.g., for text analysis)
val dim = 100
val mean = Vectors.zeros(dim)
val cov = DenseMatrix.eye(dim)
val highDimMvn = new MultivariateGaussian(mean, cov)
// Use logpdf for numerical stability in high dimensions
val testPoint = Vectors.dense(Array.fill(dim)(0.1))
val logDensity = highDimMvn.logpdf(testPoint)
// Avoid pdf in high dimensions due to numerical underflow
// val density = highDimMvn.pdf(testPoint) // May underflow to 0.0Batch Evaluation:
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
val mvn = new MultivariateGaussian(
Vectors.dense(0.0, 0.0),
DenseMatrix.eye(2)
)
// Evaluate multiple points
val testPoints = Array(
Vectors.dense(0.0, 0.0),
Vectors.dense(1.0, 0.0),
Vectors.dense(0.0, 1.0),
Vectors.dense(1.0, 1.0)
)
val densities = testPoints.map(mvn.pdf)
val logDensities = testPoints.map(mvn.logpdf)
testPoints.zip(densities).foreach { case (point, density) =>
println(s"Point ${point.toArray.mkString("(", ", ", ")")}: density = $density")
}The implementation uses several techniques for numerical stability:
logpdf avoids numerical underflow in high dimensions// Internal tolerance calculation (not part of public API)
val tolerance = EPSILON * maxEigenvalue * matrixDimensionWhere:
EPSILON: Machine epsilon from Utils.EPSILONmaxEigenvalue: Maximum eigenvalue of covariance matrixmatrixDimension: Size of the covariance matrix@transient private lazy valThe MultivariateGaussian constructor validates inputs:
// These will throw IllegalArgumentException:
// Non-square covariance matrix
val badCov1 = Matrices.dense(2, 3, Array(1.0, 0.0, 0.0, 1.0, 0.0, 0.0))
// new MultivariateGaussian(mean, badCov1) // throws exception
// Dimension mismatch between mean and covariance
val mean2D = Vectors.dense(0.0, 0.0)
val cov3D = DenseMatrix.eye(3)
// new MultivariateGaussian(mean2D, cov3D) // throws exception
// All-zero eigenvalues (no non-zero singular values)
val zeroCov = Matrices.zeros(2, 2)
// new MultivariateGaussian(mean2D, zeroCov) // may throw IllegalArgumentExceptionValid Cases:
// These are all valid:
// Standard non-singular covariance
val validMvn1 = new MultivariateGaussian(
Vectors.dense(0.0, 0.0),
DenseMatrix.eye(2)
)
// Singular but non-zero covariance
val singularCov = Matrices.dense(2, 2, Array(1.0, 1.0, 1.0, 1.0))
val validMvn2 = new MultivariateGaussian(
Vectors.dense(0.0, 0.0),
singularCov
)
// Very small non-zero eigenvalues (handled gracefully)
val nearSingular = Matrices.dense(2, 2, Array(1.0, 0.0, 0.0, 1e-15))
val validMvn3 = new MultivariateGaussian(
Vectors.dense(0.0, 0.0),
nearSingular
)The multivariate Gaussian probability density function is:
pdf(x) = (2π)^(-k/2) * |Σ|^(-1/2) * exp(-1/2 * (x-μ)^T * Σ^(-1) * (x-μ))Where:
k: Dimensionality (length of mean vector)μ: Mean vectorΣ: Covariance matrix|Σ|: Determinant of covariance matrixFor numerical stability, the implementation computes:
logpdf(x) = -k/2 * log(2π) - 1/2 * log|Σ| - 1/2 * (x-μ)^T * Σ^(-1) * (x-μ)And uses eigendecomposition Σ = U * D * U^T to compute the inverse and determinant efficiently.
MultivariateGaussian is used throughout Spark MLlib for: