Multivariate statistical distributions with robust numerical implementations that handle edge cases like singular covariance matrices. The library provides probability density function calculations with numerical stability features for machine learning applications.
Implementation of multivariate normal distribution with support for singular covariance matrices through pseudo-inverse calculations and reduced-dimensional subspace computations.
class MultivariateGaussian(mean: Vector, cov: Matrix) extends Serializable {
val mean: Vector
val cov: Matrix
def pdf(x: Vector): Double
def logpdf(x: Vector): Double
}Usage examples:
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
// Create a 2D Gaussian distribution
val mean = Vectors.dense(0.0, 0.0)
val cov = DenseMatrix.eye(2) // Identity covariance matrix
val gaussian = new MultivariateGaussian(mean, cov)
// Evaluate probability density at a point
val x = Vectors.dense(1.0, 1.0)
val density = gaussian.pdf(x)
val logDensity = gaussian.logpdf(x)
println(s"PDF at $x: $density")
println(s"Log PDF at $x: $logDensity")
// Create distribution with custom covariance
val customCov = Matrices.dense(2, 2, Array(
1.0, 0.5,
0.5, 2.0
))
val correlatedGaussian = new MultivariateGaussian(mean, customCov)
// Evaluate at multiple points
val points = Seq(
Vectors.dense(0.0, 0.0),
Vectors.dense(1.0, 0.0),
Vectors.dense(0.0, 1.0),
Vectors.dense(1.0, 1.0)
)
points.foreach { point =>
val prob = correlatedGaussian.pdf(point)
println(s"PDF at ${point.toArray.mkString("[", ", ", "]")}: $prob")
}The implementation handles singular (non-invertible) covariance matrices through pseudo-inverse computation and reduced-dimensional analysis.
Working with singular covariance matrices:
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
// Create a singular covariance matrix (rank deficient)
val singularCov = Matrices.dense(3, 3, Array(
1.0, 1.0, 1.0,
1.0, 1.0, 1.0,
1.0, 1.0, 1.0
))
val mean = Vectors.dense(0.0, 0.0, 0.0)
// The MultivariateGaussian handles singular matrices automatically
val singularGaussian = new MultivariateGaussian(mean, singularCov)
// Density computation works in the reduced subspace
val x = Vectors.dense(1.0, 1.0, 1.0)
val density = singularGaussian.pdf(x)
val logDensity = singularGaussian.logpdf(x)
println(s"PDF with singular covariance: $density")
println(s"Log PDF with singular covariance: $logDensity")Multiple ways to create MultivariateGaussian distributions from different input formats.
class MultivariateGaussian(mean: Vector, cov: Matrix) extends Serializable
// Private constructor for Breeze types (internal use)
private[ml] def this(mean: breeze.linalg.DenseVector[Double], cov: breeze.linalg.DenseMatrix[Double])Different initialization patterns:
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.stat.distribution.MultivariateGaussian
// Standard initialization with Spark vectors/matrices
val mean1 = Vectors.dense(1.0, 2.0)
val cov1 = DenseMatrix.eye(2)
val gaussian1 = new MultivariateGaussian(mean1, cov1)
// With sparse mean vector
val sparseMean = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
val cov2 = DenseMatrix.eye(3)
val gaussian2 = new MultivariateGaussian(sparseMean, cov2)
// With sparse covariance matrix
val sparseIdentity = SparseMatrix.speye(2)
val gaussian3 = new MultivariateGaussian(mean1, sparseIdentity)The multivariate Gaussian PDF is computed as:
pdf(x) = (2π)^(-k/2) * |Σ|^(-1/2) * exp(-1/2 * (x-μ)^T * Σ^(-1) * (x-μ))Where:
k is the dimensionality (length of mean vector)μ is the mean vectorΣ is the covariance matrix|Σ| is the determinant of the covariance matrixThe log PDF is computed for numerical stability:
logpdf(x) = -k/2 * log(2π) - 1/2 * log|Σ| - 1/2 * (x-μ)^T * Σ^(-1) * (x-μ)For singular covariance matrices, the implementation:
Σ = U * D * U^TThe implementation uses machine precision-based tolerance to determine which eigenvalues are considered non-zero:
val tolerance = EPSILON * max(eigenvalues) * dimensionalityThis prevents numerical instability from very small eigenvalues that should be treated as zero.
Internal implementation optimizes matrix operations:
The implementation validates inputs and provides meaningful error messages:
// Dimension validation
require(cov.numCols == cov.numRows, "Covariance matrix must be square")
require(mean.size == cov.numCols, "Mean vector length must match covariance matrix size")
// Singular matrix handling
try {
val gaussian = new MultivariateGaussian(mean, singularCov)
val density = gaussian.pdf(x)
} catch {
case e: IllegalArgumentException =>
println("Covariance matrix has no non-zero singular values")
}Distribution-dependent constants are computed lazily and cached:
The implementation leverages optimized BLAS operations:
val samples = Seq(
Vectors.dense(1.2, 0.8),
Vectors.dense(0.9, 1.1),
Vectors.dense(1.1, 0.9)
)
val gaussian = new MultivariateGaussian(mean, cov)
// Compute likelihood of samples
val likelihoods = samples.map(gaussian.pdf)
val logLikelihoods = samples.map(gaussian.logpdf)
// Total log-likelihood
val totalLogLikelihood = logLikelihoods.sumval normalDataMean = Vectors.dense(0.0, 0.0)
val normalDataCov = DenseMatrix.eye(2)
val normalModel = new MultivariateGaussian(normalDataMean, normalDataCov)
def isAnomalous(point: Vector, threshold: Double): Boolean = {
val logProb = normalModel.logpdf(point)
logProb < threshold
}
val testPoint = Vectors.dense(3.0, 3.0)
val anomalous = isAnomalous(testPoint, -5.0)// Component distributions for a mixture model
val component1 = new MultivariateGaussian(
Vectors.dense(-1.0, -1.0),
DenseMatrix.eye(2)
)
val component2 = new MultivariateGaussian(
Vectors.dense(1.0, 1.0),
Matrices.dense(2, 2, Array(2.0, 0.5, 0.5, 2.0))
)
// Evaluate mixture probability (would need mixture weights)
def mixturePdf(x: Vector, weights: Array[Double]): Double = {
weights(0) * component1.pdf(x) + weights(1) * component2.pdf(x)
}