or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

blas.mddistributions.mdindex.mdmatrices.mdutils.mdvectors.md
tile.json

distributions.mddocs/

Statistical Distributions

Multivariate probability distributions with numerical stability and support for singular covariance matrices. Designed for robust statistical computations in machine learning applications.

Capabilities

Multivariate Gaussian Distribution

Implementation of multivariate normal distribution with support for degenerate (singular) covariance matrices.

/**
 * Multivariate Gaussian (Normal) Distribution
 * Handles singular covariance matrices by computing density in reduced dimensional subspace
 * 
 * Note: This class is marked as @DeveloperApi in Spark MLlib
 * 
 * @param mean Mean vector of the distribution
 * @param cov Covariance matrix of the distribution (must be square and same size as mean)
 */
class MultivariateGaussian(
  val mean: Vector,
  val cov: Matrix
) extends Serializable {
  
  /**
   * Private constructor taking Breeze types (internal use)
   * @param mean Mean vector as Breeze DenseVector
   * @param cov Covariance matrix as Breeze DenseMatrix
   */
  private[ml] def this(mean: breeze.linalg.DenseVector[Double], cov: breeze.linalg.DenseMatrix[Double])
  /**
   * Mean vector of the distribution
   * @return Vector containing mean values for each dimension
   */
  def mean: Vector
  
  /**
   * Covariance matrix of the distribution
   * @return Square matrix representing covariance structure
   */
  def cov: Matrix
  
  /**
   * Compute probability density function at given point
   * @param x Point to evaluate (must have same size as mean)
   * @return Probability density value (always non-negative)
   */
  def pdf(x: Vector): Double
  
  /**
   * Compute log probability density function at given point
   * @param x Point to evaluate (must have same size as mean)
   * @return Log probability density value
   */
  def logpdf(x: Vector): Double
}

Usage Examples:

import org.apache.spark.ml.linalg._
import org.apache.spark.ml.stat.distribution.MultivariateGaussian

// 2D Gaussian distribution
val mean = Vectors.dense(0.0, 0.0)
val cov = Matrices.dense(2, 2, Array(
  1.0, 0.5,  // column 1: [1.0, 0.5]
  0.5, 1.0   // column 2: [0.5, 1.0]
))

val mvn = new MultivariateGaussian(mean, cov)

// Evaluate density at specific points
val point1 = Vectors.dense(0.0, 0.0)  // at mean
val point2 = Vectors.dense(1.0, 1.0)  // away from mean

val density1 = mvn.pdf(point1)     // Higher density (near mean)
val density2 = mvn.pdf(point2)     // Lower density (away from mean)

val logDensity1 = mvn.logpdf(point1) // More numerically stable
val logDensity2 = mvn.logpdf(point2)

println(s"Density at mean: $density1")
println(s"Density at (1,1): $density2")
println(s"Log density at mean: $logDensity1")

Identity Covariance

Simple case with uncorrelated dimensions.

import org.apache.spark.ml.linalg._
import org.apache.spark.ml.stat.distribution.MultivariateGaussian

// 3D Gaussian with identity covariance (uncorrelated)
val mean = Vectors.dense(1.0, 2.0, 3.0)
val identityCov = DenseMatrix.eye(3)

val independentMvn = new MultivariateGaussian(mean, identityCov)

// Evaluate at mean (should give highest density)
val atMean = independentMvn.pdf(mean)
val awayfromMean = independentMvn.pdf(Vectors.dense(0.0, 0.0, 0.0))

println(s"Density at mean: $atMean")
println(s"Density at origin: $awayfromMean")

Singular Covariance Matrices

Handling of degenerate covariance matrices where some dimensions are linearly dependent.

import org.apache.spark.ml.linalg._
import org.apache.spark.ml.stat.distribution.MultivariateGaussian

// Singular covariance matrix (rank deficient)
val mean = Vectors.dense(0.0, 0.0, 0.0)
val singularCov = Matrices.dense(3, 3, Array(
  1.0, 1.0, 1.0,  // column 1
  1.0, 1.0, 1.0,  // column 2 (same as column 1)
  1.0, 1.0, 1.0   // column 3 (same as column 1)
))

// This will work despite singular covariance
val singularMvn = new MultivariateGaussian(mean, singularCov)

// Density is computed in reduced dimensional subspace
val point = Vectors.dense(1.0, 1.0, 1.0)
val density = singularMvn.pdf(point)

println(s"Density with singular covariance: $density")

Advanced Usage Patterns

Working with Large Dimensions:

import org.apache.spark.ml.linalg._
import org.apache.spark.ml.stat.distribution.MultivariateGaussian

// High-dimensional Gaussian (e.g., for text analysis)
val dim = 100
val mean = Vectors.zeros(dim)
val cov = DenseMatrix.eye(dim)

val highDimMvn = new MultivariateGaussian(mean, cov)

// Use logpdf for numerical stability in high dimensions
val testPoint = Vectors.dense(Array.fill(dim)(0.1))
val logDensity = highDimMvn.logpdf(testPoint)

// Avoid pdf in high dimensions due to numerical underflow
// val density = highDimMvn.pdf(testPoint) // May underflow to 0.0

Batch Evaluation:

import org.apache.spark.ml.linalg._
import org.apache.spark.ml.stat.distribution.MultivariateGaussian

val mvn = new MultivariateGaussian(
  Vectors.dense(0.0, 0.0),
  DenseMatrix.eye(2)
)

// Evaluate multiple points
val testPoints = Array(
  Vectors.dense(0.0, 0.0),
  Vectors.dense(1.0, 0.0),
  Vectors.dense(0.0, 1.0),
  Vectors.dense(1.0, 1.0)
)

val densities = testPoints.map(mvn.pdf)
val logDensities = testPoints.map(mvn.logpdf)

testPoints.zip(densities).foreach { case (point, density) =>
  println(s"Point ${point.toArray.mkString("(", ", ", ")")}: density = $density")
}

Internal Implementation

Numerical Stability

The implementation uses several techniques for numerical stability:

  1. Eigenvalue Decomposition: Uses eigendecomposition instead of direct matrix inversion
  2. Pseudo-Inverse: Handles singular covariance matrices via Moore-Penrose pseudo-inverse
  3. Tolerance-Based Filtering: Eigenvalues below machine precision threshold are treated as zero
  4. Log-Space Computation: logpdf avoids numerical underflow in high dimensions

Tolerance Calculation

// Internal tolerance calculation (not part of public API)
val tolerance = EPSILON * maxEigenvalue * matrixDimension

Where:

  • EPSILON: Machine epsilon from Utils.EPSILON
  • maxEigenvalue: Maximum eigenvalue of covariance matrix
  • matrixDimension: Size of the covariance matrix

Memory Optimization

  • Lazy Evaluation: Expensive computations are cached using @transient private lazy val
  • Efficient Storage: Uses optimized Breeze operations internally
  • Minimal Memory Footprint: Only stores essential computed values

Error Handling

The MultivariateGaussian constructor validates inputs:

// These will throw IllegalArgumentException:

// Non-square covariance matrix
val badCov1 = Matrices.dense(2, 3, Array(1.0, 0.0, 0.0, 1.0, 0.0, 0.0))
// new MultivariateGaussian(mean, badCov1) // throws exception

// Dimension mismatch between mean and covariance
val mean2D = Vectors.dense(0.0, 0.0)
val cov3D = DenseMatrix.eye(3)
// new MultivariateGaussian(mean2D, cov3D) // throws exception

// All-zero eigenvalues (no non-zero singular values)
val zeroCov = Matrices.zeros(2, 2)
// new MultivariateGaussian(mean2D, zeroCov) // may throw IllegalArgumentException

Valid Cases:

// These are all valid:

// Standard non-singular covariance
val validMvn1 = new MultivariateGaussian(
  Vectors.dense(0.0, 0.0),
  DenseMatrix.eye(2)
)

// Singular but non-zero covariance
val singularCov = Matrices.dense(2, 2, Array(1.0, 1.0, 1.0, 1.0))
val validMvn2 = new MultivariateGaussian(
  Vectors.dense(0.0, 0.0),
  singularCov
)

// Very small non-zero eigenvalues (handled gracefully)
val nearSingular = Matrices.dense(2, 2, Array(1.0, 0.0, 0.0, 1e-15))
val validMvn3 = new MultivariateGaussian(
  Vectors.dense(0.0, 0.0),
  nearSingular
)

Mathematical Background

The multivariate Gaussian probability density function is:

pdf(x) = (2π)^(-k/2) * |Σ|^(-1/2) * exp(-1/2 * (x-μ)^T * Σ^(-1) * (x-μ))

Where:

  • k: Dimensionality (length of mean vector)
  • μ: Mean vector
  • Σ: Covariance matrix
  • |Σ|: Determinant of covariance matrix

For numerical stability, the implementation computes:

logpdf(x) = -k/2 * log(2π) - 1/2 * log|Σ| - 1/2 * (x-μ)^T * Σ^(-1) * (x-μ)

And uses eigendecomposition Σ = U * D * U^T to compute the inverse and determinant efficiently.

Integration with Spark MLlib

MultivariateGaussian is used throughout Spark MLlib for:

  • Gaussian Mixture Models: Component distributions
  • Naive Bayes: Class-conditional distributions
  • Anomaly Detection: Outlier scoring
  • Dimensionality Reduction: Principal component analysis
  • Clustering: Gaussian-based clustering algorithms