or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

index.md linear-algebra.md statistical-distributions.md

tile.json

Statistical Distributions

Multivariate statistical distributions with robust numerical implementations that handle edge cases like singular covariance matrices. The library provides probability density function calculations with numerical stability features for machine learning applications.

Capabilities

Multivariate Gaussian Distribution

Implementation of multivariate normal distribution with support for singular covariance matrices through pseudo-inverse calculations and reduced-dimensional subspace computations.

class MultivariateGaussian(mean: Vector, cov: Matrix) extends Serializable {
  val mean: Vector
  val cov: Matrix
  def pdf(x: Vector): Double
  def logpdf(x: Vector): Double
}

Usage examples:

import org.apache.spark.ml.linalg._
import org.apache.spark.ml.stat.distribution.MultivariateGaussian

// Create a 2D Gaussian distribution
val mean = Vectors.dense(0.0, 0.0)
val cov = DenseMatrix.eye(2)  // Identity covariance matrix
val gaussian = new MultivariateGaussian(mean, cov)

// Evaluate probability density at a point
val x = Vectors.dense(1.0, 1.0)
val density = gaussian.pdf(x)
val logDensity = gaussian.logpdf(x)

println(s"PDF at $x: $density")
println(s"Log PDF at $x: $logDensity")

// Create distribution with custom covariance
val customCov = Matrices.dense(2, 2, Array(
  1.0, 0.5,
  0.5, 2.0
))
val correlatedGaussian = new MultivariateGaussian(mean, customCov)

// Evaluate at multiple points
val points = Seq(
  Vectors.dense(0.0, 0.0),
  Vectors.dense(1.0, 0.0),
  Vectors.dense(0.0, 1.0),
  Vectors.dense(1.0, 1.0)
)

points.foreach { point =>
  val prob = correlatedGaussian.pdf(point)
  println(s"PDF at ${point.toArray.mkString("[", ", ", "]")}: $prob")
}

Singular Covariance Matrix Handling

The implementation handles singular (non-invertible) covariance matrices through pseudo-inverse computation and reduced-dimensional analysis.

Working with singular covariance matrices:

import org.apache.spark.ml.linalg._
import org.apache.spark.ml.stat.distribution.MultivariateGaussian

// Create a singular covariance matrix (rank deficient)
val singularCov = Matrices.dense(3, 3, Array(
  1.0, 1.0, 1.0,
  1.0, 1.0, 1.0,
  1.0, 1.0, 1.0
))

val mean = Vectors.dense(0.0, 0.0, 0.0)

// The MultivariateGaussian handles singular matrices automatically
val singularGaussian = new MultivariateGaussian(mean, singularCov)

// Density computation works in the reduced subspace
val x = Vectors.dense(1.0, 1.0, 1.0)
val density = singularGaussian.pdf(x)
val logDensity = singularGaussian.logpdf(x)

println(s"PDF with singular covariance: $density")
println(s"Log PDF with singular covariance: $logDensity")

Constructor Variants

Multiple ways to create MultivariateGaussian distributions from different input formats.

class MultivariateGaussian(mean: Vector, cov: Matrix) extends Serializable

// Private constructor for Breeze types (internal use)
private[ml] def this(mean: breeze.linalg.DenseVector[Double], cov: breeze.linalg.DenseMatrix[Double])

Different initialization patterns:

import org.apache.spark.ml.linalg._
import org.apache.spark.ml.stat.distribution.MultivariateGaussian

// Standard initialization with Spark vectors/matrices
val mean1 = Vectors.dense(1.0, 2.0)
val cov1 = DenseMatrix.eye(2)
val gaussian1 = new MultivariateGaussian(mean1, cov1)

// With sparse mean vector
val sparseMean = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
val cov2 = DenseMatrix.eye(3)
val gaussian2 = new MultivariateGaussian(sparseMean, cov2)

// With sparse covariance matrix
val sparseIdentity = SparseMatrix.speye(2)
val gaussian3 = new MultivariateGaussian(mean1, sparseIdentity)

Mathematical Background

Probability Density Function

The multivariate Gaussian PDF is computed as:

pdf(x) = (2π)^(-k/2) * |Σ|^(-1/2) * exp(-1/2 * (x-μ)^T * Σ^(-1) * (x-μ))

Where:

k is the dimensionality (length of mean vector)
μ is the mean vector
Σ is the covariance matrix
|Σ| is the determinant of the covariance matrix

Log Probability Density Function

The log PDF is computed for numerical stability:

logpdf(x) = -k/2 * log(2π) - 1/2 * log|Σ| - 1/2 * (x-μ)^T * Σ^(-1) * (x-μ)

Singular Covariance Handling

For singular covariance matrices, the implementation:

Computes eigendecomposition: Σ = U * D * U^T
Identifies non-zero eigenvalues using numerical tolerance
Computes pseudo-determinant from non-zero eigenvalues
Uses pseudo-inverse for the quadratic form computation
Operates in the reduced-dimensional subspace where the distribution is supported

Numerical Stability Features

Tolerance-Based Eigenvalue Filtering

The implementation uses machine precision-based tolerance to determine which eigenvalues are considered non-zero:

val tolerance = EPSILON * max(eigenvalues) * dimensionality

This prevents numerical instability from very small eigenvalues that should be treated as zero.

Efficient Matrix Operations

Internal implementation optimizes matrix operations:

Uses eigendecomposition instead of direct matrix inversion
Computes pseudo-inverse through eigenvalue manipulation
Avoids explicit matrix inverse computation for better numerical stability
Lazy evaluation of distribution-dependent constants

Error Handling

The implementation validates inputs and provides meaningful error messages:

// Dimension validation
require(cov.numCols == cov.numRows, "Covariance matrix must be square")
require(mean.size == cov.numCols, "Mean vector length must match covariance matrix size")

// Singular matrix handling
try {
  val gaussian = new MultivariateGaussian(mean, singularCov)
  val density = gaussian.pdf(x)
} catch {
  case e: IllegalArgumentException => 
    println("Covariance matrix has no non-zero singular values")
}

Performance Considerations

Lazy Computation

Distribution-dependent constants are computed lazily and cached:

Eigendecomposition is performed once during first PDF/log-PDF call
Matrix square roots and determinants are cached
Intermediate computations are reused across multiple evaluations

Memory Efficiency

Uses efficient eigendecomposition instead of full matrix operations
Minimal memory footprint for distribution parameters
Optimized for repeated evaluations with the same distribution

Integration with BLAS

The implementation leverages optimized BLAS operations:

Matrix-vector multiplications use BLAS Level 2 routines
Eigendecomposition uses optimized linear algebra libraries
Vector operations benefit from vectorized implementations

Common Use Cases

Density Estimation

val samples = Seq(
  Vectors.dense(1.2, 0.8),
  Vectors.dense(0.9, 1.1),
  Vectors.dense(1.1, 0.9)
)

val gaussian = new MultivariateGaussian(mean, cov)

// Compute likelihood of samples
val likelihoods = samples.map(gaussian.pdf)
val logLikelihoods = samples.map(gaussian.logpdf)

// Total log-likelihood
val totalLogLikelihood = logLikelihoods.sum

Anomaly Detection

val normalDataMean = Vectors.dense(0.0, 0.0)
val normalDataCov = DenseMatrix.eye(2)
val normalModel = new MultivariateGaussian(normalDataMean, normalDataCov)

def isAnomalous(point: Vector, threshold: Double): Boolean = {
  val logProb = normalModel.logpdf(point)
  logProb < threshold
}

val testPoint = Vectors.dense(3.0, 3.0)
val anomalous = isAnomalous(testPoint, -5.0)

Gaussian Mixture Components

// Component distributions for a mixture model
val component1 = new MultivariateGaussian(
  Vectors.dense(-1.0, -1.0),
  DenseMatrix.eye(2)
)

val component2 = new MultivariateGaussian(
  Vectors.dense(1.0, 1.0), 
  Matrices.dense(2, 2, Array(2.0, 0.5, 0.5, 2.0))
)

// Evaluate mixture probability (would need mixture weights)
def mixturePdf(x: Vector, weights: Array[Double]): Double = {
  weights(0) * component1.pdf(x) + weights(1) * component2.pdf(x)
}

Version

Tile

Files