or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

distributions.mdindex.mdmatrices.mdvectors.md

distributions.mddocs/

0

# Statistical Distributions

1

2

Multivariate statistical distributions for machine learning applications with support for probability density functions and log-probability calculations. Includes robust handling of singular covariance matrices using pseudo-inverse techniques.

3

4

## Capabilities

5

6

### Multivariate Gaussian Distribution

7

8

Multivariate Gaussian (Normal) distribution supporting both regular and degenerate (singular) covariance matrices through pseudo-inverse computation.

9

10

```scala { .api }

11

/**

12

* Multivariate Gaussian (Normal) Distribution

13

* Handles singular covariance matrices using reduced dimensional subspace computation

14

* @param mean mean vector of the distribution

15

* @param cov covariance matrix of the distribution (must be square and same size as mean)

16

*/

17

@DeveloperApi

18

class MultivariateGaussian(val mean: Vector, val cov: Matrix) extends Serializable {

19

20

/**

21

* Returns density of this multivariate Gaussian at given point

22

* @param x point to evaluate density at

23

* @return probability density value

24

*/

25

def pdf(x: Vector): Double

26

27

/**

28

* Returns log-density of this multivariate Gaussian at given point

29

* @param x point to evaluate log-density at

30

* @return log probability density value

31

*/

32

def logpdf(x: Vector): Double

33

}

34

```

35

36

**Usage Examples:**

37

38

```scala

39

import org.apache.spark.ml.linalg.{Vectors, Matrices}

40

import org.apache.spark.ml.stat.distribution.MultivariateGaussian

41

42

// Create 2D Gaussian distribution

43

val mean = Vectors.dense(0.0, 0.0)

44

val cov = Matrices.dense(2, 2, Array(

45

1.0, 0.5, // Covariance matrix: [[1.0, 0.5],

46

0.5, 2.0 // [0.5, 2.0]]

47

))

48

49

val gaussian = new MultivariateGaussian(mean, cov)

50

51

// Evaluate density at different points

52

val point1 = Vectors.dense(0.0, 0.0) // At mean

53

val point2 = Vectors.dense(1.0, 1.0) // Away from mean

54

val point3 = Vectors.dense(-1.0, 2.0) // Different direction

55

56

val density1 = gaussian.pdf(point1) // Highest density (at mean)

57

val density2 = gaussian.pdf(point2) // Lower density

58

val density3 = gaussian.pdf(point3) // Even lower density

59

60

// Log-densities (more numerically stable for extreme values)

61

val logDensity1 = gaussian.logpdf(point1) // log(density1)

62

val logDensity2 = gaussian.logpdf(point2) // log(density2)

63

val logDensity3 = gaussian.logpdf(point3) // log(density3)

64

65

println(s"PDF at mean: $density1") // ~0.159 (1/(2π√det(cov)))

66

println(s"PDF at (1,1): $density2") // Lower value

67

println(s"Log-PDF at mean: $logDensity1") // ~-1.838

68

```

69

70

### Singular Covariance Handling

71

72

The implementation robustly handles singular (non-invertible) covariance matrices using pseudo-inverse techniques based on eigenvalue decomposition.

73

74

```scala { .api }

75

// Internal singular value handling (conceptual - not directly accessible)

76

private def calculateCovarianceConstants: (BDM[Double], Double) = {

77

// 1. Eigendecomposition: cov = U * D * U^T

78

// 2. Filter eigenvalues below tolerance: tol = EPSILON * max(eigenvalues) * dimension

79

// 3. Compute pseudo-determinant from non-zero eigenvalues

80

// 4. Compute pseudo-inverse square root: D^(-1/2) for non-zero eigenvalues

81

// 5. Return (D^(-1/2) * U^T, log_normalizer)

82

}

83

```

84

85

**Usage Examples:**

86

87

```scala

88

import org.apache.spark.ml.linalg.{Vectors, Matrices}

89

import org.apache.spark.ml.stat.distribution.MultivariateGaussian

90

91

// Example 1: Singular covariance (rank deficient)

92

val mean = Vectors.dense(0.0, 0.0, 0.0)

93

val singularCov = Matrices.dense(3, 3, Array(

94

1.0, 1.0, 1.0, // All rows/columns are linearly dependent

95

1.0, 1.0, 1.0, // Rank = 1, not full rank

96

1.0, 1.0, 1.0

97

))

98

99

val singularGaussian = new MultivariateGaussian(mean, singularCov)

100

101

// Still works! Uses pseudo-inverse internally

102

val testPoint = Vectors.dense(0.5, 0.5, 0.5)

103

val density = singularGaussian.pdf(testPoint) // Computed in reduced dimension

104

val logDensity = singularGaussian.logpdf(testPoint) // More stable computation

105

106

// Example 2: Nearly singular covariance

107

val mean2 = Vectors.dense(1.0, 2.0)

108

val nearlySingularCov = Matrices.dense(2, 2, Array(

109

1.0, 0.9999, // Very high correlation, nearly singular

110

0.9999, 1.0

111

))

112

113

val nearSingularGaussian = new MultivariateGaussian(mean2, nearlySingularCov)

114

val point = Vectors.dense(1.1, 2.1)

115

val stableDensity = nearSingularGaussian.pdf(point) // Handles numerical issues

116

117

// Example 3: Diagonal covariance (well-conditioned)

118

val diagonalCov = Matrices.dense(2, 2, Array(

119

2.0, 0.0, // Independent dimensions

120

0.0, 3.0

121

))

122

123

val diagonalGaussian = new MultivariateGaussian(mean2, diagonalCov)

124

val fastDensity = diagonalGaussian.pdf(point) // Efficient computation

125

```

126

127

### Construction Patterns

128

129

Different ways to create multivariate Gaussian distributions for common use cases.

130

131

```scala { .api }

132

// Standard constructor

133

new MultivariateGaussian(mean: Vector, cov: Matrix)

134

135

// Internal Breeze constructor (not typically used directly)

136

private[ml] def this(mean: BDV[Double], cov: BDM[Double])

137

```

138

139

**Usage Examples:**

140

141

```scala

142

import org.apache.spark.ml.linalg.{Vectors, Matrices, DenseMatrix}

143

import org.apache.spark.ml.stat.distribution.MultivariateGaussian

144

145

// 1. Isotropic Gaussian (spherical covariance)

146

val isotropicMean = Vectors.dense(0.0, 0.0, 0.0)

147

val isotropicCov = Matrices.eye(3).asInstanceOf[DenseMatrix]

148

isotropicCov.update(_ * 0.5) // Scale by 0.5: σ² = 0.5 for all dimensions

149

150

val isotropicGaussian = new MultivariateGaussian(isotropicMean, isotropicCov)

151

152

// 2. Diagonal Gaussian (independent dimensions with different variances)

153

val diagMean = Vectors.dense(1.0, -1.0, 2.0)

154

val diagCov = DenseMatrix.diag(Vectors.dense(1.0, 2.0, 0.5)) // Different variances

155

156

val diagonalGaussian = new MultivariateGaussian(diagMean, diagCov)

157

158

// 3. Full covariance Gaussian (correlated dimensions)

159

val corrMean = Vectors.dense(0.0, 0.0)

160

val corrCov = Matrices.dense(2, 2, Array(

161

2.0, 1.0, // Positive correlation

162

1.0, 2.0

163

))

164

165

val correlatedGaussian = new MultivariateGaussian(corrMean, corrCov)

166

167

// 4. From empirical data (conceptual - you'd compute these from data)

168

val dataMean = Vectors.dense(2.5, 1.8) // Sample mean

169

val dataCov = Matrices.dense(2, 2, Array( // Sample covariance

170

1.2, 0.3,

171

0.3, 0.8

172

))

173

174

val empiricalGaussian = new MultivariateGaussian(dataMean, dataCov)

175

176

// Evaluate all distributions at a test point

177

val testPoint = Vectors.dense(0.0, 0.0, 0.0)

178

println(s"Isotropic density: ${isotropicGaussian.pdf(testPoint.slice(Array(0, 1, 2)))}")

179

println(s"Diagonal density: ${diagonalGaussian.pdf(testPoint)}")

180

println(s"Correlated density: ${correlatedGaussian.pdf(testPoint.slice(Array(0, 1)))}")

181

```

182

183

### Mathematical Properties

184

185

Understanding the mathematical foundation for effective usage.

186

187

```scala { .api }

188

// Probability density function formula (conceptual)

189

// pdf(x) = (2π)^(-k/2) * det(Σ)^(-1/2) * exp(-0.5 * (x-μ)^T * Σ^(-1) * (x-μ))

190

// where:

191

// k = dimension

192

// μ = mean vector

193

// Σ = covariance matrix

194

// det(Σ) = determinant of covariance matrix

195

196

// Log probability density (more numerically stable)

197

// logpdf(x) = -0.5 * [k*log(2π) + log(det(Σ)) + (x-μ)^T * Σ^(-1) * (x-μ)]

198

```

199

200

**Usage Examples:**

201

202

```scala

203

import org.apache.spark.ml.linalg.{Vectors, Matrices}

204

import org.apache.spark.ml.stat.distribution.MultivariateGaussian

205

206

// Understanding the relationship between pdf and logpdf

207

val mean = Vectors.dense(1.0, 2.0)

208

val cov = Matrices.dense(2, 2, Array(1.0, 0.0, 0.0, 1.0)) // Identity covariance

209

val gaussian = new MultivariateGaussian(mean, cov)

210

211

val x = Vectors.dense(1.0, 2.0) // Point at the mean

212

213

val pdf = gaussian.pdf(x)

214

val logpdf = gaussian.logpdf(x)

215

216

// Verify relationship: pdf = exp(logpdf)

217

val pdfFromLog = math.exp(logpdf)

218

println(s"PDF: $pdf")

219

println(s"PDF from log: $pdfFromLog")

220

println(s"Difference: ${math.abs(pdf - pdfFromLog)}") // Should be very small

221

222

// Maximum density is at the mean

223

val atMean = gaussian.pdf(mean)

224

val awayFromMean = gaussian.pdf(Vectors.dense(3.0, 4.0))

225

println(s"Density at mean: $atMean")

226

println(s"Density away from mean: $awayFromMean")

227

assert(atMean > awayFromMean) // Density decreases away from mean

228

229

// For identity covariance, the maximum density is 1/(2π)^(k/2)

230

val theoreticalMax = 1.0 / math.pow(2.0 * math.Pi, 1.0) // k=2 dimensions

231

println(s"Theoretical maximum density: $theoreticalMax")

232

println(s"Actual maximum density: $atMean")

233

```

234

235

## Error Handling

236

237

The MultivariateGaussian class validates inputs and handles edge cases gracefully:

238

239

```scala

240

import org.apache.spark.ml.linalg.{Vectors, Matrices}

241

import org.apache.spark.ml.stat.distribution.MultivariateGaussian

242

243

// Dimension mismatch between mean and covariance

244

val mean = Vectors.dense(1.0, 2.0)

245

val wrongSizeCov = Matrices.eye(3)

246

// new MultivariateGaussian(mean, wrongSizeCov) // IllegalArgumentException: Mean vector length must match covariance matrix size

247

248

// Non-square covariance matrix

249

val nonSquareCov = Matrices.dense(2, 3, Array(1,2,3,4,5,6))

250

// new MultivariateGaussian(mean, nonSquareCov) // IllegalArgumentException: Covariance matrix must be square

251

252

// Completely zero covariance (no variation)

253

val zeroMean = Vectors.dense(0.0, 0.0)

254

val zeroCov = Matrices.zeros(2, 2)

255

// new MultivariateGaussian(zeroMean, zeroCov) // IllegalArgumentException: Covariance matrix has no non-zero singular values

256

257

// Valid but challenging cases (handled gracefully)

258

val validMean = Vectors.dense(0.0, 0.0)

259

val validCov = Matrices.dense(2, 2, Array(1.0, 0.0, 0.0, 1.0))

260

val validGaussian = new MultivariateGaussian(validMean, validCov)

261

262

// Very small probabilities (handled via log-space)

263

val farPoint = Vectors.dense(10.0, 10.0) // Very far from mean

264

val smallPdf = validGaussian.pdf(farPoint) // Very small positive number

265

val logPdf = validGaussian.logpdf(farPoint) // Large negative number

266

println(s"Small PDF: $smallPdf") // May be close to 0

267

println(s"Log PDF: $logPdf") // More informative

268

269

// Numerical stability demonstration

270

val extremePoint = Vectors.dense(50.0, 50.0)

271

val extremeLogPdf = validGaussian.logpdf(extremePoint) // Large negative

272

val extremePdf = validGaussian.pdf(extremePoint) // Essentially 0

273

println(s"Extreme log PDF: $extremeLogPdf") // Still computable

274

println(s"Extreme PDF: $extremePdf") // May underflow to 0

275

```