or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

blas.mddistributions.mdindex.mdmatrices.mdutils.mdvectors.md

distributions.mddocs/

0

# Statistical Distributions

1

2

Multivariate probability distributions with numerical stability and support for singular covariance matrices. Designed for robust statistical computations in machine learning applications.

3

4

## Capabilities

5

6

### Multivariate Gaussian Distribution

7

8

Implementation of multivariate normal distribution with support for degenerate (singular) covariance matrices.

9

10

```scala { .api }

11

/**

12

* Multivariate Gaussian (Normal) Distribution

13

* Handles singular covariance matrices by computing density in reduced dimensional subspace

14

*

15

* Note: This class is marked as @DeveloperApi in Spark MLlib

16

*

17

* @param mean Mean vector of the distribution

18

* @param cov Covariance matrix of the distribution (must be square and same size as mean)

19

*/

20

class MultivariateGaussian(

21

val mean: Vector,

22

val cov: Matrix

23

) extends Serializable {

24

25

/**

26

* Private constructor taking Breeze types (internal use)

27

* @param mean Mean vector as Breeze DenseVector

28

* @param cov Covariance matrix as Breeze DenseMatrix

29

*/

30

private[ml] def this(mean: breeze.linalg.DenseVector[Double], cov: breeze.linalg.DenseMatrix[Double])

31

/**

32

* Mean vector of the distribution

33

* @return Vector containing mean values for each dimension

34

*/

35

def mean: Vector

36

37

/**

38

* Covariance matrix of the distribution

39

* @return Square matrix representing covariance structure

40

*/

41

def cov: Matrix

42

43

/**

44

* Compute probability density function at given point

45

* @param x Point to evaluate (must have same size as mean)

46

* @return Probability density value (always non-negative)

47

*/

48

def pdf(x: Vector): Double

49

50

/**

51

* Compute log probability density function at given point

52

* @param x Point to evaluate (must have same size as mean)

53

* @return Log probability density value

54

*/

55

def logpdf(x: Vector): Double

56

}

57

```

58

59

**Usage Examples:**

60

61

```scala

62

import org.apache.spark.ml.linalg._

63

import org.apache.spark.ml.stat.distribution.MultivariateGaussian

64

65

// 2D Gaussian distribution

66

val mean = Vectors.dense(0.0, 0.0)

67

val cov = Matrices.dense(2, 2, Array(

68

1.0, 0.5, // column 1: [1.0, 0.5]

69

0.5, 1.0 // column 2: [0.5, 1.0]

70

))

71

72

val mvn = new MultivariateGaussian(mean, cov)

73

74

// Evaluate density at specific points

75

val point1 = Vectors.dense(0.0, 0.0) // at mean

76

val point2 = Vectors.dense(1.0, 1.0) // away from mean

77

78

val density1 = mvn.pdf(point1) // Higher density (near mean)

79

val density2 = mvn.pdf(point2) // Lower density (away from mean)

80

81

val logDensity1 = mvn.logpdf(point1) // More numerically stable

82

val logDensity2 = mvn.logpdf(point2)

83

84

println(s"Density at mean: $density1")

85

println(s"Density at (1,1): $density2")

86

println(s"Log density at mean: $logDensity1")

87

```

88

89

### Identity Covariance

90

91

Simple case with uncorrelated dimensions.

92

93

```scala

94

import org.apache.spark.ml.linalg._

95

import org.apache.spark.ml.stat.distribution.MultivariateGaussian

96

97

// 3D Gaussian with identity covariance (uncorrelated)

98

val mean = Vectors.dense(1.0, 2.0, 3.0)

99

val identityCov = DenseMatrix.eye(3)

100

101

val independentMvn = new MultivariateGaussian(mean, identityCov)

102

103

// Evaluate at mean (should give highest density)

104

val atMean = independentMvn.pdf(mean)

105

val awayfromMean = independentMvn.pdf(Vectors.dense(0.0, 0.0, 0.0))

106

107

println(s"Density at mean: $atMean")

108

println(s"Density at origin: $awayfromMean")

109

```

110

111

### Singular Covariance Matrices

112

113

Handling of degenerate covariance matrices where some dimensions are linearly dependent.

114

115

```scala

116

import org.apache.spark.ml.linalg._

117

import org.apache.spark.ml.stat.distribution.MultivariateGaussian

118

119

// Singular covariance matrix (rank deficient)

120

val mean = Vectors.dense(0.0, 0.0, 0.0)

121

val singularCov = Matrices.dense(3, 3, Array(

122

1.0, 1.0, 1.0, // column 1

123

1.0, 1.0, 1.0, // column 2 (same as column 1)

124

1.0, 1.0, 1.0 // column 3 (same as column 1)

125

))

126

127

// This will work despite singular covariance

128

val singularMvn = new MultivariateGaussian(mean, singularCov)

129

130

// Density is computed in reduced dimensional subspace

131

val point = Vectors.dense(1.0, 1.0, 1.0)

132

val density = singularMvn.pdf(point)

133

134

println(s"Density with singular covariance: $density")

135

```

136

137

### Advanced Usage Patterns

138

139

**Working with Large Dimensions:**

140

141

```scala

142

import org.apache.spark.ml.linalg._

143

import org.apache.spark.ml.stat.distribution.MultivariateGaussian

144

145

// High-dimensional Gaussian (e.g., for text analysis)

146

val dim = 100

147

val mean = Vectors.zeros(dim)

148

val cov = DenseMatrix.eye(dim)

149

150

val highDimMvn = new MultivariateGaussian(mean, cov)

151

152

// Use logpdf for numerical stability in high dimensions

153

val testPoint = Vectors.dense(Array.fill(dim)(0.1))

154

val logDensity = highDimMvn.logpdf(testPoint)

155

156

// Avoid pdf in high dimensions due to numerical underflow

157

// val density = highDimMvn.pdf(testPoint) // May underflow to 0.0

158

```

159

160

**Batch Evaluation:**

161

162

```scala

163

import org.apache.spark.ml.linalg._

164

import org.apache.spark.ml.stat.distribution.MultivariateGaussian

165

166

val mvn = new MultivariateGaussian(

167

Vectors.dense(0.0, 0.0),

168

DenseMatrix.eye(2)

169

)

170

171

// Evaluate multiple points

172

val testPoints = Array(

173

Vectors.dense(0.0, 0.0),

174

Vectors.dense(1.0, 0.0),

175

Vectors.dense(0.0, 1.0),

176

Vectors.dense(1.0, 1.0)

177

)

178

179

val densities = testPoints.map(mvn.pdf)

180

val logDensities = testPoints.map(mvn.logpdf)

181

182

testPoints.zip(densities).foreach { case (point, density) =>

183

println(s"Point ${point.toArray.mkString("(", ", ", ")")}: density = $density")

184

}

185

```

186

187

## Internal Implementation

188

189

### Numerical Stability

190

191

The implementation uses several techniques for numerical stability:

192

193

1. **Eigenvalue Decomposition**: Uses eigendecomposition instead of direct matrix inversion

194

2. **Pseudo-Inverse**: Handles singular covariance matrices via Moore-Penrose pseudo-inverse

195

3. **Tolerance-Based Filtering**: Eigenvalues below machine precision threshold are treated as zero

196

4. **Log-Space Computation**: `logpdf` avoids numerical underflow in high dimensions

197

198

### Tolerance Calculation

199

200

```scala

201

// Internal tolerance calculation (not part of public API)

202

val tolerance = EPSILON * maxEigenvalue * matrixDimension

203

```

204

205

Where:

206

- `EPSILON`: Machine epsilon from `Utils.EPSILON`

207

- `maxEigenvalue`: Maximum eigenvalue of covariance matrix

208

- `matrixDimension`: Size of the covariance matrix

209

210

### Memory Optimization

211

212

- **Lazy Evaluation**: Expensive computations are cached using `@transient private lazy val`

213

- **Efficient Storage**: Uses optimized Breeze operations internally

214

- **Minimal Memory Footprint**: Only stores essential computed values

215

216

## Error Handling

217

218

The MultivariateGaussian constructor validates inputs:

219

220

```scala

221

// These will throw IllegalArgumentException:

222

223

// Non-square covariance matrix

224

val badCov1 = Matrices.dense(2, 3, Array(1.0, 0.0, 0.0, 1.0, 0.0, 0.0))

225

// new MultivariateGaussian(mean, badCov1) // throws exception

226

227

// Dimension mismatch between mean and covariance

228

val mean2D = Vectors.dense(0.0, 0.0)

229

val cov3D = DenseMatrix.eye(3)

230

// new MultivariateGaussian(mean2D, cov3D) // throws exception

231

232

// All-zero eigenvalues (no non-zero singular values)

233

val zeroCov = Matrices.zeros(2, 2)

234

// new MultivariateGaussian(mean2D, zeroCov) // may throw IllegalArgumentException

235

```

236

237

**Valid Cases:**

238

239

```scala

240

// These are all valid:

241

242

// Standard non-singular covariance

243

val validMvn1 = new MultivariateGaussian(

244

Vectors.dense(0.0, 0.0),

245

DenseMatrix.eye(2)

246

)

247

248

// Singular but non-zero covariance

249

val singularCov = Matrices.dense(2, 2, Array(1.0, 1.0, 1.0, 1.0))

250

val validMvn2 = new MultivariateGaussian(

251

Vectors.dense(0.0, 0.0),

252

singularCov

253

)

254

255

// Very small non-zero eigenvalues (handled gracefully)

256

val nearSingular = Matrices.dense(2, 2, Array(1.0, 0.0, 0.0, 1e-15))

257

val validMvn3 = new MultivariateGaussian(

258

Vectors.dense(0.0, 0.0),

259

nearSingular

260

)

261

```

262

263

## Mathematical Background

264

265

The multivariate Gaussian probability density function is:

266

267

```

268

pdf(x) = (2π)^(-k/2) * |Σ|^(-1/2) * exp(-1/2 * (x-μ)^T * Σ^(-1) * (x-μ))

269

```

270

271

Where:

272

- `k`: Dimensionality (length of mean vector)

273

- `μ`: Mean vector

274

- `Σ`: Covariance matrix

275

- `|Σ|`: Determinant of covariance matrix

276

277

For numerical stability, the implementation computes:

278

279

```

280

logpdf(x) = -k/2 * log(2π) - 1/2 * log|Σ| - 1/2 * (x-μ)^T * Σ^(-1) * (x-μ)

281

```

282

283

And uses eigendecomposition `Σ = U * D * U^T` to compute the inverse and determinant efficiently.

284

285

## Integration with Spark MLlib

286

287

MultivariateGaussian is used throughout Spark MLlib for:

288

289

- **Gaussian Mixture Models**: Component distributions

290

- **Naive Bayes**: Class-conditional distributions

291

- **Anomaly Detection**: Outlier scoring

292

- **Dimensionality Reduction**: Principal component analysis

293

- **Clustering**: Gaussian-based clustering algorithms