or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

index.mdlinear-algebra.mdstatistical-distributions.md

statistical-distributions.mddocs/

0

# Statistical Distributions

1

2

Multivariate statistical distributions with robust numerical implementations that handle edge cases like singular covariance matrices. The library provides probability density function calculations with numerical stability features for machine learning applications.

3

4

## Capabilities

5

6

### Multivariate Gaussian Distribution

7

8

Implementation of multivariate normal distribution with support for singular covariance matrices through pseudo-inverse calculations and reduced-dimensional subspace computations.

9

10

```scala { .api }

11

class MultivariateGaussian(mean: Vector, cov: Matrix) extends Serializable {

12

val mean: Vector

13

val cov: Matrix

14

def pdf(x: Vector): Double

15

def logpdf(x: Vector): Double

16

}

17

```

18

19

**Usage examples:**

20

21

```scala

22

import org.apache.spark.ml.linalg._

23

import org.apache.spark.ml.stat.distribution.MultivariateGaussian

24

25

// Create a 2D Gaussian distribution

26

val mean = Vectors.dense(0.0, 0.0)

27

val cov = DenseMatrix.eye(2) // Identity covariance matrix

28

val gaussian = new MultivariateGaussian(mean, cov)

29

30

// Evaluate probability density at a point

31

val x = Vectors.dense(1.0, 1.0)

32

val density = gaussian.pdf(x)

33

val logDensity = gaussian.logpdf(x)

34

35

println(s"PDF at $x: $density")

36

println(s"Log PDF at $x: $logDensity")

37

38

// Create distribution with custom covariance

39

val customCov = Matrices.dense(2, 2, Array(

40

1.0, 0.5,

41

0.5, 2.0

42

))

43

val correlatedGaussian = new MultivariateGaussian(mean, customCov)

44

45

// Evaluate at multiple points

46

val points = Seq(

47

Vectors.dense(0.0, 0.0),

48

Vectors.dense(1.0, 0.0),

49

Vectors.dense(0.0, 1.0),

50

Vectors.dense(1.0, 1.0)

51

)

52

53

points.foreach { point =>

54

val prob = correlatedGaussian.pdf(point)

55

println(s"PDF at ${point.toArray.mkString("[", ", ", "]")}: $prob")

56

}

57

```

58

59

### Singular Covariance Matrix Handling

60

61

The implementation handles singular (non-invertible) covariance matrices through pseudo-inverse computation and reduced-dimensional analysis.

62

63

**Working with singular covariance matrices:**

64

65

```scala

66

import org.apache.spark.ml.linalg._

67

import org.apache.spark.ml.stat.distribution.MultivariateGaussian

68

69

// Create a singular covariance matrix (rank deficient)

70

val singularCov = Matrices.dense(3, 3, Array(

71

1.0, 1.0, 1.0,

72

1.0, 1.0, 1.0,

73

1.0, 1.0, 1.0

74

))

75

76

val mean = Vectors.dense(0.0, 0.0, 0.0)

77

78

// The MultivariateGaussian handles singular matrices automatically

79

val singularGaussian = new MultivariateGaussian(mean, singularCov)

80

81

// Density computation works in the reduced subspace

82

val x = Vectors.dense(1.0, 1.0, 1.0)

83

val density = singularGaussian.pdf(x)

84

val logDensity = singularGaussian.logpdf(x)

85

86

println(s"PDF with singular covariance: $density")

87

println(s"Log PDF with singular covariance: $logDensity")

88

```

89

90

### Constructor Variants

91

92

Multiple ways to create MultivariateGaussian distributions from different input formats.

93

94

```scala { .api }

95

class MultivariateGaussian(mean: Vector, cov: Matrix) extends Serializable

96

97

// Private constructor for Breeze types (internal use)

98

private[ml] def this(mean: breeze.linalg.DenseVector[Double], cov: breeze.linalg.DenseMatrix[Double])

99

```

100

101

**Different initialization patterns:**

102

103

```scala

104

import org.apache.spark.ml.linalg._

105

import org.apache.spark.ml.stat.distribution.MultivariateGaussian

106

107

// Standard initialization with Spark vectors/matrices

108

val mean1 = Vectors.dense(1.0, 2.0)

109

val cov1 = DenseMatrix.eye(2)

110

val gaussian1 = new MultivariateGaussian(mean1, cov1)

111

112

// With sparse mean vector

113

val sparseMean = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))

114

val cov2 = DenseMatrix.eye(3)

115

val gaussian2 = new MultivariateGaussian(sparseMean, cov2)

116

117

// With sparse covariance matrix

118

val sparseIdentity = SparseMatrix.speye(2)

119

val gaussian3 = new MultivariateGaussian(mean1, sparseIdentity)

120

```

121

122

## Mathematical Background

123

124

### Probability Density Function

125

126

The multivariate Gaussian PDF is computed as:

127

128

```

129

pdf(x) = (2π)^(-k/2) * |Σ|^(-1/2) * exp(-1/2 * (x-μ)^T * Σ^(-1) * (x-μ))

130

```

131

132

Where:

133

- `k` is the dimensionality (length of mean vector)

134

- `μ` is the mean vector

135

- `Σ` is the covariance matrix

136

- `|Σ|` is the determinant of the covariance matrix

137

138

### Log Probability Density Function

139

140

The log PDF is computed for numerical stability:

141

142

```

143

logpdf(x) = -k/2 * log(2π) - 1/2 * log|Σ| - 1/2 * (x-μ)^T * Σ^(-1) * (x-μ)

144

```

145

146

### Singular Covariance Handling

147

148

For singular covariance matrices, the implementation:

149

150

1. Computes eigendecomposition: `Σ = U * D * U^T`

151

2. Identifies non-zero eigenvalues using numerical tolerance

152

3. Computes pseudo-determinant from non-zero eigenvalues

153

4. Uses pseudo-inverse for the quadratic form computation

154

5. Operates in the reduced-dimensional subspace where the distribution is supported

155

156

## Numerical Stability Features

157

158

### Tolerance-Based Eigenvalue Filtering

159

160

The implementation uses machine precision-based tolerance to determine which eigenvalues are considered non-zero:

161

162

```scala

163

val tolerance = EPSILON * max(eigenvalues) * dimensionality

164

```

165

166

This prevents numerical instability from very small eigenvalues that should be treated as zero.

167

168

### Efficient Matrix Operations

169

170

Internal implementation optimizes matrix operations:

171

172

- Uses eigendecomposition instead of direct matrix inversion

173

- Computes pseudo-inverse through eigenvalue manipulation

174

- Avoids explicit matrix inverse computation for better numerical stability

175

- Lazy evaluation of distribution-dependent constants

176

177

### Error Handling

178

179

The implementation validates inputs and provides meaningful error messages:

180

181

```scala

182

// Dimension validation

183

require(cov.numCols == cov.numRows, "Covariance matrix must be square")

184

require(mean.size == cov.numCols, "Mean vector length must match covariance matrix size")

185

186

// Singular matrix handling

187

try {

188

val gaussian = new MultivariateGaussian(mean, singularCov)

189

val density = gaussian.pdf(x)

190

} catch {

191

case e: IllegalArgumentException =>

192

println("Covariance matrix has no non-zero singular values")

193

}

194

```

195

196

## Performance Considerations

197

198

### Lazy Computation

199

200

Distribution-dependent constants are computed lazily and cached:

201

202

- Eigendecomposition is performed once during first PDF/log-PDF call

203

- Matrix square roots and determinants are cached

204

- Intermediate computations are reused across multiple evaluations

205

206

### Memory Efficiency

207

208

- Uses efficient eigendecomposition instead of full matrix operations

209

- Minimal memory footprint for distribution parameters

210

- Optimized for repeated evaluations with the same distribution

211

212

### Integration with BLAS

213

214

The implementation leverages optimized BLAS operations:

215

216

- Matrix-vector multiplications use BLAS Level 2 routines

217

- Eigendecomposition uses optimized linear algebra libraries

218

- Vector operations benefit from vectorized implementations

219

220

## Common Use Cases

221

222

### Density Estimation

223

224

```scala

225

val samples = Seq(

226

Vectors.dense(1.2, 0.8),

227

Vectors.dense(0.9, 1.1),

228

Vectors.dense(1.1, 0.9)

229

)

230

231

val gaussian = new MultivariateGaussian(mean, cov)

232

233

// Compute likelihood of samples

234

val likelihoods = samples.map(gaussian.pdf)

235

val logLikelihoods = samples.map(gaussian.logpdf)

236

237

// Total log-likelihood

238

val totalLogLikelihood = logLikelihoods.sum

239

```

240

241

### Anomaly Detection

242

243

```scala

244

val normalDataMean = Vectors.dense(0.0, 0.0)

245

val normalDataCov = DenseMatrix.eye(2)

246

val normalModel = new MultivariateGaussian(normalDataMean, normalDataCov)

247

248

def isAnomalous(point: Vector, threshold: Double): Boolean = {

249

val logProb = normalModel.logpdf(point)

250

logProb < threshold

251

}

252

253

val testPoint = Vectors.dense(3.0, 3.0)

254

val anomalous = isAnomalous(testPoint, -5.0)

255

```

256

257

### Gaussian Mixture Components

258

259

```scala

260

// Component distributions for a mixture model

261

val component1 = new MultivariateGaussian(

262

Vectors.dense(-1.0, -1.0),

263

DenseMatrix.eye(2)

264

)

265

266

val component2 = new MultivariateGaussian(

267

Vectors.dense(1.0, 1.0),

268

Matrices.dense(2, 2, Array(2.0, 0.5, 0.5, 2.0))

269

)

270

271

// Evaluate mixture probability (would need mixture weights)

272

def mixturePdf(x: Vector, weights: Array[Double]): Double = {

273

weights(0) * component1.pdf(x) + weights(1) * component2.pdf(x)

274

}

275

```