or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

distributions.mdindex.mdmatrices.mdtesting.mdvectors.md

distributions.mddocs/

0

# Statistical Distributions

1

2

Statistical distribution implementations for probability computations and machine learning algorithms. Supports multivariate distributions with advanced numerical stability features.

3

4

## Capabilities

5

6

### Multivariate Gaussian Distribution

7

8

Implementation of multivariate Gaussian (Normal) distribution with support for singular covariance matrices through pseudo-inverse computations.

9

10

```scala { .api }

11

/**

12

* Multivariate Gaussian (Normal) Distribution

13

* Handles singular covariance matrices by computing density in reduced dimensional subspace

14

*/

15

class MultivariateGaussian(val mean: Vector, val cov: Matrix) extends Serializable {

16

/** Mean vector of the distribution */

17

val mean: Vector

18

19

/** Covariance matrix of the distribution */

20

val cov: Matrix

21

22

/**

23

* Computes probability density function at given point

24

* @param x Point to evaluate density at

25

* @return Probability density value

26

*/

27

def pdf(x: Vector): Double

28

29

/**

30

* Computes log probability density function at given point

31

* @param x Point to evaluate log density at

32

* @return Log probability density value

33

*/

34

def logpdf(x: Vector): Double

35

}

36

```

37

38

**Usage Examples:**

39

40

```scala

41

import org.apache.spark.ml.linalg.{Vectors, Matrices}

42

import org.apache.spark.ml.stat.distribution.MultivariateGaussian

43

44

// Create a 2D Gaussian distribution

45

val mean = Vectors.dense(0.0, 0.0)

46

val cov = Matrices.dense(2, 2, Array(

47

1.0, 0.5, // First column: [1.0, 0.5]

48

0.5, 1.0 // Second column: [0.5, 1.0]

49

))

50

51

val gaussian = new MultivariateGaussian(mean, cov)

52

53

// Evaluate probability density

54

val point1 = Vectors.dense(0.0, 0.0) // At the mean

55

val density1 = gaussian.pdf(point1)

56

println(s"Density at mean: $density1")

57

58

val point2 = Vectors.dense(1.0, 1.0) // Away from mean

59

val density2 = gaussian.pdf(point2)

60

println(s"Density at (1,1): $density2")

61

62

// Evaluate log probability density (more numerically stable)

63

val logDensity1 = gaussian.logpdf(point1)

64

val logDensity2 = gaussian.logpdf(point2)

65

println(s"Log density at mean: $logDensity1")

66

println(s"Log density at (1,1): $logDensity2")

67

```

68

69

### Advanced Usage Examples

70

71

#### Working with High-Dimensional Distributions

72

73

```scala

74

import org.apache.spark.ml.linalg.{Vectors, Matrices}

75

import org.apache.spark.ml.stat.distribution.MultivariateGaussian

76

77

// Create 5-dimensional Gaussian

78

val dim = 5

79

val mean = Vectors.zeros(dim)

80

81

// Create diagonal covariance matrix

82

val covValues = Array.tabulate(dim * dim) { i =>

83

val row = i / dim

84

val col = i % dim

85

if (row == col) 1.0 else 0.0 // Identity matrix

86

}

87

val cov = Matrices.dense(dim, dim, covValues)

88

89

val gaussian = new MultivariateGaussian(mean, cov)

90

91

// Evaluate multiple points

92

val points = Array(

93

Vectors.dense(0.0, 0.0, 0.0, 0.0, 0.0), // Origin

94

Vectors.dense(1.0, 0.0, 0.0, 0.0, 0.0), // Unit vector in first dimension

95

Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0) // Unit vector in all dimensions

96

)

97

98

points.zipWithIndex.foreach { case (point, i) =>

99

val density = gaussian.pdf(point)

100

val logDensity = gaussian.logpdf(point)

101

println(s"Point $i: density = $density, log density = $logDensity")

102

}

103

```

104

105

#### Working with Singular Covariance Matrices

106

107

The implementation handles singular (non-invertible) covariance matrices using pseudo-inverse computation:

108

109

```scala

110

import org.apache.spark.ml.linalg.{Vectors, Matrices}

111

import org.apache.spark.ml.stat.distribution.MultivariateGaussian

112

113

// Create a singular covariance matrix (rank deficient)

114

val mean = Vectors.dense(0.0, 0.0, 0.0)

115

val singularCov = Matrices.dense(3, 3, Array(

116

1.0, 1.0, 1.0, // First column

117

1.0, 1.0, 1.0, // Second column (identical to first)

118

1.0, 1.0, 1.0 // Third column (identical to first)

119

))

120

121

// This will work despite the singular covariance matrix

122

val singularGaussian = new MultivariateGaussian(mean, singularCov)

123

124

val testPoint = Vectors.dense(1.0, 1.0, 1.0)

125

val density = singularGaussian.pdf(testPoint)

126

val logDensity = singularGaussian.logpdf(testPoint)

127

128

println(s"Density with singular covariance: $density")

129

println(s"Log density with singular covariance: $logDensity")

130

```

131

132

#### Correlated Multivariate Gaussian

133

134

```scala

135

import org.apache.spark.ml.linalg.{Vectors, Matrices}

136

import org.apache.spark.ml.stat.distribution.MultivariateGaussian

137

138

// Create correlated 3D Gaussian

139

val mean = Vectors.dense(1.0, 2.0, 3.0)

140

141

// Covariance matrix with correlations

142

val cov = Matrices.dense(3, 3, Array(

143

2.0, 0.8, 0.3, // var=2.0, corr with dim2=0.4, corr with dim3=0.15

144

0.8, 1.5, 0.6, // corr with dim1=0.4, var=1.5, corr with dim3=0.4

145

0.3, 0.6, 1.0 // corr with dim1=0.15, corr with dim2=0.4, var=1.0

146

))

147

148

val correlatedGaussian = new MultivariateGaussian(mean, cov)

149

150

// Sample different points and compare densities

151

val points = Array(

152

mean, // Should have highest density

153

Vectors.dense(1.0, 2.0, 3.5), // Close to mean

154

Vectors.dense(0.0, 0.0, 0.0), // Far from mean

155

Vectors.dense(2.0, 3.0, 4.0) // Scaled version of mean

156

)

157

158

println("Evaluating correlated Gaussian:")

159

points.zipWithIndex.foreach { case (point, i) =>

160

val density = correlatedGaussian.pdf(point)

161

val logDensity = correlatedGaussian.logpdf(point)

162

println(f"Point $i: density = $density%.6f, log density = $logDensity%.6f")

163

}

164

```

165

166

### Numerical Stability Features

167

168

The `MultivariateGaussian` implementation includes several numerical stability features:

169

170

1. **Eigenvalue Decomposition**: Uses eigendecomposition instead of direct matrix inversion

171

2. **Tolerance-based Computation**: Considers singular values as zero only if they fall below a machine precision-based tolerance

172

3. **Pseudo-inverse Computation**: Uses Moore-Penrose pseudo-inverse for singular matrices

173

4. **Log-space Computation**: Provides `logpdf` method for numerical stability in high dimensions

174

175

```scala

176

// The implementation automatically handles numerical issues

177

val mean = Vectors.dense(0.0, 0.0)

178

val poorlyConditioned = Matrices.dense(2, 2, Array(

179

1e12, 1e12, // Very large values

180

1e12, 1e12 + 1e-6 // Nearly singular

181

))

182

183

val stableGaussian = new MultivariateGaussian(mean, poorlyConditioned)

184

185

// These operations will be numerically stable

186

val point = Vectors.dense(1e6, 1e6)

187

val stableDensity = stableGaussian.pdf(point)

188

val stableLogDensity = stableGaussian.logpdf(point) // Preferred for stability

189

```

190

191

## Mathematical Background

192

193

The multivariate Gaussian PDF is given by:

194

195

```

196

pdf(x) = (2π)^(-k/2) * |Σ|^(-1/2) * exp(-1/2 * (x-μ)^T * Σ^(-1) * (x-μ))

197

```

198

199

Where:

200

- `k` is the dimensionality

201

- `μ` is the mean vector

202

- `Σ` is the covariance matrix

203

- `|Σ|` is the determinant of the covariance matrix

204

205

For singular covariance matrices, the implementation computes the pseudo-determinant and uses the pseudo-inverse in a reduced-dimensional subspace.

206

207

## Types

208

209

```scala { .api }

210

import org.apache.spark.ml.linalg.{Vector, Matrix}

211

212

class MultivariateGaussian(val mean: Vector, val cov: Matrix) extends Serializable {

213

require(cov.numCols == cov.numRows, "Covariance matrix must be square")

214

require(mean.size == cov.numCols, "Mean vector length must match covariance matrix size")

215

}

216

```