or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

algorithms.mddistance-metrics.mdindex.mdlinear-algebra.mdoptimization.mdoutlier-detection.mdpipeline.mdpreprocessing.md

distance-metrics.mddocs/

0

# Distance Metrics

1

2

Apache Flink ML provides a comprehensive collection of distance metrics for measuring similarity between vectors. These metrics are used by algorithms like k-Nearest Neighbors and can be used standalone for similarity computations.

3

4

## Base Distance Metric Interface

5

6

All distance metrics implement the `DistanceMetric` trait.

7

8

```scala { .api }

9

trait DistanceMetric {

10

def distance(a: Vector, b: Vector): Double

11

}

12

```

13

14

## Available Distance Metrics

15

16

### Euclidean Distance

17

18

Standard Euclidean distance (L2 norm) - the straight-line distance between two points.

19

20

```scala { .api }

21

class EuclideanDistanceMetric extends DistanceMetric {

22

def distance(a: Vector, b: Vector): Double

23

}

24

25

object EuclideanDistanceMetric {

26

def apply(): EuclideanDistanceMetric

27

}

28

```

29

30

**Formula:** √(Σ(aᵢ - bᵢ)²)

31

32

**Usage Example:**

33

34

```scala

35

import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric

36

import org.apache.flink.ml.math.DenseVector

37

38

val euclidean = EuclideanDistanceMetric()

39

40

val v1 = DenseVector(1.0, 2.0, 3.0)

41

val v2 = DenseVector(4.0, 5.0, 6.0)

42

43

val distance = euclidean.distance(v1, v2) // Returns: 5.196152422706632

44

```

45

46

### Squared Euclidean Distance

47

48

Squared Euclidean distance - faster than standard Euclidean when only relative distances matter.

49

50

```scala { .api }

51

class SquaredEuclideanDistanceMetric extends DistanceMetric {

52

def distance(a: Vector, b: Vector): Double

53

}

54

55

object SquaredEuclideanDistanceMetric {

56

def apply(): SquaredEuclideanDistanceMetric

57

}

58

```

59

60

**Formula:** Σ(aᵢ - bᵢ)²

61

62

**Usage Example:**

63

64

```scala

65

import org.apache.flink.ml.metrics.distances.SquaredEuclideanDistanceMetric

66

67

val squaredEuclidean = SquaredEuclideanDistanceMetric()

68

val distance = squaredEuclidean.distance(v1, v2) // Returns: 27.0

69

```

70

71

### Manhattan Distance

72

73

Manhattan distance (L1 norm) - sum of absolute differences, also known as taxicab distance.

74

75

```scala { .api }

76

class ManhattanDistanceMetric extends DistanceMetric {

77

def distance(a: Vector, b: Vector): Double

78

}

79

80

object ManhattanDistanceMetric {

81

def apply(): ManhattanDistanceMetric

82

}

83

```

84

85

**Formula:** Σ|aᵢ - bᵢ|

86

87

**Usage Example:**

88

89

```scala

90

import org.apache.flink.ml.metrics.distances.ManhattanDistanceMetric

91

92

val manhattan = ManhattanDistanceMetric()

93

val distance = manhattan.distance(v1, v2) // Returns: 9.0

94

```

95

96

### Cosine Distance

97

98

Cosine distance - measures the angle between vectors, independent of magnitude.

99

100

```scala { .api }

101

class CosineDistanceMetric extends DistanceMetric {

102

def distance(a: Vector, b: Vector): Double

103

}

104

105

object CosineDistanceMetric {

106

def apply(): CosineDistanceMetric

107

}

108

```

109

110

**Formula:** 1 - (a·b)/(||a|| × ||b||)

111

112

**Usage Example:**

113

114

```scala

115

import org.apache.flink.ml.metrics.distances.CosineDistanceMetric

116

117

val cosine = CosineDistanceMetric()

118

val distance = cosine.distance(v1, v2) // Returns cosine distance (0 = identical direction)

119

```

120

121

### Chebyshev Distance

122

123

Chebyshev distance (L∞ norm) - maximum absolute difference across all dimensions.

124

125

```scala { .api }

126

class ChebyshevDistanceMetric extends DistanceMetric {

127

def distance(a: Vector, b: Vector): Double

128

}

129

130

object ChebyshevDistanceMetric {

131

def apply(): ChebyshevDistanceMetric

132

}

133

```

134

135

**Formula:** max|aᵢ - bᵢ|

136

137

**Usage Example:**

138

139

```scala

140

import org.apache.flink.ml.metrics.distances.ChebyshevDistanceMetric

141

142

val chebyshev = ChebyshevDistanceMetric()

143

val distance = chebyshev.distance(v1, v2) // Returns: 3.0

144

```

145

146

### Minkowski Distance

147

148

Generalized Minkowski distance (Lp norm) - parameterized distance metric that includes other metrics as special cases.

149

150

```scala { .api }

151

class MinkowskiDistanceMetric(p: Double) extends DistanceMetric {

152

def distance(a: Vector, b: Vector): Double

153

}

154

155

object MinkowskiDistanceMetric {

156

def apply(p: Double): MinkowskiDistanceMetric

157

}

158

```

159

160

**Formula:** (Σ|aᵢ - bᵢ|ᵖ)^(1/p)

161

162

**Special Cases:**

163

- p = 1: Manhattan distance

164

- p = 2: Euclidean distance

165

- p = ∞: Chebyshev distance

166

167

**Usage Example:**

168

169

```scala

170

import org.apache.flink.ml.metrics.distances.MinkowskiDistanceMetric

171

172

val minkowski3 = MinkowskiDistanceMetric(3.0) // L3 norm

173

val minkowski1 = MinkowskiDistanceMetric(1.0) // Equivalent to Manhattan

174

val minkowski2 = MinkowskiDistanceMetric(2.0) // Equivalent to Euclidean

175

176

val distance = minkowski3.distance(v1, v2)

177

```

178

179

### Tanimoto Distance

180

181

Tanimoto distance (Jaccard distance) - measures similarity for binary or non-negative vectors.

182

183

```scala { .api }

184

class TanimotoDistanceMetric extends DistanceMetric {

185

def distance(a: Vector, b: Vector): Double

186

}

187

188

object TanimotoDistanceMetric {

189

def apply(): TanimotoDistanceMetric

190

}

191

```

192

193

**Formula:** 1 - (a·b)/(||a||² + ||b||² - a·b)

194

195

**Usage Example:**

196

197

```scala

198

import org.apache.flink.ml.metrics.distances.TanimotoDistanceMetric

199

200

val tanimoto = TanimotoDistanceMetric()

201

val distance = tanimoto.distance(v1, v2)

202

```

203

204

## Using Distance Metrics with Algorithms

205

206

Distance metrics are commonly used with machine learning algorithms, particularly k-Nearest Neighbors.

207

208

**Example with k-NN:**

209

210

```scala

211

import org.apache.flink.ml.nn.KNN

212

import org.apache.flink.ml.metrics.distances.ManhattanDistanceMetric

213

214

val trainingData: DataSet[LabeledVector] = //... your training data

215

216

val knn = KNN()

217

.setK(5)

218

.setDistanceMetric(ManhattanDistanceMetric()) // Use Manhattan distance

219

.setBlocks(10)

220

221

val model = knn.fit(trainingData)

222

val predictions = model.predict(testData)

223

```

224

225

## Choosing the Right Distance Metric

226

227

Different distance metrics are suitable for different types of data and applications:

228

229

### Euclidean Distance

230

- **Best for:** Continuous numerical data, geometric problems

231

- **Characteristics:** Sensitive to magnitude, affected by the curse of dimensionality

232

- **Use cases:** Image processing, coordinate-based data, general-purpose similarity

233

234

### Manhattan Distance

235

- **Best for:** High-dimensional data, data with outliers

236

- **Characteristics:** Less sensitive to outliers than Euclidean, more robust in high dimensions

237

- **Use cases:** Recommendation systems, text analysis, categorical data

238

239

### Cosine Distance

240

- **Best for:** High-dimensional sparse data, text/document similarity

241

- **Characteristics:** Magnitude-independent, focuses on vector direction

242

- **Use cases:** Text mining, information retrieval, collaborative filtering

243

244

### Chebyshev Distance

245

- **Best for:** Applications where the maximum difference matters most

246

- **Characteristics:** Considers only the largest difference

247

- **Use cases:** Game theory, optimization, scheduling problems

248

249

### Minkowski Distance

250

- **Best for:** When you need flexibility to tune the distance behavior

251

- **Characteristics:** Generalizes other metrics, allows tuning via p parameter

252

- **Use cases:** Experimental settings, domain-specific requirements

253

254

### Tanimoto Distance

255

- **Best for:** Binary or non-negative feature data, chemical similarity

256

- **Characteristics:** Bounded between 0 and 1, good for sparse binary vectors

257

- **Use cases:** Chemical compound similarity, binary feature comparison

258

259

## Custom Distance Metrics

260

261

You can implement custom distance metrics by extending the `DistanceMetric` trait:

262

263

```scala

264

class CustomDistanceMetric extends DistanceMetric {

265

def distance(a: Vector, b: Vector): Double = {

266

// Implement your custom distance calculation

267

require(a.size == b.size, "Vectors must have the same size")

268

269

var sum = 0.0

270

for (i <- 0 until a.size) {

271

val diff = a(i) - b(i)

272

sum += math.pow(math.abs(diff), 1.5) // Example: L1.5 norm

273

}

274

275

math.pow(sum, 1.0 / 1.5)

276

}

277

}

278

279

// Use with algorithms

280

val customMetric = new CustomDistanceMetric()

281

val knn = KNN().setDistanceMetric(customMetric)

282

```

283

284

## Performance Considerations

285

286

- **Squared Euclidean** is faster than Euclidean when you only need relative distances

287

- **Manhattan** is computationally cheaper than Euclidean (no square root calculation)

288

- **Cosine** requires computing vector magnitudes, which can be expensive for large vectors

289

- **Sparse vectors** can be more efficient with certain distance metrics that can skip zero elements

290

291

Choose the appropriate distance metric based on your data characteristics and computational requirements.