Tessl Tile for maven/org.apache.flink/flink-ml_2.12@1.8.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

algorithms.md distance-metrics.md index.md linear-algebra.md optimization.md outlier-detection.md pipeline.md preprocessing.md

distance-metrics.mddocs/

0
# Distance Metrics
1

2
Apache Flink ML provides a comprehensive collection of distance metrics for measuring similarity between vectors. These metrics are used by algorithms like k-Nearest Neighbors and can be used standalone for similarity computations.
3

4
## Base Distance Metric Interface
5

6
All distance metrics implement the `DistanceMetric` trait.
7

8
```scala { .api }
9
trait DistanceMetric {
10
  def distance(a: Vector, b: Vector): Double
11
}
12
```
13

14
## Available Distance Metrics
15

16
### Euclidean Distance
17

18
Standard Euclidean distance (L2 norm) - the straight-line distance between two points.
19

20
```scala { .api }
21
class EuclideanDistanceMetric extends DistanceMetric {
22
  def distance(a: Vector, b: Vector): Double
23
}
24

25
object EuclideanDistanceMetric {
26
  def apply(): EuclideanDistanceMetric
27
}
28
```
29

30
**Formula:** √(Σ(aᵢ - bᵢ)²)
31

32
**Usage Example:**
33

34
```scala
35
import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
36
import org.apache.flink.ml.math.DenseVector
37

38
val euclidean = EuclideanDistanceMetric()
39

40
val v1 = DenseVector(1.0, 2.0, 3.0)
41
val v2 = DenseVector(4.0, 5.0, 6.0)
42

43
val distance = euclidean.distance(v1, v2)  // Returns: 5.196152422706632
44
```
45

46
### Squared Euclidean Distance
47

48
Squared Euclidean distance - faster than standard Euclidean when only relative distances matter.
49

50
```scala { .api }
51
class SquaredEuclideanDistanceMetric extends DistanceMetric {
52
  def distance(a: Vector, b: Vector): Double
53
}
54

55
object SquaredEuclideanDistanceMetric {
56
  def apply(): SquaredEuclideanDistanceMetric
57
}
58
```
59

60
**Formula:** Σ(aᵢ - bᵢ)²
61

62
**Usage Example:**
63

64
```scala
65
import org.apache.flink.ml.metrics.distances.SquaredEuclideanDistanceMetric
66

67
val squaredEuclidean = SquaredEuclideanDistanceMetric()
68
val distance = squaredEuclidean.distance(v1, v2)  // Returns: 27.0
69
```
70

71
### Manhattan Distance
72

73
Manhattan distance (L1 norm) - sum of absolute differences, also known as taxicab distance.
74

75
```scala { .api }
76
class ManhattanDistanceMetric extends DistanceMetric {
77
  def distance(a: Vector, b: Vector): Double
78
}
79

80
object ManhattanDistanceMetric {
81
  def apply(): ManhattanDistanceMetric
82
}
83
```
84

85
**Formula:** Σ|aᵢ - bᵢ|
86

87
**Usage Example:**
88

89
```scala
90
import org.apache.flink.ml.metrics.distances.ManhattanDistanceMetric
91

92
val manhattan = ManhattanDistanceMetric()
93
val distance = manhattan.distance(v1, v2)  // Returns: 9.0
94
```
95

96
### Cosine Distance
97

98
Cosine distance - measures the angle between vectors, independent of magnitude.
99

100
```scala { .api }
101
class CosineDistanceMetric extends DistanceMetric {
102
  def distance(a: Vector, b: Vector): Double
103
}
104

105
object CosineDistanceMetric {
106
  def apply(): CosineDistanceMetric
107
}
108
```
109

110
**Formula:** 1 - (a·b)/(||a|| × ||b||)
111

112
**Usage Example:**
113

114
```scala
115
import org.apache.flink.ml.metrics.distances.CosineDistanceMetric
116

117
val cosine = CosineDistanceMetric()
118
val distance = cosine.distance(v1, v2)  // Returns cosine distance (0 = identical direction)
119
```
120

121
### Chebyshev Distance
122

123
Chebyshev distance (L∞ norm) - maximum absolute difference across all dimensions.
124

125
```scala { .api }
126
class ChebyshevDistanceMetric extends DistanceMetric {
127
  def distance(a: Vector, b: Vector): Double
128
}
129

130
object ChebyshevDistanceMetric {
131
  def apply(): ChebyshevDistanceMetric
132
}
133
```
134

135
**Formula:** max|aᵢ - bᵢ|
136

137
**Usage Example:**
138

139
```scala
140
import org.apache.flink.ml.metrics.distances.ChebyshevDistanceMetric
141

142
val chebyshev = ChebyshevDistanceMetric()
143
val distance = chebyshev.distance(v1, v2)  // Returns: 3.0
144
```
145

146
### Minkowski Distance
147

148
Generalized Minkowski distance (Lp norm) - parameterized distance metric that includes other metrics as special cases.
149

150
```scala { .api }
151
class MinkowskiDistanceMetric(p: Double) extends DistanceMetric {
152
  def distance(a: Vector, b: Vector): Double
153
}
154

155
object MinkowskiDistanceMetric {
156
  def apply(p: Double): MinkowskiDistanceMetric
157
}
158
```
159

160
**Formula:** (Σ|aᵢ - bᵢ|ᵖ)^(1/p)
161

162
**Special Cases:**
163
- p = 1: Manhattan distance
164
- p = 2: Euclidean distance  
165
- p = ∞: Chebyshev distance
166

167
**Usage Example:**
168

169
```scala
170
import org.apache.flink.ml.metrics.distances.MinkowskiDistanceMetric
171

172
val minkowski3 = MinkowskiDistanceMetric(3.0)     // L3 norm
173
val minkowski1 = MinkowskiDistanceMetric(1.0)     // Equivalent to Manhattan
174
val minkowski2 = MinkowskiDistanceMetric(2.0)     // Equivalent to Euclidean
175

176
val distance = minkowski3.distance(v1, v2)
177
```
178

179
### Tanimoto Distance
180

181
Tanimoto distance (Jaccard distance) - measures similarity for binary or non-negative vectors.
182

183
```scala { .api }
184
class TanimotoDistanceMetric extends DistanceMetric {
185
  def distance(a: Vector, b: Vector): Double
186
}
187

188
object TanimotoDistanceMetric {
189
  def apply(): TanimotoDistanceMetric
190
}
191
```
192

193
**Formula:** 1 - (a·b)/(||a||² + ||b||² - a·b)
194

195
**Usage Example:**
196

197
```scala
198
import org.apache.flink.ml.metrics.distances.TanimotoDistanceMetric
199

200
val tanimoto = TanimotoDistanceMetric()
201
val distance = tanimoto.distance(v1, v2)
202
```
203

204
## Using Distance Metrics with Algorithms
205

206
Distance metrics are commonly used with machine learning algorithms, particularly k-Nearest Neighbors.
207

208
**Example with k-NN:**
209

210
```scala
211
import org.apache.flink.ml.nn.KNN
212
import org.apache.flink.ml.metrics.distances.ManhattanDistanceMetric
213

214
val trainingData: DataSet[LabeledVector] = //... your training data
215

216
val knn = KNN()
217
  .setK(5)
218
  .setDistanceMetric(ManhattanDistanceMetric())  // Use Manhattan distance
219
  .setBlocks(10)
220

221
val model = knn.fit(trainingData)
222
val predictions = model.predict(testData)
223
```
224

225
## Choosing the Right Distance Metric
226

227
Different distance metrics are suitable for different types of data and applications:
228

229
### Euclidean Distance
230
- **Best for:** Continuous numerical data, geometric problems
231
- **Characteristics:** Sensitive to magnitude, affected by the curse of dimensionality
232
- **Use cases:** Image processing, coordinate-based data, general-purpose similarity
233

234
### Manhattan Distance
235
- **Best for:** High-dimensional data, data with outliers
236
- **Characteristics:** Less sensitive to outliers than Euclidean, more robust in high dimensions
237
- **Use cases:** Recommendation systems, text analysis, categorical data
238

239
### Cosine Distance
240
- **Best for:** High-dimensional sparse data, text/document similarity
241
- **Characteristics:** Magnitude-independent, focuses on vector direction
242
- **Use cases:** Text mining, information retrieval, collaborative filtering
243

244
### Chebyshev Distance
245
- **Best for:** Applications where the maximum difference matters most
246
- **Characteristics:** Considers only the largest difference
247
- **Use cases:** Game theory, optimization, scheduling problems
248

249
### Minkowski Distance
250
- **Best for:** When you need flexibility to tune the distance behavior
251
- **Characteristics:** Generalizes other metrics, allows tuning via p parameter
252
- **Use cases:** Experimental settings, domain-specific requirements
253

254
### Tanimoto Distance
255
- **Best for:** Binary or non-negative feature data, chemical similarity
256
- **Characteristics:** Bounded between 0 and 1, good for sparse binary vectors
257
- **Use cases:** Chemical compound similarity, binary feature comparison
258

259
## Custom Distance Metrics
260

261
You can implement custom distance metrics by extending the `DistanceMetric` trait:
262

263
```scala
264
class CustomDistanceMetric extends DistanceMetric {
265
  def distance(a: Vector, b: Vector): Double = {
266
    // Implement your custom distance calculation
267
    require(a.size == b.size, "Vectors must have the same size")
268
    
269
    var sum = 0.0
270
    for (i <- 0 until a.size) {
271
      val diff = a(i) - b(i)
272
      sum += math.pow(math.abs(diff), 1.5)  // Example: L1.5 norm
273
    }
274
    
275
    math.pow(sum, 1.0 / 1.5)
276
  }
277
}
278

279
// Use with algorithms
280
val customMetric = new CustomDistanceMetric()
281
val knn = KNN().setDistanceMetric(customMetric)
282
```
283

284
## Performance Considerations
285

286
- **Squared Euclidean** is faster than Euclidean when you only need relative distances
287
- **Manhattan** is computationally cheaper than Euclidean (no square root calculation)
288
- **Cosine** requires computing vector magnitudes, which can be expensive for large vectors
289
- **Sparse vectors** can be more efficient with certain distance metrics that can skip zero elements
290

291
Choose the appropriate distance metric based on your data characteristics and computational requirements.

Version

Tile

Files

distance-metrics.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

distance-metrics.mddocs/