0
# Distance Metrics
1
2
Apache Flink ML provides a comprehensive collection of distance metrics for measuring similarity between vectors. These metrics are used by algorithms like k-Nearest Neighbors and can be used standalone for similarity computations.
3
4
## Base Distance Metric Interface
5
6
All distance metrics implement the `DistanceMetric` trait.
7
8
```scala { .api }
9
trait DistanceMetric {
10
def distance(a: Vector, b: Vector): Double
11
}
12
```
13
14
## Available Distance Metrics
15
16
### Euclidean Distance
17
18
Standard Euclidean distance (L2 norm) - the straight-line distance between two points.
19
20
```scala { .api }
21
class EuclideanDistanceMetric extends DistanceMetric {
22
def distance(a: Vector, b: Vector): Double
23
}
24
25
object EuclideanDistanceMetric {
26
def apply(): EuclideanDistanceMetric
27
}
28
```
29
30
**Formula:** √(Σ(aᵢ - bᵢ)²)
31
32
**Usage Example:**
33
34
```scala
35
import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
36
import org.apache.flink.ml.math.DenseVector
37
38
val euclidean = EuclideanDistanceMetric()
39
40
val v1 = DenseVector(1.0, 2.0, 3.0)
41
val v2 = DenseVector(4.0, 5.0, 6.0)
42
43
val distance = euclidean.distance(v1, v2) // Returns: 5.196152422706632
44
```
45
46
### Squared Euclidean Distance
47
48
Squared Euclidean distance - faster than standard Euclidean when only relative distances matter.
49
50
```scala { .api }
51
class SquaredEuclideanDistanceMetric extends DistanceMetric {
52
def distance(a: Vector, b: Vector): Double
53
}
54
55
object SquaredEuclideanDistanceMetric {
56
def apply(): SquaredEuclideanDistanceMetric
57
}
58
```
59
60
**Formula:** Σ(aᵢ - bᵢ)²
61
62
**Usage Example:**
63
64
```scala
65
import org.apache.flink.ml.metrics.distances.SquaredEuclideanDistanceMetric
66
67
val squaredEuclidean = SquaredEuclideanDistanceMetric()
68
val distance = squaredEuclidean.distance(v1, v2) // Returns: 27.0
69
```
70
71
### Manhattan Distance
72
73
Manhattan distance (L1 norm) - sum of absolute differences, also known as taxicab distance.
74
75
```scala { .api }
76
class ManhattanDistanceMetric extends DistanceMetric {
77
def distance(a: Vector, b: Vector): Double
78
}
79
80
object ManhattanDistanceMetric {
81
def apply(): ManhattanDistanceMetric
82
}
83
```
84
85
**Formula:** Σ|aᵢ - bᵢ|
86
87
**Usage Example:**
88
89
```scala
90
import org.apache.flink.ml.metrics.distances.ManhattanDistanceMetric
91
92
val manhattan = ManhattanDistanceMetric()
93
val distance = manhattan.distance(v1, v2) // Returns: 9.0
94
```
95
96
### Cosine Distance
97
98
Cosine distance - measures the angle between vectors, independent of magnitude.
99
100
```scala { .api }
101
class CosineDistanceMetric extends DistanceMetric {
102
def distance(a: Vector, b: Vector): Double
103
}
104
105
object CosineDistanceMetric {
106
def apply(): CosineDistanceMetric
107
}
108
```
109
110
**Formula:** 1 - (a·b)/(||a|| × ||b||)
111
112
**Usage Example:**
113
114
```scala
115
import org.apache.flink.ml.metrics.distances.CosineDistanceMetric
116
117
val cosine = CosineDistanceMetric()
118
val distance = cosine.distance(v1, v2) // Returns cosine distance (0 = identical direction)
119
```
120
121
### Chebyshev Distance
122
123
Chebyshev distance (L∞ norm) - maximum absolute difference across all dimensions.
124
125
```scala { .api }
126
class ChebyshevDistanceMetric extends DistanceMetric {
127
def distance(a: Vector, b: Vector): Double
128
}
129
130
object ChebyshevDistanceMetric {
131
def apply(): ChebyshevDistanceMetric
132
}
133
```
134
135
**Formula:** max|aᵢ - bᵢ|
136
137
**Usage Example:**
138
139
```scala
140
import org.apache.flink.ml.metrics.distances.ChebyshevDistanceMetric
141
142
val chebyshev = ChebyshevDistanceMetric()
143
val distance = chebyshev.distance(v1, v2) // Returns: 3.0
144
```
145
146
### Minkowski Distance
147
148
Generalized Minkowski distance (Lp norm) - parameterized distance metric that includes other metrics as special cases.
149
150
```scala { .api }
151
class MinkowskiDistanceMetric(p: Double) extends DistanceMetric {
152
def distance(a: Vector, b: Vector): Double
153
}
154
155
object MinkowskiDistanceMetric {
156
def apply(p: Double): MinkowskiDistanceMetric
157
}
158
```
159
160
**Formula:** (Σ|aᵢ - bᵢ|ᵖ)^(1/p)
161
162
**Special Cases:**
163
- p = 1: Manhattan distance
164
- p = 2: Euclidean distance
165
- p = ∞: Chebyshev distance
166
167
**Usage Example:**
168
169
```scala
170
import org.apache.flink.ml.metrics.distances.MinkowskiDistanceMetric
171
172
val minkowski3 = MinkowskiDistanceMetric(3.0) // L3 norm
173
val minkowski1 = MinkowskiDistanceMetric(1.0) // Equivalent to Manhattan
174
val minkowski2 = MinkowskiDistanceMetric(2.0) // Equivalent to Euclidean
175
176
val distance = minkowski3.distance(v1, v2)
177
```
178
179
### Tanimoto Distance
180
181
Tanimoto distance (Jaccard distance) - measures similarity for binary or non-negative vectors.
182
183
```scala { .api }
184
class TanimotoDistanceMetric extends DistanceMetric {
185
def distance(a: Vector, b: Vector): Double
186
}
187
188
object TanimotoDistanceMetric {
189
def apply(): TanimotoDistanceMetric
190
}
191
```
192
193
**Formula:** 1 - (a·b)/(||a||² + ||b||² - a·b)
194
195
**Usage Example:**
196
197
```scala
198
import org.apache.flink.ml.metrics.distances.TanimotoDistanceMetric
199
200
val tanimoto = TanimotoDistanceMetric()
201
val distance = tanimoto.distance(v1, v2)
202
```
203
204
## Using Distance Metrics with Algorithms
205
206
Distance metrics are commonly used with machine learning algorithms, particularly k-Nearest Neighbors.
207
208
**Example with k-NN:**
209
210
```scala
211
import org.apache.flink.ml.nn.KNN
212
import org.apache.flink.ml.metrics.distances.ManhattanDistanceMetric
213
214
val trainingData: DataSet[LabeledVector] = //... your training data
215
216
val knn = KNN()
217
.setK(5)
218
.setDistanceMetric(ManhattanDistanceMetric()) // Use Manhattan distance
219
.setBlocks(10)
220
221
val model = knn.fit(trainingData)
222
val predictions = model.predict(testData)
223
```
224
225
## Choosing the Right Distance Metric
226
227
Different distance metrics are suitable for different types of data and applications:
228
229
### Euclidean Distance
230
- **Best for:** Continuous numerical data, geometric problems
231
- **Characteristics:** Sensitive to magnitude, affected by the curse of dimensionality
232
- **Use cases:** Image processing, coordinate-based data, general-purpose similarity
233
234
### Manhattan Distance
235
- **Best for:** High-dimensional data, data with outliers
236
- **Characteristics:** Less sensitive to outliers than Euclidean, more robust in high dimensions
237
- **Use cases:** Recommendation systems, text analysis, categorical data
238
239
### Cosine Distance
240
- **Best for:** High-dimensional sparse data, text/document similarity
241
- **Characteristics:** Magnitude-independent, focuses on vector direction
242
- **Use cases:** Text mining, information retrieval, collaborative filtering
243
244
### Chebyshev Distance
245
- **Best for:** Applications where the maximum difference matters most
246
- **Characteristics:** Considers only the largest difference
247
- **Use cases:** Game theory, optimization, scheduling problems
248
249
### Minkowski Distance
250
- **Best for:** When you need flexibility to tune the distance behavior
251
- **Characteristics:** Generalizes other metrics, allows tuning via p parameter
252
- **Use cases:** Experimental settings, domain-specific requirements
253
254
### Tanimoto Distance
255
- **Best for:** Binary or non-negative feature data, chemical similarity
256
- **Characteristics:** Bounded between 0 and 1, good for sparse binary vectors
257
- **Use cases:** Chemical compound similarity, binary feature comparison
258
259
## Custom Distance Metrics
260
261
You can implement custom distance metrics by extending the `DistanceMetric` trait:
262
263
```scala
264
class CustomDistanceMetric extends DistanceMetric {
265
def distance(a: Vector, b: Vector): Double = {
266
// Implement your custom distance calculation
267
require(a.size == b.size, "Vectors must have the same size")
268
269
var sum = 0.0
270
for (i <- 0 until a.size) {
271
val diff = a(i) - b(i)
272
sum += math.pow(math.abs(diff), 1.5) // Example: L1.5 norm
273
}
274
275
math.pow(sum, 1.0 / 1.5)
276
}
277
}
278
279
// Use with algorithms
280
val customMetric = new CustomDistanceMetric()
281
val knn = KNN().setDistanceMetric(customMetric)
282
```
283
284
## Performance Considerations
285
286
- **Squared Euclidean** is faster than Euclidean when you only need relative distances
287
- **Manhattan** is computationally cheaper than Euclidean (no square root calculation)
288
- **Cosine** requires computing vector magnitudes, which can be expensive for large vectors
289
- **Sparse vectors** can be more efficient with certain distance metrics that can skip zero elements
290
291
Choose the appropriate distance metric based on your data characteristics and computational requirements.