or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

algorithms.mddistance-metrics.mdindex.mdlinear-algebra.mdoptimization.mdoutlier-detection.mdpipeline.mdpreprocessing.md
tile.json

distance-metrics.mddocs/

Distance Metrics

Apache Flink ML provides a comprehensive collection of distance metrics for measuring similarity between vectors. These metrics are used by algorithms like k-Nearest Neighbors and can be used standalone for similarity computations.

Base Distance Metric Interface

All distance metrics implement the DistanceMetric trait.

trait DistanceMetric {
  def distance(a: Vector, b: Vector): Double
}

Available Distance Metrics

Euclidean Distance

Standard Euclidean distance (L2 norm) - the straight-line distance between two points.

class EuclideanDistanceMetric extends DistanceMetric {
  def distance(a: Vector, b: Vector): Double
}

object EuclideanDistanceMetric {
  def apply(): EuclideanDistanceMetric
}

Formula: √(Σ(aᵢ - bᵢ)²)

Usage Example:

import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
import org.apache.flink.ml.math.DenseVector

val euclidean = EuclideanDistanceMetric()

val v1 = DenseVector(1.0, 2.0, 3.0)
val v2 = DenseVector(4.0, 5.0, 6.0)

val distance = euclidean.distance(v1, v2)  // Returns: 5.196152422706632

Squared Euclidean Distance

Squared Euclidean distance - faster than standard Euclidean when only relative distances matter.

class SquaredEuclideanDistanceMetric extends DistanceMetric {
  def distance(a: Vector, b: Vector): Double
}

object SquaredEuclideanDistanceMetric {
  def apply(): SquaredEuclideanDistanceMetric
}

Formula: Σ(aᵢ - bᵢ)²

Usage Example:

import org.apache.flink.ml.metrics.distances.SquaredEuclideanDistanceMetric

val squaredEuclidean = SquaredEuclideanDistanceMetric()
val distance = squaredEuclidean.distance(v1, v2)  // Returns: 27.0

Manhattan Distance

Manhattan distance (L1 norm) - sum of absolute differences, also known as taxicab distance.

class ManhattanDistanceMetric extends DistanceMetric {
  def distance(a: Vector, b: Vector): Double
}

object ManhattanDistanceMetric {
  def apply(): ManhattanDistanceMetric
}

Formula: Σ|aᵢ - bᵢ|

Usage Example:

import org.apache.flink.ml.metrics.distances.ManhattanDistanceMetric

val manhattan = ManhattanDistanceMetric()
val distance = manhattan.distance(v1, v2)  // Returns: 9.0

Cosine Distance

Cosine distance - measures the angle between vectors, independent of magnitude.

class CosineDistanceMetric extends DistanceMetric {
  def distance(a: Vector, b: Vector): Double
}

object CosineDistanceMetric {
  def apply(): CosineDistanceMetric
}

Formula: 1 - (a·b)/(||a|| × ||b||)

Usage Example:

import org.apache.flink.ml.metrics.distances.CosineDistanceMetric

val cosine = CosineDistanceMetric()
val distance = cosine.distance(v1, v2)  // Returns cosine distance (0 = identical direction)

Chebyshev Distance

Chebyshev distance (L∞ norm) - maximum absolute difference across all dimensions.

class ChebyshevDistanceMetric extends DistanceMetric {
  def distance(a: Vector, b: Vector): Double
}

object ChebyshevDistanceMetric {
  def apply(): ChebyshevDistanceMetric
}

Formula: max|aᵢ - bᵢ|

Usage Example:

import org.apache.flink.ml.metrics.distances.ChebyshevDistanceMetric

val chebyshev = ChebyshevDistanceMetric()
val distance = chebyshev.distance(v1, v2)  // Returns: 3.0

Minkowski Distance

Generalized Minkowski distance (Lp norm) - parameterized distance metric that includes other metrics as special cases.

class MinkowskiDistanceMetric(p: Double) extends DistanceMetric {
  def distance(a: Vector, b: Vector): Double
}

object MinkowskiDistanceMetric {
  def apply(p: Double): MinkowskiDistanceMetric
}

Formula: (Σ|aᵢ - bᵢ|ᵖ)^(1/p)

Special Cases:

  • p = 1: Manhattan distance
  • p = 2: Euclidean distance
  • p = ∞: Chebyshev distance

Usage Example:

import org.apache.flink.ml.metrics.distances.MinkowskiDistanceMetric

val minkowski3 = MinkowskiDistanceMetric(3.0)     // L3 norm
val minkowski1 = MinkowskiDistanceMetric(1.0)     // Equivalent to Manhattan
val minkowski2 = MinkowskiDistanceMetric(2.0)     // Equivalent to Euclidean

val distance = minkowski3.distance(v1, v2)

Tanimoto Distance

Tanimoto distance (Jaccard distance) - measures similarity for binary or non-negative vectors.

class TanimotoDistanceMetric extends DistanceMetric {
  def distance(a: Vector, b: Vector): Double
}

object TanimotoDistanceMetric {
  def apply(): TanimotoDistanceMetric
}

Formula: 1 - (a·b)/(||a||² + ||b||² - a·b)

Usage Example:

import org.apache.flink.ml.metrics.distances.TanimotoDistanceMetric

val tanimoto = TanimotoDistanceMetric()
val distance = tanimoto.distance(v1, v2)

Using Distance Metrics with Algorithms

Distance metrics are commonly used with machine learning algorithms, particularly k-Nearest Neighbors.

Example with k-NN:

import org.apache.flink.ml.nn.KNN
import org.apache.flink.ml.metrics.distances.ManhattanDistanceMetric

val trainingData: DataSet[LabeledVector] = //... your training data

val knn = KNN()
  .setK(5)
  .setDistanceMetric(ManhattanDistanceMetric())  // Use Manhattan distance
  .setBlocks(10)

val model = knn.fit(trainingData)
val predictions = model.predict(testData)

Choosing the Right Distance Metric

Different distance metrics are suitable for different types of data and applications:

Euclidean Distance

  • Best for: Continuous numerical data, geometric problems
  • Characteristics: Sensitive to magnitude, affected by the curse of dimensionality
  • Use cases: Image processing, coordinate-based data, general-purpose similarity

Manhattan Distance

  • Best for: High-dimensional data, data with outliers
  • Characteristics: Less sensitive to outliers than Euclidean, more robust in high dimensions
  • Use cases: Recommendation systems, text analysis, categorical data

Cosine Distance

  • Best for: High-dimensional sparse data, text/document similarity
  • Characteristics: Magnitude-independent, focuses on vector direction
  • Use cases: Text mining, information retrieval, collaborative filtering

Chebyshev Distance

  • Best for: Applications where the maximum difference matters most
  • Characteristics: Considers only the largest difference
  • Use cases: Game theory, optimization, scheduling problems

Minkowski Distance

  • Best for: When you need flexibility to tune the distance behavior
  • Characteristics: Generalizes other metrics, allows tuning via p parameter
  • Use cases: Experimental settings, domain-specific requirements

Tanimoto Distance

  • Best for: Binary or non-negative feature data, chemical similarity
  • Characteristics: Bounded between 0 and 1, good for sparse binary vectors
  • Use cases: Chemical compound similarity, binary feature comparison

Custom Distance Metrics

You can implement custom distance metrics by extending the DistanceMetric trait:

class CustomDistanceMetric extends DistanceMetric {
  def distance(a: Vector, b: Vector): Double = {
    // Implement your custom distance calculation
    require(a.size == b.size, "Vectors must have the same size")
    
    var sum = 0.0
    for (i <- 0 until a.size) {
      val diff = a(i) - b(i)
      sum += math.pow(math.abs(diff), 1.5)  // Example: L1.5 norm
    }
    
    math.pow(sum, 1.0 / 1.5)
  }
}

// Use with algorithms
val customMetric = new CustomDistanceMetric()
val knn = KNN().setDistanceMetric(customMetric)

Performance Considerations

  • Squared Euclidean is faster than Euclidean when you only need relative distances
  • Manhattan is computationally cheaper than Euclidean (no square root calculation)
  • Cosine requires computing vector magnitudes, which can be expensive for large vectors
  • Sparse vectors can be more efficient with certain distance metrics that can skip zero elements

Choose the appropriate distance metric based on your data characteristics and computational requirements.