CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-org-apache-flink--flink-ml-2-12

Machine learning library for Apache Flink providing scalable ML algorithms including classification (SVM), regression (multiple linear regression), and recommendation (ALS) optimized for distributed stream and batch processing

Overview
Eval results
Files

distance-metrics.mddocs/

Distance Metrics

Apache Flink ML provides a comprehensive collection of distance metrics for measuring similarity between vectors. These metrics are used by algorithms like k-Nearest Neighbors and can be used standalone for similarity computations.

Base Distance Metric Interface

All distance metrics implement the DistanceMetric trait.

trait DistanceMetric {
  def distance(a: Vector, b: Vector): Double
}

Available Distance Metrics

Euclidean Distance

Standard Euclidean distance (L2 norm) - the straight-line distance between two points.

class EuclideanDistanceMetric extends DistanceMetric {
  def distance(a: Vector, b: Vector): Double
}

object EuclideanDistanceMetric {
  def apply(): EuclideanDistanceMetric
}

Formula: √(Σ(aᵢ - bᵢ)²)

Usage Example:

import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
import org.apache.flink.ml.math.DenseVector

val euclidean = EuclideanDistanceMetric()

val v1 = DenseVector(1.0, 2.0, 3.0)
val v2 = DenseVector(4.0, 5.0, 6.0)

val distance = euclidean.distance(v1, v2)  // Returns: 5.196152422706632

Squared Euclidean Distance

Squared Euclidean distance - faster than standard Euclidean when only relative distances matter.

class SquaredEuclideanDistanceMetric extends DistanceMetric {
  def distance(a: Vector, b: Vector): Double
}

object SquaredEuclideanDistanceMetric {
  def apply(): SquaredEuclideanDistanceMetric
}

Formula: Σ(aᵢ - bᵢ)²

Usage Example:

import org.apache.flink.ml.metrics.distances.SquaredEuclideanDistanceMetric

val squaredEuclidean = SquaredEuclideanDistanceMetric()
val distance = squaredEuclidean.distance(v1, v2)  // Returns: 27.0

Manhattan Distance

Manhattan distance (L1 norm) - sum of absolute differences, also known as taxicab distance.

class ManhattanDistanceMetric extends DistanceMetric {
  def distance(a: Vector, b: Vector): Double
}

object ManhattanDistanceMetric {
  def apply(): ManhattanDistanceMetric
}

Formula: Σ|aᵢ - bᵢ|

Usage Example:

import org.apache.flink.ml.metrics.distances.ManhattanDistanceMetric

val manhattan = ManhattanDistanceMetric()
val distance = manhattan.distance(v1, v2)  // Returns: 9.0

Cosine Distance

Cosine distance - measures the angle between vectors, independent of magnitude.

class CosineDistanceMetric extends DistanceMetric {
  def distance(a: Vector, b: Vector): Double
}

object CosineDistanceMetric {
  def apply(): CosineDistanceMetric
}

Formula: 1 - (a·b)/(||a|| × ||b||)

Usage Example:

import org.apache.flink.ml.metrics.distances.CosineDistanceMetric

val cosine = CosineDistanceMetric()
val distance = cosine.distance(v1, v2)  // Returns cosine distance (0 = identical direction)

Chebyshev Distance

Chebyshev distance (L∞ norm) - maximum absolute difference across all dimensions.

class ChebyshevDistanceMetric extends DistanceMetric {
  def distance(a: Vector, b: Vector): Double
}

object ChebyshevDistanceMetric {
  def apply(): ChebyshevDistanceMetric
}

Formula: max|aᵢ - bᵢ|

Usage Example:

import org.apache.flink.ml.metrics.distances.ChebyshevDistanceMetric

val chebyshev = ChebyshevDistanceMetric()
val distance = chebyshev.distance(v1, v2)  // Returns: 3.0

Minkowski Distance

Generalized Minkowski distance (Lp norm) - parameterized distance metric that includes other metrics as special cases.

class MinkowskiDistanceMetric(p: Double) extends DistanceMetric {
  def distance(a: Vector, b: Vector): Double
}

object MinkowskiDistanceMetric {
  def apply(p: Double): MinkowskiDistanceMetric
}

Formula: (Σ|aᵢ - bᵢ|ᵖ)^(1/p)

Special Cases:

  • p = 1: Manhattan distance
  • p = 2: Euclidean distance
  • p = ∞: Chebyshev distance

Usage Example:

import org.apache.flink.ml.metrics.distances.MinkowskiDistanceMetric

val minkowski3 = MinkowskiDistanceMetric(3.0)     // L3 norm
val minkowski1 = MinkowskiDistanceMetric(1.0)     // Equivalent to Manhattan
val minkowski2 = MinkowskiDistanceMetric(2.0)     // Equivalent to Euclidean

val distance = minkowski3.distance(v1, v2)

Tanimoto Distance

Tanimoto distance (Jaccard distance) - measures similarity for binary or non-negative vectors.

class TanimotoDistanceMetric extends DistanceMetric {
  def distance(a: Vector, b: Vector): Double
}

object TanimotoDistanceMetric {
  def apply(): TanimotoDistanceMetric
}

Formula: 1 - (a·b)/(||a||² + ||b||² - a·b)

Usage Example:

import org.apache.flink.ml.metrics.distances.TanimotoDistanceMetric

val tanimoto = TanimotoDistanceMetric()
val distance = tanimoto.distance(v1, v2)

Using Distance Metrics with Algorithms

Distance metrics are commonly used with machine learning algorithms, particularly k-Nearest Neighbors.

Example with k-NN:

import org.apache.flink.ml.nn.KNN
import org.apache.flink.ml.metrics.distances.ManhattanDistanceMetric

val trainingData: DataSet[LabeledVector] = //... your training data

val knn = KNN()
  .setK(5)
  .setDistanceMetric(ManhattanDistanceMetric())  // Use Manhattan distance
  .setBlocks(10)

val model = knn.fit(trainingData)
val predictions = model.predict(testData)

Choosing the Right Distance Metric

Different distance metrics are suitable for different types of data and applications:

Euclidean Distance

  • Best for: Continuous numerical data, geometric problems
  • Characteristics: Sensitive to magnitude, affected by the curse of dimensionality
  • Use cases: Image processing, coordinate-based data, general-purpose similarity

Manhattan Distance

  • Best for: High-dimensional data, data with outliers
  • Characteristics: Less sensitive to outliers than Euclidean, more robust in high dimensions
  • Use cases: Recommendation systems, text analysis, categorical data

Cosine Distance

  • Best for: High-dimensional sparse data, text/document similarity
  • Characteristics: Magnitude-independent, focuses on vector direction
  • Use cases: Text mining, information retrieval, collaborative filtering

Chebyshev Distance

  • Best for: Applications where the maximum difference matters most
  • Characteristics: Considers only the largest difference
  • Use cases: Game theory, optimization, scheduling problems

Minkowski Distance

  • Best for: When you need flexibility to tune the distance behavior
  • Characteristics: Generalizes other metrics, allows tuning via p parameter
  • Use cases: Experimental settings, domain-specific requirements

Tanimoto Distance

  • Best for: Binary or non-negative feature data, chemical similarity
  • Characteristics: Bounded between 0 and 1, good for sparse binary vectors
  • Use cases: Chemical compound similarity, binary feature comparison

Custom Distance Metrics

You can implement custom distance metrics by extending the DistanceMetric trait:

class CustomDistanceMetric extends DistanceMetric {
  def distance(a: Vector, b: Vector): Double = {
    // Implement your custom distance calculation
    require(a.size == b.size, "Vectors must have the same size")
    
    var sum = 0.0
    for (i <- 0 until a.size) {
      val diff = a(i) - b(i)
      sum += math.pow(math.abs(diff), 1.5)  // Example: L1.5 norm
    }
    
    math.pow(sum, 1.0 / 1.5)
  }
}

// Use with algorithms
val customMetric = new CustomDistanceMetric()
val knn = KNN().setDistanceMetric(customMetric)

Performance Considerations

  • Squared Euclidean is faster than Euclidean when you only need relative distances
  • Manhattan is computationally cheaper than Euclidean (no square root calculation)
  • Cosine requires computing vector magnitudes, which can be expensive for large vectors
  • Sparse vectors can be more efficient with certain distance metrics that can skip zero elements

Choose the appropriate distance metric based on your data characteristics and computational requirements.

Install with Tessl CLI

npx tessl i tessl/maven-org-apache-flink--flink-ml-2-12

docs

algorithms.md

distance-metrics.md

index.md

linear-algebra.md

optimization.md

outlier-detection.md

pipeline.md

preprocessing.md

tile.json