CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-com-github-haifengl--smile-core

Statistical Machine Intelligence and Learning Engine providing comprehensive machine learning algorithms for classification, regression, clustering, and feature engineering in Java

Pending
Overview
Eval results
Files

clustering.mddocs/

Clustering

Unsupervised learning algorithms for discovering patterns and groupings in data. Smile Core provides comprehensive clustering capabilities including partitioning methods, hierarchical clustering, density-based algorithms, and spectral clustering.

Capabilities

Core Clustering Interface

Clustering algorithms extend the base PartitionClustering class or implement specific clustering interfaces.

/**
 * Base class for partition clustering algorithms
 */
abstract class PartitionClustering implements Serializable {
    /** Number of clusters */
    public final int k;
    
    /** Cluster assignments for each data point */
    public final int[] y;
    
    /** Size of each cluster */
    public final int[] size;
    
    /** Constant for outlier points */
    public static final int OUTLIER = Integer.MAX_VALUE;
    
    /** K-means++ initialization for centroids */
    public static double[][] seed(double[][] data, int k);
    
    /** Run clustering algorithm multiple times and return best result */
    public static <T extends PartitionClustering> T run(Supplier<T> clustering, int runs);
}

K-Means Clustering

Partitioning algorithm that groups data into k clusters by minimizing within-cluster sum of squares.

/**
 * K-means clustering algorithm
 */
class KMeans extends CentroidClustering<double[], double[]> {
    /** Train K-means with specified number of clusters */
    public static KMeans fit(double[][] data, int k);
    
    /** Train with custom parameters */
    public static KMeans fit(double[][] data, int k, int maxIter, double tolerance);
    
    /** Train with initial centroids */
    public static KMeans fit(double[][] data, double[][] centroids, int maxIter, double tolerance);
    
    /** Predict cluster for new data point */
    public int predict(double[] x);
    
    /** Get cluster centroids */
    public double[][] centroids;
    
    /** Get within-cluster sum of squares */
    public double distortion;
    
    /** Get silhouette coefficient */
    public double[] silhouette();
}

Usage Example:

import smile.clustering.KMeans;

// Basic K-means clustering
KMeans kmeans = KMeans.fit(data, 3);
int[] clusters = kmeans.y;
double[][] centroids = kmeans.centroids;

// Predict cluster for new point
int cluster = kmeans.predict(newPoint);

// Evaluate clustering quality
double[] silhouette = kmeans.silhouette();

Hierarchical Clustering

Agglomerative clustering that builds a hierarchy of clusters using various linkage criteria.

/**
 * Hierarchical clustering with various linkage methods
 */
class HierarchicalClustering extends PartitionClustering {
    /** Perform hierarchical clustering with complete linkage */
    public static HierarchicalClustering fit(double[][] data);
    
    /** Cluster with specified linkage method */
    public static HierarchicalClustering fit(double[][] data, Linkage linkage);
    
    /** Cut dendrogram at specified height to get k clusters */
    public int[] partition(int k);
    
    /** Cut dendrogram at specified height */
    public int[] partition(double height);
    
    /** Get dendrogram tree structure */
    public Node[] tree;
    
    /** Get merge heights */
    public double[] height;
}

Linkage Methods

Various linkage criteria for hierarchical clustering.

/**
 * Base linkage interface
 */
interface Linkage {
    /** Calculate distance between clusters */
    double distance(int[] cluster1, int[] cluster2);
}

/**
 * Single linkage (nearest neighbor)
 */
class SingleLinkage implements Linkage {
    public static SingleLinkage of(double[][] proximity);
}

/**
 * Complete linkage (farthest neighbor)
 */
class CompleteLinkage implements Linkage {
    public static CompleteLinkage of(double[][] proximity);
}

/**
 * Ward linkage (minimum variance)
 */
class WardLinkage implements Linkage {
    public static WardLinkage of(double[][] data);
}

/**
 * UPGMA linkage (unweighted pair group method)
 */
class UPGMALinkage implements Linkage {
    public static UPGMALinkage of(double[][] proximity);
}

Density-Based Clustering

Algorithms that find clusters based on density of data points, capable of finding arbitrary-shaped clusters and identifying outliers.

/**
 * DBSCAN density-based clustering
 * @param <T> the type of input objects
 */
class DBSCAN<T> implements Serializable {
    /** Perform DBSCAN clustering */
    public static <T> DBSCAN<T> fit(T[] data, Distance<T> distance, int minPts, double radius);
    
    /** Cluster assignments */
    public final int[] y;
    
    /** Number of clusters found */
    public final int clusters;
    
    /** Classify new point as core, border, or outlier */
    public int predict(T x);
}

/**
 * DENCLUE density-based clustering
 */
class DENCLUE implements Serializable {
    /** Perform DENCLUE clustering */
    public static DENCLUE fit(double[][] data, double sigma, int minPts);
    
    /** Cluster assignments */
    public final int[] y;
    
    /** Number of clusters */
    public final int k;
    
    /** Density attractors (cluster centers) */
    public final double[][] attractors;
}

Spectral Clustering

Graph-based clustering using eigendecomposition of similarity matrices.

/**
 * Spectral clustering algorithm
 */
class SpectralClustering extends PartitionClustering {
    /** Perform spectral clustering with RBF similarity */
    public static SpectralClustering fit(double[][] data, int k, double sigma);
    
    /** Spectral clustering with custom similarity matrix */
    public static SpectralClustering fit(double[][] similarity, int k);
    
    /** Get embedding coordinates */
    public double[][] coordinates();
    
    /** Get eigenvalues */
    public double[] eigenvalues();
}

K-Modes and Mixed-Type Clustering

Clustering algorithms for categorical data and mixed-type datasets.

/**
 * K-modes clustering for categorical data
 */
class KModes extends CentroidClustering<int[], int[]> {
    /** Train K-modes clustering */
    public static KModes fit(int[][] data, int k);
    
    /** Train with custom parameters */
    public static KModes fit(int[][] data, int k, int maxIter, int runs);
    
    /** Predict cluster for categorical data */
    public int predict(int[] x);
    
    /** Get cluster modes (most frequent values) */
    public int[][] centroids;
}

Advanced Clustering Algorithms

Sophisticated clustering methods for specific use cases.

/**
 * X-means clustering with automatic k selection
 */
class XMeans extends CentroidClustering<double[], double[]> {
    /** Perform X-means clustering with automatic k selection */
    public static XMeans fit(double[][] data, int kmax);
    
    /** Get final number of clusters */
    public int k();
    
    /** Get cluster centroids */
    public double[][] centroids;
}

/**
 * G-means clustering using Gaussian assumption test
 */
class GMeans extends CentroidClustering<double[], double[]> {
    /** Perform G-means clustering */
    public static GMeans fit(double[][] data, int kmax);
    
    /** Get final number of clusters */
    public int k();
}

/**
 * CLARANS clustering for large datasets
 */
class CLARANS extends PartitionClustering {
    /** Perform CLARANS clustering */
    public static CLARANS fit(double[][] data, int k, int maxNeighbor, int numLocal);
    
    /** Get medoid indices */
    public int[] medoids();
}

/**
 * Deterministic Annealing clustering
 */
class DeterministicAnnealing extends CentroidClustering<double[], double[]> {
    /** Perform deterministic annealing clustering */
    public static DeterministicAnnealing fit(double[][] data, int kmax);
    
    /** Get final temperature */
    public double temperature();
}

Centroid-Based Clustering

Base class for algorithms that represent clusters by centroids.

/**
 * Base class for centroid-based clustering algorithms
 * @param <T> the type of input objects
 * @param <C> the type of cluster centroids
 */
abstract class CentroidClustering<T, C> extends PartitionClustering {
    /** Cluster centroids */
    public final C[] centroids;
    
    /** Predict cluster assignment for new data point */
    public abstract int predict(T x);
    
    /** Calculate quantization error */
    public abstract double quantizationError(T[] data);
}

Evaluation Metrics

Metrics for evaluating clustering quality and comparing different clustering results.

/**
 * Silhouette analysis for cluster validation
 */
class Silhouette {
    /** Calculate silhouette coefficient for each point */
    public static double[] of(double[][] data, int[] clusters);
    
    /** Calculate mean silhouette coefficient */
    public static double mean(double[][] data, int[] clusters);
}

/**
 * Davies-Bouldin Index for cluster validation
 */
class DaviesBouldin {
    /** Calculate Davies-Bouldin index */
    public static double of(double[][] data, int[] clusters);
}

/**
 * Calinski-Harabasz Index (Variance Ratio Criterion)
 */
class CalinskiHarabasz {
    /** Calculate Calinski-Harabasz index */
    public static double of(double[][] data, int[] clusters);
}

Cluster Initialization

Methods for initializing cluster centers and parameters.

/**
 * K-means++ initialization
 */
class KMeansPlusPlus {
    /** Initialize centroids using K-means++ algorithm */
    public static double[][] init(double[][] data, int k);
    
    /** Initialize with custom distance metric */
    public static double[][] init(double[][] data, int k, Distance<double[]> distance);
}

/**
 * Random initialization strategies
 */
class RandomInit {
    /** Random initialization from data points */
    public static double[][] fromData(double[][] data, int k);
    
    /** Random initialization within data bounds */
    public static double[][] uniform(double[][] data, int k);
}

Usage Examples:

// Hierarchical clustering with different linkages
HierarchicalClustering hc1 = HierarchicalClustering.fit(data, new WardLinkage());
HierarchicalClustering hc2 = HierarchicalClustering.fit(data, new CompleteLinkage());

// Cut dendrogram to get 5 clusters
int[] clusters = hc1.partition(5);

// DBSCAN for arbitrary-shaped clusters
DBSCAN<double[]> dbscan = DBSCAN.fit(data, new EuclideanDistance(), 5, 0.5);
int[] clusters = dbscan.y;
System.out.println("Found " + dbscan.clusters + " clusters");

// Spectral clustering for non-convex clusters
SpectralClustering sc = SpectralClustering.fit(data, 3, 1.0);
double[][] embedding = sc.coordinates();

// Evaluate clustering quality
double[] silhouette = Silhouette.of(data, clusters);
double meanSilhouette = Silhouette.mean(data, clusters);
double dbIndex = DaviesBouldin.of(data, clusters);

Common Clustering Parameters

Most clustering algorithms support these configuration options:

  • k: Number of clusters (for partitioning methods)
  • maxIter: Maximum iterations for convergence
  • tolerance: Convergence tolerance
  • runs: Number of random restarts
  • minPts: Minimum points for density-based clustering
  • radius/epsilon: Neighborhood radius for density-based clustering
  • sigma: Bandwidth parameter for kernel-based methods
  • linkage: Linkage criterion for hierarchical clustering
  • seed: Random seed for reproducible results

Distance Metrics

Clustering algorithms support various distance metrics:

  • EuclideanDistance: Standard L2 distance
  • ManhattanDistance: L1 distance
  • ChebyshevDistance: L∞ distance
  • CorrelationDistance: 1 - correlation coefficient
  • CosineDistance: 1 - cosine similarity
  • HammingDistance: For binary/categorical data
  • JaccardDistance: For set-based data

Install with Tessl CLI

npx tessl i tessl/maven-com-github-haifengl--smile-core

docs

advanced-analytics.md

classification.md

clustering.md

deep-learning.md

feature-engineering.md

index.md

regression.md

validation-metrics.md

tile.json