Statistical Machine Intelligence and Learning Engine providing comprehensive machine learning algorithms for classification, regression, clustering, and feature engineering in Java
—
Unsupervised learning algorithms for discovering patterns and groupings in data. Smile Core provides comprehensive clustering capabilities including partitioning methods, hierarchical clustering, density-based algorithms, and spectral clustering.
Clustering algorithms extend the base PartitionClustering class or implement specific clustering interfaces.
/**
* Base class for partition clustering algorithms
*/
abstract class PartitionClustering implements Serializable {
/** Number of clusters */
public final int k;
/** Cluster assignments for each data point */
public final int[] y;
/** Size of each cluster */
public final int[] size;
/** Constant for outlier points */
public static final int OUTLIER = Integer.MAX_VALUE;
/** K-means++ initialization for centroids */
public static double[][] seed(double[][] data, int k);
/** Run clustering algorithm multiple times and return best result */
public static <T extends PartitionClustering> T run(Supplier<T> clustering, int runs);
}Partitioning algorithm that groups data into k clusters by minimizing within-cluster sum of squares.
/**
* K-means clustering algorithm
*/
class KMeans extends CentroidClustering<double[], double[]> {
/** Train K-means with specified number of clusters */
public static KMeans fit(double[][] data, int k);
/** Train with custom parameters */
public static KMeans fit(double[][] data, int k, int maxIter, double tolerance);
/** Train with initial centroids */
public static KMeans fit(double[][] data, double[][] centroids, int maxIter, double tolerance);
/** Predict cluster for new data point */
public int predict(double[] x);
/** Get cluster centroids */
public double[][] centroids;
/** Get within-cluster sum of squares */
public double distortion;
/** Get silhouette coefficient */
public double[] silhouette();
}Usage Example:
import smile.clustering.KMeans;
// Basic K-means clustering
KMeans kmeans = KMeans.fit(data, 3);
int[] clusters = kmeans.y;
double[][] centroids = kmeans.centroids;
// Predict cluster for new point
int cluster = kmeans.predict(newPoint);
// Evaluate clustering quality
double[] silhouette = kmeans.silhouette();Agglomerative clustering that builds a hierarchy of clusters using various linkage criteria.
/**
* Hierarchical clustering with various linkage methods
*/
class HierarchicalClustering extends PartitionClustering {
/** Perform hierarchical clustering with complete linkage */
public static HierarchicalClustering fit(double[][] data);
/** Cluster with specified linkage method */
public static HierarchicalClustering fit(double[][] data, Linkage linkage);
/** Cut dendrogram at specified height to get k clusters */
public int[] partition(int k);
/** Cut dendrogram at specified height */
public int[] partition(double height);
/** Get dendrogram tree structure */
public Node[] tree;
/** Get merge heights */
public double[] height;
}Various linkage criteria for hierarchical clustering.
/**
* Base linkage interface
*/
interface Linkage {
/** Calculate distance between clusters */
double distance(int[] cluster1, int[] cluster2);
}
/**
* Single linkage (nearest neighbor)
*/
class SingleLinkage implements Linkage {
public static SingleLinkage of(double[][] proximity);
}
/**
* Complete linkage (farthest neighbor)
*/
class CompleteLinkage implements Linkage {
public static CompleteLinkage of(double[][] proximity);
}
/**
* Ward linkage (minimum variance)
*/
class WardLinkage implements Linkage {
public static WardLinkage of(double[][] data);
}
/**
* UPGMA linkage (unweighted pair group method)
*/
class UPGMALinkage implements Linkage {
public static UPGMALinkage of(double[][] proximity);
}Algorithms that find clusters based on density of data points, capable of finding arbitrary-shaped clusters and identifying outliers.
/**
* DBSCAN density-based clustering
* @param <T> the type of input objects
*/
class DBSCAN<T> implements Serializable {
/** Perform DBSCAN clustering */
public static <T> DBSCAN<T> fit(T[] data, Distance<T> distance, int minPts, double radius);
/** Cluster assignments */
public final int[] y;
/** Number of clusters found */
public final int clusters;
/** Classify new point as core, border, or outlier */
public int predict(T x);
}
/**
* DENCLUE density-based clustering
*/
class DENCLUE implements Serializable {
/** Perform DENCLUE clustering */
public static DENCLUE fit(double[][] data, double sigma, int minPts);
/** Cluster assignments */
public final int[] y;
/** Number of clusters */
public final int k;
/** Density attractors (cluster centers) */
public final double[][] attractors;
}Graph-based clustering using eigendecomposition of similarity matrices.
/**
* Spectral clustering algorithm
*/
class SpectralClustering extends PartitionClustering {
/** Perform spectral clustering with RBF similarity */
public static SpectralClustering fit(double[][] data, int k, double sigma);
/** Spectral clustering with custom similarity matrix */
public static SpectralClustering fit(double[][] similarity, int k);
/** Get embedding coordinates */
public double[][] coordinates();
/** Get eigenvalues */
public double[] eigenvalues();
}Clustering algorithms for categorical data and mixed-type datasets.
/**
* K-modes clustering for categorical data
*/
class KModes extends CentroidClustering<int[], int[]> {
/** Train K-modes clustering */
public static KModes fit(int[][] data, int k);
/** Train with custom parameters */
public static KModes fit(int[][] data, int k, int maxIter, int runs);
/** Predict cluster for categorical data */
public int predict(int[] x);
/** Get cluster modes (most frequent values) */
public int[][] centroids;
}Sophisticated clustering methods for specific use cases.
/**
* X-means clustering with automatic k selection
*/
class XMeans extends CentroidClustering<double[], double[]> {
/** Perform X-means clustering with automatic k selection */
public static XMeans fit(double[][] data, int kmax);
/** Get final number of clusters */
public int k();
/** Get cluster centroids */
public double[][] centroids;
}
/**
* G-means clustering using Gaussian assumption test
*/
class GMeans extends CentroidClustering<double[], double[]> {
/** Perform G-means clustering */
public static GMeans fit(double[][] data, int kmax);
/** Get final number of clusters */
public int k();
}
/**
* CLARANS clustering for large datasets
*/
class CLARANS extends PartitionClustering {
/** Perform CLARANS clustering */
public static CLARANS fit(double[][] data, int k, int maxNeighbor, int numLocal);
/** Get medoid indices */
public int[] medoids();
}
/**
* Deterministic Annealing clustering
*/
class DeterministicAnnealing extends CentroidClustering<double[], double[]> {
/** Perform deterministic annealing clustering */
public static DeterministicAnnealing fit(double[][] data, int kmax);
/** Get final temperature */
public double temperature();
}Base class for algorithms that represent clusters by centroids.
/**
* Base class for centroid-based clustering algorithms
* @param <T> the type of input objects
* @param <C> the type of cluster centroids
*/
abstract class CentroidClustering<T, C> extends PartitionClustering {
/** Cluster centroids */
public final C[] centroids;
/** Predict cluster assignment for new data point */
public abstract int predict(T x);
/** Calculate quantization error */
public abstract double quantizationError(T[] data);
}Metrics for evaluating clustering quality and comparing different clustering results.
/**
* Silhouette analysis for cluster validation
*/
class Silhouette {
/** Calculate silhouette coefficient for each point */
public static double[] of(double[][] data, int[] clusters);
/** Calculate mean silhouette coefficient */
public static double mean(double[][] data, int[] clusters);
}
/**
* Davies-Bouldin Index for cluster validation
*/
class DaviesBouldin {
/** Calculate Davies-Bouldin index */
public static double of(double[][] data, int[] clusters);
}
/**
* Calinski-Harabasz Index (Variance Ratio Criterion)
*/
class CalinskiHarabasz {
/** Calculate Calinski-Harabasz index */
public static double of(double[][] data, int[] clusters);
}Methods for initializing cluster centers and parameters.
/**
* K-means++ initialization
*/
class KMeansPlusPlus {
/** Initialize centroids using K-means++ algorithm */
public static double[][] init(double[][] data, int k);
/** Initialize with custom distance metric */
public static double[][] init(double[][] data, int k, Distance<double[]> distance);
}
/**
* Random initialization strategies
*/
class RandomInit {
/** Random initialization from data points */
public static double[][] fromData(double[][] data, int k);
/** Random initialization within data bounds */
public static double[][] uniform(double[][] data, int k);
}Usage Examples:
// Hierarchical clustering with different linkages
HierarchicalClustering hc1 = HierarchicalClustering.fit(data, new WardLinkage());
HierarchicalClustering hc2 = HierarchicalClustering.fit(data, new CompleteLinkage());
// Cut dendrogram to get 5 clusters
int[] clusters = hc1.partition(5);
// DBSCAN for arbitrary-shaped clusters
DBSCAN<double[]> dbscan = DBSCAN.fit(data, new EuclideanDistance(), 5, 0.5);
int[] clusters = dbscan.y;
System.out.println("Found " + dbscan.clusters + " clusters");
// Spectral clustering for non-convex clusters
SpectralClustering sc = SpectralClustering.fit(data, 3, 1.0);
double[][] embedding = sc.coordinates();
// Evaluate clustering quality
double[] silhouette = Silhouette.of(data, clusters);
double meanSilhouette = Silhouette.mean(data, clusters);
double dbIndex = DaviesBouldin.of(data, clusters);Most clustering algorithms support these configuration options:
Clustering algorithms support various distance metrics:
Install with Tessl CLI
npx tessl i tessl/maven-com-github-haifengl--smile-core