Statistical Machine Intelligence and Learning Engine providing comprehensive machine learning algorithms for classification, regression, clustering, and feature engineering in Java
npx @tessl/cli install tessl/maven-com-github-haifengl--smile-core@3.1.0Smile Core is the foundational library of the Statistical Machine Intelligence and Learning Engine (SMILE), providing a comprehensive suite of machine learning algorithms for classification, regression, clustering, feature engineering, and advanced analytics in Java. It offers high-performance implementations with optimized data structures, extensive validation utilities, and seamless integration with Java-based data science workflows.
<dependency>
<groupId>com.github.haifengl</groupId>
<artifactId>smile-core</artifactId>
<version>3.1.1</version>
</dependency>import smile.classification.*;
import smile.regression.*;
import smile.clustering.*;
import smile.feature.*;
import smile.validation.*;import smile.classification.RandomForest;
import smile.data.DataFrame;
import smile.data.formula.Formula;
import smile.validation.CrossValidation;
// Load data (assuming DataFrame df with features and target)
Formula formula = Formula.lhs("target");
// Train a random forest classifier
RandomForest model = RandomForest.fit(formula, df);
// Make predictions on test DataFrame tuples
int prediction = model.predict(testTuple);
// Cross-validation
var results = CrossValidation.classification(10, RandomForest::fit, formula, df);
System.out.println("Accuracy: " + results.avg.accuracy);Smile Core is built around several key design principles:
Classifier<T>, Regression<T>, and PartitionClustering provide consistent APIs across algorithmsComprehensive supervised learning algorithms for predicting categorical outcomes, including ensemble methods, neural networks, and probabilistic models.
interface Classifier<T> extends ToIntFunction<T>, Serializable {
int predict(T x);
int predict(T x, double[] posteriori);
default int numClasses();
default int[] classes();
default void update(T x, int y);
}Supervised learning algorithms for predicting continuous values, from linear models to advanced ensemble methods and kernel machines.
interface Regression<T> extends ToDoubleFunction<T>, Serializable {
double predict(T x);
default void update(T x, double y);
}Unsupervised learning algorithms for discovering patterns and groupings in data, including partitioning, hierarchical, and density-based methods.
abstract class PartitionClustering implements Serializable {
public final int k;
public final int[] y;
public final int[] size;
public static final int OUTLIER = Integer.MAX_VALUE;
}Complete preprocessing pipeline including dimensionality reduction, feature selection, transformation, and imputation utilities.
interface Transform extends Function<double[], double[]> {
double[] apply(double[] x);
}
abstract class Projection implements Transform {
public abstract double[] project(double[] x);
}Comprehensive model validation framework with cross-validation, bootstrap sampling, and extensive performance metrics.
interface CrossValidation {
Bag[] split(int n);
static CrossValidation of(int k);
static CrossValidation stratify(int k, int[] y);
}
interface ClassificationMetric {
double score(int[] truth, int[] prediction);
}Neural network components including multi-layer perceptrons, activation functions, and optimization algorithms.
abstract class MultilayerPerceptron implements Classifier<double[]> {
public abstract int predict(double[] x);
public abstract void update(double[] x, int y);
}Specialized algorithms for manifold learning, time series analysis, sequence modeling, and association rule mining.
interface SequenceLabeler<T> {
int[] predict(T[] sequence);
}
class TimeSeries {
public static double[] autocorrelation(double[] data);
public static double[] crosscorrelation(double[] x, double[] y);
}// Main data structures
class Bag {
public final int[] samples;
public final int[] oob;
}
class SupportVector {
public final double[] x;
public final double alpha;
}
// Validation results
class ClassificationValidation {
public final double accuracy;
public final double error;
public final ConfusionMatrix confusion;
}
class RegressionValidation {
public final double rmse;
public final double mad;
public final double r2;
}enum SplitRule {
GINI, ENTROPY, CLASSIFICATION_ERROR
}
enum Cost {
MEAN_SQUARED_ERROR, CROSS_ENTROPY, SPARSE_CROSS_ENTROPY
}
enum OutputFunction {
LINEAR, SIGMOID, SOFTMAX
}