0
# scikit-learn
1
2
scikit-learn is a comprehensive machine learning library for Python that provides simple and efficient tools for predictive data analysis. It features various classification, regression, and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
3
4
## Package Information
5
6
**Name**: scikit-learn
7
**Language**: Python
8
**Installation**: `pip install scikit-learn`
9
**Version**: 1.7.1
10
11
## Core Imports
12
13
```python
14
import sklearn
15
from sklearn import datasets
16
from sklearn.model_selection import train_test_split
17
from sklearn.preprocessing import StandardScaler
18
from sklearn.linear_model import LogisticRegression
19
from sklearn.ensemble import RandomForestClassifier
20
from sklearn.cluster import KMeans
21
from sklearn.metrics import accuracy_score, classification_report
22
```
23
24
## Basic Usage
25
26
Here's a simple example demonstrating scikit-learn's consistent API for machine learning:
27
28
```python
29
from sklearn.datasets import load_iris
30
from sklearn.model_selection import train_test_split
31
from sklearn.ensemble import RandomForestClassifier
32
from sklearn.metrics import accuracy_score
33
34
# Load dataset
35
iris = load_iris()
36
X, y = iris.data, iris.target
37
38
# Split data
39
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
40
41
# Train model
42
clf = RandomForestClassifier(n_estimators=100, random_state=42)
43
clf.fit(X_train, y_train)
44
45
# Make predictions
46
y_pred = clf.predict(X_test)
47
48
# Evaluate
49
accuracy = accuracy_score(y_test, y_pred)
50
print(f"Accuracy: {accuracy:.3f}")
51
```
52
53
## Architecture
54
55
scikit-learn follows several key design principles:
56
57
### Estimator Pattern
58
All learning algorithms follow the same interface:
59
- `fit(X, y)` - Learn from training data
60
- `predict(X)` - Make predictions on new data
61
- `transform(X)` - Transform data (for transformers)
62
63
### Pipeline Architecture
64
Combine multiple processing steps:
65
66
```python
67
from sklearn.pipeline import Pipeline
68
from sklearn.preprocessing import StandardScaler
69
from sklearn.svm import SVC
70
71
pipeline = Pipeline([
72
('scaler', StandardScaler()),
73
('classifier', SVC())
74
])
75
```
76
77
### Consistent API Design
78
- **Estimators**: All learning algorithms (classifiers, regressors, clusterers)
79
- **Transformers**: Data preprocessing and feature engineering
80
- **Meta-estimators**: Combine multiple estimators (ensembles, pipelines)
81
82
## Core Capabilities
83
84
### Supervised Learning
85
```python
86
# Classification
87
from sklearn.linear_model import LogisticRegression
88
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
89
from sklearn.svm import SVC
90
from sklearn.naive_bayes import GaussianNB
91
92
# Regression
93
from sklearn.linear_model import LinearRegression, Ridge, Lasso
94
from sklearn.ensemble import RandomForestRegressor
95
from sklearn.svm import SVR
96
```
97
98
[Supervised Learning](./supervised-learning.md)
99
100
### Unsupervised Learning
101
```python
102
# Clustering
103
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
104
from sklearn.mixture import GaussianMixture
105
106
# Dimensionality Reduction
107
from sklearn.decomposition import PCA, FastICA, NMF
108
from sklearn.manifold import TSNE, Isomap
109
```
110
111
[Unsupervised Learning](./unsupervised-learning.md)
112
113
### Data Preprocessing
114
```python
115
# Scaling and Normalization
116
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
117
118
# Encoding
119
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
120
121
# Feature Engineering
122
from sklearn.preprocessing import PolynomialFeatures
123
from sklearn.feature_selection import SelectKBest, RFE
124
```
125
126
[Data Preprocessing and Feature Engineering](./preprocessing.md)
127
128
### Model Selection and Evaluation
129
```python
130
# Cross-Validation
131
from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV
132
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split
133
134
# Metrics
135
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
136
from sklearn.metrics import mean_squared_error, r2_score, roc_auc_score
137
```
138
139
[Model Selection and Evaluation](./model-selection.md)
140
141
### Built-in Datasets
142
```python
143
# Load toy datasets
144
from sklearn.datasets import load_iris, load_diabetes, load_wine, load_breast_cancer
145
146
# Generate synthetic data
147
from sklearn.datasets import make_classification, make_regression, make_blobs
148
149
# Fetch real-world datasets
150
from sklearn.datasets import fetch_20newsgroups, fetch_california_housing
151
```
152
153
[Datasets and Data Generation](./datasets.md)
154
155
### Performance Metrics and Visualization
156
```python
157
# Classification metrics
158
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
159
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay
160
161
# Regression metrics
162
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
163
from sklearn.metrics import PredictionErrorDisplay
164
```
165
166
[Metrics and Visualization](./metrics.md)
167
168
### Feature Extraction and Text Processing
169
```python
170
# Text vectorization
171
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
172
from sklearn.feature_extraction.text import HashingVectorizer, TfidfTransformer
173
174
# Dictionary and hashing
175
from sklearn.feature_extraction import DictVectorizer, FeatureHasher
176
177
# Image processing
178
from sklearn.feature_extraction.image import img_to_graph, grid_to_graph
179
```
180
181
[Feature Extraction](./feature-extraction.md)
182
183
### Pipelines and Workflow Composition
184
```python
185
# Pipeline construction
186
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion
187
188
# Column-wise transformations
189
from sklearn.compose import ColumnTransformer, make_column_transformer
190
from sklearn.compose import TransformedTargetRegressor
191
```
192
193
[Pipelines and Composition](./pipelines.md)
194
195
### Nearest Neighbors Algorithms
196
```python
197
# Classification and regression
198
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
199
from sklearn.neighbors import RadiusNeighborsClassifier, RadiusNeighborsRegressor
200
201
# Outlier detection and density estimation
202
from sklearn.neighbors import LocalOutlierFactor, KernelDensity
203
from sklearn.neighbors import NearestNeighbors, NearestCentroid
204
```
205
206
[Nearest Neighbors](./neighbors.md)
207
208
### Utilities and Configuration
209
```python
210
# Core utilities
211
from sklearn.base import clone
212
from sklearn import get_config, set_config, config_context
213
214
# Version and system information
215
import sklearn
216
sklearn.__version__, sklearn.show_versions()
217
```
218
219
[Utilities and Core Functions](./utilities.md)
220
221
## Version Information
222
223
```python
224
import sklearn
225
print(sklearn.__version__) # "1.7.1"
226
227
# Get system information
228
sklearn.show_versions()
229
```
230
231
## Key Features
232
233
- **Consistent API**: All algorithms follow the same interface patterns
234
- **Comprehensive**: 300+ classes and 150+ functions covering all ML tasks
235
- **Well-tested**: Extensive test suite ensuring reliability
236
- **Documentation**: Comprehensive user guide and API reference
237
- **Community**: Large, active community with regular releases
238
- **Integration**: Works seamlessly with NumPy, SciPy, pandas, and matplotlib
239
- **Performance**: Optimized implementations with optional parallelization
240
241
scikit-learn provides everything needed for machine learning workflows, from data preprocessing to model evaluation, making it the go-to library for machine learning in Python.