Tessl Tile for pypi/scikit-learn@1.7.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

datasets.md feature-extraction.md index.md metrics.md model-selection.md neighbors.md pipelines.md preprocessing.md supervised-learning.md unsupervised-learning.md utilities.md

preprocessing.mddocs/

0
# Data Preprocessing and Feature Engineering
1

2
This document covers all data preprocessing, feature engineering, and feature selection capabilities in scikit-learn.
3

4
## Scaling and Normalization
5

6
#### StandardScaler { .api }
7
```python
8
from sklearn.preprocessing import StandardScaler
9

10
StandardScaler(
11
    copy: bool = True,
12
    with_mean: bool = True,
13
    with_std: bool = True
14
)
15
```
16
Standardize features by removing the mean and scaling to unit variance.
17

18
#### MinMaxScaler { .api }
19
```python
20
from sklearn.preprocessing import MinMaxScaler
21

22
MinMaxScaler(
23
    feature_range: tuple[float, float] = (0, 1),
24
    copy: bool = True,
25
    clip: bool = False
26
)
27
```
28
Transform features by scaling each feature to a given range.
29

30
#### MaxAbsScaler { .api }
31
```python
32
from sklearn.preprocessing import MaxAbsScaler
33

34
MaxAbsScaler(
35
    copy: bool = True
36
)
37
```
38
Scale each feature by its maximum absolute value.
39

40
#### RobustScaler { .api }
41
```python
42
from sklearn.preprocessing import RobustScaler
43

44
RobustScaler(
45
    quantile_range: tuple[float, float] = (25.0, 75.0),
46
    copy: bool = True,
47
    unit_variance: bool = False
48
)
49
```
50
Scale features using statistics that are robust to outliers.
51

52
#### Normalizer { .api }
53
```python
54
from sklearn.preprocessing import Normalizer
55

56
Normalizer(
57
    norm: str = "l2",
58
    copy: bool = True
59
)
60
```
61
Normalize samples individually to unit norm.
62

63
#### QuantileTransformer { .api }
64
```python
65
from sklearn.preprocessing import QuantileTransformer
66

67
QuantileTransformer(
68
    n_quantiles: int = 1000,
69
    output_distribution: str = "uniform",
70
    ignore_implicit_zeros: bool = False,
71
    subsample: int = 100000,
72
    random_state: int | RandomState | None = None,
73
    copy: bool = True
74
)
75
```
76
Transform features to follow a uniform or a normal distribution.
77

78
#### PowerTransformer { .api }
79
```python
80
from sklearn.preprocessing import PowerTransformer
81

82
PowerTransformer(
83
    method: str = "yeo-johnson",
84
    standardize: bool = True,
85
    copy: bool = True
86
)
87
```
88
Apply a power transform featurewise to make data more Gaussian-like.
89

90
## Encoding
91

92
#### LabelEncoder { .api }
93
```python
94
from sklearn.preprocessing import LabelEncoder
95

96
LabelEncoder()
97
```
98
Encode target labels with value between 0 and n_classes-1.
99

100
#### LabelBinarizer { .api }
101
```python
102
from sklearn.preprocessing import LabelBinarizer
103

104
LabelBinarizer(
105
    neg_label: int = 0,
106
    pos_label: int = 1,
107
    sparse_output: bool = False
108
)
109
```
110
Binarize labels in a one-vs-all fashion.
111

112
#### MultiLabelBinarizer { .api }
113
```python
114
from sklearn.preprocessing import MultiLabelBinarizer
115

116
MultiLabelBinarizer(
117
    classes: ArrayLike | None = None,
118
    sparse_output: bool = False
119
)
120
```
121
Transform between iterable of iterables and a multilabel format.
122

123
#### OneHotEncoder { .api }
124
```python
125
from sklearn.preprocessing import OneHotEncoder
126

127
OneHotEncoder(
128
    categories: str | list[ArrayLike] = "auto",
129
    drop: str | ArrayLike | None = None,
130
    sparse_output: bool = True,
131
    dtype: type = ...,
132
    handle_unknown: str = "error",
133
    min_frequency: int | float | None = None,
134
    max_categories: int | None = None,
135
    feature_name_combiner: str | Callable = "concat"
136
)
137
```
138
Encode categorical features as a one-hot numeric array.
139

140
#### OrdinalEncoder { .api }
141
```python
142
from sklearn.preprocessing import OrdinalEncoder
143

144
OrdinalEncoder(
145
    categories: str | list[ArrayLike] = "auto",
146
    dtype: type = ...,
147
    handle_unknown: str = "error",
148
    unknown_value: int | float | None = None,
149
    encoded_missing_value: int | float = ...,
150
    min_frequency: int | float | None = None,
151
    max_categories: int | None = None
152
)
153
```
154
Encode categorical features as an integer array.
155

156
#### TargetEncoder { .api }
157
```python
158
from sklearn.preprocessing import TargetEncoder
159

160
TargetEncoder(
161
    categories: str | list[ArrayLike] = "auto",
162
    target_type: str = "auto",
163
    smooth: str | float = "auto",
164
    cv: int | BaseCrossValidator | Iterable = 5,
165
    shuffle: bool = True,
166
    random_state: int | RandomState | None = None
167
)
168
```
169
Target Encoder for regression and classification targets.
170

171
#### KBinsDiscretizer { .api }
172
```python
173
from sklearn.preprocessing import KBinsDiscretizer
174

175
KBinsDiscretizer(
176
    n_bins: int | ArrayLike = 5,
177
    encode: str = "onehot",
178
    strategy: str = "quantile",
179
    dtype: type | None = None,
180
    subsample: int | None = 200000,
181
    random_state: int | RandomState | None = None
182
)
183
```
184
Bin continuous data into intervals.
185

186
#### Binarizer { .api }
187
```python
188
from sklearn.preprocessing import Binarizer
189

190
Binarizer(
191
    threshold: float = 0.0,
192
    copy: bool = True
193
)
194
```
195
Binarize data (set feature values to 0 or 1) according to a threshold.
196

197
## Feature Engineering
198

199
#### PolynomialFeatures { .api }
200
```python
201
from sklearn.preprocessing import PolynomialFeatures
202

203
PolynomialFeatures(
204
    degree: int = 2,
205
    interaction_only: bool = False,
206
    include_bias: bool = True,
207
    order: str = "C"
208
)
209
```
210
Generate polynomial and interaction features.
211

212
#### SplineTransformer { .api }
213
```python
214
from sklearn.preprocessing import SplineTransformer
215

216
SplineTransformer(
217
    n_knots: int = 5,
218
    degree: int = 3,
219
    knots: str | ArrayLike = "uniform",
220
    extrapolation: str = "constant",
221
    include_bias: bool = True,
222
    order: str = "C",
223
    sparse_output: bool = False
224
)
225
```
226
Generate univariate B-spline bases for features.
227

228
#### FunctionTransformer { .api }
229
```python
230
from sklearn.preprocessing import FunctionTransformer
231

232
FunctionTransformer(
233
    func: Callable | None = None,
234
    inverse_func: Callable | None = None,
235
    validate: bool = False,
236
    accept_sparse: bool = False,
237
    check_inverse: bool = True,
238
    feature_names_out: str | Callable | None = None,
239
    kw_args: dict | None = None,
240
    inv_kw_args: dict | None = None
241
)
242
```
243
Constructs a transformer from an arbitrary callable.
244

245
#### KernelCenterer { .api }
246
```python
247
from sklearn.preprocessing import KernelCenterer
248

249
KernelCenterer()
250
```
251
Center a kernel matrix.
252

253
## Feature Selection
254

255
### Univariate Selection
256

257
#### SelectKBest { .api }
258
```python
259
from sklearn.feature_selection import SelectKBest
260

261
SelectKBest(
262
    score_func: Callable = ...,
263
    k: int | str = 10
264
)
265
```
266
Select features according to the k highest scores.
267

268
#### SelectPercentile { .api }
269
```python
270
from sklearn.feature_selection import SelectPercentile
271

272
SelectPercentile(
273
    score_func: Callable = ...,
274
    percentile: int = 10
275
)
276
```
277
Select features according to a percentile of the highest scores.
278

279
#### SelectFpr { .api }
280
```python
281
from sklearn.feature_selection import SelectFpr
282

283
SelectFpr(
284
    score_func: Callable = ...,
285
    alpha: float = 0.05
286
)
287
```
288
Filter: Select the pvalues below alpha based on a FPR test.
289

290
#### SelectFdr { .api }
291
```python
292
from sklearn.feature_selection import SelectFdr
293

294
SelectFdr(
295
    score_func: Callable = ...,
296
    alpha: float = 0.05
297
)
298
```
299
Filter: Select the p-values for an estimated false discovery rate.
300

301
#### SelectFwe { .api }
302
```python
303
from sklearn.feature_selection import SelectFwe
304

305
SelectFwe(
306
    score_func: Callable = ...,
307
    alpha: float = 0.05
308
)
309
```
310
Filter: Select the p-values corresponding to Family-wise error rate.
311

312
#### GenericUnivariateSelect { .api }
313
```python
314
from sklearn.feature_selection import GenericUnivariateSelect
315

316
GenericUnivariateSelect(
317
    score_func: Callable = ...,
318
    mode: str = "percentile",
319
    param: int | float = 1e-05
320
)
321
```
322
Univariate feature selector with configurable strategy.
323

324
### Model-based Selection
325

326
#### SelectFromModel { .api }
327
```python
328
from sklearn.feature_selection import SelectFromModel
329

330
SelectFromModel(
331
    estimator: BaseEstimator,
332
    threshold: str | float | None = None,
333
    prefit: bool = False,
334
    norm_order: int = 1,
335
    max_features: int | Callable | None = None,
336
    importance_getter: str | Callable = "auto"
337
)
338
```
339
Meta-transformer for selecting features based on importance weights.
340

341
### Recursive Feature Elimination
342

343
#### RFE { .api }
344
```python
345
from sklearn.feature_selection import RFE
346

347
RFE(
348
    estimator: BaseEstimator,
349
    n_features_to_select: int | float | None = None,
350
    step: int | float = 1,
351
    verbose: int = 0,
352
    importance_getter: str | Callable = "auto"
353
)
354
```
355
Feature ranking with recursive feature elimination.
356

357
#### RFECV { .api }
358
```python
359
from sklearn.feature_selection import RFECV
360

361
RFECV(
362
    estimator: BaseEstimator,
363
    step: int | float = 1,
364
    min_features_to_select: int = 1,
365
    cv: int | BaseCrossValidator | Iterable | None = None,
366
    scoring: str | Callable | None = None,
367
    verbose: int = 0,
368
    n_jobs: int | None = None,
369
    importance_getter: str | Callable = "auto"
370
)
371
```
372
Recursive feature elimination with cross-validation.
373

374
### Sequential Feature Selection
375

376
#### SequentialFeatureSelector { .api }
377
```python
378
from sklearn.feature_selection import SequentialFeatureSelector
379

380
SequentialFeatureSelector(
381
    estimator: BaseEstimator,
382
    n_features_to_select: int | float | str = "auto",
383
    tol: float | None = None,
384
    direction: str = "forward",
385
    scoring: str | Callable | None = None,
386
    cv: int | BaseCrossValidator | Iterable = 5,
387
    n_jobs: int | None = None
388
)
389
```
390
Sequential Feature Selector.
391

392
### Variance-based Selection
393

394
#### VarianceThreshold { .api }
395
```python
396
from sklearn.feature_selection import VarianceThreshold
397

398
VarianceThreshold(
399
    threshold: float = 0.0
400
)
401
```
402
Feature selector that removes all low-variance features.
403

404
### Base Classes
405

406
#### SelectorMixin { .api }
407
```python
408
from sklearn.feature_selection import SelectorMixin
409

410
SelectorMixin()
411
```
412
Transformer mixin that performs feature selection given a support mask.
413

414
## Feature Selection Functions
415

416
### Statistical Tests
417

418
#### chi2 { .api }
419
```python
420
from sklearn.feature_selection import chi2
421

422
chi2(
423
    X: ArrayLike,
424
    y: ArrayLike
425
) -> tuple[ArrayLike, ArrayLike]
426
```
427
Compute chi-squared stats between each non-negative feature and class.
428

429
#### f_classif { .api }
430
```python
431
from sklearn.feature_selection import f_classif
432

433
f_classif(
434
    X: ArrayLike,
435
    y: ArrayLike
436
) -> tuple[ArrayLike, ArrayLike]
437
```
438
Compute the ANOVA F-value for the provided sample.
439

440
#### f_oneway { .api }
441
```python
442
from sklearn.feature_selection import f_oneway
443

444
f_oneway(
445
    *samples: ArrayLike
446
) -> tuple[ArrayLike, ArrayLike]
447
```
448
Test for equal means in two or more samples from the normal distribution.
449

450
#### f_regression { .api }
451
```python
452
from sklearn.feature_selection import f_regression
453

454
f_regression(
455
    X: ArrayLike,
456
    y: ArrayLike,
457
    center: bool = True
458
) -> tuple[ArrayLike, ArrayLike]
459
```
460
Univariate linear regression tests returning F-statistic and p-values.
461

462
#### r_regression { .api }
463
```python
464
from sklearn.feature_selection import r_regression
465

466
r_regression(
467
    X: ArrayLike,
468
    y: ArrayLike,
469
    center: bool = True,
470
    force_finite: bool = True
471
) -> tuple[ArrayLike, ArrayLike]
472
```
473
Compute Pearson's r for each feature with the target.
474

475
### Mutual Information
476

477
#### mutual_info_classif { .api }
478
```python
479
from sklearn.feature_selection import mutual_info_classif
480

481
mutual_info_classif(
482
    X: ArrayLike,
483
    y: ArrayLike,
484
    discrete_features: str | bool | ArrayLike = "auto",
485
    n_neighbors: int = 3,
486
    copy: bool = True,
487
    random_state: int | RandomState | None = None
488
) -> ArrayLike
489
```
490
Estimate mutual information for a discrete target variable.
491

492
#### mutual_info_regression { .api }
493
```python
494
from sklearn.feature_selection import mutual_info_regression
495

496
mutual_info_regression(
497
    X: ArrayLike,
498
    y: ArrayLike,
499
    discrete_features: str | bool | ArrayLike = "auto",
500
    n_neighbors: int = 3,
501
    copy: bool = True,
502
    random_state: int | RandomState | None = None
503
) -> ArrayLike
504
```
505
Estimate mutual information for a continuous target variable.
506

507
## Preprocessing Functions
508

509
### Scaling Functions
510

511
#### scale { .api }
512
```python
513
from sklearn.preprocessing import scale
514

515
scale(
516
    X: ArrayLike,
517
    axis: int = 0,
518
    with_mean: bool = True,
519
    with_std: bool = True,
520
    copy: bool = True
521
) -> ArrayLike
522
```
523
Standardize a dataset along any axis.
524

525
#### minmax_scale { .api }
526
```python
527
from sklearn.preprocessing import minmax_scale
528

529
minmax_scale(
530
    X: ArrayLike,
531
    feature_range: tuple[float, float] = (0, 1),
532
    axis: int = 0,
533
    copy: bool = True
534
) -> ArrayLike
535
```
536
Transform features by scaling each feature to a given range.
537

538
#### maxabs_scale { .api }
539
```python
540
from sklearn.preprocessing import maxabs_scale
541

542
maxabs_scale(
543
    X: ArrayLike,
544
    axis: int = 0,
545
    copy: bool = True
546
) -> ArrayLike
547
```
548
Scale each feature to the [-1, 1] range without breaking sparsity.
549

550
#### robust_scale { .api }
551
```python
552
from sklearn.preprocessing import robust_scale
553

554
robust_scale(
555
    X: ArrayLike,
556
    axis: int = 0,
557
    quantile_range: tuple[float, float] = (25.0, 75.0),
558
    copy: bool = True,
559
    unit_variance: bool = False
560
) -> ArrayLike
561
```
562
Standardize a dataset along any axis.
563

564
#### normalize { .api }
565
```python
566
from sklearn.preprocessing import normalize
567

568
normalize(
569
    X: ArrayLike,
570
    norm: str = "l2",
571
    axis: int = 1,
572
    copy: bool = True,
573
    return_norm: bool = False
574
) -> ArrayLike | tuple[ArrayLike, ArrayLike]
575
```
576
Scale input vectors individually to unit norm (vector length).
577

578
#### quantile_transform { .api }
579
```python
580
from sklearn.preprocessing import quantile_transform
581

582
quantile_transform(
583
    X: ArrayLike,
584
    axis: int = 0,
585
    n_quantiles: int = 1000,
586
    output_distribution: str = "uniform",
587
    ignore_implicit_zeros: bool = False,
588
    subsample: int = 100000,
589
    random_state: int | RandomState | None = None,
590
    copy: bool = True
591
) -> ArrayLike
592
```
593
Transform features to follow a uniform or a normal distribution.
594

595
#### power_transform { .api }
596
```python
597
from sklearn.preprocessing import power_transform
598

599
power_transform(
600
    X: ArrayLike,
601
    method: str = "yeo-johnson",
602
    standardize: bool = True,
603
    copy: bool = True
604
) -> ArrayLike
605
```
606
Apply a power transform featurewise to make data more Gaussian-like.
607

608
### Encoding Functions
609

610
#### label_binarize { .api }
611
```python
612
from sklearn.preprocessing import label_binarize
613

614
label_binarize(
615
    y: ArrayLike,
616
    classes: ArrayLike,
617
    neg_label: int = 0,
618
    pos_label: int = 1,
619
    sparse_output: bool = False
620
) -> ArrayLike
621
```
622
Binarize labels in a one-vs-all fashion.
623

624
#### binarize { .api }
625
```python
626
from sklearn.preprocessing import binarize
627

628
binarize(
629
    X: ArrayLike,
630
    threshold: float = 0.0,
631
    copy: bool = True
632
) -> ArrayLike
633
```
634
Boolean thresholding of array-like or scipy.sparse matrix.
635

636
#### add_dummy_feature { .api }
637
```python
638
from sklearn.preprocessing import add_dummy_feature
639

640
add_dummy_feature(
641
    X: ArrayLike,
642
    value: float = 1.0
643
) -> ArrayLike
644
```
645
Augment dataset with an additional dummy feature.
646

647
## Feature Extraction
648

649
### Text Feature Extraction
650

651
#### DictVectorizer { .api }
652
```python
653
from sklearn.feature_extraction import DictVectorizer
654

655
DictVectorizer(
656
    dtype: type = ...,
657
    separator: str = "=",
658
    sparse: bool = True,
659
    sort: bool = True
660
)
661
```
662
Transforms lists of feature-value mappings to vectors.
663

664
#### FeatureHasher { .api }
665
```python
666
from sklearn.feature_extraction import FeatureHasher
667

668
FeatureHasher(
669
    n_features: int = 1048576,
670
    input_type: str = "dict",
671
    dtype: type = ...,
672
    alternate_sign: bool = True
673
)
674
```
675
Implements feature hashing, aka the hashing trick.
676

677
### Image Feature Extraction
678

679
#### img_to_graph { .api }
680
```python
681
from sklearn.feature_extraction import img_to_graph
682

683
img_to_graph(
684
    img: ArrayLike,
685
    mask: ArrayLike | None = None,
686
    return_as: type = ...,
687
    dtype: type | None = None
688
) -> ArrayLike
689
```
690
Graph of the pixel-to-pixel gradient connections.
691

692
#### grid_to_graph { .api }
693
```python
694
from sklearn.feature_extraction import grid_to_graph
695

696
grid_to_graph(
697
    n_x: int,
698
    n_y: int,
699
    n_z: int | None = None,
700
    mask: ArrayLike | None = None,
701
    return_as: type = ...,
702
    dtype: type = ...,
703
    **kwargs
704
) -> ArrayLike
705
```
706
Graph of the pixel-to-pixel gradient connections.
707

708
## Imputation
709

710
### Simple Imputation
711

712
#### SimpleImputer { .api }
713
```python
714
from sklearn.impute import SimpleImputer
715

716
SimpleImputer(
717
    missing_values: int | float | str | None = ...,
718
    strategy: str = "mean",
719
    fill_value: str | int | float | None = None,
720
    copy: bool = True,
721
    add_indicator: bool = False,
722
    keep_empty_features: bool = False
723
)
724
```
725
Imputation transformer for completing missing values.
726

727
### Advanced Imputation
728

729
#### KNNImputer { .api }
730
```python
731
from sklearn.impute import KNNImputer
732

733
KNNImputer(
734
    missing_values: int | float | str | None = ...,
735
    n_neighbors: int = 5,
736
    weights: str | Callable = "uniform",
737
    metric: str | Callable = "nan_euclidean",
738
    copy: bool = True,
739
    add_indicator: bool = False,
740
    keep_empty_features: bool = False
741
)
742
```
743
Imputation for completing missing values using k-Nearest Neighbors.
744

745
### Missing Value Indicators
746

747
#### MissingIndicator { .api }
748
```python
749
from sklearn.impute import MissingIndicator
750

751
MissingIndicator(
752
    missing_values: int | float | str | None = ...,
753
    features: str = "missing-only",
754
    sparse: bool | str = "auto",
755
    error_on_new: bool = True
756
)
757
```
758
Binary indicators for missing values.
759

760
## Kernel Approximation
761

762
### RBF Kernel Approximation
763

764
#### RBFSampler { .api }
765
```python
766
from sklearn.kernel_approximation import RBFSampler
767

768
RBFSampler(
769
    gamma: float = 1.0,
770
    n_components: int = 100,
771
    random_state: int | RandomState | None = None
772
)
773
```
774
Approximate a RBF kernel feature map using random Fourier features.
775

776
#### Nystroem { .api }
777
```python
778
from sklearn.kernel_approximation import Nystroem
779

780
Nystroem(
781
    kernel: str | Callable = "rbf",
782
    gamma: float | None = None,
783
    coef0: float | None = None,
784
    degree: float | None = None,
785
    kernel_params: dict | None = None,
786
    n_components: int = 100,
787
    random_state: int | RandomState | None = None,
788
    n_jobs: int | None = None
789
)
790
```
791
Approximate a kernel map using a subset of the training data.
792

793
### Chi-squared Kernel Approximation
794

795
#### AdditiveChi2Sampler { .api }
796
```python
797
from sklearn.kernel_approximation import AdditiveChi2Sampler
798

799
AdditiveChi2Sampler(
800
    sample_steps: int = 2,
801
    sample_interval: float | None = None
802
)
803
```
804
Approximate feature map for additive chi2 kernel.
805

806
#### SkewedChi2Sampler { .api }
807
```python
808
from sklearn.kernel_approximation import SkewedChi2Sampler
809

810
SkewedChi2Sampler(
811
    skewedness: float = 1.0,
812
    n_components: int = 100,
813
    random_state: int | RandomState | None = None
814
)
815
```
816
Approximate feature map for "skewed chi-squared" kernel.
817

818
### Polynomial Kernel Approximation
819

820
#### PolynomialCountSketch { .api }
821
```python
822
from sklearn.kernel_approximation import PolynomialCountSketch
823

824
PolynomialCountSketch(
825
    gamma: float = 1.0,
826
    degree: int = 2,
827
    coef0: int = 0,
828
    n_components: int = 100,
829
    random_state: int | RandomState | None = None
830
)
831
```
832
Polynomial kernel approximation via Tensor Sketch.
833

834
## Random Projection
835

836
#### GaussianRandomProjection { .api }
837
```python
838
from sklearn.random_projection import GaussianRandomProjection
839

840
GaussianRandomProjection(
841
    n_components: int | str = "auto",
842
    eps: float = 0.1,
843
    random_state: int | RandomState | None = None,
844
    compute_inverse_components: bool = False
845
)
846
```
847
Reduce dimensionality through Gaussian random projection.
848

849
#### SparseRandomProjection { .api }
850
```python
851
from sklearn.random_projection import SparseRandomProjection
852

853
SparseRandomProjection(
854
    n_components: int | str = "auto",
855
    density: float | str = "auto",
856
    eps: float = 0.1,
857
    dense_output: bool = False,
858
    random_state: int | RandomState | None = None,
859
    compute_inverse_components: bool = False
860
)
861
```
862
Reduce dimensionality through sparse random projection.
863

864
### Random Projection Functions
865

866
#### johnson_lindenstrauss_min_dim { .api }
867
```python
868
from sklearn.random_projection import johnson_lindenstrauss_min_dim
869

870
johnson_lindenstrauss_min_dim(
871
    n_samples: int,
872
    eps: float | ArrayLike = 0.1
873
) -> int | ArrayLike
874
```
875
Find a 'safe' number of components to randomly project to.
876

877
## Examples
878

879
### Basic Preprocessing Pipeline
880

881
```python
882
from sklearn.preprocessing import StandardScaler, OneHotEncoder
883
from sklearn.compose import ColumnTransformer
884
from sklearn.pipeline import Pipeline
885
from sklearn.impute import SimpleImputer
886

887
# Create preprocessing pipeline
888
numeric_features = ['age', 'income', 'score']
889
categorical_features = ['city', 'gender']
890

891
numeric_transformer = Pipeline(steps=[
892
    ('imputer', SimpleImputer(strategy='median')),
893
    ('scaler', StandardScaler())
894
])
895

896
categorical_transformer = Pipeline(steps=[
897
    ('imputer', SimpleImputer(strategy='most_frequent')),
898
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
899
])
900

901
preprocessor = ColumnTransformer(
902
    transformers=[
903
        ('num', numeric_transformer, numeric_features),
904
        ('cat', categorical_transformer, categorical_features)
905
    ]
906
)
907
```
908

909
### Feature Selection Pipeline
910

911
```python
912
from sklearn.feature_selection import SelectKBest, f_classif, RFE
913
from sklearn.ensemble import RandomForestClassifier
914

915
# Univariate feature selection
916
selector = SelectKBest(score_func=f_classif, k=10)
917

918
# Model-based feature selection
919
rfe = RFE(estimator=RandomForestClassifier(n_estimators=100), n_features_to_select=10)
920

921
# Complete pipeline
922
pipeline = Pipeline([
923
    ('scaler', StandardScaler()),
924
    ('selector', selector),
925
    ('classifier', RandomForestClassifier())
926
])
927
```

Version

Tile

Files

preprocessing.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

preprocessing.mddocs/