0
# Variable Discretisation
1
2
Transformers for converting continuous variables into discrete intervals using equal width, equal frequency, decision tree-based, or user-defined boundaries.
3
4
## Capabilities
5
6
### Equal Width Discretisation
7
8
Sorts continuous variables into intervals of equal width.
9
10
```python { .api }
11
class EqualWidthDiscretiser:
12
def __init__(self, q=5, variables=None, return_object=False, return_boundaries=False):
13
"""
14
Initialize EqualWidthDiscretiser.
15
16
Parameters:
17
- q (int): Number of intervals to create
18
- variables (list): List of numerical variables to discretise. If None, selects all numerical variables
19
- return_object (bool): Whether to return discretised variables as object type
20
- return_boundaries (bool): Whether to return interval boundaries as part of labels
21
"""
22
23
def fit(self, X, y=None):
24
"""
25
Learn interval boundaries for each variable.
26
27
Parameters:
28
- X (pandas.DataFrame): Training dataset
29
- y (pandas.Series, optional): Target variable (not used)
30
31
Returns:
32
- self
33
"""
34
35
def transform(self, X):
36
"""
37
Discretise continuous variables into equal width intervals.
38
39
Parameters:
40
- X (pandas.DataFrame): Dataset to transform
41
42
Returns:
43
- pandas.DataFrame: Dataset with continuous variables replaced by interval labels
44
"""
45
46
def fit_transform(self, X, y=None):
47
"""Fit to data, then transform it."""
48
```
49
50
**Usage Example**:
51
```python
52
from feature_engine.discretisation import EqualWidthDiscretiser
53
import pandas as pd
54
import numpy as np
55
56
# Sample continuous data
57
data = {'age': np.random.normal(35, 10, 1000),
58
'income': np.random.normal(50000, 15000, 1000)}
59
df = pd.DataFrame(data)
60
61
# Create 5 equal width intervals
62
discretiser = EqualWidthDiscretiser(q=5)
63
df_discretised = discretiser.fit_transform(df)
64
# Creates intervals like: (18.5, 25.2], (25.2, 31.9], etc.
65
66
# Return with boundaries in labels
67
discretiser = EqualWidthDiscretiser(q=3, return_boundaries=True)
68
df_discretised = discretiser.fit_transform(df)
69
70
# Access learned boundaries
71
print(discretiser.binner_dict_) # Shows interval boundaries per variable
72
```
73
74
### Equal Frequency Discretisation
75
76
Sorts continuous variables into intervals of equal frequency (quantiles).
77
78
```python { .api }
79
class EqualFrequencyDiscretiser:
80
def __init__(self, q=5, variables=None, return_object=False, return_boundaries=False):
81
"""
82
Initialize EqualFrequencyDiscretiser.
83
84
Parameters:
85
- q (int): Number of intervals to create (quantiles)
86
- variables (list): List of numerical variables to discretise. If None, selects all numerical variables
87
- return_object (bool): Whether to return discretised variables as object type
88
- return_boundaries (bool): Whether to return interval boundaries as part of labels
89
"""
90
91
def fit(self, X, y=None):
92
"""
93
Learn quantile boundaries for each variable.
94
95
Parameters:
96
- X (pandas.DataFrame): Training dataset
97
- y (pandas.Series, optional): Target variable (not used)
98
99
Returns:
100
- self
101
"""
102
103
def transform(self, X):
104
"""
105
Discretise continuous variables into equal frequency intervals.
106
107
Parameters:
108
- X (pandas.DataFrame): Dataset to transform
109
110
Returns:
111
- pandas.DataFrame: Dataset with continuous variables replaced by interval labels
112
"""
113
114
def fit_transform(self, X, y=None):
115
"""Fit to data, then transform it."""
116
```
117
118
**Usage Example**:
119
```python
120
from feature_engine.discretisation import EqualFrequencyDiscretiser
121
122
# Create 5 quantile-based intervals
123
discretiser = EqualFrequencyDiscretiser(q=5)
124
df_discretised = discretiser.fit_transform(df)
125
# Each interval contains approximately 20% of the data
126
127
# Create quartiles (4 intervals)
128
discretiser = EqualFrequencyDiscretiser(q=4)
129
df_discretised = discretiser.fit_transform(df)
130
# Creates Q1, Q2, Q3, Q4 intervals
131
```
132
133
### Arbitrary Discretisation
134
135
Sorts continuous variables into intervals defined by user-specified boundaries.
136
137
```python { .api }
138
class ArbitraryDiscretiser:
139
def __init__(self, binning_dict, return_object=False, return_boundaries=False):
140
"""
141
Initialize ArbitraryDiscretiser.
142
143
Parameters:
144
- binning_dict (dict): Dictionary mapping variables to lists of cut points
145
- return_object (bool): Whether to return discretised variables as object type
146
- return_boundaries (bool): Whether to return interval boundaries as part of labels
147
"""
148
149
def fit(self, X, y=None):
150
"""
151
Validate binning dictionary and variables.
152
153
Parameters:
154
- X (pandas.DataFrame): Training dataset
155
- y (pandas.Series, optional): Target variable (not used)
156
157
Returns:
158
- self
159
"""
160
161
def transform(self, X):
162
"""
163
Discretise continuous variables using user-defined boundaries.
164
165
Parameters:
166
- X (pandas.DataFrame): Dataset to transform
167
168
Returns:
169
- pandas.DataFrame: Dataset with continuous variables replaced by interval labels
170
"""
171
172
def fit_transform(self, X, y=None):
173
"""Fit to data, then transform it."""
174
```
175
176
**Usage Example**:
177
```python
178
from feature_engine.discretisation import ArbitraryDiscretiser
179
180
# Define custom intervals for each variable
181
binning_dict = {
182
'age': [18, 30, 45, 60, 100],
183
'income': [0, 25000, 50000, 75000, 100000, float('inf')]
184
}
185
186
discretiser = ArbitraryDiscretiser(binning_dict=binning_dict)
187
df_discretised = discretiser.fit_transform(df)
188
# Creates intervals: (18,30], (30,45], (45,60], (60,100] for age
189
# Creates intervals: (0,25000], (25000,50000], etc. for income
190
191
# Return as object type with boundaries
192
discretiser = ArbitraryDiscretiser(
193
binning_dict=binning_dict,
194
return_object=True,
195
return_boundaries=True
196
)
197
df_discretised = discretiser.fit_transform(df)
198
```
199
200
### Decision Tree Discretisation
201
202
Uses decision tree to find optimal cut points for discretisation based on target variable.
203
204
```python { .api }
205
class DecisionTreeDiscretiser:
206
def __init__(self, variables=None, cv=3, scoring='accuracy', param_grid=None,
207
regression=False, random_state=None, return_object=False,
208
return_boundaries=False):
209
"""
210
Initialize DecisionTreeDiscretiser.
211
212
Parameters:
213
- variables (list): List of numerical variables to discretise. If None, selects all numerical variables
214
- cv (int): Cross-validation folds for hyperparameter tuning
215
- scoring (str): Scoring metric for model selection
216
- param_grid (dict): Parameter grid for decision tree hyperparameter tuning
217
- regression (bool): Whether target is continuous (True) or categorical (False)
218
- random_state (int): Random state for reproducibility
219
- return_object (bool): Whether to return discretised variables as object type
220
- return_boundaries (bool): Whether to return interval boundaries as part of labels
221
"""
222
223
def fit(self, X, y):
224
"""
225
Train decision trees to find optimal cut points per variable.
226
227
Parameters:
228
- X (pandas.DataFrame): Training dataset
229
- y (pandas.Series): Target variable (required)
230
231
Returns:
232
- self
233
"""
234
235
def transform(self, X):
236
"""
237
Discretise variables using decision tree-derived cut points.
238
239
Parameters:
240
- X (pandas.DataFrame): Dataset to transform
241
242
Returns:
243
- pandas.DataFrame: Dataset with continuous variables replaced by interval labels
244
"""
245
246
def fit_transform(self, X, y):
247
"""Fit to data, then transform it."""
248
```
249
250
**Usage Example**:
251
```python
252
from feature_engine.discretisation import DecisionTreeDiscretiser
253
254
# Automatic discretisation based on target
255
discretiser = DecisionTreeDiscretiser(cv=5, scoring='accuracy')
256
df_discretised = discretiser.fit_transform(df, y)
257
# Finds optimal cut points that best separate target classes
258
259
# For regression tasks
260
discretiser = DecisionTreeDiscretiser(
261
regression=True,
262
scoring='neg_mean_squared_error'
263
)
264
df_discretised = discretiser.fit_transform(df, y_continuous)
265
266
# Access learned boundaries
267
print(discretiser.binner_dict_) # Shows tree-derived cut points per variable
268
print(discretiser.scores_dict_) # Shows cross-validation scores
269
```
270
271
## Usage Patterns
272
273
### Combining with Other Transformers
274
275
```python
276
from sklearn.pipeline import Pipeline
277
from feature_engine.imputation import MeanMedianImputer
278
from feature_engine.discretisation import EqualFrequencyDiscretiser
279
from feature_engine.encoding import OneHotEncoder
280
281
# Pipeline for preprocessing continuous variables
282
pipeline = Pipeline([
283
('imputer', MeanMedianImputer()),
284
('discretiser', EqualFrequencyDiscretiser(q=5)),
285
('encoder', OneHotEncoder()) # Convert intervals to dummy variables
286
])
287
288
df_processed = pipeline.fit_transform(df)
289
```
290
291
### Handling Mixed Data Types
292
293
```python
294
from feature_engine.discretisation import EqualWidthDiscretiser
295
296
# Specify only numerical variables to discretise
297
discretiser = EqualWidthDiscretiser(
298
q=4,
299
variables=['age', 'income', 'score'] # Only these will be discretised
300
)
301
df_mixed = discretiser.fit_transform(df_with_mixed_types)
302
# Categorical variables remain unchanged
303
```
304
305
## Common Attributes
306
307
All discretisation transformers share these fitted attributes:
308
309
- `variables_` (list): Variables that will be transformed
310
- `n_features_in_` (int): Number of features in training set
311
- `binner_dict_` (dict): Dictionary with interval boundaries per variable
312
313
Additional attributes for specific discretisers:
314
- `scores_dict_` (dict): Cross-validation scores per variable (DecisionTreeDiscretiser)
315
- `models_dict_` (dict): Trained decision tree models per variable (DecisionTreeDiscretiser)