or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

creation.mddatetime.mddiscretisation.mdencoding.mdimputation.mdindex.mdoutliers.mdpreprocessing.mdselection.mdtransformation.mdwrappers.md

discretisation.mddocs/

0

# Variable Discretisation

1

2

Transformers for converting continuous variables into discrete intervals using equal width, equal frequency, decision tree-based, or user-defined boundaries.

3

4

## Capabilities

5

6

### Equal Width Discretisation

7

8

Sorts continuous variables into intervals of equal width.

9

10

```python { .api }

11

class EqualWidthDiscretiser:

12

def __init__(self, q=5, variables=None, return_object=False, return_boundaries=False):

13

"""

14

Initialize EqualWidthDiscretiser.

15

16

Parameters:

17

- q (int): Number of intervals to create

18

- variables (list): List of numerical variables to discretise. If None, selects all numerical variables

19

- return_object (bool): Whether to return discretised variables as object type

20

- return_boundaries (bool): Whether to return interval boundaries as part of labels

21

"""

22

23

def fit(self, X, y=None):

24

"""

25

Learn interval boundaries for each variable.

26

27

Parameters:

28

- X (pandas.DataFrame): Training dataset

29

- y (pandas.Series, optional): Target variable (not used)

30

31

Returns:

32

- self

33

"""

34

35

def transform(self, X):

36

"""

37

Discretise continuous variables into equal width intervals.

38

39

Parameters:

40

- X (pandas.DataFrame): Dataset to transform

41

42

Returns:

43

- pandas.DataFrame: Dataset with continuous variables replaced by interval labels

44

"""

45

46

def fit_transform(self, X, y=None):

47

"""Fit to data, then transform it."""

48

```

49

50

**Usage Example**:

51

```python

52

from feature_engine.discretisation import EqualWidthDiscretiser

53

import pandas as pd

54

import numpy as np

55

56

# Sample continuous data

57

data = {'age': np.random.normal(35, 10, 1000),

58

'income': np.random.normal(50000, 15000, 1000)}

59

df = pd.DataFrame(data)

60

61

# Create 5 equal width intervals

62

discretiser = EqualWidthDiscretiser(q=5)

63

df_discretised = discretiser.fit_transform(df)

64

# Creates intervals like: (18.5, 25.2], (25.2, 31.9], etc.

65

66

# Return with boundaries in labels

67

discretiser = EqualWidthDiscretiser(q=3, return_boundaries=True)

68

df_discretised = discretiser.fit_transform(df)

69

70

# Access learned boundaries

71

print(discretiser.binner_dict_) # Shows interval boundaries per variable

72

```

73

74

### Equal Frequency Discretisation

75

76

Sorts continuous variables into intervals of equal frequency (quantiles).

77

78

```python { .api }

79

class EqualFrequencyDiscretiser:

80

def __init__(self, q=5, variables=None, return_object=False, return_boundaries=False):

81

"""

82

Initialize EqualFrequencyDiscretiser.

83

84

Parameters:

85

- q (int): Number of intervals to create (quantiles)

86

- variables (list): List of numerical variables to discretise. If None, selects all numerical variables

87

- return_object (bool): Whether to return discretised variables as object type

88

- return_boundaries (bool): Whether to return interval boundaries as part of labels

89

"""

90

91

def fit(self, X, y=None):

92

"""

93

Learn quantile boundaries for each variable.

94

95

Parameters:

96

- X (pandas.DataFrame): Training dataset

97

- y (pandas.Series, optional): Target variable (not used)

98

99

Returns:

100

- self

101

"""

102

103

def transform(self, X):

104

"""

105

Discretise continuous variables into equal frequency intervals.

106

107

Parameters:

108

- X (pandas.DataFrame): Dataset to transform

109

110

Returns:

111

- pandas.DataFrame: Dataset with continuous variables replaced by interval labels

112

"""

113

114

def fit_transform(self, X, y=None):

115

"""Fit to data, then transform it."""

116

```

117

118

**Usage Example**:

119

```python

120

from feature_engine.discretisation import EqualFrequencyDiscretiser

121

122

# Create 5 quantile-based intervals

123

discretiser = EqualFrequencyDiscretiser(q=5)

124

df_discretised = discretiser.fit_transform(df)

125

# Each interval contains approximately 20% of the data

126

127

# Create quartiles (4 intervals)

128

discretiser = EqualFrequencyDiscretiser(q=4)

129

df_discretised = discretiser.fit_transform(df)

130

# Creates Q1, Q2, Q3, Q4 intervals

131

```

132

133

### Arbitrary Discretisation

134

135

Sorts continuous variables into intervals defined by user-specified boundaries.

136

137

```python { .api }

138

class ArbitraryDiscretiser:

139

def __init__(self, binning_dict, return_object=False, return_boundaries=False):

140

"""

141

Initialize ArbitraryDiscretiser.

142

143

Parameters:

144

- binning_dict (dict): Dictionary mapping variables to lists of cut points

145

- return_object (bool): Whether to return discretised variables as object type

146

- return_boundaries (bool): Whether to return interval boundaries as part of labels

147

"""

148

149

def fit(self, X, y=None):

150

"""

151

Validate binning dictionary and variables.

152

153

Parameters:

154

- X (pandas.DataFrame): Training dataset

155

- y (pandas.Series, optional): Target variable (not used)

156

157

Returns:

158

- self

159

"""

160

161

def transform(self, X):

162

"""

163

Discretise continuous variables using user-defined boundaries.

164

165

Parameters:

166

- X (pandas.DataFrame): Dataset to transform

167

168

Returns:

169

- pandas.DataFrame: Dataset with continuous variables replaced by interval labels

170

"""

171

172

def fit_transform(self, X, y=None):

173

"""Fit to data, then transform it."""

174

```

175

176

**Usage Example**:

177

```python

178

from feature_engine.discretisation import ArbitraryDiscretiser

179

180

# Define custom intervals for each variable

181

binning_dict = {

182

'age': [18, 30, 45, 60, 100],

183

'income': [0, 25000, 50000, 75000, 100000, float('inf')]

184

}

185

186

discretiser = ArbitraryDiscretiser(binning_dict=binning_dict)

187

df_discretised = discretiser.fit_transform(df)

188

# Creates intervals: (18,30], (30,45], (45,60], (60,100] for age

189

# Creates intervals: (0,25000], (25000,50000], etc. for income

190

191

# Return as object type with boundaries

192

discretiser = ArbitraryDiscretiser(

193

binning_dict=binning_dict,

194

return_object=True,

195

return_boundaries=True

196

)

197

df_discretised = discretiser.fit_transform(df)

198

```

199

200

### Decision Tree Discretisation

201

202

Uses decision tree to find optimal cut points for discretisation based on target variable.

203

204

```python { .api }

205

class DecisionTreeDiscretiser:

206

def __init__(self, variables=None, cv=3, scoring='accuracy', param_grid=None,

207

regression=False, random_state=None, return_object=False,

208

return_boundaries=False):

209

"""

210

Initialize DecisionTreeDiscretiser.

211

212

Parameters:

213

- variables (list): List of numerical variables to discretise. If None, selects all numerical variables

214

- cv (int): Cross-validation folds for hyperparameter tuning

215

- scoring (str): Scoring metric for model selection

216

- param_grid (dict): Parameter grid for decision tree hyperparameter tuning

217

- regression (bool): Whether target is continuous (True) or categorical (False)

218

- random_state (int): Random state for reproducibility

219

- return_object (bool): Whether to return discretised variables as object type

220

- return_boundaries (bool): Whether to return interval boundaries as part of labels

221

"""

222

223

def fit(self, X, y):

224

"""

225

Train decision trees to find optimal cut points per variable.

226

227

Parameters:

228

- X (pandas.DataFrame): Training dataset

229

- y (pandas.Series): Target variable (required)

230

231

Returns:

232

- self

233

"""

234

235

def transform(self, X):

236

"""

237

Discretise variables using decision tree-derived cut points.

238

239

Parameters:

240

- X (pandas.DataFrame): Dataset to transform

241

242

Returns:

243

- pandas.DataFrame: Dataset with continuous variables replaced by interval labels

244

"""

245

246

def fit_transform(self, X, y):

247

"""Fit to data, then transform it."""

248

```

249

250

**Usage Example**:

251

```python

252

from feature_engine.discretisation import DecisionTreeDiscretiser

253

254

# Automatic discretisation based on target

255

discretiser = DecisionTreeDiscretiser(cv=5, scoring='accuracy')

256

df_discretised = discretiser.fit_transform(df, y)

257

# Finds optimal cut points that best separate target classes

258

259

# For regression tasks

260

discretiser = DecisionTreeDiscretiser(

261

regression=True,

262

scoring='neg_mean_squared_error'

263

)

264

df_discretised = discretiser.fit_transform(df, y_continuous)

265

266

# Access learned boundaries

267

print(discretiser.binner_dict_) # Shows tree-derived cut points per variable

268

print(discretiser.scores_dict_) # Shows cross-validation scores

269

```

270

271

## Usage Patterns

272

273

### Combining with Other Transformers

274

275

```python

276

from sklearn.pipeline import Pipeline

277

from feature_engine.imputation import MeanMedianImputer

278

from feature_engine.discretisation import EqualFrequencyDiscretiser

279

from feature_engine.encoding import OneHotEncoder

280

281

# Pipeline for preprocessing continuous variables

282

pipeline = Pipeline([

283

('imputer', MeanMedianImputer()),

284

('discretiser', EqualFrequencyDiscretiser(q=5)),

285

('encoder', OneHotEncoder()) # Convert intervals to dummy variables

286

])

287

288

df_processed = pipeline.fit_transform(df)

289

```

290

291

### Handling Mixed Data Types

292

293

```python

294

from feature_engine.discretisation import EqualWidthDiscretiser

295

296

# Specify only numerical variables to discretise

297

discretiser = EqualWidthDiscretiser(

298

q=4,

299

variables=['age', 'income', 'score'] # Only these will be discretised

300

)

301

df_mixed = discretiser.fit_transform(df_with_mixed_types)

302

# Categorical variables remain unchanged

303

```

304

305

## Common Attributes

306

307

All discretisation transformers share these fitted attributes:

308

309

- `variables_` (list): Variables that will be transformed

310

- `n_features_in_` (int): Number of features in training set

311

- `binner_dict_` (dict): Dictionary with interval boundaries per variable

312

313

Additional attributes for specific discretisers:

314

- `scores_dict_` (dict): Cross-validation scores per variable (DecisionTreeDiscretiser)

315

- `models_dict_` (dict): Trained decision tree models per variable (DecisionTreeDiscretiser)