or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

datasets.mdfeature-extraction.mdindex.mdmetrics.mdmodel-selection.mdneighbors.mdpipelines.mdpreprocessing.mdsupervised-learning.mdunsupervised-learning.mdutilities.md

feature-extraction.mddocs/

0

# Feature Extraction

1

2

Feature extraction utilities for converting raw data into numerical features suitable for machine learning algorithms. This includes text processing, image processing, and dictionary-based feature extraction.

3

4

## Text Feature Extraction

5

6

### CountVectorizer

7

8

Convert a collection of text documents to a matrix of token counts.

9

10

```python { .api }

11

from sklearn.feature_extraction.text import CountVectorizer

12

13

CountVectorizer(

14

input: str = "content",

15

encoding: str = "utf-8",

16

decode_error: str = "strict",

17

strip_accents: str | None = None,

18

lowercase: bool = True,

19

preprocessor: callable | None = None,

20

tokenizer: callable | None = None,

21

stop_words: str | list | None = None,

22

token_pattern: str = r"(?u)\b\w\w+\b",

23

ngram_range: tuple = (1, 1),

24

analyzer: str = "word",

25

max_df: float | int = 1.0,

26

min_df: float | int = 1,

27

max_features: int | None = None,

28

vocabulary: dict | None = None,

29

binary: bool = False,

30

dtype: type = np.int64

31

)

32

```

33

34

### TfidfVectorizer

35

36

Convert a collection of raw documents to a matrix of TF-IDF features.

37

38

```python { .api }

39

from sklearn.feature_extraction.text import TfidfVectorizer

40

41

TfidfVectorizer(

42

input: str = "content",

43

encoding: str = "utf-8",

44

decode_error: str = "strict",

45

strip_accents: str | None = None,

46

lowercase: bool = True,

47

preprocessor: callable | None = None,

48

tokenizer: callable | None = None,

49

analyzer: str = "word",

50

stop_words: str | list | None = None,

51

token_pattern: str = r"(?u)\b\w\w+\b",

52

ngram_range: tuple = (1, 1),

53

max_df: float | int = 1.0,

54

min_df: float | int = 1,

55

max_features: int | None = None,

56

vocabulary: dict | None = None,

57

binary: bool = False,

58

dtype: type = np.float64,

59

norm: str = "l2",

60

use_idf: bool = True,

61

smooth_idf: bool = True,

62

sublinear_tf: bool = False

63

)

64

```

65

66

### TfidfTransformer

67

68

Transform a count matrix to a normalized tf or tf-idf representation.

69

70

```python { .api }

71

from sklearn.feature_extraction.text import TfidfTransformer

72

73

TfidfTransformer(

74

norm: str = "l2",

75

use_idf: bool = True,

76

smooth_idf: bool = True,

77

sublinear_tf: bool = False

78

)

79

```

80

81

### HashingVectorizer

82

83

Convert a collection of text documents to a matrix of token occurrences using hashing trick.

84

85

```python { .api }

86

from sklearn.feature_extraction.text import HashingVectorizer

87

88

HashingVectorizer(

89

n_features: int = 2**20,

90

input: str = "content",

91

encoding: str = "utf-8",

92

decode_error: str = "strict",

93

strip_accents: str | None = None,

94

lowercase: bool = True,

95

preprocessor: callable | None = None,

96

tokenizer: callable | None = None,

97

stop_words: str | list | None = None,

98

token_pattern: str = r"(?u)\b\w\w+\b",

99

ngram_range: tuple = (1, 1),

100

analyzer: str = "word",

101

binary: bool = False,

102

norm: str = "l2",

103

alternate_sign: bool = True,

104

dtype: type = np.float64

105

)

106

```

107

108

### Text Preprocessing Functions

109

110

```python { .api }

111

from sklearn.feature_extraction.text import strip_accents_ascii, strip_accents_unicode, strip_tags

112

113

def strip_accents_ascii(s: str) -> str: ...

114

def strip_accents_unicode(s: str) -> str: ...

115

def strip_tags(s: str) -> str: ...

116

```

117

118

## Dictionary Feature Extraction

119

120

### DictVectorizer

121

122

Transform lists of feature-value mappings to vectors.

123

124

```python { .api }

125

from sklearn.feature_extraction import DictVectorizer

126

127

DictVectorizer(

128

dtype: type = np.float64,

129

separator: str = "=",

130

sparse: bool = True,

131

sort: bool = True

132

)

133

```

134

135

## Hashing Feature Extraction

136

137

### FeatureHasher

138

139

Implements feature hashing for high-speed, low-memory vectorization.

140

141

```python { .api }

142

from sklearn.feature_extraction import FeatureHasher

143

144

FeatureHasher(

145

n_features: int = 2**20,

146

input_type: str = "dict",

147

dtype: type = np.float64,

148

alternate_sign: bool = True

149

)

150

```

151

152

## Image Feature Extraction

153

154

### Image to Graph

155

156

Convert images to graphs for machine learning applications.

157

158

```python { .api }

159

from sklearn.feature_extraction.image import img_to_graph, grid_to_graph

160

161

def img_to_graph(

162

img: ndarray,

163

mask: ndarray | None = None,

164

return_as: type = np.ndarray,

165

dtype: type | None = None

166

) -> ndarray | csr_matrix: ...

167

168

def grid_to_graph(

169

n_x: int,

170

n_y: int,

171

n_z: int = 1,

172

mask: ndarray | None = None,

173

return_as: type = np.ndarray,

174

dtype: type = np.int32

175

) -> ndarray | csr_matrix: ...

176

```

177

178

## Usage Examples

179

180

### Text Vectorization

181

182

```python

183

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

184

185

# Basic count vectorization

186

corpus = ['This is the first document.',

187

'This document is the second document.',

188

'And this is the third one.']

189

190

# Count vectorizer

191

vectorizer = CountVectorizer()

192

X_counts = vectorizer.fit_transform(corpus)

193

print(vectorizer.get_feature_names_out())

194

195

# TF-IDF vectorizer

196

tfidf_vectorizer = TfidfVectorizer()

197

X_tfidf = tfidf_vectorizer.fit_transform(corpus)

198

```

199

200

### Dictionary Vectorization

201

202

```python

203

from sklearn.feature_extraction import DictVectorizer

204

205

# Convert list of dictionaries to feature vectors

206

measurements = [

207

{'city': 'Dubai', 'temperature': 33.},

208

{'city': 'London', 'temperature': 12.},

209

{'city': 'San Francisco', 'temperature': 18.},

210

]

211

212

vec = DictVectorizer()

213

X = vec.fit_transform(measurements)

214

print(vec.get_feature_names_out())

215

```

216

217

### Feature Hashing

218

219

```python

220

from sklearn.feature_extraction import FeatureHasher

221

222

# Hash features for large-scale learning

223

h = FeatureHasher(n_features=10)

224

D = [{'dog': 1, 'cat': 2, 'elephant': 4},

225

{'dog': 2, 'run': 5}]

226

f = h.transform(D)

227

```

228

229

## Constants

230

231

```python { .api }

232

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

233

234

ENGLISH_STOP_WORDS: frozenset # Set of common English stop words

235

```