0
# Feature Extraction
1
2
Feature extraction utilities for converting raw data into numerical features suitable for machine learning algorithms. This includes text processing, image processing, and dictionary-based feature extraction.
3
4
## Text Feature Extraction
5
6
### CountVectorizer
7
8
Convert a collection of text documents to a matrix of token counts.
9
10
```python { .api }
11
from sklearn.feature_extraction.text import CountVectorizer
12
13
CountVectorizer(
14
input: str = "content",
15
encoding: str = "utf-8",
16
decode_error: str = "strict",
17
strip_accents: str | None = None,
18
lowercase: bool = True,
19
preprocessor: callable | None = None,
20
tokenizer: callable | None = None,
21
stop_words: str | list | None = None,
22
token_pattern: str = r"(?u)\b\w\w+\b",
23
ngram_range: tuple = (1, 1),
24
analyzer: str = "word",
25
max_df: float | int = 1.0,
26
min_df: float | int = 1,
27
max_features: int | None = None,
28
vocabulary: dict | None = None,
29
binary: bool = False,
30
dtype: type = np.int64
31
)
32
```
33
34
### TfidfVectorizer
35
36
Convert a collection of raw documents to a matrix of TF-IDF features.
37
38
```python { .api }
39
from sklearn.feature_extraction.text import TfidfVectorizer
40
41
TfidfVectorizer(
42
input: str = "content",
43
encoding: str = "utf-8",
44
decode_error: str = "strict",
45
strip_accents: str | None = None,
46
lowercase: bool = True,
47
preprocessor: callable | None = None,
48
tokenizer: callable | None = None,
49
analyzer: str = "word",
50
stop_words: str | list | None = None,
51
token_pattern: str = r"(?u)\b\w\w+\b",
52
ngram_range: tuple = (1, 1),
53
max_df: float | int = 1.0,
54
min_df: float | int = 1,
55
max_features: int | None = None,
56
vocabulary: dict | None = None,
57
binary: bool = False,
58
dtype: type = np.float64,
59
norm: str = "l2",
60
use_idf: bool = True,
61
smooth_idf: bool = True,
62
sublinear_tf: bool = False
63
)
64
```
65
66
### TfidfTransformer
67
68
Transform a count matrix to a normalized tf or tf-idf representation.
69
70
```python { .api }
71
from sklearn.feature_extraction.text import TfidfTransformer
72
73
TfidfTransformer(
74
norm: str = "l2",
75
use_idf: bool = True,
76
smooth_idf: bool = True,
77
sublinear_tf: bool = False
78
)
79
```
80
81
### HashingVectorizer
82
83
Convert a collection of text documents to a matrix of token occurrences using hashing trick.
84
85
```python { .api }
86
from sklearn.feature_extraction.text import HashingVectorizer
87
88
HashingVectorizer(
89
n_features: int = 2**20,
90
input: str = "content",
91
encoding: str = "utf-8",
92
decode_error: str = "strict",
93
strip_accents: str | None = None,
94
lowercase: bool = True,
95
preprocessor: callable | None = None,
96
tokenizer: callable | None = None,
97
stop_words: str | list | None = None,
98
token_pattern: str = r"(?u)\b\w\w+\b",
99
ngram_range: tuple = (1, 1),
100
analyzer: str = "word",
101
binary: bool = False,
102
norm: str = "l2",
103
alternate_sign: bool = True,
104
dtype: type = np.float64
105
)
106
```
107
108
### Text Preprocessing Functions
109
110
```python { .api }
111
from sklearn.feature_extraction.text import strip_accents_ascii, strip_accents_unicode, strip_tags
112
113
def strip_accents_ascii(s: str) -> str: ...
114
def strip_accents_unicode(s: str) -> str: ...
115
def strip_tags(s: str) -> str: ...
116
```
117
118
## Dictionary Feature Extraction
119
120
### DictVectorizer
121
122
Transform lists of feature-value mappings to vectors.
123
124
```python { .api }
125
from sklearn.feature_extraction import DictVectorizer
126
127
DictVectorizer(
128
dtype: type = np.float64,
129
separator: str = "=",
130
sparse: bool = True,
131
sort: bool = True
132
)
133
```
134
135
## Hashing Feature Extraction
136
137
### FeatureHasher
138
139
Implements feature hashing for high-speed, low-memory vectorization.
140
141
```python { .api }
142
from sklearn.feature_extraction import FeatureHasher
143
144
FeatureHasher(
145
n_features: int = 2**20,
146
input_type: str = "dict",
147
dtype: type = np.float64,
148
alternate_sign: bool = True
149
)
150
```
151
152
## Image Feature Extraction
153
154
### Image to Graph
155
156
Convert images to graphs for machine learning applications.
157
158
```python { .api }
159
from sklearn.feature_extraction.image import img_to_graph, grid_to_graph
160
161
def img_to_graph(
162
img: ndarray,
163
mask: ndarray | None = None,
164
return_as: type = np.ndarray,
165
dtype: type | None = None
166
) -> ndarray | csr_matrix: ...
167
168
def grid_to_graph(
169
n_x: int,
170
n_y: int,
171
n_z: int = 1,
172
mask: ndarray | None = None,
173
return_as: type = np.ndarray,
174
dtype: type = np.int32
175
) -> ndarray | csr_matrix: ...
176
```
177
178
## Usage Examples
179
180
### Text Vectorization
181
182
```python
183
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
184
185
# Basic count vectorization
186
corpus = ['This is the first document.',
187
'This document is the second document.',
188
'And this is the third one.']
189
190
# Count vectorizer
191
vectorizer = CountVectorizer()
192
X_counts = vectorizer.fit_transform(corpus)
193
print(vectorizer.get_feature_names_out())
194
195
# TF-IDF vectorizer
196
tfidf_vectorizer = TfidfVectorizer()
197
X_tfidf = tfidf_vectorizer.fit_transform(corpus)
198
```
199
200
### Dictionary Vectorization
201
202
```python
203
from sklearn.feature_extraction import DictVectorizer
204
205
# Convert list of dictionaries to feature vectors
206
measurements = [
207
{'city': 'Dubai', 'temperature': 33.},
208
{'city': 'London', 'temperature': 12.},
209
{'city': 'San Francisco', 'temperature': 18.},
210
]
211
212
vec = DictVectorizer()
213
X = vec.fit_transform(measurements)
214
print(vec.get_feature_names_out())
215
```
216
217
### Feature Hashing
218
219
```python
220
from sklearn.feature_extraction import FeatureHasher
221
222
# Hash features for large-scale learning
223
h = FeatureHasher(n_features=10)
224
D = [{'dog': 1, 'cat': 2, 'elephant': 4},
225
{'dog': 2, 'run': 5}]
226
f = h.transform(D)
227
```
228
229
## Constants
230
231
```python { .api }
232
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
233
234
ENGLISH_STOP_WORDS: frozenset # Set of common English stop words
235
```