Tessl Tile for pypi/scikit-learn@1.7.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

datasets.md feature-extraction.md index.md metrics.md model-selection.md neighbors.md pipelines.md preprocessing.md supervised-learning.md unsupervised-learning.md utilities.md

feature-extraction.mddocs/

0
# Feature Extraction
1

2
Feature extraction utilities for converting raw data into numerical features suitable for machine learning algorithms. This includes text processing, image processing, and dictionary-based feature extraction.
3

4
## Text Feature Extraction
5

6
### CountVectorizer
7

8
Convert a collection of text documents to a matrix of token counts.
9

10
```python { .api }
11
from sklearn.feature_extraction.text import CountVectorizer
12

13
CountVectorizer(
14
    input: str = "content",
15
    encoding: str = "utf-8",
16
    decode_error: str = "strict",
17
    strip_accents: str | None = None,
18
    lowercase: bool = True,
19
    preprocessor: callable | None = None,
20
    tokenizer: callable | None = None,
21
    stop_words: str | list | None = None,
22
    token_pattern: str = r"(?u)\b\w\w+\b",
23
    ngram_range: tuple = (1, 1),
24
    analyzer: str = "word",
25
    max_df: float | int = 1.0,
26
    min_df: float | int = 1,
27
    max_features: int | None = None,
28
    vocabulary: dict | None = None,
29
    binary: bool = False,
30
    dtype: type = np.int64
31
)
32
```
33

34
### TfidfVectorizer
35

36
Convert a collection of raw documents to a matrix of TF-IDF features.
37

38
```python { .api }
39
from sklearn.feature_extraction.text import TfidfVectorizer
40

41
TfidfVectorizer(
42
    input: str = "content",
43
    encoding: str = "utf-8",
44
    decode_error: str = "strict",
45
    strip_accents: str | None = None,
46
    lowercase: bool = True,
47
    preprocessor: callable | None = None,
48
    tokenizer: callable | None = None,
49
    analyzer: str = "word",
50
    stop_words: str | list | None = None,
51
    token_pattern: str = r"(?u)\b\w\w+\b",
52
    ngram_range: tuple = (1, 1),
53
    max_df: float | int = 1.0,
54
    min_df: float | int = 1,
55
    max_features: int | None = None,
56
    vocabulary: dict | None = None,
57
    binary: bool = False,
58
    dtype: type = np.float64,
59
    norm: str = "l2",
60
    use_idf: bool = True,
61
    smooth_idf: bool = True,
62
    sublinear_tf: bool = False
63
)
64
```
65

66
### TfidfTransformer
67

68
Transform a count matrix to a normalized tf or tf-idf representation.
69

70
```python { .api }
71
from sklearn.feature_extraction.text import TfidfTransformer
72

73
TfidfTransformer(
74
    norm: str = "l2",
75
    use_idf: bool = True,
76
    smooth_idf: bool = True,
77
    sublinear_tf: bool = False
78
)
79
```
80

81
### HashingVectorizer
82

83
Convert a collection of text documents to a matrix of token occurrences using hashing trick.
84

85
```python { .api }
86
from sklearn.feature_extraction.text import HashingVectorizer
87

88
HashingVectorizer(
89
    n_features: int = 2**20,
90
    input: str = "content",
91
    encoding: str = "utf-8",
92
    decode_error: str = "strict",
93
    strip_accents: str | None = None,
94
    lowercase: bool = True,
95
    preprocessor: callable | None = None,
96
    tokenizer: callable | None = None,
97
    stop_words: str | list | None = None,
98
    token_pattern: str = r"(?u)\b\w\w+\b",
99
    ngram_range: tuple = (1, 1),
100
    analyzer: str = "word",
101
    binary: bool = False,
102
    norm: str = "l2",
103
    alternate_sign: bool = True,
104
    dtype: type = np.float64
105
)
106
```
107

108
### Text Preprocessing Functions
109

110
```python { .api }
111
from sklearn.feature_extraction.text import strip_accents_ascii, strip_accents_unicode, strip_tags
112

113
def strip_accents_ascii(s: str) -> str: ...
114
def strip_accents_unicode(s: str) -> str: ...
115
def strip_tags(s: str) -> str: ...
116
```
117

118
## Dictionary Feature Extraction
119

120
### DictVectorizer
121

122
Transform lists of feature-value mappings to vectors.
123

124
```python { .api }
125
from sklearn.feature_extraction import DictVectorizer
126

127
DictVectorizer(
128
    dtype: type = np.float64,
129
    separator: str = "=",
130
    sparse: bool = True,
131
    sort: bool = True
132
)
133
```
134

135
## Hashing Feature Extraction
136

137
### FeatureHasher
138

139
Implements feature hashing for high-speed, low-memory vectorization.
140

141
```python { .api }
142
from sklearn.feature_extraction import FeatureHasher
143

144
FeatureHasher(
145
    n_features: int = 2**20,
146
    input_type: str = "dict",
147
    dtype: type = np.float64,
148
    alternate_sign: bool = True
149
)
150
```
151

152
## Image Feature Extraction
153

154
### Image to Graph
155

156
Convert images to graphs for machine learning applications.
157

158
```python { .api }
159
from sklearn.feature_extraction.image import img_to_graph, grid_to_graph
160

161
def img_to_graph(
162
    img: ndarray,
163
    mask: ndarray | None = None,
164
    return_as: type = np.ndarray,
165
    dtype: type | None = None
166
) -> ndarray | csr_matrix: ...
167

168
def grid_to_graph(
169
    n_x: int,
170
    n_y: int,
171
    n_z: int = 1,
172
    mask: ndarray | None = None,
173
    return_as: type = np.ndarray,
174
    dtype: type = np.int32
175
) -> ndarray | csr_matrix: ...
176
```
177

178
## Usage Examples
179

180
### Text Vectorization
181

182
```python
183
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
184

185
# Basic count vectorization
186
corpus = ['This is the first document.',
187
          'This document is the second document.',
188
          'And this is the third one.']
189

190
# Count vectorizer
191
vectorizer = CountVectorizer()
192
X_counts = vectorizer.fit_transform(corpus)
193
print(vectorizer.get_feature_names_out())
194

195
# TF-IDF vectorizer
196
tfidf_vectorizer = TfidfVectorizer()
197
X_tfidf = tfidf_vectorizer.fit_transform(corpus)
198
```
199

200
### Dictionary Vectorization
201

202
```python
203
from sklearn.feature_extraction import DictVectorizer
204

205
# Convert list of dictionaries to feature vectors
206
measurements = [
207
    {'city': 'Dubai', 'temperature': 33.},
208
    {'city': 'London', 'temperature': 12.},
209
    {'city': 'San Francisco', 'temperature': 18.},
210
]
211

212
vec = DictVectorizer()
213
X = vec.fit_transform(measurements)
214
print(vec.get_feature_names_out())
215
```
216

217
### Feature Hashing
218

219
```python
220
from sklearn.feature_extraction import FeatureHasher
221

222
# Hash features for large-scale learning
223
h = FeatureHasher(n_features=10)
224
D = [{'dog': 1, 'cat': 2, 'elephant': 4},
225
     {'dog': 2, 'run': 5}]
226
f = h.transform(D)
227
```
228

229
## Constants
230

231
```python { .api }
232
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
233

234
ENGLISH_STOP_WORDS: frozenset  # Set of common English stop words
235
```

Version

Tile

Files

feature-extraction.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

feature-extraction.mddocs/