0
# Text Processing
1
2
Text processing utilities for natural language processing tasks including name normalization and tokenization.
3
4
## Capabilities
5
6
### Name Processing
7
8
Utilities for processing and normalizing person names.
9
10
```python { .api }
11
def generalize_names(name):
12
"""
13
Generalize person names for consistency.
14
15
Parameters:
16
- name: str, person name to generalize
17
18
Returns:
19
- generalized_name: str, normalized name
20
"""
21
22
def generalize_names_duplcheck(name_list):
23
"""
24
Generalize names with duplicate checking and removal.
25
26
Parameters:
27
- name_list: list, list of person names
28
29
Returns:
30
- unique_names: list, deduplicated normalized names
31
"""
32
```
33
34
### Text Tokenization
35
36
Tokenization utilities for text processing including emoticon handling.
37
38
```python { .api }
39
def tokenizer_words_and_emoticons(text):
40
"""
41
Tokenize text including words and emoticons.
42
43
Parameters:
44
- text: str, input text to tokenize
45
46
Returns:
47
- tokens: list, list of word and emoticon tokens
48
"""
49
50
def tokenizer_emoticons(text):
51
"""
52
Extract emoticons from text.
53
54
Parameters:
55
- text: str, input text
56
57
Returns:
58
- emoticons: list, list of emoticon tokens found in text
59
"""
60
```
61
62
## Usage Examples
63
64
```python
65
from mlxtend.text import generalize_names, generalize_names_duplcheck
66
from mlxtend.text import tokenizer_words_and_emoticons, tokenizer_emoticons
67
68
# Name processing examples
69
name = "Dr. John Smith Jr."
70
normalized = generalize_names(name)
71
print(f"Original: {name}")
72
print(f"Normalized: {normalized}")
73
74
# Duplicate name handling
75
names = ["John Smith", "J. Smith", "John Smith", "Jane Doe"]
76
unique_names = generalize_names_duplcheck(names)
77
print(f"Original names: {names}")
78
print(f"Unique normalized: {unique_names}")
79
80
# Text tokenization with emoticons
81
text = "I love machine learning! :) It's so cool :D"
82
tokens = tokenizer_words_and_emoticons(text)
83
emoticons = tokenizer_emoticons(text)
84
85
print(f"Text: {text}")
86
print(f"All tokens: {tokens}")
87
print(f"Emoticons only: {emoticons}")
88
```