or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

classification.mdclustering.mddatasets.mdevaluation.mdfeature-engineering.mdfile-io.mdindex.mdmath-utils.mdpattern-mining.mdplotting.mdpreprocessing.mdregression.mdtext-processing.mdutilities.md

text-processing.mddocs/

0

# Text Processing

1

2

Text processing utilities for natural language processing tasks including name normalization and tokenization.

3

4

## Capabilities

5

6

### Name Processing

7

8

Utilities for processing and normalizing person names.

9

10

```python { .api }

11

def generalize_names(name):

12

"""

13

Generalize person names for consistency.

14

15

Parameters:

16

- name: str, person name to generalize

17

18

Returns:

19

- generalized_name: str, normalized name

20

"""

21

22

def generalize_names_duplcheck(name_list):

23

"""

24

Generalize names with duplicate checking and removal.

25

26

Parameters:

27

- name_list: list, list of person names

28

29

Returns:

30

- unique_names: list, deduplicated normalized names

31

"""

32

```

33

34

### Text Tokenization

35

36

Tokenization utilities for text processing including emoticon handling.

37

38

```python { .api }

39

def tokenizer_words_and_emoticons(text):

40

"""

41

Tokenize text including words and emoticons.

42

43

Parameters:

44

- text: str, input text to tokenize

45

46

Returns:

47

- tokens: list, list of word and emoticon tokens

48

"""

49

50

def tokenizer_emoticons(text):

51

"""

52

Extract emoticons from text.

53

54

Parameters:

55

- text: str, input text

56

57

Returns:

58

- emoticons: list, list of emoticon tokens found in text

59

"""

60

```

61

62

## Usage Examples

63

64

```python

65

from mlxtend.text import generalize_names, generalize_names_duplcheck

66

from mlxtend.text import tokenizer_words_and_emoticons, tokenizer_emoticons

67

68

# Name processing examples

69

name = "Dr. John Smith Jr."

70

normalized = generalize_names(name)

71

print(f"Original: {name}")

72

print(f"Normalized: {normalized}")

73

74

# Duplicate name handling

75

names = ["John Smith", "J. Smith", "John Smith", "Jane Doe"]

76

unique_names = generalize_names_duplcheck(names)

77

print(f"Original names: {names}")

78

print(f"Unique normalized: {unique_names}")

79

80

# Text tokenization with emoticons

81

text = "I love machine learning! :) It's so cool :D"

82

tokens = tokenizer_words_and_emoticons(text)

83

emoticons = tokenizer_emoticons(text)

84

85

print(f"Text: {text}")

86

print(f"All tokens: {tokens}")

87

print(f"Emoticons only: {emoticons}")

88

```