0
# Text Tokenization
1
2
Independent tokenizer functionality using the SentencePiece algorithm for converting text into token sequences. The tokenizer can be used separately from the embedding models and supports custom vocabularies.
3
4
## Capabilities
5
6
### Load Tokenizer
7
8
Creates a tokenizer instance with the default or custom vocabulary for text tokenization.
9
10
```typescript { .api }
11
/**
12
* Load a tokenizer for independent use from the Universal Sentence Encoder
13
* @param pathToVocabulary - Optional path to custom vocabulary file
14
* @returns Promise that resolves to Tokenizer instance
15
*/
16
function loadTokenizer(pathToVocabulary?: string): Promise<Tokenizer>;
17
```
18
19
**Usage Examples:**
20
21
```typescript
22
import * as use from '@tensorflow-models/universal-sentence-encoder';
23
24
// Load with default vocabulary
25
const tokenizer = await use.loadTokenizer();
26
27
// Load with custom vocabulary
28
const customTokenizer = await use.loadTokenizer(
29
'https://example.com/my-vocab.json'
30
);
31
```
32
33
### Tokenizer Class
34
35
SentencePiece tokenizer implementation that converts text strings into sequences of integer tokens using the Viterbi algorithm.
36
37
```typescript { .api }
38
class Tokenizer {
39
/**
40
* Create a tokenizer with vocabulary and symbol configuration
41
* @param vocabulary - Array of [token, score] pairs
42
* @param reservedSymbolsCount - Number of reserved symbols (default: 6)
43
*/
44
constructor(vocabulary: Vocabulary, reservedSymbolsCount?: number);
45
46
/**
47
* Tokenize input string into array of token IDs
48
* Uses Viterbi algorithm to find most likely token sequence
49
* @param input - String to tokenize
50
* @returns Array of token IDs
51
*/
52
encode(input: string): number[];
53
}
54
```
55
56
**Usage Examples:**
57
58
```typescript
59
import * as use from '@tensorflow-models/universal-sentence-encoder';
60
61
// Basic tokenization
62
const tokenizer = await use.loadTokenizer();
63
const tokens = tokenizer.encode('Hello, how are you?');
64
console.log(tokens); // [341, 4125, 8, 140, 31, 19, 54]
65
66
// Tokenize multiple strings
67
const sentences = [
68
'Machine learning is fascinating.',
69
'TensorFlow.js runs in browsers.',
70
'Tokenization converts text to numbers.'
71
];
72
73
const allTokens = sentences.map(text => tokenizer.encode(text));
74
console.log('Tokenized sentences:', allTokens);
75
```
76
77
### Vocabulary Loading
78
79
Load vocabulary files for creating custom tokenizers.
80
81
```typescript { .api }
82
/**
83
* Load vocabulary from a remote URL
84
* @param pathToVocabulary - URL or path to vocabulary JSON file
85
* @returns Promise that resolves to vocabulary array
86
*/
87
function loadVocabulary(pathToVocabulary: string): Promise<Vocabulary>;
88
```
89
90
**Usage Example:**
91
92
```typescript
93
import * as use from '@tensorflow-models/universal-sentence-encoder';
94
95
// Load custom vocabulary
96
const vocab = await use.loadVocabulary('https://example.com/vocab.json');
97
const customTokenizer = new use.Tokenizer(vocab);
98
99
// Use custom tokenizer
100
const tokens = customTokenizer.encode('Custom vocabulary example');
101
```
102
103
### Tokenization Process
104
105
The tokenizer follows the SentencePiece algorithm with these key steps:
106
107
1. **Input Normalization**: Unicode normalization (NFKC) and separator insertion
108
2. **Lattice Construction**: Build token possibility lattice using Trie data structure
109
3. **Viterbi Algorithm**: Find most likely token sequence based on vocabulary scores
110
4. **Post-processing**: Merge consecutive unknown tokens and reverse token order
111
112
**Example of tokenization steps:**
113
114
```typescript
115
const tokenizer = await use.loadTokenizer();
116
117
// Original text
118
const text = "Hello, world!";
119
120
// Internal processing (for illustration):
121
// 1. Normalized: "▁Hello,▁world!"
122
// 2. Lattice: Multiple possible token combinations
123
// 3. Viterbi: Best path selection
124
// 4. Result: [341, 8, 126, 54]
125
126
const tokens = tokenizer.encode(text);
127
console.log('Final tokens:', tokens);
128
```
129
130
## Advanced Usage
131
132
### Custom Vocabulary Integration
133
134
Create tokenizers with different vocabularies for specialized domains:
135
136
```typescript
137
// Load domain-specific vocabulary
138
const medicalVocab = await use.loadVocabulary('https://example.com/medical-vocab.json');
139
const medicalTokenizer = new use.Tokenizer(medicalVocab);
140
141
// Tokenize medical text
142
const medicalText = "The patient shows symptoms of acute myocardial infarction.";
143
const medicalTokens = medicalTokenizer.encode(medicalText);
144
```
145
146
### Batch Tokenization
147
148
Efficiently tokenize multiple texts:
149
150
```typescript
151
const tokenizer = await use.loadTokenizer();
152
153
const documents = [
154
"Natural language processing enables computers to understand text.",
155
"Deep learning models can generate human-like responses.",
156
"Tokenization is the first step in text preprocessing."
157
];
158
159
// Tokenize all documents
160
const tokenizedDocs = documents.map(doc => ({
161
text: doc,
162
tokens: tokenizer.encode(doc),
163
tokenCount: tokenizer.encode(doc).length
164
}));
165
166
console.log('Tokenized documents:', tokenizedDocs);
167
```
168
169
### Vocabulary Analysis
170
171
Explore the tokenizer's vocabulary:
172
173
```typescript
174
// Load vocabulary for inspection
175
const vocab = await use.loadVocabulary(
176
'https://storage.googleapis.com/tfjs-models/savedmodel/universal_sentence_encoder/vocab.json'
177
);
178
179
console.log('Vocabulary size:', vocab.length);
180
console.log('First 10 tokens:', vocab.slice(0, 10));
181
182
// Find specific tokens
183
const commonWords = vocab.filter(([token, score]) =>
184
token.includes('▁the') || token.includes('▁and') || token.includes('▁is')
185
);
186
console.log('Common word tokens:', commonWords);
187
```
188
189
### Trie Data Structure
190
191
Internal trie (prefix tree) data structure used by the tokenizer for efficient token matching during the SentencePiece tokenization process.
192
193
```typescript { .api }
194
class Trie {
195
/**
196
* Create a new trie with an empty root node
197
*/
198
constructor();
199
200
/**
201
* Insert a token into the trie with its score and index
202
* @param word - Token string to insert
203
* @param score - Score associated with the token
204
* @param index - Index of the token in vocabulary
205
*/
206
insert(word: string, score: number, index: number): void;
207
208
/**
209
* Find all tokens that start with the given prefix
210
* @param symbols - Array of characters to match as prefix
211
* @returns Array of matching tokens with their data [token, score, index]
212
*/
213
commonPrefixSearch(symbols: string[]): Array<[string[], number, number]>;
214
}
215
```
216
217
**Usage Example:**
218
219
```typescript
220
import { Trie, stringToChars } from '@tensorflow-models/universal-sentence-encoder';
221
222
// Create and populate a trie
223
const trie = new Trie();
224
trie.insert('hello', 10.5, 100);
225
trie.insert('help', 8.2, 101);
226
trie.insert('helicopter', 5.1, 102);
227
228
// Search for matches
229
const prefix = stringToChars('hel');
230
const matches = trie.commonPrefixSearch(prefix);
231
console.log('Matching tokens:', matches);
232
// Output: [['h', 'e', 'l'], ['h', 'e', 'l', 'l', 'o'], ['h', 'e', 'l', 'p']]
233
```
234
235
### Utility Functions
236
237
Unicode-aware text processing utilities used internally by the tokenizer.
238
239
```typescript { .api }
240
/**
241
* Convert string to array of unicode characters with proper handling
242
* @param input - String to convert to character array
243
* @returns Array of unicode characters
244
*/
245
function stringToChars(input: string): string[];
246
```
247
248
**Usage Example:**
249
250
```typescript
251
import { stringToChars } from '@tensorflow-models/universal-sentence-encoder';
252
253
const text = "Hello 🌍!";
254
const chars = stringToChars(text);
255
console.log(chars); // ['H', 'e', 'l', 'l', 'o', ' ', '🌍', '!']
256
```
257
258
## Types
259
260
```typescript { .api }
261
// Vocabulary format: array of [token_string, score] pairs
262
type Vocabulary = Array<[string, number]>;
263
264
class Tokenizer {
265
constructor(vocabulary: Vocabulary, reservedSymbolsCount?: number);
266
encode(input: string): number[];
267
}
268
269
// Internal Trie structure for tokenization
270
class Trie {
271
insert(word: string, score: number, index: number): void;
272
commonPrefixSearch(symbols: string[]): Array<[string[], number, number]>;
273
}
274
275
// Utility functions
276
function stringToChars(input: string): string[];
277
```
278
279
## Constants
280
281
```typescript { .api }
282
// Default vocabulary URL
283
const BASE_PATH = 'https://storage.googleapis.com/tfjs-models/savedmodel/universal_sentence_encoder';
284
285
// Default reserved symbol count
286
const RESERVED_SYMBOLS_COUNT = 6;
287
288
// Unicode separator character
289
const separator = '\u2581'; // Lower one eighth block
290
```