Tessl Tile for npm/@tensorflow-models/universal-sentence-encoder@1.3.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

index.md question-answering.md standard-embeddings.md tokenization.md

tokenization.mddocs/

0
# Text Tokenization
1

2
Independent tokenizer functionality using the SentencePiece algorithm for converting text into token sequences. The tokenizer can be used separately from the embedding models and supports custom vocabularies.
3

4
## Capabilities
5

6
### Load Tokenizer
7

8
Creates a tokenizer instance with the default or custom vocabulary for text tokenization.
9

10
```typescript { .api }
11
/**
12
 * Load a tokenizer for independent use from the Universal Sentence Encoder
13
 * @param pathToVocabulary - Optional path to custom vocabulary file
14
 * @returns Promise that resolves to Tokenizer instance
15
 */
16
function loadTokenizer(pathToVocabulary?: string): Promise<Tokenizer>;
17
```
18

19
**Usage Examples:**
20

21
```typescript
22
import * as use from '@tensorflow-models/universal-sentence-encoder';
23

24
// Load with default vocabulary
25
const tokenizer = await use.loadTokenizer();
26

27
// Load with custom vocabulary
28
const customTokenizer = await use.loadTokenizer(
29
  'https://example.com/my-vocab.json'
30
);
31
```
32

33
### Tokenizer Class
34

35
SentencePiece tokenizer implementation that converts text strings into sequences of integer tokens using the Viterbi algorithm.
36

37
```typescript { .api }
38
class Tokenizer {
39
  /**
40
   * Create a tokenizer with vocabulary and symbol configuration
41
   * @param vocabulary - Array of [token, score] pairs
42
   * @param reservedSymbolsCount - Number of reserved symbols (default: 6)
43
   */
44
  constructor(vocabulary: Vocabulary, reservedSymbolsCount?: number);
45
  
46
  /**
47
   * Tokenize input string into array of token IDs
48
   * Uses Viterbi algorithm to find most likely token sequence
49
   * @param input - String to tokenize
50
   * @returns Array of token IDs
51
   */
52
  encode(input: string): number[];
53
}
54
```
55

56
**Usage Examples:**
57

58
```typescript
59
import * as use from '@tensorflow-models/universal-sentence-encoder';
60

61
// Basic tokenization
62
const tokenizer = await use.loadTokenizer();
63
const tokens = tokenizer.encode('Hello, how are you?');
64
console.log(tokens); // [341, 4125, 8, 140, 31, 19, 54]
65

66
// Tokenize multiple strings
67
const sentences = [
68
  'Machine learning is fascinating.',
69
  'TensorFlow.js runs in browsers.',
70
  'Tokenization converts text to numbers.'
71
];
72

73
const allTokens = sentences.map(text => tokenizer.encode(text));
74
console.log('Tokenized sentences:', allTokens);
75
```
76

77
### Vocabulary Loading
78

79
Load vocabulary files for creating custom tokenizers.
80

81
```typescript { .api }
82
/**
83
 * Load vocabulary from a remote URL  
84
 * @param pathToVocabulary - URL or path to vocabulary JSON file
85
 * @returns Promise that resolves to vocabulary array
86
 */
87
function loadVocabulary(pathToVocabulary: string): Promise<Vocabulary>;
88
```
89

90
**Usage Example:**
91

92
```typescript
93
import * as use from '@tensorflow-models/universal-sentence-encoder';
94

95
// Load custom vocabulary
96
const vocab = await use.loadVocabulary('https://example.com/vocab.json');
97
const customTokenizer = new use.Tokenizer(vocab);
98

99
// Use custom tokenizer
100
const tokens = customTokenizer.encode('Custom vocabulary example');
101
```
102

103
### Tokenization Process
104

105
The tokenizer follows the SentencePiece algorithm with these key steps:
106

107
1. **Input Normalization**: Unicode normalization (NFKC) and separator insertion
108
2. **Lattice Construction**: Build token possibility lattice using Trie data structure
109
3. **Viterbi Algorithm**: Find most likely token sequence based on vocabulary scores
110
4. **Post-processing**: Merge consecutive unknown tokens and reverse token order
111

112
**Example of tokenization steps:**
113

114
```typescript
115
const tokenizer = await use.loadTokenizer();
116

117
// Original text
118
const text = "Hello, world!";
119

120
// Internal processing (for illustration):
121
// 1. Normalized: "▁Hello,▁world!"
122
// 2. Lattice: Multiple possible token combinations
123
// 3. Viterbi: Best path selection
124
// 4. Result: [341, 8, 126, 54]
125

126
const tokens = tokenizer.encode(text);
127
console.log('Final tokens:', tokens);
128
```
129

130
## Advanced Usage
131

132
### Custom Vocabulary Integration
133

134
Create tokenizers with different vocabularies for specialized domains:
135

136
```typescript
137
// Load domain-specific vocabulary
138
const medicalVocab = await use.loadVocabulary('https://example.com/medical-vocab.json');
139
const medicalTokenizer = new use.Tokenizer(medicalVocab);
140

141
// Tokenize medical text
142
const medicalText = "The patient shows symptoms of acute myocardial infarction.";
143
const medicalTokens = medicalTokenizer.encode(medicalText);
144
```
145

146
### Batch Tokenization
147

148
Efficiently tokenize multiple texts:
149

150
```typescript
151
const tokenizer = await use.loadTokenizer();
152

153
const documents = [
154
  "Natural language processing enables computers to understand text.",
155
  "Deep learning models can generate human-like responses.",
156
  "Tokenization is the first step in text preprocessing."
157
];
158

159
// Tokenize all documents
160
const tokenizedDocs = documents.map(doc => ({
161
  text: doc,
162
  tokens: tokenizer.encode(doc),
163
  tokenCount: tokenizer.encode(doc).length
164
}));
165

166
console.log('Tokenized documents:', tokenizedDocs);
167
```
168

169
### Vocabulary Analysis
170

171
Explore the tokenizer's vocabulary:
172

173
```typescript
174
// Load vocabulary for inspection
175
const vocab = await use.loadVocabulary(
176
  'https://storage.googleapis.com/tfjs-models/savedmodel/universal_sentence_encoder/vocab.json'
177
);
178

179
console.log('Vocabulary size:', vocab.length);
180
console.log('First 10 tokens:', vocab.slice(0, 10));
181

182
// Find specific tokens
183
const commonWords = vocab.filter(([token, score]) => 
184
  token.includes('▁the') || token.includes('▁and') || token.includes('▁is')
185
);
186
console.log('Common word tokens:', commonWords);
187
```
188

189
### Trie Data Structure
190

191
Internal trie (prefix tree) data structure used by the tokenizer for efficient token matching during the SentencePiece tokenization process.
192

193
```typescript { .api }
194
class Trie {
195
  /**
196
   * Create a new trie with an empty root node
197
   */
198
  constructor();
199
  
200
  /**
201
   * Insert a token into the trie with its score and index
202
   * @param word - Token string to insert
203
   * @param score - Score associated with the token
204
   * @param index - Index of the token in vocabulary
205
   */
206
  insert(word: string, score: number, index: number): void;
207
  
208
  /**
209
   * Find all tokens that start with the given prefix
210
   * @param symbols - Array of characters to match as prefix
211
   * @returns Array of matching tokens with their data [token, score, index]
212
   */
213
  commonPrefixSearch(symbols: string[]): Array<[string[], number, number]>;
214
}
215
```
216

217
**Usage Example:**
218

219
```typescript
220
import { Trie, stringToChars } from '@tensorflow-models/universal-sentence-encoder';
221

222
// Create and populate a trie
223
const trie = new Trie();
224
trie.insert('hello', 10.5, 100);
225
trie.insert('help', 8.2, 101);
226
trie.insert('helicopter', 5.1, 102);
227

228
// Search for matches
229
const prefix = stringToChars('hel');
230
const matches = trie.commonPrefixSearch(prefix);
231
console.log('Matching tokens:', matches);
232
// Output: [['h', 'e', 'l'], ['h', 'e', 'l', 'l', 'o'], ['h', 'e', 'l', 'p']]
233
```
234

235
### Utility Functions
236

237
Unicode-aware text processing utilities used internally by the tokenizer.
238

239
```typescript { .api }
240
/**
241
 * Convert string to array of unicode characters with proper handling
242
 * @param input - String to convert to character array
243
 * @returns Array of unicode characters
244
 */
245
function stringToChars(input: string): string[];
246
```
247

248
**Usage Example:**
249

250
```typescript
251
import { stringToChars } from '@tensorflow-models/universal-sentence-encoder';
252

253
const text = "Hello 🌍!";
254
const chars = stringToChars(text);
255
console.log(chars); // ['H', 'e', 'l', 'l', 'o', ' ', '🌍', '!']
256
```
257

258
## Types
259

260
```typescript { .api }
261
// Vocabulary format: array of [token_string, score] pairs
262
type Vocabulary = Array<[string, number]>;
263

264
class Tokenizer {
265
  constructor(vocabulary: Vocabulary, reservedSymbolsCount?: number);
266
  encode(input: string): number[];
267
}
268

269
// Internal Trie structure for tokenization
270
class Trie {
271
  insert(word: string, score: number, index: number): void;
272
  commonPrefixSearch(symbols: string[]): Array<[string[], number, number]>;
273
}
274

275
// Utility functions
276
function stringToChars(input: string): string[];
277
```
278

279
## Constants
280

281
```typescript { .api }
282
// Default vocabulary URL
283
const BASE_PATH = 'https://storage.googleapis.com/tfjs-models/savedmodel/universal_sentence_encoder';
284

285
// Default reserved symbol count
286
const RESERVED_SYMBOLS_COUNT = 6;
287

288
// Unicode separator character
289
const separator = '\u2581'; // Lower one eighth block
290
```

Version

Tile

Files

tokenization.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

tokenization.mddocs/