or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

index.mdquestion-answering.mdstandard-embeddings.mdtokenization.md

tokenization.mddocs/

0

# Text Tokenization

1

2

Independent tokenizer functionality using the SentencePiece algorithm for converting text into token sequences. The tokenizer can be used separately from the embedding models and supports custom vocabularies.

3

4

## Capabilities

5

6

### Load Tokenizer

7

8

Creates a tokenizer instance with the default or custom vocabulary for text tokenization.

9

10

```typescript { .api }

11

/**

12

* Load a tokenizer for independent use from the Universal Sentence Encoder

13

* @param pathToVocabulary - Optional path to custom vocabulary file

14

* @returns Promise that resolves to Tokenizer instance

15

*/

16

function loadTokenizer(pathToVocabulary?: string): Promise<Tokenizer>;

17

```

18

19

**Usage Examples:**

20

21

```typescript

22

import * as use from '@tensorflow-models/universal-sentence-encoder';

23

24

// Load with default vocabulary

25

const tokenizer = await use.loadTokenizer();

26

27

// Load with custom vocabulary

28

const customTokenizer = await use.loadTokenizer(

29

'https://example.com/my-vocab.json'

30

);

31

```

32

33

### Tokenizer Class

34

35

SentencePiece tokenizer implementation that converts text strings into sequences of integer tokens using the Viterbi algorithm.

36

37

```typescript { .api }

38

class Tokenizer {

39

/**

40

* Create a tokenizer with vocabulary and symbol configuration

41

* @param vocabulary - Array of [token, score] pairs

42

* @param reservedSymbolsCount - Number of reserved symbols (default: 6)

43

*/

44

constructor(vocabulary: Vocabulary, reservedSymbolsCount?: number);

45

46

/**

47

* Tokenize input string into array of token IDs

48

* Uses Viterbi algorithm to find most likely token sequence

49

* @param input - String to tokenize

50

* @returns Array of token IDs

51

*/

52

encode(input: string): number[];

53

}

54

```

55

56

**Usage Examples:**

57

58

```typescript

59

import * as use from '@tensorflow-models/universal-sentence-encoder';

60

61

// Basic tokenization

62

const tokenizer = await use.loadTokenizer();

63

const tokens = tokenizer.encode('Hello, how are you?');

64

console.log(tokens); // [341, 4125, 8, 140, 31, 19, 54]

65

66

// Tokenize multiple strings

67

const sentences = [

68

'Machine learning is fascinating.',

69

'TensorFlow.js runs in browsers.',

70

'Tokenization converts text to numbers.'

71

];

72

73

const allTokens = sentences.map(text => tokenizer.encode(text));

74

console.log('Tokenized sentences:', allTokens);

75

```

76

77

### Vocabulary Loading

78

79

Load vocabulary files for creating custom tokenizers.

80

81

```typescript { .api }

82

/**

83

* Load vocabulary from a remote URL

84

* @param pathToVocabulary - URL or path to vocabulary JSON file

85

* @returns Promise that resolves to vocabulary array

86

*/

87

function loadVocabulary(pathToVocabulary: string): Promise<Vocabulary>;

88

```

89

90

**Usage Example:**

91

92

```typescript

93

import * as use from '@tensorflow-models/universal-sentence-encoder';

94

95

// Load custom vocabulary

96

const vocab = await use.loadVocabulary('https://example.com/vocab.json');

97

const customTokenizer = new use.Tokenizer(vocab);

98

99

// Use custom tokenizer

100

const tokens = customTokenizer.encode('Custom vocabulary example');

101

```

102

103

### Tokenization Process

104

105

The tokenizer follows the SentencePiece algorithm with these key steps:

106

107

1. **Input Normalization**: Unicode normalization (NFKC) and separator insertion

108

2. **Lattice Construction**: Build token possibility lattice using Trie data structure

109

3. **Viterbi Algorithm**: Find most likely token sequence based on vocabulary scores

110

4. **Post-processing**: Merge consecutive unknown tokens and reverse token order

111

112

**Example of tokenization steps:**

113

114

```typescript

115

const tokenizer = await use.loadTokenizer();

116

117

// Original text

118

const text = "Hello, world!";

119

120

// Internal processing (for illustration):

121

// 1. Normalized: "▁Hello,▁world!"

122

// 2. Lattice: Multiple possible token combinations

123

// 3. Viterbi: Best path selection

124

// 4. Result: [341, 8, 126, 54]

125

126

const tokens = tokenizer.encode(text);

127

console.log('Final tokens:', tokens);

128

```

129

130

## Advanced Usage

131

132

### Custom Vocabulary Integration

133

134

Create tokenizers with different vocabularies for specialized domains:

135

136

```typescript

137

// Load domain-specific vocabulary

138

const medicalVocab = await use.loadVocabulary('https://example.com/medical-vocab.json');

139

const medicalTokenizer = new use.Tokenizer(medicalVocab);

140

141

// Tokenize medical text

142

const medicalText = "The patient shows symptoms of acute myocardial infarction.";

143

const medicalTokens = medicalTokenizer.encode(medicalText);

144

```

145

146

### Batch Tokenization

147

148

Efficiently tokenize multiple texts:

149

150

```typescript

151

const tokenizer = await use.loadTokenizer();

152

153

const documents = [

154

"Natural language processing enables computers to understand text.",

155

"Deep learning models can generate human-like responses.",

156

"Tokenization is the first step in text preprocessing."

157

];

158

159

// Tokenize all documents

160

const tokenizedDocs = documents.map(doc => ({

161

text: doc,

162

tokens: tokenizer.encode(doc),

163

tokenCount: tokenizer.encode(doc).length

164

}));

165

166

console.log('Tokenized documents:', tokenizedDocs);

167

```

168

169

### Vocabulary Analysis

170

171

Explore the tokenizer's vocabulary:

172

173

```typescript

174

// Load vocabulary for inspection

175

const vocab = await use.loadVocabulary(

176

'https://storage.googleapis.com/tfjs-models/savedmodel/universal_sentence_encoder/vocab.json'

177

);

178

179

console.log('Vocabulary size:', vocab.length);

180

console.log('First 10 tokens:', vocab.slice(0, 10));

181

182

// Find specific tokens

183

const commonWords = vocab.filter(([token, score]) =>

184

token.includes('▁the') || token.includes('▁and') || token.includes('▁is')

185

);

186

console.log('Common word tokens:', commonWords);

187

```

188

189

### Trie Data Structure

190

191

Internal trie (prefix tree) data structure used by the tokenizer for efficient token matching during the SentencePiece tokenization process.

192

193

```typescript { .api }

194

class Trie {

195

/**

196

* Create a new trie with an empty root node

197

*/

198

constructor();

199

200

/**

201

* Insert a token into the trie with its score and index

202

* @param word - Token string to insert

203

* @param score - Score associated with the token

204

* @param index - Index of the token in vocabulary

205

*/

206

insert(word: string, score: number, index: number): void;

207

208

/**

209

* Find all tokens that start with the given prefix

210

* @param symbols - Array of characters to match as prefix

211

* @returns Array of matching tokens with their data [token, score, index]

212

*/

213

commonPrefixSearch(symbols: string[]): Array<[string[], number, number]>;

214

}

215

```

216

217

**Usage Example:**

218

219

```typescript

220

import { Trie, stringToChars } from '@tensorflow-models/universal-sentence-encoder';

221

222

// Create and populate a trie

223

const trie = new Trie();

224

trie.insert('hello', 10.5, 100);

225

trie.insert('help', 8.2, 101);

226

trie.insert('helicopter', 5.1, 102);

227

228

// Search for matches

229

const prefix = stringToChars('hel');

230

const matches = trie.commonPrefixSearch(prefix);

231

console.log('Matching tokens:', matches);

232

// Output: [['h', 'e', 'l'], ['h', 'e', 'l', 'l', 'o'], ['h', 'e', 'l', 'p']]

233

```

234

235

### Utility Functions

236

237

Unicode-aware text processing utilities used internally by the tokenizer.

238

239

```typescript { .api }

240

/**

241

* Convert string to array of unicode characters with proper handling

242

* @param input - String to convert to character array

243

* @returns Array of unicode characters

244

*/

245

function stringToChars(input: string): string[];

246

```

247

248

**Usage Example:**

249

250

```typescript

251

import { stringToChars } from '@tensorflow-models/universal-sentence-encoder';

252

253

const text = "Hello 🌍!";

254

const chars = stringToChars(text);

255

console.log(chars); // ['H', 'e', 'l', 'l', 'o', ' ', '🌍', '!']

256

```

257

258

## Types

259

260

```typescript { .api }

261

// Vocabulary format: array of [token_string, score] pairs

262

type Vocabulary = Array<[string, number]>;

263

264

class Tokenizer {

265

constructor(vocabulary: Vocabulary, reservedSymbolsCount?: number);

266

encode(input: string): number[];

267

}

268

269

// Internal Trie structure for tokenization

270

class Trie {

271

insert(word: string, score: number, index: number): void;

272

commonPrefixSearch(symbols: string[]): Array<[string[], number, number]>;

273

}

274

275

// Utility functions

276

function stringToChars(input: string): string[];

277

```

278

279

## Constants

280

281

```typescript { .api }

282

// Default vocabulary URL

283

const BASE_PATH = 'https://storage.googleapis.com/tfjs-models/savedmodel/universal_sentence_encoder';

284

285

// Default reserved symbol count

286

const RESERVED_SYMBOLS_COUNT = 6;

287

288

// Unicode separator character

289

const separator = '\u2581'; // Lower one eighth block

290

```