Universal Sentence Encoder for generating text embeddings using TensorFlow.js
—
Independent tokenizer functionality using the SentencePiece algorithm for converting text into token sequences. The tokenizer can be used separately from the embedding models and supports custom vocabularies.
Creates a tokenizer instance with the default or custom vocabulary for text tokenization.
/**
* Load a tokenizer for independent use from the Universal Sentence Encoder
* @param pathToVocabulary - Optional path to custom vocabulary file
* @returns Promise that resolves to Tokenizer instance
*/
function loadTokenizer(pathToVocabulary?: string): Promise<Tokenizer>;Usage Examples:
import * as use from '@tensorflow-models/universal-sentence-encoder';
// Load with default vocabulary
const tokenizer = await use.loadTokenizer();
// Load with custom vocabulary
const customTokenizer = await use.loadTokenizer(
'https://example.com/my-vocab.json'
);SentencePiece tokenizer implementation that converts text strings into sequences of integer tokens using the Viterbi algorithm.
class Tokenizer {
/**
* Create a tokenizer with vocabulary and symbol configuration
* @param vocabulary - Array of [token, score] pairs
* @param reservedSymbolsCount - Number of reserved symbols (default: 6)
*/
constructor(vocabulary: Vocabulary, reservedSymbolsCount?: number);
/**
* Tokenize input string into array of token IDs
* Uses Viterbi algorithm to find most likely token sequence
* @param input - String to tokenize
* @returns Array of token IDs
*/
encode(input: string): number[];
}Usage Examples:
import * as use from '@tensorflow-models/universal-sentence-encoder';
// Basic tokenization
const tokenizer = await use.loadTokenizer();
const tokens = tokenizer.encode('Hello, how are you?');
console.log(tokens); // [341, 4125, 8, 140, 31, 19, 54]
// Tokenize multiple strings
const sentences = [
'Machine learning is fascinating.',
'TensorFlow.js runs in browsers.',
'Tokenization converts text to numbers.'
];
const allTokens = sentences.map(text => tokenizer.encode(text));
console.log('Tokenized sentences:', allTokens);Load vocabulary files for creating custom tokenizers.
/**
* Load vocabulary from a remote URL
* @param pathToVocabulary - URL or path to vocabulary JSON file
* @returns Promise that resolves to vocabulary array
*/
function loadVocabulary(pathToVocabulary: string): Promise<Vocabulary>;Usage Example:
import * as use from '@tensorflow-models/universal-sentence-encoder';
// Load custom vocabulary
const vocab = await use.loadVocabulary('https://example.com/vocab.json');
const customTokenizer = new use.Tokenizer(vocab);
// Use custom tokenizer
const tokens = customTokenizer.encode('Custom vocabulary example');The tokenizer follows the SentencePiece algorithm with these key steps:
Example of tokenization steps:
const tokenizer = await use.loadTokenizer();
// Original text
const text = "Hello, world!";
// Internal processing (for illustration):
// 1. Normalized: "▁Hello,▁world!"
// 2. Lattice: Multiple possible token combinations
// 3. Viterbi: Best path selection
// 4. Result: [341, 8, 126, 54]
const tokens = tokenizer.encode(text);
console.log('Final tokens:', tokens);Create tokenizers with different vocabularies for specialized domains:
// Load domain-specific vocabulary
const medicalVocab = await use.loadVocabulary('https://example.com/medical-vocab.json');
const medicalTokenizer = new use.Tokenizer(medicalVocab);
// Tokenize medical text
const medicalText = "The patient shows symptoms of acute myocardial infarction.";
const medicalTokens = medicalTokenizer.encode(medicalText);Efficiently tokenize multiple texts:
const tokenizer = await use.loadTokenizer();
const documents = [
"Natural language processing enables computers to understand text.",
"Deep learning models can generate human-like responses.",
"Tokenization is the first step in text preprocessing."
];
// Tokenize all documents
const tokenizedDocs = documents.map(doc => ({
text: doc,
tokens: tokenizer.encode(doc),
tokenCount: tokenizer.encode(doc).length
}));
console.log('Tokenized documents:', tokenizedDocs);Explore the tokenizer's vocabulary:
// Load vocabulary for inspection
const vocab = await use.loadVocabulary(
'https://storage.googleapis.com/tfjs-models/savedmodel/universal_sentence_encoder/vocab.json'
);
console.log('Vocabulary size:', vocab.length);
console.log('First 10 tokens:', vocab.slice(0, 10));
// Find specific tokens
const commonWords = vocab.filter(([token, score]) =>
token.includes('▁the') || token.includes('▁and') || token.includes('▁is')
);
console.log('Common word tokens:', commonWords);Internal trie (prefix tree) data structure used by the tokenizer for efficient token matching during the SentencePiece tokenization process.
class Trie {
/**
* Create a new trie with an empty root node
*/
constructor();
/**
* Insert a token into the trie with its score and index
* @param word - Token string to insert
* @param score - Score associated with the token
* @param index - Index of the token in vocabulary
*/
insert(word: string, score: number, index: number): void;
/**
* Find all tokens that start with the given prefix
* @param symbols - Array of characters to match as prefix
* @returns Array of matching tokens with their data [token, score, index]
*/
commonPrefixSearch(symbols: string[]): Array<[string[], number, number]>;
}Usage Example:
import { Trie, stringToChars } from '@tensorflow-models/universal-sentence-encoder';
// Create and populate a trie
const trie = new Trie();
trie.insert('hello', 10.5, 100);
trie.insert('help', 8.2, 101);
trie.insert('helicopter', 5.1, 102);
// Search for matches
const prefix = stringToChars('hel');
const matches = trie.commonPrefixSearch(prefix);
console.log('Matching tokens:', matches);
// Output: [['h', 'e', 'l'], ['h', 'e', 'l', 'l', 'o'], ['h', 'e', 'l', 'p']]Unicode-aware text processing utilities used internally by the tokenizer.
/**
* Convert string to array of unicode characters with proper handling
* @param input - String to convert to character array
* @returns Array of unicode characters
*/
function stringToChars(input: string): string[];Usage Example:
import { stringToChars } from '@tensorflow-models/universal-sentence-encoder';
const text = "Hello 🌍!";
const chars = stringToChars(text);
console.log(chars); // ['H', 'e', 'l', 'l', 'o', ' ', '🌍', '!']// Vocabulary format: array of [token_string, score] pairs
type Vocabulary = Array<[string, number]>;
class Tokenizer {
constructor(vocabulary: Vocabulary, reservedSymbolsCount?: number);
encode(input: string): number[];
}
// Internal Trie structure for tokenization
class Trie {
insert(word: string, score: number, index: number): void;
commonPrefixSearch(symbols: string[]): Array<[string[], number, number]>;
}
// Utility functions
function stringToChars(input: string): string[];// Default vocabulary URL
const BASE_PATH = 'https://storage.googleapis.com/tfjs-models/savedmodel/universal_sentence_encoder';
// Default reserved symbol count
const RESERVED_SYMBOLS_COUNT = 6;
// Unicode separator character
const separator = '\u2581'; // Lower one eighth blockInstall with Tessl CLI
npx tessl i tessl/npm-tensorflow-models--universal-sentence-encoder