CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/npm-tensorflow-models--universal-sentence-encoder

Universal Sentence Encoder for generating text embeddings using TensorFlow.js

Pending
Overview
Eval results
Files

tokenization.mddocs/

Text Tokenization

Independent tokenizer functionality using the SentencePiece algorithm for converting text into token sequences. The tokenizer can be used separately from the embedding models and supports custom vocabularies.

Capabilities

Load Tokenizer

Creates a tokenizer instance with the default or custom vocabulary for text tokenization.

/**
 * Load a tokenizer for independent use from the Universal Sentence Encoder
 * @param pathToVocabulary - Optional path to custom vocabulary file
 * @returns Promise that resolves to Tokenizer instance
 */
function loadTokenizer(pathToVocabulary?: string): Promise<Tokenizer>;

Usage Examples:

import * as use from '@tensorflow-models/universal-sentence-encoder';

// Load with default vocabulary
const tokenizer = await use.loadTokenizer();

// Load with custom vocabulary
const customTokenizer = await use.loadTokenizer(
  'https://example.com/my-vocab.json'
);

Tokenizer Class

SentencePiece tokenizer implementation that converts text strings into sequences of integer tokens using the Viterbi algorithm.

class Tokenizer {
  /**
   * Create a tokenizer with vocabulary and symbol configuration
   * @param vocabulary - Array of [token, score] pairs
   * @param reservedSymbolsCount - Number of reserved symbols (default: 6)
   */
  constructor(vocabulary: Vocabulary, reservedSymbolsCount?: number);
  
  /**
   * Tokenize input string into array of token IDs
   * Uses Viterbi algorithm to find most likely token sequence
   * @param input - String to tokenize
   * @returns Array of token IDs
   */
  encode(input: string): number[];
}

Usage Examples:

import * as use from '@tensorflow-models/universal-sentence-encoder';

// Basic tokenization
const tokenizer = await use.loadTokenizer();
const tokens = tokenizer.encode('Hello, how are you?');
console.log(tokens); // [341, 4125, 8, 140, 31, 19, 54]

// Tokenize multiple strings
const sentences = [
  'Machine learning is fascinating.',
  'TensorFlow.js runs in browsers.',
  'Tokenization converts text to numbers.'
];

const allTokens = sentences.map(text => tokenizer.encode(text));
console.log('Tokenized sentences:', allTokens);

Vocabulary Loading

Load vocabulary files for creating custom tokenizers.

/**
 * Load vocabulary from a remote URL  
 * @param pathToVocabulary - URL or path to vocabulary JSON file
 * @returns Promise that resolves to vocabulary array
 */
function loadVocabulary(pathToVocabulary: string): Promise<Vocabulary>;

Usage Example:

import * as use from '@tensorflow-models/universal-sentence-encoder';

// Load custom vocabulary
const vocab = await use.loadVocabulary('https://example.com/vocab.json');
const customTokenizer = new use.Tokenizer(vocab);

// Use custom tokenizer
const tokens = customTokenizer.encode('Custom vocabulary example');

Tokenization Process

The tokenizer follows the SentencePiece algorithm with these key steps:

  1. Input Normalization: Unicode normalization (NFKC) and separator insertion
  2. Lattice Construction: Build token possibility lattice using Trie data structure
  3. Viterbi Algorithm: Find most likely token sequence based on vocabulary scores
  4. Post-processing: Merge consecutive unknown tokens and reverse token order

Example of tokenization steps:

const tokenizer = await use.loadTokenizer();

// Original text
const text = "Hello, world!";

// Internal processing (for illustration):
// 1. Normalized: "▁Hello,▁world!"
// 2. Lattice: Multiple possible token combinations
// 3. Viterbi: Best path selection
// 4. Result: [341, 8, 126, 54]

const tokens = tokenizer.encode(text);
console.log('Final tokens:', tokens);

Advanced Usage

Custom Vocabulary Integration

Create tokenizers with different vocabularies for specialized domains:

// Load domain-specific vocabulary
const medicalVocab = await use.loadVocabulary('https://example.com/medical-vocab.json');
const medicalTokenizer = new use.Tokenizer(medicalVocab);

// Tokenize medical text
const medicalText = "The patient shows symptoms of acute myocardial infarction.";
const medicalTokens = medicalTokenizer.encode(medicalText);

Batch Tokenization

Efficiently tokenize multiple texts:

const tokenizer = await use.loadTokenizer();

const documents = [
  "Natural language processing enables computers to understand text.",
  "Deep learning models can generate human-like responses.",
  "Tokenization is the first step in text preprocessing."
];

// Tokenize all documents
const tokenizedDocs = documents.map(doc => ({
  text: doc,
  tokens: tokenizer.encode(doc),
  tokenCount: tokenizer.encode(doc).length
}));

console.log('Tokenized documents:', tokenizedDocs);

Vocabulary Analysis

Explore the tokenizer's vocabulary:

// Load vocabulary for inspection
const vocab = await use.loadVocabulary(
  'https://storage.googleapis.com/tfjs-models/savedmodel/universal_sentence_encoder/vocab.json'
);

console.log('Vocabulary size:', vocab.length);
console.log('First 10 tokens:', vocab.slice(0, 10));

// Find specific tokens
const commonWords = vocab.filter(([token, score]) => 
  token.includes('▁the') || token.includes('▁and') || token.includes('▁is')
);
console.log('Common word tokens:', commonWords);

Trie Data Structure

Internal trie (prefix tree) data structure used by the tokenizer for efficient token matching during the SentencePiece tokenization process.

class Trie {
  /**
   * Create a new trie with an empty root node
   */
  constructor();
  
  /**
   * Insert a token into the trie with its score and index
   * @param word - Token string to insert
   * @param score - Score associated with the token
   * @param index - Index of the token in vocabulary
   */
  insert(word: string, score: number, index: number): void;
  
  /**
   * Find all tokens that start with the given prefix
   * @param symbols - Array of characters to match as prefix
   * @returns Array of matching tokens with their data [token, score, index]
   */
  commonPrefixSearch(symbols: string[]): Array<[string[], number, number]>;
}

Usage Example:

import { Trie, stringToChars } from '@tensorflow-models/universal-sentence-encoder';

// Create and populate a trie
const trie = new Trie();
trie.insert('hello', 10.5, 100);
trie.insert('help', 8.2, 101);
trie.insert('helicopter', 5.1, 102);

// Search for matches
const prefix = stringToChars('hel');
const matches = trie.commonPrefixSearch(prefix);
console.log('Matching tokens:', matches);
// Output: [['h', 'e', 'l'], ['h', 'e', 'l', 'l', 'o'], ['h', 'e', 'l', 'p']]

Utility Functions

Unicode-aware text processing utilities used internally by the tokenizer.

/**
 * Convert string to array of unicode characters with proper handling
 * @param input - String to convert to character array
 * @returns Array of unicode characters
 */
function stringToChars(input: string): string[];

Usage Example:

import { stringToChars } from '@tensorflow-models/universal-sentence-encoder';

const text = "Hello 🌍!";
const chars = stringToChars(text);
console.log(chars); // ['H', 'e', 'l', 'l', 'o', ' ', '🌍', '!']

Types

// Vocabulary format: array of [token_string, score] pairs
type Vocabulary = Array<[string, number]>;

class Tokenizer {
  constructor(vocabulary: Vocabulary, reservedSymbolsCount?: number);
  encode(input: string): number[];
}

// Internal Trie structure for tokenization
class Trie {
  insert(word: string, score: number, index: number): void;
  commonPrefixSearch(symbols: string[]): Array<[string[], number, number]>;
}

// Utility functions
function stringToChars(input: string): string[];

Constants

// Default vocabulary URL
const BASE_PATH = 'https://storage.googleapis.com/tfjs-models/savedmodel/universal_sentence_encoder';

// Default reserved symbol count
const RESERVED_SYMBOLS_COUNT = 6;

// Unicode separator character
const separator = '\u2581'; // Lower one eighth block

Install with Tessl CLI

npx tessl i tessl/npm-tensorflow-models--universal-sentence-encoder

docs

index.md

question-answering.md

standard-embeddings.md

tokenization.md

tile.json