CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/npm-next-token-prediction

JavaScript library for creating language models with next-token prediction capabilities including autocomplete, text completion, and AI-powered text generation.

Overview
Eval results
Files

training-system.mddocs/

Training System

The Training System provides comprehensive capabilities for creating custom language models from text documents. It handles tokenization, n-gram analysis, embedding generation, and model optimization through a sophisticated multi-metric training pipeline.

Capabilities

Model Training

Core training functionality that processes text documents to create embeddings and n-gram structures for prediction.

/**
 * Train model on dataset with full embedding generation
 * @param {Object} dataset - Training dataset configuration
 * @param {string} dataset.name - Dataset identifier for saving embeddings
 * @param {string[]} dataset.files - Document filenames (without .txt extension)
 * @returns {Promise<void>} Completes when training and context creation finished
 */
train(dataset);

Usage Examples:

// Train on custom dataset
await model.train({
  name: 'shakespeare',
  files: ['hamlet', 'macbeth', 'othello', 'king-lear']
});

// Train on single document
await model.train({
  name: 'technical-docs',
  files: ['api-documentation']
});

// Train on mixed content
await model.train({
  name: 'mixed-content',
  files: [
    'news-articles',
    'scientific-papers',
    'literature-excerpts',
    'chat-conversations'
  ]
});

Text Ingestion

Direct text ingestion for processing without file-based training.

/**
 * Ingest text directly for processing
 * @param {string} text - Raw text content to process
 */
ingest(text);

Usage Examples:

// Ingest direct text content
const textContent = `
  This is sample text for training.
  The model will learn token relationships.
  Multiple sentences provide better context.
`;
model.ingest(textContent);

// Use with external text sources
const webContent = await fetch('https://example.com/articles.txt').then(r => r.text());
model.ingest(webContent);

Context Creation

Creates in-memory model context from pre-computed embeddings, enabling fast model initialization.

/**
 * Create model context from embeddings
 * @param {Object} embeddings - Pre-computed token embeddings
 */
createContext(embeddings);

Usage Examples:

// Load and use pre-computed embeddings
const fs = require('fs').promises;
const embeddingsData = JSON.parse(
  await fs.readFile('./training/embeddings/my-dataset.json', 'utf8')
);
model.createContext(embeddingsData);

// Use embeddings from training data object
const trainingData = {
  text: "Combined training text...",
  embeddings: { /* pre-computed embeddings */ }
};
model.createContext(trainingData.embeddings);

Training Metrics and Embeddings

The training system generates high-dimensional embeddings (144 dimensions by default) using multiple analysis metrics:

Embedding Dimensions

/**
 * Training embedding structure with multiple analysis metrics
 */
interface EmbeddingDimensions {
  // Character composition analysis (66 dimensions)
  characterDistribution: number[];  // Distribution of alphanumeric characters

  // Grammatical analysis (36 dimensions)
  partOfSpeech: number[];          // Part-of-speech tag probabilities

  // Statistical analysis (1 dimension)
  tokenPrevalence: number;         // Frequency in training dataset

  // Linguistic analysis (37 dimensions)
  suffixPatterns: number[];        // Common word ending patterns

  // Co-occurrence analysis (1 dimension)
  nextWordFrequency: number;       // Normalized co-occurrence frequency

  // Content filtering (1 dimension)
  vulgarityScore: number;          // Profanity detection (placeholder)

  // Stylistic analysis (2 dimensions)
  styleFeatures: number[];         // Pirate/Victorian language detection
}

Training Process

The training pipeline follows these steps:

  1. Document Combination: Concatenates all training documents
  2. Tokenization: Splits text into individual tokens
  3. Token Analysis: Analyzes each token-nextToken pair
  4. Metric Calculation: Computes 144-dimensional embeddings
  5. Normalization: Normalizes frequency-based metrics
  6. Context Creation: Builds n-gram trie structure
  7. Persistence: Saves embeddings to JSON files

Training Configuration

/**
 * Environment variables affecting training behavior
 */
interface TrainingConfig {
  PARAMETER_CHUNK_SIZE: number;   // Training batch size (default: 50000)
  DIMENSIONS: number;             // Vector dimensionality (default: 144)
}

Configuration Examples:

# Increase batch size for faster training on large datasets
export PARAMETER_CHUNK_SIZE=100000

# Adjust vector dimensions (requires retraining all models)
export DIMENSIONS=256

Training Datasets

The system supports structured dataset configurations for reproducible training:

Dataset Structure

/**
 * Training dataset configuration object
 */
interface Dataset {
  name: string;           // Dataset identifier
  files: string[];        // Document filenames (without .txt extension)
}

/**
 * Built-in dataset examples
 */
const DefaultDataset = {
  name: 'default',
  files: [
    'animal-facts',
    'cat-facts',
    'facts-and-sentences',
    'heart-of-darkness',
    'lectures-on-alchemy',
    'legendary-islands-of-the-atlantic',
    'on-the-taboo-against-knowing-who-you-are',
    'paris',
    'test',
    'the-initiates-of-the-flame',
    'the-phantom-of-the-opera'
  ]
};

Document Management

Training documents must be placed in the training/documents/ directory relative to the project root:

project-root/
├── training/
│   ├── documents/
│   │   ├── document1.txt
│   │   ├── document2.txt
│   │   └── ...
│   └── embeddings/
│       ├── dataset1.json
│       ├── dataset2.json
│       └── ...

Internal Utility Functions

The training system uses internal utility functions for document and embedding management. These are not directly exported from the main package but are used internally by the training pipeline:

/**
 * Internal utility functions (not directly accessible)
 */
interface InternalUtils {
  combineDocuments(documents: string[]): Promise<string>;
  fetchEmbeddings(name: string): Promise<Object>;
  tokenize(input: string): string[];
  getPartsOfSpeech(text: string): Object[];
}

These functions handle:

  • combineDocuments: Reads and concatenates training document files from training/documents/ directory
  • fetchEmbeddings: Loads pre-computed embeddings from training/embeddings/ directory
  • tokenize: Splits input text into tokens using regex-based tokenization
  • getPartsOfSpeech: Analyzes grammatical roles using wink-pos-tagger

Training Performance

Training Optimization

  • Chunked Processing: Large parameter sets processed in configurable chunks
  • Memory Management: Efficient n-gram trie construction with merge operations
  • Progress Logging: Detailed console output showing training progress
  • File Persistence: Automatic saving of embeddings for reuse

Training Time Estimates

Training time varies based on dataset size and system performance:

  • Small datasets (< 1MB text): 30 seconds - 2 minutes
  • Medium datasets (1-10MB text): 2-15 minutes
  • Large datasets (10-100MB text): 15 minutes - 2 hours
  • Very large datasets (100MB+ text): 2+ hours

Memory Requirements

  • Base memory: ~50MB for library components
  • Training memory: ~1-5MB per 1MB of training text
  • Embedding storage: ~100KB per 1000 unique tokens
  • Context memory: ~10-50MB for typical models

Error Handling

The training system handles various error conditions:

  • Missing Files: Clear error messages for missing training documents
  • Invalid Format: Validation of document file formats
  • Memory Limits: Chunked processing for large datasets
  • Disk Space: Automatic cleanup of temporary files
  • Encoding Issues: UTF-8 text processing with error recovery

Install with Tessl CLI

npx tessl i tessl/npm-next-token-prediction

docs

index.md

language-model.md

text-prediction.md

training-system.md

vector-operations.md

tile.json