JavaScript library for creating language models with next-token prediction capabilities including autocomplete, text completion, and AI-powered text generation.
The Training System provides comprehensive capabilities for creating custom language models from text documents. It handles tokenization, n-gram analysis, embedding generation, and model optimization through a sophisticated multi-metric training pipeline.
Core training functionality that processes text documents to create embeddings and n-gram structures for prediction.
/**
* Train model on dataset with full embedding generation
* @param {Object} dataset - Training dataset configuration
* @param {string} dataset.name - Dataset identifier for saving embeddings
* @param {string[]} dataset.files - Document filenames (without .txt extension)
* @returns {Promise<void>} Completes when training and context creation finished
*/
train(dataset);Usage Examples:
// Train on custom dataset
await model.train({
name: 'shakespeare',
files: ['hamlet', 'macbeth', 'othello', 'king-lear']
});
// Train on single document
await model.train({
name: 'technical-docs',
files: ['api-documentation']
});
// Train on mixed content
await model.train({
name: 'mixed-content',
files: [
'news-articles',
'scientific-papers',
'literature-excerpts',
'chat-conversations'
]
});Direct text ingestion for processing without file-based training.
/**
* Ingest text directly for processing
* @param {string} text - Raw text content to process
*/
ingest(text);Usage Examples:
// Ingest direct text content
const textContent = `
This is sample text for training.
The model will learn token relationships.
Multiple sentences provide better context.
`;
model.ingest(textContent);
// Use with external text sources
const webContent = await fetch('https://example.com/articles.txt').then(r => r.text());
model.ingest(webContent);Creates in-memory model context from pre-computed embeddings, enabling fast model initialization.
/**
* Create model context from embeddings
* @param {Object} embeddings - Pre-computed token embeddings
*/
createContext(embeddings);Usage Examples:
// Load and use pre-computed embeddings
const fs = require('fs').promises;
const embeddingsData = JSON.parse(
await fs.readFile('./training/embeddings/my-dataset.json', 'utf8')
);
model.createContext(embeddingsData);
// Use embeddings from training data object
const trainingData = {
text: "Combined training text...",
embeddings: { /* pre-computed embeddings */ }
};
model.createContext(trainingData.embeddings);The training system generates high-dimensional embeddings (144 dimensions by default) using multiple analysis metrics:
/**
* Training embedding structure with multiple analysis metrics
*/
interface EmbeddingDimensions {
// Character composition analysis (66 dimensions)
characterDistribution: number[]; // Distribution of alphanumeric characters
// Grammatical analysis (36 dimensions)
partOfSpeech: number[]; // Part-of-speech tag probabilities
// Statistical analysis (1 dimension)
tokenPrevalence: number; // Frequency in training dataset
// Linguistic analysis (37 dimensions)
suffixPatterns: number[]; // Common word ending patterns
// Co-occurrence analysis (1 dimension)
nextWordFrequency: number; // Normalized co-occurrence frequency
// Content filtering (1 dimension)
vulgarityScore: number; // Profanity detection (placeholder)
// Stylistic analysis (2 dimensions)
styleFeatures: number[]; // Pirate/Victorian language detection
}The training pipeline follows these steps:
/**
* Environment variables affecting training behavior
*/
interface TrainingConfig {
PARAMETER_CHUNK_SIZE: number; // Training batch size (default: 50000)
DIMENSIONS: number; // Vector dimensionality (default: 144)
}Configuration Examples:
# Increase batch size for faster training on large datasets
export PARAMETER_CHUNK_SIZE=100000
# Adjust vector dimensions (requires retraining all models)
export DIMENSIONS=256The system supports structured dataset configurations for reproducible training:
/**
* Training dataset configuration object
*/
interface Dataset {
name: string; // Dataset identifier
files: string[]; // Document filenames (without .txt extension)
}
/**
* Built-in dataset examples
*/
const DefaultDataset = {
name: 'default',
files: [
'animal-facts',
'cat-facts',
'facts-and-sentences',
'heart-of-darkness',
'lectures-on-alchemy',
'legendary-islands-of-the-atlantic',
'on-the-taboo-against-knowing-who-you-are',
'paris',
'test',
'the-initiates-of-the-flame',
'the-phantom-of-the-opera'
]
};Training documents must be placed in the training/documents/ directory relative to the project root:
project-root/
├── training/
│ ├── documents/
│ │ ├── document1.txt
│ │ ├── document2.txt
│ │ └── ...
│ └── embeddings/
│ ├── dataset1.json
│ ├── dataset2.json
│ └── ...The training system uses internal utility functions for document and embedding management. These are not directly exported from the main package but are used internally by the training pipeline:
/**
* Internal utility functions (not directly accessible)
*/
interface InternalUtils {
combineDocuments(documents: string[]): Promise<string>;
fetchEmbeddings(name: string): Promise<Object>;
tokenize(input: string): string[];
getPartsOfSpeech(text: string): Object[];
}These functions handle:
training/documents/ directorytraining/embeddings/ directoryTraining time varies based on dataset size and system performance:
The training system handles various error conditions:
Install with Tessl CLI
npx tessl i tessl/npm-next-token-prediction