Tessl Tile for npm/next-token-prediction@1.1.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

index.md language-model.md text-prediction.md training-system.md vector-operations.md

training-system.mddocs/

0
# Training System
1

2
The Training System provides comprehensive capabilities for creating custom language models from text documents. It handles tokenization, n-gram analysis, embedding generation, and model optimization through a sophisticated multi-metric training pipeline.
3

4
## Capabilities
5

6
### Model Training
7

8
Core training functionality that processes text documents to create embeddings and n-gram structures for prediction.
9

10
```javascript { .api }
11
/**
12
 * Train model on dataset with full embedding generation
13
 * @param {Object} dataset - Training dataset configuration
14
 * @param {string} dataset.name - Dataset identifier for saving embeddings
15
 * @param {string[]} dataset.files - Document filenames (without .txt extension)
16
 * @returns {Promise<void>} Completes when training and context creation finished
17
 */
18
train(dataset);
19
```
20

21
**Usage Examples:**
22

23
```javascript
24
// Train on custom dataset
25
await model.train({
26
  name: 'shakespeare',
27
  files: ['hamlet', 'macbeth', 'othello', 'king-lear']
28
});
29

30
// Train on single document
31
await model.train({
32
  name: 'technical-docs',
33
  files: ['api-documentation']
34
});
35

36
// Train on mixed content
37
await model.train({
38
  name: 'mixed-content',
39
  files: [
40
    'news-articles',
41
    'scientific-papers',
42
    'literature-excerpts',
43
    'chat-conversations'
44
  ]
45
});
46
```
47

48
### Text Ingestion
49

50
Direct text ingestion for processing without file-based training.
51

52
```javascript { .api }
53
/**
54
 * Ingest text directly for processing
55
 * @param {string} text - Raw text content to process
56
 */
57
ingest(text);
58
```
59

60
**Usage Examples:**
61

62
```javascript
63
// Ingest direct text content
64
const textContent = `
65
  This is sample text for training.
66
  The model will learn token relationships.
67
  Multiple sentences provide better context.
68
`;
69
model.ingest(textContent);
70

71
// Use with external text sources
72
const webContent = await fetch('https://example.com/articles.txt').then(r => r.text());
73
model.ingest(webContent);
74
```
75

76
### Context Creation
77

78
Creates in-memory model context from pre-computed embeddings, enabling fast model initialization.
79

80
```javascript { .api }
81
/**
82
 * Create model context from embeddings
83
 * @param {Object} embeddings - Pre-computed token embeddings
84
 */
85
createContext(embeddings);
86
```
87

88
**Usage Examples:**
89

90
```javascript
91
// Load and use pre-computed embeddings
92
const fs = require('fs').promises;
93
const embeddingsData = JSON.parse(
94
  await fs.readFile('./training/embeddings/my-dataset.json', 'utf8')
95
);
96
model.createContext(embeddingsData);
97

98
// Use embeddings from training data object
99
const trainingData = {
100
  text: "Combined training text...",
101
  embeddings: { /* pre-computed embeddings */ }
102
};
103
model.createContext(trainingData.embeddings);
104
```
105

106
## Training Metrics and Embeddings
107

108
The training system generates high-dimensional embeddings (144 dimensions by default) using multiple analysis metrics:
109

110
### Embedding Dimensions
111

112
```javascript { .api }
113
/**
114
 * Training embedding structure with multiple analysis metrics
115
 */
116
interface EmbeddingDimensions {
117
  // Character composition analysis (66 dimensions)
118
  characterDistribution: number[];  // Distribution of alphanumeric characters
119

120
  // Grammatical analysis (36 dimensions)
121
  partOfSpeech: number[];          // Part-of-speech tag probabilities
122

123
  // Statistical analysis (1 dimension)
124
  tokenPrevalence: number;         // Frequency in training dataset
125

126
  // Linguistic analysis (37 dimensions)
127
  suffixPatterns: number[];        // Common word ending patterns
128

129
  // Co-occurrence analysis (1 dimension)
130
  nextWordFrequency: number;       // Normalized co-occurrence frequency
131

132
  // Content filtering (1 dimension)
133
  vulgarityScore: number;          // Profanity detection (placeholder)
134

135
  // Stylistic analysis (2 dimensions)
136
  styleFeatures: number[];         // Pirate/Victorian language detection
137
}
138
```
139

140
### Training Process
141

142
The training pipeline follows these steps:
143

144
1. **Document Combination**: Concatenates all training documents
145
2. **Tokenization**: Splits text into individual tokens
146
3. **Token Analysis**: Analyzes each token-nextToken pair
147
4. **Metric Calculation**: Computes 144-dimensional embeddings
148
5. **Normalization**: Normalizes frequency-based metrics
149
6. **Context Creation**: Builds n-gram trie structure
150
7. **Persistence**: Saves embeddings to JSON files
151

152
### Training Configuration
153

154
```javascript { .api }
155
/**
156
 * Environment variables affecting training behavior
157
 */
158
interface TrainingConfig {
159
  PARAMETER_CHUNK_SIZE: number;   // Training batch size (default: 50000)
160
  DIMENSIONS: number;             // Vector dimensionality (default: 144)
161
}
162
```
163

164
**Configuration Examples:**
165

166
```bash
167
# Increase batch size for faster training on large datasets
168
export PARAMETER_CHUNK_SIZE=100000
169

170
# Adjust vector dimensions (requires retraining all models)
171
export DIMENSIONS=256
172
```
173

174
## Training Datasets
175

176
The system supports structured dataset configurations for reproducible training:
177

178
### Dataset Structure
179

180
```javascript { .api }
181
/**
182
 * Training dataset configuration object
183
 */
184
interface Dataset {
185
  name: string;           // Dataset identifier
186
  files: string[];        // Document filenames (without .txt extension)
187
}
188

189
/**
190
 * Built-in dataset examples
191
 */
192
const DefaultDataset = {
193
  name: 'default',
194
  files: [
195
    'animal-facts',
196
    'cat-facts',
197
    'facts-and-sentences',
198
    'heart-of-darkness',
199
    'lectures-on-alchemy',
200
    'legendary-islands-of-the-atlantic',
201
    'on-the-taboo-against-knowing-who-you-are',
202
    'paris',
203
    'test',
204
    'the-initiates-of-the-flame',
205
    'the-phantom-of-the-opera'
206
  ]
207
};
208
```
209

210
### Document Management
211

212
Training documents must be placed in the `training/documents/` directory relative to the project root:
213

214
```
215
project-root/
216
├── training/
217
│   ├── documents/
218
│   │   ├── document1.txt
219
│   │   ├── document2.txt
220
│   │   └── ...
221
│   └── embeddings/
222
│       ├── dataset1.json
223
│       ├── dataset2.json
224
│       └── ...
225
```
226

227
### Internal Utility Functions
228

229
The training system uses internal utility functions for document and embedding management. These are not directly exported from the main package but are used internally by the training pipeline:
230

231
```javascript { .api }
232
/**
233
 * Internal utility functions (not directly accessible)
234
 */
235
interface InternalUtils {
236
  combineDocuments(documents: string[]): Promise<string>;
237
  fetchEmbeddings(name: string): Promise<Object>;
238
  tokenize(input: string): string[];
239
  getPartsOfSpeech(text: string): Object[];
240
}
241
```
242

243
These functions handle:
244
- **combineDocuments**: Reads and concatenates training document files from `training/documents/` directory
245
- **fetchEmbeddings**: Loads pre-computed embeddings from `training/embeddings/` directory
246
- **tokenize**: Splits input text into tokens using regex-based tokenization
247
- **getPartsOfSpeech**: Analyzes grammatical roles using wink-pos-tagger
248

249
## Training Performance
250

251
### Training Optimization
252

253
- **Chunked Processing**: Large parameter sets processed in configurable chunks
254
- **Memory Management**: Efficient n-gram trie construction with merge operations
255
- **Progress Logging**: Detailed console output showing training progress
256
- **File Persistence**: Automatic saving of embeddings for reuse
257

258
### Training Time Estimates
259

260
Training time varies based on dataset size and system performance:
261

262
- Small datasets (< 1MB text): 30 seconds - 2 minutes
263
- Medium datasets (1-10MB text): 2-15 minutes
264
- Large datasets (10-100MB text): 15 minutes - 2 hours
265
- Very large datasets (100MB+ text): 2+ hours
266

267
### Memory Requirements
268

269
- Base memory: ~50MB for library components
270
- Training memory: ~1-5MB per 1MB of training text
271
- Embedding storage: ~100KB per 1000 unique tokens
272
- Context memory: ~10-50MB for typical models
273

274
## Error Handling
275

276
The training system handles various error conditions:
277

278
- **Missing Files**: Clear error messages for missing training documents
279
- **Invalid Format**: Validation of document file formats
280
- **Memory Limits**: Chunked processing for large datasets
281
- **Disk Space**: Automatic cleanup of temporary files
282
- **Encoding Issues**: UTF-8 text processing with error recovery

Version

Tile

Files

training-system.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

training-system.mddocs/