0
# Training System
1
2
The Training System provides comprehensive capabilities for creating custom language models from text documents. It handles tokenization, n-gram analysis, embedding generation, and model optimization through a sophisticated multi-metric training pipeline.
3
4
## Capabilities
5
6
### Model Training
7
8
Core training functionality that processes text documents to create embeddings and n-gram structures for prediction.
9
10
```javascript { .api }
11
/**
12
* Train model on dataset with full embedding generation
13
* @param {Object} dataset - Training dataset configuration
14
* @param {string} dataset.name - Dataset identifier for saving embeddings
15
* @param {string[]} dataset.files - Document filenames (without .txt extension)
16
* @returns {Promise<void>} Completes when training and context creation finished
17
*/
18
train(dataset);
19
```
20
21
**Usage Examples:**
22
23
```javascript
24
// Train on custom dataset
25
await model.train({
26
name: 'shakespeare',
27
files: ['hamlet', 'macbeth', 'othello', 'king-lear']
28
});
29
30
// Train on single document
31
await model.train({
32
name: 'technical-docs',
33
files: ['api-documentation']
34
});
35
36
// Train on mixed content
37
await model.train({
38
name: 'mixed-content',
39
files: [
40
'news-articles',
41
'scientific-papers',
42
'literature-excerpts',
43
'chat-conversations'
44
]
45
});
46
```
47
48
### Text Ingestion
49
50
Direct text ingestion for processing without file-based training.
51
52
```javascript { .api }
53
/**
54
* Ingest text directly for processing
55
* @param {string} text - Raw text content to process
56
*/
57
ingest(text);
58
```
59
60
**Usage Examples:**
61
62
```javascript
63
// Ingest direct text content
64
const textContent = `
65
This is sample text for training.
66
The model will learn token relationships.
67
Multiple sentences provide better context.
68
`;
69
model.ingest(textContent);
70
71
// Use with external text sources
72
const webContent = await fetch('https://example.com/articles.txt').then(r => r.text());
73
model.ingest(webContent);
74
```
75
76
### Context Creation
77
78
Creates in-memory model context from pre-computed embeddings, enabling fast model initialization.
79
80
```javascript { .api }
81
/**
82
* Create model context from embeddings
83
* @param {Object} embeddings - Pre-computed token embeddings
84
*/
85
createContext(embeddings);
86
```
87
88
**Usage Examples:**
89
90
```javascript
91
// Load and use pre-computed embeddings
92
const fs = require('fs').promises;
93
const embeddingsData = JSON.parse(
94
await fs.readFile('./training/embeddings/my-dataset.json', 'utf8')
95
);
96
model.createContext(embeddingsData);
97
98
// Use embeddings from training data object
99
const trainingData = {
100
text: "Combined training text...",
101
embeddings: { /* pre-computed embeddings */ }
102
};
103
model.createContext(trainingData.embeddings);
104
```
105
106
## Training Metrics and Embeddings
107
108
The training system generates high-dimensional embeddings (144 dimensions by default) using multiple analysis metrics:
109
110
### Embedding Dimensions
111
112
```javascript { .api }
113
/**
114
* Training embedding structure with multiple analysis metrics
115
*/
116
interface EmbeddingDimensions {
117
// Character composition analysis (66 dimensions)
118
characterDistribution: number[]; // Distribution of alphanumeric characters
119
120
// Grammatical analysis (36 dimensions)
121
partOfSpeech: number[]; // Part-of-speech tag probabilities
122
123
// Statistical analysis (1 dimension)
124
tokenPrevalence: number; // Frequency in training dataset
125
126
// Linguistic analysis (37 dimensions)
127
suffixPatterns: number[]; // Common word ending patterns
128
129
// Co-occurrence analysis (1 dimension)
130
nextWordFrequency: number; // Normalized co-occurrence frequency
131
132
// Content filtering (1 dimension)
133
vulgarityScore: number; // Profanity detection (placeholder)
134
135
// Stylistic analysis (2 dimensions)
136
styleFeatures: number[]; // Pirate/Victorian language detection
137
}
138
```
139
140
### Training Process
141
142
The training pipeline follows these steps:
143
144
1. **Document Combination**: Concatenates all training documents
145
2. **Tokenization**: Splits text into individual tokens
146
3. **Token Analysis**: Analyzes each token-nextToken pair
147
4. **Metric Calculation**: Computes 144-dimensional embeddings
148
5. **Normalization**: Normalizes frequency-based metrics
149
6. **Context Creation**: Builds n-gram trie structure
150
7. **Persistence**: Saves embeddings to JSON files
151
152
### Training Configuration
153
154
```javascript { .api }
155
/**
156
* Environment variables affecting training behavior
157
*/
158
interface TrainingConfig {
159
PARAMETER_CHUNK_SIZE: number; // Training batch size (default: 50000)
160
DIMENSIONS: number; // Vector dimensionality (default: 144)
161
}
162
```
163
164
**Configuration Examples:**
165
166
```bash
167
# Increase batch size for faster training on large datasets
168
export PARAMETER_CHUNK_SIZE=100000
169
170
# Adjust vector dimensions (requires retraining all models)
171
export DIMENSIONS=256
172
```
173
174
## Training Datasets
175
176
The system supports structured dataset configurations for reproducible training:
177
178
### Dataset Structure
179
180
```javascript { .api }
181
/**
182
* Training dataset configuration object
183
*/
184
interface Dataset {
185
name: string; // Dataset identifier
186
files: string[]; // Document filenames (without .txt extension)
187
}
188
189
/**
190
* Built-in dataset examples
191
*/
192
const DefaultDataset = {
193
name: 'default',
194
files: [
195
'animal-facts',
196
'cat-facts',
197
'facts-and-sentences',
198
'heart-of-darkness',
199
'lectures-on-alchemy',
200
'legendary-islands-of-the-atlantic',
201
'on-the-taboo-against-knowing-who-you-are',
202
'paris',
203
'test',
204
'the-initiates-of-the-flame',
205
'the-phantom-of-the-opera'
206
]
207
};
208
```
209
210
### Document Management
211
212
Training documents must be placed in the `training/documents/` directory relative to the project root:
213
214
```
215
project-root/
216
├── training/
217
│ ├── documents/
218
│ │ ├── document1.txt
219
│ │ ├── document2.txt
220
│ │ └── ...
221
│ └── embeddings/
222
│ ├── dataset1.json
223
│ ├── dataset2.json
224
│ └── ...
225
```
226
227
### Internal Utility Functions
228
229
The training system uses internal utility functions for document and embedding management. These are not directly exported from the main package but are used internally by the training pipeline:
230
231
```javascript { .api }
232
/**
233
* Internal utility functions (not directly accessible)
234
*/
235
interface InternalUtils {
236
combineDocuments(documents: string[]): Promise<string>;
237
fetchEmbeddings(name: string): Promise<Object>;
238
tokenize(input: string): string[];
239
getPartsOfSpeech(text: string): Object[];
240
}
241
```
242
243
These functions handle:
244
- **combineDocuments**: Reads and concatenates training document files from `training/documents/` directory
245
- **fetchEmbeddings**: Loads pre-computed embeddings from `training/embeddings/` directory
246
- **tokenize**: Splits input text into tokens using regex-based tokenization
247
- **getPartsOfSpeech**: Analyzes grammatical roles using wink-pos-tagger
248
249
## Training Performance
250
251
### Training Optimization
252
253
- **Chunked Processing**: Large parameter sets processed in configurable chunks
254
- **Memory Management**: Efficient n-gram trie construction with merge operations
255
- **Progress Logging**: Detailed console output showing training progress
256
- **File Persistence**: Automatic saving of embeddings for reuse
257
258
### Training Time Estimates
259
260
Training time varies based on dataset size and system performance:
261
262
- Small datasets (< 1MB text): 30 seconds - 2 minutes
263
- Medium datasets (1-10MB text): 2-15 minutes
264
- Large datasets (10-100MB text): 15 minutes - 2 hours
265
- Very large datasets (100MB+ text): 2+ hours
266
267
### Memory Requirements
268
269
- Base memory: ~50MB for library components
270
- Training memory: ~1-5MB per 1MB of training text
271
- Embedding storage: ~100KB per 1000 unique tokens
272
- Context memory: ~10-50MB for typical models
273
274
## Error Handling
275
276
The training system handles various error conditions:
277
278
- **Missing Files**: Clear error messages for missing training documents
279
- **Invalid Format**: Validation of document file formats
280
- **Memory Limits**: Chunked processing for large datasets
281
- **Disk Space**: Automatic cleanup of temporary files
282
- **Encoding Issues**: UTF-8 text processing with error recovery