or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

index.mdlanguage-model.mdtext-prediction.mdtraining-system.mdvector-operations.md

training-system.mddocs/

0

# Training System

1

2

The Training System provides comprehensive capabilities for creating custom language models from text documents. It handles tokenization, n-gram analysis, embedding generation, and model optimization through a sophisticated multi-metric training pipeline.

3

4

## Capabilities

5

6

### Model Training

7

8

Core training functionality that processes text documents to create embeddings and n-gram structures for prediction.

9

10

```javascript { .api }

11

/**

12

* Train model on dataset with full embedding generation

13

* @param {Object} dataset - Training dataset configuration

14

* @param {string} dataset.name - Dataset identifier for saving embeddings

15

* @param {string[]} dataset.files - Document filenames (without .txt extension)

16

* @returns {Promise<void>} Completes when training and context creation finished

17

*/

18

train(dataset);

19

```

20

21

**Usage Examples:**

22

23

```javascript

24

// Train on custom dataset

25

await model.train({

26

name: 'shakespeare',

27

files: ['hamlet', 'macbeth', 'othello', 'king-lear']

28

});

29

30

// Train on single document

31

await model.train({

32

name: 'technical-docs',

33

files: ['api-documentation']

34

});

35

36

// Train on mixed content

37

await model.train({

38

name: 'mixed-content',

39

files: [

40

'news-articles',

41

'scientific-papers',

42

'literature-excerpts',

43

'chat-conversations'

44

]

45

});

46

```

47

48

### Text Ingestion

49

50

Direct text ingestion for processing without file-based training.

51

52

```javascript { .api }

53

/**

54

* Ingest text directly for processing

55

* @param {string} text - Raw text content to process

56

*/

57

ingest(text);

58

```

59

60

**Usage Examples:**

61

62

```javascript

63

// Ingest direct text content

64

const textContent = `

65

This is sample text for training.

66

The model will learn token relationships.

67

Multiple sentences provide better context.

68

`;

69

model.ingest(textContent);

70

71

// Use with external text sources

72

const webContent = await fetch('https://example.com/articles.txt').then(r => r.text());

73

model.ingest(webContent);

74

```

75

76

### Context Creation

77

78

Creates in-memory model context from pre-computed embeddings, enabling fast model initialization.

79

80

```javascript { .api }

81

/**

82

* Create model context from embeddings

83

* @param {Object} embeddings - Pre-computed token embeddings

84

*/

85

createContext(embeddings);

86

```

87

88

**Usage Examples:**

89

90

```javascript

91

// Load and use pre-computed embeddings

92

const fs = require('fs').promises;

93

const embeddingsData = JSON.parse(

94

await fs.readFile('./training/embeddings/my-dataset.json', 'utf8')

95

);

96

model.createContext(embeddingsData);

97

98

// Use embeddings from training data object

99

const trainingData = {

100

text: "Combined training text...",

101

embeddings: { /* pre-computed embeddings */ }

102

};

103

model.createContext(trainingData.embeddings);

104

```

105

106

## Training Metrics and Embeddings

107

108

The training system generates high-dimensional embeddings (144 dimensions by default) using multiple analysis metrics:

109

110

### Embedding Dimensions

111

112

```javascript { .api }

113

/**

114

* Training embedding structure with multiple analysis metrics

115

*/

116

interface EmbeddingDimensions {

117

// Character composition analysis (66 dimensions)

118

characterDistribution: number[]; // Distribution of alphanumeric characters

119

120

// Grammatical analysis (36 dimensions)

121

partOfSpeech: number[]; // Part-of-speech tag probabilities

122

123

// Statistical analysis (1 dimension)

124

tokenPrevalence: number; // Frequency in training dataset

125

126

// Linguistic analysis (37 dimensions)

127

suffixPatterns: number[]; // Common word ending patterns

128

129

// Co-occurrence analysis (1 dimension)

130

nextWordFrequency: number; // Normalized co-occurrence frequency

131

132

// Content filtering (1 dimension)

133

vulgarityScore: number; // Profanity detection (placeholder)

134

135

// Stylistic analysis (2 dimensions)

136

styleFeatures: number[]; // Pirate/Victorian language detection

137

}

138

```

139

140

### Training Process

141

142

The training pipeline follows these steps:

143

144

1. **Document Combination**: Concatenates all training documents

145

2. **Tokenization**: Splits text into individual tokens

146

3. **Token Analysis**: Analyzes each token-nextToken pair

147

4. **Metric Calculation**: Computes 144-dimensional embeddings

148

5. **Normalization**: Normalizes frequency-based metrics

149

6. **Context Creation**: Builds n-gram trie structure

150

7. **Persistence**: Saves embeddings to JSON files

151

152

### Training Configuration

153

154

```javascript { .api }

155

/**

156

* Environment variables affecting training behavior

157

*/

158

interface TrainingConfig {

159

PARAMETER_CHUNK_SIZE: number; // Training batch size (default: 50000)

160

DIMENSIONS: number; // Vector dimensionality (default: 144)

161

}

162

```

163

164

**Configuration Examples:**

165

166

```bash

167

# Increase batch size for faster training on large datasets

168

export PARAMETER_CHUNK_SIZE=100000

169

170

# Adjust vector dimensions (requires retraining all models)

171

export DIMENSIONS=256

172

```

173

174

## Training Datasets

175

176

The system supports structured dataset configurations for reproducible training:

177

178

### Dataset Structure

179

180

```javascript { .api }

181

/**

182

* Training dataset configuration object

183

*/

184

interface Dataset {

185

name: string; // Dataset identifier

186

files: string[]; // Document filenames (without .txt extension)

187

}

188

189

/**

190

* Built-in dataset examples

191

*/

192

const DefaultDataset = {

193

name: 'default',

194

files: [

195

'animal-facts',

196

'cat-facts',

197

'facts-and-sentences',

198

'heart-of-darkness',

199

'lectures-on-alchemy',

200

'legendary-islands-of-the-atlantic',

201

'on-the-taboo-against-knowing-who-you-are',

202

'paris',

203

'test',

204

'the-initiates-of-the-flame',

205

'the-phantom-of-the-opera'

206

]

207

};

208

```

209

210

### Document Management

211

212

Training documents must be placed in the `training/documents/` directory relative to the project root:

213

214

```

215

project-root/

216

├── training/

217

│ ├── documents/

218

│ │ ├── document1.txt

219

│ │ ├── document2.txt

220

│ │ └── ...

221

│ └── embeddings/

222

│ ├── dataset1.json

223

│ ├── dataset2.json

224

│ └── ...

225

```

226

227

### Internal Utility Functions

228

229

The training system uses internal utility functions for document and embedding management. These are not directly exported from the main package but are used internally by the training pipeline:

230

231

```javascript { .api }

232

/**

233

* Internal utility functions (not directly accessible)

234

*/

235

interface InternalUtils {

236

combineDocuments(documents: string[]): Promise<string>;

237

fetchEmbeddings(name: string): Promise<Object>;

238

tokenize(input: string): string[];

239

getPartsOfSpeech(text: string): Object[];

240

}

241

```

242

243

These functions handle:

244

- **combineDocuments**: Reads and concatenates training document files from `training/documents/` directory

245

- **fetchEmbeddings**: Loads pre-computed embeddings from `training/embeddings/` directory

246

- **tokenize**: Splits input text into tokens using regex-based tokenization

247

- **getPartsOfSpeech**: Analyzes grammatical roles using wink-pos-tagger

248

249

## Training Performance

250

251

### Training Optimization

252

253

- **Chunked Processing**: Large parameter sets processed in configurable chunks

254

- **Memory Management**: Efficient n-gram trie construction with merge operations

255

- **Progress Logging**: Detailed console output showing training progress

256

- **File Persistence**: Automatic saving of embeddings for reuse

257

258

### Training Time Estimates

259

260

Training time varies based on dataset size and system performance:

261

262

- Small datasets (< 1MB text): 30 seconds - 2 minutes

263

- Medium datasets (1-10MB text): 2-15 minutes

264

- Large datasets (10-100MB text): 15 minutes - 2 hours

265

- Very large datasets (100MB+ text): 2+ hours

266

267

### Memory Requirements

268

269

- Base memory: ~50MB for library components

270

- Training memory: ~1-5MB per 1MB of training text

271

- Embedding storage: ~100KB per 1000 unique tokens

272

- Context memory: ~10-50MB for typical models

273

274

## Error Handling

275

276

The training system handles various error conditions:

277

278

- **Missing Files**: Clear error messages for missing training documents

279

- **Invalid Format**: Validation of document file formats

280

- **Memory Limits**: Chunked processing for large datasets

281

- **Disk Space**: Automatic cleanup of temporary files

282

- **Encoding Issues**: UTF-8 text processing with error recovery