or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

cli.mdembeddings.mdindex.mdlangchain-integration.mdmodel-operations.mdutilities.mdweb-ui.md

model-operations.mddocs/

0

# Model Operations

1

2

Core functionality for loading models, generating text, and managing model state. The Model class provides the primary interface for interacting with GGML language models through both streaming and batch generation methods.

3

4

## Capabilities

5

6

### Model Initialization

7

8

Initialize and configure a language model instance with extensive customization options for context size, GPU utilization, and model behavior.

9

10

```python { .api }

11

class Model:

12

def __init__(

13

self,

14

model_path: str,

15

prompt_context: str = '',

16

prompt_prefix: str = '',

17

prompt_suffix: str = '',

18

log_level: int = logging.ERROR,

19

n_ctx: int = 512,

20

seed: int = 0,

21

n_gpu_layers: int = 0,

22

f16_kv: bool = False,

23

logits_all: bool = False,

24

vocab_only: bool = False,

25

use_mlock: bool = False,

26

embedding: bool = False

27

):

28

"""

29

Initialize a Model instance.

30

31

Parameters:

32

- model_path: str, path to the GGML model file

33

- prompt_context: str, global context for all interactions

34

- prompt_prefix: str, prefix added to each prompt

35

- prompt_suffix: str, suffix added to each prompt

36

- log_level: int, logging level (default: logging.ERROR)

37

- n_ctx: int, context window size in tokens (default: 512)

38

- seed: int, random seed for generation (default: 0)

39

- n_gpu_layers: int, number of layers to offload to GPU (default: 0)

40

- f16_kv: bool, use fp16 for key/value cache (default: False)

41

- logits_all: bool, compute all logits, not just last token (default: False)

42

- vocab_only: bool, only load vocabulary, no weights (default: False)

43

- use_mlock: bool, force system to keep model in RAM (default: False)

44

- embedding: bool, enable embedding mode (default: False)

45

"""

46

```

47

48

Example usage:

49

50

```python

51

from pyllamacpp.model import Model

52

53

# Basic model loading

54

model = Model(model_path='./models/llama-7b.ggml')

55

56

# Advanced configuration

57

model = Model(

58

model_path='./models/llama-13b.ggml',

59

n_ctx=2048,

60

n_gpu_layers=32,

61

f16_kv=True,

62

prompt_context="You are a helpful AI assistant.",

63

prompt_prefix="\n\nHuman: ",

64

prompt_suffix="\n\nAssistant: "

65

)

66

```

67

68

### Streaming Text Generation

69

70

Generate text tokens iteratively using a generator pattern, allowing real-time display of generated text with extensive parameter control for sampling strategies.

71

72

```python { .api }

73

def generate(

74

self,

75

prompt: str,

76

n_predict: Union[None, int] = None,

77

n_threads: int = 4,

78

seed: Union[None, int] = None,

79

antiprompt: str = None,

80

n_batch: int = 512,

81

n_keep: int = 0,

82

top_k: int = 40,

83

top_p: float = 0.95,

84

tfs_z: float = 1.00,

85

typical_p: float = 1.00,

86

temp: float = 0.8,

87

repeat_penalty: float = 1.10,

88

repeat_last_n: int = 64,

89

frequency_penalty: float = 0.00,

90

presence_penalty: float = 0.00,

91

mirostat: int = 0,

92

mirostat_tau: int = 5.00,

93

mirostat_eta: int = 0.1,

94

infinite_generation: bool = False

95

) -> Generator:

96

"""

97

Generate text tokens iteratively.

98

99

Parameters:

100

- prompt: str, input prompt for generation

101

- n_predict: int or None, max tokens to generate (None for until EOS)

102

- n_threads: int, CPU threads to use (default: 4)

103

- seed: int or None, random seed (None for time-based seed)

104

- antiprompt: str, stop word to halt generation

105

- n_batch: int, batch size for prompt processing (default: 512)

106

- n_keep: int, tokens to keep from initial prompt (default: 0)

107

- top_k: int, top-k sampling parameter (default: 40)

108

- top_p: float, top-p sampling parameter (default: 0.95)

109

- tfs_z: float, tail free sampling parameter (default: 1.00)

110

- typical_p: float, typical sampling parameter (default: 1.00)

111

- temp: float, temperature for sampling (default: 0.8)

112

- repeat_penalty: float, repetition penalty (default: 1.10)

113

- repeat_last_n: int, last n tokens to penalize (default: 64)

114

- frequency_penalty: float, frequency penalty (default: 0.00)

115

- presence_penalty: float, presence penalty (default: 0.00)

116

- mirostat: int, mirostat algorithm (0=disabled, 1=v1, 2=v2)

117

- mirostat_tau: int, mirostat target entropy (default: 5.00)

118

- mirostat_eta: int, mirostat learning rate (default: 0.1)

119

- infinite_generation: bool, generate infinitely (default: False)

120

121

Yields:

122

str: Individual tokens as they are generated

123

"""

124

```

125

126

Example usage:

127

128

```python

129

# Basic streaming generation

130

for token in model.generate("What is machine learning?"):

131

print(token, end='', flush=True)

132

133

# Advanced parameter control

134

for token in model.generate(

135

"Explain quantum computing",

136

n_predict=200,

137

temp=0.7,

138

top_p=0.9,

139

repeat_penalty=1.15,

140

antiprompt="Human:"

141

):

142

print(token, end='', flush=True)

143

```

144

145

### Batch Text Generation

146

147

Generate complete text responses using llama.cpp's native generation function with callback support for monitoring generation progress.

148

149

```python { .api }

150

def cpp_generate(

151

self,

152

prompt: str,

153

n_predict: int = 128,

154

new_text_callback: Callable[[bytes], None] = None,

155

n_threads: int = 4,

156

top_k: int = 40,

157

top_p: float = 0.95,

158

tfs_z: float = 1.00,

159

typical_p: float = 1.00,

160

temp: float = 0.8,

161

repeat_penalty: float = 1.10,

162

repeat_last_n: int = 64,

163

frequency_penalty: float = 0.00,

164

presence_penalty: float = 0.00,

165

mirostat: int = 0,

166

mirostat_tau: int = 5.00,

167

mirostat_eta: int = 0.1,

168

n_batch: int = 8,

169

n_keep: int = 0,

170

interactive: bool = False,

171

antiprompt: List = [],

172

instruct: bool = False,

173

verbose_prompt: bool = False

174

) -> str:

175

"""

176

Generate text using llama.cpp's native generation function.

177

178

Parameters:

179

- prompt: str, input prompt

180

- n_predict: int, number of tokens to generate (default: 128)

181

- new_text_callback: callable, callback for new text generation

182

- n_threads: int, CPU threads (default: 4)

183

- top_k: int, top-k sampling (default: 40)

184

- top_p: float, top-p sampling (default: 0.95)

185

- tfs_z: float, tail free sampling (default: 1.00)

186

- typical_p: float, typical sampling (default: 1.00)

187

- temp: float, temperature (default: 0.8)

188

- repeat_penalty: float, repetition penalty (default: 1.10)

189

- repeat_last_n: int, penalty window (default: 64)

190

- frequency_penalty: float, frequency penalty (default: 0.00)

191

- presence_penalty: float, presence penalty (default: 0.00)

192

- mirostat: int, mirostat mode (default: 0)

193

- mirostat_tau: int, mirostat tau (default: 5.00)

194

- mirostat_eta: int, mirostat eta (default: 0.1)

195

- n_batch: int, batch size (default: 8)

196

- n_keep: int, tokens to keep (default: 0)

197

- interactive: bool, interactive mode (default: False)

198

- antiprompt: list, stop phrases (default: [])

199

- instruct: bool, instruction mode (default: False)

200

- verbose_prompt: bool, verbose prompting (default: False)

201

202

Returns:

203

str: Complete generated text

204

"""

205

```

206

207

Example usage:

208

209

```python

210

# Basic batch generation

211

response = model.cpp_generate("Describe the solar system", n_predict=200)

212

print(response)

213

214

# With callback for progress monitoring

215

def progress_callback(text):

216

print("Generated:", text.decode('utf-8'), end='')

217

218

response = model.cpp_generate(

219

"Write a short poem",

220

n_predict=100,

221

new_text_callback=progress_callback,

222

temp=0.9

223

)

224

```

225

226

### Tokenization and Text Processing

227

228

Convert between text and token representations, essential for understanding model input processing and implementing custom text handling.

229

230

```python { .api }

231

def tokenize(self, text: str):

232

"""

233

Convert text to list of tokens.

234

235

Parameters:

236

- text: str, text to tokenize

237

238

Returns:

239

list: List of token integers

240

"""

241

242

def detokenize(self, tokens: list):

243

"""

244

Convert tokens back to text.

245

246

Parameters:

247

- tokens: list or array, token integers

248

249

Returns:

250

str: Decoded text string

251

"""

252

```

253

254

Example usage:

255

256

```python

257

# Tokenize text

258

tokens = model.tokenize("Hello, world!")

259

print(f"Tokens: {tokens}")

260

261

# Convert back to text

262

text = model.detokenize(tokens)

263

print(f"Text: {text}")

264

265

# Analyze token count

266

prompt = "This is a test prompt for token counting"

267

token_count = len(model.tokenize(prompt))

268

print(f"Token count: {token_count}")

269

```

270

271

### Context Management

272

273

Reset and manage the model's conversational context, essential for multi-turn conversations and context window management.

274

275

```python { .api }

276

def reset(self) -> None:

277

"""

278

Reset the model context and token history.

279

280

Clears conversation history and resets internal state

281

to initial conditions, useful for starting fresh conversations

282

or managing context window limitations.

283

"""

284

```

285

286

Example usage:

287

288

```python

289

# Use model for one conversation

290

model.generate("Hello, how are you?")

291

292

# Reset for fresh conversation

293

model.reset()

294

295

# Start new conversation with clean context

296

model.generate("What's the weather like?")

297

```

298

299

### Performance and Debugging

300

301

Access performance metrics and system information for optimization and debugging purposes.

302

303

```python { .api }

304

def llama_print_timings(self):

305

"""Print detailed performance timing information."""

306

307

@staticmethod

308

def llama_print_system_info():

309

"""Print system information relevant to model execution."""

310

311

@staticmethod

312

def get_params(params) -> dict:

313

"""

314

Convert parameter object to dictionary representation.

315

316

Parameters:

317

- params: parameter object

318

319

Returns:

320

dict: Dictionary representation of parameters

321

"""

322

```

323

324

Example usage:

325

326

```python

327

# Print system information

328

Model.llama_print_system_info()

329

330

# Generate text and check performance

331

model.generate("Test prompt")

332

model.llama_print_timings()

333

334

# Inspect model parameters

335

params_dict = Model.get_params(model.llama_params)

336

print(params_dict)

337

```