0
# Model Operations
1
2
Core functionality for loading models, generating text, and managing model state. The Model class provides the primary interface for interacting with GGML language models through both streaming and batch generation methods.
3
4
## Capabilities
5
6
### Model Initialization
7
8
Initialize and configure a language model instance with extensive customization options for context size, GPU utilization, and model behavior.
9
10
```python { .api }
11
class Model:
12
def __init__(
13
self,
14
model_path: str,
15
prompt_context: str = '',
16
prompt_prefix: str = '',
17
prompt_suffix: str = '',
18
log_level: int = logging.ERROR,
19
n_ctx: int = 512,
20
seed: int = 0,
21
n_gpu_layers: int = 0,
22
f16_kv: bool = False,
23
logits_all: bool = False,
24
vocab_only: bool = False,
25
use_mlock: bool = False,
26
embedding: bool = False
27
):
28
"""
29
Initialize a Model instance.
30
31
Parameters:
32
- model_path: str, path to the GGML model file
33
- prompt_context: str, global context for all interactions
34
- prompt_prefix: str, prefix added to each prompt
35
- prompt_suffix: str, suffix added to each prompt
36
- log_level: int, logging level (default: logging.ERROR)
37
- n_ctx: int, context window size in tokens (default: 512)
38
- seed: int, random seed for generation (default: 0)
39
- n_gpu_layers: int, number of layers to offload to GPU (default: 0)
40
- f16_kv: bool, use fp16 for key/value cache (default: False)
41
- logits_all: bool, compute all logits, not just last token (default: False)
42
- vocab_only: bool, only load vocabulary, no weights (default: False)
43
- use_mlock: bool, force system to keep model in RAM (default: False)
44
- embedding: bool, enable embedding mode (default: False)
45
"""
46
```
47
48
Example usage:
49
50
```python
51
from pyllamacpp.model import Model
52
53
# Basic model loading
54
model = Model(model_path='./models/llama-7b.ggml')
55
56
# Advanced configuration
57
model = Model(
58
model_path='./models/llama-13b.ggml',
59
n_ctx=2048,
60
n_gpu_layers=32,
61
f16_kv=True,
62
prompt_context="You are a helpful AI assistant.",
63
prompt_prefix="\n\nHuman: ",
64
prompt_suffix="\n\nAssistant: "
65
)
66
```
67
68
### Streaming Text Generation
69
70
Generate text tokens iteratively using a generator pattern, allowing real-time display of generated text with extensive parameter control for sampling strategies.
71
72
```python { .api }
73
def generate(
74
self,
75
prompt: str,
76
n_predict: Union[None, int] = None,
77
n_threads: int = 4,
78
seed: Union[None, int] = None,
79
antiprompt: str = None,
80
n_batch: int = 512,
81
n_keep: int = 0,
82
top_k: int = 40,
83
top_p: float = 0.95,
84
tfs_z: float = 1.00,
85
typical_p: float = 1.00,
86
temp: float = 0.8,
87
repeat_penalty: float = 1.10,
88
repeat_last_n: int = 64,
89
frequency_penalty: float = 0.00,
90
presence_penalty: float = 0.00,
91
mirostat: int = 0,
92
mirostat_tau: int = 5.00,
93
mirostat_eta: int = 0.1,
94
infinite_generation: bool = False
95
) -> Generator:
96
"""
97
Generate text tokens iteratively.
98
99
Parameters:
100
- prompt: str, input prompt for generation
101
- n_predict: int or None, max tokens to generate (None for until EOS)
102
- n_threads: int, CPU threads to use (default: 4)
103
- seed: int or None, random seed (None for time-based seed)
104
- antiprompt: str, stop word to halt generation
105
- n_batch: int, batch size for prompt processing (default: 512)
106
- n_keep: int, tokens to keep from initial prompt (default: 0)
107
- top_k: int, top-k sampling parameter (default: 40)
108
- top_p: float, top-p sampling parameter (default: 0.95)
109
- tfs_z: float, tail free sampling parameter (default: 1.00)
110
- typical_p: float, typical sampling parameter (default: 1.00)
111
- temp: float, temperature for sampling (default: 0.8)
112
- repeat_penalty: float, repetition penalty (default: 1.10)
113
- repeat_last_n: int, last n tokens to penalize (default: 64)
114
- frequency_penalty: float, frequency penalty (default: 0.00)
115
- presence_penalty: float, presence penalty (default: 0.00)
116
- mirostat: int, mirostat algorithm (0=disabled, 1=v1, 2=v2)
117
- mirostat_tau: int, mirostat target entropy (default: 5.00)
118
- mirostat_eta: int, mirostat learning rate (default: 0.1)
119
- infinite_generation: bool, generate infinitely (default: False)
120
121
Yields:
122
str: Individual tokens as they are generated
123
"""
124
```
125
126
Example usage:
127
128
```python
129
# Basic streaming generation
130
for token in model.generate("What is machine learning?"):
131
print(token, end='', flush=True)
132
133
# Advanced parameter control
134
for token in model.generate(
135
"Explain quantum computing",
136
n_predict=200,
137
temp=0.7,
138
top_p=0.9,
139
repeat_penalty=1.15,
140
antiprompt="Human:"
141
):
142
print(token, end='', flush=True)
143
```
144
145
### Batch Text Generation
146
147
Generate complete text responses using llama.cpp's native generation function with callback support for monitoring generation progress.
148
149
```python { .api }
150
def cpp_generate(
151
self,
152
prompt: str,
153
n_predict: int = 128,
154
new_text_callback: Callable[[bytes], None] = None,
155
n_threads: int = 4,
156
top_k: int = 40,
157
top_p: float = 0.95,
158
tfs_z: float = 1.00,
159
typical_p: float = 1.00,
160
temp: float = 0.8,
161
repeat_penalty: float = 1.10,
162
repeat_last_n: int = 64,
163
frequency_penalty: float = 0.00,
164
presence_penalty: float = 0.00,
165
mirostat: int = 0,
166
mirostat_tau: int = 5.00,
167
mirostat_eta: int = 0.1,
168
n_batch: int = 8,
169
n_keep: int = 0,
170
interactive: bool = False,
171
antiprompt: List = [],
172
instruct: bool = False,
173
verbose_prompt: bool = False
174
) -> str:
175
"""
176
Generate text using llama.cpp's native generation function.
177
178
Parameters:
179
- prompt: str, input prompt
180
- n_predict: int, number of tokens to generate (default: 128)
181
- new_text_callback: callable, callback for new text generation
182
- n_threads: int, CPU threads (default: 4)
183
- top_k: int, top-k sampling (default: 40)
184
- top_p: float, top-p sampling (default: 0.95)
185
- tfs_z: float, tail free sampling (default: 1.00)
186
- typical_p: float, typical sampling (default: 1.00)
187
- temp: float, temperature (default: 0.8)
188
- repeat_penalty: float, repetition penalty (default: 1.10)
189
- repeat_last_n: int, penalty window (default: 64)
190
- frequency_penalty: float, frequency penalty (default: 0.00)
191
- presence_penalty: float, presence penalty (default: 0.00)
192
- mirostat: int, mirostat mode (default: 0)
193
- mirostat_tau: int, mirostat tau (default: 5.00)
194
- mirostat_eta: int, mirostat eta (default: 0.1)
195
- n_batch: int, batch size (default: 8)
196
- n_keep: int, tokens to keep (default: 0)
197
- interactive: bool, interactive mode (default: False)
198
- antiprompt: list, stop phrases (default: [])
199
- instruct: bool, instruction mode (default: False)
200
- verbose_prompt: bool, verbose prompting (default: False)
201
202
Returns:
203
str: Complete generated text
204
"""
205
```
206
207
Example usage:
208
209
```python
210
# Basic batch generation
211
response = model.cpp_generate("Describe the solar system", n_predict=200)
212
print(response)
213
214
# With callback for progress monitoring
215
def progress_callback(text):
216
print("Generated:", text.decode('utf-8'), end='')
217
218
response = model.cpp_generate(
219
"Write a short poem",
220
n_predict=100,
221
new_text_callback=progress_callback,
222
temp=0.9
223
)
224
```
225
226
### Tokenization and Text Processing
227
228
Convert between text and token representations, essential for understanding model input processing and implementing custom text handling.
229
230
```python { .api }
231
def tokenize(self, text: str):
232
"""
233
Convert text to list of tokens.
234
235
Parameters:
236
- text: str, text to tokenize
237
238
Returns:
239
list: List of token integers
240
"""
241
242
def detokenize(self, tokens: list):
243
"""
244
Convert tokens back to text.
245
246
Parameters:
247
- tokens: list or array, token integers
248
249
Returns:
250
str: Decoded text string
251
"""
252
```
253
254
Example usage:
255
256
```python
257
# Tokenize text
258
tokens = model.tokenize("Hello, world!")
259
print(f"Tokens: {tokens}")
260
261
# Convert back to text
262
text = model.detokenize(tokens)
263
print(f"Text: {text}")
264
265
# Analyze token count
266
prompt = "This is a test prompt for token counting"
267
token_count = len(model.tokenize(prompt))
268
print(f"Token count: {token_count}")
269
```
270
271
### Context Management
272
273
Reset and manage the model's conversational context, essential for multi-turn conversations and context window management.
274
275
```python { .api }
276
def reset(self) -> None:
277
"""
278
Reset the model context and token history.
279
280
Clears conversation history and resets internal state
281
to initial conditions, useful for starting fresh conversations
282
or managing context window limitations.
283
"""
284
```
285
286
Example usage:
287
288
```python
289
# Use model for one conversation
290
model.generate("Hello, how are you?")
291
292
# Reset for fresh conversation
293
model.reset()
294
295
# Start new conversation with clean context
296
model.generate("What's the weather like?")
297
```
298
299
### Performance and Debugging
300
301
Access performance metrics and system information for optimization and debugging purposes.
302
303
```python { .api }
304
def llama_print_timings(self):
305
"""Print detailed performance timing information."""
306
307
@staticmethod
308
def llama_print_system_info():
309
"""Print system information relevant to model execution."""
310
311
@staticmethod
312
def get_params(params) -> dict:
313
"""
314
Convert parameter object to dictionary representation.
315
316
Parameters:
317
- params: parameter object
318
319
Returns:
320
dict: Dictionary representation of parameters
321
"""
322
```
323
324
Example usage:
325
326
```python
327
# Print system information
328
Model.llama_print_system_info()
329
330
# Generate text and check performance
331
model.generate("Test prompt")
332
model.llama_print_timings()
333
334
# Inspect model parameters
335
params_dict = Model.get_params(model.llama_params)
336
print(params_dict)
337
```