Tessl Tile for pypi/pyllamacpp@2.4.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

cli.md embeddings.md index.md langchain-integration.md model-operations.md utilities.md web-ui.md

model-operations.mddocs/

0
# Model Operations
1

2
Core functionality for loading models, generating text, and managing model state. The Model class provides the primary interface for interacting with GGML language models through both streaming and batch generation methods.
3

4
## Capabilities
5

6
### Model Initialization
7

8
Initialize and configure a language model instance with extensive customization options for context size, GPU utilization, and model behavior.
9

10
```python { .api }
11
class Model:
12
    def __init__(
13
        self,
14
        model_path: str,
15
        prompt_context: str = '',
16
        prompt_prefix: str = '',
17
        prompt_suffix: str = '',
18
        log_level: int = logging.ERROR,
19
        n_ctx: int = 512,
20
        seed: int = 0,
21
        n_gpu_layers: int = 0,
22
        f16_kv: bool = False,
23
        logits_all: bool = False,
24
        vocab_only: bool = False,
25
        use_mlock: bool = False,
26
        embedding: bool = False
27
    ):
28
        """
29
        Initialize a Model instance.
30

31
        Parameters:
32
        - model_path: str, path to the GGML model file
33
        - prompt_context: str, global context for all interactions
34
        - prompt_prefix: str, prefix added to each prompt
35
        - prompt_suffix: str, suffix added to each prompt
36
        - log_level: int, logging level (default: logging.ERROR)
37
        - n_ctx: int, context window size in tokens (default: 512)
38
        - seed: int, random seed for generation (default: 0)
39
        - n_gpu_layers: int, number of layers to offload to GPU (default: 0)
40
        - f16_kv: bool, use fp16 for key/value cache (default: False)
41
        - logits_all: bool, compute all logits, not just last token (default: False)
42
        - vocab_only: bool, only load vocabulary, no weights (default: False)
43
        - use_mlock: bool, force system to keep model in RAM (default: False)
44
        - embedding: bool, enable embedding mode (default: False)
45
        """
46
```
47

48
Example usage:
49

50
```python
51
from pyllamacpp.model import Model
52

53
# Basic model loading
54
model = Model(model_path='./models/llama-7b.ggml')
55

56
# Advanced configuration
57
model = Model(
58
    model_path='./models/llama-13b.ggml',
59
    n_ctx=2048,
60
    n_gpu_layers=32,
61
    f16_kv=True,
62
    prompt_context="You are a helpful AI assistant.",
63
    prompt_prefix="\n\nHuman: ",
64
    prompt_suffix="\n\nAssistant: "
65
)
66
```
67

68
### Streaming Text Generation
69

70
Generate text tokens iteratively using a generator pattern, allowing real-time display of generated text with extensive parameter control for sampling strategies.
71

72
```python { .api }
73
def generate(
74
    self,
75
    prompt: str,
76
    n_predict: Union[None, int] = None,
77
    n_threads: int = 4,
78
    seed: Union[None, int] = None,
79
    antiprompt: str = None,
80
    n_batch: int = 512,
81
    n_keep: int = 0,
82
    top_k: int = 40,
83
    top_p: float = 0.95,
84
    tfs_z: float = 1.00,
85
    typical_p: float = 1.00,
86
    temp: float = 0.8,
87
    repeat_penalty: float = 1.10,
88
    repeat_last_n: int = 64,
89
    frequency_penalty: float = 0.00,
90
    presence_penalty: float = 0.00,
91
    mirostat: int = 0,
92
    mirostat_tau: int = 5.00,
93
    mirostat_eta: int = 0.1,
94
    infinite_generation: bool = False
95
) -> Generator:
96
    """
97
    Generate text tokens iteratively.
98

99
    Parameters:
100
    - prompt: str, input prompt for generation
101
    - n_predict: int or None, max tokens to generate (None for until EOS)
102
    - n_threads: int, CPU threads to use (default: 4)
103
    - seed: int or None, random seed (None for time-based seed)
104
    - antiprompt: str, stop word to halt generation
105
    - n_batch: int, batch size for prompt processing (default: 512)
106
    - n_keep: int, tokens to keep from initial prompt (default: 0)
107
    - top_k: int, top-k sampling parameter (default: 40)
108
    - top_p: float, top-p sampling parameter (default: 0.95)
109
    - tfs_z: float, tail free sampling parameter (default: 1.00)
110
    - typical_p: float, typical sampling parameter (default: 1.00)
111
    - temp: float, temperature for sampling (default: 0.8)
112
    - repeat_penalty: float, repetition penalty (default: 1.10)
113
    - repeat_last_n: int, last n tokens to penalize (default: 64)
114
    - frequency_penalty: float, frequency penalty (default: 0.00)
115
    - presence_penalty: float, presence penalty (default: 0.00)
116
    - mirostat: int, mirostat algorithm (0=disabled, 1=v1, 2=v2)
117
    - mirostat_tau: int, mirostat target entropy (default: 5.00)
118
    - mirostat_eta: int, mirostat learning rate (default: 0.1)
119
    - infinite_generation: bool, generate infinitely (default: False)
120

121
    Yields:
122
    str: Individual tokens as they are generated
123
    """
124
```
125

126
Example usage:
127

128
```python
129
# Basic streaming generation
130
for token in model.generate("What is machine learning?"):
131
    print(token, end='', flush=True)
132

133
# Advanced parameter control
134
for token in model.generate(
135
    "Explain quantum computing",
136
    n_predict=200,
137
    temp=0.7,
138
    top_p=0.9,
139
    repeat_penalty=1.15,
140
    antiprompt="Human:"
141
):
142
    print(token, end='', flush=True)
143
```
144

145
### Batch Text Generation
146

147
Generate complete text responses using llama.cpp's native generation function with callback support for monitoring generation progress.
148

149
```python { .api }
150
def cpp_generate(
151
    self,
152
    prompt: str,
153
    n_predict: int = 128,
154
    new_text_callback: Callable[[bytes], None] = None,
155
    n_threads: int = 4,
156
    top_k: int = 40,
157
    top_p: float = 0.95,
158
    tfs_z: float = 1.00,
159
    typical_p: float = 1.00,
160
    temp: float = 0.8,
161
    repeat_penalty: float = 1.10,
162
    repeat_last_n: int = 64,
163
    frequency_penalty: float = 0.00,
164
    presence_penalty: float = 0.00,
165
    mirostat: int = 0,
166
    mirostat_tau: int = 5.00,
167
    mirostat_eta: int = 0.1,
168
    n_batch: int = 8,
169
    n_keep: int = 0,
170
    interactive: bool = False,
171
    antiprompt: List = [],
172
    instruct: bool = False,
173
    verbose_prompt: bool = False
174
) -> str:
175
    """
176
    Generate text using llama.cpp's native generation function.
177

178
    Parameters:
179
    - prompt: str, input prompt
180
    - n_predict: int, number of tokens to generate (default: 128)
181
    - new_text_callback: callable, callback for new text generation
182
    - n_threads: int, CPU threads (default: 4)
183
    - top_k: int, top-k sampling (default: 40)
184
    - top_p: float, top-p sampling (default: 0.95)
185
    - tfs_z: float, tail free sampling (default: 1.00)
186
    - typical_p: float, typical sampling (default: 1.00)
187
    - temp: float, temperature (default: 0.8)
188
    - repeat_penalty: float, repetition penalty (default: 1.10)
189
    - repeat_last_n: int, penalty window (default: 64)
190
    - frequency_penalty: float, frequency penalty (default: 0.00)
191
    - presence_penalty: float, presence penalty (default: 0.00)
192
    - mirostat: int, mirostat mode (default: 0)
193
    - mirostat_tau: int, mirostat tau (default: 5.00)
194
    - mirostat_eta: int, mirostat eta (default: 0.1)
195
    - n_batch: int, batch size (default: 8)
196
    - n_keep: int, tokens to keep (default: 0)
197
    - interactive: bool, interactive mode (default: False)
198
    - antiprompt: list, stop phrases (default: [])
199
    - instruct: bool, instruction mode (default: False)
200
    - verbose_prompt: bool, verbose prompting (default: False)
201

202
    Returns:
203
    str: Complete generated text
204
    """
205
```
206

207
Example usage:
208

209
```python
210
# Basic batch generation
211
response = model.cpp_generate("Describe the solar system", n_predict=200)
212
print(response)
213

214
# With callback for progress monitoring
215
def progress_callback(text):
216
    print("Generated:", text.decode('utf-8'), end='')
217

218
response = model.cpp_generate(
219
    "Write a short poem",
220
    n_predict=100,
221
    new_text_callback=progress_callback,
222
    temp=0.9
223
)
224
```
225

226
### Tokenization and Text Processing
227

228
Convert between text and token representations, essential for understanding model input processing and implementing custom text handling.
229

230
```python { .api }
231
def tokenize(self, text: str):
232
    """
233
    Convert text to list of tokens.
234

235
    Parameters:
236
    - text: str, text to tokenize
237

238
    Returns:
239
    list: List of token integers
240
    """
241

242
def detokenize(self, tokens: list):
243
    """
244
    Convert tokens back to text.
245

246
    Parameters:
247
    - tokens: list or array, token integers
248

249
    Returns:
250
    str: Decoded text string
251
    """
252
```
253

254
Example usage:
255

256
```python
257
# Tokenize text
258
tokens = model.tokenize("Hello, world!")
259
print(f"Tokens: {tokens}")
260

261
# Convert back to text
262
text = model.detokenize(tokens)
263
print(f"Text: {text}")
264

265
# Analyze token count
266
prompt = "This is a test prompt for token counting"
267
token_count = len(model.tokenize(prompt))
268
print(f"Token count: {token_count}")
269
```
270

271
### Context Management
272

273
Reset and manage the model's conversational context, essential for multi-turn conversations and context window management.
274

275
```python { .api }
276
def reset(self) -> None:
277
    """
278
    Reset the model context and token history.
279
    
280
    Clears conversation history and resets internal state
281
    to initial conditions, useful for starting fresh conversations
282
    or managing context window limitations.
283
    """
284
```
285

286
Example usage:
287

288
```python
289
# Use model for one conversation
290
model.generate("Hello, how are you?")
291

292
# Reset for fresh conversation
293
model.reset()
294

295
# Start new conversation with clean context
296
model.generate("What's the weather like?")
297
```
298

299
### Performance and Debugging
300

301
Access performance metrics and system information for optimization and debugging purposes.
302

303
```python { .api }
304
def llama_print_timings(self):
305
    """Print detailed performance timing information."""
306

307
@staticmethod
308
def llama_print_system_info():
309
    """Print system information relevant to model execution."""
310

311
@staticmethod
312
def get_params(params) -> dict:
313
    """
314
    Convert parameter object to dictionary representation.
315
    
316
    Parameters:
317
    - params: parameter object
318
    
319
    Returns:
320
    dict: Dictionary representation of parameters
321
    """
322
```
323

324
Example usage:
325

326
```python
327
# Print system information
328
Model.llama_print_system_info()
329

330
# Generate text and check performance
331
model.generate("Test prompt")
332
model.llama_print_timings()
333

334
# Inspect model parameters
335
params_dict = Model.get_params(model.llama_params)
336
print(params_dict)
337
```

Version

Tile

Files

model-operations.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

model-operations.mddocs/