Tessl Tile for pypi/llama-cpp-python@0.3.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

caching.md chat-completion.md grammar.md index.md llama-model.md low-level.md server.md tokenization.md vision.md

index.mddocs/

0
# llama-cpp-python
1

2
Python bindings for the llama.cpp library providing high-performance large language model inference with comprehensive APIs for text completion, chat, embeddings, and multimodal processing. Offers both high-level Python interfaces and low-level C bindings with OpenAI-compatible endpoints.
3

4
## Package Information
5

6
- **Package Name**: llama-cpp-python
7
- **Package Type**: PyPI
8
- **Language**: Python
9
- **Installation**: `pip install llama-cpp-python`
10

11
## Core Imports
12

13
```python
14
import llama_cpp
15
```
16

17
Common high-level imports:
18

19
```python
20
from llama_cpp import Llama, LlamaGrammar, LlamaCache
21
```
22

23
OpenAI-compatible types:
24

25
```python
26
from llama_cpp.llama_types import (
27
    CreateCompletionResponse,
28
    CreateChatCompletionResponse,
29
    CreateEmbeddingResponse
30
)
31
```
32

33
## Basic Usage
34

35
```python
36
from llama_cpp import Llama
37

38
# Initialize model
39
llm = Llama(
40
    model_path="./models/llama-model.gguf",
41
    n_ctx=2048,  # Context window
42
    n_threads=8,  # CPU threads
43
)
44

45
# Generate text completion
46
output = llm.create_completion(
47
    prompt="The capital of France is",
48
    max_tokens=32,
49
    temperature=0.7,
50
    top_p=0.9,
51
)
52
print(output['choices'][0]['text'])
53

54
# Create chat completion
55
messages = [
56
    {"role": "system", "content": "You are a helpful assistant."},
57
    {"role": "user", "content": "Hello! How are you?"}
58
]
59

60
response = llm.create_chat_completion(
61
    messages=messages,
62
    max_tokens=100,
63
    temperature=0.7,
64
)
65
print(response['choices'][0]['message']['content'])
66

67
# Generate embeddings
68
embeddings = llm.create_embedding(
69
    input=["Hello world", "Python is great"],
70
)
71
print(embeddings['data'][0]['embedding'][:5])  # First 5 dimensions
72
```
73

74
## Architecture
75

76
The llama-cpp-python package provides multiple layers of abstraction:
77

78
- **High-level API**: The `Llama` class offers convenient methods for common operations with sensible defaults
79
- **Low-level bindings**: Direct access to llama.cpp C functions through ctypes for maximum control
80
- **OpenAI compatibility**: Drop-in replacement for OpenAI API endpoints with identical response formats
81
- **Extensible components**: Modular caching, tokenization, grammar, and formatting systems
82

83
Key design patterns:
84
- **Lazy loading**: Models and contexts are loaded only when needed
85
- **Memory management**: Automatic cleanup and manual control options
86
- **Hardware optimization**: CPU, CUDA, and Metal acceleration support
87
- **Format flexibility**: Support for various model formats (GGUF, GGML) and quantization levels
88

89
## Capabilities
90

91
### Core Model and Inference
92

93
High-level model loading, text generation, and inference operations including completion, sampling, state management, and performance optimization.
94

95
```python { .api }
96
class Llama:
97
    def __init__(self, model_path: str, **kwargs): ...
98
    def create_completion(self, prompt: str, **kwargs) -> CreateCompletionResponse: ...
99
    def create_chat_completion(self, messages: List[dict], **kwargs) -> CreateChatCompletionResponse: ...
100
    def create_embedding(self, input: Union[str, List[str]], **kwargs) -> CreateEmbeddingResponse: ...
101
    def tokenize(self, text: str, add_bos: bool = True, special: bool = False) -> List[int]: ...
102
    def detokenize(self, tokens: List[int], decode: bool = True) -> str: ...
103
```
104

105
[Core Model and Inference](./llama-model.md)
106

107
### Chat Completions and Formatting
108

109
OpenAI-compatible chat completions with extensive formatting options, role-based conversations, function calling, and custom message templates for different model types.
110

111
```python { .api }
112
def get_chat_completion_handler(chat_format: str) -> LlamaChatCompletionHandler: ...
113
def register_chat_completion_handler(chat_format: str, chat_handler: LlamaChatCompletionHandler): ...
114

115
class Jinja2ChatFormatter:
116
    def __init__(self, template: str, **kwargs): ...
117
    def format_messages(self, messages: List[dict]) -> ChatFormatterResponse: ...
118
```
119

120
[Chat Completions and Formatting](./chat-completion.md)
121

122
### Tokenization
123

124
Native llama.cpp tokenization and HuggingFace tokenizer integration with support for different vocabulary types, encoding/decoding, and model-specific preprocessing.
125

126
```python { .api }
127
class LlamaTokenizer:
128
    def tokenize(self, text: str, add_bos: bool = True, special: bool = False) -> List[int]: ...
129
    def detokenize(self, tokens: List[int], decode: bool = True) -> str: ...
130
    @classmethod
131
    def from_ggml_file(cls, path: str) -> "LlamaTokenizer": ...
132

133
class LlamaHFTokenizer:
134
    @classmethod
135
    def from_pretrained(cls, pretrained_model_name_or_path: str) -> "LlamaHFTokenizer": ...
136
```
137

138
[Tokenization](./tokenization.md)
139

140
### Caching
141

142
Memory and disk-based caching systems for model states, context, and computed results to improve inference performance and enable state persistence.
143

144
```python { .api }
145
class LlamaRAMCache:
146
    def __init__(self, capacity_bytes: int = 2 << 30): ...
147

148
class LlamaDiskCache:
149
    def __init__(self, cache_dir: str = ".cache/llama_cpp"): ...
150
```
151

152
[Caching](./caching.md)
153

154
### Grammar and Structured Generation
155

156
Constrained text generation using formal grammars (GBNF), JSON Schema validation, and built-in templates for structured outputs like JSON, code, and domain-specific formats.
157

158
```python { .api }
159
class LlamaGrammar:
160
    @classmethod
161
    def from_string(cls, grammar_str: str, verbose: bool = True) -> "LlamaGrammar": ...
162
    @classmethod
163
    def from_json_schema(cls, schema: dict, verbose: bool = True) -> "LlamaGrammar": ...
164

165
def json_schema_to_gbnf(schema: dict, **kwargs) -> str: ...
166
```
167

168
[Grammar and Structured Generation](./grammar.md)
169

170
### Vision and Multimodal
171

172
LLaVA vision model integration for processing images alongside text, supporting various image formats and multimodal conversation flows.
173

174
```python { .api }
175
def llava_image_embed_make_with_filename(ctx_clip, image_path: str): ...
176
def llava_image_embed_make_with_bytes(ctx_clip, image_bytes: bytes, image_bytes_length: int): ...
177
def llava_validate_embed_size(n_embd: int, n_image_embd: int) -> bool: ...
178
```
179

180
[Vision and Multimodal](./vision.md)
181

182
### Server Components
183

184
FastAPI-based web server with OpenAI-compatible endpoints, settings management, and multi-model configuration support for production deployments.
185

186
```python { .api }
187
class ModelSettings:
188
    model: str
189
    n_ctx: int = 2048
190
    temperature: float = 0.7
191

192
class ServerSettings:
193
    host: str = "127.0.0.1"
194
    port: int = 8000
195
    interrupt_requests: bool = True
196
```
197

198
[Server Components](./server.md)
199

200
### Low-Level API
201

202
Direct access to llama.cpp C functions through ctypes bindings, providing maximum control over model loading, context management, and backend operations.
203

204
```python { .api }
205
def llama_model_load_from_file(path_model: bytes, params) -> llama_model_p: ...
206
def llama_new_context_with_model(model: llama_model_p, params) -> llama_context_p: ...
207
def llama_backend_init() -> None: ...
208
def llama_backend_free() -> None: ...
209
```
210

211
[Low-Level API](./low-level.md)
212

213
## Types
214

215
```python { .api }
216
# Core response types
217
CreateCompletionResponse = TypedDict('CreateCompletionResponse', {
218
    'id': str,
219
    'object': str,
220
    'created': int,
221
    'model': str,
222
    'choices': List[CompletionChoice],
223
    'usage': CompletionUsage,
224
})
225

226
CreateChatCompletionResponse = TypedDict('CreateChatCompletionResponse', {
227
    'id': str,
228
    'object': str,
229
    'created': int,
230
    'model': str,
231
    'choices': List[ChatCompletionResponseChoice],
232
    'usage': CompletionUsage,
233
})
234

235
CreateEmbeddingResponse = TypedDict('CreateEmbeddingResponse', {
236
    'object': str,
237
    'data': List[Embedding],
238
    'model': str,
239
    'usage': EmbeddingUsage,
240
})
241

242
# Message types for chat
243
ChatCompletionRequestMessage = TypedDict('ChatCompletionRequestMessage', {
244
    'role': str,
245
    'content': Optional[str],
246
})
247

248
ChatCompletionRequestSystemMessage = TypedDict('ChatCompletionRequestSystemMessage', {
249
    'role': Literal['system'],
250
    'content': str,
251
})
252

253
ChatCompletionRequestUserMessage = TypedDict('ChatCompletionRequestUserMessage', {
254
    'role': Literal['user'],
255
    'content': str,
256
})
257

258
ChatCompletionRequestAssistantMessage = TypedDict('ChatCompletionRequestAssistantMessage', {
259
    'role': Literal['assistant'],
260
    'content': Optional[str],
261
})
262

263
# JSON serializable type
264
JsonType = Union[None, int, float, str, bool, List['JsonType'], Dict[str, 'JsonType']]
265
```

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/