Python bindings for the llama.cpp library providing high-performance LLM inference with OpenAI-compatible APIs.
npx @tessl/cli install tessl/pypi-llama-cpp-python@0.3.00
# llama-cpp-python
1
2
Python bindings for the llama.cpp library providing high-performance large language model inference with comprehensive APIs for text completion, chat, embeddings, and multimodal processing. Offers both high-level Python interfaces and low-level C bindings with OpenAI-compatible endpoints.
3
4
## Package Information
5
6
- **Package Name**: llama-cpp-python
7
- **Package Type**: PyPI
8
- **Language**: Python
9
- **Installation**: `pip install llama-cpp-python`
10
11
## Core Imports
12
13
```python
14
import llama_cpp
15
```
16
17
Common high-level imports:
18
19
```python
20
from llama_cpp import Llama, LlamaGrammar, LlamaCache
21
```
22
23
OpenAI-compatible types:
24
25
```python
26
from llama_cpp.llama_types import (
27
CreateCompletionResponse,
28
CreateChatCompletionResponse,
29
CreateEmbeddingResponse
30
)
31
```
32
33
## Basic Usage
34
35
```python
36
from llama_cpp import Llama
37
38
# Initialize model
39
llm = Llama(
40
model_path="./models/llama-model.gguf",
41
n_ctx=2048, # Context window
42
n_threads=8, # CPU threads
43
)
44
45
# Generate text completion
46
output = llm.create_completion(
47
prompt="The capital of France is",
48
max_tokens=32,
49
temperature=0.7,
50
top_p=0.9,
51
)
52
print(output['choices'][0]['text'])
53
54
# Create chat completion
55
messages = [
56
{"role": "system", "content": "You are a helpful assistant."},
57
{"role": "user", "content": "Hello! How are you?"}
58
]
59
60
response = llm.create_chat_completion(
61
messages=messages,
62
max_tokens=100,
63
temperature=0.7,
64
)
65
print(response['choices'][0]['message']['content'])
66
67
# Generate embeddings
68
embeddings = llm.create_embedding(
69
input=["Hello world", "Python is great"],
70
)
71
print(embeddings['data'][0]['embedding'][:5]) # First 5 dimensions
72
```
73
74
## Architecture
75
76
The llama-cpp-python package provides multiple layers of abstraction:
77
78
- **High-level API**: The `Llama` class offers convenient methods for common operations with sensible defaults
79
- **Low-level bindings**: Direct access to llama.cpp C functions through ctypes for maximum control
80
- **OpenAI compatibility**: Drop-in replacement for OpenAI API endpoints with identical response formats
81
- **Extensible components**: Modular caching, tokenization, grammar, and formatting systems
82
83
Key design patterns:
84
- **Lazy loading**: Models and contexts are loaded only when needed
85
- **Memory management**: Automatic cleanup and manual control options
86
- **Hardware optimization**: CPU, CUDA, and Metal acceleration support
87
- **Format flexibility**: Support for various model formats (GGUF, GGML) and quantization levels
88
89
## Capabilities
90
91
### Core Model and Inference
92
93
High-level model loading, text generation, and inference operations including completion, sampling, state management, and performance optimization.
94
95
```python { .api }
96
class Llama:
97
def __init__(self, model_path: str, **kwargs): ...
98
def create_completion(self, prompt: str, **kwargs) -> CreateCompletionResponse: ...
99
def create_chat_completion(self, messages: List[dict], **kwargs) -> CreateChatCompletionResponse: ...
100
def create_embedding(self, input: Union[str, List[str]], **kwargs) -> CreateEmbeddingResponse: ...
101
def tokenize(self, text: str, add_bos: bool = True, special: bool = False) -> List[int]: ...
102
def detokenize(self, tokens: List[int], decode: bool = True) -> str: ...
103
```
104
105
[Core Model and Inference](./llama-model.md)
106
107
### Chat Completions and Formatting
108
109
OpenAI-compatible chat completions with extensive formatting options, role-based conversations, function calling, and custom message templates for different model types.
110
111
```python { .api }
112
def get_chat_completion_handler(chat_format: str) -> LlamaChatCompletionHandler: ...
113
def register_chat_completion_handler(chat_format: str, chat_handler: LlamaChatCompletionHandler): ...
114
115
class Jinja2ChatFormatter:
116
def __init__(self, template: str, **kwargs): ...
117
def format_messages(self, messages: List[dict]) -> ChatFormatterResponse: ...
118
```
119
120
[Chat Completions and Formatting](./chat-completion.md)
121
122
### Tokenization
123
124
Native llama.cpp tokenization and HuggingFace tokenizer integration with support for different vocabulary types, encoding/decoding, and model-specific preprocessing.
125
126
```python { .api }
127
class LlamaTokenizer:
128
def tokenize(self, text: str, add_bos: bool = True, special: bool = False) -> List[int]: ...
129
def detokenize(self, tokens: List[int], decode: bool = True) -> str: ...
130
@classmethod
131
def from_ggml_file(cls, path: str) -> "LlamaTokenizer": ...
132
133
class LlamaHFTokenizer:
134
@classmethod
135
def from_pretrained(cls, pretrained_model_name_or_path: str) -> "LlamaHFTokenizer": ...
136
```
137
138
[Tokenization](./tokenization.md)
139
140
### Caching
141
142
Memory and disk-based caching systems for model states, context, and computed results to improve inference performance and enable state persistence.
143
144
```python { .api }
145
class LlamaRAMCache:
146
def __init__(self, capacity_bytes: int = 2 << 30): ...
147
148
class LlamaDiskCache:
149
def __init__(self, cache_dir: str = ".cache/llama_cpp"): ...
150
```
151
152
[Caching](./caching.md)
153
154
### Grammar and Structured Generation
155
156
Constrained text generation using formal grammars (GBNF), JSON Schema validation, and built-in templates for structured outputs like JSON, code, and domain-specific formats.
157
158
```python { .api }
159
class LlamaGrammar:
160
@classmethod
161
def from_string(cls, grammar_str: str, verbose: bool = True) -> "LlamaGrammar": ...
162
@classmethod
163
def from_json_schema(cls, schema: dict, verbose: bool = True) -> "LlamaGrammar": ...
164
165
def json_schema_to_gbnf(schema: dict, **kwargs) -> str: ...
166
```
167
168
[Grammar and Structured Generation](./grammar.md)
169
170
### Vision and Multimodal
171
172
LLaVA vision model integration for processing images alongside text, supporting various image formats and multimodal conversation flows.
173
174
```python { .api }
175
def llava_image_embed_make_with_filename(ctx_clip, image_path: str): ...
176
def llava_image_embed_make_with_bytes(ctx_clip, image_bytes: bytes, image_bytes_length: int): ...
177
def llava_validate_embed_size(n_embd: int, n_image_embd: int) -> bool: ...
178
```
179
180
[Vision and Multimodal](./vision.md)
181
182
### Server Components
183
184
FastAPI-based web server with OpenAI-compatible endpoints, settings management, and multi-model configuration support for production deployments.
185
186
```python { .api }
187
class ModelSettings:
188
model: str
189
n_ctx: int = 2048
190
temperature: float = 0.7
191
192
class ServerSettings:
193
host: str = "127.0.0.1"
194
port: int = 8000
195
interrupt_requests: bool = True
196
```
197
198
[Server Components](./server.md)
199
200
### Low-Level API
201
202
Direct access to llama.cpp C functions through ctypes bindings, providing maximum control over model loading, context management, and backend operations.
203
204
```python { .api }
205
def llama_model_load_from_file(path_model: bytes, params) -> llama_model_p: ...
206
def llama_new_context_with_model(model: llama_model_p, params) -> llama_context_p: ...
207
def llama_backend_init() -> None: ...
208
def llama_backend_free() -> None: ...
209
```
210
211
[Low-Level API](./low-level.md)
212
213
## Types
214
215
```python { .api }
216
# Core response types
217
CreateCompletionResponse = TypedDict('CreateCompletionResponse', {
218
'id': str,
219
'object': str,
220
'created': int,
221
'model': str,
222
'choices': List[CompletionChoice],
223
'usage': CompletionUsage,
224
})
225
226
CreateChatCompletionResponse = TypedDict('CreateChatCompletionResponse', {
227
'id': str,
228
'object': str,
229
'created': int,
230
'model': str,
231
'choices': List[ChatCompletionResponseChoice],
232
'usage': CompletionUsage,
233
})
234
235
CreateEmbeddingResponse = TypedDict('CreateEmbeddingResponse', {
236
'object': str,
237
'data': List[Embedding],
238
'model': str,
239
'usage': EmbeddingUsage,
240
})
241
242
# Message types for chat
243
ChatCompletionRequestMessage = TypedDict('ChatCompletionRequestMessage', {
244
'role': str,
245
'content': Optional[str],
246
})
247
248
ChatCompletionRequestSystemMessage = TypedDict('ChatCompletionRequestSystemMessage', {
249
'role': Literal['system'],
250
'content': str,
251
})
252
253
ChatCompletionRequestUserMessage = TypedDict('ChatCompletionRequestUserMessage', {
254
'role': Literal['user'],
255
'content': str,
256
})
257
258
ChatCompletionRequestAssistantMessage = TypedDict('ChatCompletionRequestAssistantMessage', {
259
'role': Literal['assistant'],
260
'content': Optional[str],
261
})
262
263
# JSON serializable type
264
JsonType = Union[None, int, float, str, bool, List['JsonType'], Dict[str, 'JsonType']]
265
```