or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

caching.mdchat-completion.mdgrammar.mdindex.mdllama-model.mdlow-level.mdserver.mdtokenization.mdvision.md

index.mddocs/

0

# llama-cpp-python

1

2

Python bindings for the llama.cpp library providing high-performance large language model inference with comprehensive APIs for text completion, chat, embeddings, and multimodal processing. Offers both high-level Python interfaces and low-level C bindings with OpenAI-compatible endpoints.

3

4

## Package Information

5

6

- **Package Name**: llama-cpp-python

7

- **Package Type**: PyPI

8

- **Language**: Python

9

- **Installation**: `pip install llama-cpp-python`

10

11

## Core Imports

12

13

```python

14

import llama_cpp

15

```

16

17

Common high-level imports:

18

19

```python

20

from llama_cpp import Llama, LlamaGrammar, LlamaCache

21

```

22

23

OpenAI-compatible types:

24

25

```python

26

from llama_cpp.llama_types import (

27

CreateCompletionResponse,

28

CreateChatCompletionResponse,

29

CreateEmbeddingResponse

30

)

31

```

32

33

## Basic Usage

34

35

```python

36

from llama_cpp import Llama

37

38

# Initialize model

39

llm = Llama(

40

model_path="./models/llama-model.gguf",

41

n_ctx=2048, # Context window

42

n_threads=8, # CPU threads

43

)

44

45

# Generate text completion

46

output = llm.create_completion(

47

prompt="The capital of France is",

48

max_tokens=32,

49

temperature=0.7,

50

top_p=0.9,

51

)

52

print(output['choices'][0]['text'])

53

54

# Create chat completion

55

messages = [

56

{"role": "system", "content": "You are a helpful assistant."},

57

{"role": "user", "content": "Hello! How are you?"}

58

]

59

60

response = llm.create_chat_completion(

61

messages=messages,

62

max_tokens=100,

63

temperature=0.7,

64

)

65

print(response['choices'][0]['message']['content'])

66

67

# Generate embeddings

68

embeddings = llm.create_embedding(

69

input=["Hello world", "Python is great"],

70

)

71

print(embeddings['data'][0]['embedding'][:5]) # First 5 dimensions

72

```

73

74

## Architecture

75

76

The llama-cpp-python package provides multiple layers of abstraction:

77

78

- **High-level API**: The `Llama` class offers convenient methods for common operations with sensible defaults

79

- **Low-level bindings**: Direct access to llama.cpp C functions through ctypes for maximum control

80

- **OpenAI compatibility**: Drop-in replacement for OpenAI API endpoints with identical response formats

81

- **Extensible components**: Modular caching, tokenization, grammar, and formatting systems

82

83

Key design patterns:

84

- **Lazy loading**: Models and contexts are loaded only when needed

85

- **Memory management**: Automatic cleanup and manual control options

86

- **Hardware optimization**: CPU, CUDA, and Metal acceleration support

87

- **Format flexibility**: Support for various model formats (GGUF, GGML) and quantization levels

88

89

## Capabilities

90

91

### Core Model and Inference

92

93

High-level model loading, text generation, and inference operations including completion, sampling, state management, and performance optimization.

94

95

```python { .api }

96

class Llama:

97

def __init__(self, model_path: str, **kwargs): ...

98

def create_completion(self, prompt: str, **kwargs) -> CreateCompletionResponse: ...

99

def create_chat_completion(self, messages: List[dict], **kwargs) -> CreateChatCompletionResponse: ...

100

def create_embedding(self, input: Union[str, List[str]], **kwargs) -> CreateEmbeddingResponse: ...

101

def tokenize(self, text: str, add_bos: bool = True, special: bool = False) -> List[int]: ...

102

def detokenize(self, tokens: List[int], decode: bool = True) -> str: ...

103

```

104

105

[Core Model and Inference](./llama-model.md)

106

107

### Chat Completions and Formatting

108

109

OpenAI-compatible chat completions with extensive formatting options, role-based conversations, function calling, and custom message templates for different model types.

110

111

```python { .api }

112

def get_chat_completion_handler(chat_format: str) -> LlamaChatCompletionHandler: ...

113

def register_chat_completion_handler(chat_format: str, chat_handler: LlamaChatCompletionHandler): ...

114

115

class Jinja2ChatFormatter:

116

def __init__(self, template: str, **kwargs): ...

117

def format_messages(self, messages: List[dict]) -> ChatFormatterResponse: ...

118

```

119

120

[Chat Completions and Formatting](./chat-completion.md)

121

122

### Tokenization

123

124

Native llama.cpp tokenization and HuggingFace tokenizer integration with support for different vocabulary types, encoding/decoding, and model-specific preprocessing.

125

126

```python { .api }

127

class LlamaTokenizer:

128

def tokenize(self, text: str, add_bos: bool = True, special: bool = False) -> List[int]: ...

129

def detokenize(self, tokens: List[int], decode: bool = True) -> str: ...

130

@classmethod

131

def from_ggml_file(cls, path: str) -> "LlamaTokenizer": ...

132

133

class LlamaHFTokenizer:

134

@classmethod

135

def from_pretrained(cls, pretrained_model_name_or_path: str) -> "LlamaHFTokenizer": ...

136

```

137

138

[Tokenization](./tokenization.md)

139

140

### Caching

141

142

Memory and disk-based caching systems for model states, context, and computed results to improve inference performance and enable state persistence.

143

144

```python { .api }

145

class LlamaRAMCache:

146

def __init__(self, capacity_bytes: int = 2 << 30): ...

147

148

class LlamaDiskCache:

149

def __init__(self, cache_dir: str = ".cache/llama_cpp"): ...

150

```

151

152

[Caching](./caching.md)

153

154

### Grammar and Structured Generation

155

156

Constrained text generation using formal grammars (GBNF), JSON Schema validation, and built-in templates for structured outputs like JSON, code, and domain-specific formats.

157

158

```python { .api }

159

class LlamaGrammar:

160

@classmethod

161

def from_string(cls, grammar_str: str, verbose: bool = True) -> "LlamaGrammar": ...

162

@classmethod

163

def from_json_schema(cls, schema: dict, verbose: bool = True) -> "LlamaGrammar": ...

164

165

def json_schema_to_gbnf(schema: dict, **kwargs) -> str: ...

166

```

167

168

[Grammar and Structured Generation](./grammar.md)

169

170

### Vision and Multimodal

171

172

LLaVA vision model integration for processing images alongside text, supporting various image formats and multimodal conversation flows.

173

174

```python { .api }

175

def llava_image_embed_make_with_filename(ctx_clip, image_path: str): ...

176

def llava_image_embed_make_with_bytes(ctx_clip, image_bytes: bytes, image_bytes_length: int): ...

177

def llava_validate_embed_size(n_embd: int, n_image_embd: int) -> bool: ...

178

```

179

180

[Vision and Multimodal](./vision.md)

181

182

### Server Components

183

184

FastAPI-based web server with OpenAI-compatible endpoints, settings management, and multi-model configuration support for production deployments.

185

186

```python { .api }

187

class ModelSettings:

188

model: str

189

n_ctx: int = 2048

190

temperature: float = 0.7

191

192

class ServerSettings:

193

host: str = "127.0.0.1"

194

port: int = 8000

195

interrupt_requests: bool = True

196

```

197

198

[Server Components](./server.md)

199

200

### Low-Level API

201

202

Direct access to llama.cpp C functions through ctypes bindings, providing maximum control over model loading, context management, and backend operations.

203

204

```python { .api }

205

def llama_model_load_from_file(path_model: bytes, params) -> llama_model_p: ...

206

def llama_new_context_with_model(model: llama_model_p, params) -> llama_context_p: ...

207

def llama_backend_init() -> None: ...

208

def llama_backend_free() -> None: ...

209

```

210

211

[Low-Level API](./low-level.md)

212

213

## Types

214

215

```python { .api }

216

# Core response types

217

CreateCompletionResponse = TypedDict('CreateCompletionResponse', {

218

'id': str,

219

'object': str,

220

'created': int,

221

'model': str,

222

'choices': List[CompletionChoice],

223

'usage': CompletionUsage,

224

})

225

226

CreateChatCompletionResponse = TypedDict('CreateChatCompletionResponse', {

227

'id': str,

228

'object': str,

229

'created': int,

230

'model': str,

231

'choices': List[ChatCompletionResponseChoice],

232

'usage': CompletionUsage,

233

})

234

235

CreateEmbeddingResponse = TypedDict('CreateEmbeddingResponse', {

236

'object': str,

237

'data': List[Embedding],

238

'model': str,

239

'usage': EmbeddingUsage,

240

})

241

242

# Message types for chat

243

ChatCompletionRequestMessage = TypedDict('ChatCompletionRequestMessage', {

244

'role': str,

245

'content': Optional[str],

246

})

247

248

ChatCompletionRequestSystemMessage = TypedDict('ChatCompletionRequestSystemMessage', {

249

'role': Literal['system'],

250

'content': str,

251

})

252

253

ChatCompletionRequestUserMessage = TypedDict('ChatCompletionRequestUserMessage', {

254

'role': Literal['user'],

255

'content': str,

256

})

257

258

ChatCompletionRequestAssistantMessage = TypedDict('ChatCompletionRequestAssistantMessage', {

259

'role': Literal['assistant'],

260

'content': Optional[str],

261

})

262

263

# JSON serializable type

264

JsonType = Union[None, int, float, str, bool, List['JsonType'], Dict[str, 'JsonType']]

265

```