CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-llama-cpp-python

Python bindings for the llama.cpp library providing high-performance LLM inference with OpenAI-compatible APIs.

Pending
Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Pending

The risk profile of this skill

Overview
Eval results
Files

index.mddocs/

llama-cpp-python

Python bindings for the llama.cpp library providing high-performance large language model inference with comprehensive APIs for text completion, chat, embeddings, and multimodal processing. Offers both high-level Python interfaces and low-level C bindings with OpenAI-compatible endpoints.

Package Information

  • Package Name: llama-cpp-python
  • Package Type: PyPI
  • Language: Python
  • Installation: pip install llama-cpp-python

Core Imports

import llama_cpp

Common high-level imports:

from llama_cpp import Llama, LlamaGrammar, LlamaCache

OpenAI-compatible types:

from llama_cpp.llama_types import (
    CreateCompletionResponse,
    CreateChatCompletionResponse,
    CreateEmbeddingResponse
)

Basic Usage

from llama_cpp import Llama

# Initialize model
llm = Llama(
    model_path="./models/llama-model.gguf",
    n_ctx=2048,  # Context window
    n_threads=8,  # CPU threads
)

# Generate text completion
output = llm.create_completion(
    prompt="The capital of France is",
    max_tokens=32,
    temperature=0.7,
    top_p=0.9,
)
print(output['choices'][0]['text'])

# Create chat completion
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello! How are you?"}
]

response = llm.create_chat_completion(
    messages=messages,
    max_tokens=100,
    temperature=0.7,
)
print(response['choices'][0]['message']['content'])

# Generate embeddings
embeddings = llm.create_embedding(
    input=["Hello world", "Python is great"],
)
print(embeddings['data'][0]['embedding'][:5])  # First 5 dimensions

Architecture

The llama-cpp-python package provides multiple layers of abstraction:

  • High-level API: The Llama class offers convenient methods for common operations with sensible defaults
  • Low-level bindings: Direct access to llama.cpp C functions through ctypes for maximum control
  • OpenAI compatibility: Drop-in replacement for OpenAI API endpoints with identical response formats
  • Extensible components: Modular caching, tokenization, grammar, and formatting systems

Key design patterns:

  • Lazy loading: Models and contexts are loaded only when needed
  • Memory management: Automatic cleanup and manual control options
  • Hardware optimization: CPU, CUDA, and Metal acceleration support
  • Format flexibility: Support for various model formats (GGUF, GGML) and quantization levels

Capabilities

Core Model and Inference

High-level model loading, text generation, and inference operations including completion, sampling, state management, and performance optimization.

class Llama:
    def __init__(self, model_path: str, **kwargs): ...
    def create_completion(self, prompt: str, **kwargs) -> CreateCompletionResponse: ...
    def create_chat_completion(self, messages: List[dict], **kwargs) -> CreateChatCompletionResponse: ...
    def create_embedding(self, input: Union[str, List[str]], **kwargs) -> CreateEmbeddingResponse: ...
    def tokenize(self, text: str, add_bos: bool = True, special: bool = False) -> List[int]: ...
    def detokenize(self, tokens: List[int], decode: bool = True) -> str: ...

Core Model and Inference

Chat Completions and Formatting

OpenAI-compatible chat completions with extensive formatting options, role-based conversations, function calling, and custom message templates for different model types.

def get_chat_completion_handler(chat_format: str) -> LlamaChatCompletionHandler: ...
def register_chat_completion_handler(chat_format: str, chat_handler: LlamaChatCompletionHandler): ...

class Jinja2ChatFormatter:
    def __init__(self, template: str, **kwargs): ...
    def format_messages(self, messages: List[dict]) -> ChatFormatterResponse: ...

Chat Completions and Formatting

Tokenization

Native llama.cpp tokenization and HuggingFace tokenizer integration with support for different vocabulary types, encoding/decoding, and model-specific preprocessing.

class LlamaTokenizer:
    def tokenize(self, text: str, add_bos: bool = True, special: bool = False) -> List[int]: ...
    def detokenize(self, tokens: List[int], decode: bool = True) -> str: ...
    @classmethod
    def from_ggml_file(cls, path: str) -> "LlamaTokenizer": ...

class LlamaHFTokenizer:
    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: str) -> "LlamaHFTokenizer": ...

Tokenization

Caching

Memory and disk-based caching systems for model states, context, and computed results to improve inference performance and enable state persistence.

class LlamaRAMCache:
    def __init__(self, capacity_bytes: int = 2 << 30): ...

class LlamaDiskCache:
    def __init__(self, cache_dir: str = ".cache/llama_cpp"): ...

Caching

Grammar and Structured Generation

Constrained text generation using formal grammars (GBNF), JSON Schema validation, and built-in templates for structured outputs like JSON, code, and domain-specific formats.

class LlamaGrammar:
    @classmethod
    def from_string(cls, grammar_str: str, verbose: bool = True) -> "LlamaGrammar": ...
    @classmethod
    def from_json_schema(cls, schema: dict, verbose: bool = True) -> "LlamaGrammar": ...

def json_schema_to_gbnf(schema: dict, **kwargs) -> str: ...

Grammar and Structured Generation

Vision and Multimodal

LLaVA vision model integration for processing images alongside text, supporting various image formats and multimodal conversation flows.

def llava_image_embed_make_with_filename(ctx_clip, image_path: str): ...
def llava_image_embed_make_with_bytes(ctx_clip, image_bytes: bytes, image_bytes_length: int): ...
def llava_validate_embed_size(n_embd: int, n_image_embd: int) -> bool: ...

Vision and Multimodal

Server Components

FastAPI-based web server with OpenAI-compatible endpoints, settings management, and multi-model configuration support for production deployments.

class ModelSettings:
    model: str
    n_ctx: int = 2048
    temperature: float = 0.7

class ServerSettings:
    host: str = "127.0.0.1"
    port: int = 8000
    interrupt_requests: bool = True

Server Components

Low-Level API

Direct access to llama.cpp C functions through ctypes bindings, providing maximum control over model loading, context management, and backend operations.

def llama_model_load_from_file(path_model: bytes, params) -> llama_model_p: ...
def llama_new_context_with_model(model: llama_model_p, params) -> llama_context_p: ...
def llama_backend_init() -> None: ...
def llama_backend_free() -> None: ...

Low-Level API

Types

# Core response types
CreateCompletionResponse = TypedDict('CreateCompletionResponse', {
    'id': str,
    'object': str,
    'created': int,
    'model': str,
    'choices': List[CompletionChoice],
    'usage': CompletionUsage,
})

CreateChatCompletionResponse = TypedDict('CreateChatCompletionResponse', {
    'id': str,
    'object': str,
    'created': int,
    'model': str,
    'choices': List[ChatCompletionResponseChoice],
    'usage': CompletionUsage,
})

CreateEmbeddingResponse = TypedDict('CreateEmbeddingResponse', {
    'object': str,
    'data': List[Embedding],
    'model': str,
    'usage': EmbeddingUsage,
})

# Message types for chat
ChatCompletionRequestMessage = TypedDict('ChatCompletionRequestMessage', {
    'role': str,
    'content': Optional[str],
})

ChatCompletionRequestSystemMessage = TypedDict('ChatCompletionRequestSystemMessage', {
    'role': Literal['system'],
    'content': str,
})

ChatCompletionRequestUserMessage = TypedDict('ChatCompletionRequestUserMessage', {
    'role': Literal['user'],
    'content': str,
})

ChatCompletionRequestAssistantMessage = TypedDict('ChatCompletionRequestAssistantMessage', {
    'role': Literal['assistant'],
    'content': Optional[str],
})

# JSON serializable type
JsonType = Union[None, int, float, str, bool, List['JsonType'], Dict[str, 'JsonType']]

docs

caching.md

chat-completion.md

grammar.md

index.md

llama-model.md

low-level.md

server.md

tokenization.md

vision.md

tile.json