Python bindings for llama.cpp enabling efficient local language model inference without external API dependencies
npx @tessl/cli install tessl/pypi-pyllamacpp@2.4.0Python bindings for llama.cpp enabling developers to run Facebook's LLaMA language models and other compatible large language models directly in Python applications. PyLLaMACpp provides both high-level Python APIs through the Model class for easy integration, and low-level access to llama.cpp C-API functions for advanced users requiring custom implementations.
pip install pyllamacppfrom pyllamacpp.model import ModelFor utility functions:
from pyllamacpp import utilsFor LangChain integration:
from pyllamacpp.langchain_llm import PyllamacppLLMFor logging configuration:
from pyllamacpp._logger import get_logger, set_log_levelFor package constants:
from pyllamacpp.constants import PACKAGE_NAME, LOGGING_LEVELFor web interface:
from pyllamacpp.webui import webui, runfrom pyllamacpp.model import Model
# Load a GGML model
model = Model(model_path='/path/to/model.ggml')
# Generate text streaming tokens
for token in model.generate("Tell me a joke"):
print(token, end='', flush=True)
# Or generate all at once using cpp_generate
response = model.cpp_generate("What is artificial intelligence?", n_predict=100)
print(response)Interactive dialogue example:
from pyllamacpp.model import Model
model = Model(model_path='/path/to/model.ggml')
while True:
try:
prompt = input("You: ")
if prompt == '':
continue
print("AI:", end='')
for token in model.generate(prompt):
print(token, end='', flush=True)
print()
except KeyboardInterrupt:
breakPyLLaMACpp operates as a bridge between Python and the high-performance llama.cpp C++ library:
The architecture enables maximum performance by leveraging llama.cpp's optimized C++ implementation while maintaining ease of use through Python interfaces, making it suitable for chatbots, text generation, interactive AI applications, and any project requiring efficient local language model inference without external API dependencies.
Core functionality for loading models, generating text, and managing model state. Includes both streaming token generation and batch text generation methods with extensive parameter control.
class Model:
def __init__(self, model_path: str, prompt_context: str = '', prompt_prefix: str = '', prompt_suffix: str = '', log_level: int = logging.ERROR, n_ctx: int = 512, seed: int = 0, n_gpu_layers: int = 0, f16_kv: bool = False, logits_all: bool = False, vocab_only: bool = False, use_mlock: bool = False, embedding: bool = False): ...
def generate(self, prompt: str, n_predict: Union[None, int] = None, n_threads: int = 4, **kwargs) -> Generator: ...
def cpp_generate(self, prompt: str, n_predict: int = 128, **kwargs) -> str: ...
def tokenize(self, text: str): ...
def detokenize(self, tokens: list): ...
def reset(self) -> None: ...Helper functions for model format conversion and quantization. Includes conversion from LLaMA PyTorch models to GGML format and quantization for reduced model sizes.
def llama_to_ggml(dir_model: str, ftype: int = 1) -> str: ...
def quantize(ggml_model_path: str, output_model_path: str = None, itype: int = 2) -> str: ...LangChain-compatible wrapper class enabling seamless integration with LangChain workflows and chains. Provides the same interface as other LangChain LLM implementations.
class PyllamacppLLM(LLM):
model: str
n_ctx: int = 512
seed: int = 0
n_threads: int = 4
n_predict: int = 50
temp: float = 0.8
top_p: float = 0.95
top_k: int = 40Vector embeddings functionality for semantic similarity and RAG applications. Supports generating embeddings for individual prompts or extracting embeddings from current model context.
def get_embeddings(self) -> List[float]: ...
def get_prompt_embeddings(self, prompt: str, n_threads: int = 4, n_batch: int = 512) -> List[float]: ...Streamlit-based web interface for interactive model testing and development. Provides browser-based chat interface with configurable parameters and real-time model interaction.
def webui() -> None: ...
def run(): ...Interactive command-line interface for model testing and development. Provides configurable chat interface with extensive parameter control and debugging features.
pyllamacpp path/to/model.ggml