or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

cli.mdembeddings.mdindex.mdlangchain-integration.mdmodel-operations.mdutilities.mdweb-ui.md
tile.json

tessl/pypi-pyllamacpp

Python bindings for llama.cpp enabling efficient local language model inference without external API dependencies

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/pyllamacpp@2.4.x

To install, run

npx @tessl/cli install tessl/pypi-pyllamacpp@2.4.0

index.mddocs/

PyLLaMACpp

Python bindings for llama.cpp enabling developers to run Facebook's LLaMA language models and other compatible large language models directly in Python applications. PyLLaMACpp provides both high-level Python APIs through the Model class for easy integration, and low-level access to llama.cpp C-API functions for advanced users requiring custom implementations.

Package Information

  • Package Name: pyllamacpp
  • Language: Python
  • Installation: pip install pyllamacpp
  • Dependencies: CMake, pybind11 (for building from source); optional: numpy, torch, sentencepiece (for model conversion utilities)

Core Imports

from pyllamacpp.model import Model

For utility functions:

from pyllamacpp import utils

For LangChain integration:

from pyllamacpp.langchain_llm import PyllamacppLLM

For logging configuration:

from pyllamacpp._logger import get_logger, set_log_level

For package constants:

from pyllamacpp.constants import PACKAGE_NAME, LOGGING_LEVEL

For web interface:

from pyllamacpp.webui import webui, run

Basic Usage

from pyllamacpp.model import Model

# Load a GGML model
model = Model(model_path='/path/to/model.ggml')

# Generate text streaming tokens
for token in model.generate("Tell me a joke"):
    print(token, end='', flush=True)

# Or generate all at once using cpp_generate
response = model.cpp_generate("What is artificial intelligence?", n_predict=100)
print(response)

Interactive dialogue example:

from pyllamacpp.model import Model

model = Model(model_path='/path/to/model.ggml')

while True:
    try:
        prompt = input("You: ")
        if prompt == '':
            continue
        print("AI:", end='')
        for token in model.generate(prompt):
            print(token, end='', flush=True)
        print()
    except KeyboardInterrupt:
        break

Architecture

PyLLaMACpp operates as a bridge between Python and the high-performance llama.cpp C++ library:

  • Model Class: High-level Python interface providing text generation, tokenization, and embedding capabilities
  • C++ Extension (_pyllamacpp): Direct bindings to llama.cpp functions built with pybind11
  • Utility Functions: Model format conversion and quantization tools
  • Integration Wrappers: LangChain compatibility and web UI interfaces
  • CLI Interface: Command-line tool for interactive model testing

The architecture enables maximum performance by leveraging llama.cpp's optimized C++ implementation while maintaining ease of use through Python interfaces, making it suitable for chatbots, text generation, interactive AI applications, and any project requiring efficient local language model inference without external API dependencies.

Capabilities

Model Operations

Core functionality for loading models, generating text, and managing model state. Includes both streaming token generation and batch text generation methods with extensive parameter control.

class Model:
    def __init__(self, model_path: str, prompt_context: str = '', prompt_prefix: str = '', prompt_suffix: str = '', log_level: int = logging.ERROR, n_ctx: int = 512, seed: int = 0, n_gpu_layers: int = 0, f16_kv: bool = False, logits_all: bool = False, vocab_only: bool = False, use_mlock: bool = False, embedding: bool = False): ...
    def generate(self, prompt: str, n_predict: Union[None, int] = None, n_threads: int = 4, **kwargs) -> Generator: ...
    def cpp_generate(self, prompt: str, n_predict: int = 128, **kwargs) -> str: ...
    def tokenize(self, text: str): ...
    def detokenize(self, tokens: list): ...
    def reset(self) -> None: ...

Model Operations

Utility Functions

Helper functions for model format conversion and quantization. Includes conversion from LLaMA PyTorch models to GGML format and quantization for reduced model sizes.

def llama_to_ggml(dir_model: str, ftype: int = 1) -> str: ...
def quantize(ggml_model_path: str, output_model_path: str = None, itype: int = 2) -> str: ...

Utility Functions

LangChain Integration

LangChain-compatible wrapper class enabling seamless integration with LangChain workflows and chains. Provides the same interface as other LangChain LLM implementations.

class PyllamacppLLM(LLM):
    model: str
    n_ctx: int = 512
    seed: int = 0
    n_threads: int = 4
    n_predict: int = 50
    temp: float = 0.8
    top_p: float = 0.95
    top_k: int = 40

LangChain Integration

Embeddings

Vector embeddings functionality for semantic similarity and RAG applications. Supports generating embeddings for individual prompts or extracting embeddings from current model context.

def get_embeddings(self) -> List[float]: ...
def get_prompt_embeddings(self, prompt: str, n_threads: int = 4, n_batch: int = 512) -> List[float]: ...

Embeddings

Web User Interface

Streamlit-based web interface for interactive model testing and development. Provides browser-based chat interface with configurable parameters and real-time model interaction.

def webui() -> None: ...
def run(): ...

Web User Interface

Command Line Interface

Interactive command-line interface for model testing and development. Provides configurable chat interface with extensive parameter control and debugging features.

pyllamacpp path/to/model.ggml

Command Line Interface