tessl/pypi-pyllamacpp

Python bindings for llama.cpp enabling efficient local language model inference without external API dependencies

—

Pending

Overview

Eval results

Files

Utility Functions

Name: tessl/pypi-pyllamacpp
Author: tessl

Helper functions for model format conversion and quantization. These utilities enable conversion between different model formats and optimization of model storage and inference performance.

Capabilities

Model Format Conversion

Convert LLaMA PyTorch models to GGML format for use with pyllamacpp. This function replicates the functionality of llama.cpp's convert-pth-to-ggml.py script.

def llama_to_ggml(dir_model: str, ftype: int = 1) -> str:
    """
    Convert LLaMA PyTorch models to GGML format.

    This function converts Facebook's original LLaMA model files
    from PyTorch format to GGML format compatible with llama.cpp.

    Parameters:
    - dir_model: str, path to directory containing LLaMA model files
                 (should contain params.json and consolidated.0X.pth files)
    - ftype: int, precision format (0 for f32, 1 for f16, default: 1)

    Returns:
    str: Path to the converted GGML model file

    Raises:
    Exception: If model directory structure is invalid or conversion fails
    """

Example usage:

from pyllamacpp import utils

# Convert LLaMA-7B model to f16 GGML format
ggml_path = utils.llama_to_ggml('/path/to/llama-7b/', ftype=1)
print(f"Converted model saved to: {ggml_path}")

# Convert to f32 format for higher precision
ggml_path_f32 = utils.llama_to_ggml('/path/to/llama-13b/', ftype=0)
print(f"F32 model saved to: {ggml_path_f32}")

# Use converted model
from pyllamacpp.model import Model
model = Model(model_path=ggml_path)

Model Quantization

Quantize GGML models to reduce file size and memory usage while maintaining reasonable inference quality. Supports Q4_0 and Q4_1 quantization formats.

def quantize(ggml_model_path: str, output_model_path: str = None, itype: int = 2) -> str:
    """
    Quantize GGML model to reduce size and memory usage.

    Applies quantization to reduce model precision, significantly
    decreasing file size and memory requirements with minimal
    quality loss for most applications.

    Parameters:
    - ggml_model_path: str, path to input GGML model file
    - output_model_path: str or None, output path for quantized model
                        (default: input_path + '-q4_0.bin' or '-q4_1.bin')
    - itype: int, quantization type:
        - 2: Q4_0 quantization (4-bit, smaller file size)
        - 3: Q4_1 quantization (4-bit, slightly better quality)

    Returns:
    str: Path to the quantized model file

    Raises:
    Exception: If quantization process fails
    """

Example usage:

from pyllamacpp import utils

# Quantize model using Q4_0 (default)
original_model = '/path/to/llama-7b.ggml'
quantized_path = utils.quantize(original_model)
print(f"Quantized model: {quantized_path}")

# Quantize with custom output path and Q4_1 format
quantized_custom = utils.quantize(
    ggml_model_path=original_model,
    output_model_path='/path/to/llama-7b-q4_1.ggml',
    itype=3
)

# Compare file sizes
import os
original_size = os.path.getsize(original_model) / (1024**3)  # GB
quantized_size = os.path.getsize(quantized_path) / (1024**3)  # GB
print(f"Original: {original_size:.2f} GB")
print(f"Quantized: {quantized_size:.2f} GB")
print(f"Size reduction: {(1 - quantized_size/original_size)*100:.1f}%")

# Use quantized model
from pyllamacpp.model import Model
model = Model(model_path=quantized_path)

GPT4All Conversion

Placeholder function for converting GPT4All models (currently not implemented).

def convert_gpt4all() -> str:
    """
    Convert GPT4All models (placeholder implementation).
    
    Note: This function is currently not implemented and will
    pass without performing any operations.

    Returns:
    str: Conversion result (implementation pending)
    """

Logging Configuration

Logger configuration functions for controlling PyLLaMACpp's internal logging behavior.

def get_logger():
    """
    Get the package logger instance.
    
    Returns the configured logger instance used throughout
    the PyLLaMACpp package for debugging and information output.
    
    Returns:
    logging.Logger: Package logger instance
    """

def set_log_level(log_level):
    """
    Set the logging level for the PyLLaMACpp package.
    
    Controls the verbosity of logging output from the package.
    Use standard Python logging levels (DEBUG, INFO, WARNING, ERROR, CRITICAL).
    
    Parameters:
    - log_level: int or logging level constant, desired logging level
    
    Example:
    ```python
    import logging
    from pyllamacpp._logger import set_log_level
    
    # Set to INFO level for detailed output
    set_log_level(logging.INFO)
    
    # Set to ERROR level for minimal output
    set_log_level(logging.ERROR)
    ```
    """

Example usage:

import logging
from pyllamacpp._logger import get_logger, set_log_level
from pyllamacpp.model import Model

# Configure logging for debugging
set_log_level(logging.DEBUG)
logger = get_logger()

# Load model with debug logging
model = Model(model_path='/path/to/model.ggml')
logger.info("Model loaded successfully")

# Generate text with logging
response = model.cpp_generate("Test prompt", n_predict=50)
logger.info(f"Generated {len(response)} characters")

Package Constants

Package-level constants for identification and configuration.

PACKAGE_NAME = 'pyllamacpp'
"""Package name identifier constant."""

LOGGING_LEVEL = logging.INFO
"""Default logging level for the package."""

Example usage:

from pyllamacpp.constants import PACKAGE_NAME, LOGGING_LEVEL
import logging

print(f"Using {PACKAGE_NAME} package")

# Use default logging level
logging.basicConfig(level=LOGGING_LEVEL)

Complete Workflow Example

Here's a complete example showing the typical workflow from PyTorch LLaMA model to optimized quantized model:

from pyllamacpp import utils
from pyllamacpp.model import Model
import os

# Step 1: Convert PyTorch LLaMA model to GGML
print("Converting PyTorch model to GGML...")
ggml_model = utils.llama_to_ggml(
    dir_model='/path/to/llama-7b-pytorch/',
    ftype=1  # f16 precision
)
print(f"GGML model created: {ggml_model}")

# Step 2: Quantize the GGML model
print("Quantizing model...")
quantized_model = utils.quantize(
    ggml_model_path=ggml_model,
    itype=2  # Q4_0 quantization
)
print(f"Quantized model created: {quantized_model}")

# Step 3: Compare sizes
original_size = os.path.getsize(ggml_model) / (1024**2)  # MB
quantized_size = os.path.getsize(quantized_model) / (1024**2)  # MB
print(f"Size reduction: {original_size:.1f}MB -> {quantized_size:.1f}MB")

# Step 4: Test the quantized model
print("Testing quantized model...")
model = Model(model_path=quantized_model)
response = model.cpp_generate("Hello, how are you?", n_predict=50)
print(f"Model response: {response}")

Dependencies

The utility functions require additional dependencies:

# Required for llama_to_ggml
import torch
import numpy as np
from sentencepiece import SentencePieceProcessor

# Built-in dependencies
import json
import struct
import sys
from pathlib import Path

Make sure these are installed: