Python bindings for llama.cpp enabling efficient local language model inference without external API dependencies
—
Interactive command-line interface for model testing and development. The CLI provides a configurable chat interface with extensive parameter control, debugging features, and direct access to model capabilities for development and experimentation.
Launch the interactive chat interface with a model file:
pyllamacpp /path/to/model.ggmlThis starts an interactive session where you can chat with the model:
██████╗ ██╗ ██╗██╗ ██╗ █████╗ ███╗ ███╗ █████╗ ██████╗██████╗ ██████╗
██╔══██╗╚██╗ ██╔╝██║ ██║ ██╔══██╗████╗ ████║██╔══██╗██╔════╝██╔══██╗██╔══██╗
██████╔╝ ╚████╔╝ ██║ ██║ ███████║██╔████╔██║███████║██║ ██████╔╝██████╔╝
██╔═══╝ ╚██╔╝ ██║ ██║ ██╔══██║██║╚██╔╝██║██╔══██║██║ ██╔═══╝ ██╔═══╝
██║ ██║ ███████╗███████╗██║ ██║██║ ╚═╝ ██║██║ ██║╚██████╗██║ ██║
╚═╝ ╚═╝ ╚══════╝╚══════╝╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═╝ ╚═════╝╚═╝ ╚═╝
PyLLaMACpp
A simple Command Line Interface to test the package
Version: 2.4.3
You: Hello, how are you?
AI: I'm doing well, thank you for asking! How can I help you today?
You:The CLI supports extensive parameter customization:
pyllamacpp --help
usage: pyllamacpp [-h] [--n_ctx N_CTX] [--seed SEED] [--f16_kv F16_KV]
[--logits_all LOGITS_ALL] [--vocab_only VOCAB_ONLY]
[--use_mlock USE_MLOCK] [--embedding EMBEDDING]
[--n_predict N_PREDICT] [--n_threads N_THREADS]
[--repeat_last_n REPEAT_LAST_N] [--top_k TOP_K]
[--top_p TOP_P] [--temp TEMP] [--repeat_penalty REPEAT_PENALTY]
[--n_batch N_BATCH]
model
positional arguments:
model The path of the model file
options:
-h, --help show this help message and exit
# Context Parameters
--n_ctx N_CTX text context (default: 512)
--seed SEED RNG seed (default: -1 for random)
--f16_kv F16_KV use fp16 for KV cache (default: False)
--logits_all LOGITS_ALL
compute all logits, not just the last one (default: False)
--vocab_only VOCAB_ONLY
only load vocabulary, no weights (default: False)
--use_mlock USE_MLOCK
force system to keep model in RAM (default: False)
--embedding EMBEDDING
embedding mode only (default: False)
# Generation Parameters
--n_predict N_PREDICT
Number of tokens to predict (default: 256)
--n_threads N_THREADS
Number of threads (default: 4)
--repeat_last_n REPEAT_LAST_N
Last n tokens to penalize (default: 64)
--top_k TOP_K top_k sampling (default: 40)
--top_p TOP_P top_p sampling (default: 0.95)
--temp TEMP temperature (default: 0.8)
--repeat_penalty REPEAT_PENALTY
repeat_penalty (default: 1.1)
--n_batch N_BATCH batch size for prompt processing (default: 512)Configure the model for different use cases:
# High creativity configuration
pyllamacpp /path/to/model.ggml \
--temp 1.2 \
--top_p 0.9 \
--top_k 50 \
--n_predict 200
# Focused, deterministic responses
pyllamacpp /path/to/model.ggml \
--temp 0.1 \
--top_p 0.9 \
--top_k 20 \
--repeat_penalty 1.15
# Large context configuration
pyllamacpp /path/to/model.ggml \
--n_ctx 2048 \
--n_batch 1024 \
--n_threads 8
# GPU acceleration (if supported)
pyllamacpp /path/to/model.ggml \
--n_gpu_layers 32 \
--f16_kv True
# Memory-optimized configuration
pyllamacpp /path/to/model.ggml \
--use_mlock True \
--n_batch 256The CLI provides several interactive features:
The CLI includes built-in instruction-following templates:
# Default prompt templates in CLI
PROMPT_CONTEXT = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
PROMPT_PREFIX = "\n\n##Instruction:\n"
PROMPT_SUFFIX = "\n\n##Response:\n"Example interaction with instruction format:
You: Explain how photosynthesis works
AI: ##Response:
Photosynthesis is the process by which plants convert light energy into chemical energy...The CLI includes performance monitoring capabilities:
# Example CLI session with timing info
You: Tell me about machine learning
AI: Machine learning is a subset of artificial intelligence... (Generated in 2.3s, 45 tokens/s)
# System information display
Model: /path/to/llama-7b.ggml
Context size: 512 tokens
Threads: 4
Memory usage: 4.2 GBThe CLI uses structured parameter schemas for validation:
# Context parameters schema
LLAMA_CONTEXT_PARAMS_SCHEMA = {
'n_ctx': {
'type': int,
'description': "text context",
'default': 512
},
'seed': {
'type': int,
'description': "RNG seed",
'default': -1
},
'f16_kv': {
'type': bool,
'description': "use fp16 for KV cache",
'default': False
},
# ... more parameters
}
# Generation parameters schema
GPT_PARAMS_SCHEMA = {
'n_predict': {
'type': int,
'description': "Number of tokens to predict",
'default': 256
},
'n_threads': {
'type': int,
'description': "Number of threads",
'default': 4
},
# ... more parameters
}Access CLI functionality programmatically:
def main():
"""Main entry point for command line interface."""
def run(args):
"""
Run interactive chat session with parsed arguments.
Parameters:
- args: Parsed command line arguments
"""Example programmatic usage:
import argparse
from pyllamacpp.cli import run
# Create argument parser
parser = argparse.ArgumentParser()
parser.add_argument('model', help='Path to model file')
parser.add_argument('--temp', type=float, default=0.8)
parser.add_argument('--n_predict', type=int, default=128)
# Parse arguments and run
args = parser.parse_args(['/path/to/model.ggml', '--temp', '0.7'])
run(args)Build custom CLI applications using the CLI components:
from pyllamacpp.model import Model
from pyllamacpp.cli import bcolors, PROMPT_CONTEXT, PROMPT_PREFIX, PROMPT_SUFFIX
import argparse
def custom_cli():
parser = argparse.ArgumentParser(description="Custom PyLLaMACpp CLI")
parser.add_argument('model', help='Model path')
parser.add_argument('--system-prompt', default="You are a helpful assistant.")
args = parser.parse_args()
# Initialize model with custom configuration
model = Model(
model_path=args.model,
prompt_context=args.system_prompt,
prompt_prefix="\n\nUser: ",
prompt_suffix="\n\nAssistant: "
)
print(f"{bcolors.HEADER}Custom PyLLaMACpp Chat{bcolors.ENDC}")
print(f"Model: {args.model}")
print(f"System: {args.system_prompt}")
print("-" * 50)
while True:
try:
user_input = input(f"{bcolors.OKBLUE}You: {bcolors.ENDC}")
if user_input.lower() in ['exit', 'quit']:
break
print(f"{bcolors.OKGREEN}AI: {bcolors.ENDC}", end="")
for token in model.generate(user_input, n_predict=150):
print(token, end="", flush=True)
print()
except KeyboardInterrupt:
print(f"\n{bcolors.WARNING}Goodbye!{bcolors.ENDC}")
break
if __name__ == "__main__":
custom_cli()The CLI includes debugging features for development:
# Color codes for terminal output
class bcolors:
HEADER = '\033[95m'
OKBLUE = '\033[94m'
OKGREEN = '\033[92m'
WARNING = '\033[93m'
FAIL = '\033[91m'
ENDC = '\033[0m'
BOLD = '\033[1m'
UNDERLINE = '\033[4m'
# Usage in CLI output
print(f"{bcolors.OKGREEN}Model loaded successfully{bcolors.ENDC}")
print(f"{bcolors.WARNING}Warning: Large context size{bcolors.ENDC}")
print(f"{bcolors.FAIL}Error: Model file not found{bcolors.ENDC}")Run the CLI in batch mode for automated testing:
# Process commands from file
echo "Tell me a joke" | pyllamacpp /path/to/model.ggml --n_predict 50
# Multiple prompts
cat prompts.txt | pyllamacpp /path/to/model.ggml --temp 0.5Use the CLI for rapid prototyping and testing:
# Test different temperatures
for temp in 0.3 0.7 1.0; do
echo "Temperature: $temp"
echo "What is AI?" | pyllamacpp model.ggml --temp $temp --n_predict 50
echo "---"
done
# Performance testing
time pyllamacpp model.ggml --n_predict 1000 < test_prompt.txt
# Memory usage monitoring
/usr/bin/time -v pyllamacpp model.ggml --use_mlock True < test_prompt.txtInstall with Tessl CLI
npx tessl i tessl/pypi-pyllamacpp