dpearson2699/swift-ios-skills

Agent skills for iOS, iPadOS, Swift, SwiftUI, and modern Apple framework development.

Quality

89%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Advisory

Suggest reviewing before use

MLX Swift & llama.cpp Reference

Name: dpearson2699/swift-ios-skills
Rating: 71.89999999999999 (1 reviews)
Author: dpearson2699

Complete reference for running open-source LLMs on Apple platforms using MLX Swift and llama.cpp.

MLX Swift
llama.cpp
Multi-Backend Architecture
Built-in Apple Frameworks
Performance Best Practices
Review Checklist

MLX Swift

Apple's ML framework for Swift. Highest sustained generation throughput on Apple Silicon via unified memory architecture.

Key Characteristics

Unified memory: operations run on CPU or GPU without data transfer
Lazy computation: operations computed only when needed
Automatic differentiation for training
Metal GPU acceleration
Research-oriented but increasingly used in production

Loading and Running LLMs

import MLX
import MLXLLM

let config = ModelConfiguration(
    id: "mlx-community/Mistral-7B-Instruct-v0.3-4bit"
)
let model = try await LLMModelFactory.shared.loadContainer(
    configuration: config
)

try await model.perform { context in
    let input = try await context.processor.prepare(
        input: UserInput(prompt: "Hello")
    )
    let stream = try generate(
        input: input,
        parameters: GenerateParameters(temperature: 0.0),
        context: context
    )
    for await part in stream {
        print(part.chunk ?? "", terminator: "")
    }
}

Recommended Models by Device

Device	RAM	Recommended Model	Disk Size	RAM Usage
iPhone 12-14	4-6 GB	SmolLM2-135M or Qwen 2.5 0.5B	~278 MB	~0.3 GB
iPhone 15 Pro+	8 GB	Gemma 3n E4B 4-bit	~2.7 GB	~3.5 GB
Mac 8 GB	8 GB	Llama 3.2 3B 4-bit	~1.8 GB	~3 GB
Mac 16 GB+	16 GB+	Mistral 7B 4-bit	~4 GB	~6 GB

Memory Management Rules

Never exceed 60% of total RAM on iOS

Set GPU cache limits:

MLX.GPU.set(cacheLimit: 512 * 1024 * 1024) // 512 MB

Monitor memory pressure and reduce cache under pressure
Unload models on app backgrounding
Use "Increased Memory Limit" entitlement for larger models on iOS
Pre-flight memory checks before loading models
Physical device required (no simulator support for Metal GPU)

Model Lifecycle Management

@Observable
class ModelManager {
    private var model: LLMModelContainer?
    private var generationCount = 0

    func loadModel() async throws {
        let config = ModelConfiguration(
            id: "mlx-community/Llama-3.2-3B-Instruct-4bit"
        )
        model = try await LLMModelFactory.shared.loadContainer(
            configuration: config
        )
    }

    func unloadModel() {
        model = nil
        MLX.GPU.set(cacheLimit: 0)
    }
}

Key lifecycle patterns:

Track active generation count to distinguish "loaded but idle" from "generating"
Unconditional cancellation on app backgrounding
5-second delayed force-unload after backgrounding
Platform-specific memory monitoring (UIKit on iOS, DispatchSource on macOS)

Background Handling

// iOS: Observe app lifecycle
NotificationCenter.default.addObserver(
    forName: UIApplication.didEnterBackgroundNotification,
    object: nil, queue: .main
) { _ in
    modelManager.cancelGeneration()
    Task {
        try await Task.sleep(for: .seconds(5))
        modelManager.unloadModel()
    }
}

llama.cpp

C/C++ LLM inference engine. Best cross-platform support. Uses GGUF model format.

Swift Integration (swift-llama-cpp)

import SwiftLlamaCpp

let service = LlamaService(
    modelUrl: modelURL,
    config: .init(
        batchSize: 256,
        maxTokenCount: 4096,
        useGPU: true
    )
)

let messages = [
    LlamaChatMessage(role: .system, content: "You are helpful."),
    LlamaChatMessage(role: .user, content: "Hello")
]

let stream = try await service.streamCompletion(
    of: messages,
    samplingConfig: .init(temperature: 0.8)
)
for try await token in stream {
    print(token, terminator: "")
}

GGUF Quantization Levels

Level	Quality	Size	Use Case
Q2_K	Lowest	Smallest	Extreme memory constraints
Q4_K_M	Good	Balanced	Mobile devices (recommended)
Q5_K_M	Higher	Larger	When quality matters more
Q8_0	Near-original	Largest	Desktop with ample RAM

llama.cpp vs MLX Swift

Aspect	llama.cpp	MLX Swift
Model format	GGUF	Hugging Face / MLX format
Platform support	Cross-platform	Apple only
Throughput (Apple Silicon)	Good	Best
Model ecosystem	Broadest	mlx-community models
Maturity	Very mature	Evolving
Memory efficiency	Excellent	Good

Multi-Backend Architecture

When an app needs multiple AI backends:

Fallback Chain Pattern

func respond(to prompt: String) async throws -> String {
    // Try Foundation Models first (zero setup, best integration)
    if SystemLanguageModel.default.isAvailable {
        return try await foundationModelsRespond(prompt)
    }

    // Fall back to MLX Swift (best throughput)
    if canLoadMLXModel() {
        return try await mlxRespond(prompt)
    }

    // Fall back to llama.cpp (broadest compatibility)
    if llamaModelAvailable() {
        return try await llamaRespond(prompt)
    }

    throw AIError.noBackendAvailable
}

Architecture Guidelines

Create a router that checks Foundation Models availability first
Fall back to MLX or llama.cpp when Foundation Models is unavailable
Define model tiers based on device capabilities
Serialize all model access through a coordinator actor to prevent contention
Ensure tool systems work across backends (schema translation may be needed)

Coordinator Actor

actor ModelCoordinator {
    private var activeBackend: Backend?

    func withExclusiveAccess<T>(
        _ work: () async throws -> T
    ) async rethrows -> T {
        try await work()
    }

    enum Backend {
        case foundationModels
        case mlx
        case llamaCpp
    }
}

Built-in Apple Frameworks

Before reaching for custom models, consider built-in frameworks:

Natural Language Framework

No model downloads required:

NLLanguageRecognizer -- Language detection
NLTokenizer -- Word, sentence, paragraph tokenization
NLTagger -- Parts of speech, named entity recognition, sentiment
NLEmbedding -- Word and sentence vectors, similarity search

Vision Framework

Built-in computer vision (legacy VN* API; for iOS 18+ prefer modern Swift equivalents like RecognizeTextRequest):

VNRecognizeTextRequest -- OCR
VNClassifyImageRequest -- Image classification
VNDetectFaceRectanglesRequest -- Face detection
VNDetectHumanBodyPoseRequest -- Body pose estimation

Create ML

Training custom classifiers directly on device or Mac:

Image classification
Text classification
Tabular data models
Sound classification

Performance Best Practices

Run outside debugger for accurate benchmarks (Xcode: Cmd-Opt-R, uncheck "Debug Executable")
Use session.prewarm() for Foundation Models before user interaction
Batch Vision framework requests in a single perform() call
Use .fast recognition level for real-time camera processing
Neural Engine (Core ML) is most energy-efficient for compatible operations
For MLX Swift, monitor token generation speed and adjust model size if below acceptable thresholds

Review Checklist

skills

accessorysetupkit

activitykit

adattributionkit

alarmkit

app-clips

app-intents

app-store-optimization

app-store-review

apple-on-device-ai

references

coreml-conversion.md

coreml-optimization.md

foundation-models.md

mlx-swift.md

SKILL.md

appmigrationkit

audioaccessorykit

authentication

avkit

background-processing

browserenginekit

callkit

carplay

cloudkit

contacts-framework

core-bluetooth

core-data

core-motion

core-nfc

coreml

cryptokit

cryptotokenkit

debugging-instruments

device-integrity

dockkit

energykit

eventkit

financekit

focus-engine

gamekit

healthkit

homekit

ios-accessibility

ios-localization

ios-networking

ios-simulator

mapkit

metrickit

musickit

natural-language

paperkit

passkit

pdfkit

pencilkit

permissionkit

photokit

push-notifications

realitykit

relevancekit

scenekit

sensorkit

shareplay-activities

speech-recognition

spritekit

storekit

swift-api-design-guidelines

swift-architecture

swift-charts

swift-codable

swift-concurrency

swift-formatstyle

swift-language

swift-security

swift-testing

swiftdata

swiftlint

swiftui-animation

swiftui-gestures

swiftui-layout-components

swiftui-liquid-glass

swiftui-patterns

swiftui-performance

swiftui-uikit-interop

swiftui-webkit

tabletopkit

tipkit

vision-framework

weatherkit

widgetkit

CHANGELOG.md

README.md

tile.json

dpearson2699/swift-ios-skills

mlx-swift.mdskills/apple-on-device-ai/references/

MLX Swift & llama.cpp Reference

Contents

MLX Swift

Key Characteristics

Loading and Running LLMs

Recommended Models by Device

Memory Management Rules

Model Lifecycle Management

Background Handling

llama.cpp

Swift Integration (swift-llama-cpp)

GGUF Quantization Levels

llama.cpp vs MLX Swift

Multi-Backend Architecture

Fallback Chain Pattern

Architecture Guidelines

Coordinator Actor

Built-in Apple Frameworks

Natural Language Framework

Vision Framework

Create ML

Performance Best Practices

Review Checklist

dpearson2699/swift-ios-skills

mlx-swift.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}skills/apple-on-device-ai/references/

MLX Swift & llama.cpp Reference

Contents

MLX Swift

Key Characteristics

Loading and Running LLMs

Recommended Models by Device

Memory Management Rules

Model Lifecycle Management

Background Handling

llama.cpp

Swift Integration (swift-llama-cpp)

GGUF Quantization Levels

llama.cpp vs MLX Swift

Multi-Backend Architecture

Fallback Chain Pattern

Architecture Guidelines

Coordinator Actor

Built-in Apple Frameworks

Natural Language Framework

Vision Framework

Create ML

Performance Best Practices

Review Checklist

mlx-swift.mdskills/apple-on-device-ai/references/