Apple端侧AI开发指南：Core ML、Foundation Models、MLX Swift与llama.cpp选择与优化

apple-on-device-ai by dpearson2699/swift-ios-skills

424 周安装量

276 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/dpearson2699/swift-ios-skills --skill apple-on-device-ai

AI/机器学习 iOS Swift

🇨🇳中文介绍

Apple 平台端侧 AI

端侧机器学习模型选择、部署与优化指南。涵盖 Apple Foundation Models、Core ML、MLX Swift 和 llama.cpp。

框架选择决策树

使用此决策树为您的用例选择合适的框架。

Apple Foundation Models

使用时机： 在已启用 Apple Intelligence 的 iOS 26+ / macOS 26+ 设备上进行文本生成、摘要、实体提取、结构化输出和简短对话。零设置——无需 API 密钥、无需网络、无需模型下载。

最适合：

使用 @Generable 类型生成文本或结构化数据
摘要、分类、内容标记
使用 Tool 协议进行工具增强生成
需要保证端侧隐私的应用

不适用于： 复杂数学运算、代码生成、事实准确性任务，或针对 iOS 26 之前设备的应用。

Core ML

使用时机： 在所有 Apple 平台上部署自定义训练的模型（视觉、NLP、音频）。使用 coremltools 转换来自 PyTorch、TensorFlow 或 scikit-learn 的模型。

最适合：

图像分类、目标检测、分割

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

场景	框架
文本生成，零设置（iOS 26+）	Foundation Models
来自端侧 LLM 的结构化输出	Foundation Models (`@Generable`)
图像分类，目标检测	Core ML
来自 PyTorch/TensorFlow 的自定义模型	Core ML + coremltools
运行特定的开源 LLM	MLX Swift 或 llama.cpp
在 Apple Silicon 上实现最大吞吐量	MLX Swift
跨平台 LLM 推理	llama.cpp
OCR 和文本识别	Vision 框架
情感分析、NER、分词	Natural Language 框架
在设备上训练自定义分类器	Create ML

Apple Foundation Models 概览

为 Apple Silicon 优化的端侧语言模型。在支持 Apple Intelligence（iOS 26+、macOS 26+）的设备上可用。

令牌预算涵盖输入 + 输出；检查 contextSize 了解限制
检查 supportedLanguages 了解支持的语言环境
护栏始终强制执行，无法禁用

可用性检查（必需）

使用前务必检查。在不可用时切勿崩溃。

import FoundationModels

switch SystemLanguageModel.default.availability {
case .available:
    // Proceed with model usage
case .unavailable(.appleIntelligenceNotEnabled):
    // Guide user to enable Apple Intelligence in Settings
case .unavailable(.modelNotReady):
    // Model is downloading; show loading state
case .unavailable(.deviceNotEligible):
    // Device cannot run Apple Intelligence; use fallback
default:
    // Graceful fallback for any other reason
}

// Basic session
let session = LanguageModelSession()

// Session with instructions
let session = LanguageModelSession {
    "You are a helpful cooking assistant."
}

// Session with tools
let session = LanguageModelSession(
    tools: [weatherTool, recipeTool]
) {
    "You are a helpful assistant with access to tools."
}

会话是有状态的——多轮对话自动维护上下文
每个会话一次处理一个请求（检查 session.isResponding）
在用户交互前调用 session.prewarm() 以获得更快的首次响应
保存/恢复转录：LanguageModelSession(model: model, tools: [], transcript: savedTranscript)

使用 @Generable 实现结构化输出

@Generable 宏为类型安全输出创建编译时模式：

@Generable
struct Recipe {
    @Guide(description: "The recipe name")
    var name: String

    @Guide(description: "Cooking steps", .count(3))
    var steps: [String]

    @Guide(description: "Prep time in minutes", .range(1...120))
    var prepTime: Int
}

let response = try await session.respond(
    to: "Suggest a quick pasta recipe",
    generating: Recipe.self
)
print(response.content.name)

约束	用途
`description:`	用于生成的自然语言提示
`.anyOf([values])`	限制为枚举的字符串值
`.count(n)`	固定数组长度
`.range(min...max)`	数值范围
`.minimum(n)` / `.maximum(n)`	单边数值边界
`.minimumCount(n)` / `.maximumCount(n)`	数组长度边界
`.constant(value)`	始终返回此值
`.pattern(regex)`	字符串格式强制
`.element(guide)`	应用于每个数组元素的指南

属性按声明顺序生成。将基础数据放在依赖数据之前以获得更好的结果。

流式结构化输出

let stream = session.streamResponse(
    to: "Suggest a recipe",
    generating: Recipe.self
)
for try await snapshot in stream {
    // snapshot.content is Recipe.PartiallyGenerated (all properties optional)
    if let name = snapshot.content.name { updateNameLabel(name) }
}

struct WeatherTool: Tool {
    let name = "weather"
    let description = "Get current weather for a city."

    @Generable
    struct Arguments {
        @Guide(description: "The city name")
        var city: String
    }

    func call(arguments: Arguments) async throws -> String {
        let weather = try await fetchWeather(arguments.city)
        return weather.description
    }
}

在会话创建时注册工具。模型会自动调用它们。

do {
    let response = try await session.respond(to: prompt)
} catch let error as LanguageModelSession.GenerationError {
    switch error {
    case .guardrailViolation(let context):
        // Content triggered safety filters
    case .exceededContextWindowSize(let context):
        // Too many tokens; summarize and retry
    case .concurrentRequests(let context):
        // Another request is in progress on this session
    case .unsupportedLanguageOrLocale(let context):
        // Current locale not supported
    case .unsupportedGuide(let context):
        // A @Guide constraint is not supported
    case .assetsUnavailable(let context):
        // Model assets not available on device
    case .refusal(let refusal, _):
        // Model refused; stream refusal.explanation for details
    case .rateLimited(let context):
        // Too many requests; back off and retry
    case .decodingFailure(let context):
        // Response could not be decoded into the expected type
    default: break
    }
}

let options = GenerationOptions(
    sampling: .random(top: 40),
    temperature: 0.7,
    maximumResponseTokens: 512
)
let response = try await session.respond(to: prompt, options: options)

采样模式：.greedy, .random(top:seed:), .random(probabilityThreshold:seed:).

保持简洁——使用 tokenCount(for:) 监控上下文窗口预算
在指令中使用带括号的占位符：[描述性示例]
使用全大写的 "DO NOT" 表示禁止
提供最多 5 个少样本示例以确保一致性
使用长度限定词："用几句话"，"用三个句子"

护栏始终强制执行且无法禁用
指令优先于用户提示
切勿在指令中包含不受信任的用户内容
优雅处理误报
将工具结果框定为授权数据，以防止模型拒绝

Foundation Models 通过 SystemLanguageModel.UseCase 支持专门的用例：

.general -- 文本生成、摘要、对话的默认设置
.contentTagging -- 为分类和标记任务优化

加载用于专门行为的微调适配器（需要授权）：

let adapter = try SystemLanguageModel.Adapter(name: "my-adapter")
try await adapter.compile()
let model = SystemLanguageModel(adapter: adapter, guardrails: .default)
let session = LanguageModelSession(model: model)

完整的 Foundation Models API 参考，请参见 references/foundation-models.md。

Apple 用于部署训练模型的框架。自动调度到最佳计算单元（CPU、GPU 或 Neural Engine）。

格式	扩展名	使用时机
`.mlpackage`	目录（mlprogram）	所有新模型（iOS 15+）
`.mlmodel`	单个文件（neuralnetwork）	仅限旧版（iOS 11-14）
`.mlmodelc`	已编译	预编译以加快加载速度

新项目请始终使用 mlprogram（.mlpackage）。

转换流程（coremltools）

import coremltools as ct

# PyTorch conversion (torch.jit.trace)
model.eval()  # CRITICAL: always call eval() before tracing
traced = torch.jit.trace(model, example_input)
mlmodel = ct.convert(
    traced,
    inputs=[ct.TensorType(shape=(1, 3, 224, 224), name="image")],
    minimum_deployment_target=ct.target.iOS18,
    convert_to='mlprogram',
)
mlmodel.save("Model.mlpackage")

技术	大小缩减	精度影响	最佳计算单元
INT8 每通道	~4x	低	CPU/GPU
INT4 每块	~8x	中	GPU
4 位调色板化	~8x	低-中	Neural Engine
W8A8（权重+激活）	~4x	低	ANE（A17 Pro/M4+）
75% 剪枝	~4x	中	CPU/ANE

let config = MLModelConfiguration()
config.computeUnits = .all
let model = try MLModel(contentsOf: modelURL, configuration: config)

// Async prediction (iOS 17+)
let output = try await model.prediction(from: input)

MLTensor（iOS 18+）

用于多维数组操作的 Swift 类型：

import CoreML

let tensor = MLTensor([1.0, 2.0, 3.0, 4.0])
let reshaped = tensor.reshaped(to: [2, 2])
let result = tensor.softmax()

完整的转换流程，请参见 references/coreml-conversion.md；优化技术，请参见 references/coreml-optimization.md。

Apple 的 Swift ML 框架。通过统一内存架构在 Apple Silicon 上实现最高的持续生成吞吐量。

import MLX
import MLXLLM

let config = ModelConfiguration(id: "mlx-community/Mistral-7B-Instruct-v0.3-4bit")
let model = try await LLMModelFactory.shared.loadContainer(configuration: config)

try await model.perform { context in
    let input = try await context.processor.prepare(
        input: UserInput(prompt: "Hello")
    )
    let stream = try generate(
        input: input,
        parameters: GenerateParameters(temperature: 0.0),
        context: context
    )
    for await part in stream {
        print(part.chunk ?? "", terminator: "")
    }
}

按设备选择模型

设备	RAM	推荐模型	RAM 使用量
iPhone 12-14	4-6 GB	SmolLM2-135M 或 Qwen 2.5 0.5B	~0.3 GB
iPhone 15 Pro+	8 GB	Gemma 3n E4B 4-bit	~3.5 GB
Mac 8 GB	8 GB	Llama 3.2 3B 4-bit	~3 GB
Mac 16 GB+	16 GB+	Mistral 7B 4-bit	~6 GB

在 iOS 上切勿超过总 RAM 的 60%
设置 GPU 缓存限制：MLX.GPU.set(cacheLimit: 512 * 1024 * 1024)
应用进入后台时卸载模型
对于更大的模型，使用 "Increased Memory Limit" 授权
需要物理设备（模拟器不支持 Metal GPU）

完整的 MLX Swift 模式和 llama.cpp 集成，请参见 references/mlx-swift.md。

当应用需要多个 AI 后端时（例如，Foundation Models + MLX 后备方案）：

func respond(to prompt: String) async throws -> String {
    if SystemLanguageModel.default.isAvailable {
        return try await foundationModelsRespond(prompt)
    } else if canLoadMLXModel() {
        return try await mlxRespond(prompt)
    } else {
        throw AIError.noBackendAvailable
    }
}

通过协调器 actor 序列化所有模型访问，以防止争用：

actor ModelCoordinator {
    func withExclusiveAccess<T>(_ work: () async throws -> T) async rethrows -> T {
        try await work()
    }
}

在调试器外运行以获得准确的基准测试（Xcode：Cmd-Opt-R，取消勾选 "Debug Executable"）
在用户交互前为 Foundation Models 调用 session.prewarm()
将 Core ML 模型预编译为 .mlmodelc 以加快加载速度
使用 EnumeratedShapes 而非 RangeDim 以优化 Neural Engine
使用 4 位调色板化以获得最佳的 Neural Engine 内存/延迟收益
在单个 perform() 调用中批量处理 Vision 框架请求
在 Swift 并发上下文中使用异步预测（iOS 17+）
Neural Engine（Core ML）对于兼容操作是最节能的

未进行可用性检查。 在不检查 SystemLanguageModel.default.availability 的情况下调用 LanguageModelSession() 会在不支持的设备上崩溃。
没有后备 UI。 iOS 26 之前或未启用 Apple Intelligence 的设备上的用户看不到任何内容。始终提供优雅的降级路径。
超出上下文窗口。 令牌预算涵盖输入 + 输出。通过 tokenCount(for:) 监控使用情况，并在需要时进行摘要。
在一个会话上并发请求。 LanguageModelSession 一次支持一个请求。检查 session.isResponding 或序列化访问。
在指令中包含不受信任的内容。 放在指令参数中的用户输入会绕过护栏边界。将用户内容保留在提示中。
在 Core ML 跟踪前忘记 model.eval()。 PyTorch 模型在 torch.jit.trace 前必须处于评估模式。训练模式下的伪影会破坏输出。
使用 neuralnetwork 格式。 对于新的 Core ML 模型，请始终使用 mlprogram（.mlpackage）。旧版 neuralnetwork 格式已弃用。
在 iOS 上超过 60% RAM（MLX Swift）。 大型模型会导致 OOM 终止。
在模拟器中运行 MLX。 MLX 需要 Metal GPU——请使用物理设备。
在后台时未卸载模型。 在 scenePhase == .background 时卸载。

框架选择符合用例和目标操作系统版本
Foundation Models：每次 API 调用前都检查可用性
Foundation Models：模型不可用时优雅降级
Foundation Models：在用户交互前调用会话预热
Foundation Models：@Generable 属性按逻辑生成顺序排列
Foundation Models：考虑了令牌预算（检查 contextSize）
Core ML：对于 iOS 15+，模型格式为 mlprogram（.mlpackage）
Core ML：在跟踪/导出 PyTorch 模型前调用 model.eval()
Core ML：明确设置 minimum_deployment_target
Core ML：压缩后验证模型精度
MLX Swift：模型大小适合目标设备 RAM
MLX Swift：设置了 GPU 缓存限制，后台时卸载模型
所有模型访问都通过协调器 actor 序列化
并发性：模型类型和工具实现符合 Sendable 或隔离于 @MainActor
已进行物理设备测试（非模拟器）

Foundation Models API -- LanguageModelSession, @Generable, 工具调用, 提示设计
Core ML 转换 -- 从 PyTorch、TensorFlow 和其他框架转换模型
Core ML 优化 -- 量化、调色板化、剪枝、性能调优
MLX Swift & llama.cpp -- MLX Swift 模式、llama.cpp 集成、内存管理

🇺🇸English

On-Device AI for Apple Platforms

Guide for selecting, deploying, and optimizing on-device ML models. Covers Apple Foundation Models, Core ML, MLX Swift, and llama.cpp.

Framework Selection Router
Apple Foundation Models Overview
Core ML Overview
MLX Swift Overview
Multi-Backend Architecture
Performance Best Practices
Common Mistakes
Review Checklist
References

Framework Selection Router

Use this decision tree to pick the right framework for your use case.

Apple Foundation Models

When to use: Text generation, summarization, entity extraction, structured output, and short dialog on iOS 26+ / macOS 26+ devices with Apple Intelligence enabled. Zero setup -- no API keys, no network, no model downloads.

Best for:

Generating text or structured data with @Generable types
Summarization, classification, content tagging
Tool-augmented generation with the Tool protocol
Apps that need guaranteed on-device privacy

Not suited for: Complex math, code generation, factual accuracy tasks, or apps targeting pre-iOS 26 devices.

Core ML

When to use: Deploying custom trained models (vision, NLP, audio) across all Apple platforms. Converting models from PyTorch, TensorFlow, or scikit-learn with coremltools.

Best for:

Image classification, object detection, segmentation
Custom NLP classifiers, sentiment analysis models
Audio/speech models via SoundAnalysis integration
Any scenario needing Neural Engine optimization
Models requiring quantization, palettization, or pruning

MLX Swift

When to use: Running specific open-source LLMs (Llama, Mistral, Qwen, Gemma) on Apple Silicon with maximum throughput. Research and prototyping.

Best for:

Highest sustained token generation on Apple Silicon
Running Hugging Face models from mlx-community
Research requiring automatic differentiation
Fine-tuning workflows on Mac

llama.cpp

When to use: Cross-platform LLM inference using GGUF model format. Production deployments needing broad device support.

Best for:

GGUF quantized models (Q4_K_M, Q5_K_M, Q8_0)
Cross-platform apps (iOS + Android + desktop)
Maximum compatibility with open-source model ecosystem

Quick Reference

Scenario	Framework
Text generation, zero setup (iOS 26+)	Foundation Models
Structured output from on-device LLM	Foundation Models (`@Generable`)
Image classification, object detection	Core ML
Custom model from PyTorch/TensorFlow	Core ML + coremltools
Running specific open-source LLMs	MLX Swift or llama.cpp
Maximum throughput on Apple Silicon	MLX Swift
Cross-platform LLM inference	llama.cpp
OCR and text recognition	Vision framework
Sentiment analysis, NER, tokenization	Natural Language framework
Training custom classifiers on device	Create ML

Apple Foundation Models Overview

On-device language model optimized for Apple Silicon. Available on devices supporting Apple Intelligence (iOS 26+, macOS 26+).

Token budget covers input + output; check contextSize for the limit
Check supportedLanguages for supported locales
Guardrails always enforced, cannot be disabled

Availability Checking (Required)

Always check before using. Never crash on unavailability.

import FoundationModels

switch SystemLanguageModel.default.availability {
case .available:
    // Proceed with model usage
case .unavailable(.appleIntelligenceNotEnabled):
    // Guide user to enable Apple Intelligence in Settings
case .unavailable(.modelNotReady):
    // Model is downloading; show loading state
case .unavailable(.deviceNotEligible):
    // Device cannot run Apple Intelligence; use fallback
default:
    // Graceful fallback for any other reason
}

Session Management

// Basic session
let session = LanguageModelSession()

// Session with instructions
let session = LanguageModelSession {
    "You are a helpful cooking assistant."
}

// Session with tools
let session = LanguageModelSession(
    tools: [weatherTool, recipeTool]
) {
    "You are a helpful assistant with access to tools."
}

Key rules:

Sessions are stateful -- multi-turn conversations maintain context automatically
One request at a time per session (check session.isResponding)
Call session.prewarm() before user interaction for faster first response
Save/restore transcripts: LanguageModelSession(model: model, tools: [], transcript: savedTranscript)

Structured Output with @Generable

The @Generable macro creates compile-time schemas for type-safe output:

@Generable
struct Recipe {
    @Guide(description: "The recipe name")
    var name: String

    @Guide(description: "Cooking steps", .count(3))
    var steps: [String]

    @Guide(description: "Prep time in minutes", .range(1...120))
    var prepTime: Int
}

let response = try await session.respond(
    to: "Suggest a quick pasta recipe",
    generating: Recipe.self
)
print(response.content.name)

@Guide Constraints

Constraint	Purpose
`description:`	Natural language hint for generation
`.anyOf([values])`	Restrict to enumerated string values
`.count(n)`	Fixed array length
`.range(min...max)`	Numeric range
`.minimum(n)` / `.maximum(n)`	One-sided numeric bound
`.minimumCount(n)` /

Properties generate in declaration order. Place foundational data before dependent data for better results.

Streaming Structured Output

let stream = session.streamResponse(
    to: "Suggest a recipe",
    generating: Recipe.self
)
for try await snapshot in stream {
    // snapshot.content is Recipe.PartiallyGenerated (all properties optional)
    if let name = snapshot.content.name { updateNameLabel(name) }
}

Tool Calling

struct WeatherTool: Tool {
    let name = "weather"
    let description = "Get current weather for a city."

    @Generable
    struct Arguments {
        @Guide(description: "The city name")
        var city: String
    }

    func call(arguments: Arguments) async throws -> String {
        let weather = try await fetchWeather(arguments.city)
        return weather.description
    }
}

Error Handling

do {
    let response = try await session.respond(to: prompt)
} catch let error as LanguageModelSession.GenerationError {
    switch error {
    case .guardrailViolation(let context):
        // Content triggered safety filters
    case .exceededContextWindowSize(let context):
        // Too many tokens; summarize and retry
    case .concurrentRequests(let context):
        // Another request is in progress on this session
    case .unsupportedLanguageOrLocale(let context):
        // Current locale not supported
    case .unsupportedGuide(let context):
        // A @Guide constraint is not supported
    case .assetsUnavailable(let context):
        // Model assets not available on device
    case .refusal(let refusal, _):
        // Model refused; stream refusal.explanation for details
    case .rateLimited(let context):
        // Too many requests; back off and retry
    case .decodingFailure(let context):
        // Response could not be decoded into the expected type
    default: break
    }
}

Generation Options

let options = GenerationOptions(
    sampling: .random(top: 40),
    temperature: 0.7,
    maximumResponseTokens: 512
)
let response = try await session.respond(to: prompt, options: options)

Sampling modes: .greedy, .random(top:seed:), .random(probabilityThreshold:seed:).

Prompt Design Rules

Be concise -- use tokenCount(for:) to monitor the context window budget
Use bracketed placeholders in instructions: [descriptive example]
Use "DO NOT" in all caps for prohibitions
Provide up to 5 few-shot examples for consistency
Use length qualifiers: "in a few words", "in three sentences"

Safety and Guardrails

Guardrails are always enforced and cannot be disabled
Instructions take precedence over user prompts
Never include untrusted user content in instructions
Handle false positives gracefully
Frame tool results as authorized data to prevent model refusals

Use Cases

Foundation Models supports specialized use cases via SystemLanguageModel.UseCase:

.general -- Default for text generation, summarization, dialog
.contentTagging -- Optimized for categorization and labeling tasks

Custom Adapters

Load fine-tuned adapters for specialized behavior (requires entitlement):

let adapter = try SystemLanguageModel.Adapter(name: "my-adapter")
try await adapter.compile()
let model = SystemLanguageModel(adapter: adapter, guardrails: .default)
let session = LanguageModelSession(model: model)

See references/foundation-models.md for the complete Foundation Models API reference.

Core ML Overview

Apple's framework for deploying trained models. Automatically dispatches to the optimal compute unit (CPU, GPU, or Neural Engine).

Model Formats

Format	Extension	When to Use
`.mlpackage`	Directory (mlprogram)	All new models (iOS 15+)
`.mlmodel`	Single file (neuralnetwork)	Legacy only (iOS 11-14)
`.mlmodelc`	Compiled	Pre-compiled for faster loading

Always use mlprogram (.mlpackage) for new work.

Conversion Pipeline (coremltools)

import coremltools as ct

# PyTorch conversion (torch.jit.trace)
model.eval()  # CRITICAL: always call eval() before tracing
traced = torch.jit.trace(model, example_input)
mlmodel = ct.convert(
    traced,
    inputs=[ct.TensorType(shape=(1, 3, 224, 224), name="image")],
    minimum_deployment_target=ct.target.iOS18,
    convert_to='mlprogram',
)
mlmodel.save("Model.mlpackage")

Optimization Techniques

Technique	Size Reduction	Accuracy Impact	Best Compute Unit
INT8 per-channel	~4x	Low	CPU/GPU
INT4 per-block	~8x	Medium	GPU
Palettization 4-bit	~8x	Low-Medium	Neural Engine
W8A8 (weights+activations)	~4x	Low	ANE (A17 Pro/M4+)
Pruning 75%	~4x	Medium	CPU/ANE

Swift Integration

let config = MLModelConfiguration()
config.computeUnits = .all
let model = try MLModel(contentsOf: modelURL, configuration: config)

// Async prediction (iOS 17+)
let output = try await model.prediction(from: input)

MLTensor (iOS 18+)

Swift type for multidimensional array operations:

import CoreML

let tensor = MLTensor([1.0, 2.0, 3.0, 4.0])
let reshaped = tensor.reshaped(to: [2, 2])
let result = tensor.softmax()

See references/coreml-conversion.md for the full conversion pipeline and references/coreml-optimization.md for optimization techniques.

MLX Swift Overview

Apple's ML framework for Swift. Highest sustained generation throughput on Apple Silicon via unified memory architecture.

Loading and Running LLMs

import MLX
import MLXLLM

let config = ModelConfiguration(id: "mlx-community/Mistral-7B-Instruct-v0.3-4bit")
let model = try await LLMModelFactory.shared.loadContainer(configuration: config)

try await model.perform { context in
    let input = try await context.processor.prepare(
        input: UserInput(prompt: "Hello")
    )
    let stream = try generate(
        input: input,
        parameters: GenerateParameters(temperature: 0.0),
        context: context
    )
    for await part in stream {
        print(part.chunk ?? "", terminator: "")
    }
}

Model Selection by Device

Device	RAM	Recommended Model	RAM Usage
iPhone 12-14	4-6 GB	SmolLM2-135M or Qwen 2.5 0.5B	~0.3 GB
iPhone 15 Pro+	8 GB	Gemma 3n E4B 4-bit	~3.5 GB
Mac 8 GB	8 GB	Llama 3.2 3B 4-bit	~3 GB
Mac 16 GB+	16 GB+	Mistral 7B 4-bit	~6 GB

Memory Management

Never exceed 60% of total RAM on iOS
Set GPU cache limits: MLX.GPU.set(cacheLimit: 512 * 1024 * 1024)
Unload models on app backgrounding
Use "Increased Memory Limit" entitlement for larger models
Physical device required (no simulator support for Metal GPU)

See references/mlx-swift.md for full MLX Swift patterns and llama.cpp integration.

Multi-Backend Architecture

When an app needs multiple AI backends (e.g., Foundation Models + MLX fallback):

func respond(to prompt: String) async throws -> String {
    if SystemLanguageModel.default.isAvailable {
        return try await foundationModelsRespond(prompt)
    } else if canLoadMLXModel() {
        return try await mlxRespond(prompt)
    } else {
        throw AIError.noBackendAvailable
    }
}

Serialize all model access through a coordinator actor to prevent contention:

actor ModelCoordinator {
    func withExclusiveAccess<T>(_ work: () async throws -> T) async rethrows -> T {
        try await work()
    }
}

Performance Best Practices

Run outside debugger for accurate benchmarks (Xcode: Cmd-Opt-R, uncheck "Debug Executable")
Call session.prewarm() for Foundation Models before user interaction
Pre-compile Core ML models to .mlmodelc for faster loading
Use EnumeratedShapes over RangeDim for Neural Engine optimization
Use 4-bit palettization for best Neural Engine memory/latency gains
Batch Vision framework requests in a single perform() call
Use async prediction (iOS 17+) in Swift concurrency contexts
Neural Engine (Core ML) is most energy-efficient for compatible operations

Common Mistakes

No availability check. Calling LanguageModelSession() without checking SystemLanguageModel.default.availability crashes on unsupported devices.
No fallback UI. Users on pre-iOS 26 or devices without Apple Intelligence see nothing. Always provide a graceful degradation path.
Exceeding the context window. The token budget covers input + output. Monitor usage via tokenCount(for:) and summarize when needed.
Concurrent requests on one session. LanguageModelSession supports one request at a time. Check session.isResponding or serialize access.
Untrusted content in instructions. User input placed in the instructions parameter bypasses guardrail boundaries. Keep user content in the prompt.
Forgettingmodel.eval() before Core ML tracing. PyTorch models must be in eval mode before torch.jit.trace. Training-mode artifacts corrupt output.

Review Checklist

Framework selection matches use case and target OS version
Foundation Models: availability checked before every API call
Foundation Models: graceful fallback when model unavailable
Foundation Models: session prewarm called before user interaction
Foundation Models: @Generable properties in logical generation order
Foundation Models: token budget accounted for (check contextSize)
Core ML: model format is mlprogram (.mlpackage) for iOS 15+
Core ML: model.eval() called before tracing/exporting PyTorch models
Core ML: minimum_deployment_target set explicitly
Core ML: model accuracy validated after compression
MLX Swift: model size appropriate for target device RAM
MLX Swift: GPU cache limits set, models unloaded on backgrounding
All model access serialized through coordinator actor
Concurrency: model types and tool implementations are Sendable-conformant or @MainActor-isolated
Physical device testing performed (not simulator)

References

Foundation Models API -- LanguageModelSession, @Generable, tool calling, prompt design
Core ML Conversion -- Model conversion from PyTorch, TensorFlow, other frameworks
Core ML Optimization -- Quantization, palettization, pruning, performance tuning
MLX Swift & llama.cpp -- MLX Swift patterns, llama.cpp integration, memory management

Weekly Installs

404

Repository

dpearson2699/sw…s-skills

GitHub Stars

269

First Seen

Mar 3, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

codex401

kimi-cli398

amp398

cline398

github-copilot398

opencode398

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

39,200 周安装

Using neuralnetwork format. Always use mlprogram (.mlpackage) for new Core ML models. The legacy neuralnetwork format is deprecated.

Exceeding 60% RAM on iOS (MLX Swift). Large models cause OOM kills.

Running MLX in simulator. MLX requires Metal GPU -- use physical devices.

Not unloading models on background. Unload in scenePhase == .background.

Apple端侧AI开发指南：Core ML、Foundation Models、MLX Swift与llama.cpp选择与优化

🇨🇳中文介绍

Apple 平台端侧 AI

目录

框架选择决策树

Apple Foundation Models

Core ML

相关 Skills

MLX Swift

llama.cpp

快速参考

Apple Foundation Models 概览

可用性检查（必需）

会话管理

使用 @Generable 实现结构化输出

@Guide 约束

流式结构化输出

工具调用

错误处理

生成选项

提示设计规则

安全与护栏

用例

自定义适配器

Core ML 概览

模型格式

转换流程（coremltools）

优化技术

Swift 集成

MLTensor（iOS 18+）

MLX Swift 概览

加载和运行 LLM

按设备选择模型

内存管理

多后端架构

性能最佳实践

常见错误

审查清单

参考资料

🇺🇸English

On-Device AI for Apple Platforms

Contents

Framework Selection Router

Apple Foundation Models

Core ML

MLX Swift

llama.cpp

Quick Reference

Apple Foundation Models Overview

Availability Checking (Required)

Session Management

Structured Output with @Generable

@Guide Constraints

Streaming Structured Output

Tool Calling

Error Handling

Generation Options

Prompt Design Rules

Safety and Guardrails

Use Cases

Custom Adapters

Core ML Overview

Model Formats

Conversion Pipeline (coremltools)

Optimization Techniques

Swift Integration

MLTensor (iOS 18+)

MLX Swift Overview

Loading and Running LLMs

Model Selection by Device

Memory Management

Multi-Backend Architecture

Performance Best Practices

Common Mistakes

Review Checklist

References

最新 Skills