apple-on-device-ai by dpearson2699/swift-ios-skills
npx skills add https://github.com/dpearson2699/swift-ios-skills --skill apple-on-device-ai端侧机器学习模型选择、部署与优化指南。涵盖 Apple Foundation Models、Core ML、MLX Swift 和 llama.cpp。
使用此决策树为您的用例选择合适的框架。
使用时机: 在已启用 Apple Intelligence 的 iOS 26+ / macOS 26+ 设备上进行文本生成、摘要、实体提取、结构化输出和简短对话。零设置——无需 API 密钥、无需网络、无需模型下载。
最适合:
@Generable 类型生成文本或结构化数据Tool 协议进行工具增强生成不适用于: 复杂数学运算、代码生成、事实准确性任务,或针对 iOS 26 之前设备的应用。
使用时机: 在所有 Apple 平台上部署自定义训练的模型(视觉、NLP、音频)。使用 coremltools 转换来自 PyTorch、TensorFlow 或 scikit-learn 的模型。
最适合:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
使用时机: 在 Apple Silicon 上以最大吞吐量运行特定的开源 LLM(Llama、Mistral、Qwen、Gemma)。用于研究和原型设计。
最适合:
mlx-community 的 Hugging Face 模型使用时机: 使用 GGUF 模型格式进行跨平台 LLM 推理。需要广泛设备支持的生产部署。
最适合:
| 场景 | 框架 |
|---|---|
| 文本生成,零设置(iOS 26+) | Foundation Models |
| 来自端侧 LLM 的结构化输出 | Foundation Models (@Generable) |
| 图像分类,目标检测 | Core ML |
| 来自 PyTorch/TensorFlow 的自定义模型 | Core ML + coremltools |
| 运行特定的开源 LLM | MLX Swift 或 llama.cpp |
| 在 Apple Silicon 上实现最大吞吐量 | MLX Swift |
| 跨平台 LLM 推理 | llama.cpp |
| OCR 和文本识别 | Vision 框架 |
| 情感分析、NER、分词 | Natural Language 框架 |
| 在设备上训练自定义分类器 | Create ML |
为 Apple Silicon 优化的端侧语言模型。在支持 Apple Intelligence(iOS 26+、macOS 26+)的设备上可用。
contextSize 了解限制supportedLanguages 了解支持的语言环境使用前务必检查。在不可用时切勿崩溃。
import FoundationModels
switch SystemLanguageModel.default.availability {
case .available:
// Proceed with model usage
case .unavailable(.appleIntelligenceNotEnabled):
// Guide user to enable Apple Intelligence in Settings
case .unavailable(.modelNotReady):
// Model is downloading; show loading state
case .unavailable(.deviceNotEligible):
// Device cannot run Apple Intelligence; use fallback
default:
// Graceful fallback for any other reason
}
// Basic session
let session = LanguageModelSession()
// Session with instructions
let session = LanguageModelSession {
"You are a helpful cooking assistant."
}
// Session with tools
let session = LanguageModelSession(
tools: [weatherTool, recipeTool]
) {
"You are a helpful assistant with access to tools."
}
关键规则:
session.isResponding)session.prewarm() 以获得更快的首次响应LanguageModelSession(model: model, tools: [], transcript: savedTranscript)@Generable 宏为类型安全输出创建编译时模式:
@Generable
struct Recipe {
@Guide(description: "The recipe name")
var name: String
@Guide(description: "Cooking steps", .count(3))
var steps: [String]
@Guide(description: "Prep time in minutes", .range(1...120))
var prepTime: Int
}
let response = try await session.respond(
to: "Suggest a quick pasta recipe",
generating: Recipe.self
)
print(response.content.name)
| 约束 | 用途 |
|---|---|
description: | 用于生成的自然语言提示 |
.anyOf([values]) | 限制为枚举的字符串值 |
.count(n) | 固定数组长度 |
.range(min...max) | 数值范围 |
.minimum(n) / .maximum(n) | 单边数值边界 |
.minimumCount(n) / .maximumCount(n) | 数组长度边界 |
.constant(value) | 始终返回此值 |
.pattern(regex) | 字符串格式强制 |
.element(guide) | 应用于每个数组元素的指南 |
属性按声明顺序生成。将基础数据放在依赖数据之前以获得更好的结果。
let stream = session.streamResponse(
to: "Suggest a recipe",
generating: Recipe.self
)
for try await snapshot in stream {
// snapshot.content is Recipe.PartiallyGenerated (all properties optional)
if let name = snapshot.content.name { updateNameLabel(name) }
}
struct WeatherTool: Tool {
let name = "weather"
let description = "Get current weather for a city."
@Generable
struct Arguments {
@Guide(description: "The city name")
var city: String
}
func call(arguments: Arguments) async throws -> String {
let weather = try await fetchWeather(arguments.city)
return weather.description
}
}
在会话创建时注册工具。模型会自动调用它们。
do {
let response = try await session.respond(to: prompt)
} catch let error as LanguageModelSession.GenerationError {
switch error {
case .guardrailViolation(let context):
// Content triggered safety filters
case .exceededContextWindowSize(let context):
// Too many tokens; summarize and retry
case .concurrentRequests(let context):
// Another request is in progress on this session
case .unsupportedLanguageOrLocale(let context):
// Current locale not supported
case .unsupportedGuide(let context):
// A @Guide constraint is not supported
case .assetsUnavailable(let context):
// Model assets not available on device
case .refusal(let refusal, _):
// Model refused; stream refusal.explanation for details
case .rateLimited(let context):
// Too many requests; back off and retry
case .decodingFailure(let context):
// Response could not be decoded into the expected type
default: break
}
}
let options = GenerationOptions(
sampling: .random(top: 40),
temperature: 0.7,
maximumResponseTokens: 512
)
let response = try await session.respond(to: prompt, options: options)
采样模式:.greedy, .random(top:seed:), .random(probabilityThreshold:seed:).
tokenCount(for:) 监控上下文窗口预算[描述性示例]Foundation Models 通过 SystemLanguageModel.UseCase 支持专门的用例:
.general -- 文本生成、摘要、对话的默认设置.contentTagging -- 为分类和标记任务优化加载用于专门行为的微调适配器(需要授权):
let adapter = try SystemLanguageModel.Adapter(name: "my-adapter")
try await adapter.compile()
let model = SystemLanguageModel(adapter: adapter, guardrails: .default)
let session = LanguageModelSession(model: model)
完整的 Foundation Models API 参考,请参见 references/foundation-models.md。
Apple 用于部署训练模型的框架。自动调度到最佳计算单元(CPU、GPU 或 Neural Engine)。
| 格式 | 扩展名 | 使用时机 |
|---|---|---|
.mlpackage | 目录(mlprogram) | 所有新模型(iOS 15+) |
.mlmodel | 单个文件(neuralnetwork) | 仅限旧版(iOS 11-14) |
.mlmodelc | 已编译 | 预编译以加快加载速度 |
新项目请始终使用 mlprogram(.mlpackage)。
import coremltools as ct
# PyTorch conversion (torch.jit.trace)
model.eval() # CRITICAL: always call eval() before tracing
traced = torch.jit.trace(model, example_input)
mlmodel = ct.convert(
traced,
inputs=[ct.TensorType(shape=(1, 3, 224, 224), name="image")],
minimum_deployment_target=ct.target.iOS18,
convert_to='mlprogram',
)
mlmodel.save("Model.mlpackage")
| 技术 | 大小缩减 | 精度影响 | 最佳计算单元 |
|---|---|---|---|
| INT8 每通道 | ~4x | 低 | CPU/GPU |
| INT4 每块 | ~8x | 中 | GPU |
| 4 位调色板化 | ~8x | 低-中 | Neural Engine |
| W8A8(权重+激活) | ~4x | 低 | ANE(A17 Pro/M4+) |
| 75% 剪枝 | ~4x | 中 | CPU/ANE |
let config = MLModelConfiguration()
config.computeUnits = .all
let model = try MLModel(contentsOf: modelURL, configuration: config)
// Async prediction (iOS 17+)
let output = try await model.prediction(from: input)
用于多维数组操作的 Swift 类型:
import CoreML
let tensor = MLTensor([1.0, 2.0, 3.0, 4.0])
let reshaped = tensor.reshaped(to: [2, 2])
let result = tensor.softmax()
完整的转换流程,请参见 references/coreml-conversion.md;优化技术,请参见 references/coreml-optimization.md。
Apple 的 Swift ML 框架。通过统一内存架构在 Apple Silicon 上实现最高的持续生成吞吐量。
import MLX
import MLXLLM
let config = ModelConfiguration(id: "mlx-community/Mistral-7B-Instruct-v0.3-4bit")
let model = try await LLMModelFactory.shared.loadContainer(configuration: config)
try await model.perform { context in
let input = try await context.processor.prepare(
input: UserInput(prompt: "Hello")
)
let stream = try generate(
input: input,
parameters: GenerateParameters(temperature: 0.0),
context: context
)
for await part in stream {
print(part.chunk ?? "", terminator: "")
}
}
| 设备 | RAM | 推荐模型 | RAM 使用量 |
|---|---|---|---|
| iPhone 12-14 | 4-6 GB | SmolLM2-135M 或 Qwen 2.5 0.5B | ~0.3 GB |
| iPhone 15 Pro+ | 8 GB | Gemma 3n E4B 4-bit | ~3.5 GB |
| Mac 8 GB | 8 GB | Llama 3.2 3B 4-bit | ~3 GB |
| Mac 16 GB+ | 16 GB+ | Mistral 7B 4-bit | ~6 GB |
MLX.GPU.set(cacheLimit: 512 * 1024 * 1024)完整的 MLX Swift 模式和 llama.cpp 集成,请参见 references/mlx-swift.md。
当应用需要多个 AI 后端时(例如,Foundation Models + MLX 后备方案):
func respond(to prompt: String) async throws -> String {
if SystemLanguageModel.default.isAvailable {
return try await foundationModelsRespond(prompt)
} else if canLoadMLXModel() {
return try await mlxRespond(prompt)
} else {
throw AIError.noBackendAvailable
}
}
通过协调器 actor 序列化所有模型访问,以防止争用:
actor ModelCoordinator {
func withExclusiveAccess<T>(_ work: () async throws -> T) async rethrows -> T {
try await work()
}
}
session.prewarm().mlmodelc 以加快加载速度perform() 调用中批量处理 Vision 框架请求SystemLanguageModel.default.availability 的情况下调用 LanguageModelSession() 会在不支持的设备上崩溃。tokenCount(for:) 监控使用情况,并在需要时进行摘要。LanguageModelSession 一次支持一个请求。检查 session.isResponding 或序列化访问。model.eval()。 PyTorch 模型在 torch.jit.trace 前必须处于评估模式。训练模式下的伪影会破坏输出。mlprogram(.mlpackage)。旧版 neuralnetwork 格式已弃用。scenePhase == .background 时卸载。contextSize)Sendable 或隔离于 @MainActor每周安装次数
404
仓库
GitHub 星标
269
首次出现
2026年3月3日
安全审计
安装于
codex401
kimi-cli398
amp398
cline398
github-copilot398
opencode398
Guide for selecting, deploying, and optimizing on-device ML models. Covers Apple Foundation Models, Core ML, MLX Swift, and llama.cpp.
Use this decision tree to pick the right framework for your use case.
When to use: Text generation, summarization, entity extraction, structured output, and short dialog on iOS 26+ / macOS 26+ devices with Apple Intelligence enabled. Zero setup -- no API keys, no network, no model downloads.
Best for:
@Generable typesTool protocolNot suited for: Complex math, code generation, factual accuracy tasks, or apps targeting pre-iOS 26 devices.
When to use: Deploying custom trained models (vision, NLP, audio) across all Apple platforms. Converting models from PyTorch, TensorFlow, or scikit-learn with coremltools.
Best for:
When to use: Running specific open-source LLMs (Llama, Mistral, Qwen, Gemma) on Apple Silicon with maximum throughput. Research and prototyping.
Best for:
mlx-communityWhen to use: Cross-platform LLM inference using GGUF model format. Production deployments needing broad device support.
Best for:
| Scenario | Framework |
|---|---|
| Text generation, zero setup (iOS 26+) | Foundation Models |
| Structured output from on-device LLM | Foundation Models (@Generable) |
| Image classification, object detection | Core ML |
| Custom model from PyTorch/TensorFlow | Core ML + coremltools |
| Running specific open-source LLMs | MLX Swift or llama.cpp |
| Maximum throughput on Apple Silicon | MLX Swift |
| Cross-platform LLM inference | llama.cpp |
| OCR and text recognition | Vision framework |
| Sentiment analysis, NER, tokenization | Natural Language framework |
| Training custom classifiers on device | Create ML |
On-device language model optimized for Apple Silicon. Available on devices supporting Apple Intelligence (iOS 26+, macOS 26+).
contextSize for the limitsupportedLanguages for supported localesAlways check before using. Never crash on unavailability.
import FoundationModels
switch SystemLanguageModel.default.availability {
case .available:
// Proceed with model usage
case .unavailable(.appleIntelligenceNotEnabled):
// Guide user to enable Apple Intelligence in Settings
case .unavailable(.modelNotReady):
// Model is downloading; show loading state
case .unavailable(.deviceNotEligible):
// Device cannot run Apple Intelligence; use fallback
default:
// Graceful fallback for any other reason
}
// Basic session
let session = LanguageModelSession()
// Session with instructions
let session = LanguageModelSession {
"You are a helpful cooking assistant."
}
// Session with tools
let session = LanguageModelSession(
tools: [weatherTool, recipeTool]
) {
"You are a helpful assistant with access to tools."
}
Key rules:
session.isResponding)session.prewarm() before user interaction for faster first responseLanguageModelSession(model: model, tools: [], transcript: savedTranscript)The @Generable macro creates compile-time schemas for type-safe output:
@Generable
struct Recipe {
@Guide(description: "The recipe name")
var name: String
@Guide(description: "Cooking steps", .count(3))
var steps: [String]
@Guide(description: "Prep time in minutes", .range(1...120))
var prepTime: Int
}
let response = try await session.respond(
to: "Suggest a quick pasta recipe",
generating: Recipe.self
)
print(response.content.name)
| Constraint | Purpose |
|---|---|
description: | Natural language hint for generation |
.anyOf([values]) | Restrict to enumerated string values |
.count(n) | Fixed array length |
.range(min...max) | Numeric range |
.minimum(n) / .maximum(n) | One-sided numeric bound |
.minimumCount(n) / |
Properties generate in declaration order. Place foundational data before dependent data for better results.
let stream = session.streamResponse(
to: "Suggest a recipe",
generating: Recipe.self
)
for try await snapshot in stream {
// snapshot.content is Recipe.PartiallyGenerated (all properties optional)
if let name = snapshot.content.name { updateNameLabel(name) }
}
struct WeatherTool: Tool {
let name = "weather"
let description = "Get current weather for a city."
@Generable
struct Arguments {
@Guide(description: "The city name")
var city: String
}
func call(arguments: Arguments) async throws -> String {
let weather = try await fetchWeather(arguments.city)
return weather.description
}
}
Register tools at session creation. The model invokes them autonomously.
do {
let response = try await session.respond(to: prompt)
} catch let error as LanguageModelSession.GenerationError {
switch error {
case .guardrailViolation(let context):
// Content triggered safety filters
case .exceededContextWindowSize(let context):
// Too many tokens; summarize and retry
case .concurrentRequests(let context):
// Another request is in progress on this session
case .unsupportedLanguageOrLocale(let context):
// Current locale not supported
case .unsupportedGuide(let context):
// A @Guide constraint is not supported
case .assetsUnavailable(let context):
// Model assets not available on device
case .refusal(let refusal, _):
// Model refused; stream refusal.explanation for details
case .rateLimited(let context):
// Too many requests; back off and retry
case .decodingFailure(let context):
// Response could not be decoded into the expected type
default: break
}
}
let options = GenerationOptions(
sampling: .random(top: 40),
temperature: 0.7,
maximumResponseTokens: 512
)
let response = try await session.respond(to: prompt, options: options)
Sampling modes: .greedy, .random(top:seed:), .random(probabilityThreshold:seed:).
tokenCount(for:) to monitor the context window budget[descriptive example]Foundation Models supports specialized use cases via SystemLanguageModel.UseCase:
.general -- Default for text generation, summarization, dialog.contentTagging -- Optimized for categorization and labeling tasksLoad fine-tuned adapters for specialized behavior (requires entitlement):
let adapter = try SystemLanguageModel.Adapter(name: "my-adapter")
try await adapter.compile()
let model = SystemLanguageModel(adapter: adapter, guardrails: .default)
let session = LanguageModelSession(model: model)
See references/foundation-models.md for the complete Foundation Models API reference.
Apple's framework for deploying trained models. Automatically dispatches to the optimal compute unit (CPU, GPU, or Neural Engine).
| Format | Extension | When to Use |
|---|---|---|
.mlpackage | Directory (mlprogram) | All new models (iOS 15+) |
.mlmodel | Single file (neuralnetwork) | Legacy only (iOS 11-14) |
.mlmodelc | Compiled | Pre-compiled for faster loading |
Always use mlprogram (.mlpackage) for new work.
import coremltools as ct
# PyTorch conversion (torch.jit.trace)
model.eval() # CRITICAL: always call eval() before tracing
traced = torch.jit.trace(model, example_input)
mlmodel = ct.convert(
traced,
inputs=[ct.TensorType(shape=(1, 3, 224, 224), name="image")],
minimum_deployment_target=ct.target.iOS18,
convert_to='mlprogram',
)
mlmodel.save("Model.mlpackage")
| Technique | Size Reduction | Accuracy Impact | Best Compute Unit |
|---|---|---|---|
| INT8 per-channel | ~4x | Low | CPU/GPU |
| INT4 per-block | ~8x | Medium | GPU |
| Palettization 4-bit | ~8x | Low-Medium | Neural Engine |
| W8A8 (weights+activations) | ~4x | Low | ANE (A17 Pro/M4+) |
| Pruning 75% | ~4x | Medium | CPU/ANE |
let config = MLModelConfiguration()
config.computeUnits = .all
let model = try MLModel(contentsOf: modelURL, configuration: config)
// Async prediction (iOS 17+)
let output = try await model.prediction(from: input)
Swift type for multidimensional array operations:
import CoreML
let tensor = MLTensor([1.0, 2.0, 3.0, 4.0])
let reshaped = tensor.reshaped(to: [2, 2])
let result = tensor.softmax()
See references/coreml-conversion.md for the full conversion pipeline and references/coreml-optimization.md for optimization techniques.
Apple's ML framework for Swift. Highest sustained generation throughput on Apple Silicon via unified memory architecture.
import MLX
import MLXLLM
let config = ModelConfiguration(id: "mlx-community/Mistral-7B-Instruct-v0.3-4bit")
let model = try await LLMModelFactory.shared.loadContainer(configuration: config)
try await model.perform { context in
let input = try await context.processor.prepare(
input: UserInput(prompt: "Hello")
)
let stream = try generate(
input: input,
parameters: GenerateParameters(temperature: 0.0),
context: context
)
for await part in stream {
print(part.chunk ?? "", terminator: "")
}
}
| Device | RAM | Recommended Model | RAM Usage |
|---|---|---|---|
| iPhone 12-14 | 4-6 GB | SmolLM2-135M or Qwen 2.5 0.5B | ~0.3 GB |
| iPhone 15 Pro+ | 8 GB | Gemma 3n E4B 4-bit | ~3.5 GB |
| Mac 8 GB | 8 GB | Llama 3.2 3B 4-bit | ~3 GB |
| Mac 16 GB+ | 16 GB+ | Mistral 7B 4-bit | ~6 GB |
MLX.GPU.set(cacheLimit: 512 * 1024 * 1024)See references/mlx-swift.md for full MLX Swift patterns and llama.cpp integration.
When an app needs multiple AI backends (e.g., Foundation Models + MLX fallback):
func respond(to prompt: String) async throws -> String {
if SystemLanguageModel.default.isAvailable {
return try await foundationModelsRespond(prompt)
} else if canLoadMLXModel() {
return try await mlxRespond(prompt)
} else {
throw AIError.noBackendAvailable
}
}
Serialize all model access through a coordinator actor to prevent contention:
actor ModelCoordinator {
func withExclusiveAccess<T>(_ work: () async throws -> T) async rethrows -> T {
try await work()
}
}
session.prewarm() for Foundation Models before user interaction.mlmodelc for faster loadingperform() callLanguageModelSession() without checking SystemLanguageModel.default.availability crashes on unsupported devices.tokenCount(for:) and summarize when needed.LanguageModelSession supports one request at a time. Check session.isResponding or serialize access.model.eval() before Core ML tracing. PyTorch models must be in eval mode before torch.jit.trace. Training-mode artifacts corrupt output.contextSize)Sendable-conformant or @MainActor-isolatedWeekly Installs
404
Repository
GitHub Stars
269
First Seen
Mar 3, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
codex401
kimi-cli398
amp398
cline398
github-copilot398
opencode398
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
39,200 周安装
.maximumCount(n)| Array length bounds |
.constant(value) | Always returns this value |
.pattern(regex) | String format enforcement |
.element(guide) | Guide applied to each array element |
mlprogram (.mlpackage) for new Core ML models. The legacy neuralnetwork format is deprecated.scenePhase == .background.