axiom-vision by charleswiltgen/axiom
npx skills add https://github.com/charleswiltgen/axiom --skill axiom-vision指导您实现计算机视觉功能:主体分割、手部/身体姿态检测、人物检测、文本识别、条形码检测、文档扫描,以及组合 Vision API 来解决复杂问题。
在您需要时使用:
“如何将主体从背景中分离出来?”“我需要检测像捏合这样的手势”“如何获取物体周围的边界框而不包括拿着它的手?”“我应该使用 VisionKit 还是 Vision framework 进行主体提取?”“如何分别分割多个人物?”“我需要为健身应用检测身体姿态”“在将主体合成到新背景上时如何保留 HDR?”“如何识别图像中的文本?”“我需要从摄像头扫描二维码”“如何从收据中提取数据?”“我应该使用 DataScannerViewController 还是直接使用 Vision?”“如何扫描文档并校正透视?”“我需要从文档中提取表格数据”
表明您把事情复杂化的迹象:
在实现任何 Vision 功能之前:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
What do you need to do?
┌─ Isolate subject(s) from background?
│ ├─ Need system UI + out-of-process → VisionKit
│ │ └─ ImageAnalysisInteraction (iOS/iPadOS)
│ │ └─ ImageAnalysisOverlayView (macOS)
│ ├─ Need custom pipeline / HDR / large images → Vision
│ │ └─ VNGenerateForegroundInstanceMaskRequest
│ └─ Need to EXCLUDE hands from object → Combine APIs
│ └─ Subject mask + Hand pose + custom masking (see Pattern 1)
│
├─ Segment people?
│ ├─ All people in one mask → VNGeneratePersonSegmentationRequest
│ └─ Separate mask per person (up to 4) → VNGeneratePersonInstanceMaskRequest
│
├─ Detect hand pose/gestures?
│ ├─ Just hand location → VNDetectHumanRectanglesRequest
│ └─ 21 hand landmarks → VNDetectHumanHandPoseRequest
│ └─ Gesture recognition → Hand pose + distance checks
│
├─ Detect body pose?
│ ├─ 2D normalized landmarks → VNDetectHumanBodyPoseRequest
│ ├─ 3D real-world coordinates → VNDetectHumanBodyPose3DRequest
│ └─ Action classification → Body pose + CreateML model
│
├─ Face detection?
│ ├─ Just bounding boxes → VNDetectFaceRectanglesRequest
│ └─ Detailed landmarks → VNDetectFaceLandmarksRequest
│
├─ Person detection (location only)?
│ └─ VNDetectHumanRectanglesRequest
│
├─ Recognize text in images?
│ ├─ Real-time from camera + need UI → DataScannerViewController (iOS 16+)
│ ├─ Processing captured image → VNRecognizeTextRequest
│ │ ├─ Need speed (real-time camera) → recognitionLevel = .fast
│ │ └─ Need accuracy (documents) → recognitionLevel = .accurate
│ └─ Need structured documents (iOS 26+) → RecognizeDocumentsRequest
│
├─ Detect barcodes/QR codes?
│ ├─ Real-time camera + need UI → DataScannerViewController (iOS 16+)
│ └─ Processing image → VNDetectBarcodesRequest
│
└─ Scan documents?
├─ Need built-in UI + perspective correction → VNDocumentCameraViewController
├─ Need structured data (tables, lists) → RecognizeDocumentsRequest (iOS 26+)
└─ Custom pipeline → VNDetectDocumentSegmentationRequest + perspective correction
切勿在主线程上运行 Vision:
let processingQueue = DispatchQueue(label: "com.yourapp.vision", qos: .userInitiated)
processingQueue.async {
do {
let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])
// Process observations...
DispatchQueue.main.async {
// Update UI
}
} catch {
// Handle error
}
}
处理视频帧?使用 VNSequenceRequestHandler(保持帧间状态以实现时间平滑)。对于单张图像,使用 VNImageRequestHandler。为每一帧创建新的 VNImageRequestHandler 会丢弃时间上下文并导致结果抖动。完整比较和代码示例请参见 axiom-vision-ref。
| API | 最低版本 |
|---|---|
| 主体分割(实例掩码) | iOS 17+ |
| VisionKit 主体提取 | iOS 16+ |
| 手部姿态 | iOS 14+ |
| 身体姿态(2D) | iOS 14+ |
| 身体姿态(3D) | iOS 17+ |
| 人物实例分割 | iOS 17+ |
| VNRecognizeTextRequest(基础) | iOS 13+ |
| VNRecognizeTextRequest(准确,多语言) | iOS 14+ |
| VNDetectBarcodesRequest | iOS 11+ |
| VNDetectBarcodesRequest(修订版 2:Codabar,MicroQR) | iOS 15+ |
| VNDetectBarcodesRequest(修订版 3:基于 ML) | iOS 16+ |
| DataScannerViewController | iOS 16+ |
| VNDocumentCameraViewController | iOS 13+ |
| VNDetectDocumentSegmentationRequest | iOS 15+ |
| RecognizeDocumentsRequest | iOS 26+ |
用户原始问题:获取手持物体周围的边界框,而不包括拿着它的手。
根本原因:VNGenerateForegroundInstanceMaskRequest 是类别无关的,将手+物体视为一个主体。
解决方案:组合主体掩码和手部姿态以创建排除掩码。
// 1. 获取主体实例掩码
let subjectRequest = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: sourceImage)
try handler.perform([subjectRequest])
guard let subjectObservation = subjectRequest.results?.first as? VNInstanceMaskObservation else {
fatalError("No subject detected")
}
// 2. 获取手部姿态关键点
let handRequest = VNDetectHumanHandPoseRequest()
handRequest.maximumHandCount = 2
try handler.perform([handRequest])
guard let handObservation = handRequest.results?.first as? VNHumanHandPoseObservation else {
// No hand detected - use full subject mask
let mask = try subjectObservation.createScaledMask(
for: subjectObservation.allInstances,
croppedToInstancesContent: false
)
return mask
}
// 3. 从关键点创建手部排除区域
let handPoints = try handObservation.recognizedPoints(.all)
let handBounds = calculateConvexHull(from: handPoints) // Your implementation
// 4. 使用 CoreImage 从主体掩码中减去手部区域
let subjectMask = try subjectObservation.createScaledMask(
for: subjectObservation.allInstances,
croppedToInstancesContent: false
)
let subjectCIMask = CIImage(cvPixelBuffer: subjectMask)
let handMask = createMaskFromRegion(handBounds, size: sourceImage.size)
let finalMask = subtractMasks(handMask: handMask, from: subjectCIMask)
// 5. 从最终掩码计算边界框
let objectBounds = calculateBoundingBox(from: finalMask)
辅助函数:凸包
func calculateConvexHull(from points: [VNRecognizedPointKey: VNRecognizedPoint]) -> CGRect {
// Get high-confidence points
let validPoints = points.values.filter { $0.confidence > 0.5 }
guard !validPoints.isEmpty else { return .zero }
// Simple bounding rect (for more accuracy, use actual convex hull algorithm)
let xs = validPoints.map { $0.location.x }
let ys = validPoints.map { $0.location.y }
let minX = xs.min()!
let maxX = xs.max()!
let minY = ys.min()!
let maxY = ys.max()!
return CGRect(
x: minX,
y: minY,
width: maxX - minX,
height: maxY - minY
)
}
成本:初始实现 2-5 小时,后续维护 30 分钟
使用场景:以最少的代码添加类似系统的主体提取 UI。
// iOS
let interaction = ImageAnalysisInteraction()
interaction.preferredInteractionTypes = .imageSubject
imageView.addInteraction(interaction)
// macOS
let overlayView = ImageAnalysisOverlayView()
overlayView.preferredInteractionTypes = .imageSubject
nsView.addSubview(overlayView)
何时使用:
成本:实现 15 分钟,后续维护 5 分钟
使用场景:无需 UI 交互即可获取主体图像/边界。
let analyzer = ImageAnalyzer()
let configuration = ImageAnalyzer.Configuration([.text, .visualLookUp])
let analysis = try await analyzer.analyze(sourceImage, configuration: configuration)
// Get all subjects
for subject in analysis.subjects {
let subjectImage = subject.image
let subjectBounds = subject.bounds
// Process subject...
}
// Tap-based lookup
if let subject = try await analysis.subject(at: tapPoint) {
let compositeImage = try await analysis.image(for: [subject])
}
成本:实现 30 分钟,后续维护 10 分钟
使用场景:HDR 保留、大图像、自定义合成。
let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: sourceImage)
try handler.perform([request])
guard let observation = request.results?.first as? VNInstanceMaskObservation else {
return
}
// Get soft segmentation mask
let mask = try observation.createScaledMask(
for: observation.allInstances,
croppedToInstancesContent: false // Full resolution for compositing
)
// Use with CoreImage for HDR preservation
let filter = CIFilter(name: "CIBlendWithMask")!
filter.setValue(CIImage(cgImage: sourceImage), forKey: kCIInputImageKey)
filter.setValue(CIImage(cvPixelBuffer: mask), forKey: kCIInputMaskImageKey)
filter.setValue(newBackground, forKey: kCIInputBackgroundImageKey)
let compositedImage = filter.outputImage
成本:实现 1 小时,后续维护 15 分钟
使用场景:用户点击选择要提取哪个主体/人物。
// Get instance at tap point
let instance = observation.instanceAtPoint(tapPoint)
if instance == 0 {
// Background tapped - select all instances
let mask = try observation.createScaledMask(
for: observation.allInstances,
croppedToInstancesContent: false
)
} else {
// Specific instance tapped
let mask = try observation.createScaledMask(
for: IndexSet(integer: instance),
croppedToInstancesContent: true
)
}
替代方案:原始像素缓冲区访问
let instanceMask = observation.instanceMask
CVPixelBufferLockBaseAddress(instanceMask, .readOnly)
defer { CVPixelBufferUnlockBaseAddress(instanceMask, .readOnly) }
let baseAddress = CVPixelBufferGetBaseAddress(instanceMask)
let bytesPerRow = CVPixelBufferGetBytesPerRow(instanceMask)
// Convert normalized tap to pixel coordinates
let pixelPoint = VNImagePointForNormalizedPoint(
tapPoint,
width: imageWidth,
height: imageHeight
)
let offset = Int(pixelPoint.y) * bytesPerRow + Int(pixelPoint.x)
let label = UnsafeRawPointer(baseAddress!).load(
fromByteOffset: offset,
as: UInt8.self
)
成本:实现 45 分钟,后续维护 10 分钟
使用场景:检测捏合手势以用于自定义相机触发或 UI 控制。
let request = VNDetectHumanHandPoseRequest()
request.maximumHandCount = 1
try handler.perform([request])
guard let observation = request.results?.first as? VNHumanHandPoseObservation else {
return
}
let thumbTip = try observation.recognizedPoint(.thumbTip)
let indexTip = try observation.recognizedPoint(.indexTip)
// Check confidence
guard thumbTip.confidence > 0.5, indexTip.confidence > 0.5 else {
return
}
// Calculate distance (normalized coordinates)
let dx = thumbTip.location.x - indexTip.location.x
let dy = thumbTip.location.y - indexTip.location.y
let distance = sqrt(dx * dx + dy * dy)
let isPinching = distance < 0.05 // Adjust threshold
// State machine for evidence accumulation
if isPinching {
pinchFrameCount += 1
if pinchFrameCount >= 3 {
state = .pinched
}
} else {
pinchFrameCount = max(0, pinchFrameCount - 1)
if pinchFrameCount == 0 {
state = .apart
}
}
成本:实现 2 小时,后续维护 20 分钟
使用场景:对每个人物应用不同的效果或统计人数。
let request = VNGeneratePersonInstanceMaskRequest()
try handler.perform([request])
guard let observation = request.results?.first as? VNInstanceMaskObservation else {
return
}
let peopleCount = observation.allInstances.count // Up to 4
for personIndex in observation.allInstances {
let personMask = try observation.createScaledMask(
for: IndexSet(integer: personIndex),
croppedToInstancesContent: false
)
// Apply effect to this person only
applyEffect(to: personMask, personIndex: personIndex)
}
拥挤场景(>4 人):
// Count faces to detect crowding
let faceRequest = VNDetectFaceRectanglesRequest()
try handler.perform([faceRequest])
let faceCount = faceRequest.results?.count ?? 0
if faceCount > 4 {
// Fallback: Use single mask for all people
let singleMaskRequest = VNGeneratePersonSegmentationRequest()
try handler.perform([singleMaskRequest])
}
成本:实现 1.5 小时,后续维护 15 分钟
使用场景:识别运动(开合跳、深蹲等)的健身应用。
// 1. Collect body pose observations
var poseObservations: [VNHumanBodyPoseObservation] = []
let request = VNDetectHumanBodyPoseRequest()
try handler.perform([request])
if let observation = request.results?.first as? VNHumanBodyPoseObservation {
poseObservations.append(observation)
}
// 2. When you have 60 frames of poses, prepare for CreateML model
if poseObservations.count == 60 {
var multiArray = try MLMultiArray(
shape: [60, 18, 3], // 60 frames, 18 joints, (x, y, confidence)
dataType: .double
)
for (frameIndex, observation) in poseObservations.enumerated() {
let allPoints = try observation.recognizedPoints(.all)
for (jointIndex, (_, point)) in allPoints.enumerated() {
multiArray[[frameIndex, jointIndex, 0] as [NSNumber]] = NSNumber(value: point.location.x)
multiArray[[frameIndex, jointIndex, 1] as [NSNumber]] = NSNumber(value: point.location.y)
multiArray[[frameIndex, jointIndex, 2] as [NSNumber]] = NSNumber(value: point.confidence)
}
}
// 3. Run inference with CreateML model
let input = YourActionClassifierInput(poses: multiArray)
let output = try actionClassifier.prediction(input: input)
let action = output.label // "jumping_jacks", "squats", etc.
}
成本:实现 3-4 小时,后续维护 1 小时
使用场景:从图像、收据、标牌、文档中提取文本。
let request = VNRecognizeTextRequest()
request.recognitionLevel = .accurate // Or .fast for real-time
request.recognitionLanguages = ["en-US"] // Specify known languages
request.usesLanguageCorrection = true // Helps accuracy
let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])
guard let observations = request.results as? [VNRecognizedTextObservation] else {
return
}
for observation in observations {
// Get top candidate (most likely)
guard let candidate = observation.topCandidates(1).first else { continue }
let text = candidate.string
let confidence = candidate.confidence
// Get bounding box for specific substring
if let range = text.range(of: searchTerm) {
if let boundingBox = try? candidate.boundingBox(for: range) {
// Use for highlighting
}
}
}
快速 vs 准确:
语言提示:
automaticallyDetectsLanguage = truesupportedRecognitionLanguages 以获取当前修订版成本:基础实现 30 分钟,带语言处理 2 小时
使用场景:扫描产品条形码、二维码、医疗保健码。
let request = VNDetectBarcodesRequest()
request.revision = VNDetectBarcodesRequestRevision3 // ML-based, iOS 16+
request.symbologies = [.qr, .ean13] // Specify only what you need!
let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])
guard let observations = request.results as? [VNBarcodeObservation] else {
return
}
for barcode in observations {
let payload = barcode.payloadStringValue // Decoded content
let symbology = barcode.symbology // Type of barcode
let bounds = barcode.boundingBox // Location (normalized)
print("Found \(symbology): \(payload ?? "no string")")
}
性能提示:指定更少的符号体系 = 更快的扫描
修订版差异:
成本:实现 15 分钟
使用场景:带有内置 UI 的基于摄像头的文本/条形码扫描(iOS 16+)。
import VisionKit
// Check support
guard DataScannerViewController.isSupported,
DataScannerViewController.isAvailable else {
// Not supported or camera access denied
return
}
// Configure what to scan
let recognizedDataTypes: Set<DataScannerViewController.RecognizedDataType> = [
.barcode(symbologies: [.qr]),
.text(textContentType: .URL) // Or nil for all text
]
// Create and present
let scanner = DataScannerViewController(
recognizedDataTypes: recognizedDataTypes,
qualityLevel: .balanced, // Or .fast, .accurate
recognizesMultipleItems: false, // Center-most if false
isHighFrameRateTrackingEnabled: true, // For smooth highlights
isPinchToZoomEnabled: true,
isGuidanceEnabled: true,
isHighlightingEnabled: true
)
scanner.delegate = self
present(scanner, animated: true) {
try? scanner.startScanning()
}
委托方法:
func dataScanner(_ scanner: DataScannerViewController,
didTapOn item: RecognizedItem) {
switch item {
case .text(let text):
print("Tapped text: \(text.transcript)")
case .barcode(let barcode):
print("Tapped barcode: \(barcode.payloadStringValue ?? "")")
@unknown default: break
}
}
// For custom highlights
func dataScanner(_ scanner: DataScannerViewController,
didAdd addedItems: [RecognizedItem],
allItems: [RecognizedItem]) {
for item in addedItems {
let highlight = createHighlight(for: item)
scanner.overlayContainerView.addSubview(highlight)
}
}
异步流替代方案:
for await items in scanner.recognizedItems {
// Process current items
}
成本:带自定义高亮实现 45 分钟
使用场景:通过自动边缘检测和透视校正扫描纸质文档。
import VisionKit
let documentCamera = VNDocumentCameraViewController()
documentCamera.delegate = self
present(documentCamera, animated: true)
// In delegate
func documentCameraViewController(_ controller: VNDocumentCameraViewController,
didFinishWith scan: VNDocumentCameraScan) {
controller.dismiss(animated: true)
// Process each page
for pageIndex in 0..<scan.pageCount {
let image = scan.imageOfPage(at: pageIndex)
// Now run text recognition on the corrected image
let handler = VNImageRequestHandler(cgImage: image.cgImage!)
let textRequest = VNRecognizeTextRequest()
try? handler.perform([textRequest])
}
}
成本:实现 30 分钟
使用场景:为自定义相机 UI 以编程方式检测文档边缘。
let request = VNDetectDocumentSegmentationRequest()
let handler = VNImageRequestHandler(ciImage: inputImage)
try handler.perform([request])
guard let observation = request.results?.first,
let document = observation as? VNRectangleObservation else {
return
}
// Get corner points (normalized coordinates)
let topLeft = document.topLeft
let topRight = document.topRight
let bottomLeft = document.bottomLeft
let bottomRight = document.bottomRight
// Apply perspective correction with CoreImage
let correctedImage = inputImage
.cropped(to: document.boundingBox.scaled(to: imageSize))
.applyingFilter("CIPerspectiveCorrection", parameters: [
"inputTopLeft": CIVector(cgPoint: topLeft.scaled(to: imageSize)),
"inputTopRight": CIVector(cgPoint: topRight.scaled(to: imageSize)),
"inputBottomLeft": CIVector(cgPoint: bottomLeft.scaled(to: imageSize)),
"inputBottomRight": CIVector(cgPoint: bottomRight.scaled(to: imageSize))
])
VNDetectDocumentSegmentationRequest vs VNDetectRectanglesRequest:
成本:实现 1-2 小时
使用场景:提取具有语义理解的表格、列表、段落。
// iOS 26+
let request = RecognizeDocumentsRequest()
let observations = try await request.perform(on: imageData)
guard let document = observations.first?.document else {
return
}
// Extract tables
for table in document.tables {
for row in table.rows {
for cell in row {
let text = cell.content.text.transcript
print("Cell: \(text)")
}
}
}
// Get detected data (emails, phones, URLs, dates)
let allDetectedData = document.text.detectedData
for data in allDetectedData {
switch data.match.details {
case .emailAddress(let email):
print("Email: \(email.emailAddress)")
case .phoneNumber(let phone):
print("Phone: \(phone.phoneNumber)")
case .link(let url):
print("URL: \(url)")
default: break
}
}
文档层次结构:
成本:实现 1 小时
使用场景:像条形码扫描器一样从摄像头扫描电话号码(来自 WWDC 2019)。
// 1. Use region of interest to guide user
let textRequest = VNRecognizeTextRequest { request, error in
guard let observations = request.results as? [VNRecognizedTextObservation] else { return }
for observation in observations {
guard let candidate = observation.topCandidates(1).first else { continue }
// Use domain knowledge to filter
if let phoneNumber = self.extractPhoneNumber(from: candidate.string) {
self.stringTracker.add(phoneNumber)
}
}
// Build evidence over frames
if let stableNumber = self.stringTracker.getStableString(threshold: 10) {
self.foundPhoneNumber(stableNumber)
}
}
textRequest.recognitionLevel = .fast // Real-time
textRequest.usesLanguageCorrection = false // Codes, not natural text
textRequest.regionOfInterest = guidanceBox // Crop to user's focus area
// 2. String tracker for stability
class StringTracker {
private var seenStrings: [String: Int] = [:]
func add(_ string: String) {
seenStrings[string, default: 0] += 1
}
func getStableString(threshold: Int) -> String? {
seenStrings.first { $0.value >= threshold }?.key
}
}
来自 WWDC 2019 的关键技术:
.fast 识别级别成本:实现 2 小时
错误:
let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request]) // Blocks UI!
正确:
DispatchQueue.global(qos: .userInitiated).async {
let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])
DispatchQueue.main.async {
// Update UI
}
}
为何重要:Vision 是资源密集型的。阻塞主线程会导致 UI 冻结。
错误:
let thumbTip = try observation.recognizedPoint(.thumbTip)
let location = thumbTip.location // May be unreliable!
正确:
let thumbTip = try observation.recognizedPoint(.thumbTip)
guard thumbTip.confidence > 0.5 else {
// Low confidence - landmark unreliable
return
}
let location = thumbTip.location
为何重要:低置信度的关键点不准确(遮挡、模糊、画面边缘)。
错误(混合坐标系):
// Vision uses lower-left origin
let visionPoint = recognizedPoint.location // (0, 0) = bottom-left
// UIKit uses top-left origin
let uiPoint = CGPoint(x: axiom-visionPoint.x, y: axiom-visionPoint.y) // WRONG!
正确:
let visionPoint = recognizedPoint.location
// Convert to UIKit coordinates
let uiPoint = CGPoint(
x: axiom-visionPoint.x * imageWidth,
y: (1 - visionPoint.y) * imageHeight // Flip Y axis
)
为何重要:不匹配的原点会导致 UI 叠加层出现在错误的位置。
错误:
let request = VNDetectHumanHandPoseRequest()
request.maximumHandCount = 10 // "Just in case"
正确:
let request = VNDetectHumanHandPoseRequest()
request.maximumHandCount = 2 // Only compute what you need
为何重要:性能随 maximumHandCount 扩展。姿态计算针对所有检测到的手部 ≤ 最大值。
错误(如果您不需要 AR):
// Requires AR session just for body pose
let arSession = ARBodyTrackingConfiguration()
正确:
// Vision works offline on still images
let request = VNDetectHumanBodyPoseRequest()
为何重要:ARKit 身体姿态需要后置摄像头、AR 会话、支持的设备。Vision 随处可用(甚至离线)。
背景:产品经理希望在周五前实现“像照片应用一样”的主体提取。您正在考虑跳过后台处理。
压力:“它在我的 iPhone 15 Pro 上运行良好,我们发布吧。”
现实:Vision 在旧设备上会阻塞 UI。使用 iPhone 12 的用户会遇到应用冻结。
正确做法:
反驳模板:“主体提取功能可以工作,但在旧设备上会冻结 UI。我需要 30 分钟来添加后台处理,防止一星差评。”
背景:设计师希望从主体边界框中排除手部。工程师建议训练自定义 CoreML 模型进行特定物体检测。
压力:“我们需要完美的边界,我们来训练一个模型。”
现实:训练需要带标签的数据集(数周)、持续维护,并且仍然无法泛化到新物体。内置的 Vision API + 手部姿态可以在 2-5 小时内解决。
正确做法:
反驳模板:“训练模型需要数周,并且只对特定物体有效。我可以组合 Vision API 在几小时内解决这个问题,并且它对任何物体都有效。”
背景:您需要实例掩码,但应用支持 iOS 15+。
压力:“就使用 iOS 15 的人物分割并发布吧。”
现实:VNGeneratePersonSegmentationRequest(iOS 15)返回所有人物合并的单一掩码。无法解决多人物使用场景。
正确做法:
@available 有条件地启用功能反驳模板:“iOS 15 上的人物分割将所有人物合并到一个掩码中。我们可以要么要求 iOS 17 以获得最佳体验,要么在旧操作系统版本上禁用多人物功能。您更倾向于哪种?”
发布 Vision 功能前:
性能:
maximumHandCount 设置为所需的最小值准确性:
坐标:
平台支持:
Guides you through implementing computer vision: subject segmentation, hand/body pose detection, person detection, text recognition, barcode detection, document scanning, and combining Vision APIs to solve complex problems.
Use when you need to:
"How do I isolate a subject from the background?" "I need to detect hand gestures like pinch" "How can I get a bounding box around an object without including the hand holding it?" "Should I use VisionKit or Vision framework for subject lifting?" "How do I segment multiple people separately?" "I need to detect body poses for a fitness app" "How do I preserve HDR when compositing subjects on new backgrounds?" "How do I recognize text in an image?" "I need to scan QR codes from camera" "How do I extract data from a receipt?" "Should I use DataScannerViewController or Vision directly?" "How do I scan documents and correct perspective?" "I need to extract table data from a document"
Signs you're making this harder than it needs to be:
Before implementing any Vision feature:
What do you need to do?
┌─ Isolate subject(s) from background?
│ ├─ Need system UI + out-of-process → VisionKit
│ │ └─ ImageAnalysisInteraction (iOS/iPadOS)
│ │ └─ ImageAnalysisOverlayView (macOS)
│ ├─ Need custom pipeline / HDR / large images → Vision
│ │ └─ VNGenerateForegroundInstanceMaskRequest
│ └─ Need to EXCLUDE hands from object → Combine APIs
│ └─ Subject mask + Hand pose + custom masking (see Pattern 1)
│
├─ Segment people?
│ ├─ All people in one mask → VNGeneratePersonSegmentationRequest
│ └─ Separate mask per person (up to 4) → VNGeneratePersonInstanceMaskRequest
│
├─ Detect hand pose/gestures?
│ ├─ Just hand location → VNDetectHumanRectanglesRequest
│ └─ 21 hand landmarks → VNDetectHumanHandPoseRequest
│ └─ Gesture recognition → Hand pose + distance checks
│
├─ Detect body pose?
│ ├─ 2D normalized landmarks → VNDetectHumanBodyPoseRequest
│ ├─ 3D real-world coordinates → VNDetectHumanBodyPose3DRequest
│ └─ Action classification → Body pose + CreateML model
│
├─ Face detection?
│ ├─ Just bounding boxes → VNDetectFaceRectanglesRequest
│ └─ Detailed landmarks → VNDetectFaceLandmarksRequest
│
├─ Person detection (location only)?
│ └─ VNDetectHumanRectanglesRequest
│
├─ Recognize text in images?
│ ├─ Real-time from camera + need UI → DataScannerViewController (iOS 16+)
│ ├─ Processing captured image → VNRecognizeTextRequest
│ │ ├─ Need speed (real-time camera) → recognitionLevel = .fast
│ │ └─ Need accuracy (documents) → recognitionLevel = .accurate
│ └─ Need structured documents (iOS 26+) → RecognizeDocumentsRequest
│
├─ Detect barcodes/QR codes?
│ ├─ Real-time camera + need UI → DataScannerViewController (iOS 16+)
│ └─ Processing image → VNDetectBarcodesRequest
│
└─ Scan documents?
├─ Need built-in UI + perspective correction → VNDocumentCameraViewController
├─ Need structured data (tables, lists) → RecognizeDocumentsRequest (iOS 26+)
└─ Custom pipeline → VNDetectDocumentSegmentationRequest + perspective correction
NEVER run Vision on main thread :
let processingQueue = DispatchQueue(label: "com.yourapp.vision", qos: .userInitiated)
processingQueue.async {
do {
let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])
// Process observations...
DispatchQueue.main.async {
// Update UI
}
} catch {
// Handle error
}
}
Processing video frames? Use VNSequenceRequestHandler (maintains inter-frame state for temporal smoothing). For single images, use VNImageRequestHandler. Creating a new VNImageRequestHandler per frame discards temporal context and causes jittery results. See axiom-vision-ref for full comparison and code examples.
| API | Minimum Version |
|---|---|
| Subject segmentation (instance masks) | iOS 17+ |
| VisionKit subject lifting | iOS 16+ |
| Hand pose | iOS 14+ |
| Body pose (2D) | iOS 14+ |
| Body pose (3D) | iOS 17+ |
| Person instance segmentation | iOS 17+ |
| VNRecognizeTextRequest (basic) | iOS 13+ |
| VNRecognizeTextRequest (accurate, multi-lang) | iOS 14+ |
| VNDetectBarcodesRequest | iOS 11+ |
| VNDetectBarcodesRequest (revision 2: Codabar, MicroQR) | iOS 15+ |
| VNDetectBarcodesRequest (revision 3: ML-based) | iOS 16+ |
| DataScannerViewController | iOS 16+ |
| VNDocumentCameraViewController | iOS 13+ |
| VNDetectDocumentSegmentationRequest |
User's original problem : Getting a bounding box around an object held in hand, without including the hand.
Root cause : VNGenerateForegroundInstanceMaskRequest is class-agnostic and treats hand+object as one subject.
Solution : Combine subject mask with hand pose to create exclusion mask.
// 1. Get subject instance mask
let subjectRequest = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: sourceImage)
try handler.perform([subjectRequest])
guard let subjectObservation = subjectRequest.results?.first as? VNInstanceMaskObservation else {
fatalError("No subject detected")
}
// 2. Get hand pose landmarks
let handRequest = VNDetectHumanHandPoseRequest()
handRequest.maximumHandCount = 2
try handler.perform([handRequest])
guard let handObservation = handRequest.results?.first as? VNHumanHandPoseObservation else {
// No hand detected - use full subject mask
let mask = try subjectObservation.createScaledMask(
for: subjectObservation.allInstances,
croppedToInstancesContent: false
)
return mask
}
// 3. Create hand exclusion region from landmarks
let handPoints = try handObservation.recognizedPoints(.all)
let handBounds = calculateConvexHull(from: handPoints) // Your implementation
// 4. Subtract hand region from subject mask using CoreImage
let subjectMask = try subjectObservation.createScaledMask(
for: subjectObservation.allInstances,
croppedToInstancesContent: false
)
let subjectCIMask = CIImage(cvPixelBuffer: subjectMask)
let handMask = createMaskFromRegion(handBounds, size: sourceImage.size)
let finalMask = subtractMasks(handMask: handMask, from: subjectCIMask)
// 5. Calculate bounding box from final mask
let objectBounds = calculateBoundingBox(from: finalMask)
Helper: Convex Hull
func calculateConvexHull(from points: [VNRecognizedPointKey: VNRecognizedPoint]) -> CGRect {
// Get high-confidence points
let validPoints = points.values.filter { $0.confidence > 0.5 }
guard !validPoints.isEmpty else { return .zero }
// Simple bounding rect (for more accuracy, use actual convex hull algorithm)
let xs = validPoints.map { $0.location.x }
let ys = validPoints.map { $0.location.y }
let minX = xs.min()!
let maxX = xs.max()!
let minY = ys.min()!
let maxY = ys.max()!
return CGRect(
x: minX,
y: minY,
width: maxX - minX,
height: maxY - minY
)
}
Cost : 2-5 hours initial implementation, 30 min ongoing maintenance
Use case : Add system-like subject lifting UI with minimal code.
// iOS
let interaction = ImageAnalysisInteraction()
interaction.preferredInteractionTypes = .imageSubject
imageView.addInteraction(interaction)
// macOS
let overlayView = ImageAnalysisOverlayView()
overlayView.preferredInteractionTypes = .imageSubject
nsView.addSubview(overlayView)
When to use :
Cost : 15 min implementation, 5 min ongoing
Use case : Need subject images/bounds without UI interaction.
let analyzer = ImageAnalyzer()
let configuration = ImageAnalyzer.Configuration([.text, .visualLookUp])
let analysis = try await analyzer.analyze(sourceImage, configuration: configuration)
// Get all subjects
for subject in analysis.subjects {
let subjectImage = subject.image
let subjectBounds = subject.bounds
// Process subject...
}
// Tap-based lookup
if let subject = try await analysis.subject(at: tapPoint) {
let compositeImage = try await analysis.image(for: [subject])
}
Cost : 30 min implementation, 10 min ongoing
Use case : HDR preservation, large images, custom compositing.
let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: sourceImage)
try handler.perform([request])
guard let observation = request.results?.first as? VNInstanceMaskObservation else {
return
}
// Get soft segmentation mask
let mask = try observation.createScaledMask(
for: observation.allInstances,
croppedToInstancesContent: false // Full resolution for compositing
)
// Use with CoreImage for HDR preservation
let filter = CIFilter(name: "CIBlendWithMask")!
filter.setValue(CIImage(cgImage: sourceImage), forKey: kCIInputImageKey)
filter.setValue(CIImage(cvPixelBuffer: mask), forKey: kCIInputMaskImageKey)
filter.setValue(newBackground, forKey: kCIInputBackgroundImageKey)
let compositedImage = filter.outputImage
Cost : 1 hour implementation, 15 min ongoing
Use case : User taps to select which subject/person to lift.
// Get instance at tap point
let instance = observation.instanceAtPoint(tapPoint)
if instance == 0 {
// Background tapped - select all instances
let mask = try observation.createScaledMask(
for: observation.allInstances,
croppedToInstancesContent: false
)
} else {
// Specific instance tapped
let mask = try observation.createScaledMask(
for: IndexSet(integer: instance),
croppedToInstancesContent: true
)
}
Alternative: Raw pixel buffer access
let instanceMask = observation.instanceMask
CVPixelBufferLockBaseAddress(instanceMask, .readOnly)
defer { CVPixelBufferUnlockBaseAddress(instanceMask, .readOnly) }
let baseAddress = CVPixelBufferGetBaseAddress(instanceMask)
let bytesPerRow = CVPixelBufferGetBytesPerRow(instanceMask)
// Convert normalized tap to pixel coordinates
let pixelPoint = VNImagePointForNormalizedPoint(
tapPoint,
width: imageWidth,
height: imageHeight
)
let offset = Int(pixelPoint.y) * bytesPerRow + Int(pixelPoint.x)
let label = UnsafeRawPointer(baseAddress!).load(
fromByteOffset: offset,
as: UInt8.self
)
Cost : 45 min implementation, 10 min ongoing
Use case : Detect pinch gesture for custom camera trigger or UI control.
let request = VNDetectHumanHandPoseRequest()
request.maximumHandCount = 1
try handler.perform([request])
guard let observation = request.results?.first as? VNHumanHandPoseObservation else {
return
}
let thumbTip = try observation.recognizedPoint(.thumbTip)
let indexTip = try observation.recognizedPoint(.indexTip)
// Check confidence
guard thumbTip.confidence > 0.5, indexTip.confidence > 0.5 else {
return
}
// Calculate distance (normalized coordinates)
let dx = thumbTip.location.x - indexTip.location.x
let dy = thumbTip.location.y - indexTip.location.y
let distance = sqrt(dx * dx + dy * dy)
let isPinching = distance < 0.05 // Adjust threshold
// State machine for evidence accumulation
if isPinching {
pinchFrameCount += 1
if pinchFrameCount >= 3 {
state = .pinched
}
} else {
pinchFrameCount = max(0, pinchFrameCount - 1)
if pinchFrameCount == 0 {
state = .apart
}
}
Cost : 2 hours implementation, 20 min ongoing
Use case : Apply different effects to each person or count people.
let request = VNGeneratePersonInstanceMaskRequest()
try handler.perform([request])
guard let observation = request.results?.first as? VNInstanceMaskObservation else {
return
}
let peopleCount = observation.allInstances.count // Up to 4
for personIndex in observation.allInstances {
let personMask = try observation.createScaledMask(
for: IndexSet(integer: personIndex),
croppedToInstancesContent: false
)
// Apply effect to this person only
applyEffect(to: personMask, personIndex: personIndex)
}
Crowded scenes ( >4 people):
// Count faces to detect crowding
let faceRequest = VNDetectFaceRectanglesRequest()
try handler.perform([faceRequest])
let faceCount = faceRequest.results?.count ?? 0
if faceCount > 4 {
// Fallback: Use single mask for all people
let singleMaskRequest = VNGeneratePersonSegmentationRequest()
try handler.perform([singleMaskRequest])
}
Cost : 1.5 hours implementation, 15 min ongoing
Use case : Fitness app that recognizes exercises (jumping jacks, squats, etc.)
// 1. Collect body pose observations
var poseObservations: [VNHumanBodyPoseObservation] = []
let request = VNDetectHumanBodyPoseRequest()
try handler.perform([request])
if let observation = request.results?.first as? VNHumanBodyPoseObservation {
poseObservations.append(observation)
}
// 2. When you have 60 frames of poses, prepare for CreateML model
if poseObservations.count == 60 {
var multiArray = try MLMultiArray(
shape: [60, 18, 3], // 60 frames, 18 joints, (x, y, confidence)
dataType: .double
)
for (frameIndex, observation) in poseObservations.enumerated() {
let allPoints = try observation.recognizedPoints(.all)
for (jointIndex, (_, point)) in allPoints.enumerated() {
multiArray[[frameIndex, jointIndex, 0] as [NSNumber]] = NSNumber(value: point.location.x)
multiArray[[frameIndex, jointIndex, 1] as [NSNumber]] = NSNumber(value: point.location.y)
multiArray[[frameIndex, jointIndex, 2] as [NSNumber]] = NSNumber(value: point.confidence)
}
}
// 3. Run inference with CreateML model
let input = YourActionClassifierInput(poses: multiArray)
let output = try actionClassifier.prediction(input: input)
let action = output.label // "jumping_jacks", "squats", etc.
}
Cost : 3-4 hours implementation, 1 hour ongoing
Use case : Extract text from images, receipts, signs, documents.
let request = VNRecognizeTextRequest()
request.recognitionLevel = .accurate // Or .fast for real-time
request.recognitionLanguages = ["en-US"] // Specify known languages
request.usesLanguageCorrection = true // Helps accuracy
let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])
guard let observations = request.results as? [VNRecognizedTextObservation] else {
return
}
for observation in observations {
// Get top candidate (most likely)
guard let candidate = observation.topCandidates(1).first else { continue }
let text = candidate.string
let confidence = candidate.confidence
// Get bounding box for specific substring
if let range = text.range(of: searchTerm) {
if let boundingBox = try? candidate.boundingBox(for: range) {
// Use for highlighting
}
}
}
Fast vs Accurate :
Language tips :
automaticallyDetectsLanguage = true only when language unknownsupportedRecognitionLanguages for current revisionCost : 30 min basic implementation, 2 hours with language handling
Use case : Scan product barcodes, QR codes, healthcare codes.
let request = VNDetectBarcodesRequest()
request.revision = VNDetectBarcodesRequestRevision3 // ML-based, iOS 16+
request.symbologies = [.qr, .ean13] // Specify only what you need!
let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])
guard let observations = request.results as? [VNBarcodeObservation] else {
return
}
for barcode in observations {
let payload = barcode.payloadStringValue // Decoded content
let symbology = barcode.symbology // Type of barcode
let bounds = barcode.boundingBox // Location (normalized)
print("Found \(symbology): \(payload ?? "no string")")
}
Performance tip : Specifying fewer symbologies = faster scanning
Revision differences :
Cost : 15 min implementation
Use case : Camera-based text/barcode scanning with built-in UI (iOS 16+).
import VisionKit
// Check support
guard DataScannerViewController.isSupported,
DataScannerViewController.isAvailable else {
// Not supported or camera access denied
return
}
// Configure what to scan
let recognizedDataTypes: Set<DataScannerViewController.RecognizedDataType> = [
.barcode(symbologies: [.qr]),
.text(textContentType: .URL) // Or nil for all text
]
// Create and present
let scanner = DataScannerViewController(
recognizedDataTypes: recognizedDataTypes,
qualityLevel: .balanced, // Or .fast, .accurate
recognizesMultipleItems: false, // Center-most if false
isHighFrameRateTrackingEnabled: true, // For smooth highlights
isPinchToZoomEnabled: true,
isGuidanceEnabled: true,
isHighlightingEnabled: true
)
scanner.delegate = self
present(scanner, animated: true) {
try? scanner.startScanning()
}
Delegate methods :
func dataScanner(_ scanner: DataScannerViewController,
didTapOn item: RecognizedItem) {
switch item {
case .text(let text):
print("Tapped text: \(text.transcript)")
case .barcode(let barcode):
print("Tapped barcode: \(barcode.payloadStringValue ?? "")")
@unknown default: break
}
}
// For custom highlights
func dataScanner(_ scanner: DataScannerViewController,
didAdd addedItems: [RecognizedItem],
allItems: [RecognizedItem]) {
for item in addedItems {
let highlight = createHighlight(for: item)
scanner.overlayContainerView.addSubview(highlight)
}
}
Async stream alternative :
for await items in scanner.recognizedItems {
// Process current items
}
Cost : 45 min implementation with custom highlights
Use case : Scan paper documents with automatic edge detection and perspective correction.
import VisionKit
let documentCamera = VNDocumentCameraViewController()
documentCamera.delegate = self
present(documentCamera, animated: true)
// In delegate
func documentCameraViewController(_ controller: VNDocumentCameraViewController,
didFinishWith scan: VNDocumentCameraScan) {
controller.dismiss(animated: true)
// Process each page
for pageIndex in 0..<scan.pageCount {
let image = scan.imageOfPage(at: pageIndex)
// Now run text recognition on the corrected image
let handler = VNImageRequestHandler(cgImage: image.cgImage!)
let textRequest = VNRecognizeTextRequest()
try? handler.perform([textRequest])
}
}
Cost : 30 min implementation
Use case : Detect document edges programmatically for custom camera UI.
let request = VNDetectDocumentSegmentationRequest()
let handler = VNImageRequestHandler(ciImage: inputImage)
try handler.perform([request])
guard let observation = request.results?.first,
let document = observation as? VNRectangleObservation else {
return
}
// Get corner points (normalized coordinates)
let topLeft = document.topLeft
let topRight = document.topRight
let bottomLeft = document.bottomLeft
let bottomRight = document.bottomRight
// Apply perspective correction with CoreImage
let correctedImage = inputImage
.cropped(to: document.boundingBox.scaled(to: imageSize))
.applyingFilter("CIPerspectiveCorrection", parameters: [
"inputTopLeft": CIVector(cgPoint: topLeft.scaled(to: imageSize)),
"inputTopRight": CIVector(cgPoint: topRight.scaled(to: imageSize)),
"inputBottomLeft": CIVector(cgPoint: bottomLeft.scaled(to: imageSize)),
"inputBottomRight": CIVector(cgPoint: bottomRight.scaled(to: imageSize))
])
VNDetectDocumentSegmentationRequest vs VNDetectRectanglesRequest :
Cost : 1-2 hours implementation
Use case : Extract tables, lists, paragraphs with semantic understanding.
// iOS 26+
let request = RecognizeDocumentsRequest()
let observations = try await request.perform(on: imageData)
guard let document = observations.first?.document else {
return
}
// Extract tables
for table in document.tables {
for row in table.rows {
for cell in row {
let text = cell.content.text.transcript
print("Cell: \(text)")
}
}
}
// Get detected data (emails, phones, URLs, dates)
let allDetectedData = document.text.detectedData
for data in allDetectedData {
switch data.match.details {
case .emailAddress(let email):
print("Email: \(email.emailAddress)")
case .phoneNumber(let phone):
print("Phone: \(phone.phoneNumber)")
case .link(let url):
print("URL: \(url)")
default: break
}
}
Document hierarchy :
Cost : 1 hour implementation
Use case : Scan phone numbers from camera like barcode scanner (from WWDC 2019).
// 1. Use region of interest to guide user
let textRequest = VNRecognizeTextRequest { request, error in
guard let observations = request.results as? [VNRecognizedTextObservation] else { return }
for observation in observations {
guard let candidate = observation.topCandidates(1).first else { continue }
// Use domain knowledge to filter
if let phoneNumber = self.extractPhoneNumber(from: candidate.string) {
self.stringTracker.add(phoneNumber)
}
}
// Build evidence over frames
if let stableNumber = self.stringTracker.getStableString(threshold: 10) {
self.foundPhoneNumber(stableNumber)
}
}
textRequest.recognitionLevel = .fast // Real-time
textRequest.usesLanguageCorrection = false // Codes, not natural text
textRequest.regionOfInterest = guidanceBox // Crop to user's focus area
// 2. String tracker for stability
class StringTracker {
private var seenStrings: [String: Int] = [:]
func add(_ string: String) {
seenStrings[string, default: 0] += 1
}
func getStableString(threshold: Int) -> String? {
seenStrings.first { $0.value >= threshold }?.key
}
}
Key techniques from WWDC 2019 :
.fast recognition level for real-timeCost : 2 hours implementation
Wrong :
let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request]) // Blocks UI!
Right :
DispatchQueue.global(qos: .userInitiated).async {
let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])
DispatchQueue.main.async {
// Update UI
}
}
Why it matters : Vision is resource-intensive. Blocking main thread freezes UI.
Wrong :
let thumbTip = try observation.recognizedPoint(.thumbTip)
let location = thumbTip.location // May be unreliable!
Right :
let thumbTip = try observation.recognizedPoint(.thumbTip)
guard thumbTip.confidence > 0.5 else {
// Low confidence - landmark unreliable
return
}
let location = thumbTip.location
Why it matters : Low confidence points are inaccurate (occlusion, blur, edge of frame).
Wrong (mixing coordinate systems):
// Vision uses lower-left origin
let visionPoint = recognizedPoint.location // (0, 0) = bottom-left
// UIKit uses top-left origin
let uiPoint = CGPoint(x: axiom-visionPoint.x, y: axiom-visionPoint.y) // WRONG!
Right :
let visionPoint = recognizedPoint.location
// Convert to UIKit coordinates
let uiPoint = CGPoint(
x: axiom-visionPoint.x * imageWidth,
y: (1 - visionPoint.y) * imageHeight // Flip Y axis
)
Why it matters : Mismatched origins cause UI overlays to appear in wrong positions.
Wrong :
let request = VNDetectHumanHandPoseRequest()
request.maximumHandCount = 10 // "Just in case"
Right :
let request = VNDetectHumanHandPoseRequest()
request.maximumHandCount = 2 // Only compute what you need
Why it matters : Performance scales with maximumHandCount. Pose computed for all detected hands ≤ max.
Wrong (if you don't need AR):
// Requires AR session just for body pose
let arSession = ARBodyTrackingConfiguration()
Right :
// Vision works offline on still images
let request = VNDetectHumanBodyPoseRequest()
Why it matters : ARKit body pose requires rear camera, AR session, supported devices. Vision works everywhere (even offline).
Context : Product manager wants subject lifting "like in Photos app" by Friday. You're considering skipping background processing.
Pressure : "It's working on my iPhone 15 Pro, let's ship it."
Reality : Vision blocks UI on older devices. Users on iPhone 12 will experience frozen app.
Correct action :
Push-back template : "Subject lifting works, but it freezes the UI on older devices. I need 30 minutes to add background processing and prevent 1-star reviews."
Context : Designer wants to exclude hands from subject bounding box. Engineer suggests training custom CoreML model for specific object detection.
Pressure : "We need perfect bounds, let's train a model."
Reality : Training requires labeled dataset (weeks), ongoing maintenance, and still won't generalize to new objects. Built-in Vision APIs + hand pose solve it in 2-5 hours.
Correct action :
Push-back template : "Training a model takes weeks and only works for specific objects. I can combine Vision APIs to solve this in a few hours and it'll work for any object."
Context : You need instance masks but app supports iOS 15+.
Pressure : "Just use iOS 15 person segmentation and ship it."
Reality : VNGeneratePersonSegmentationRequest (iOS 15) returns single mask for all people. Doesn't solve multi-person use case.
Correct action :
@available to conditionally enable featuresPush-back template : "Person segmentation on iOS 15 combines all people into one mask. We can either require iOS 17 for the best experience, or disable multi-person features on older OS versions. Which do you prefer?"
Before shipping Vision features:
Performance :
maximumHandCount set to minimum needed valueAccuracy :
Coordinates :
Platform Support :
@available checks for iOS 17+ APIs (instance masks)Edge Cases :
CoreImage Integration (if applicable):
croppedToInstancesContent set appropriately (false for compositing)Text/Barcode Recognition (if applicable):
WWDC : 2019-234, 2021-10041, 2022-10024, 2022-10025, 2025-272, 2023-10176, 2023-111241, 2020-10653
Docs : /vision, /visionkit, /vision/vnrecognizetextrequest, /vision/vndetectbarcodesrequest
Skills : axiom-vision-ref, axiom-vision-diag
Weekly Installs
274
Repository
GitHub Stars
601
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode250
codex241
gemini-cli239
cursor239
github-copilot233
amp221
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
41,800 周安装
| iOS 15+ |
| RecognizeDocumentsRequest | iOS 26+ |