iOS Vision Framework 计算机视觉开发指南：主体分割、姿态检测、OCR 与文档扫描

axiom-vision by charleswiltgen/axiom

525 周安装量

692 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/charleswiltgen/axiom --skill axiom-vision

AI/机器学习 iOS Swift

🇨🇳中文介绍

Vision Framework 计算机视觉

指导您实现计算机视觉功能：主体分割、手部/身体姿态检测、人物检测、文本识别、条形码检测、文档扫描，以及组合 Vision API 来解决复杂问题。

何时使用此技能

在您需要时使用：

☑ 将主体与背景分离（主体提取）
☑ 检测和跟踪手部姿态以识别手势
☑ 检测和跟踪身体姿态用于健身/动作分类
☑ 分别分割多个人物
☑ 从物体边界框中排除手部（组合 API）
☑ 在 VisionKit 和 Vision framework 之间做出选择
☑ 将 Vision 与 CoreImage 结合进行合成
☑ 决定哪个 Vision API 能解决您的问题
☑ 识别图像中的文本（OCR）
☑ 检测条形码和二维码
☑ 进行带透视校正的文档扫描
☑ 从文档中提取结构化数据（iOS 26+）
☑ 构建实时扫描体验（DataScannerViewController）

示例提示

“如何将主体从背景中分离出来？”“我需要检测像捏合这样的手势”“如何获取物体周围的边界框而不包括拿着它的手？”“我应该使用 VisionKit 还是 Vision framework 进行主体提取？”“如何分别分割多个人物？”“我需要为健身应用检测身体姿态”“在将主体合成到新背景上时如何保留 HDR？”“如何识别图像中的文本？”“我需要从摄像头扫描二维码”“如何从收据中提取数据？”“我应该使用 DataScannerViewController 还是直接使用 Vision？”“如何扫描文档并校正透视？”“我需要从文档中提取表格数据”

警示信号

表明您把事情复杂化的迹象：

❌ 使用 CoreML 模型手动实现主体分割
❌ 仅为了身体姿态而使用 ARKit（Vision 可离线工作）
❌ 从零开始编写手势识别（使用手部姿态 + 简单距离检查）
❌ 在主线程上处理（阻塞 UI - Vision 是资源密集型的）
❌ 在 Vision API 已存在的情况下训练自定义模型
❌ 不检查置信度分数（低置信度 = 不可靠的关键点）
❌ 忘记转换坐标（左下角原点 vs UIKit 左上角原点）
❌ 在 VNRecognizeTextRequest 存在时构建自定义文本识别器
❌ 在 DataScannerViewController 足够时使用 AVFoundation + Vision
❌ 为扫描处理每一帧摄像头画面（跳过帧，使用感兴趣区域）
❌ 在只需要一种时启用所有条形码符号体系（性能损失）
❌ 当您需要表格/列表结构时忽略 RecognizeDocumentsRequest（iOS 26+）

必须的第一步

在实现任何 Vision 功能之前：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

1. 选择正确的 API（决策树）

What do you need to do?

┌─ Isolate subject(s) from background?
│  ├─ Need system UI + out-of-process → VisionKit
│  │  └─ ImageAnalysisInteraction (iOS/iPadOS)
│  │  └─ ImageAnalysisOverlayView (macOS)
│  ├─ Need custom pipeline / HDR / large images → Vision
│  │  └─ VNGenerateForegroundInstanceMaskRequest
│  └─ Need to EXCLUDE hands from object → Combine APIs
│     └─ Subject mask + Hand pose + custom masking (see Pattern 1)
│
├─ Segment people?
│  ├─ All people in one mask → VNGeneratePersonSegmentationRequest
│  └─ Separate mask per person (up to 4) → VNGeneratePersonInstanceMaskRequest
│
├─ Detect hand pose/gestures?
│  ├─ Just hand location → VNDetectHumanRectanglesRequest
│  └─ 21 hand landmarks → VNDetectHumanHandPoseRequest
│     └─ Gesture recognition → Hand pose + distance checks
│
├─ Detect body pose?
│  ├─ 2D normalized landmarks → VNDetectHumanBodyPoseRequest
│  ├─ 3D real-world coordinates → VNDetectHumanBodyPose3DRequest
│  └─ Action classification → Body pose + CreateML model
│
├─ Face detection?
│  ├─ Just bounding boxes → VNDetectFaceRectanglesRequest
│  └─ Detailed landmarks → VNDetectFaceLandmarksRequest
│
├─ Person detection (location only)?
│  └─ VNDetectHumanRectanglesRequest
│
├─ Recognize text in images?
│  ├─ Real-time from camera + need UI → DataScannerViewController (iOS 16+)
│  ├─ Processing captured image → VNRecognizeTextRequest
│  │  ├─ Need speed (real-time camera) → recognitionLevel = .fast
│  │  └─ Need accuracy (documents) → recognitionLevel = .accurate
│  └─ Need structured documents (iOS 26+) → RecognizeDocumentsRequest
│
├─ Detect barcodes/QR codes?
│  ├─ Real-time camera + need UI → DataScannerViewController (iOS 16+)
│  └─ Processing image → VNDetectBarcodesRequest
│
└─ Scan documents?
   ├─ Need built-in UI + perspective correction → VNDocumentCameraViewController
   ├─ Need structured data (tables, lists) → RecognizeDocumentsRequest (iOS 26+)
   └─ Custom pipeline → VNDetectDocumentSegmentationRequest + perspective correction

2. 设置后台处理

切勿在主线程上运行 Vision：

let processingQueue = DispatchQueue(label: "com.yourapp.vision", qos: .userInitiated)

processingQueue.async {
    do {
        let request = VNGenerateForegroundInstanceMaskRequest()
        let handler = VNImageRequestHandler(cgImage: image)
        try handler.perform([request])

        // Process observations...

        DispatchQueue.main.async {
            // Update UI
        }
    } catch {
        // Handle error
    }
}

3. 选择正确的请求处理器

处理视频帧？使用 VNSequenceRequestHandler（保持帧间状态以实现时间平滑）。对于单张图像，使用 VNImageRequestHandler。为每一帧创建新的 VNImageRequestHandler 会丢弃时间上下文并导致结果抖动。完整比较和代码示例请参见 axiom-vision-ref。

4. 验证平台可用性

API	最低版本
主体分割（实例掩码）	iOS 17+
VisionKit 主体提取	iOS 16+
手部姿态	iOS 14+
身体姿态（2D）	iOS 14+
身体姿态（3D）	iOS 17+
人物实例分割	iOS 17+
VNRecognizeTextRequest（基础）	iOS 13+
VNRecognizeTextRequest（准确，多语言）	iOS 14+
VNDetectBarcodesRequest	iOS 11+
VNDetectBarcodesRequest（修订版 2：Codabar，MicroQR）	iOS 15+
VNDetectBarcodesRequest（修订版 3：基于 ML）	iOS 16+
DataScannerViewController	iOS 16+
VNDocumentCameraViewController	iOS 13+
VNDetectDocumentSegmentationRequest	iOS 15+
RecognizeDocumentsRequest	iOS 26+

模式 1：隔离物体同时排除手部

用户原始问题：获取手持物体周围的边界框，而不包括拿着它的手。

根本原因：VNGenerateForegroundInstanceMaskRequest 是类别无关的，将手+物体视为一个主体。

解决方案：组合主体掩码和手部姿态以创建排除掩码。

// 1. 获取主体实例掩码
let subjectRequest = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: sourceImage)
try handler.perform([subjectRequest])

guard let subjectObservation = subjectRequest.results?.first as? VNInstanceMaskObservation else {
    fatalError("No subject detected")
}

// 2. 获取手部姿态关键点
let handRequest = VNDetectHumanHandPoseRequest()
handRequest.maximumHandCount = 2
try handler.perform([handRequest])

guard let handObservation = handRequest.results?.first as? VNHumanHandPoseObservation else {
    // No hand detected - use full subject mask
    let mask = try subjectObservation.createScaledMask(
        for: subjectObservation.allInstances,
        croppedToInstancesContent: false
    )
    return mask
}

// 3. 从关键点创建手部排除区域
let handPoints = try handObservation.recognizedPoints(.all)
let handBounds = calculateConvexHull(from: handPoints)  // Your implementation

// 4. 使用 CoreImage 从主体掩码中减去手部区域
let subjectMask = try subjectObservation.createScaledMask(
    for: subjectObservation.allInstances,
    croppedToInstancesContent: false
)

let subjectCIMask = CIImage(cvPixelBuffer: subjectMask)
let handMask = createMaskFromRegion(handBounds, size: sourceImage.size)
let finalMask = subtractMasks(handMask: handMask, from: subjectCIMask)

// 5. 从最终掩码计算边界框
let objectBounds = calculateBoundingBox(from: finalMask)

辅助函数：凸包

func calculateConvexHull(from points: [VNRecognizedPointKey: VNRecognizedPoint]) -> CGRect {
    // Get high-confidence points
    let validPoints = points.values.filter { $0.confidence > 0.5 }

    guard !validPoints.isEmpty else { return .zero }

    // Simple bounding rect (for more accuracy, use actual convex hull algorithm)
    let xs = validPoints.map { $0.location.x }
    let ys = validPoints.map { $0.location.y }

    let minX = xs.min()!
    let maxX = xs.max()!
    let minY = ys.min()!
    let maxY = ys.max()!

    return CGRect(
        x: minX,
        y: minY,
        width: maxX - minX,
        height: maxY - minY
    )
}

成本：初始实现 2-5 小时，后续维护 30 分钟

模式 2：VisionKit 简单主体提取

使用场景：以最少的代码添加类似系统的主体提取 UI。

// iOS
let interaction = ImageAnalysisInteraction()
interaction.preferredInteractionTypes = .imageSubject
imageView.addInteraction(interaction)

// macOS
let overlayView = ImageAnalysisOverlayView()
overlayView.preferredInteractionTypes = .imageSubject
nsView.addSubview(overlayView)

✓ 想要系统行为（长按选择，拖拽分享）
✓ 不需要自定义处理流程
✓ 图像大小在 VisionKit 限制内（进程外处理）

成本：实现 15 分钟，后续维护 5 分钟

模式 3：编程式主体访问（VisionKit）

使用场景：无需 UI 交互即可获取主体图像/边界。

let analyzer = ImageAnalyzer()
let configuration = ImageAnalyzer.Configuration([.text, .visualLookUp])

let analysis = try await analyzer.analyze(sourceImage, configuration: configuration)

// Get all subjects
for subject in analysis.subjects {
    let subjectImage = subject.image
    let subjectBounds = subject.bounds

    // Process subject...
}

// Tap-based lookup
if let subject = try await analysis.subject(at: tapPoint) {
    let compositeImage = try await analysis.image(for: [subject])
}

成本：实现 30 分钟，后续维护 10 分钟

模式 4：用于自定义流程的 Vision 实例掩码

使用场景：HDR 保留、大图像、自定义合成。

let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: sourceImage)
try handler.perform([request])

guard let observation = request.results?.first as? VNInstanceMaskObservation else {
    return
}

// Get soft segmentation mask
let mask = try observation.createScaledMask(
    for: observation.allInstances,
    croppedToInstancesContent: false  // Full resolution for compositing
)

// Use with CoreImage for HDR preservation
let filter = CIFilter(name: "CIBlendWithMask")!
filter.setValue(CIImage(cgImage: sourceImage), forKey: kCIInputImageKey)
filter.setValue(CIImage(cvPixelBuffer: mask), forKey: kCIInputMaskImageKey)
filter.setValue(newBackground, forKey: kCIInputBackgroundImageKey)

let compositedImage = filter.outputImage

成本：实现 1 小时，后续维护 15 分钟

模式 5：点击选择实例

使用场景：用户点击选择要提取哪个主体/人物。

// Get instance at tap point
let instance = observation.instanceAtPoint(tapPoint)

if instance == 0 {
    // Background tapped - select all instances
    let mask = try observation.createScaledMask(
        for: observation.allInstances,
        croppedToInstancesContent: false
    )
} else {
    // Specific instance tapped
    let mask = try observation.createScaledMask(
        for: IndexSet(integer: instance),
        croppedToInstancesContent: true
    )
}

替代方案：原始像素缓冲区访问

let instanceMask = observation.instanceMask

CVPixelBufferLockBaseAddress(instanceMask, .readOnly)
defer { CVPixelBufferUnlockBaseAddress(instanceMask, .readOnly) }

let baseAddress = CVPixelBufferGetBaseAddress(instanceMask)
let bytesPerRow = CVPixelBufferGetBytesPerRow(instanceMask)

// Convert normalized tap to pixel coordinates
let pixelPoint = VNImagePointForNormalizedPoint(
    tapPoint,
    width: imageWidth,
    height: imageHeight
)

let offset = Int(pixelPoint.y) * bytesPerRow + Int(pixelPoint.x)
let label = UnsafeRawPointer(baseAddress!).load(
    fromByteOffset: offset,
    as: UInt8.self
)

成本：实现 45 分钟，后续维护 10 分钟

模式 6：手部手势识别（捏合）

使用场景：检测捏合手势以用于自定义相机触发或 UI 控制。

let request = VNDetectHumanHandPoseRequest()
request.maximumHandCount = 1

try handler.perform([request])

guard let observation = request.results?.first as? VNHumanHandPoseObservation else {
    return
}

let thumbTip = try observation.recognizedPoint(.thumbTip)
let indexTip = try observation.recognizedPoint(.indexTip)

// Check confidence
guard thumbTip.confidence > 0.5, indexTip.confidence > 0.5 else {
    return
}

// Calculate distance (normalized coordinates)
let dx = thumbTip.location.x - indexTip.location.x
let dy = thumbTip.location.y - indexTip.location.y
let distance = sqrt(dx * dx + dy * dy)

let isPinching = distance < 0.05  // Adjust threshold

// State machine for evidence accumulation
if isPinching {
    pinchFrameCount += 1
    if pinchFrameCount >= 3 {
        state = .pinched
    }
} else {
    pinchFrameCount = max(0, pinchFrameCount - 1)
    if pinchFrameCount == 0 {
        state = .apart
    }
}

成本：实现 2 小时，后续维护 20 分钟

模式 7：分离多个人物

使用场景：对每个人物应用不同的效果或统计人数。

let request = VNGeneratePersonInstanceMaskRequest()
try handler.perform([request])

guard let observation = request.results?.first as? VNInstanceMaskObservation else {
    return
}

let peopleCount = observation.allInstances.count  // Up to 4

for personIndex in observation.allInstances {
    let personMask = try observation.createScaledMask(
        for: IndexSet(integer: personIndex),
        croppedToInstancesContent: false
    )

    // Apply effect to this person only
    applyEffect(to: personMask, personIndex: personIndex)
}

拥挤场景（>4 人）：

// Count faces to detect crowding
let faceRequest = VNDetectFaceRectanglesRequest()
try handler.perform([faceRequest])

let faceCount = faceRequest.results?.count ?? 0

if faceCount > 4 {
    // Fallback: Use single mask for all people
    let singleMaskRequest = VNGeneratePersonSegmentationRequest()
    try handler.perform([singleMaskRequest])
}

成本：实现 1.5 小时，后续维护 15 分钟

模式 8：用于动作分类的身体姿态

使用场景：识别运动（开合跳、深蹲等）的健身应用。

// 1. Collect body pose observations
var poseObservations: [VNHumanBodyPoseObservation] = []

let request = VNDetectHumanBodyPoseRequest()
try handler.perform([request])

if let observation = request.results?.first as? VNHumanBodyPoseObservation {
    poseObservations.append(observation)
}

// 2. When you have 60 frames of poses, prepare for CreateML model
if poseObservations.count == 60 {
    var multiArray = try MLMultiArray(
        shape: [60, 18, 3],  // 60 frames, 18 joints, (x, y, confidence)
        dataType: .double
    )

    for (frameIndex, observation) in poseObservations.enumerated() {
        let allPoints = try observation.recognizedPoints(.all)

        for (jointIndex, (_, point)) in allPoints.enumerated() {
            multiArray[[frameIndex, jointIndex, 0] as [NSNumber]] = NSNumber(value: point.location.x)
            multiArray[[frameIndex, jointIndex, 1] as [NSNumber]] = NSNumber(value: point.location.y)
            multiArray[[frameIndex, jointIndex, 2] as [NSNumber]] = NSNumber(value: point.confidence)
        }
    }

    // 3. Run inference with CreateML model
    let input = YourActionClassifierInput(poses: multiArray)
    let output = try actionClassifier.prediction(input: input)

    let action = output.label  // "jumping_jacks", "squats", etc.
}

成本：实现 3-4 小时，后续维护 1 小时

模式 9：文本识别（OCR）

使用场景：从图像、收据、标牌、文档中提取文本。

let request = VNRecognizeTextRequest()
request.recognitionLevel = .accurate  // Or .fast for real-time
request.recognitionLanguages = ["en-US"]  // Specify known languages
request.usesLanguageCorrection = true  // Helps accuracy

let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])

guard let observations = request.results as? [VNRecognizedTextObservation] else {
    return
}

for observation in observations {
    // Get top candidate (most likely)
    guard let candidate = observation.topCandidates(1).first else { continue }

    let text = candidate.string
    let confidence = candidate.confidence

    // Get bounding box for specific substring
    if let range = text.range(of: searchTerm) {
        if let boundingBox = try? candidate.boundingBox(for: range) {
            // Use for highlighting
        }
    }
}

快速 vs 准确：

快速：实时摄像头、大型清晰文本（标牌、广告牌）、逐字符识别
准确：文档、收据、小文本、手写体、基于 ML 的单词/行识别

顺序重要：第一种语言决定了准确路径的 ML 模型
仅在语言未知时使用 automaticallyDetectsLanguage = true
查询 supportedRecognitionLanguages 以获取当前修订版

成本：基础实现 30 分钟，带语言处理 2 小时

模式 10：条形码/二维码检测

使用场景：扫描产品条形码、二维码、医疗保健码。

let request = VNDetectBarcodesRequest()
request.revision = VNDetectBarcodesRequestRevision3  // ML-based, iOS 16+
request.symbologies = [.qr, .ean13]  // Specify only what you need!

let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])

guard let observations = request.results as? [VNBarcodeObservation] else {
    return
}

for barcode in observations {
    let payload = barcode.payloadStringValue  // Decoded content
    let symbology = barcode.symbology  // Type of barcode
    let bounds = barcode.boundingBox  // Location (normalized)

    print("Found \(symbology): \(payload ?? "no string")")
}

性能提示：指定更少的符号体系 = 更快的扫描

修订版差异：

修订版 1：一次一个码，1D 码返回线条
修订版 2：Codabar、GS1Databar、MicroPDF、MicroQR，与 ROI 配合更好
修订版 3：基于 ML，一次多个码，更好的边界框，更少的重复

成本：实现 15 分钟

模式 11：DataScannerViewController（实时扫描）

使用场景：带有内置 UI 的基于摄像头的文本/条形码扫描（iOS 16+）。

import VisionKit

// Check support
guard DataScannerViewController.isSupported,
      DataScannerViewController.isAvailable else {
    // Not supported or camera access denied
    return
}

// Configure what to scan
let recognizedDataTypes: Set<DataScannerViewController.RecognizedDataType> = [
    .barcode(symbologies: [.qr]),
    .text(textContentType: .URL)  // Or nil for all text
]

// Create and present
let scanner = DataScannerViewController(
    recognizedDataTypes: recognizedDataTypes,
    qualityLevel: .balanced,  // Or .fast, .accurate
    recognizesMultipleItems: false,  // Center-most if false
    isHighFrameRateTrackingEnabled: true,  // For smooth highlights
    isPinchToZoomEnabled: true,
    isGuidanceEnabled: true,
    isHighlightingEnabled: true
)

scanner.delegate = self
present(scanner, animated: true) {
    try? scanner.startScanning()
}

func dataScanner(_ scanner: DataScannerViewController,
                 didTapOn item: RecognizedItem) {
    switch item {
    case .text(let text):
        print("Tapped text: \(text.transcript)")
    case .barcode(let barcode):
        print("Tapped barcode: \(barcode.payloadStringValue ?? "")")
    @unknown default: break
    }
}

// For custom highlights
func dataScanner(_ scanner: DataScannerViewController,
                 didAdd addedItems: [RecognizedItem],
                 allItems: [RecognizedItem]) {
    for item in addedItems {
        let highlight = createHighlight(for: item)
        scanner.overlayContainerView.addSubview(highlight)
    }
}

异步流替代方案：

for await items in scanner.recognizedItems {
    // Process current items
}

成本：带自定义高亮实现 45 分钟

模式 12：使用 VNDocumentCameraViewController 进行文档扫描

使用场景：通过自动边缘检测和透视校正扫描纸质文档。

import VisionKit

let documentCamera = VNDocumentCameraViewController()
documentCamera.delegate = self
present(documentCamera, animated: true)

// In delegate
func documentCameraViewController(_ controller: VNDocumentCameraViewController,
                                   didFinishWith scan: VNDocumentCameraScan) {
    controller.dismiss(animated: true)

    // Process each page
    for pageIndex in 0..<scan.pageCount {
        let image = scan.imageOfPage(at: pageIndex)

        // Now run text recognition on the corrected image
        let handler = VNImageRequestHandler(cgImage: image.cgImage!)
        let textRequest = VNRecognizeTextRequest()
        try? handler.perform([textRequest])
    }
}

成本：实现 30 分钟

模式 13：文档分割（自定义流程）

使用场景：为自定义相机 UI 以编程方式检测文档边缘。

let request = VNDetectDocumentSegmentationRequest()
let handler = VNImageRequestHandler(ciImage: inputImage)
try handler.perform([request])

guard let observation = request.results?.first,
      let document = observation as? VNRectangleObservation else {
    return
}

// Get corner points (normalized coordinates)
let topLeft = document.topLeft
let topRight = document.topRight
let bottomLeft = document.bottomLeft
let bottomRight = document.bottomRight

// Apply perspective correction with CoreImage
let correctedImage = inputImage
    .cropped(to: document.boundingBox.scaled(to: imageSize))
    .applyingFilter("CIPerspectiveCorrection", parameters: [
        "inputTopLeft": CIVector(cgPoint: topLeft.scaled(to: imageSize)),
        "inputTopRight": CIVector(cgPoint: topRight.scaled(to: imageSize)),
        "inputBottomLeft": CIVector(cgPoint: bottomLeft.scaled(to: imageSize)),
        "inputBottomRight": CIVector(cgPoint: bottomRight.scaled(to: imageSize))
    ])

VNDetectDocumentSegmentationRequest vs VNDetectRectanglesRequest：

Document：基于 ML，针对文档训练，处理非矩形，返回一个文档
Rectangle：基于边缘，查找任何四边形，返回多个，仅 CPU

成本：实现 1-2 小时

模式 14：结构化文档提取（iOS 26+）

使用场景：提取具有语义理解的表格、列表、段落。

// iOS 26+
let request = RecognizeDocumentsRequest()
let observations = try await request.perform(on: imageData)

guard let document = observations.first?.document else {
    return
}

// Extract tables
for table in document.tables {
    for row in table.rows {
        for cell in row {
            let text = cell.content.text.transcript
            print("Cell: \(text)")
        }
    }
}

// Get detected data (emails, phones, URLs, dates)
let allDetectedData = document.text.detectedData
for data in allDetectedData {
    switch data.match.details {
    case .emailAddress(let email):
        print("Email: \(email.emailAddress)")
    case .phoneNumber(let phone):
        print("Phone: \(phone.phoneNumber)")
    case .link(let url):
        print("URL: \(url)")
    default: break
    }
}

文档层次结构：

Document → 容器（文本、表格、列表、条形码）
Table → 行 → 单元格 → 内容
Content → 文本（转录文本、行、段落、单词、detectedData）

成本：实现 1 小时

模式 15：实时电话号码扫描器

使用场景：像条形码扫描器一样从摄像头扫描电话号码（来自 WWDC 2019）。

// 1. Use region of interest to guide user
let textRequest = VNRecognizeTextRequest { request, error in
    guard let observations = request.results as? [VNRecognizedTextObservation] else { return }

    for observation in observations {
        guard let candidate = observation.topCandidates(1).first else { continue }

        // Use domain knowledge to filter
        if let phoneNumber = self.extractPhoneNumber(from: candidate.string) {
            self.stringTracker.add(phoneNumber)
        }
    }

    // Build evidence over frames
    if let stableNumber = self.stringTracker.getStableString(threshold: 10) {
        self.foundPhoneNumber(stableNumber)
    }
}

textRequest.recognitionLevel = .fast  // Real-time
textRequest.usesLanguageCorrection = false  // Codes, not natural text
textRequest.regionOfInterest = guidanceBox  // Crop to user's focus area

// 2. String tracker for stability
class StringTracker {
    private var seenStrings: [String: Int] = [:]

    func add(_ string: String) {
        seenStrings[string, default: 0] += 1
    }

    func getStableString(threshold: Int) -> String? {
        seenStrings.first { $0.value >= threshold }?.key
    }
}

来自 WWDC 2019 的关键技术：

对实时处理使用 .fast 识别级别
对代码/数字禁用语言校正
使用感兴趣区域来提高速度和聚焦
在多帧上积累证据（字符串跟踪器）
应用领域知识（电话号码正则表达式）

成本：实现 2 小时

反面模式 1：在主线程上处理

let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])  // Blocks UI!

DispatchQueue.global(qos: .userInitiated).async {
    let request = VNGenerateForegroundInstanceMaskRequest()
    let handler = VNImageRequestHandler(cgImage: image)
    try handler.perform([request])

    DispatchQueue.main.async {
        // Update UI
    }
}

为何重要：Vision 是资源密集型的。阻塞主线程会导致 UI 冻结。

反面模式 2：忽略置信度分数

let thumbTip = try observation.recognizedPoint(.thumbTip)
let location = thumbTip.location  // May be unreliable!

let thumbTip = try observation.recognizedPoint(.thumbTip)
guard thumbTip.confidence > 0.5 else {
    // Low confidence - landmark unreliable
    return
}
let location = thumbTip.location

为何重要：低置信度的关键点不准确（遮挡、模糊、画面边缘）。

反面模式 3：忘记坐标转换

错误（混合坐标系）：

// Vision uses lower-left origin
let visionPoint = recognizedPoint.location  // (0, 0) = bottom-left

// UIKit uses top-left origin
let uiPoint = CGPoint(x: axiom-visionPoint.x, y: axiom-visionPoint.y)  // WRONG!

let visionPoint = recognizedPoint.location

// Convert to UIKit coordinates
let uiPoint = CGPoint(
    x: axiom-visionPoint.x * imageWidth,
    y: (1 - visionPoint.y) * imageHeight  // Flip Y axis
)

为何重要：不匹配的原点会导致 UI 叠加层出现在错误的位置。

反面模式 4：将 maximumHandCount 设置过高

let request = VNDetectHumanHandPoseRequest()
request.maximumHandCount = 10  // "Just in case"

let request = VNDetectHumanHandPoseRequest()
request.maximumHandCount = 2  // Only compute what you need

为何重要：性能随 maximumHandCount 扩展。姿态计算针对所有检测到的手部 ≤ 最大值。

反面模式 5：在 Vision 足够时使用 ARKit

错误（如果您不需要 AR）：

// Requires AR session just for body pose
let arSession = ARBodyTrackingConfiguration()

// Vision works offline on still images
let request = VNDetectHumanBodyPoseRequest()

为何重要：ARKit 身体姿态需要后置摄像头、AR 会话、支持的设备。Vision 随处可用（甚至离线）。

场景 1：“赶紧发布这个功能”

背景：产品经理希望在周五前实现“像照片应用一样”的主体提取。您正在考虑跳过后台处理。

压力：“它在我的 iPhone 15 Pro 上运行良好，我们发布吧。”

现实：Vision 在旧设备上会阻塞 UI。使用 iPhone 12 的用户会遇到应用冻结。

实现后台队列（15 分钟）
添加加载指示器（10 分钟）
在 iPhone 12 或更早设备上测试（5 分钟）

反驳模板：“主体提取功能可以工作，但在旧设备上会冻结 UI。我需要 30 分钟来添加后台处理，防止一星差评。”

场景 2：“训练我们自己的模型”

背景：设计师希望从主体边界框中排除手部。工程师建议训练自定义 CoreML 模型进行特定物体检测。

压力：“我们需要完美的边界，我们来训练一个模型。”

现实：训练需要带标签的数据集（数周）、持续维护，并且仍然无法泛化到新物体。内置的 Vision API + 手部姿态可以在 2-5 小时内解决。

解释模式 1（组合主体掩码 + 手部姿态）
在 1 小时内制作原型进行演示
与训练时间线（数周 vs 数小时）进行比较

反驳模板：“训练模型需要数周，并且只对特定物体有效。我可以组合 Vision API 在几小时内解决这个问题，并且它对任何物体都有效。”

场景 3：“我们不能等到 iOS 17”

背景：您需要实例掩码，但应用支持 iOS 15+。

压力：“就使用 iOS 15 的人物分割并发布吧。”

现实：VNGeneratePersonSegmentationRequest（iOS 15）返回所有人物合并的单一掩码。无法解决多人物使用场景。

将最低部署目标提高到 iOS 17（最佳用户体验）
或者实现回退方案：使用 iOS 15 API 但禁用多人物功能
或者使用 @available 有条件地启用功能

反驳模板：“iOS 15 上的人物分割将所有人物合并到一个掩码中。我们可以要么要求 iOS 17 以获得最佳体验，要么在旧操作系统版本上禁用多人物功能。您更倾向于哪种？”

发布 Vision 功能前：

☑ 所有 Vision 请求在后台队列上运行
☑ 处理期间 UI 显示加载指示器
☑ 在 iPhone 12 或更早设备上测试（不仅是最新设备）
☑ maximumHandCount 设置为所需的最小值

☑ 在使用关键点前检查置信度分数
☑ 为低置信度观察结果提供回退行为
☑ 处理未检测到主体/手部/人物的场景

☑ Vision 坐标（左下角原点）已转换为 UIKit（左上角原点）
☑ 归一化坐标已缩放到像素尺寸
☑ UI 叠加层与图像正确对齐

☑ 对 iOS 17+ API（实例掩

🇺🇸English

Vision Framework Computer Vision

Guides you through implementing computer vision: subject segmentation, hand/body pose detection, person detection, text recognition, barcode detection, document scanning, and combining Vision APIs to solve complex problems.

When to Use This Skill

Use when you need to:

☑ Isolate subjects from backgrounds (subject lifting)
☑ Detect and track hand poses for gestures
☑ Detect and track body poses for fitness/action classification
☑ Segment multiple people separately
☑ Exclude hands from object bounding boxes (combining APIs)
☑ Choose between VisionKit and Vision framework
☑ Combine Vision with CoreImage for compositing
☑ Decide which Vision API solves your problem
☑ Recognize text in images (OCR)
☑ Detect barcodes and QR codes
☑ Scan documents with perspective correction
☑ Extract structured data from documents (iOS 26+)
☑ Build live scanning experiences (DataScannerViewController)

Example Prompts

"How do I isolate a subject from the background?" "I need to detect hand gestures like pinch" "How can I get a bounding box around an object without including the hand holding it?" "Should I use VisionKit or Vision framework for subject lifting?" "How do I segment multiple people separately?" "I need to detect body poses for a fitness app" "How do I preserve HDR when compositing subjects on new backgrounds?" "How do I recognize text in an image?" "I need to scan QR codes from camera" "How do I extract data from a receipt?" "Should I use DataScannerViewController or Vision directly?" "How do I scan documents and correct perspective?" "I need to extract table data from a document"

Red Flags

Signs you're making this harder than it needs to be:

❌ Manually implementing subject segmentation with CoreML models
❌ Using ARKit just for body pose (Vision works offline)
❌ Writing gesture recognition from scratch (use hand pose + simple distance checks)
❌ Processing on main thread (blocks UI - Vision is resource intensive)
❌ Training custom models when Vision APIs already exist
❌ Not checking confidence scores (low confidence = unreliable landmarks)
❌ Forgetting to convert coordinates (lower-left origin vs UIKit top-left)
❌ Building custom text recognizer when VNRecognizeTextRequest exists
❌ Using AVFoundation + Vision when DataScannerViewController suffices
❌ Processing every camera frame for scanning (skip frames, use region of interest)
❌ Enabling all barcode symbologies when you only need one (performance hit)
❌ Ignoring RecognizeDocumentsRequest when you need table/list structure (iOS 26+)

Mandatory First Steps

Before implementing any Vision feature:

1. Choose the Right API (Decision Tree)

What do you need to do?

┌─ Isolate subject(s) from background?
│  ├─ Need system UI + out-of-process → VisionKit
│  │  └─ ImageAnalysisInteraction (iOS/iPadOS)
│  │  └─ ImageAnalysisOverlayView (macOS)
│  ├─ Need custom pipeline / HDR / large images → Vision
│  │  └─ VNGenerateForegroundInstanceMaskRequest
│  └─ Need to EXCLUDE hands from object → Combine APIs
│     └─ Subject mask + Hand pose + custom masking (see Pattern 1)
│
├─ Segment people?
│  ├─ All people in one mask → VNGeneratePersonSegmentationRequest
│  └─ Separate mask per person (up to 4) → VNGeneratePersonInstanceMaskRequest
│
├─ Detect hand pose/gestures?
│  ├─ Just hand location → VNDetectHumanRectanglesRequest
│  └─ 21 hand landmarks → VNDetectHumanHandPoseRequest
│     └─ Gesture recognition → Hand pose + distance checks
│
├─ Detect body pose?
│  ├─ 2D normalized landmarks → VNDetectHumanBodyPoseRequest
│  ├─ 3D real-world coordinates → VNDetectHumanBodyPose3DRequest
│  └─ Action classification → Body pose + CreateML model
│
├─ Face detection?
│  ├─ Just bounding boxes → VNDetectFaceRectanglesRequest
│  └─ Detailed landmarks → VNDetectFaceLandmarksRequest
│
├─ Person detection (location only)?
│  └─ VNDetectHumanRectanglesRequest
│
├─ Recognize text in images?
│  ├─ Real-time from camera + need UI → DataScannerViewController (iOS 16+)
│  ├─ Processing captured image → VNRecognizeTextRequest
│  │  ├─ Need speed (real-time camera) → recognitionLevel = .fast
│  │  └─ Need accuracy (documents) → recognitionLevel = .accurate
│  └─ Need structured documents (iOS 26+) → RecognizeDocumentsRequest
│
├─ Detect barcodes/QR codes?
│  ├─ Real-time camera + need UI → DataScannerViewController (iOS 16+)
│  └─ Processing image → VNDetectBarcodesRequest
│
└─ Scan documents?
   ├─ Need built-in UI + perspective correction → VNDocumentCameraViewController
   ├─ Need structured data (tables, lists) → RecognizeDocumentsRequest (iOS 26+)
   └─ Custom pipeline → VNDetectDocumentSegmentationRequest + perspective correction

2. Set Up Background Processing

NEVER run Vision on main thread :

let processingQueue = DispatchQueue(label: "com.yourapp.vision", qos: .userInitiated)

processingQueue.async {
    do {
        let request = VNGenerateForegroundInstanceMaskRequest()
        let handler = VNImageRequestHandler(cgImage: image)
        try handler.perform([request])

        // Process observations...

        DispatchQueue.main.async {
            // Update UI
        }
    } catch {
        // Handle error
    }
}

3. Choose the Right Request Handler

Processing video frames? Use VNSequenceRequestHandler (maintains inter-frame state for temporal smoothing). For single images, use VNImageRequestHandler. Creating a new VNImageRequestHandler per frame discards temporal context and causes jittery results. See axiom-vision-ref for full comparison and code examples.

4. Verify Platform Availability

API	Minimum Version
Subject segmentation (instance masks)	iOS 17+
VisionKit subject lifting	iOS 16+
Hand pose	iOS 14+
Body pose (2D)	iOS 14+
Body pose (3D)	iOS 17+
Person instance segmentation	iOS 17+
VNRecognizeTextRequest (basic)	iOS 13+
VNRecognizeTextRequest (accurate, multi-lang)	iOS 14+
VNDetectBarcodesRequest	iOS 11+
VNDetectBarcodesRequest (revision 2: Codabar, MicroQR)	iOS 15+
VNDetectBarcodesRequest (revision 3: ML-based)	iOS 16+
DataScannerViewController	iOS 16+
VNDocumentCameraViewController	iOS 13+
VNDetectDocumentSegmentationRequest

Common Patterns

Pattern 1: Isolate Object While Excluding Hand

User's original problem : Getting a bounding box around an object held in hand, without including the hand.

Root cause : VNGenerateForegroundInstanceMaskRequest is class-agnostic and treats hand+object as one subject.

Solution : Combine subject mask with hand pose to create exclusion mask.

// 1. Get subject instance mask
let subjectRequest = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: sourceImage)
try handler.perform([subjectRequest])

guard let subjectObservation = subjectRequest.results?.first as? VNInstanceMaskObservation else {
    fatalError("No subject detected")
}

// 2. Get hand pose landmarks
let handRequest = VNDetectHumanHandPoseRequest()
handRequest.maximumHandCount = 2
try handler.perform([handRequest])

guard let handObservation = handRequest.results?.first as? VNHumanHandPoseObservation else {
    // No hand detected - use full subject mask
    let mask = try subjectObservation.createScaledMask(
        for: subjectObservation.allInstances,
        croppedToInstancesContent: false
    )
    return mask
}

// 3. Create hand exclusion region from landmarks
let handPoints = try handObservation.recognizedPoints(.all)
let handBounds = calculateConvexHull(from: handPoints)  // Your implementation

// 4. Subtract hand region from subject mask using CoreImage
let subjectMask = try subjectObservation.createScaledMask(
    for: subjectObservation.allInstances,
    croppedToInstancesContent: false
)

let subjectCIMask = CIImage(cvPixelBuffer: subjectMask)
let handMask = createMaskFromRegion(handBounds, size: sourceImage.size)
let finalMask = subtractMasks(handMask: handMask, from: subjectCIMask)

// 5. Calculate bounding box from final mask
let objectBounds = calculateBoundingBox(from: finalMask)

Helper: Convex Hull

func calculateConvexHull(from points: [VNRecognizedPointKey: VNRecognizedPoint]) -> CGRect {
    // Get high-confidence points
    let validPoints = points.values.filter { $0.confidence > 0.5 }

    guard !validPoints.isEmpty else { return .zero }

    // Simple bounding rect (for more accuracy, use actual convex hull algorithm)
    let xs = validPoints.map { $0.location.x }
    let ys = validPoints.map { $0.location.y }

    let minX = xs.min()!
    let maxX = xs.max()!
    let minY = ys.min()!
    let maxY = ys.max()!

    return CGRect(
        x: minX,
        y: minY,
        width: maxX - minX,
        height: maxY - minY
    )
}

Cost : 2-5 hours initial implementation, 30 min ongoing maintenance

Pattern 2: VisionKit Simple Subject Lifting

Use case : Add system-like subject lifting UI with minimal code.

// iOS
let interaction = ImageAnalysisInteraction()
interaction.preferredInteractionTypes = .imageSubject
imageView.addInteraction(interaction)

// macOS
let overlayView = ImageAnalysisOverlayView()
overlayView.preferredInteractionTypes = .imageSubject
nsView.addSubview(overlayView)

When to use :

✓ Want system behavior (long-press to select, drag to share)
✓ Don't need custom processing pipeline
✓ Image size within VisionKit limits (out-of-process)

Cost : 15 min implementation, 5 min ongoing

Pattern 3: Programmatic Subject Access (VisionKit)

Use case : Need subject images/bounds without UI interaction.

let analyzer = ImageAnalyzer()
let configuration = ImageAnalyzer.Configuration([.text, .visualLookUp])

let analysis = try await analyzer.analyze(sourceImage, configuration: configuration)

// Get all subjects
for subject in analysis.subjects {
    let subjectImage = subject.image
    let subjectBounds = subject.bounds

    // Process subject...
}

// Tap-based lookup
if let subject = try await analysis.subject(at: tapPoint) {
    let compositeImage = try await analysis.image(for: [subject])
}

Cost : 30 min implementation, 10 min ongoing

Pattern 4: Vision Instance Mask for Custom Pipeline

Use case : HDR preservation, large images, custom compositing.

let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: sourceImage)
try handler.perform([request])

guard let observation = request.results?.first as? VNInstanceMaskObservation else {
    return
}

// Get soft segmentation mask
let mask = try observation.createScaledMask(
    for: observation.allInstances,
    croppedToInstancesContent: false  // Full resolution for compositing
)

// Use with CoreImage for HDR preservation
let filter = CIFilter(name: "CIBlendWithMask")!
filter.setValue(CIImage(cgImage: sourceImage), forKey: kCIInputImageKey)
filter.setValue(CIImage(cvPixelBuffer: mask), forKey: kCIInputMaskImageKey)
filter.setValue(newBackground, forKey: kCIInputBackgroundImageKey)

let compositedImage = filter.outputImage

Cost : 1 hour implementation, 15 min ongoing

Pattern 5: Tap-to-Select Instance

Use case : User taps to select which subject/person to lift.

// Get instance at tap point
let instance = observation.instanceAtPoint(tapPoint)

if instance == 0 {
    // Background tapped - select all instances
    let mask = try observation.createScaledMask(
        for: observation.allInstances,
        croppedToInstancesContent: false
    )
} else {
    // Specific instance tapped
    let mask = try observation.createScaledMask(
        for: IndexSet(integer: instance),
        croppedToInstancesContent: true
    )
}

Alternative: Raw pixel buffer access

let instanceMask = observation.instanceMask

CVPixelBufferLockBaseAddress(instanceMask, .readOnly)
defer { CVPixelBufferUnlockBaseAddress(instanceMask, .readOnly) }

let baseAddress = CVPixelBufferGetBaseAddress(instanceMask)
let bytesPerRow = CVPixelBufferGetBytesPerRow(instanceMask)

// Convert normalized tap to pixel coordinates
let pixelPoint = VNImagePointForNormalizedPoint(
    tapPoint,
    width: imageWidth,
    height: imageHeight
)

let offset = Int(pixelPoint.y) * bytesPerRow + Int(pixelPoint.x)
let label = UnsafeRawPointer(baseAddress!).load(
    fromByteOffset: offset,
    as: UInt8.self
)

Cost : 45 min implementation, 10 min ongoing

Pattern 6: Hand Gesture Recognition (Pinch)

Use case : Detect pinch gesture for custom camera trigger or UI control.

let request = VNDetectHumanHandPoseRequest()
request.maximumHandCount = 1

try handler.perform([request])

guard let observation = request.results?.first as? VNHumanHandPoseObservation else {
    return
}

let thumbTip = try observation.recognizedPoint(.thumbTip)
let indexTip = try observation.recognizedPoint(.indexTip)

// Check confidence
guard thumbTip.confidence > 0.5, indexTip.confidence > 0.5 else {
    return
}

// Calculate distance (normalized coordinates)
let dx = thumbTip.location.x - indexTip.location.x
let dy = thumbTip.location.y - indexTip.location.y
let distance = sqrt(dx * dx + dy * dy)

let isPinching = distance < 0.05  // Adjust threshold

// State machine for evidence accumulation
if isPinching {
    pinchFrameCount += 1
    if pinchFrameCount >= 3 {
        state = .pinched
    }
} else {
    pinchFrameCount = max(0, pinchFrameCount - 1)
    if pinchFrameCount == 0 {
        state = .apart
    }
}

Cost : 2 hours implementation, 20 min ongoing

Pattern 7: Separate Multiple People

Use case : Apply different effects to each person or count people.

let request = VNGeneratePersonInstanceMaskRequest()
try handler.perform([request])

guard let observation = request.results?.first as? VNInstanceMaskObservation else {
    return
}

let peopleCount = observation.allInstances.count  // Up to 4

for personIndex in observation.allInstances {
    let personMask = try observation.createScaledMask(
        for: IndexSet(integer: personIndex),
        croppedToInstancesContent: false
    )

    // Apply effect to this person only
    applyEffect(to: personMask, personIndex: personIndex)
}

Crowded scenes ( >4 people):

// Count faces to detect crowding
let faceRequest = VNDetectFaceRectanglesRequest()
try handler.perform([faceRequest])

let faceCount = faceRequest.results?.count ?? 0

if faceCount > 4 {
    // Fallback: Use single mask for all people
    let singleMaskRequest = VNGeneratePersonSegmentationRequest()
    try handler.perform([singleMaskRequest])
}

Cost : 1.5 hours implementation, 15 min ongoing

Pattern 8: Body Pose for Action Classification

Use case : Fitness app that recognizes exercises (jumping jacks, squats, etc.)

// 1. Collect body pose observations
var poseObservations: [VNHumanBodyPoseObservation] = []

let request = VNDetectHumanBodyPoseRequest()
try handler.perform([request])

if let observation = request.results?.first as? VNHumanBodyPoseObservation {
    poseObservations.append(observation)
}

// 2. When you have 60 frames of poses, prepare for CreateML model
if poseObservations.count == 60 {
    var multiArray = try MLMultiArray(
        shape: [60, 18, 3],  // 60 frames, 18 joints, (x, y, confidence)
        dataType: .double
    )

    for (frameIndex, observation) in poseObservations.enumerated() {
        let allPoints = try observation.recognizedPoints(.all)

        for (jointIndex, (_, point)) in allPoints.enumerated() {
            multiArray[[frameIndex, jointIndex, 0] as [NSNumber]] = NSNumber(value: point.location.x)
            multiArray[[frameIndex, jointIndex, 1] as [NSNumber]] = NSNumber(value: point.location.y)
            multiArray[[frameIndex, jointIndex, 2] as [NSNumber]] = NSNumber(value: point.confidence)
        }
    }

    // 3. Run inference with CreateML model
    let input = YourActionClassifierInput(poses: multiArray)
    let output = try actionClassifier.prediction(input: input)

    let action = output.label  // "jumping_jacks", "squats", etc.
}

Cost : 3-4 hours implementation, 1 hour ongoing

Pattern 9: Text Recognition (OCR)

Use case : Extract text from images, receipts, signs, documents.

let request = VNRecognizeTextRequest()
request.recognitionLevel = .accurate  // Or .fast for real-time
request.recognitionLanguages = ["en-US"]  // Specify known languages
request.usesLanguageCorrection = true  // Helps accuracy

let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])

guard let observations = request.results as? [VNRecognizedTextObservation] else {
    return
}

for observation in observations {
    // Get top candidate (most likely)
    guard let candidate = observation.topCandidates(1).first else { continue }

    let text = candidate.string
    let confidence = candidate.confidence

    // Get bounding box for specific substring
    if let range = text.range(of: searchTerm) {
        if let boundingBox = try? candidate.boundingBox(for: range) {
            // Use for highlighting
        }
    }
}

Fast vs Accurate :

Fast : Real-time camera, large legible text (signs, billboards), character-by-character
Accurate : Documents, receipts, small text, handwriting, ML-based word/line recognition

Language tips :

Order matters: first language determines ML model for accurate path
Use automaticallyDetectsLanguage = true only when language unknown
Query supportedRecognitionLanguages for current revision

Cost : 30 min basic implementation, 2 hours with language handling

Pattern 10: Barcode/QR Code Detection

Use case : Scan product barcodes, QR codes, healthcare codes.

let request = VNDetectBarcodesRequest()
request.revision = VNDetectBarcodesRequestRevision3  // ML-based, iOS 16+
request.symbologies = [.qr, .ean13]  // Specify only what you need!

let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])

guard let observations = request.results as? [VNBarcodeObservation] else {
    return
}

for barcode in observations {
    let payload = barcode.payloadStringValue  // Decoded content
    let symbology = barcode.symbology  // Type of barcode
    let bounds = barcode.boundingBox  // Location (normalized)

    print("Found \(symbology): \(payload ?? "no string")")
}

Performance tip : Specifying fewer symbologies = faster scanning

Revision differences :

Revision 1 : One code at a time, 1D codes return lines
Revision 2 : Codabar, GS1Databar, MicroPDF, MicroQR, better with ROI
Revision 3 : ML-based, multiple codes at once, better bounding boxes, fewer duplicates

Cost : 15 min implementation

Pattern 11: DataScannerViewController (Live Scanning)

Use case : Camera-based text/barcode scanning with built-in UI (iOS 16+).

import VisionKit

// Check support
guard DataScannerViewController.isSupported,
      DataScannerViewController.isAvailable else {
    // Not supported or camera access denied
    return
}

// Configure what to scan
let recognizedDataTypes: Set<DataScannerViewController.RecognizedDataType> = [
    .barcode(symbologies: [.qr]),
    .text(textContentType: .URL)  // Or nil for all text
]

// Create and present
let scanner = DataScannerViewController(
    recognizedDataTypes: recognizedDataTypes,
    qualityLevel: .balanced,  // Or .fast, .accurate
    recognizesMultipleItems: false,  // Center-most if false
    isHighFrameRateTrackingEnabled: true,  // For smooth highlights
    isPinchToZoomEnabled: true,
    isGuidanceEnabled: true,
    isHighlightingEnabled: true
)

scanner.delegate = self
present(scanner, animated: true) {
    try? scanner.startScanning()
}

Delegate methods :

func dataScanner(_ scanner: DataScannerViewController,
                 didTapOn item: RecognizedItem) {
    switch item {
    case .text(let text):
        print("Tapped text: \(text.transcript)")
    case .barcode(let barcode):
        print("Tapped barcode: \(barcode.payloadStringValue ?? "")")
    @unknown default: break
    }
}

// For custom highlights
func dataScanner(_ scanner: DataScannerViewController,
                 didAdd addedItems: [RecognizedItem],
                 allItems: [RecognizedItem]) {
    for item in addedItems {
        let highlight = createHighlight(for: item)
        scanner.overlayContainerView.addSubview(highlight)
    }
}

Async stream alternative :

for await items in scanner.recognizedItems {
    // Process current items
}

Cost : 45 min implementation with custom highlights

Pattern 12: Document Scanning with VNDocumentCameraViewController

Use case : Scan paper documents with automatic edge detection and perspective correction.

import VisionKit

let documentCamera = VNDocumentCameraViewController()
documentCamera.delegate = self
present(documentCamera, animated: true)

// In delegate
func documentCameraViewController(_ controller: VNDocumentCameraViewController,
                                   didFinishWith scan: VNDocumentCameraScan) {
    controller.dismiss(animated: true)

    // Process each page
    for pageIndex in 0..<scan.pageCount {
        let image = scan.imageOfPage(at: pageIndex)

        // Now run text recognition on the corrected image
        let handler = VNImageRequestHandler(cgImage: image.cgImage!)
        let textRequest = VNRecognizeTextRequest()
        try? handler.perform([textRequest])
    }
}

Cost : 30 min implementation

Pattern 13: Document Segmentation (Custom Pipeline)

Use case : Detect document edges programmatically for custom camera UI.

let request = VNDetectDocumentSegmentationRequest()
let handler = VNImageRequestHandler(ciImage: inputImage)
try handler.perform([request])

guard let observation = request.results?.first,
      let document = observation as? VNRectangleObservation else {
    return
}

// Get corner points (normalized coordinates)
let topLeft = document.topLeft
let topRight = document.topRight
let bottomLeft = document.bottomLeft
let bottomRight = document.bottomRight

// Apply perspective correction with CoreImage
let correctedImage = inputImage
    .cropped(to: document.boundingBox.scaled(to: imageSize))
    .applyingFilter("CIPerspectiveCorrection", parameters: [
        "inputTopLeft": CIVector(cgPoint: topLeft.scaled(to: imageSize)),
        "inputTopRight": CIVector(cgPoint: topRight.scaled(to: imageSize)),
        "inputBottomLeft": CIVector(cgPoint: bottomLeft.scaled(to: imageSize)),
        "inputBottomRight": CIVector(cgPoint: bottomRight.scaled(to: imageSize))
    ])

VNDetectDocumentSegmentationRequest vs VNDetectRectanglesRequest :

Document: ML-based, trained on documents, handles non-rectangles, returns one document
Rectangle: Edge-based, finds any quadrilateral, returns multiple, CPU-only

Cost : 1-2 hours implementation

Pattern 14: Structured Document Extraction (iOS 26+)

Use case : Extract tables, lists, paragraphs with semantic understanding.

// iOS 26+
let request = RecognizeDocumentsRequest()
let observations = try await request.perform(on: imageData)

guard let document = observations.first?.document else {
    return
}

// Extract tables
for table in document.tables {
    for row in table.rows {
        for cell in row {
            let text = cell.content.text.transcript
            print("Cell: \(text)")
        }
    }
}

// Get detected data (emails, phones, URLs, dates)
let allDetectedData = document.text.detectedData
for data in allDetectedData {
    switch data.match.details {
    case .emailAddress(let email):
        print("Email: \(email.emailAddress)")
    case .phoneNumber(let phone):
        print("Phone: \(phone.phoneNumber)")
    case .link(let url):
        print("URL: \(url)")
    default: break
    }
}

Document hierarchy :

Document → containers (text, tables, lists, barcodes)
Table → rows → cells → content
Content → text (transcript, lines, paragraphs, words, detectedData)

Cost : 1 hour implementation

Pattern 15: Real-time Phone Number Scanner

Use case : Scan phone numbers from camera like barcode scanner (from WWDC 2019).

// 1. Use region of interest to guide user
let textRequest = VNRecognizeTextRequest { request, error in
    guard let observations = request.results as? [VNRecognizedTextObservation] else { return }

    for observation in observations {
        guard let candidate = observation.topCandidates(1).first else { continue }

        // Use domain knowledge to filter
        if let phoneNumber = self.extractPhoneNumber(from: candidate.string) {
            self.stringTracker.add(phoneNumber)
        }
    }

    // Build evidence over frames
    if let stableNumber = self.stringTracker.getStableString(threshold: 10) {
        self.foundPhoneNumber(stableNumber)
    }
}

textRequest.recognitionLevel = .fast  // Real-time
textRequest.usesLanguageCorrection = false  // Codes, not natural text
textRequest.regionOfInterest = guidanceBox  // Crop to user's focus area

// 2. String tracker for stability
class StringTracker {
    private var seenStrings: [String: Int] = [:]

    func add(_ string: String) {
        seenStrings[string, default: 0] += 1
    }

    func getStableString(threshold: Int) -> String? {
        seenStrings.first { $0.value >= threshold }?.key
    }
}

Key techniques from WWDC 2019 :

Use .fast recognition level for real-time
Disable language correction for codes/numbers
Use region of interest to improve speed and focus
Build evidence over multiple frames (string tracker)
Apply domain knowledge (phone number regex)

Cost : 2 hours implementation

Anti-Patterns

Anti-Pattern 1: Processing on Main Thread

Wrong :

let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])  // Blocks UI!

Right :

DispatchQueue.global(qos: .userInitiated).async {
    let request = VNGenerateForegroundInstanceMaskRequest()
    let handler = VNImageRequestHandler(cgImage: image)
    try handler.perform([request])

    DispatchQueue.main.async {
        // Update UI
    }
}

Why it matters : Vision is resource-intensive. Blocking main thread freezes UI.

Anti-Pattern 2: Ignoring Confidence Scores

Wrong :

let thumbTip = try observation.recognizedPoint(.thumbTip)
let location = thumbTip.location  // May be unreliable!

Right :

let thumbTip = try observation.recognizedPoint(.thumbTip)
guard thumbTip.confidence > 0.5 else {
    // Low confidence - landmark unreliable
    return
}
let location = thumbTip.location

Why it matters : Low confidence points are inaccurate (occlusion, blur, edge of frame).

Anti-Pattern 3: Forgetting Coordinate Conversion

Wrong (mixing coordinate systems):

// Vision uses lower-left origin
let visionPoint = recognizedPoint.location  // (0, 0) = bottom-left

// UIKit uses top-left origin
let uiPoint = CGPoint(x: axiom-visionPoint.x, y: axiom-visionPoint.y)  // WRONG!

Right :

let visionPoint = recognizedPoint.location

// Convert to UIKit coordinates
let uiPoint = CGPoint(
    x: axiom-visionPoint.x * imageWidth,
    y: (1 - visionPoint.y) * imageHeight  // Flip Y axis
)

Why it matters : Mismatched origins cause UI overlays to appear in wrong positions.

Anti-Pattern 4: Setting maximumHandCount Too High

Wrong :

let request = VNDetectHumanHandPoseRequest()
request.maximumHandCount = 10  // "Just in case"

Right :

let request = VNDetectHumanHandPoseRequest()
request.maximumHandCount = 2  // Only compute what you need

Why it matters : Performance scales with maximumHandCount. Pose computed for all detected hands ≤ max.

Anti-Pattern 5: Using ARKit When Vision Suffices

Wrong (if you don't need AR):

// Requires AR session just for body pose
let arSession = ARBodyTrackingConfiguration()

Right :

// Vision works offline on still images
let request = VNDetectHumanBodyPoseRequest()

Why it matters : ARKit body pose requires rear camera, AR session, supported devices. Vision works everywhere (even offline).

Pressure Scenarios

Scenario 1: "Just Ship the Feature"

Context : Product manager wants subject lifting "like in Photos app" by Friday. You're considering skipping background processing.

Pressure : "It's working on my iPhone 15 Pro, let's ship it."

Reality : Vision blocks UI on older devices. Users on iPhone 12 will experience frozen app.

Correct action :

Implement background queue (15 min)
Add loading indicator (10 min)
Test on iPhone 12 or earlier (5 min)

Push-back template : "Subject lifting works, but it freezes the UI on older devices. I need 30 minutes to add background processing and prevent 1-star reviews."

Scenario 2: "Training Our Own Model"

Context : Designer wants to exclude hands from subject bounding box. Engineer suggests training custom CoreML model for specific object detection.

Pressure : "We need perfect bounds, let's train a model."

Reality : Training requires labeled dataset (weeks), ongoing maintenance, and still won't generalize to new objects. Built-in Vision APIs + hand pose solve it in 2-5 hours.

Correct action :

Explain Pattern 1 (combine subject mask + hand pose)
Prototype in 1 hour to demonstrate
Compare against training timeline (weeks vs hours)

Push-back template : "Training a model takes weeks and only works for specific objects. I can combine Vision APIs to solve this in a few hours and it'll work for any object."

Scenario 3: "We Can't Wait for iOS 17"

Context : You need instance masks but app supports iOS 15+.

Pressure : "Just use iOS 15 person segmentation and ship it."

Reality : VNGeneratePersonSegmentationRequest (iOS 15) returns single mask for all people. Doesn't solve multi-person use case.

Correct action :

Raise minimum deployment target to iOS 17 (best UX)
OR implement fallback: use iOS 15 API but disable multi-person features
OR use @available to conditionally enable features

Push-back template : "Person segmentation on iOS 15 combines all people into one mask. We can either require iOS 17 for the best experience, or disable multi-person features on older OS versions. Which do you prefer?"

Checklist

Before shipping Vision features:

Performance :

☑ All Vision requests run on background queue
☑ UI shows loading indicator during processing
☑ Tested on iPhone 12 or earlier (not just latest devices)
☑ maximumHandCount set to minimum needed value

Accuracy :

☑ Confidence scores checked before using landmarks
☑ Fallback behavior for low confidence observations
☑ Handles case where no subjects/hands/people detected

Coordinates :

☑ Vision coordinates (lower-left origin) converted to UIKit (top-left)
☑ Normalized coordinates scaled to pixel dimensions
☑ UI overlays aligned correctly with image

Platform Support :

☑ @available checks for iOS 17+ APIs (instance masks)
☑ Fallback for iOS 14-16 (or raised deployment target)
☑ Tested on actual devices, not just simulator

Edge Cases :

☑ Handles images with no detectable subjects
☑ Handles partially occluded hands/bodies
☑ Handles hands/bodies near image edges
☑ Handles >4 people for person instance segmentation

CoreImage Integration (if applicable):

☑ HDR preservation verified with high dynamic range images
☑ Mask resolution matches source image
☑ croppedToInstancesContent set appropriately (false for compositing)

Text/Barcode Recognition (if applicable):

☑ Recognition level matches use case (fast for real-time, accurate for documents)
☑ Language correction disabled for codes/serial numbers
☑ Barcode symbologies limited to actual needs (performance)
☑ Region of interest used to focus scanning area
☑ Multiple candidates checked (not just top candidate)
☑ Evidence accumulated over frames for real-time (string tracker)
☑ DataScannerViewController availability checked before presenting

Resources

WWDC : 2019-234, 2021-10041, 2022-10024, 2022-10025, 2025-272, 2023-10176, 2023-111241, 2020-10653

Docs : /vision, /visionkit, /vision/vnrecognizetextrequest, /vision/vndetectbarcodesrequest

Skills : axiom-vision-ref, axiom-vision-diag

Weekly Installs

274

Repository

charleswiltgen/axiom

GitHub Stars

601

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode250

codex241

gemini-cli239

cursor239

github-copilot233

amp221

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

41,800 周安装

iOS Vision Framework 计算机视觉开发指南：主体分割、姿态检测、OCR 与文档扫描

🇨🇳中文介绍

Vision Framework 计算机视觉

何时使用此技能

示例提示

警示信号

必须的第一步

相关 Skills

1. 选择正确的 API（决策树）

2. 设置后台处理

3. 选择正确的请求处理器

4. 验证平台可用性

常见模式

模式 1：隔离物体同时排除手部

模式 2：VisionKit 简单主体提取

模式 3：编程式主体访问（VisionKit）

模式 4：用于自定义流程的 Vision 实例掩码

模式 5：点击选择实例

模式 6：手部手势识别（捏合）

模式 7：分离多个人物

模式 8：用于动作分类的身体姿态

模式 9：文本识别（OCR）

模式 10：条形码/二维码检测

模式 11：DataScannerViewController（实时扫描）

模式 12：使用 VNDocumentCameraViewController 进行文档扫描

模式 13：文档分割（自定义流程）

模式 14：结构化文档提取（iOS 26+）

模式 15：实时电话号码扫描器

反面模式

反面模式 1：在主线程上处理

反面模式 2：忽略置信度分数

反面模式 3：忘记坐标转换

反面模式 4：将 maximumHandCount 设置过高

反面模式 5：在 Vision 足够时使用 ARKit

压力场景

场景 1：“赶紧发布这个功能”

场景 2：“训练我们自己的模型”

场景 3：“我们不能等到 iOS 17”

检查清单

🇺🇸English

Vision Framework Computer Vision

When to Use This Skill

Example Prompts

Red Flags

Mandatory First Steps

1. Choose the Right API (Decision Tree)

2. Set Up Background Processing

3. Choose the Right Request Handler

4. Verify Platform Availability

Common Patterns

Pattern 1: Isolate Object While Excluding Hand

Pattern 2: VisionKit Simple Subject Lifting

Pattern 3: Programmatic Subject Access (VisionKit)

Pattern 4: Vision Instance Mask for Custom Pipeline

Pattern 5: Tap-to-Select Instance

Pattern 6: Hand Gesture Recognition (Pinch)

Pattern 7: Separate Multiple People

Pattern 8: Body Pose for Action Classification

Pattern 9: Text Recognition (OCR)

Pattern 10: Barcode/QR Code Detection

Pattern 11: DataScannerViewController (Live Scanning)

Pattern 12: Document Scanning with VNDocumentCameraViewController

Pattern 13: Document Segmentation (Custom Pipeline)

Pattern 14: Structured Document Extraction (iOS 26+)

Pattern 15: Real-time Phone Number Scanner

Anti-Patterns

Anti-Pattern 1: Processing on Main Thread

Anti-Pattern 2: Ignoring Confidence Scores

Anti-Pattern 3: Forgetting Coordinate Conversion

Anti-Pattern 4: Setting maximumHandCount Too High

Anti-Pattern 5: Using ARKit When Vision Suffices

Pressure Scenarios

Scenario 1: "Just Ship the Feature"

Scenario 2: "Training Our Own Model"

Scenario 3: "We Can't Wait for iOS 17"

Checklist

Resources

最新 Skills