Vision Framework API 参考：iOS/macOS 计算机视觉开发指南 (主体分割、OCR、姿态检测)

axiom-vision-ref by charleswiltgen/axiom

162 周安装量

716 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/charleswiltgen/axiom --skill axiom-vision-ref

iOS Swift 计算机视觉

🇨🇳中文介绍

Vision Framework API 参考文档

Vision 框架计算机视觉功能全面参考：主体分割、手部/身体姿态检测、人物检测、人脸分析、文本识别（OCR）、条形码检测和文档扫描。

何时使用本参考

实现主体提取 使用 VisionKit 或 Vision
检测手部/身体姿态 用于手势识别或健身应用
人物分割 从背景中分离人物或分离多个人物
人脸检测和特征点 用于 AR 效果或身份验证
组合 Vision API 以解决复杂的计算机视觉问题
查找特定 API 签名 和参数含义
识别图像中的文本（OCR）使用 VNRecognizeTextRequest
检测条形码和二维码 使用 VNDetectBarcodesRequest
构建实时扫描器 使用 DataScannerViewController
扫描文档 使用 VNDocumentCameraViewController
提取结构化文档数据 使用 RecognizeDocumentsRequest（iOS 26+）

相关技能 : 查看 axiom-vision 获取决策树和模式，axiom-vision-diag 获取故障排除

Vision Framework 概述

Vision 为静态图像和视频提供计算机视觉算法：

核心工作流程 :

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

使用场景	处理器
单张照片或截图	`VNImageRequestHandler`
视频流或相机帧	`VNSequenceRequestHandler`
时间平滑（姿态、分割）	`VNSequenceRequestHandler`
一次性分析 CVPixelBuffer	`VNImageRequestHandler`

组键	点
`.all`	所有 21 个特征点
`.thumb`	4 个拇指关节
`.indexFinger`	4 个食指关节
`.middleFinger`	4 个中指关节
`.ringFinger`	4 个无名指关节
`.littleFinger`	4 个小指关节

组键	点
`.all`	所有 18 个特征点
`.face`	5 个面部特征点
`.leftArm`	肩膀、手肘、手腕
`.rightArm`	肩膀、手肘、手腕
`.torso`	颈部、肩膀、臀部、根节点
`.leftLeg`	臀部、膝盖、脚踝
`.rightLeg`	臀部、膝盖、脚踝

级别	性能	准确度	最适合
`.fast`	实时	良好	相机流、大文本、标志
`.accurate`	较慢	优秀	文档、收据、手写文字

属性	类型	描述
`recognitionLevel`	`VNRequestTextRecognitionLevel`	`.fast` 或 `.accurate`
`recognitionLanguages`	`[String]`	BCP 47 语言代码，顺序 = 优先级
`usesLanguageCorrection`	`Bool`	使用语言模型进行校正
`customWords`	`[String]`	领域特定词汇
`automaticallyDetectsLanguage`	`Bool`	自动检测语言（iOS 16+）
`minimumTextHeight`	`Float`	文本最小高度占图像的比例（0-1）
`revision`	`Int`	API 版本（影响支持的语言）

属性	类型	描述
`string`	`String`	识别的文本
`confidence`	`VNConfidence`	0.0-1.0
`boundingBox(for:)`	`VNRectangleObservation?`	子字符串范围的边界框

修订版	iOS	功能
1	11+	基本检测，一次一个代码
2	15+	Codabar、GS1、MicroPDF、MicroQR，更好的 ROI
3	16+	基于 ML，多个代码，更好的边界框

属性	类型	描述
`payloadStringValue`	`String?`	解码的内容
`symbology`	`VNBarcodeSymbology`	条形码类型
`boundingBox`	`CGRect`	归一化边界
`topLeft/topRight/bottomLeft/bottomRight`	`CGPoint`	角点

类型	描述
`.barcode(symbologies:)`	特定条形码类型
`.text()`	所有文本
`.text(languages:)`	按语言过滤的文本
`.text(textContentType:)`	按类型过滤的文本（URL、电话、电子邮件）

属性	类型	描述
`pageCount`	`Int`	扫描的页数
`imageOfPage(at:)`	`UIImage`	获取索引处的页面图像
`title`	`String`	用户可编辑的标题

属性	类型	描述
`labels`	`[String]`	检测到项目的分类标签
`pixelBuffer`	`CVReadOnlyPixelBuffer?`	检测到项目的视觉数据

🇺🇸English

Vision Framework API Reference

Comprehensive reference for Vision framework computer vision: subject segmentation, hand/body pose detection, person detection, face analysis, text recognition (OCR), barcode detection, and document scanning.

When to Use This Reference

Implementing subject lifting using VisionKit or Vision
Detecting hand/body poses for gesture recognition or fitness apps
Segmenting people from backgrounds or separating multiple individuals
Face detection and landmarks for AR effects or authentication
Combining Vision APIs to solve complex computer vision problems
Looking up specific API signatures and parameter meanings
Recognizing text in images (OCR) with VNRecognizeTextRequest
Detecting barcodes and QR codes with VNDetectBarcodesRequest
Building live scanners with DataScannerViewController
Scanning documents with VNDocumentCameraViewController
Extracting structured document data with RecognizeDocumentsRequest (iOS 26+)

Related skills : See axiom-vision for decision trees and patterns, axiom-vision-diag for troubleshooting

Vision Framework Overview

Vision provides computer vision algorithms for still images and video:

Core workflow :

Create request (e.g., VNDetectHumanHandPoseRequest())
Create handler with image (VNImageRequestHandler(cgImage: image))
Perform request (try handler.perform([request]))
Access observations from request.results

Coordinate system : Lower-left origin, normalized (0.0-1.0) coordinates

Performance : Run on background queue - resource intensive, blocks UI if on main thread

Request Handlers

Vision provides two request handlers for different scenarios.

VNImageRequestHandler

Analyzes a single image. Initialize with the image, perform requests against it, discard.

let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request1, request2])  // Multiple requests, one image

Initialize with : CGImage, CIImage, CVPixelBuffer, Data, or URL

Rule : One handler per image. Reusing a handler with a different image is unsupported.

VNSequenceRequestHandler

Analyzes a sequence of frames (video, camera feed). Initialize empty, pass each frame to perform(). Maintains inter-frame state for temporal smoothing.

let sequenceHandler = VNSequenceRequestHandler()

// In your camera/video frame callback:
func processFrame(_ pixelBuffer: CVPixelBuffer) throws {
    try sequenceHandler.perform([request], on: pixelBuffer)
}

Rule : Create once, reuse across frames. The handler tracks state between calls.

When to Use Which

Use Case	Handler
Single photo or screenshot	`VNImageRequestHandler`
Video stream or camera frames	`VNSequenceRequestHandler`
Temporal smoothing (pose, segmentation)	`VNSequenceRequestHandler`
One-off analysis of a CVPixelBuffer	`VNImageRequestHandler`

Requests That Benefit from Sequence Handling

These requests use inter-frame state when run through VNSequenceRequestHandler:

VNDetectHumanBodyPoseRequest — Smoother joint tracking
VNDetectHumanHandPoseRequest — Smoother landmark tracking
VNGeneratePersonSegmentationRequest — Temporally consistent masks
VNGeneratePersonInstanceMaskRequest — Stable person identity across frames
VNDetectDocumentSegmentationRequest — Stable document edges
Any VNStatefulRequest subclass — Designed for sequences

Common Mistake

Creating a new VNImageRequestHandler per video frame discards temporal context. Pose landmarks jitter, segmentation masks flicker, and you lose the smoothing that sequence handling provides.

// Wrong — loses temporal context every frame
func processFrame(_ buffer: CVPixelBuffer) throws {
    let handler = VNImageRequestHandler(cvPixelBuffer: buffer)
    try handler.perform([poseRequest])
}

// Right — maintains inter-frame state
let sequenceHandler = VNSequenceRequestHandler()
func processFrame(_ buffer: CVPixelBuffer) throws {
    try sequenceHandler.perform([poseRequest], on: buffer)
}

Subject Segmentation APIs

VNGenerateForegroundInstanceMaskRequest

Availability : iOS 17+, macOS 14+, tvOS 17+, visionOS 1+

Generates class-agnostic instance mask of foreground objects (people, pets, buildings, food, shoes, etc.)

Basic Usage

let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: image)

try handler.perform([request])

guard let observation = request.results?.first as? VNInstanceMaskObservation else {
    return
}

InstanceMaskObservation

allInstances : IndexSet containing all foreground instance indices (excludes background 0)

instanceMask : CVPixelBuffer with UInt8 labels (0 = background, 1+ = instance indices)

instanceAtPoint(_:) : Returns instance index at normalized point

let point = CGPoint(x: 0.5, y: 0.5)  // Center of image
let instance = observation.instanceAtPoint(point)

if instance == 0 {
    print("Background tapped")
} else {
    print("Instance \(instance) tapped")
}

Generating Masks

createScaledMask(for:croppedToInstancesContent:)

Parameters:

for: IndexSet of instances to include
croppedToInstancesContent:
- false = Output matches input resolution (for compositing)
- true = Tight crop around selected instances

Returns: Single-channel floating-point CVPixelBuffer (soft segmentation mask)

// All instances, full resolution
let mask = try observation.createScaledMask(
    for: observation.allInstances,
    croppedToInstancesContent: false
)

// Single instance, cropped
let instances = IndexSet(integer: 1)
let croppedMask = try observation.createScaledMask(
    for: instances,
    croppedToInstancesContent: true
)

Instance Mask Hit Testing

Access raw pixel buffer to map tap coordinates to instance labels:

let instanceMask = observation.instanceMask

CVPixelBufferLockBaseAddress(instanceMask, .readOnly)
defer { CVPixelBufferUnlockBaseAddress(instanceMask, .readOnly) }

let baseAddress = CVPixelBufferGetBaseAddress(instanceMask)
let width = CVPixelBufferGetWidth(instanceMask)
let bytesPerRow = CVPixelBufferGetBytesPerRow(instanceMask)

// Convert normalized tap to pixel coordinates
let pixelPoint = VNImagePointForNormalizedPoint(
    CGPoint(x: normalizedX, y: normalizedY),
    width: imageWidth,
    height: imageHeight
)

// Calculate byte offset
let offset = Int(pixelPoint.y) * bytesPerRow + Int(pixelPoint.x)

// Read instance label
let label = UnsafeRawPointer(baseAddress!).load(
    fromByteOffset: offset,
    as: UInt8.self
)

let instances = label == 0 ? observation.allInstances : IndexSet(integer: Int(label))

VisionKit Subject Lifting

ImageAnalysisInteraction (iOS)

Availability : iOS 16+, iPadOS 16+

Adds system-like subject lifting UI to views:

let interaction = ImageAnalysisInteraction()
interaction.preferredInteractionTypes = .imageSubject  // Or .automatic
imageView.addInteraction(interaction)

Interaction types :

.automatic: Subject lifting + Live Text + data detectors
.imageSubject: Subject lifting only (no interactive text)

ImageAnalysisOverlayView (macOS)

Availability : macOS 13+

let overlayView = ImageAnalysisOverlayView()
overlayView.preferredInteractionTypes = .imageSubject
nsView.addSubview(overlayView)

Programmatic Access

ImageAnalyzer

let analyzer = ImageAnalyzer()
let configuration = ImageAnalyzer.Configuration([.text, .visualLookUp])

let analysis = try await analyzer.analyze(image, configuration: configuration)

ImageAnalysis

subjects : [Subject] - All subjects in image

highlightedSubjects : Set<Subject> - Currently highlighted (user long-pressed)

subject(at:) : Async lookup of subject at normalized point (returns nil if none)

// Get all subjects
let subjects = analysis.subjects

// Look up subject at tap
if let subject = try await analysis.subject(at: tapPoint) {
    // Process subject
}

// Change highlight state
analysis.highlightedSubjects = Set([subjects[0], subjects[1]])

Subject Struct

image : UIImage/NSImage - Extracted subject with transparency

bounds : CGRect - Subject boundaries in image coordinates

// Single subject image
let subjectImage = subject.image

// Composite multiple subjects
let compositeImage = try await analysis.image(for: [subject1, subject2])

Out-of-process : VisionKit analysis happens out-of-process (performance benefit, image size limited)

Person Segmentation APIs

VNGeneratePersonSegmentationRequest

Availability : iOS 15+, macOS 12+

Returns single mask containing all people in image:

let request = VNGeneratePersonSegmentationRequest()
// Configure quality level if needed
try handler.perform([request])

guard let observation = request.results?.first as? VNPixelBufferObservation else {
    return
}

let personMask = observation.pixelBuffer  // CVPixelBuffer

VNGeneratePersonInstanceMaskRequest

Availability : iOS 17+, macOS 14+

Returns separate masks for up to 4 people :

let request = VNGeneratePersonInstanceMaskRequest()
try handler.perform([request])

guard let observation = request.results?.first as? VNInstanceMaskObservation else {
    return
}

// Same InstanceMaskObservation API as foreground instance masks
let allPeople = observation.allInstances  // Up to 4 people (1-4)

// Get mask for person 1
let person1Mask = try observation.createScaledMask(
    for: IndexSet(integer: 1),
    croppedToInstancesContent: false
)

Limitations :

Segments up to 4 people
With >4 people: may miss people or combine them (typically background people)
Use VNDetectFaceRectanglesRequest to count faces if you need to handle crowded scenes

Hand Pose Detection

VNDetectHumanHandPoseRequest

Availability : iOS 14+, macOS 11+

Detects 21 hand landmarks per hand:

let request = VNDetectHumanHandPoseRequest()
request.maximumHandCount = 2  // Default: 2, increase if needed

let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])

for observation in request.results as? [VNHumanHandPoseObservation] ?? [] {
    // Process each hand
}

Performance note : maximumHandCount affects latency. Pose computed only for hands ≤ maximum. Set to lowest acceptable value.

Hand Landmarks (21 points)

Wrist : 1 landmark

Thumb (4 landmarks):

.thumbTip
.thumbIP (interphalangeal joint)
.thumbMP (metacarpophalangeal joint)
.thumbCMC (carpometacarpal joint)

Fingers (4 landmarks each):

Tip (.indexTip, .middleTip, .ringTip, .littleTip)
DIP (distal interphalangeal joint)
PIP (proximal interphalangeal joint)
MCP (metacarpophalangeal joint)

Group Keys

Access landmark groups:

Group Key	Points
`.all`	All 21 landmarks
`.thumb`	4 thumb joints
`.indexFinger`	4 index finger joints
`.middleFinger`	4 middle finger joints
`.ringFinger`	4 ring finger joints
`.littleFinger`	4 little finger joints

// Get all points
let allPoints = try observation.recognizedPoints(.all)

// Get index finger points only
let indexPoints = try observation.recognizedPoints(.indexFinger)

// Get specific point
let thumbTip = try observation.recognizedPoint(.thumbTip)
let indexTip = try observation.recognizedPoint(.indexTip)

// Check confidence
guard thumbTip.confidence > 0.5 else { return }

// Access location (normalized coordinates, lower-left origin)
let location = thumbTip.location  // CGPoint

Gesture Recognition Example (Pinch)

let thumbTip = try observation.recognizedPoint(.thumbTip)
let indexTip = try observation.recognizedPoint(.indexTip)

guard thumbTip.confidence > 0.5, indexTip.confidence > 0.5 else {
    return
}

let distance = hypot(
    thumbTip.location.x - indexTip.location.x,
    thumbTip.location.y - indexTip.location.y
)

let isPinching = distance < 0.05  // Normalized threshold

Chirality (Handedness)

let chirality = observation.chirality  // .left or .right or .unknown

Body Pose Detection

VNDetectHumanBodyPoseRequest (2D)

Availability : iOS 14+, macOS 11+

Detects 18 body landmarks (2D normalized coordinates):

let request = VNDetectHumanBodyPoseRequest()
try handler.perform([request])

for observation in request.results as? [VNHumanBodyPoseObservation] ?? [] {
    // Process each person
}

Body Landmarks (18 points)

Face (5 landmarks):

.nose, .leftEye, .rightEye, .leftEar, .rightEar

Arms (6 landmarks):

Left: .leftShoulder, .leftElbow, .leftWrist
Right: .rightShoulder, .rightElbow, .rightWrist

Torso (7 landmarks):

.neck (between shoulders)
.leftShoulder, .rightShoulder (also in arm groups)
.leftHip, .rightHip
.root (between hips)

Legs (6 landmarks):

Left: .leftHip, .leftKnee, .leftAnkle
Right: .rightHip, .rightKnee, .rightAnkle

Note : Shoulders and hips appear in multiple groups

Group Keys (Body)

Group Key	Points
`.all`	All 18 landmarks
`.face`	5 face landmarks
`.leftArm`	shoulder, elbow, wrist
`.rightArm`	shoulder, elbow, wrist
`.torso`	neck, shoulders, hips, root
`.leftLeg`	hip, knee, ankle

// Get all body points
let allPoints = try observation.recognizedPoints(.all)

// Get left arm only
let leftArmPoints = try observation.recognizedPoints(.leftArm)

// Get specific joint
let leftWrist = try observation.recognizedPoint(.leftWrist)

VNDetectHumanBodyPose3DRequest (3D)

Availability : iOS 17+, macOS 14+

Returns 3D skeleton with 17 joints in meters (real-world coordinates):

let request = VNDetectHumanBodyPose3DRequest()
try handler.perform([request])

guard let observation = request.results?.first as? VNHumanBodyPose3DObservation else {
    return
}

// Get 3D joint position
let leftWrist = try observation.recognizedPoint(.leftWrist)
let position = leftWrist.position  // simd_float4x4 matrix
let localPosition = leftWrist.localPosition  // Relative to parent joint

3D Body Landmarks (17 points): Same as 2D except no ears (15 vs 18 2D landmarks)

3D Observation Properties

bodyHeight : Estimated height in meters

With depth data: Measured height
Without depth data: Reference height (1.8m)

heightEstimation : .measured or .reference

cameraOriginMatrix : simd_float4x4 camera position/orientation relative to subject

pointInImage(_:) : Project 3D joint back to 2D image coordinates

let wrist2D = try observation.pointInImage(leftWrist)

3D Point Classes

VNPoint3D : Base class with simd_float4x4 position matrix

VNRecognizedPoint3D : Adds identifier (joint name)

VNHumanBodyRecognizedPoint3D : Adds localPosition and parentJoint

// Position relative to skeleton root (center of hip)
let modelPosition = leftWrist.position

// Position relative to parent joint (left elbow)
let relativePosition = leftWrist.localPosition

Depth Input

Vision accepts depth data alongside images:

// From AVDepthData
let handler = VNImageRequestHandler(
    cvPixelBuffer: imageBuffer,
    depthData: depthData,
    orientation: orientation
)

// From file (automatic depth extraction)
let handler = VNImageRequestHandler(url: imageURL)  // Depth auto-fetched

Depth formats : Disparity or Depth (interchangeable via AVFoundation)

LiDAR : Use in live capture sessions for accurate scale/measurement

Face Detection & Landmarks

VNDetectFaceRectanglesRequest

Availability : iOS 11+

Detects face bounding boxes:

let request = VNDetectFaceRectanglesRequest()
try handler.perform([request])

for observation in request.results as? [VNFaceObservation] ?? [] {
    let faceBounds = observation.boundingBox  // Normalized rect
}

VNDetectFaceLandmarksRequest

Availability : iOS 11+

Detects face with detailed landmarks:

let request = VNDetectFaceLandmarksRequest()
try handler.perform([request])

for observation in request.results as? [VNFaceObservation] ?? [] {
    if let landmarks = observation.landmarks {
        let leftEye = landmarks.leftEye
        let nose = landmarks.nose
        let leftPupil = landmarks.leftPupil  // Revision 2+
    }
}

Revisions :

Revision 1: Basic landmarks
Revision 2: Detects upside-down faces
Revision 3+: Pupil locations

Person Detection

VNDetectHumanRectanglesRequest

Availability : iOS 13+

Detects human bounding boxes (torso detection):

let request = VNDetectHumanRectanglesRequest()
try handler.perform([request])

for observation in request.results as? [VNHumanObservation] ?? [] {
    let humanBounds = observation.boundingBox  // Normalized rect
}

Use case : Faster than pose detection when you only need location

CoreImage Integration

CIBlendWithMask Filter

Composite subject on new background using Vision mask:

// 1. Get mask from Vision
let observation = request.results?.first as? VNInstanceMaskObservation
let visionMask = try observation.createScaledMask(
    for: observation.allInstances,
    croppedToInstancesContent: false
)

// 2. Convert to CIImage
let maskImage = CIImage(cvPixelBuffer: visionMask)

// 3. Apply filter
let filter = CIFilter(name: "CIBlendWithMask")!
filter.setValue(sourceImage, forKey: kCIInputImageKey)
filter.setValue(maskImage, forKey: kCIInputMaskImageKey)
filter.setValue(newBackground, forKey: kCIInputBackgroundImageKey)

let output = filter.outputImage  // Composited result

Parameters :

Input image : Original image to mask
Mask image : Vision's soft segmentation mask
Background image : New background (or empty image for transparency)

HDR preservation : CoreImage preserves high dynamic range from input (Vision/VisionKit output is SDR)

Text Recognition APIs

VNRecognizeTextRequest

Availability : iOS 13+, macOS 10.15+

Recognizes text in images with configurable accuracy/speed trade-off.

Basic Usage

let request = VNRecognizeTextRequest()
request.recognitionLevel = .accurate  // Or .fast
request.recognitionLanguages = ["en-US", "de-DE"]  // Order matters
request.usesLanguageCorrection = true

let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])

for observation in request.results as? [VNRecognizedTextObservation] ?? [] {
    // Get top candidates
    let candidates = observation.topCandidates(3)
    let bestText = candidates.first?.string ?? ""
}

Recognition Levels

Level	Performance	Accuracy	Best For
`.fast`	Real-time	Good	Camera feed, large text, signs
`.accurate`	Slower	Excellent	Documents, receipts, handwriting

Fast path : Character-by-character recognition (Neural Network → Character Detection)

Accurate path : Full-line ML recognition (Neural Network → Line/Word Recognition)

Properties

Property	Type	Description
`recognitionLevel`	`VNRequestTextRecognitionLevel`	`.fast` or `.accurate`
`recognitionLanguages`	`[String]`	BCP 47 language codes, order = priority
`usesLanguageCorrection`

Language Support

// Check supported languages for current settings
let languages = try VNRecognizeTextRequest.supportedRecognitionLanguages(
    for: .accurate,
    revision: VNRecognizeTextRequestRevision3
)

Language correction : Improves accuracy but takes processing time. Disable for codes/serial numbers.

Custom words : Add domain-specific vocabulary for better recognition (medical terms, product codes).

VNRecognizedTextObservation

boundingBox : Normalized rect containing recognized text

topCandidates(_:) : Returns [VNRecognizedText] ordered by confidence

VNRecognizedText

Property	Type	Description
`string`	`String`	Recognized text
`confidence`	`VNConfidence`	0.0-1.0
`boundingBox(for:)`	`VNRectangleObservation?`	Box for substring range

// Get bounding box for substring
let text = candidate.string
if let range = text.range(of: "invoice") {
    let box = try candidate.boundingBox(for: range)
}

Barcode Detection APIs

VNDetectBarcodesRequest

Availability : iOS 11+, macOS 10.13+

Detects and decodes barcodes and QR codes.

Basic Usage

let request = VNDetectBarcodesRequest()
request.symbologies = [.qr, .ean13, .code128]  // Specific codes

let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])

for barcode in request.results as? [VNBarcodeObservation] ?? [] {
    let payload = barcode.payloadStringValue
    let type = barcode.symbology
    let bounds = barcode.boundingBox
}

Symbologies

1D Barcodes :

.codabar (iOS 15+)
.code39, .code39Checksum, .code39FullASCII, .code39FullASCIIChecksum
.code93, .code93i
.code128
.ean8, .ean13
.gs1DataBar, , (iOS 15+)

2D Codes :

.aztec
.dataMatrix
.microPDF417 (iOS 15+)
.microQR (iOS 15+)
.pdf417
.qr

Performance : Specifying fewer symbologies = faster detection

Revisions

Revision	iOS	Features
1	11+	Basic detection, one code at a time
2	15+	Codabar, GS1, MicroPDF, MicroQR, better ROI
3	16+	ML-based, multiple codes, better bounding boxes

VNBarcodeObservation

Property	Type	Description
`payloadStringValue`	`String?`	Decoded content
`symbology`	`VNBarcodeSymbology`	Barcode type
`boundingBox`	`CGRect`	Normalized bounds
`topLeft/topRight/bottomLeft/bottomRight`

VisionKit Scanner APIs

DataScannerViewController

Availability : iOS 16+

Camera-based live scanner with built-in UI for text and barcodes.

Check Availability

// Hardware support
DataScannerViewController.isSupported

// Runtime availability (camera access, parental controls)
DataScannerViewController.isAvailable

Configuration

import VisionKit

let dataTypes: Set<DataScannerViewController.RecognizedDataType> = [
    .barcode(symbologies: [.qr, .ean13]),
    .text(textContentType: .URL),  // Or nil for all text
    // .text(languages: ["ja"])  // Filter by language
]

let scanner = DataScannerViewController(
    recognizedDataTypes: dataTypes,
    qualityLevel: .balanced,  // .fast, .balanced, .accurate
    recognizesMultipleItems: true,
    isHighFrameRateTrackingEnabled: true,
    isPinchToZoomEnabled: true,
    isGuidanceEnabled: true,
    isHighlightingEnabled: true
)

scanner.delegate = self
present(scanner, animated: true) {
    try? scanner.startScanning()
}

RecognizedDataType

Type	Description
`.barcode(symbologies:)`	Specific barcode types
`.text()`	All text
`.text(languages:)`	Text filtered by language
`.text(textContentType:)`	Text filtered by type (URL, phone, email)

Delegate Protocol

protocol DataScannerViewControllerDelegate {
    func dataScanner(_ dataScanner: DataScannerViewController,
                     didTapOn item: RecognizedItem)

    func dataScanner(_ dataScanner: DataScannerViewController,
                     didAdd addedItems: [RecognizedItem],
                     allItems: [RecognizedItem])

    func dataScanner(_ dataScanner: DataScannerViewController,
                     didUpdate updatedItems: [RecognizedItem],
                     allItems: [RecognizedItem])

    func dataScanner(_ dataScanner: DataScannerViewController,
                     didRemove removedItems: [RecognizedItem],
                     allItems: [RecognizedItem])

    func dataScanner(_ dataScanner: DataScannerViewController,
                     becameUnavailableWithError error: DataScannerViewController.ScanningUnavailable)
}

RecognizedItem

enum RecognizedItem {
    case text(RecognizedItem.Text)
    case barcode(RecognizedItem.Barcode)

    var id: UUID { get }
    var bounds: RecognizedItem.Bounds { get }
}

// Text item
struct Text {
    let transcript: String
}

// Barcode item
struct Barcode {
    let payloadStringValue: String?
    let observation: VNBarcodeObservation
}

Async Stream

// Alternative to delegate
for await items in scanner.recognizedItems {
    // Current recognized items
}

Custom Highlights

// Add custom views over recognized items
scanner.overlayContainerView.addSubview(customHighlight)

// Capture still photo
let photo = try await scanner.capturePhoto()

VNDocumentCameraViewController

Availability : iOS 13+

Document scanning with automatic edge detection, perspective correction, and lighting adjustment.

Basic Usage

import VisionKit

let camera = VNDocumentCameraViewController()
camera.delegate = self
present(camera, animated: true)

Delegate Protocol

protocol VNDocumentCameraViewControllerDelegate {
    func documentCameraViewController(_ controller: VNDocumentCameraViewController,
                                       didFinishWith scan: VNDocumentCameraScan)

    func documentCameraViewControllerDidCancel(_ controller: VNDocumentCameraViewController)

    func documentCameraViewController(_ controller: VNDocumentCameraViewController,
                                       didFailWithError error: Error)
}

VNDocumentCameraScan

Property	Type	Description
`pageCount`	`Int`	Number of scanned pages
`imageOfPage(at:)`	`UIImage`	Get page image at index
`title`	`String`	User-editable title

func documentCameraViewController(_ controller: VNDocumentCameraViewController,
                                   didFinishWith scan: VNDocumentCameraScan) {
    controller.dismiss(animated: true)

    for i in 0..<scan.pageCount {
        let pageImage = scan.imageOfPage(at: i)
        // Process with VNRecognizeTextRequest
    }
}

Document Analysis APIs

VNDetectDocumentSegmentationRequest

Availability : iOS 15+, macOS 12+

Detects document boundaries for custom camera UIs or post-processing.

let request = VNDetectDocumentSegmentationRequest()
let handler = VNImageRequestHandler(ciImage: image)
try handler.perform([request])

guard let observation = request.results?.first as? VNRectangleObservation else {
    return  // No document found
}

// Get corner points (normalized)
let corners = [
    observation.topLeft,
    observation.topRight,
    observation.bottomLeft,
    observation.bottomRight
]

vs VNDetectRectanglesRequest :

Document: ML-based, trained specifically on documents
Rectangle: Edge-based, finds any quadrilateral

RecognizeDocumentsRequest (iOS 26+)

Availability : iOS 26+, macOS 26+

Structured document understanding with semantic parsing.

Basic Usage

let request = RecognizeDocumentsRequest()
let observations = try await request.perform(on: imageData)

guard let document = observations.first?.document else {
    return
}

DocumentObservation Hierarchy

DocumentObservation
└── document: DocumentObservation.Document
    ├── text: TextObservation
    ├── tables: [Container.Table]
    ├── lists: [Container.List]
    └── barcodes: [Container.Barcode]

Table Extraction

for table in document.tables {
    for row in table.rows {
        for cell in row {
            let text = cell.content.text.transcript
            let detectedData = cell.content.text.detectedData
        }
    }
}

Detected Data Types

for data in document.text.detectedData {
    switch data.match.details {
    case .emailAddress(let email):
        let address = email.emailAddress
    case .phoneNumber(let phone):
        let number = phone.phoneNumber
    case .link(let url):
        let link = url
    case .address(let address):
        let components = address
    case .date(let date):
        let dateValue = date
    default:
        break
    }
}

TextObservation Hierarchy

TextObservation
├── transcript: String
├── lines: [TextObservation.Line]
├── paragraphs: [TextObservation.Paragraph]
├── words: [TextObservation.Word]
└── detectedData: [DetectedDataObservation]

Visual Intelligence Integration

Visual Intelligence is a system-level feature (iOS 26+) that lets users point their camera at real-world objects and find matching content across apps. This is distinct from the Vision framework (VNRequest-based image analysis) covered above. Vision analyzes images within your app; Visual Intelligence lets the system invoke your app when users search with the camera or screenshots.

How It Works

User activates Visual Intelligence camera or takes a screenshot
System analyzes what the user is looking at
System queries participating apps via IntentValueQuery
Your app receives a SemanticContentDescriptor with labels and/or pixel data
Your app searches its content and returns matching AppEntity results
Results appear in the Visual Intelligence UI with your app's branding

Required Frameworks

import VisualIntelligence
import AppIntents

SemanticContentDescriptor

The core object the system provides to describe what the user is looking at.

Property	Type	Description
`labels`	`[String]`	Classification labels for the detected item
`pixelBuffer`	`CVReadOnlyPixelBuffer?`	Visual data of the detected item

Use labels for fast keyword matching against your content catalog. Use the pixel buffer for image-similarity search when labels are insufficient.

IntentValueQuery

The entry point for Visual Intelligence to communicate with your app. Implement values(for:) to receive search requests and return matching entities.

struct LandmarkIntentValueQuery: IntentValueQuery {
    @Dependency var modelData: ModelData

    func values(for input: SemanticContentDescriptor) async throws -> [LandmarkEntity] {
        if !input.labels.isEmpty {
            return try await modelData.search(matching: input.labels)
        }
        guard let pixelBuffer = input.pixelBuffer else { return [] }
        return try await modelData.search(matching: pixelBuffer)
    }
}

Returning Multiple Result Types

Use @UnionValue when your app can return different entity types from a single search.

@UnionValue
enum VisualSearchResult {
    case landmark(LandmarkEntity)
    case collection(CollectionEntity)
}

Display Representation

Visual Intelligence uses your entity's DisplayRepresentation to show results. Provide a title, subtitle, and image for each result.

struct LandmarkEntity: AppEntity {
    var id: String
    var name: String
    var location: String

    static var typeDisplayRepresentation: TypeDisplayRepresentation {
        TypeDisplayRepresentation(
            name: LocalizedStringResource("Landmark", table: "AppIntents"),
            numericFormat: "\(placeholder: .int) landmarks"
        )
    }

    var displayRepresentation: DisplayRepresentation {
        DisplayRepresentation(
            title: "\(name)",
            subtitle: "\(location)",
            image: .init(named: thumbnailImageName)
        )
    }
}

Deep Linking from Results

When a user taps a result, your app should open to the relevant content. Provide an appLinkURL on your entity.

var appLinkURL: URL? {
    URL(string: "yourapp://landmark/\(id)")
}

"More Results" Intent

For large result sets, provide a VisualIntelligenceSearchIntent that opens your app's full search UI.

struct ViewMoreLandmarksIntent: AppIntent, VisualIntelligenceSearchIntent {
    static var title: LocalizedStringResource = "View More Landmarks"

    @Parameter(title: "Semantic Content")
    var semanticContent: SemanticContentDescriptor

    func perform() async throws -> some IntentResult {
        // Open your app's search view with the semantic content
        return .result()
    }
}

Best Practices

Return results quickly — Visual Intelligence expects low-latency responses. Limit to 10-20 most relevant results
Prefer labels first — Label matching is faster than pixel buffer analysis. Fall back to pixel buffer when labels are empty or insufficient
Localize everything — Display representations appear in the system UI. Use LocalizedStringResource for all user-facing text
Include images — Results with thumbnails are more recognizable in the Visual Intelligence overlay

Testing

Build and run on a physical device
Activate Visual Intelligence camera or take a screenshot of relevant content
Perform a visual search and verify your app's results appear
Tap results to verify deep linking opens the correct content

API Quick Reference

Subject Segmentation

API	Platform	Purpose
`VNGenerateForegroundInstanceMaskRequest`	iOS 17+	Class-agnostic subject instances
`VNGeneratePersonInstanceMaskRequest`	iOS 17+	Up to 4 people separately
`VNGeneratePersonSegmentationRequest`	iOS 15+	All people (single mask)
`ImageAnalysisInteraction` (VisionKit)	iOS 16+	UI for subject lifting

Pose Detection

API	Platform	Landmarks	Coordinates
`VNDetectHumanHandPoseRequest`	iOS 14+	21 per hand	2D normalized
`VNDetectHumanBodyPoseRequest`	iOS 14+	18 body joints	2D normalized
`VNDetectHumanBodyPose3DRequest`	iOS 17+	17 body joints	3D meters

Face & Person Detection

API	Platform	Purpose
`VNDetectFaceRectanglesRequest`	iOS 11+	Face bounding boxes
`VNDetectFaceLandmarksRequest`	iOS 11+	Face with detailed landmarks
`VNDetectHumanRectanglesRequest`	iOS 13+	Human torso bounding boxes

Text & Barcode

API	Platform	Purpose
`VNRecognizeTextRequest`	iOS 13+	Text recognition (OCR)
`VNDetectBarcodesRequest`	iOS 11+	Barcode/QR detection
`DataScannerViewController`	iOS 16+	Live camera scanner (text + barcodes)
`VNDocumentCameraViewController`	iOS 13+	Document scanning with perspective correction
`VNDetectDocumentSegmentationRequest`

Visual Intelligence

API	Platform	Purpose
`SemanticContentDescriptor`	iOS 26+	Describes what the user is looking at (labels + pixel buffer)
`IntentValueQuery`	iOS 26+	Entry point for receiving visual search requests
`VisualIntelligenceSearchIntent`	iOS 26+	"More results" deep link to your app

Observation Types

Observation	Returned By
`VNInstanceMaskObservation`	Foreground/person instance masks
`VNPixelBufferObservation`	Person segmentation (single mask)
`VNHumanHandPoseObservation`	Hand pose
`VNHumanBodyPoseObservation`	Body pose (2D)
`VNHumanBodyPose3DObservation`	Body pose (3D)
`VNFaceObservation`

Resources

WWDC : 2019-234, 2021-10041, 2022-10024, 2022-10025, 2025-272, 2023-10176, 2023-111241, 2023-10048, 2020-10653, 2020-10043, 2020-10099

Docs : /vision, /visionkit, /visualintelligence, /visualintelligence/semanticcontentdescriptor, /vision/vnrecognizetextrequest, /vision/vndetectbarcodesrequest

Skills : axiom-vision, axiom-vision-diag

Weekly Installs

139

Repository

charleswiltgen/axiom

GitHub Stars

678

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode125

codex118

gemini-cli117

cursor114

github-copilot114

amp110

Vision Framework API 参考：iOS/macOS 计算机视觉开发指南 (主体分割、OCR、姿态检测)

🇨🇳中文介绍

Vision Framework API 参考文档

何时使用本参考

Vision Framework 概述

相关 Skills

请求处理器

VNImageRequestHandler

VNSequenceRequestHandler

何时使用哪种处理器

受益于序列处理的请求

常见错误

主体分割 API

VNGenerateForegroundInstanceMaskRequest

基本用法

InstanceMaskObservation

生成掩码

实例掩码命中测试

VisionKit 主体提取

ImageAnalysisInteraction (iOS)

ImageAnalysisOverlayView (macOS)

编程访问

ImageAnalyzer

ImageAnalysis

Subject 结构体

人物分割 API

VNGeneratePersonSegmentationRequest

VNGeneratePersonInstanceMaskRequest

手部姿态检测

VNDetectHumanHandPoseRequest

手部特征点（21 个点）

组键

手势识别示例（捏合）

手性（惯用手）

身体姿态检测

VNDetectHumanBodyPoseRequest (2D)

身体特征点（18 个点）

组键（身体）

VNDetectHumanBodyPose3DRequest (3D)

3D 观察属性

3D 点类

深度输入

人脸检测与特征点

VNDetectFaceRectanglesRequest

VNDetectFaceLandmarksRequest

人物检测

VNDetectHumanRectanglesRequest

CoreImage 集成

CIBlendWithMask 滤镜

文本识别 API

VNRecognizeTextRequest

基本用法

识别级别

属性

语言支持

VNRecognizedTextObservation

VNRecognizedText

条形码检测 API

VNDetectBarcodesRequest

基本用法

符号体系

修订版

VNBarcodeObservation

VisionKit 扫描器 API

DataScannerViewController

检查可用性

配置

RecognizedDataType

委托协议

RecognizedItem

异步流

自定义高亮

VNDocumentCameraViewController

基本用法

委托协议

VNDocumentCameraScan

文档分析 API

VNDetectDocumentSegmentationRequest

RecognizeDocumentsRequest (iOS 26+)

基本用法