可用性测试指南：系统化观察用户完成核心任务的方法与最佳实践

usability-tester by daffy0208/ai-dev-standards

89 周安装量

21 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/daffy0208/ai-dev-standards --skill usability-tester

测试用户体验产品管理

🇨🇳中文介绍

可用性测试员

通过系统化观察，验证用户能否成功完成核心任务。

核心原则

观察用户遇到的困难。 发现用户体验问题的最佳方法是观察真实用户尝试完成真实任务。他们的挣扎揭示了调查和分析无法发现的真相。

测试规划

1. 定义测试目标

Good Objectives:
  - "Can users complete onboarding in <5 minutes?"
  - "Can users find and use the export feature?"
  - "Do users understand the pricing page?"

Bad Objectives:
  - "Test the UI" (too vague)
  - "See if users like it" (subjective, not behavioral)

2. 研究问题

Examples:
  - Where do users get stuck during sign-up?
  - Can users find the settings page?
  - Do users understand what each tier includes?
  - What errors do users encounter?

3. 确定核心任务

选择 3-5 个代表关键用户旅程的任务：

Example Tasks (Project Management Tool): 1. Sign up and create account
  2. Create your first project
  3. Invite a team member
  4. Assign a task to someone
  5. Export project data

4. 招募参与者

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

每个任务结束后要问的问题

Completion Questions:
  - 'On a scale of 1-5, how easy was that task?'
  - 'What were you expecting to see?'
  - 'What was confusing about that?'
  - 'If you could change one thing, what would it be?'

Discovery Questions:
  - 'Where did you expect to find that?'
  - 'What do you think this [feature] does?'
  - 'Why did you click there?'

需要跟踪的指标

Measurement:
  - Completed: User achieved goal without help
  - Partial: User achieved goal with hints
  - Failed: User could not complete task

Calculation: Task Success Rate = (Completed Tasks / Total Attempts) × 100

Target: ≥80% for core tasks

Measurement:
  - Start timer when task begins
  - Stop when user completes or gives up

Analysis:
  - Compare to baseline/previous tests
  - Identify outliers (very fast or very slow)

Target: Varies by task complexity
  - Simple task (e.g., log in): <30 seconds
  - Medium task (e.g., create project): 1-2 minutes
  - Complex task (e.g., configure integration): 3-5 minutes

Errors:
  - Wrong path taken
  - Incorrect button clicked
  - Had to backtrack
  - Gave up and tried different approach

Calculation: Errors per Task = Total Errors / Number of Users

Target: <2 errors per task

Post-Task Question:
  "How satisfied are you with completing this task?" (1-5 scale)

  1 = Very Dissatisfied
  2 = Dissatisfied
  3 = Neutral
  4 = Satisfied
  5 = Very Satisfied

Target: ≥4.0 average

问题严重性评级

Severity = Impact × Frequency

影响程度等级 (1-3)

1 - Low Impact:
  - Minor inconvenience
  - User can easily recover
  - Cosmetic issue

2 - Medium Impact:
  - Causes delay or confusion
  - User eventually figures it out
  - Moderate frustration

3 - High Impact:
  - Blocks task completion
  - User cannot proceed without help
  - Critical to core functionality

1 - Rare:
  - Only 1-2 users encountered
  - Edge case
  - Specific conditions

2 - Occasional:
  - 3-5 users encountered
  - Somewhat common
  - Specific user types

3 - Frequent:
  - Most/all users encountered
  - Consistent issue
  - All user types

Critical (8-9):
  - Impact: 3, Frequency: 3
  - Blocks most users
  → Fix immediately before release

High (6-7):
  - Impact: 3, Frequency: 2 OR Impact: 2, Frequency: 3
  - Significant delay or frequent minor issue
  → Fix before release

Medium (4-5):
  - Impact: 2, Frequency: 2 OR Impact: 3, Frequency: 1
  - Minor frustration or rare blocker
  → Fix in next release

Low (1-3):
  - Impact: 1, Frequency: 1-3
  - Cosmetic or rare minor issue
  → Backlog

系统可用性量表 (SUS)

10个问题的调查（测试后，1-5 李克特量表）：

Questions (Odd = Positive, Even = Negative): 1. I think I would like to use this product frequently
  2. I found the product unnecessarily complex
  3. I thought the product was easy to use
  4. I think I would need support to use this product
  5. I found the various functions well integrated
  6. I thought there was too much inconsistency
  7. I imagine most people would learn this quickly
  8. I found the product cumbersome to use
  9. I felt very confident using the product
  10. I needed to learn a lot before getting going

Scoring:
  - Odd questions: Score - 1
  - Even questions: 5 - Score
  - Sum all scores
  - Multiply by 2.5
  - Result: 0-100 score

Interpretation:
  ≥80: Excellent
  68-79: Good (industry average)
  51-67: OK
  <51: Needs significant improvement

usability_test_summary:
  date: '2024-01-20'
  participants: 8
  participant_profile: 'New users, age 25-45, tech-savvy'

  tasks:
    - task: 'Create a new project'
      success_rate: '87.5% (7/8)'
      avg_time: '1m 24s'
      errors: 1.2 per user
      satisfaction: 4.3/5

    - task: 'Invite team member'
      success_rate: '62.5% (5/8)'
      avg_time: '2m 45s'
      errors: 2.8 per user
      satisfaction: 3.1/5

  issues:
    - issue: "Users can't find 'Invite' button"
      severity: high
      impact: 3
      frequency: 3
      affected_users: 7/8
      recommendation: "Move 'Invite' button to top of project page, make it more prominent"

    - issue: 'Confusion about project vs workspace'
      severity: medium
      impact: 2
      frequency: 3
      affected_users: 6/8
      recommendation: 'Add tooltip explaining difference, update onboarding'

    - issue: 'Export button text unclear'
      severity: low
      impact: 1
      frequency: 2
      affected_users: 2/8
      recommendation: "Change 'Export' to 'Export to CSV'"

  sus_score: 72 (Good)

  key_insights:
    - 'Onboarding is smooth (87.5% success)'
    - 'Team collaboration features hard to discover'
    - 'Overall product easy to use once features are found'

  recommended_actions:
    1. "High priority: Redesign invite flow"
    2. "Medium priority: Add contextual help for workspace vs project"
    3. "Low priority: Update button labels"

远程测试与现场测试

远程测试（有主持）

工具：Zoom, Google Meet, UserTesting.com

Pros:
  - Can test with users anywhere
  - Lower cost (no travel)
  - Easier to recruit
  - Record sessions easily

Cons:
  - Can't see body language as well
  - Technical issues possible
  - Harder to build rapport
  - Screen sharing can lag

Best Practices:
  - Test your setup beforehand
  - Have backup communication method
  - Ask user to share screen + turn on camera
  - Record session (with permission)

Pros:
  - See full body language
  - Better rapport
  - No technical issues
  - Can see facial expressions

Cons:
  - Limited geographic reach
  - Higher cost
  - Harder to schedule
  - Need physical space

Best Practices:
  - Set up quiet room
  - Have snacks/water
  - Use screen recording software
  - Position yourself behind/beside user

When to Test:
  - Pre-launch: Test prototypes/designs
  - Post-launch: Test new features
  - Ongoing: Test every major release
  - Quarterly: Full usability audit

Continuous Testing:
  - Week 1: Test with 5 users
  - Week 2: Fix issues
  - Week 3: Test with 5 new users
  - Repeat until success rate ≥80%

Remote Testing:
  - UserTesting.com (recruit + test)
  - UserZoom (enterprise solution)
  - Lookback (live testing)
  - Maze (unmoderated testing)

Recording:
  - Zoom (screen + audio)
  - Loom (quick recordings)
  - OBS (advanced recording)

Analysis:
  - Dovetail (organize insights)
  - Notion (collaborative notes)
  - Miro (affinity mapping)
  - Excel/Sheets (metrics tracking)

定义测试目标
编写 3-5 个任务场景
招募 5-8 名参与者
准备测试脚本
设置录制

欢迎参与者
解释出声思考法
执行任务（不要帮忙！）
提问后续问题
进行 SUS 调查
感谢参与者

计算成功率
识别常见问题
评定问题严重性
创建报告
与团队分享
确定修复优先级

❌ 与员工一起测试：他们对产品太熟悉了 ❌ 在任务过程中帮助用户：让他们挣扎以发现真正的问题 ❌ 只测试理想路径：也要测试错误情况和边缘情况 ❌ 参与者不足：每个用户画像至少 5 人 ❌ 忽略低严重性问题：它们累积起来会导致糟糕的体验 ❌ 只测试不修复：如果不采取行动，可用性测试就毫无价值

优秀的可用性测试：

✅ 每个用户画像测试 5-8 名用户
✅ 使用真实的任务场景（而非逐步操作说明）
✅ 采用出声思考法（理解心智模型）
✅ 在任务过程中不帮助用户
✅ 跟踪成功率、耗时、错误、满意度
✅ 按严重性（影响 × 频率）评定问题
✅ 在发布前修复高优先级问题
✅ 持续测试，而非仅一次

🇺🇸English

Usability Tester

Validate that users can successfully complete core tasks through systematic observation.

Core Principle

Watch users struggle. The best way to find UX issues is to observe real users attempting real tasks. Their struggles reveal truth that surveys and analytics cannot.

Test Planning

1. Define Test Objectives

Good Objectives:
  - "Can users complete onboarding in <5 minutes?"
  - "Can users find and use the export feature?"
  - "Do users understand the pricing page?"

Bad Objectives:
  - "Test the UI" (too vague)
  - "See if users like it" (subjective, not behavioral)

2. Research Questions

Examples:
  - Where do users get stuck during sign-up?
  - Can users find the settings page?
  - Do users understand what each tier includes?
  - What errors do users encounter?

3. Identify Core Tasks

Choose 3-5 tasks that represent key user journeys:

Example Tasks (Project Management Tool): 1. Sign up and create account
  2. Create your first project
  3. Invite a team member
  4. Assign a task to someone
  5. Export project data

4. Recruit Participants

Sample Size:
  - 5-8 users per persona
  - After 5 users, diminishing returns (Nielsen's research)
  - Test in waves: 5 users → fix issues → test 5 more

Recruitment Criteria:
  - Match target persona
  - Haven't used product before (for onboarding tests)
  - Or: Active users (for feature tests)

Incentives:
  - $50-100 per hour (B2C)
  - $100-200 per hour (B2B professionals)
  - Gift cards work well

Task Scenarios

Best Practices

✅ Good task scenario :

"Your team is launching a new project next week. Create a project
called 'Q2 Launch' and invite john@example.com to collaborate."

Why it works :

Realistic context
Clear goal
Natural language
Doesn't give step-by-step instructions

❌ Bad task scenario :

"Click the 'New Project' button, then enter 'Q2 Launch', then
click Settings, then click Invite, then enter email."

Why it fails :

Step-by-step instructions
No context
Doesn't test discoverability
User just follows orders

Task Scenario Template

Scenario: [Context/Motivation]
Goal: [What they need to accomplish]
Success Criteria: [How to know they completed it]

Example:
  Scenario: You're preparing for a client meeting tomorrow and need to review past conversations.
  Goal: Find all conversations with "Acme Corp" from the last 30 days
  Success Criteria: User successfully uses search/filter to find conversations

Conducting Tests

Think-Aloud Protocol

Key instruction to participant :

"Please think aloud as you work. Tell me what you're looking for,
what you're thinking, what you're trying to do. There are no
wrong answers - we're testing the product, not you."

What to listen for :

"I'm looking for..." (what they expect)
"I thought this would..." (mental models)
"This is confusing because..." (friction points)
"I'm not sure if..." (uncertainty)

Facilitation Rules

✅ Do :

Observe silently
Take notes
Let them struggle (reveals issues)
Ask follow-up questions AFTER task
Stay neutral

❌ Don't :

Help or explain
Lead them ("maybe try clicking...")
Defend design choices
Interrupt during task
Show frustration

Questions to Ask After Each Task

Completion Questions:
  - 'On a scale of 1-5, how easy was that task?'
  - 'What were you expecting to see?'
  - 'What was confusing about that?'
  - 'If you could change one thing, what would it be?'

Discovery Questions:
  - 'Where did you expect to find that?'
  - 'What do you think this [feature] does?'
  - 'Why did you click there?'

Metrics to Track

Task Success Rate

Measurement:
  - Completed: User achieved goal without help
  - Partial: User achieved goal with hints
  - Failed: User could not complete task

Calculation: Task Success Rate = (Completed Tasks / Total Attempts) × 100

Target: ≥80% for core tasks

Time on Task

Measurement:
  - Start timer when task begins
  - Stop when user completes or gives up

Analysis:
  - Compare to baseline/previous tests
  - Identify outliers (very fast or very slow)

Target: Varies by task complexity
  - Simple task (e.g., log in): <30 seconds
  - Medium task (e.g., create project): 1-2 minutes
  - Complex task (e.g., configure integration): 3-5 minutes

Error Rate

Errors:
  - Wrong path taken
  - Incorrect button clicked
  - Had to backtrack
  - Gave up and tried different approach

Calculation: Errors per Task = Total Errors / Number of Users

Target: <2 errors per task

Satisfaction Rating

Post-Task Question:
  "How satisfied are you with completing this task?" (1-5 scale)

  1 = Very Dissatisfied
  2 = Dissatisfied
  3 = Neutral
  4 = Satisfied
  5 = Very Satisfied

Target: ≥4.0 average

Issue Severity Rating

Severity Formula

Severity = Impact × Frequency

Impact Scale (1-3)

1 - Low Impact:
  - Minor inconvenience
  - User can easily recover
  - Cosmetic issue

2 - Medium Impact:
  - Causes delay or confusion
  - User eventually figures it out
  - Moderate frustration

3 - High Impact:
  - Blocks task completion
  - User cannot proceed without help
  - Critical to core functionality

Frequency Scale (1-3)

1 - Rare:
  - Only 1-2 users encountered
  - Edge case
  - Specific conditions

2 - Occasional:
  - 3-5 users encountered
  - Somewhat common
  - Specific user types

3 - Frequent:
  - Most/all users encountered
  - Consistent issue
  - All user types

Combined Severity

Critical (8-9):
  - Impact: 3, Frequency: 3
  - Blocks most users
  → Fix immediately before release

High (6-7):
  - Impact: 3, Frequency: 2 OR Impact: 2, Frequency: 3
  - Significant delay or frequent minor issue
  → Fix before release

Medium (4-5):
  - Impact: 2, Frequency: 2 OR Impact: 3, Frequency: 1
  - Minor frustration or rare blocker
  → Fix in next release

Low (1-3):
  - Impact: 1, Frequency: 1-3
  - Cosmetic or rare minor issue
  → Backlog

System Usability Scale (SUS)

10-question survey (post-test, 1-5 Likert scale):

Questions (Odd = Positive, Even = Negative): 1. I think I would like to use this product frequently
  2. I found the product unnecessarily complex
  3. I thought the product was easy to use
  4. I think I would need support to use this product
  5. I found the various functions well integrated
  6. I thought there was too much inconsistency
  7. I imagine most people would learn this quickly
  8. I found the product cumbersome to use
  9. I felt very confident using the product
  10. I needed to learn a lot before getting going

Scoring:
  - Odd questions: Score - 1
  - Even questions: 5 - Score
  - Sum all scores
  - Multiply by 2.5
  - Result: 0-100 score

Interpretation:
  ≥80: Excellent
  68-79: Good (industry average)
  51-67: OK
  <51: Needs significant improvement

Test Report Template

usability_test_summary:
  date: '2024-01-20'
  participants: 8
  participant_profile: 'New users, age 25-45, tech-savvy'

  tasks:
    - task: 'Create a new project'
      success_rate: '87.5% (7/8)'
      avg_time: '1m 24s'
      errors: 1.2 per user
      satisfaction: 4.3/5

    - task: 'Invite team member'
      success_rate: '62.5% (5/8)'
      avg_time: '2m 45s'
      errors: 2.8 per user
      satisfaction: 3.1/5

  issues:
    - issue: "Users can't find 'Invite' button"
      severity: high
      impact: 3
      frequency: 3
      affected_users: 7/8
      recommendation: "Move 'Invite' button to top of project page, make it more prominent"

    - issue: 'Confusion about project vs workspace'
      severity: medium
      impact: 2
      frequency: 3
      affected_users: 6/8
      recommendation: 'Add tooltip explaining difference, update onboarding'

    - issue: 'Export button text unclear'
      severity: low
      impact: 1
      frequency: 2
      affected_users: 2/8
      recommendation: "Change 'Export' to 'Export to CSV'"

  sus_score: 72 (Good)

  key_insights:
    - 'Onboarding is smooth (87.5% success)'
    - 'Team collaboration features hard to discover'
    - 'Overall product easy to use once features are found'

  recommended_actions:
    1. "High priority: Redesign invite flow"
    2. "Medium priority: Add contextual help for workspace vs project"
    3. "Low priority: Update button labels"

Remote vs In-Person Testing

Remote Testing (Moderated)

Tools : Zoom, Google Meet, UserTesting.com

Pros:
  - Can test with users anywhere
  - Lower cost (no travel)
  - Easier to recruit
  - Record sessions easily

Cons:
  - Can't see body language as well
  - Technical issues possible
  - Harder to build rapport
  - Screen sharing can lag

Best Practices:
  - Test your setup beforehand
  - Have backup communication method
  - Ask user to share screen + turn on camera
  - Record session (with permission)

In-Person Testing

Pros:
  - See full body language
  - Better rapport
  - No technical issues
  - Can see facial expressions

Cons:
  - Limited geographic reach
  - Higher cost
  - Harder to schedule
  - Need physical space

Best Practices:
  - Set up quiet room
  - Have snacks/water
  - Use screen recording software
  - Position yourself behind/beside user

Test Frequency

When to Test:
  - Pre-launch: Test prototypes/designs
  - Post-launch: Test new features
  - Ongoing: Test every major release
  - Quarterly: Full usability audit

Continuous Testing:
  - Week 1: Test with 5 users
  - Week 2: Fix issues
  - Week 3: Test with 5 new users
  - Repeat until success rate ≥80%

Tools & Software

Remote Testing:
  - UserTesting.com (recruit + test)
  - UserZoom (enterprise solution)
  - Lookback (live testing)
  - Maze (unmoderated testing)

Recording:
  - Zoom (screen + audio)
  - Loom (quick recordings)
  - OBS (advanced recording)

Analysis:
  - Dovetail (organize insights)
  - Notion (collaborative notes)
  - Miro (affinity mapping)
  - Excel/Sheets (metrics tracking)

Quick Start Checklist

Planning Phase

Define test objectives
Write 3-5 task scenarios
Recruit 5-8 participants
Prepare test script
Set up recording

Testing Phase

Welcome participant
Explain think-aloud protocol
Conduct tasks (don't help!)
Ask follow-up questions
Administer SUS survey
Thank participant

Analysis Phase

Calculate success rates
Identify common issues
Rate issue severity
Create report
Share with team
Prioritize fixes

Common Pitfalls

❌ Testing with employees : They know the product too well ❌ Helping users during tasks : Let them struggle to find real issues ❌ Only testing happy path : Test error cases and edge cases too ❌ Not enough participants : 5 minimum per persona ❌ Ignoring low-severity issues : They add up to poor experience ❌ Testing but not fixing : Usability tests are worthless if you don't act

Summary

Great usability testing:

✅ Test with 5-8 users per persona
✅ Use realistic task scenarios (not step-by-step)
✅ Think-aloud protocol (understand mental models)
✅ Don't help users during tasks
✅ Track success rate, time, errors, satisfaction
✅ Rate issues by severity (impact × frequency)
✅ Fix high-priority issues before release
✅ Test continuously, not just once

Weekly Installs

Repository

daffy0208/ai-de…tandards

GitHub Stars

First Seen

Jan 20, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

gemini-cli76

opencode76

codex73

cursor70

claude-code65

github-copilot64

注册流程转化率优化指南：减少摩擦、提高完成率的专家技巧

29,300 周安装