A/B测试设置指南：实验设计、样本量计算与统计显著性分析

ab-test-setup by borghei/claude-skills

1 周安装量

29 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/borghei/claude-skills --skill ab-test-setup

数据分析测试产品管理

🇨🇳中文介绍

A/B 测试设置 - 实验设计与分析

类别: 产品团队 标签: A/B 测试, 实验, 统计显著性, 样本量, 功能开关, 假设检验

概述

A/B 测试设置提供了完整的框架，用于设计能产生统计有效、可操作结果的实验。大多数 A/B 测试失败并非因为变体方案错误，而是因为测试设计不当：错误的样本量、错误的指标，或者有人提前窥探结果并过早停止。本技能旨在防止这些错误。

实验生命周期

1. 提出假设  →  2. 设计  →  3. 计算  →  4. 实施
       ↑                                                    │
       │                                                    ▼
7. 迭代  ←  6. 记录  ←  5. 分析  ←  [运行至完成]

步骤 1：假设构建

假设模板

基于 [观察或数据点]，
我们相信 [具体改动]
将对 [定义的受众群体]
产生 [可衡量的结果]。

当 [主要指标] 变化达到 [最小可检测效应] 时，我们将确认此假设成立。
我们将监控 [护栏指标] 以确保没有负面影响。

好假设与坏假设

质量	假设	问题
差	"改变按钮颜色可能会增加点击量"	无数据依据，无目标，无衡量计划

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

假设来源（在哪里寻找测试想法）

来源	寻找什么	示例
分析数据	流失点，表现不佳的页面	"80% 的用户在引导流程的第 3 步流失"
用户研究	困惑、沮丧、未满足的需求	"用户无法从主页理解产品的功能"
热图/会话录制	被忽略的元素，无效点击	"没有人滚动到定价页面的首屏以下"
支持工单	反复出现的投诉，功能混淆	"用户经常询问如何邀请团队成员"
竞品分析	针对同一问题的不同方法	"竞争对手使用向导；我们使用表单"
销售异议	潜在客户不转化的常见原因	"潜在客户希望在注册前看到定价"

步骤 2：测试设计

类型	变体数	流量需求	最适合
A/B	2 (对照组 + 1 个变体)	中等	单一改动验证
A/B/n	3+ 个变体	高	比较多种方案
多变量测试	改动的组合	非常高	优化多个元素
URL 分流	不同页面	中等	重大重新设计
Bandit	动态分配	低-中等	收入优化

默认推荐： 标准 A/B 测试。仅在拥有足够流量且有特定需求时使用 A/B/n 或 MVT。

测试什么（按影响力）

类别	高影响力	中等影响力	低影响力
文案	标题/价值主张，行动号召文案	正文，社会证明	微文案，标签
设计	页面布局，首屏内容	视觉层次，图像	颜色，字体大小
用户体验	步骤数，表单字段	按钮位置，导航	动画，过渡效果
定价	价格点，套餐名称	功能打包，锚定效应	计费频率显示
社会证明	有无客户评价，徽标	客户评价格式，位置	客户评价数量

每个测试都需要三种类型的指标：

主要指标（仅 1 个）

决定成功与否的唯一指标
与假设直接相关
必须在测试期间可衡量
示例：注册率、点击率、购买率

次要指标（2-3 个）

解释主要指标变动的原因
为决策提供背景信息
示例：页面停留时间、滚动深度、功能采用率

护栏指标（1-3 个）

绝对不能变差的事项
如果显著负面，则停止测试
示例：错误率、支持工单量、页面加载时间、退款率

步骤 3：样本量计算

每个变体所需的最小访客数（95% 置信度，80% 功效）：

基准率	5% 提升	10% 提升	15% 提升	20% 提升	50% 提升
1%	620,000	156,000	70,000	39,000	6,400
2%	305,000	77,000	34,000	19,500	3,200
3%	200,000	51,000	23,000	12,800	2,100
5%	116,000	29,500	13,200	7,500	1,250
10%	54,000	13,800	6,200	3,500	600
20%	24,000	6,200	2,800	1,600	280
50%	6,100	1,600	720	410	75

持续时间 (天) = (每个变体的样本量 * 变体数量) / 测试页面的每日流量

最短持续时间： 7 天（以捕捉周内效应） 建议最长： 6 周（超过此时间，外部因素会污染结果）

如果流量不足怎么办？

情况	解决方案
需要 10 万访客，每周获得 5 千	增加最小可检测效应（测试更大胆的改动）
流量极低 (<1千/周)	改用定性测试（用户测试、调查）
中等流量 (5-20千/周)	运行 4-6 周，仅测试重大改动
高流量 (50千+/周)	可以测试细微改动，运行多个测试

JavaScript 在初始渲染后修改页面。

优点： 实施快速，无需部署 缺点： 可能导致闪烁（原始内容闪现），被广告拦截器阻止 工具： PostHog, Optimizely, VWO, Google Optimize

防闪烁模式：

// 在 <head> 中，任何渲染之前添加
<style>.ab-test-hide { opacity: 0 !important; }</style>
<script>document.documentElement.classList.add('ab-test-hide');</script>

// 在您的测试脚本中（在变体分配后运行）：
document.documentElement.classList.remove('ab-test-hide');

在页面渲染前确定变体。无闪烁，不依赖客户端。

优点： 无闪烁，不被广告拦截器阻止，适用于登录功能 缺点： 需要工程工作，需要部署 工具： PostHog, LaunchDarkly, Split, Unleash, 自定义功能开关

基本功能开关模式：

# 服务端变体分配
def get_variant(user_id: str, experiment: str) -> str:
    # 确定性哈希确保同一用户始终看到同一变体
    hash_input = f"{user_id}:{experiment}"
    hash_value = hashlib.md5(hash_input.encode()).hexdigest()
    bucket = int(hash_value[:8], 16) % 100

    if bucket < 50:
        return "control"
    else:
        return "variant"

策略	分配	何时使用
标准	50/50	默认。统计功效最大。
保守	90/10 或 80/20	有风险的改动，影响收入的测试
渐进式	从 95/5 开始，增加到 50/50	新基础设施，技术风险

用户每次访问必须看到相同的变体（通过用户 ID 或 Cookie 进行粘性分配）
分配必须在一天中的不同时间和一周中的不同天保持平衡
切勿在测试中途更改分配

步骤 5：运行测试

上线前检查清单

假设已记录，包含主要指标和最小可检测效应
样本量已计算，预计持续时间已估算
两个变体均已实施并在所有设备类型上完成质量保证
跟踪已验证（两个变体的事件触发正确）
同一页面/功能上没有其他测试在运行
已告知利益相关者测试持续时间和"禁止窥探"规则
已检查外部因素日历（无重大发布、节假日、新闻）

监控技术错误（变体未渲染，跟踪中断）
检查每日流量分配是否平衡
记录任何可能影响结果的外部事件

在达到样本量之前查看结果（"窥探问题"）
对任一变体进行更改
在测试中途添加新来源的流量
因为一个变体"看起来要赢了"而提前停止测试

窥探问题（关键）

在达到计划的样本量之前查看结果，并因为一个变体看起来更好而停止，会导致 25-40% 的误报率（相对于预期的 5%）。

原因：统计显著性在小样本量时波动剧烈。一个变体可能在计划样本量的 20% 时显示 p < 0.05，而在完整样本量时显示 p > 0.30。

预先承诺样本量，在达到之前不要查看结果
如果必须监控：使用序贯测试方法（分组序贯设计，始终有效的 p 值）
为预期完成日期设置日历提醒——届时再查看

我们是否达到了计划的样本量？ 如果没有，结果仅是初步的。
是否具有统计显著性？ p < 0.05 = 95% 置信度认为差异是真实的。
置信区间是多少？ 告诉你真实效应可能存在的范围。
效应量是否有意义？ 一个"显著"的 0.1% 提升可能不值得实施。
次要指标是否一致？ 它们是否支持主要结果？
是否有护栏违规？ 是否有任何指标变差？
细分分析： 移动端与桌面端的结果是否不同？新用户与回访用户？

结果	主要指标	置信度	行动
明确胜出	变体 +15%，p < 0.01	高	实施变体
适度胜出	变体 +5%，p < 0.05	中等	如果容易则实施，否则延长测试时间
持平	差异 < 2%，p > 0.20	高（无效应）	保留对照组，测试更大胆的方案
失败	变体 -10%，p < 0.05	高	保留对照组，调查原因
不确定	差异 5%，p = 0.08	低	需要更多流量或更大胆的测试
信号混杂	主要指标上升，护栏指标下降	调查	深入分析细分，切勿盲目发布

错误	后果	预防措施
在首次显著时停止	25-40% 误报率	承诺样本量
选择性挑选细分	发现无法复现的"胜出"细分	预先注册感兴趣的细分
忽略置信区间	高估效应量	始终与 p 值一起报告置信区间
多重比较	夸大的 I 类错误	对 A/B/n 使用 Bonferroni 校正
幸存者偏差	仅分析完成流程的用户	包含从分配点开始的所有用户
辛普森悖论	聚合数据隐藏了细分反转	始终检查关键细分

无论结果如何，每个测试都必须记录。

实验：[名称]
日期：[开始] 至 [结束]
负责人：[姓名]

假设：
基于 [观察]，我们相信 [改动] 会对 [受众] 产生 [结果]。

变体：
- 对照组：[描述]
- 变体：[描述 + 截图]

指标：
- 主要：[指标] (基准：[X]%，最小可检测效应：[Y]%)
- 次要：[指标]
- 护栏：[指标]

结果：
- 样本量：[实际] / [计划]
- 持续时间：[X] 天
- 主要指标：对照组 [X]% vs 变体 [Y]% (p = [Z]，置信区间：[范围])
- 次要指标：[结果]
- 护栏指标：[全部正常 / 违规记录]

决策：[实施变体 / 保留对照组 / 迭代]

学习：
- [我们了解到的关于用户的信息]
- [下次我们会做哪些不同的事情]

实验优先级排序框架

因素	评分 (1-10)	问题
影响力	这能在多大程度上推动指标？	对主要 KPI 的重大改变 = 10
信心	我们有多大把握它会成功？	有强有力的数据支持假设 = 10
简易性	实施和衡量有多容易？	一天内可以发布 = 10

ICE 分数 = (影响力 + 信心 + 简易性) / 3

按 ICE 分数对所有测试想法排序。先运行最高的。

测试待办事项模板

| 假设 | 主要指标 | ICE | 预计持续时间 | 状态

---|---|---|---|---|---
1 | 更大的行动号召按钮增加注册量 | 注册率 | 8.3 | 2 周 | 就绪
2 | 定价页面的社会证明增加转化率 | 套餐选择率 | 7.0 | 3 周 | 需要设计
3 | 更短的引导流程增加激活率 | 功能激活率 | 6.7 | 4 周 | 待办中

有人在两个设计方案之间争论：提议进行 A/B 测试，而不是发表意见
提到转化率表现不佳：提议设计一个测试，而不是猜测解决方案
讨论定价页面更改：始终用护栏指标测试定价变更
任何功能发布后：提议后续实验以进行优化
"我们试试看吧"：在实施前引导至结构化的假设构建

技能	使用时机
analytics-tracking	设置用于提供实验指标的事件跟踪
campaign-analytics	将实验结果纳入更广泛的归因分析
launch-strategy	在产品发布序列中进行测试
prompt-engineer-toolkit	在生产环境中对 AI 提示进行 A/B 测试

🇺🇸English

A/B Test Setup - Experimentation Design & Analysis

Category: Product Team Tags: A/B testing, experiments, statistical significance, sample size, feature flags, hypothesis testing

Overview

A/B Test Setup provides the complete framework for designing experiments that produce statistically valid, actionable results. Most A/B tests fail not because the variant was wrong, but because the test was poorly designed: wrong sample size, wrong metric, or someone peeked at results and stopped early. This skill prevents those mistakes.

The Experiment Lifecycle

1. HYPOTHESIZE  →  2. DESIGN  →  3. CALCULATE  →  4. IMPLEMENT
       ↑                                                    │
       │                                                    ▼
7. ITERATE  ←  6. DOCUMENT  ←  5. ANALYZE  ←  [Run to completion]

Step 1: Hypothesis Formulation

The Hypothesis Template

Because [observation or data point],
we believe [specific change]
will cause [measurable outcome]
for [defined audience segment].

We'll know this is true when [primary metric] changes by [minimum detectable effect].
We'll watch [guardrail metrics] to ensure no negative impact.

Good vs Bad Hypotheses

Quality	Hypothesis	Problem
Bad	"Changing the button color might increase clicks"	No data basis, no target, no measurement plan
Mediocre	"A green button will get more clicks than blue"	No "why", no target size, no guardrails
Good	"Because heatmaps show 40% of users don't notice our CTA, making the button 2x larger with contrasting color will increase CTA clicks by 15%+ for new visitors. Guardrail: page load time stays under 2s."	Data-backed, specific change, measurable outcome, defined audience, guardrail

Hypothesis Sources (Where to Find Test Ideas)

Source	What to Look For	Example
Analytics data	Drop-off points, low-performing pages	"80% of users drop off at step 3 of onboarding"
User research	Confusion, frustration, unmet needs	"Users don't understand what the product does from the homepage"
Heatmaps/session recordings	Ignored elements, rage clicks	"Nobody scrolls past the fold on pricing page"
Support tickets	Recurring complaints, feature confusion	"Users constantly ask how to invite team members"
Competitor analysis	Different approaches to same problem	"Competitor uses a wizard; we use a form"
Sales objections	Common reasons prospects don't convert	"Prospects want to see pricing before signing up"

Step 2: Test Design

Test Types

Type	Variants	Traffic Need	Best For
A/B	2 (control + 1 variant)	Moderate	Single change validation
A/B/n	3+ variants	High	Comparing multiple approaches
Multivariate (MVT)	Combinations of changes	Very high	Optimizing multiple elements
Split URL	Different pages	Moderate	Major redesigns
Bandit	Dynamic allocation	Low-moderate	Revenue optimization

Default recommendation: Standard A/B test. Only use A/B/n or MVT when you have enough traffic and a specific need.

What to Test (By Impact)

Category	High Impact	Medium Impact	Low Impact
Copy	Headline/value prop, CTA text	Body copy, social proof	Microcopy, labels
Design	Page layout, above-fold content	Visual hierarchy, imagery	Color, font size
UX	Number of steps, form fields	Button placement, navigation	Animations, transitions
Pricing	Price point, plan names	Feature packaging, anchoring	Billing frequency display
Social Proof	Testimonials vs none, logos	Testimonial format, placement	Testimonial count

Metric Selection

Every test needs three types of metrics:

Primary Metric (1 only)

The single metric that determines success
Directly tied to the hypothesis
Must be measurable within the test duration
Examples: signup rate, click-through rate, purchase rate

Secondary Metrics (2-3)

Explain why the primary metric moved
Provide context for decision-making
Examples: time on page, scroll depth, feature adoption rate

Guardrail Metrics (1-3)

Things that must NOT get worse
Stop the test if significantly negative
Examples: error rate, support ticket volume, page load time, refund rate

Step 3: Sample Size Calculation

Quick Reference Table

Minimum visitors PER VARIANT needed (95% confidence, 80% power):

Baseline Rate	5% Lift	10% Lift	15% Lift	20% Lift	50% Lift
1%	620,000	156,000	70,000	39,000	6,400
2%	305,000	77,000	34,000	19,500	3,200
3%	200,000	51,000	23,000	12,800	2,100
5%	116,000	29,500	13,200	7,500	1,250
10%	54,000	13,800	6,200

Duration Calculation

Duration (days) = (Sample size per variant * Number of variants) / Daily traffic to test page

Minimum duration: 7 days (to capture day-of-week effects) Maximum recommended: 6 weeks (beyond this, external factors contaminate results)

What If You Don't Have Enough Traffic?

Situation	Solution
Need 100K visitors, get 5K/week	Increase minimum detectable effect (test bolder changes)
Very low traffic (<1K/week)	Use qualitative testing (user testing, surveys) instead
Medium traffic (5-20K/week)	Run for 4-6 weeks, test big changes only
High traffic (50K+/week)	You can test subtle changes, run multiple tests

Step 4: Implementation

Client-Side Implementation

JavaScript modifies the page after initial render.

Pros: Quick to implement, no deploy needed Cons: Can cause flicker (flash of original content), blocked by ad blockers Tools: PostHog, Optimizely, VWO, Google Optimize

Anti-flicker pattern:

// Add to <head> before any rendering
<style>.ab-test-hide { opacity: 0 !important; }</style>
<script>document.documentElement.classList.add('ab-test-hide');</script>

// In your test script (runs after variant assignment):
document.documentElement.classList.remove('ab-test-hide');

Server-Side Implementation

Variant determined before page renders. No flicker, no client-side dependency.

Pros: No flicker, not blocked by ad blockers, works for logged-in features Cons: Requires engineering work, deploy needed Tools: PostHog, LaunchDarkly, Split, Unleash, custom feature flags

Basic feature flag pattern:

# Server-side variant assignment
def get_variant(user_id: str, experiment: str) -> str:
    # Deterministic hash ensures same user always sees same variant
    hash_input = f"{user_id}:{experiment}"
    hash_value = hashlib.md5(hash_input.encode()).hexdigest()
    bucket = int(hash_value[:8], 16) % 100

    if bucket < 50:
        return "control"
    else:
        return "variant"

Traffic Allocation

Strategy	Split	When to Use
Standard	50/50	Default. Maximum statistical power.
Conservative	90/10 or 80/20	Risky changes, revenue-impacting tests
Ramped	Start 95/5, increase to 50/50	New infrastructure, technical risk

Critical rules:

Users must see the same variant on every visit (sticky assignment by user ID or cookie)
Allocation must be balanced across time of day and day of week
Never change allocation mid-test

Step 5: Running the Test

Pre-Launch Checklist

Hypothesis documented with primary metric and minimum detectable effect
Sample size calculated, expected duration estimated
Both variants implemented and QA'd on all device types
Tracking verified (events fire correctly for both variants)
No other tests running on the same page/feature
Stakeholders informed of test duration and "no peeking" rule
External factor calendar checked (no major launches, holidays, press)

During the Test

DO:

Monitor for technical errors (variant not rendering, tracking broken)
Check that traffic split is balanced daily
Document any external events that might affect results

DO NOT:

Look at results before reaching sample size ("peeking problem")
Make changes to either variant
Add traffic from new sources mid-test
Stop the test early because one variant "looks like it's winning"

The Peeking Problem (Critical)

Looking at results before reaching the planned sample size and stopping because one variant looks better leads to a 25-40% false positive rate (vs the intended 5%).

Why: Statistical significance fluctuates wildly with small samples. A variant can show p < 0.05 at 20% of planned sample size and p > 0.30 at full sample.

Solutions:

Pre-commit to sample size and do not check results until reached
If you must monitor: use sequential testing methods (group sequential design, always-valid p-values)
Set calendar reminder for expected completion date -- that is when you look

Step 6: Analysis

Analysis Checklist

Did we reach planned sample size? If not, results are preliminary only.
Is it statistically significant? p < 0.05 = 95% confidence the difference is real.
What's the confidence interval? Tells you the range of likely true effect.
Is the effect size meaningful? A 0.1% lift that's "significant" may not be worth implementing.
Are secondary metrics consistent? Do they support the primary result?
Any guardrail violations? Did anything get worse?
Segment analysis: Different results for mobile vs desktop? New vs returning?

Interpreting Results

Result	Primary Metric	Confidence	Action
Clear winner	Variant +15%, p < 0.01	High	Implement variant
Modest winner	Variant +5%, p < 0.05	Medium	Implement if easy, else run longer
Flat	< 2% difference, p > 0.20	High (no effect)	Keep control, test something bolder
Loser	Variant -10%, p < 0.05	High	Keep control, investigate why
Inconclusive	5% difference, p = 0.08	Low	Need more traffic or bolder test
Mixed signals	Primary up, guardrail down	Investigate	Dig into segments, do not ship blindly

Common Analysis Mistakes

Mistake	Consequence	Prevention
Stopping at first significance	25-40% false positive rate	Commit to sample size
Cherry-picking segments	Finding "winners" that don't replicate	Pre-register segments of interest
Ignoring confidence intervals	Overestimating effect size	Always report CI alongside p-value
Multiple comparisons	Inflated Type I error	Bonferroni correction for A/B/n
Survivorship bias	Only analyzing users who completed flow	Include all users from assignment point
Simpson's paradox	Aggregate hides segment reversal	Always check key segments

Step 7: Documentation

Every test must be documented, regardless of outcome.

Test Documentation Template

EXPERIMENT: [Name]
DATE: [Start] to [End]
OWNER: [Name]

HYPOTHESIS:
Because [observation], we believed [change] would cause [outcome] for [audience].

VARIANTS:
- Control: [description]
- Variant: [description + screenshot]

METRICS:
- Primary: [metric] (baseline: [X]%, MDE: [Y]%)
- Secondary: [metrics]
- Guardrails: [metrics]

RESULTS:
- Sample size: [actual] / [planned]
- Duration: [X] days
- Primary metric: Control [X]% vs Variant [Y]% (p = [Z], CI: [range])
- Secondary metrics: [results]
- Guardrails: [all clear / violation noted]

DECISION: [Ship variant / Keep control / Iterate]

LEARNINGS:
- [What we learned about our users]
- [What we'd do differently next time]

Experiment Prioritization Framework

ICE Scoring

Factor	Score (1-10)	Question
Impact	How much will this move the metric?	Big change to primary KPI = 10
Confidence	How sure are we it will work?	Strong data supporting hypothesis = 10
Ease	How easy is it to implement and measure?	Can ship in a day = 10

ICE Score = (Impact + Confidence + Ease) / 3

Rank all test ideas by ICE score. Run highest first.

Test Backlog Template

| Hypothesis | Primary Metric | ICE | Est. Duration | Status

---|---|---|---|---|---
1 | Larger CTA increases signups | Signup rate | 8.3 | 2 weeks | Ready
2 | Social proof on pricing increases conversion | Plan selection rate | 7.0 | 3 weeks | Needs design
3 | Shorter onboarding increases activation | Feature activation | 6.7 | 4 weeks | In backlog

Proactive Triggers

Someone debates between two design options: propose an A/B test instead of opinionating
Conversion rate mentioned as underperforming: offer to design a test, not guess at solutions
Pricing page changes discussed: always test pricing changes with guardrail metrics
Post-launch of any feature: propose follow-up experiment to optimize
"Let's just try it and see": redirect to structured hypothesis before implementation

Related Skills

Skill	Use When
analytics-tracking	Setting up event tracking that feeds experiment metrics
campaign-analytics	Folding experiment results into broader attribution
launch-strategy	Testing within a product launch sequence
prompt-engineer-toolkit	A/B testing AI prompts in production

Weekly Installs

Repository

borghei/claude-skills

GitHub Stars

First Seen

Today

Security Audits

Gen Agent Trust HubPass SocketFail SnykPass

Installed on

zencoder1

amp1

cline1

openclaw1

opencode1

cursor1

Excel财务建模规范与xlsx文件处理指南：专业格式、零错误公式与数据分析

42,000 周安装