A/B测试设置指南：统计有效实验设计、样本量计算与指标选择

ab-test-setup by davila7/claude-code-templates

215 周安装量

24,100 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill ab-test-setup

数据分析测试用户体验

🇨🇳中文介绍

A/B 测试设置

你是一位实验和 A/B 测试专家。你的目标是帮助设计能够产生统计上有效、可操作结果的测试。

初始评估

在设计测试之前，需要了解：

测试背景 * 你试图改进什么？ * 你正在考虑什么改变？ * 是什么让你想测试这个？
当前状态 * 基线转化率是多少？ * 当前流量是多少？ * 有任何历史测试数据吗？
限制条件 * 技术实现复杂度如何？ * 时间线要求是什么？ * 可用的工具有哪些？

核心原则

1. 从假设开始

不仅仅是“看看会发生什么”
对结果的特定预测
基于推理或数据

2. 一次测试一件事

每次测试一个变量
否则你不知道是什么起了作用
将 MVT 留到以后

3. 统计严谨性

预先确定样本量
不要提前窥探和停止
坚持方法论

4. 衡量重要指标

与业务价值相关的主要指标
用于提供背景信息的次要指标
防止损害的护栏指标

假设框架

结构

因为 [观察/数据]，
我们相信 [改变]
会导致 [预期结果]
对于 [受众]。
当 [指标] 时，我们将知道这是真的。

示例

“改变按钮颜色可能会增加点击量。”

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

A/B 测试（分流测试）

两个版本：对照组 (A) 与变体 (B)
版本之间只有一个变化
最常见，最容易分析

多个变体（A 对 B 对 C...）
需要更多流量
适合测试多个选项

多变量测试 (MVT)

多个变化组合
测试变化之间的相互作用
需要显著更多的流量
分析复杂

变体使用不同的 URL
适用于重大页面更改
有时实现起来更容易

基线转化率：你当前的比率
最小可检测效应 (MDE)：值得检测的最小变化
统计显著性水平：通常为 95%
统计功效：通常为 80%

基线转化率	提升 10%	提升 20%	提升 50%
1%	15万/变体	3.9万/变体	6千/变体
3%	4.7万/变体	1.2万/变体	2千/变体
5%	2.7万/变体	7千/变体	1.2千/变体
10%	1.2万/变体	3千/变体	550/变体

时长 = 每个变体所需的样本量 × 变体数量
       ───────────────────────────────────────────────────
       测试页面的每日流量 × 转化率

最短：1-2 个业务周期（通常为 1-2 周）最长：避免运行时间过长（新奇效应、外部因素）

最重要的单一指标
直接与假设相关
你将用来判断测试结果的指标

支持主要指标的解释
解释改变为何/如何起作用
帮助理解用户行为

不应变得更糟的事项
收入、留存率、满意度
如果显著负面，则停止测试

按测试类型划分的指标示例

首页 CTA 测试：

主要：CTA 点击率
次要：点击时间、滚动深度
护栏：跳出率、下游转化率

定价页面测试：

主要：方案选择率
次要：页面停留时间、方案分布
护栏：支持工单、退款率

注册流程测试：

主要：注册完成率
次要：字段级完成率、完成时间
护栏：用户激活率（注册后质量）

当前体验，未改变
测试期间不要修改

单一、有意义的改变
足够大胆以产生影响
忠于假设

可变化的内容：

信息角度
价值主张
具体程度
语气/口吻

布局结构
颜色和对比度
图片选择
视觉层次

按钮文案
大小/突出程度
放置位置
CTA 数量

包含的信息
信息顺序
内容量
社会证明类型

对照组 (A):
- 截图
- 当前状态描述

变体 (B):
- 截图或模型
- 所做的具体更改
- 认为此变体会胜出的假设

A/B 测试为 50/50
多个变体平均分配

初始为 90/10 或 80/20
限制不良变体的风险
达到显著性所需时间更长

从小规模开始，随时间增加
有利于降低技术风险
大多数工具支持此方式

一致性：用户返回时看到相同的变体
细分规模：确保细分足够大
一天/一周中的时间：平衡曝光

工具：PostHog、Optimizely、VWO、自定义

JavaScript 在页面加载后修改页面
实施快速
可能导致闪烁

营销页面
文案/视觉更改
快速迭代

工具：PostHog、LaunchDarkly、Split、自定义

在页面渲染前确定变体
无闪烁
需要开发工作

产品功能
复杂更改
对性能敏感的页面

二进制开/关（非真正的 A/B 测试）
适合推出
可以通过百分比分割转换为 A/B 测试

上线前检查清单

假设已记录
主要指标已定义
样本量已计算
测试时长已估算
变体已正确实施
跟踪已验证
所有变体已完成 QA
利益相关者已通知

监控技术问题
检查细分质量
记录任何外部因素

窥探结果并提前停止
对变体进行更改
从新来源添加流量
因为你“知道”答案而提前结束

在达到样本量之前查看结果并在看到显著性时停止会导致：

误报
夸大的效应量
错误的决策

预先承诺样本量并坚持
如果必须窥探，请使用序贯测试
相信流程

95% 置信度 = p 值 < 0.05
意味着：<5% 的几率结果是随机的
不是保证——只是一个阈值

统计显著性 ≠ 实际显著性

效应量对业务有意义吗？
值得实施成本吗？
能长期持续吗？

需要查看的内容

你达到样本量了吗？ * 如果没有，结果是初步的
结果具有统计显著性吗？ * 检查置信区间 * 检查 p 值
效应量有意义吗？ * 与你的 MDE 比较 * 预测业务影响
次要指标一致吗？ * 它们支持主要指标吗？ * 有任何意外影响吗？
有任何护栏指标问题吗？ * 有什么变得更糟了吗？ * 长期风险？
细分差异？ * 移动端与桌面端？ * 新用户与回访用户？ * 流量来源？

结果	结论
显著胜出	实施变体
显著失败	保留对照组，探究原因
无显著差异	需要更多流量或更大胆的测试
混合信号	深入挖掘，可能需要细分

测试名称：[名称]
测试 ID：[测试工具中的 ID]
日期：[开始] - [结束]
负责人：[姓名]

假设：
[完整的假设陈述]

变体：
- 对照组：[描述 + 截图]
- 变体：[描述 + 截图]

结果：
- 样本量：[达成 vs. 目标]
- 主要指标：[对照组] vs. [变体] ([% 变化], [置信度])
- 次要指标：[摘要]
- 细分洞察：[显著差异]

决策：[胜出/失败/不确定]
行动：[我们将做什么]

学习：
[我们学到了什么，下一步测试什么]

建立学习知识库

所有测试的集中存放位置
可按页面、元素、结果搜索
防止重新运行失败的测试
建立机构知识

# A/B 测试：[名称]

## 假设
[使用框架的完整假设]

## 测试设计
- 类型：A/B / A/B/n / MVT
- 时长：X 周
- 样本量：每个变体 X
- 流量分配：50/50

## 变体
[对照组和变体描述及视觉材料]

## 指标
- 主要：[指标和定义]
- 次要：[列表]
- 护栏：[列表]

## 实施
- 方法：客户端 / 服务端
- 工具：[工具名称]
- 开发需求：[如果有]

## 分析计划
- 成功标准：[什么构成胜利]
- 细分分析：[计划分析的细分]

测试完成时提供

基于结果的后续步骤

测试变化太小（无法检测）
测试太多内容（无法隔离）
没有明确的假设
错误的受众

提前停止
测试中途更改内容
不检查实施情况
流量分配不均

忽略置信区间
挑选细分
过度解读不确定的结果
不考虑实际显著性

需要询问的问题

如果你需要更多背景信息：

你当前的转化率是多少？
这个页面有多少流量？
你正在考虑什么改变以及为什么？
值得检测的最小改进是什么？
你有什么测试工具？
你以前测试过这个领域吗？

page-cro：用于基于 CRO 原则生成测试想法
analytics-tracking：用于设置测试测量
copywriting：用于创建变体文案

🇺🇸English

A/B Test Setup

You are an expert in experimentation and A/B testing. Your goal is to help design tests that produce statistically valid, actionable results.

Initial Assessment

Before designing a test, understand:

Test Context
- What are you trying to improve?
- What change are you considering?
- What made you want to test this?
Current State
- Baseline conversion rate?
- Current traffic volume?
- Any historical test data?
Constraints
- Technical implementation complexity?
- Timeline requirements?
- Tools available?

Core Principles

1. Start with a Hypothesis

Not just "let's see what happens"
Specific prediction of outcome
Based on reasoning or data

2. Test One Thing

Single variable per test
Otherwise you don't know what worked
Save MVT for later

3. Statistical Rigor

Pre-determine sample size
Don't peek and stop early
Commit to the methodology

4. Measure What Matters

Primary metric tied to business value
Secondary metrics for context
Guardrail metrics to prevent harm

Hypothesis Framework

Structure

Because [observation/data],
we believe [change]
will cause [expected outcome]
for [audience].
We'll know this is true when [metrics].

Examples

Weak hypothesis: "Changing the button color might increase clicks."

Strong hypothesis: "Because users report difficulty finding the CTA (per heatmaps and feedback), we believe making the button larger and using contrasting color will increase CTA clicks by 15%+ for new visitors. We'll measure click-through rate from page view to signup start."

Good Hypotheses Include

Observation : What prompted this idea
Change : Specific modification
Effect : Expected outcome and direction
Audience : Who this applies to
Metric : How you'll measure success

Test Types

A/B Test (Split Test)

Two versions: Control (A) vs. Variant (B)
Single change between versions
Most common, easiest to analyze

A/B/n Test

Multiple variants (A vs. B vs. C...)
Requires more traffic
Good for testing several options

Multivariate Test (MVT)

Multiple changes in combinations
Tests interactions between changes
Requires significantly more traffic
Complex analysis

Split URL Test

Different URLs for variants
Good for major page changes
Easier implementation sometimes

Sample Size Calculation

Inputs Needed

Baseline conversion rate : Your current rate
Minimum detectable effect (MDE) : Smallest change worth detecting
Statistical significance level : Usually 95%
Statistical power : Usually 80%

Quick Reference

Baseline Rate	10% Lift	20% Lift	50% Lift
1%	150k/variant	39k/variant	6k/variant
3%	47k/variant	12k/variant	2k/variant
5%	27k/variant	7k/variant	1.2k/variant
10%	12k/variant	3k/variant	550/variant

Formula Resources

Evan Miller's calculator: https://www.evanmiller.org/ab-testing/sample-size.html
Optimizely's calculator: https://www.optimizely.com/sample-size-calculator/

Test Duration

Duration = Sample size needed per variant × Number of variants
           ───────────────────────────────────────────────────
           Daily traffic to test page × Conversion rate

Minimum: 1-2 business cycles (usually 1-2 weeks) Maximum: Avoid running too long (novelty effects, external factors)

Metrics Selection

Primary Metric

Single metric that matters most
Directly tied to hypothesis
What you'll use to call the test

Secondary Metrics

Support primary metric interpretation
Explain why/how the change worked
Help understand user behavior

Guardrail Metrics

Things that shouldn't get worse
Revenue, retention, satisfaction
Stop test if significantly negative

Metric Examples by Test Type

Homepage CTA test:

Primary: CTA click-through rate
Secondary: Time to click, scroll depth
Guardrail: Bounce rate, downstream conversion

Pricing page test:

Primary: Plan selection rate
Secondary: Time on page, plan distribution
Guardrail: Support tickets, refund rate

Signup flow test:

Primary: Signup completion rate
Secondary: Field-level completion, time to complete
Guardrail: User activation rate (post-signup quality)

Designing Variants

Control (A)

Current experience, unchanged
Don't modify during test

Variant (B+)

Best practices:

Single, meaningful change
Bold enough to make a difference
True to the hypothesis

What to vary:

Headlines/Copy:

Message angle
Value proposition
Specificity level
Tone/voice

Visual Design:

Layout structure
Color and contrast
Image selection
Visual hierarchy

CTA:

Button copy
Size/prominence
Placement
Number of CTAs

Content:

Information included
Order of information
Amount of content
Social proof type

Documenting Variants

Control (A):
- Screenshot
- Description of current state

Variant (B):
- Screenshot or mockup
- Specific changes made
- Hypothesis for why this will win

Traffic Allocation

Standard Split

50/50 for A/B test
Equal split for multiple variants

Conservative Rollout

90/10 or 80/20 initially
Limits risk of bad variant
Longer to reach significance

Ramping

Start small, increase over time
Good for technical risk mitigation
Most tools support this

Considerations

Consistency: Users see same variant on return
Segment sizes: Ensure segments are large enough
Time of day/week: Balanced exposure

Implementation Approaches

Client-Side Testing

Tools : PostHog, Optimizely, VWO, custom

How it works :

JavaScript modifies page after load
Quick to implement
Can cause flicker

Best for :

Marketing pages
Copy/visual changes
Quick iteration

Server-Side Testing

Tools : PostHog, LaunchDarkly, Split, custom

How it works :

Variant determined before page renders
No flicker
Requires development work

Best for :

Product features
Complex changes
Performance-sensitive pages

Feature Flags

Binary on/off (not true A/B)
Good for rollouts
Can convert to A/B with percentage split

Running the Test

Pre-Launch Checklist

Hypothesis documented
Primary metric defined
Sample size calculated
Test duration estimated
Variants implemented correctly
Tracking verified
QA completed on all variants
Stakeholders informed

During the Test

DO:

Monitor for technical issues
Check segment quality
Document any external factors

DON'T:

Peek at results and stop early
Make changes to variants
Add traffic from new sources
End early because you "know" the answer

Peeking Problem

Looking at results before reaching sample size and stopping when you see significance leads to:

False positives
Inflated effect sizes
Wrong decisions

Solutions:

Pre-commit to sample size and stick to it
Use sequential testing if you must peek
Trust the process

Analyzing Results

Statistical Significance

95% confidence = p-value < 0.05
Means: <5% chance result is random
Not a guarantee—just a threshold

Practical Significance

Statistical ≠ Practical

Is the effect size meaningful for business?
Is it worth the implementation cost?
Is it sustainable over time?

What to Look At

Did you reach sample size?
- If not, result is preliminary
Is it statistically significant?
- Check confidence intervals
- Check p-value
Is the effect size meaningful?
- Compare to your MDE
- Project business impact
Are secondary metrics consistent?
- Do they support the primary?
- Any unexpected effects?
Any guardrail concerns?
- Did anything get worse?
- Long-term risks?
Segment differences?
- Mobile vs. desktop?
- New vs. returning?
- Traffic source?

Interpreting Results

Result	Conclusion
Significant winner	Implement variant
Significant loser	Keep control, learn why
No significant difference	Need more traffic or bolder test
Mixed signals	Dig deeper, maybe segment

Documenting and Learning

Test Documentation

Test Name: [Name]
Test ID: [ID in testing tool]
Dates: [Start] - [End]
Owner: [Name]

Hypothesis:
[Full hypothesis statement]

Variants:
- Control: [Description + screenshot]
- Variant: [Description + screenshot]

Results:
- Sample size: [achieved vs. target]
- Primary metric: [control] vs. [variant] ([% change], [confidence])
- Secondary metrics: [summary]
- Segment insights: [notable differences]

Decision: [Winner/Loser/Inconclusive]
Action: [What we're doing]

Learnings:
[What we learned, what to test next]

Building a Learning Repository

Central location for all tests
Searchable by page, element, outcome
Prevents re-running failed tests
Builds institutional knowledge

Output Format

Test Plan Document

# A/B Test: [Name]

## Hypothesis
[Full hypothesis using framework]

## Test Design
- Type: A/B / A/B/n / MVT
- Duration: X weeks
- Sample size: X per variant
- Traffic allocation: 50/50

## Variants
[Control and variant descriptions with visuals]

## Metrics
- Primary: [metric and definition]
- Secondary: [list]
- Guardrails: [list]

## Implementation
- Method: Client-side / Server-side
- Tool: [Tool name]
- Dev requirements: [If any]

## Analysis Plan
- Success criteria: [What constitutes a win]
- Segment analysis: [Planned segments]

Results Summary

When test is complete

Recommendations

Next steps based on results

Common Mistakes

Test Design

Testing too small a change (undetectable)
Testing too many things (can't isolate)
No clear hypothesis
Wrong audience

Execution

Stopping early
Changing things mid-test
Not checking implementation
Uneven traffic allocation

Analysis

Ignoring confidence intervals
Cherry-picking segments
Over-interpreting inconclusive results
Not considering practical significance

Questions to Ask

If you need more context:

What's your current conversion rate?
How much traffic does this page get?
What change are you considering and why?
What's the smallest improvement worth detecting?
What tools do you have for testing?
Have you tested this area before?

Related Skills

page-cro : For generating test ideas based on CRO principles
analytics-tracking : For setting up test measurement
copywriting : For creating variant copy

Weekly Installs

153

Repository

davila7/claude-…emplates

GitHub Stars

22.6K

First Seen

Jan 25, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode132

gemini-cli129

codex121

cursor118

github-copilot117

claude-code114

Excel财务建模规范与xlsx文件处理指南：专业格式、零错误公式与数据分析

45,000 周安装

A/B测试设置指南：统计有效实验设计、样本量计算与指标选择

🇨🇳中文介绍

A/B 测试设置

初始评估

核心原则

1. 从假设开始

2. 一次测试一件事

3. 统计严谨性

4. 衡量重要指标

假设框架

结构

示例

相关 Skills

好的假设包括

测试类型

A/B 测试（分流测试）

A/B/n 测试

多变量测试 (MVT)

分流 URL 测试

样本量计算

所需输入

快速参考

公式资源

测试时长

指标选择

主要指标

次要指标

护栏指标

按测试类型划分的指标示例

设计变体

对照组 (A)

变体 (B+)

记录变体

流量分配

标准分配

保守推出

逐步增加

注意事项

实施方法

客户端测试

服务端测试

功能开关

运行测试

上线前检查清单

测试期间

窥探问题

分析结果

统计显著性

实际显著性

需要查看的内容

解释结果

记录与学习

测试文档