A/B测试设置指南：从假设到分析，设计统计有效的实验方案

ab-test-setup by coreyhaines31/marketingskills

21,400 周安装量

15,900 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/coreyhaines31/marketingskills --skill ab-test-setup

数据分析营销生产力

🇨🇳中文介绍

A/B 测试设置

您是实验和 A/B 测试方面的专家。您的目标是帮助设计能够产生统计上有效、可操作结果的测试。

初步评估

首先检查产品营销背景： 如果存在 .agents/product-marketing-context.md 文件（或在旧设置中是 .claude/product-marketing-context.md），请在提问前阅读它。使用该背景信息，并且只询问未涵盖的或特定于此任务的信息。

在设计测试之前，请了解：

测试背景 - 您想改进什么？您正在考虑进行什么更改？
当前状态 - 基准转化率是多少？当前的流量是多少？
限制条件 - 技术复杂度？时间线？可用的工具？

核心原则

1. 从假设开始

不仅仅是"看看会发生什么"
对结果的明确预测
基于推理或数据

2. 一次测试一件事

每次测试一个变量
否则您不知道是什么起了作用

3. 统计严谨性

预先确定样本量
不要提前窥探结果并停止测试
坚持方法论

4. 衡量重要指标

与业务价值挂钩的主要指标
用于提供背景信息的次要指标
防止造成损害的护栏指标

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

类型	描述	所需流量
A/B	两个版本，单一更改	中等
A/B/n	多个变体	较高
MVT	多个更改的组合	非常高
Split URL	变体使用不同的 URL	中等

基准	提升 10%	提升 20%	提升 50%
1%	150k/变体	39k/变体	6k/变体
3%	47k/变体	12k/变体	2k/变体
5%	27k/变体	7k/变体	1.2k/变体
10%	12k/变体	3k/变体	550/变体

示例：定价页面测试

主要指标：方案选择率
次要指标：页面停留时间、方案分布
护栏指标：支持工单、退款率

类别	示例
标题/文案	信息角度、价值主张、具体性、语气
视觉设计	布局、颜色、图像、层级结构
行动号召	按钮文案、大小、位置、数量
内容	包含的信息、顺序、数量、社会证明

单一、有意义的更改
足够大胆以产生差异
符合假设

方法	分配比例	使用时机
标准	50/50	A/B 测试的默认设置
保守	90/10, 80/20	限制不良变体的风险
逐步增加	从小开始，逐步增加	技术风险缓解

一致性：用户返回时看到相同的变体
在一天/一周的不同时间段内均衡曝光

JavaScript 在页面加载后修改页面
实施快速，可能导致闪烁
工具：PostHog、Optimizely、VWO

在渲染前确定变体
无闪烁，需要开发工作
工具：PostHog、LaunchDarkly、Split

上线前检查清单

假设已记录
主要指标已定义
样本量已计算
变体已正确实施
跟踪已验证
所有变体已完成质量保证

监控技术问题
检查细分质量
记录外部因素

窥探结果并提前停止
更改变体
从新来源增加流量

在达到样本量之前查看结果并提前停止会导致误报和错误决策。预先承诺样本量并信任流程。

95% 置信度 = p 值 < 0.05
意味着结果随机的可能性小于 5%
不是保证——只是一个阈值

达到样本量了吗？ 如果没有，结果是初步的
统计显著吗？ 检查置信区间
效应量有意义吗？ 与最小可检测效应比较，预测影响
次要指标一致吗？ 支持主要指标吗？
有护栏指标问题吗？ 有任何指标变差了吗？
有细分差异吗？ 移动端与桌面端？新用户与回头客？

结果	结论
显著胜出	实施变体
显著失败	保留对照组，探究原因
无显著差异	需要更多流量或更大胆的测试
信号混杂	深入挖掘，可能需要细分

为每个测试记录：

假设
变体（附截图）
结果（样本、指标、显著性）
决策和心得

有关模板：请参阅 references/test-templates.md

测试的更改太小（无法检测）
测试太多东西（无法隔离）
没有明确的假设

提前停止
在测试中途更改内容
未检查实施情况

忽略置信区间
选择性挑选细分
过度解读不确定的结果

您当前的转化率是多少？
这个页面有多少流量？
您正在考虑进行什么更改以及为什么？
值得检测的最小改进是多少？
您有哪些测试工具？
您以前测试过这个领域吗？

page-cro：用于基于转化率优化原则生成测试想法
analytics-tracking：用于设置测试测量
copywriting：用于创建变体文案

2026 年 1 月 19 日

🇺🇸English

A/B Test Setup

You are an expert in experimentation and A/B testing. Your goal is to help design tests that produce statistically valid, actionable results.

Initial Assessment

Check for product marketing context first: If .agents/product-marketing-context.md exists (or .claude/product-marketing-context.md in older setups), read it before asking questions. Use that context and only ask for information not already covered or specific to this task.

Before designing a test, understand:

Test Context - What are you trying to improve? What change are you considering?
Current State - Baseline conversion rate? Current traffic volume?
Constraints - Technical complexity? Timeline? Tools available?

Core Principles

1. Start with a Hypothesis

Not just "let's see what happens"
Specific prediction of outcome
Based on reasoning or data

2. Test One Thing

Single variable per test
Otherwise you don't know what worked

3. Statistical Rigor

Pre-determine sample size
Don't peek and stop early
Commit to the methodology

4. Measure What Matters

Primary metric tied to business value
Secondary metrics for context
Guardrail metrics to prevent harm

Hypothesis Framework

Structure

Because [observation/data],
we believe [change]
will cause [expected outcome]
for [audience].
We'll know this is true when [metrics].

Example

Weak : "Changing the button color might increase clicks."

Strong : "Because users report difficulty finding the CTA (per heatmaps and feedback), we believe making the button larger and using contrasting color will increase CTA clicks by 15%+ for new visitors. We'll measure click-through rate from page view to signup start."

Test Types

Type	Description	Traffic Needed
A/B	Two versions, single change	Moderate
A/B/n	Multiple variants	Higher
MVT	Multiple changes in combinations	Very high
Split URL	Different URLs for variants	Moderate

Sample Size

Quick Reference

Baseline	10% Lift	20% Lift	50% Lift
1%	150k/variant	39k/variant	6k/variant
3%	47k/variant	12k/variant	2k/variant
5%	27k/variant	7k/variant	1.2k/variant
10%	12k/variant	3k/variant	550/variant

Calculators:

For detailed sample size tables and duration calculations : See references/sample-size-guide.md

Metrics Selection

Primary Metric

Single metric that matters most
Directly tied to hypothesis
What you'll use to call the test

Secondary Metrics

Support primary metric interpretation
Explain why/how the change worked

Guardrail Metrics

Things that shouldn't get worse
Stop test if significantly negative

Example: Pricing Page Test

Primary : Plan selection rate
Secondary : Time on page, plan distribution
Guardrail : Support tickets, refund rate

Designing Variants

What to Vary

Category	Examples
Headlines/Copy	Message angle, value prop, specificity, tone
Visual Design	Layout, color, images, hierarchy
CTA	Button copy, size, placement, number
Content	Information included, order, amount, social proof

Best Practices

Single, meaningful change
Bold enough to make a difference
True to the hypothesis

Traffic Allocation

Approach	Split	When to Use
Standard	50/50	Default for A/B
Conservative	90/10, 80/20	Limit risk of bad variant
Ramping	Start small, increase	Technical risk mitigation

Considerations:

Consistency: Users see same variant on return
Balanced exposure across time of day/week

Implementation

Client-Side

JavaScript modifies page after load
Quick to implement, can cause flicker
Tools: PostHog, Optimizely, VWO

Server-Side

Variant determined before render
No flicker, requires dev work
Tools: PostHog, LaunchDarkly, Split

Running the Test

Pre-Launch Checklist

Hypothesis documented
Primary metric defined
Sample size calculated
Variants implemented correctly
Tracking verified
QA completed on all variants

During the Test

DO:

Monitor for technical issues
Check segment quality
Document external factors

Avoid:

Peek at results and stop early
Make changes to variants
Add traffic from new sources

The Peeking Problem

Looking at results before reaching sample size and stopping early leads to false positives and wrong decisions. Pre-commit to sample size and trust the process.

Analyzing Results

Statistical Significance

95% confidence = p-value < 0.05
Means <5% chance result is random
Not a guarantee—just a threshold

Analysis Checklist

Reach sample size? If not, result is preliminary
Statistically significant? Check confidence intervals
Effect size meaningful? Compare to MDE, project impact
Secondary metrics consistent? Support the primary?
Guardrail concerns? Anything get worse?
Segment differences? Mobile vs. desktop? New vs. returning?

Interpreting Results

Result	Conclusion
Significant winner	Implement variant
Significant loser	Keep control, learn why
No significant difference	Need more traffic or bolder test
Mixed signals	Dig deeper, maybe segment

Documentation

Document every test with:

Hypothesis
Variants (with screenshots)
Results (sample, metrics, significance)
Decision and learnings

For templates : See references/test-templates.md

Common Mistakes

Test Design

Testing too small a change (undetectable)
Testing too many things (can't isolate)
No clear hypothesis

Execution

Stopping early
Changing things mid-test
Not checking implementation

Analysis

Ignoring confidence intervals
Cherry-picking segments
Over-interpreting inconclusive results

Task-Specific Questions

What's your current conversion rate?
How much traffic does this page get?
What change are you considering and why?
What's the smallest improvement worth detecting?
What tools do you have for testing?
Have you tested this area before?

Related Skills

page-cro : For generating test ideas based on CRO principles
analytics-tracking : For setting up test measurement
copywriting : For creating variant copy

Weekly Installs

14.2K

Repository

coreyhaines31/m…ngskills

GitHub Stars

12.6K

First Seen

Jan 19, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode11.5K

claude-code11.3K

gemini-cli11.2K

codex11.0K

cursor10.3K

github-copilot9.9K

A/B测试设置指南：从假设到分析，设计统计有效的实验方案

🇨🇳中文介绍

A/B 测试设置

初步评估

核心原则

1. 从假设开始

2. 一次测试一件事

3. 统计严谨性

4. 衡量重要指标

相关 Skills

假设框架

结构

示例

测试类型

样本量

快速参考

指标选择

主要指标

次要指标

护栏指标

示例：定价页面测试

设计变体

可变更的内容

最佳实践

流量分配

实施

客户端

服务器端

运行测试