npx skills add https://github.com/borghei/claude-skills --skill ab-test-setup类别: 产品团队 标签: A/B 测试, 实验, 统计显著性, 样本量, 功能开关, 假设检验
A/B 测试设置提供了完整的框架,用于设计能产生统计有效、可操作结果的实验。大多数 A/B 测试失败并非因为变体方案错误,而是因为测试设计不当:错误的样本量、错误的指标,或者有人提前窥探结果并过早停止。本技能旨在防止这些错误。
1. 提出假设 → 2. 设计 → 3. 计算 → 4. 实施
↑ │
│ ▼
7. 迭代 ← 6. 记录 ← 5. 分析 ← [运行至完成]
基于 [观察或数据点],
我们相信 [具体改动]
将对 [定义的受众群体]
产生 [可衡量的结果]。
当 [主要指标] 变化达到 [最小可检测效应] 时,我们将确认此假设成立。
我们将监控 [护栏指标] 以确保没有负面影响。
| 质量 | 假设 | 问题 |
|---|---|---|
| 差 | "改变按钮颜色可能会增加点击量" | 无数据依据,无目标,无衡量计划 |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 一般 |
| "绿色按钮会比蓝色按钮获得更多点击" |
| 无"原因",无目标规模,无护栏 |
| 好 | "由于热图显示 40% 的用户未注意到我们的行动号召,将按钮尺寸增大 2 倍并使用对比色,将使新访客的行动号召点击率增加 15% 以上。护栏:页面加载时间保持在 2 秒以内。" | 有数据支持,具体改动,可衡量结果,定义受众,有护栏 |
| 来源 | 寻找什么 | 示例 |
|---|---|---|
| 分析数据 | 流失点,表现不佳的页面 | "80% 的用户在引导流程的第 3 步流失" |
| 用户研究 | 困惑、沮丧、未满足的需求 | "用户无法从主页理解产品的功能" |
| 热图/会话录制 | 被忽略的元素,无效点击 | "没有人滚动到定价页面的首屏以下" |
| 支持工单 | 反复出现的投诉,功能混淆 | "用户经常询问如何邀请团队成员" |
| 竞品分析 | 针对同一问题的不同方法 | "竞争对手使用向导;我们使用表单" |
| 销售异议 | 潜在客户不转化的常见原因 | "潜在客户希望在注册前看到定价" |
| 类型 | 变体数 | 流量需求 | 最适合 |
|---|---|---|---|
| A/B | 2 (对照组 + 1 个变体) | 中等 | 单一改动验证 |
| A/B/n | 3+ 个变体 | 高 | 比较多种方案 |
| 多变量测试 | 改动的组合 | 非常高 | 优化多个元素 |
| URL 分流 | 不同页面 | 中等 | 重大重新设计 |
| Bandit | 动态分配 | 低-中等 | 收入优化 |
默认推荐: 标准 A/B 测试。仅在拥有足够流量且有特定需求时使用 A/B/n 或 MVT。
| 类别 | 高影响力 | 中等影响力 | 低影响力 |
|---|---|---|---|
| 文案 | 标题/价值主张,行动号召文案 | 正文,社会证明 | 微文案,标签 |
| 设计 | 页面布局,首屏内容 | 视觉层次,图像 | 颜色,字体大小 |
| 用户体验 | 步骤数,表单字段 | 按钮位置,导航 | 动画,过渡效果 |
| 定价 | 价格点,套餐名称 | 功能打包,锚定效应 | 计费频率显示 |
| 社会证明 | 有无客户评价,徽标 | 客户评价格式,位置 | 客户评价数量 |
每个测试都需要三种类型的指标:
主要指标(仅 1 个)
次要指标(2-3 个)
护栏指标(1-3 个)
每个变体所需的最小访客数(95% 置信度,80% 功效):
| 基准率 | 5% 提升 | 10% 提升 | 15% 提升 | 20% 提升 | 50% 提升 |
|---|---|---|---|---|---|
| 1% | 620,000 | 156,000 | 70,000 | 39,000 | 6,400 |
| 2% | 305,000 | 77,000 | 34,000 | 19,500 | 3,200 |
| 3% | 200,000 | 51,000 | 23,000 | 12,800 | 2,100 |
| 5% | 116,000 | 29,500 | 13,200 | 7,500 | 1,250 |
| 10% | 54,000 | 13,800 | 6,200 | 3,500 | 600 |
| 20% | 24,000 | 6,200 | 2,800 | 1,600 | 280 |
| 50% | 6,100 | 1,600 | 720 | 410 | 75 |
持续时间 (天) = (每个变体的样本量 * 变体数量) / 测试页面的每日流量
最短持续时间: 7 天(以捕捉周内效应) 建议最长: 6 周(超过此时间,外部因素会污染结果)
| 情况 | 解决方案 |
|---|---|
| 需要 10 万访客,每周获得 5 千 | 增加最小可检测效应(测试更大胆的改动) |
| 流量极低 (<1千/周) | 改用定性测试(用户测试、调查) |
| 中等流量 (5-20千/周) | 运行 4-6 周,仅测试重大改动 |
| 高流量 (50千+/周) | 可以测试细微改动,运行多个测试 |
JavaScript 在初始渲染后修改页面。
优点: 实施快速,无需部署 缺点: 可能导致闪烁(原始内容闪现),被广告拦截器阻止 工具: PostHog, Optimizely, VWO, Google Optimize
防闪烁模式:
// 在 <head> 中,任何渲染之前添加
<style>.ab-test-hide { opacity: 0 !important; }</style>
<script>document.documentElement.classList.add('ab-test-hide');</script>
// 在您的测试脚本中(在变体分配后运行):
document.documentElement.classList.remove('ab-test-hide');
在页面渲染前确定变体。无闪烁,不依赖客户端。
优点: 无闪烁,不被广告拦截器阻止,适用于登录功能 缺点: 需要工程工作,需要部署 工具: PostHog, LaunchDarkly, Split, Unleash, 自定义功能开关
基本功能开关模式:
# 服务端变体分配
def get_variant(user_id: str, experiment: str) -> str:
# 确定性哈希确保同一用户始终看到同一变体
hash_input = f"{user_id}:{experiment}"
hash_value = hashlib.md5(hash_input.encode()).hexdigest()
bucket = int(hash_value[:8], 16) % 100
if bucket < 50:
return "control"
else:
return "variant"
| 策略 | 分配 | 何时使用 |
|---|---|---|
| 标准 | 50/50 | 默认。统计功效最大。 |
| 保守 | 90/10 或 80/20 | 有风险的改动,影响收入的测试 |
| 渐进式 | 从 95/5 开始,增加到 50/50 | 新基础设施,技术风险 |
关键规则:
应该做:
禁止做:
在达到计划的样本量之前查看结果,并因为一个变体看起来更好而停止,会导致 25-40% 的误报率(相对于预期的 5%)。
原因:统计显著性在小样本量时波动剧烈。一个变体可能在计划样本量的 20% 时显示 p < 0.05,而在完整样本量时显示 p > 0.30。
解决方案:
| 结果 | 主要指标 | 置信度 | 行动 |
|---|---|---|---|
| 明确胜出 | 变体 +15%,p < 0.01 | 高 | 实施变体 |
| 适度胜出 | 变体 +5%,p < 0.05 | 中等 | 如果容易则实施,否则延长测试时间 |
| 持平 | 差异 < 2%,p > 0.20 | 高(无效应) | 保留对照组,测试更大胆的方案 |
| 失败 | 变体 -10%,p < 0.05 | 高 | 保留对照组,调查原因 |
| 不确定 | 差异 5%,p = 0.08 | 低 | 需要更多流量或更大胆的测试 |
| 信号混杂 | 主要指标上升,护栏指标下降 | 调查 | 深入分析细分,切勿盲目发布 |
| 错误 | 后果 | 预防措施 |
|---|---|---|
| 在首次显著时停止 | 25-40% 误报率 | 承诺样本量 |
| 选择性挑选细分 | 发现无法复现的"胜出"细分 | 预先注册感兴趣的细分 |
| 忽略置信区间 | 高估效应量 | 始终与 p 值一起报告置信区间 |
| 多重比较 | 夸大的 I 类错误 | 对 A/B/n 使用 Bonferroni 校正 |
| 幸存者偏差 | 仅分析完成流程的用户 | 包含从分配点开始的所有用户 |
| 辛普森悖论 | 聚合数据隐藏了细分反转 | 始终检查关键细分 |
无论结果如何,每个测试都必须记录。
实验:[名称]
日期:[开始] 至 [结束]
负责人:[姓名]
假设:
基于 [观察],我们相信 [改动] 会对 [受众] 产生 [结果]。
变体:
- 对照组:[描述]
- 变体:[描述 + 截图]
指标:
- 主要:[指标] (基准:[X]%,最小可检测效应:[Y]%)
- 次要:[指标]
- 护栏:[指标]
结果:
- 样本量:[实际] / [计划]
- 持续时间:[X] 天
- 主要指标:对照组 [X]% vs 变体 [Y]% (p = [Z],置信区间:[范围])
- 次要指标:[结果]
- 护栏指标:[全部正常 / 违规记录]
决策:[实施变体 / 保留对照组 / 迭代]
学习:
- [我们了解到的关于用户的信息]
- [下次我们会做哪些不同的事情]
| 因素 | 评分 (1-10) | 问题 |
|---|---|---|
| 影响力 | 这能在多大程度上推动指标? | 对主要 KPI 的重大改变 = 10 |
| 信心 | 我们有多大把握它会成功? | 有强有力的数据支持假设 = 10 |
| 简易性 | 实施和衡量有多容易? | 一天内可以发布 = 10 |
ICE 分数 = (影响力 + 信心 + 简易性) / 3
按 ICE 分数对所有测试想法排序。先运行最高的。
---|---|---|---|---|---
1 | 更大的行动号召按钮增加注册量 | 注册率 | 8.3 | 2 周 | 就绪
2 | 定价页面的社会证明增加转化率 | 套餐选择率 | 7.0 | 3 周 | 需要设计
3 | 更短的引导流程增加激活率 | 功能激活率 | 6.7 | 4 周 | 待办中
| 技能 | 使用时机 |
|---|---|
| analytics-tracking | 设置用于提供实验指标的事件跟踪 |
| campaign-analytics | 将实验结果纳入更广泛的归因分析 |
| launch-strategy | 在产品发布序列中进行测试 |
| prompt-engineer-toolkit | 在生产环境中对 AI 提示进行 A/B 测试 |
每周安装数
1
仓库
GitHub 星标数
29
首次出现
今天
安全审计
安装于
zencoder1
amp1
cline1
openclaw1
opencode1
cursor1
Category: Product Team Tags: A/B testing, experiments, statistical significance, sample size, feature flags, hypothesis testing
A/B Test Setup provides the complete framework for designing experiments that produce statistically valid, actionable results. Most A/B tests fail not because the variant was wrong, but because the test was poorly designed: wrong sample size, wrong metric, or someone peeked at results and stopped early. This skill prevents those mistakes.
1. HYPOTHESIZE → 2. DESIGN → 3. CALCULATE → 4. IMPLEMENT
↑ │
│ ▼
7. ITERATE ← 6. DOCUMENT ← 5. ANALYZE ← [Run to completion]
Because [observation or data point],
we believe [specific change]
will cause [measurable outcome]
for [defined audience segment].
We'll know this is true when [primary metric] changes by [minimum detectable effect].
We'll watch [guardrail metrics] to ensure no negative impact.
| Quality | Hypothesis | Problem |
|---|---|---|
| Bad | "Changing the button color might increase clicks" | No data basis, no target, no measurement plan |
| Mediocre | "A green button will get more clicks than blue" | No "why", no target size, no guardrails |
| Good | "Because heatmaps show 40% of users don't notice our CTA, making the button 2x larger with contrasting color will increase CTA clicks by 15%+ for new visitors. Guardrail: page load time stays under 2s." | Data-backed, specific change, measurable outcome, defined audience, guardrail |
| Source | What to Look For | Example |
|---|---|---|
| Analytics data | Drop-off points, low-performing pages | "80% of users drop off at step 3 of onboarding" |
| User research | Confusion, frustration, unmet needs | "Users don't understand what the product does from the homepage" |
| Heatmaps/session recordings | Ignored elements, rage clicks | "Nobody scrolls past the fold on pricing page" |
| Support tickets | Recurring complaints, feature confusion | "Users constantly ask how to invite team members" |
| Competitor analysis | Different approaches to same problem | "Competitor uses a wizard; we use a form" |
| Sales objections | Common reasons prospects don't convert | "Prospects want to see pricing before signing up" |
| Type | Variants | Traffic Need | Best For |
|---|---|---|---|
| A/B | 2 (control + 1 variant) | Moderate | Single change validation |
| A/B/n | 3+ variants | High | Comparing multiple approaches |
| Multivariate (MVT) | Combinations of changes | Very high | Optimizing multiple elements |
| Split URL | Different pages | Moderate | Major redesigns |
| Bandit | Dynamic allocation | Low-moderate | Revenue optimization |
Default recommendation: Standard A/B test. Only use A/B/n or MVT when you have enough traffic and a specific need.
| Category | High Impact | Medium Impact | Low Impact |
|---|---|---|---|
| Copy | Headline/value prop, CTA text | Body copy, social proof | Microcopy, labels |
| Design | Page layout, above-fold content | Visual hierarchy, imagery | Color, font size |
| UX | Number of steps, form fields | Button placement, navigation | Animations, transitions |
| Pricing | Price point, plan names | Feature packaging, anchoring | Billing frequency display |
| Social Proof | Testimonials vs none, logos | Testimonial format, placement | Testimonial count |
Every test needs three types of metrics:
Primary Metric (1 only)
Secondary Metrics (2-3)
Guardrail Metrics (1-3)
Minimum visitors PER VARIANT needed (95% confidence, 80% power):
| Baseline Rate | 5% Lift | 10% Lift | 15% Lift | 20% Lift | 50% Lift |
|---|---|---|---|---|---|
| 1% | 620,000 | 156,000 | 70,000 | 39,000 | 6,400 |
| 2% | 305,000 | 77,000 | 34,000 | 19,500 | 3,200 |
| 3% | 200,000 | 51,000 | 23,000 | 12,800 | 2,100 |
| 5% | 116,000 | 29,500 | 13,200 | 7,500 | 1,250 |
| 10% | 54,000 | 13,800 | 6,200 |
Duration (days) = (Sample size per variant * Number of variants) / Daily traffic to test page
Minimum duration: 7 days (to capture day-of-week effects) Maximum recommended: 6 weeks (beyond this, external factors contaminate results)
| Situation | Solution |
|---|---|
| Need 100K visitors, get 5K/week | Increase minimum detectable effect (test bolder changes) |
| Very low traffic (<1K/week) | Use qualitative testing (user testing, surveys) instead |
| Medium traffic (5-20K/week) | Run for 4-6 weeks, test big changes only |
| High traffic (50K+/week) | You can test subtle changes, run multiple tests |
JavaScript modifies the page after initial render.
Pros: Quick to implement, no deploy needed Cons: Can cause flicker (flash of original content), blocked by ad blockers Tools: PostHog, Optimizely, VWO, Google Optimize
Anti-flicker pattern:
// Add to <head> before any rendering
<style>.ab-test-hide { opacity: 0 !important; }</style>
<script>document.documentElement.classList.add('ab-test-hide');</script>
// In your test script (runs after variant assignment):
document.documentElement.classList.remove('ab-test-hide');
Variant determined before page renders. No flicker, no client-side dependency.
Pros: No flicker, not blocked by ad blockers, works for logged-in features Cons: Requires engineering work, deploy needed Tools: PostHog, LaunchDarkly, Split, Unleash, custom feature flags
Basic feature flag pattern:
# Server-side variant assignment
def get_variant(user_id: str, experiment: str) -> str:
# Deterministic hash ensures same user always sees same variant
hash_input = f"{user_id}:{experiment}"
hash_value = hashlib.md5(hash_input.encode()).hexdigest()
bucket = int(hash_value[:8], 16) % 100
if bucket < 50:
return "control"
else:
return "variant"
| Strategy | Split | When to Use |
|---|---|---|
| Standard | 50/50 | Default. Maximum statistical power. |
| Conservative | 90/10 or 80/20 | Risky changes, revenue-impacting tests |
| Ramped | Start 95/5, increase to 50/50 | New infrastructure, technical risk |
Critical rules:
DO:
DO NOT:
Looking at results before reaching the planned sample size and stopping because one variant looks better leads to a 25-40% false positive rate (vs the intended 5%).
Why: Statistical significance fluctuates wildly with small samples. A variant can show p < 0.05 at 20% of planned sample size and p > 0.30 at full sample.
Solutions:
| Result | Primary Metric | Confidence | Action |
|---|---|---|---|
| Clear winner | Variant +15%, p < 0.01 | High | Implement variant |
| Modest winner | Variant +5%, p < 0.05 | Medium | Implement if easy, else run longer |
| Flat | < 2% difference, p > 0.20 | High (no effect) | Keep control, test something bolder |
| Loser | Variant -10%, p < 0.05 | High | Keep control, investigate why |
| Inconclusive | 5% difference, p = 0.08 | Low | Need more traffic or bolder test |
| Mixed signals | Primary up, guardrail down | Investigate | Dig into segments, do not ship blindly |
| Mistake | Consequence | Prevention |
|---|---|---|
| Stopping at first significance | 25-40% false positive rate | Commit to sample size |
| Cherry-picking segments | Finding "winners" that don't replicate | Pre-register segments of interest |
| Ignoring confidence intervals | Overestimating effect size | Always report CI alongside p-value |
| Multiple comparisons | Inflated Type I error | Bonferroni correction for A/B/n |
| Survivorship bias | Only analyzing users who completed flow | Include all users from assignment point |
| Simpson's paradox | Aggregate hides segment reversal | Always check key segments |
Every test must be documented, regardless of outcome.
EXPERIMENT: [Name]
DATE: [Start] to [End]
OWNER: [Name]
HYPOTHESIS:
Because [observation], we believed [change] would cause [outcome] for [audience].
VARIANTS:
- Control: [description]
- Variant: [description + screenshot]
METRICS:
- Primary: [metric] (baseline: [X]%, MDE: [Y]%)
- Secondary: [metrics]
- Guardrails: [metrics]
RESULTS:
- Sample size: [actual] / [planned]
- Duration: [X] days
- Primary metric: Control [X]% vs Variant [Y]% (p = [Z], CI: [range])
- Secondary metrics: [results]
- Guardrails: [all clear / violation noted]
DECISION: [Ship variant / Keep control / Iterate]
LEARNINGS:
- [What we learned about our users]
- [What we'd do differently next time]
| Factor | Score (1-10) | Question |
|---|---|---|
| Impact | How much will this move the metric? | Big change to primary KPI = 10 |
| Confidence | How sure are we it will work? | Strong data supporting hypothesis = 10 |
| Ease | How easy is it to implement and measure? | Can ship in a day = 10 |
ICE Score = (Impact + Confidence + Ease) / 3
Rank all test ideas by ICE score. Run highest first.
---|---|---|---|---|---
1 | Larger CTA increases signups | Signup rate | 8.3 | 2 weeks | Ready
2 | Social proof on pricing increases conversion | Plan selection rate | 7.0 | 3 weeks | Needs design
3 | Shorter onboarding increases activation | Feature activation | 6.7 | 4 weeks | In backlog
| Skill | Use When |
|---|---|
| analytics-tracking | Setting up event tracking that feeds experiment metrics |
| campaign-analytics | Folding experiment results into broader attribution |
| launch-strategy | Testing within a product launch sequence |
| prompt-engineer-toolkit | A/B testing AI prompts in production |
Weekly Installs
1
Repository
GitHub Stars
29
First Seen
Today
Security Audits
Gen Agent Trust HubPassSocketFailSnykPass
Installed on
zencoder1
amp1
cline1
openclaw1
opencode1
cursor1
Excel财务建模规范与xlsx文件处理指南:专业格式、零错误公式与数据分析
42,000 周安装
API文档生成器 - 自动生成REST/GraphQL/WebSocket API专业文档
1 周安装
Algolia 搜索集成指南:React Hooks、Next.js SSR 与数据同步最佳实践
1 周安装
Agent Tool Builder - 构建高效可靠的大语言模型工具,优化Function Calling与错误处理
1 周安装
Agent Manager Skill - 并行管理多个本地CLI代理的tmux工具,支持任务分配与监控
1 周安装
A/B测试设置指南:从假设到分析的完整流程与最佳实践
1 周安装
Stripe支付集成指南:Node.js结账会话、支付意图、订阅与Webhooks完整教程
1 周安装
| 3,500 |
| 600 |
| 20% | 24,000 | 6,200 | 2,800 | 1,600 | 280 |
| 50% | 6,100 | 1,600 | 720 | 410 | 75 |