系统化交易回测专家：专业方法论、压力测试与稳健策略验证

backtest-expert by tradermonty/claude-trading-skills

287 周安装量

398 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/tradermonty/claude-trading-skills --skill backtest-expert

方法论金融科技测试

🇨🇳中文介绍

回测专家

基于专业方法论的系统化回测交易策略方法，优先考虑稳健性而非乐观结果。

核心理念

目标：寻找“失效最少”的策略，而非纸上“盈利最多”的策略。

原则：增加摩擦，压力测试假设，观察哪些能存活下来。如果一个策略在悲观条件下仍能保持良好表现，那么它在实盘交易中更有可能成功。

何时使用此技能

在以下情况使用此技能：

开发或验证系统化交易策略时
评估交易想法是否足够稳健以用于实盘实施时
排查回测可能产生误导的原因时
学习正确的回测方法论时
避免常见陷阱（曲线拟合、前视偏差、幸存者偏差）时
评估参数敏感性和市场状态依赖性时
为滑点和执行成本设定现实预期时

前提条件

Python 3.9+（用于评估脚本）
无需 API 密钥
无外部数据依赖——指标由用户提供

工作流程

1. 陈述假设

用一句话定义策略优势。

示例：“在财报公布后跳空上涨 >3% 并在开盘第一小时内回落到前一日收盘价的股票，提供了均值回归机会。”

如果无法清晰地阐明优势，请不要进行测试。

2. 用零自由裁量权编码规则

用完全具体的细节定义：

入场：确切条件、时机、价格类型
出场：止损、止盈目标、基于时间的出场
仓位大小：固定金额、投资组合百分比、波动率调整
过滤器：市值、成交量、行业、波动率条件
交易品种：哪些工具符合条件

关键：不允许主观判断。每个决策都必须基于规则且明确无误。

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

3. 运行初始回测

至少 5 年（最好 10 年以上）
多种市场状态（牛市、熊市、高/低波动率）
现实成本：佣金 + 保守的滑点估计

检查初步结果以评估基本可行性。如果策略从根本上就存在问题，则需迭代假设。

4. 压力测试策略

80% 的测试时间应花在此处。

参数敏感性：

测试止损为基准值的 50%、75%、100%、125%、150%
测试止盈目标为基准值的 80%、90%、100%、110%、120%
将入场/出场时间变化 ±15-30 分钟
寻找性能稳定的“平台区”，而非狭窄的峰值

将滑点增加到典型估计值的 1.5-2 倍
模拟最坏情况下的成交（以卖一价+1个最小变动单位买入，以买一价-1个最小变动单位卖出）
添加现实的订单拒绝场景
使用悲观的佣金结构进行测试

时间稳健性：

逐年分析绩效
要求大多数年份具有正期望值
确保策略不依赖于 1-2 个特殊时期
分别在不同市场状态下测试

绝对最小值：30 笔交易
推荐值：100+ 笔交易
高置信度：200+ 笔交易

滚动前进分析：

在训练期（例如，第 1-3 年）进行优化
在验证期（第 4 年）进行测试
向前滚动并重复
比较样本内与样本外性能

样本外性能 < 样本内性能的 50%
需要频繁重新优化参数
不同时期之间参数变化剧烈

需要回答的问题：

优势在悲观假设下是否依然存在？
性能在不同参数变化下是否稳定？
策略在多种市场状态下是否有效？
样本量是否足以达到统计置信度？
结果是否现实，而非“好得不真实”？

✅ 部署：通过所有压力测试，性能可接受
🔄 优化：核心逻辑合理但需要参数调整
❌ 放弃：未能通过压力测试或依赖于脆弱的假设

使用评估脚本进行结构化、定量评估：

python3 skills/backtest-expert/scripts/evaluate_backtest.py \
  --total-trades 150 \
  --win-rate 62 \
  --avg-win-pct 1.8 \
  --avg-loss-pct 1.2 \
  --max-drawdown-pct 15 \
  --years-tested 8 \
  --num-parameters 3 \
  --slippage-tested \
  --output-dir reports/

该脚本从 5 个维度（样本量、期望值、风险管理、稳健性、执行现实性）进行评分，检测危险信号，并输出部署/优化/放弃的裁决。

在所有环节增加摩擦：

佣金高于现实水平
滑点为典型的 1.5-2 倍
最坏情况成交
订单拒绝
部分成交

理由：能在悲观假设下存活的策略，通常在实盘交易中表现更佳。

寻找平台区，而非峰值

寻找性能稳定的参数范围，而非产生性能峰值的优化值。

好：止损在 1.5% 到 3.0% 之间的任何位置策略都能盈利坏：策略仅在止损恰好为 2.13% 时有效

稳定的性能表明真正的优势；狭窄的最优值则暗示曲线拟合。

测试所有案例，而非精选示例

错误方法：研究精选的、有效的“市场领导者” 正确方法：测试符合标准的每只股票，包括那些失败的

选择性示例会造成幸存者偏差并高估策略质量。

分离想法生成与验证

直觉：用于生成假设验证：必须纯粹基于数据驱动

绝不要让对某个想法的情感依恋影响对测试结果的解读。

尽早识别这些模式以节省时间：

参数敏感性：仅在精确参数值下有效
状态特定性：在某些年份表现优异，在其他年份表现糟糕
滑点敏感性：添加现实成本后无利可图
样本量小：交易次数太少，统计置信度不足
前视偏差：“好得不真实”的结果
过度优化：参数过多，样本外结果差

有关详细示例和诊断框架，请参阅 references/failed_tests.md。

reports/backtest_eval_<timestamp>.json — 结构化评估，包含各维度评分、危险信号和裁决
reports/backtest_eval_<timestamp>.md — 人类可读的报告，包含维度表、关键指标和危险信号详情

文件：references/methodology.md

何时阅读：需要特定测试技术的详细指导时。

压力测试方法
参数敏感性分析
滑点和摩擦建模
样本量要求
市场状态分类
常见偏差和陷阱（幸存者偏差、前视偏差、曲线拟合等）

文件：references/failed_tests.md

何时阅读：当策略测试失败，或需要从过去的错误中学习时。

为何失败有价值
常见失败模式及示例
案例研究文档框架
评估回测的危险信号检查清单

时间分配：花 20% 的时间生成想法，80% 的时间尝试打破它们。

无上下文要求：如果策略需要“完美上下文”才能工作，那么它对于系统化交易来说不够稳健。

危险信号：如果回测结果看起来好得不真实（>90% 胜率、回撤极小、时机完美），请仔细检查是否存在前视偏差或数据问题。

工具限制：了解你的回测平台的特性（插值方法、低流动性处理、数据对齐问题）。

统计显著性：小的优势需要大的样本量来证明。每笔交易 5% 的优势需要 100 笔以上的交易才能与运气区分开。

主观交易与系统化交易的差异

本技能侧重于系统化/量化回测，其中：

所有规则都预先编码
执行中无自由裁量权或“感觉”
测试针对所有历史案例，而非精选案例
刻意排除上下文（新闻、宏观因素）

主观交易者的研究方式不同——本技能可能不适用于需要主观判断的场景。

🇺🇸English

Backtest Expert

Systematic approach to backtesting trading strategies based on professional methodology that prioritizes robustness over optimistic results.

Core Philosophy

Goal : Find strategies that "break the least", not strategies that "profit the most" on paper.

Principle : Add friction, stress test assumptions, and see what survives. If a strategy holds up under pessimistic conditions, it's more likely to work in live trading.

When to Use This Skill

Use this skill when:

Developing or validating systematic trading strategies
Evaluating whether a trading idea is robust enough for live implementation
Troubleshooting why a backtest might be misleading
Learning proper backtesting methodology
Avoiding common pitfalls (curve-fitting, look-ahead bias, survivorship bias)
Assessing parameter sensitivity and regime dependence
Setting realistic expectations for slippage and execution costs

Prerequisites

Python 3.9+ (for evaluation script)
No API keys required
No external data dependencies — metrics are user-provided

Workflow

1. State the Hypothesis

Define the edge in one sentence.

Example : "Stocks that gap up >3% on earnings and pull back to previous day's close within first hour provide mean-reversion opportunity."

If you can't articulate the edge clearly, don't proceed to testing.

2. Codify Rules with Zero Discretion

Define with complete specificity:

Entry : Exact conditions, timing, price type
Exit : Stop loss, profit target, time-based exit
Position sizing : Fixed $$, % of portfolio, volatility-adjusted
Filters : Market cap, volume, sector, volatility conditions
Universe : What instruments are eligible

Critical : No subjective judgment allowed. Every decision must be rule-based and unambiguous.

3. Run Initial Backtest

Test over:

Minimum 5 years (preferably 10+)
Multiple market regimes (bull, bear, high/low volatility)
Realistic costs : Commissions + conservative slippage

Examine initial results for basic viability. If fundamentally broken, iterate on hypothesis.

4. Stress Test the Strategy

This is where 80% of testing time should be spent.

Parameter sensitivity :

Test stop loss at 50%, 75%, 100%, 125%, 150% of baseline
Test profit target at 80%, 90%, 100%, 110%, 120% of baseline
Vary entry/exit timing by ±15-30 minutes
Look for "plateaus" of stable performance, not narrow spikes

Execution friction :

Increase slippage to 1.5-2x typical estimates
Model worst-case fills (buy at ask+1 tick, sell at bid-1 tick)
Add realistic order rejection scenarios
Test with pessimistic commission structures

Time robustness :

Analyze year-by-year performance
Require positive expectancy in majority of years
Ensure strategy doesn't rely on 1-2 exceptional periods
Test in different market regimes separately

Sample size :

Absolute minimum: 30 trades
Preferred: 100+ trades
High confidence: 200+ trades

5. Out-of-Sample Validation

Walk-forward analysis :

Optimize on training period (e.g., Year 1-3)
Test on validation period (Year 4)
Roll forward and repeat
Compare in-sample vs out-of-sample performance

Warning signs :

Out-of-sample <50% of in-sample performance
Need frequent parameter re-optimization
Parameters change dramatically between periods

6. Evaluate Results

Questions to answer :

Does edge survive pessimistic assumptions?
Is performance stable across parameter variations?
Does strategy work in multiple market regimes?
Is sample size sufficient for statistical confidence?
Are results realistic, not "too good to be true"?

Decision criteria :

✅ Deploy : Survives all stress tests with acceptable performance
🔄 Refine : Core logic sound but needs parameter adjustment
❌ Abandon : Fails stress tests or relies on fragile assumptions

Use the evaluation script for a structured, quantitative assessment:

python3 skills/backtest-expert/scripts/evaluate_backtest.py \
  --total-trades 150 \
  --win-rate 62 \
  --avg-win-pct 1.8 \
  --avg-loss-pct 1.2 \
  --max-drawdown-pct 15 \
  --years-tested 8 \
  --num-parameters 3 \
  --slippage-tested \
  --output-dir reports/

The script scores across 5 dimensions (Sample Size, Expectancy, Risk Management, Robustness, Execution Realism), detects red flags, and outputs a Deploy/Refine/Abandon verdict.

Key Testing Principles

Punish the Strategy

Add friction everywhere:

Commissions higher than reality
Slippage 1.5-2x typical
Worst-case fills
Order rejections
Partial fills

Rationale : Strategies that survive pessimistic assumptions often outperform in live trading.

Seek Plateaus, Not Peaks

Look for parameter ranges where performance is stable, not optimal values that create performance spikes.

Good : Strategy profitable with stop loss anywhere from 1.5% to 3.0% Bad : Strategy only works with stop loss at exactly 2.13%

Stable performance indicates genuine edge; narrow optima suggest curve-fitting.

Test All Cases, Not Cherry-Picked Examples

Wrong approach : Study hand-picked "market leaders" that worked Right approach : Test every stock that met criteria, including those that failed

Selective examples create survivorship bias and overestimate strategy quality.

Separate Idea Generation from Validation

Intuition : Useful for generating hypotheses Validation : Must be purely data-driven

Never let attachment to an idea influence interpretation of test results.

Common Failure Patterns

Recognize these patterns early to save time:

Parameter sensitivity : Only works with exact parameter values
Regime-specific : Great in some years, terrible in others
Slippage sensitivity : Unprofitable when realistic costs added
Small sample : Too few trades for statistical confidence
Look-ahead bias : "Too good to be true" results
Over-optimization : Many parameters, poor out-of-sample results

See references/failed_tests.md for detailed examples and diagnostic framework.

Output

reports/backtest_eval_<timestamp>.json — structured evaluation with per-dimension scores, red flags, and verdict
reports/backtest_eval_<timestamp>.md — human-readable report with dimension table, key metrics, and red flag details

Resources

Methodology Reference

File : references/methodology.md

When to read : For detailed guidance on specific testing techniques.

Contents :

Stress testing methods
Parameter sensitivity analysis
Slippage and friction modeling
Sample size requirements
Market regime classification
Common biases and pitfalls (survivorship, look-ahead, curve-fitting, etc.)

Failed Tests Reference

File : references/failed_tests.md

When to read : When strategy fails tests, or learning from past mistakes.

Contents :

Why failures are valuable
Common failure patterns with examples
Case study documentation framework
Red flags checklist for evaluating backtests

Critical Reminders

Time allocation : Spend 20% generating ideas, 80% trying to break them.

Context-free requirement : If strategy requires "perfect context" to work, it's not robust enough for systematic trading.

Red flag : If backtest results look too good (>90% win rate, minimal drawdowns, perfect timing), audit carefully for look-ahead bias or data issues.

Tool limitations : Understand your backtesting platform's quirks (interpolation methods, handling of low liquidity, data alignment issues).

Statistical significance : Small edges require large sample sizes to prove. 5% edge per trade needs 100+ trades to distinguish from luck.

Discretionary vs Systematic Differences

This skill focuses on systematic/quantitative backtesting where:

All rules are codified in advance
No discretion or "feel" in execution
Testing happens on all historical examples, not cherry-picked cases
Context (news, macro) is deliberately stripped out

Discretionary traders study differently—this skill may not apply to setups requiring subjective judgment.

Weekly Installs

287

Repository

tradermonty/cla…g-skills

GitHub Stars

398

First Seen

Jan 26, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

gemini-cli269

opencode269

codex264

cursor263

github-copilot262

kimi-cli258

Vue.js测试最佳实践：Vue 3组件、组合式函数、Pinia与异步测试完整指南

3,700 周安装