元认知推理技能：避免认知失误的严谨分析框架 | 代码审查与决策指南

meta-cognitive-reasoning by 89jobrien/steve

141 周安装量

4 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/89jobrien/steve --skill meta-cognitive-reasoning

方法论软件工程代码质量

🇨🇳中文介绍

元认知推理

此技能提供严谨的推理框架，用于避免在分析、评审和决策过程中出现认知失误。它强制要求基于证据得出结论、生成多种假设并进行系统性验证。

何时使用此技能

在对代码、系统或版本做出断言之前
在进行代码审查或架构评估时
在调试具有多种可能原因的问题时
当遇到不熟悉的模式或版本时
当提出可能产生重大影响的建议时
当模式匹配触发即时结论时
在分析文档或规范时
在任何需要严格分析推理的任务中

此技能的作用

基于证据的推理：强制要求在解释之前展示证据
多假设生成：防止过早地固守单一解释
时效性知识验证：处理知识截止日期的限制
认知失误预防：识别并防止常见的推理错误
自我纠正协议：提供透明的错误纠正框架
范围纪律：合理分配认知精力

核心原则

1. 基于证据的推理协议

通用规则：无证据，不下结论

强制顺序：
1. 首先展示工具输出
2. 引用具体证据
3. 然后进行解释

禁用短语：

"我假设"
"通常意味着"
"看起来像"
"测试通过"（未提供输出）
"符合标准"（未提供证据）

必需短语：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

2. 多重工作假设

当相同的观察结果可能源于具有相反含义的不同机制时——在得出结论前进行调查。

三层推理模型：

第一层：观察（我看到了什么？）
第二层：机制（这是如何/为何存在的？）
第三层：评估（这是好/坏/关键吗？）

错误做法：从第一层 -> 第三层（跳过机制）
正确做法：第一层 -> 第二层（调查）-> 第三层（结合上下文评估）

认识到存在多种假设
- 哪些机制可能产生此观察结果？
- 哪些机制具有相反的含义？
明确生成竞争性假设
- 假设 A：[机制] -> [含义]
- 假设 B：[不同机制] -> [相反含义]
识别区分性证据
- 什么单一观察可以证明/反驳每一个假设？
收集区分性证据
- 运行能区分假设的特定测试
结合机制上下文进行评估
- 相同的观察结果 + 不同的机制 = 不同的评估

训练数据有时间戳；知识的缺失 ≠ 事物不存在的证据

关键上下文检查：

在断言事物存在之前：
1. 我的知识截止日期是什么？
2. 今天的日期是什么？
3. 过去了多长时间？
4. 是否存在超出我训练范围的版本/功能？

高风险领域（务必验证）：

软件包版本（npm, pip, maven）
框架版本（React, Vue, Django）
语言版本（Python, Node, Go）
云服务功能（AWS, GCP, Azure）
API 版本和工具版本

"版本 X 不存在"（未经核实）
"最新版本是 Y"（基于过时的训练数据）
没有证据就使用 "CRITICAL/BLOCKER"

4. 自我纠正协议

当发现先前输出中的错误时：

步骤 1：明确承认
- 以 "CRITICAL CORRECTION" 开头
- 确保不可能被忽略

步骤 2：陈述先前的主张
- 引用确切的错误陈述

步骤 3：提供证据
- 展示证明纠正的内容

步骤 4：解释错误原因
- 根本原因：时间差？假设？

步骤 5：明确行动
- "无需更改" 或 "撤销建议"

5. 认知资源分配

选择满足要求的最简单方法
先进行简单验证，仅当简单方法失败时才采用复杂方案

将资源分配给实际需求，而非假设性需求
"这是明确要求的吗？"

复用已确定的事实
当上下文变化时重新验证

原子性原则：

每个行动应有单一明确的目的
如果描述需要在不同目的之间使用"和"，则将其拆分
好处：更清晰的故障诊断、更容易的进度跟踪、更好的证据归因

6. 系统性完成纪律

在所有要求验证完毕之前，切勿宣布成功

过早完成的高风险场景：

具有多个质量关卡的多步骤任务
成功修复重大问题后（认知奖励触发）
当工具显示许多错误时（回避诱惑）
临近会话结束时（完成压力）

将要求分解为明确的检查点
在继续之前完全完成每个关卡
在每个检查点展示证据
抵制"足够好"的捷径

想着"足够好"而不是检查所有要求
应用笼统的解决方案而不进行个体分析
跳过系统性验证
在证据显示相反情况时宣布成功

7. 个体分析优于批量处理

核心原则：每个项目都值得单独关注

错误消息（逐一阅读每条消息）
评审项目（分析每一行/文件）
决策（不应用笼统规则）
抑制（具体说明每个理由）

不阅读细节就进行批量分类
不考虑上下文就应用笼统解决方案
批量处理独特情况

8. 语义分析与字面分析

寻找概念上的重叠，而不仅仅是文本/模式的重复

这里的实际目的是什么？
这是服务于功能需求，还是仅仅匹配一个模式？
如果我移除/更改这个，会失去什么？
这是用不同方式表达的相同概念吗？

文档：识别跨层级结构的语义重复
代码审查：在建议更改前理解意图
优化：在改进前分析实际必要性

在建议更改之前，验证软件包 X 版本 Y 是否存在

在建议合并之前，检查此文件结构是符号链接还是重复文件

测试因超时错误而失败。可能的机制有哪些？

这三个文件内容相同。可能是什么原因？

进行基于证据的审查

审查此代码并为每个断言展示证据

当遇到不熟悉的版本/功能时：

认识到不确定性："我不记得训练数据中有 X"
形成假设：A) 不存在，B) 存在但较新，C) 是当前版本
在得出结论前验证：检查权威来源
展示证据，然后解释：命令输出 -> 结论

在分析代码、架构或配置时：

观察：我看到了什么？
调查机制：这是如何存在的？
然后评估：基于机制，这是好/坏吗？

用于代码审查、文档审查或任何分析：

澄清范围：在假设之前询问
为每个断言展示证据：文件:行号:代码
在得出结论前生成假设
区分机制与观察
**将强烈措辞保留给已验证的问题"

模式 1：扫描而非阅读

在发现次要问题的同时遗漏明显问题
预防：逐行/逐个错误地阅读

模式 2：不考虑上下文的模式匹配

在不理解问题的情况下应用解决方案
预防：在应用模板前分析实际目的

模式 3：基于假设的结论

猜测而非验证
预防：要求基于证据的验证

模式 4：过早宣布成功

"任务完成" ≠ "要求已验证"
预防：展示证明完成的工具输出

模式 5：时效性知识衰减

混淆"我不知道"与"不存在"
预防：验证版本/功能的时效性

模式 6：过度自信的连锁反应

错误前提："X 不存在"（未经验证）
    ↓
放大："这是 CRITICAL/BLOCKER"
    ↓
有害："将 X 更改为旧版本 Y"
    ↓
影响：从较新版本降级到较旧版本

打破连锁反应：首先验证前提

示例 1：版本验证

情况：审查使用 certifi==2025.10.5 的代码

CRITICAL: 版本 2025.10.5 不存在！
操作：更改为 2024.x.x

让我验证这个版本是否存在...

$ grep -A 3 'name = "certifi"' uv.lock
name = "certifi"
version = "2025.10.5"
source = { registry = "https://pypi.org/simple" }

证据显示该版本在 PyPI 上存在。
操作：无需更改 - 版本有效。

示例 2：文件结构分析

观察：三个文件具有相同的 1494 行内容

specs/AGENTS.md
specs/CLAUDE.md
specs/GEMINI.md

CRITICAL: 文件重复 - 合并为单个文件

竞争性假设：
A: 重复副本 -> CRITICAL: 合并
B: 指向单一源的符号链接 -> EXCELLENT: 保留

收集区分性证据：
$ ls -la specs/
lrwxrwxrwx CLAUDE.md -> AGENTS.md
lrwxrwxrwx GEMINI.md -> AGENTS.md

机制：符号链接（假设 B 确认）
评估：EXCELLENT 架构 - 具有单一事实来源的特定代理入口点
操作：保持原样

示例 3：测试失败分析

观察：5 个测试因"连接超时"而失败

A: 单个依赖项宕机（修复一件事）
B: 多个独立的超时（修复五件事）
C: 测试基础设施问题（修复设置）
D: 环境配置缺失（修复配置）

检查测试依赖项
检查错误时间戳（同时发生 vs 顺序发生）
单独运行测试

然后基于证据得出结论。

禁止：
- "文件 X 不存在" 而不执行：ls X
- "函数未使用" 而不执行：grep -r "function_name"
- "版本无效" 而不检查：注册表/锁文件
- "测试失败" 而不执行：运行测试
- 未经核实就使用 "CRITICAL/BLOCKER"
- 没有证据就使用强烈措辞
- 跳过机制调查
- 模式匹配到第一个熟悉的案例

应该：
- 在断言之前展示 grep/ls/find 输出
- 引用实际行："file.py:123: 'code here' - 问题"
- 检查锁文件以获取已解析的版本
- 运行可用工具并展示输出
- 将强烈措辞保留给有证据证明的问题
- "让我验证..." -> 工具输出 -> 解释
- 在收集证据前生成多种假设
- 区分观察与机制

在处理复杂任务之前，询问：

主要目标/上下文是什么？
期望的范围是什么（简单修复 vs 全面修复）？
成功标准是什么？
存在哪些约束？

特别针对审查：

范围：所有更改的文件还是特定文件？
深度：快速反馈还是全面分析？
重点：实现质量、标准，还是两者兼顾？
输出：问题列表还是优先路线图？

通用规则：除非明确限定范围，否则所有审查都是全面的

切勿基于以下情况假设范围有限：

最近的对话主题
先前已完成的部分工作
看似缩小范围的特定词语
请求的明显简单性

所有适用的质量关卡
每个断言的证据
对要求的完整验证
系统性覆盖（而非抽查）

上下文分析决策框架

分析实际目的（不要从模式中假设）
检查一致性与实际使用情况
用证据验证（阅读/测试以确认）
不确定时先询问再行动

错误："其他组件做 X，所以这个也需要 X"
正确："让我分析这个组件是否确实需要 X 来实现其目的"

需要基于证据断言的代码审查
提出建议前的版本验证
架构评估
具有多种可能原因的调试
文档分析
安全审计
性能调查
任何需要严格推理的分析

🇺🇸English

Meta-Cognitive Reasoning

This skill provides disciplined reasoning frameworks for avoiding cognitive failures in analysis, reviews, and decision-making. It enforces evidence-based conclusions, multiple hypothesis generation, and systematic verification.

When to Use This Skill

Before making claims about code, systems, or versions
When conducting code reviews or architectural assessments
When debugging issues with multiple possible causes
When encountering unfamiliar patterns or versions
When making recommendations that could have significant impact
When pattern matching triggers immediate conclusions
When analyzing documentation or specifications
During any task requiring rigorous analytical reasoning

What This Skill Does

Evidence-Based Reasoning : Enforces showing evidence before interpretation
Multiple Hypothesis Generation : Prevents premature commitment to single explanation
Temporal Knowledge Verification : Handles knowledge cutoff limitations
Cognitive Failure Prevention : Recognizes and prevents common reasoning errors
Self-Correction Protocol : Provides framework for transparent error correction
Scope Discipline : Allocates cognitive effort appropriately

Core Principles

1. Evidence-Based Reasoning Protocol

Universal Rule: Never conclude without proof

MANDATORY SEQUENCE:
1. Show tool output FIRST
2. Quote specific evidence
3. THEN interpret

Forbidden Phrases:

"I assume"
"typically means"
"appears to"
"Tests pass" (without output)
"Meets standards" (without evidence)

Required Phrases:

"Command shows: 'actual output' - interpretation"
"Line N: 'code snippet' - meaning"
"Let me verify..." -> tool output -> interpretation

2. Multiple Working Hypotheses

When identical observations can arise from different mechanisms with opposite implications - investigate before concluding.

Three-Layer Reasoning Model:

Layer 1: OBSERVATION (What do I see?)
Layer 2: MECHANISM (How/why does this exist?)
Layer 3: ASSESSMENT (Is this good/bad/critical?)

FAILURE: Jump from Layer 1 -> Layer 3 (skip mechanism)
CORRECT: Layer 1 -> Layer 2 (investigate) -> Layer 3 (assess with context)

Decision Framework:

Recognize multiple hypotheses exist
- What mechanisms could produce this observation?
- Which mechanisms have opposite implications?
Generate competing hypotheses explicitly
- Hypothesis A: [mechanism] -> [implication]
- Hypothesis B: [different mechanism] -> [opposite implication]
Identify discriminating evidence
- What single observation would prove/disprove each?
Gather discriminating evidence
- Run the specific test that distinguishes hypotheses
Assess with mechanism context
- Same observation + different mechanism = different assessment

3. Temporal Knowledge Currency

Training data has a timestamp; absence of knowledge ≠ evidence of absence

Critical Context Check:

Before making claims about what exists:
1. What is my knowledge cutoff date?
2. What is today's date?
3. How much time has elapsed?
4. Could versions/features beyond my training exist?

High Risk Domains (always verify):

Package versions (npm, pip, maven)
Framework versions (React, Vue, Django)
Language versions (Python, Node, Go)
Cloud service features (AWS, GCP, Azure)
API versions and tool versions

Anti-Patterns:

"Version X doesn't exist" (without verification)
"Latest is Y" (based on stale training data)
"CRITICAL/BLOCKER" without evidence

4. Self-Correction Protocol

When discovering errors in previous output:

STEP 1: ACKNOWLEDGE EXPLICITLY
- Lead with "CRITICAL CORRECTION"
- Make it impossible to miss

STEP 2: STATE PREVIOUS CLAIM
- Quote exact wrong statement

STEP 3: PROVIDE EVIDENCE
- Show what proves the correction

STEP 4: EXPLAIN ERROR CAUSE
- Root cause: temporal gap? assumption?

STEP 5: CLEAR ACTION
- "NO CHANGE NEEDED" or "Revert suggestion"

5. Cognitive Resource Allocation

Parsimony Principle:

Choose simplest approach that satisfies requirements
Simple verification first, complexity only when simple fails

Scope Discipline:

Allocate resources to actual requirements, not hypothetical ones
"Was this explicitly requested?"

Information Economy:

Reuse established facts
Re-verify when context changes

Atomicity Principle:

Each action should have one clear purpose
If description requires "and" between distinct purposes, split it
Benefits: clearer failure diagnosis, easier progress tracking, better evidence attribution

6. Systematic Completion Discipline

Never declare success until ALL requirements verified

High-Risk Scenarios for Premature Completion:

Multi-step tasks with many quality gates
After successfully fixing major issues (cognitive reward triggers)
When tools show many errors (avoidance temptation)
Near end of session (completion pressure)

Completion Protocol:

Break requirements into explicit checkpoints
Complete each gate fully before proceeding
Show evidence at each checkpoint
Resist "good enough" shortcuts

Warning Signs:

Thinking "good enough" instead of checking all requirements
Applying blanket solutions without individual analysis
Skipping systematic verification
Declaring success while evidence shows otherwise

7. Individual Analysis Over Batch Processing

Core Principle: Every item deserves individual attention

Apply to:

Error messages (read each one individually)
Review items (analyze each line/file)
Decisions (don't apply blanket rules)
Suppressions (justify each one specifically)

Anti-Patterns:

Bulk categorization without reading details
Blanket solutions applied without context
Batch processing of unique situations

8. Semantic vs Literal Analysis

Look for conceptual overlap, not just text/pattern duplication

Key Questions:

What is the actual PURPOSE here?
Does this serve a functional need or just match a pattern?
What would be LOST if I removed/changed this?
Is this the same CONCEPT expressed differently?

Applications:

Documentation: Identify semantic duplication across hierarchy levels
Code review: Understand intent before suggesting changes
Optimization: Analyze actual necessity before improving

How to Use

Verify Before Claiming

Verify that package X version Y exists before recommending changes



Check if this file structure is symlinks or duplicates before recommending consolidation

Generate Multiple Hypotheses

The tests are failing with timeout errors. What are the possible mechanisms?



These three files have identical content. What could explain this?

Conduct Evidence-Based Review

Review this code and show evidence for every claim

Reasoning Workflows

Verification Workflow

When encountering unfamiliar versions/features:

Recognize uncertainty : "I don't recall X from training"
Form hypotheses : A) doesn't exist, B) exists but new, C) is current
Verify before concluding : Check authoritative source
Show evidence, then interpret : Command output -> conclusion

Assessment Workflow

When analyzing code, architecture, or configurations:

Observe : What do I see?
Investigate mechanism : HOW does this exist?
Then assess : Based on mechanism, is this good/bad?

Review Workflow

For code reviews, documentation reviews, or any analysis:

Clarify scope : Ask before assuming
Show evidence for every claim : File:line:code
Generate hypotheses before concluding
Distinguish mechanism from observation
Reserve strong language for verified issues

Cognitive Failure Patterns

Pattern 1: Scanning Instead of Reading

Missing obvious issues while finding minor ones
Prevention: Read every line/error individually

Pattern 2: Pattern Matching Without Context

Applying solutions without understanding problems
Prevention: Analyze actual purpose before applying templates

Pattern 3: Assumption-Based Conclusions

Guessing instead of verifying
Prevention: Evidence-based verification required

Pattern 4: Premature Success Declaration

"Task complete" ≠ "Requirements verified"
Prevention: Show tool output proving completion

Pattern 5: Temporal Knowledge Decay

Confusing "I don't know" with "doesn't exist"
Prevention: Verify version/feature currency

Pattern 6: Overconfidence Cascade

False premise: "X doesn't exist" (unverified)
    ↓
Amplified: "This is CRITICAL/BLOCKER"
    ↓
Harmful: "Change X to older version Y"
    ↓
Impact: Downgrade from newer to older

BREAK THE CASCADE: Verify the premise first

Examples

Example 1: Version Verification

Situation : Reviewing code that uses certifi==2025.10.5

Wrong Approach:

CRITICAL: Version 2025.10.5 doesn't exist!
Action: Change to 2024.x.x

Correct Approach:

Let me verify this version exists...

$ grep -A 3 'name = "certifi"' uv.lock
name = "certifi"
version = "2025.10.5"
source = { registry = "https://pypi.org/simple" }

Evidence shows version EXISTS on PyPI.
Action: NO CHANGE NEEDED - version is valid.

Example 2: File Structure Analysis

Observation : Three files with identical 1494-line content

specs/AGENTS.md
specs/CLAUDE.md
specs/GEMINI.md

Wrong Approach:

CRITICAL: File duplication - consolidate to single file

Correct Approach:

Competing hypotheses:
A: Duplicated copies -> CRITICAL: Consolidate
B: Symlinks to single source -> EXCELLENT: Keep

Gathering discriminating evidence:
$ ls -la specs/
lrwxrwxrwx CLAUDE.md -> AGENTS.md
lrwxrwxrwx GEMINI.md -> AGENTS.md

Mechanism: Symlinks (Hypothesis B confirmed)
Assessment: EXCELLENT architecture - agent-specific entry points with single source of truth
Action: Keep as-is

Example 3: Test Failure Analysis

Observation : 5 tests failing with "connection timeout"

Hypotheses:

A: Single dependency down (fix one thing)
B: Multiple independent timeouts (fix five things)
C: Test infrastructure issue (fix setup)
D: Environment config missing (fix config)

Investigation:

Check test dependencies
Check error timestamps (simultaneous vs sequential)
Run tests in isolation

Then conclude based on evidence.

Anti-Patterns

DO NOT:
- "File X doesn't exist" without: ls X
- "Function not used" without: grep -r "function_name"
- "Version invalid" without: checking registry/lockfile
- "Tests fail" without: running tests
- "CRITICAL/BLOCKER" without verification
- Use strong language without evidence
- Skip mechanism investigation
- Pattern match to first familiar case

DO:
- Show grep/ls/find output BEFORE claiming
- Quote actual lines: "file.py:123: 'code here' - issue"
- Check lockfiles for resolved versions
- Run available tools and show output
- Reserve strong language for evidence-proven issues
- "Let me verify..." -> tool output -> interpretation
- Generate multiple hypotheses before gathering evidence
- Distinguish observation from mechanism

Clarifying Questions

Before proceeding with complex tasks, ask:

What is the primary goal/context?
What scope is expected (simple fix vs comprehensive)?
What are the success criteria?
What constraints exist?

For reviews specifically:

Scope: All changed files or specific ones?
Depth: Quick feedback or comprehensive analysis?
Focus: Implementation quality, standards, or both?
Output: List of issues or prioritized roadmap?

Task Management Patterns

Review Request Interpretation

Universal Rule: ALL reviews are comprehensive unless explicitly scoped

Never assume limited scope based on:

Recent conversation topics
Previously completed partial work
Specific words that seem to narrow scope
Apparent simplicity of request

Always include:

All applicable quality gates
Evidence for every claim
Complete verification of requirements
Systematic coverage (not spot-checking)

Context Analysis Decision Framework

Universal Process:

Analyze actual purpose (don't assume from patterns)
Check consistency with actual usage
Verify with evidence (read/test to confirm)
Ask before acting when uncertain

Recognition Pattern:

WRONG: "Other components do X, so this needs X"
RIGHT: "Let me analyze if this component actually needs X for its purpose"

Related Use Cases

Code reviews requiring evidence-based claims
Version verification before recommendations
Architectural assessments
Debugging with multiple possible causes
Documentation analysis
Security audits
Performance investigations
Any analysis requiring rigorous reasoning

Weekly Installs

Repository

89jobrien/steve

GitHub Stars

First Seen

Jan 24, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode57

gemini-cli55

codex51

cursor51

openclaw48

github-copilot47

任务估算指南：敏捷开发故事点、计划扑克、T恤尺码法详解

10,500 周安装