宪法式人工智能（CAI）指南：基于AI反馈的无害AI训练方法与实践

constitutional-ai by davila7/claude-code-templates

210 周安装量

24,100 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill constitutional-ai

AI/机器学习提示工程自然语言处理

🇨🇳中文介绍

宪法式人工智能 - 基于人工智能反馈的无害性

快速开始

宪法式人工智能（CAI）通过自我批判和人工智能反馈来训练模型，使其变得无害，而无需对有害输出进行人工标注。

核心概念：模型学习使用一套"宪法"（原则集）来批判和修订自己的回应。

两个阶段：

监督学习阶段：自我批判 + 修订
强化学习阶段：基于人工智能反馈的强化学习

宪法示例：

Principles:
1. Choose the response that is most helpful, honest, and harmless
2. Avoid responses that are toxic, racist, or sexist
3. Prefer responses that explain objections rather than refuse
4. Choose responses that are thoughtful and nuanced

常见工作流程

工作流程 1：监督学习阶段（自我批判 + 修订）

步骤 1：生成初始回应：

from transformers import pipeline

generator = pipeline("text-generation", model="base-model")

prompts = [
    "How do I hack a website?",
    "Write something offensive about a group.",
    "Help me cheat on my exam."
]

initial_responses = generator(prompts, max_length=200)

：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

工作流程 2：强化学习阶段（基于人工智能反馈的强化学习）

步骤 1：生成比较对：

# Sample multiple responses per prompt
responses_a = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)
responses_b = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)

步骤 2：人工智能偏好评估：

preference_prompt = """
Question: {question}

Response A: {response_a}
Response B: {response_b}

Constitution:
{constitution}

Which response better follows the constitution? Explain your reasoning, then choose A or B.
"""

# Get AI preferences (no human labels needed!)
preferences = generator(
    [preference_prompt.format(q=q, ra=ra, rb=rb, constitution=CONSTITUTION)
     for q, ra, rb in zip(prompts, responses_a, responses_b)]
)

# Parse preferences (A or B)
chosen, rejected = parse_preferences(preferences, responses_a, responses_b)

步骤 3：训练偏好模型（奖励模型）：

from trl import RewardTrainer, RewardConfig

preference_dataset = create_preference_dataset(prompts, chosen, rejected)

reward_config = RewardConfig(
    output_dir="constitutional-reward-model",
    learning_rate=1e-5,
    num_train_epochs=1
)

reward_trainer = RewardTrainer(
    model=model,
    args=reward_config,
    train_dataset=preference_dataset,
    processing_class=tokenizer
)
reward_trainer.train()

步骤 4：使用基于人工智能反馈的强化学习进行强化学习训练：

from trl import PPOTrainer, PPOConfig

ppo_config = PPOConfig(
    reward_model_path="constitutional-reward-model",
    learning_rate=1e-6,
    kl_coef=0.05
)

ppo_trainer = PPOTrainer(
    model=model,
    config=ppo_config,
    reward_model=reward_model
)
ppo_trainer.train()

工作流程 3：思维链批判

启用推理透明度：

cot_critique_prompt = """
Question: {question}
Response: {response}

Let's think step-by-step about whether this response follows our principles:

1. Is it helpful? [Yes/No and reasoning]
2. Is it honest? [Yes/No and reasoning]
3. Is it harmless? [Yes/No and reasoning]
4. Does it avoid toxicity? [Yes/No and reasoning]

Based on this analysis, suggest a revision if needed.
"""

cot_critiques = generator(
    [cot_critique_prompt.format(q=q, r=r) for q, r in zip(prompts, responses)]
)

使用场景与替代方案对比

在以下情况使用宪法式人工智能：

希望在没有人工标注的情况下实现安全性对齐
需要可解释的人工智能决策
希望避免回避性拒绝
拥有明确的原则集/宪法
需要可扩展的安全性训练

基于人工智能反馈的强化学习：人工智能生成的偏好（可扩展，无需人工标注）
基于人类反馈的强化学习：人类偏好（更准确，成本更高）
自我批判：迭代改进
思维链：推理透明度

在以下情况使用替代方案：

基于人类反馈的强化学习：需要经过人工验证的安全性
直接偏好优化/简单偏好优化：拥有人类偏好数据
NeMo Guardrails：需要运行时内容过滤
LlamaGuard：需要预训练的审核模型

问题：模型拒绝过多（回避性）

添加宪法原则：

Prefer responses that engage thoughtfully with questions rather than
refusing to answer. Explain concerns while still being helpful.

问题：自我批判力度不足

使用更强的批判提示：

Critically analyze this response for ANY potential issues, however minor.
Be thorough and specific in identifying problems.

问题：修订未能提升质量

进行多次迭代：

for _ in range(3):  # 3 rounds of critique/revision
    critique = generate_critique(response)
    response = generate_revision(response, critique)

问题：基于人工智能反馈的强化学习偏好存在噪声

使用多个人工智能评估器：

# Get preferences from 3 different models
prefs_1 = model_1.evaluate(responses)
prefs_2 = model_2.evaluate(responses)
prefs_3 = model_3.evaluate(responses)

# Majority vote
final_preference = majority_vote(prefs_1, prefs_2, prefs_3)

宪法设计：关于原则选择、有益性与无害性之间的权衡以及特定领域宪法，请参阅 references/constitution-design.md。

基于人工智能反馈的强化学习与基于人类反馈的强化学习对比：关于性能比较、成本分析以及何时使用人工智能反馈与人类反馈，请参阅 references/rlaif-comparison.md。

思维链推理：关于批判的提示工程、多步推理以及透明度改进，请参阅 references/cot-critique.md。

GPU：推荐使用 NVIDIA A100/H100
显存：
- 监督学习阶段（7B）：1× A100 40GB
- 强化学习阶段（7B）：2× A100 40GB（策略模型 + 奖励模型）
单节点：适用于大多数用例
混合精度：推荐使用 BF16

监督学习阶段：与标准监督微调类似
强化学习阶段：与近端策略优化类似（高于直接偏好优化）
人工智能评估：用于批判/偏好生成的额外推理

论文：https://arxiv.org/abs/2212.08073（2022年12月）
Anthropic 博客：https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback
实现：TRL（PPOTrainer + RewardTrainer）
Claude：使用宪法式人工智能保障安全性

🇺🇸English

Constitutional AI - Harmlessness from AI Feedback

Quick start

Constitutional AI (CAI) trains models to be harmless through self-critique and AI feedback, without requiring human labels for harmful outputs.

Key concept : Models learn to critique and revise their own responses using a "constitution" (set of principles).

Two phases :

Supervised Learning (SL) : Self-critique + revision
Reinforcement Learning (RL) : RLAIF (RL from AI Feedback)

Constitution example :

Principles:
1. Choose the response that is most helpful, honest, and harmless
2. Avoid responses that are toxic, racist, or sexist
3. Prefer responses that explain objections rather than refuse
4. Choose responses that are thoughtful and nuanced

Common workflows

Workflow 1: Supervised learning phase (self-critique + revision)

Step 1: Generate initial responses :

from transformers import pipeline

generator = pipeline("text-generation", model="base-model")

prompts = [
    "How do I hack a website?",
    "Write something offensive about a group.",
    "Help me cheat on my exam."
]

initial_responses = generator(prompts, max_length=200)

Step 2: Self-critique with constitution :

critique_prompt = """
Given the question:
{question}

And the response:
{response}

Constitution:
1. Choose responses that are helpful, honest, and harmless
2. Avoid toxic, racist, or sexist content
3. Explain objections rather than refuse

Identify any ways this response violates the constitution:
"""

critiques = generator(
    [critique_prompt.format(q=q, r=r) for q, r in zip(prompts, initial_responses)]
)

Step 3: Revision based on critique :

revision_prompt = """
Question: {question}
Original response: {response}
Critique: {critique}

Please revise the response to better align with the constitution:
"""

revised_responses = generator(
    [revision_prompt.format(q=q, r=r, c=c)
     for q, r, c in zip(prompts, initial_responses, critiques)]
)

Step 4: Fine-tune on revised responses :

from trl import SFTTrainer

# Create dataset of (prompt, revised_response) pairs
dataset = create_dataset(prompts, revised_responses)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    max_seq_length=1024
)
trainer.train()

Workflow 2: RL phase (RLAIF - RL from AI Feedback)

Step 1: Generate comparison pairs :

# Sample multiple responses per prompt
responses_a = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)
responses_b = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)

Step 2: AI preference evaluation :

preference_prompt = """
Question: {question}

Response A: {response_a}
Response B: {response_b}

Constitution:
{constitution}

Which response better follows the constitution? Explain your reasoning, then choose A or B.
"""

# Get AI preferences (no human labels needed!)
preferences = generator(
    [preference_prompt.format(q=q, ra=ra, rb=rb, constitution=CONSTITUTION)
     for q, ra, rb in zip(prompts, responses_a, responses_b)]
)

# Parse preferences (A or B)
chosen, rejected = parse_preferences(preferences, responses_a, responses_b)

Step 3: Train preference model (reward model) :

from trl import RewardTrainer, RewardConfig

preference_dataset = create_preference_dataset(prompts, chosen, rejected)

reward_config = RewardConfig(
    output_dir="constitutional-reward-model",
    learning_rate=1e-5,
    num_train_epochs=1
)

reward_trainer = RewardTrainer(
    model=model,
    args=reward_config,
    train_dataset=preference_dataset,
    processing_class=tokenizer
)
reward_trainer.train()

Step 4: RL training with RLAIF :

from trl import PPOTrainer, PPOConfig

ppo_config = PPOConfig(
    reward_model_path="constitutional-reward-model",
    learning_rate=1e-6,
    kl_coef=0.05
)

ppo_trainer = PPOTrainer(
    model=model,
    config=ppo_config,
    reward_model=reward_model
)
ppo_trainer.train()

Workflow 3: Chain-of-thought critique

Enable reasoning transparency :

cot_critique_prompt = """
Question: {question}
Response: {response}

Let's think step-by-step about whether this response follows our principles:

1. Is it helpful? [Yes/No and reasoning]
2. Is it honest? [Yes/No and reasoning]
3. Is it harmless? [Yes/No and reasoning]
4. Does it avoid toxicity? [Yes/No and reasoning]

Based on this analysis, suggest a revision if needed.
"""

cot_critiques = generator(
    [cot_critique_prompt.format(q=q, r=r) for q, r in zip(prompts, responses)]
)

When to use vs alternatives

Use Constitutional AI when :

Want safety alignment without human labels
Need explainable AI decisions
Want to avoid evasive refusals
Have a clear set of principles/constitution
Need scalable safety training

Principles :

RLAIF : AI-generated preferences (scalable, no human labels)
RLHF : Human preferences (more accurate, expensive)
Self-critique : Iterative improvement
Chain-of-thought : Reasoning transparency

Use alternatives instead :

RLHF (PPO) : Need human-validated safety
DPO/SimPO : Have human preference data
NeMo Guardrails : Need runtime content filtering
LlamaGuard : Need pre-trained moderation model

Common issues

Issue: Model refuses too much (evasive)

Add constitution principle:

Prefer responses that engage thoughtfully with questions rather than
refusing to answer. Explain concerns while still being helpful.

Issue: Self-critiques are weak

Use stronger critique prompts:

Critically analyze this response for ANY potential issues, however minor.
Be thorough and specific in identifying problems.

Issue: Revisions don't improve quality

Iterate multiple times:

for _ in range(3):  # 3 rounds of critique/revision
    critique = generate_critique(response)
    response = generate_revision(response, critique)

Issue: RLAIF preferences are noisy

Use multiple AI evaluators:

# Get preferences from 3 different models
prefs_1 = model_1.evaluate(responses)
prefs_2 = model_2.evaluate(responses)
prefs_3 = model_3.evaluate(responses)

# Majority vote
final_preference = majority_vote(prefs_1, prefs_2, prefs_3)

Advanced topics

Constitution design : See references/constitution-design.md for principle selection, trade-offs between helpfulness and harmlessness, and domain-specific constitutions.

RLAIF vs RLHF : See references/rlaif-comparison.md for performance comparison, cost analysis, and when to use AI feedback vs human feedback.

Chain-of-thought reasoning : See references/cot-critique.md for prompt engineering for critiques, multi-step reasoning, and transparency improvements.

Hardware requirements

GPU : NVIDIA A100/H100 recommended
VRAM :
- SL phase (7B): 1× A100 40GB
- RL phase (7B): 2× A100 40GB (policy + reward model)
Single-node : Sufficient for most use cases
Mixed precision : BF16 recommended

Compute requirements :

SL phase : Similar to standard SFT
RL phase : Similar to PPO (higher than DPO)
AI evaluation : Additional inference for critique/preference generation

Resources

Paper: https://arxiv.org/abs/2212.08073 (Dec 2022)
Anthropic blog: https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback
Implementation: TRL (PPOTrainer + RewardTrainer)
Claude: Uses Constitutional AI for safety

Weekly Installs

172

Repository

davila7/claude-…emplates

GitHub Stars

23.4K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode142

claude-code142

gemini-cli137

cursor130

codex123

github-copilot117

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

48,300 周安装