⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

宪法式人工智能(Constitutional AI)原理与实践：基于AI反馈的无害性训练方法

constitutional-ai by orchestra-research/ai-research-skills

64 周安装量

5,500 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/orchestra-research/ai-research-skills --skill constitutional-ai

AI/机器学习强化学习自然语言处理

🇨🇳中文介绍

宪法式人工智能 - 基于人工智能反馈的无害性

快速开始

宪法式人工智能通过自我批评和人工智能反馈来训练模型，使其变得无害，而无需为有害输出提供人工标注。

核心概念：模型学习使用"宪法"（一组原则）来批评和修订自己的回应。

两个阶段：

监督学习阶段：自我批评 + 修订
强化学习阶段：基于人工智能反馈的强化学习

宪法示例：

Principles:
1. Choose the response that is most helpful, honest, and harmless
2. Avoid responses that are toxic, racist, or sexist
3. Prefer responses that explain objections rather than refuse
4. Choose responses that are thoughtful and nuanced

常见工作流程

工作流程 1：监督学习阶段（自我批评 + 修订）

步骤 1：生成初始回应：

from transformers import pipeline

generator = pipeline("text-generation", model="base-model")

prompts = [
    "How do I hack a website?",
    "Write something offensive about a group.",
    "Help me cheat on my exam."
]

initial_responses = generator(prompts, max_length=200)

：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

工作流程 2：强化学习阶段（基于人工智能反馈的强化学习）

步骤 1：生成比较对：

# Sample multiple responses per prompt
responses_a = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)
responses_b = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)

步骤 2：人工智能偏好评估：

preference_prompt = """
Question: {question}

Response A: {response_a}
Response B: {response_b}

Constitution:
{constitution}

Which response better follows the constitution? Explain your reasoning, then choose A or B.
"""

# Get AI preferences (no human labels needed!)
preferences = generator(
    [preference_prompt.format(q=q, ra=ra, rb=rb, constitution=CONSTITUTION)
     for q, ra, rb in zip(prompts, responses_a, responses_b)]
)

# Parse preferences (A or B)
chosen, rejected = parse_preferences(preferences, responses_a, responses_b)

步骤 3：训练偏好模型（奖励模型）：

from trl import RewardTrainer, RewardConfig

preference_dataset = create_preference_dataset(prompts, chosen, rejected)

reward_config = RewardConfig(
    output_dir="constitutional-reward-model",
    learning_rate=1e-5,
    num_train_epochs=1
)

reward_trainer = RewardTrainer(
    model=model,
    args=reward_config,
    train_dataset=preference_dataset,
    processing_class=tokenizer
)
reward_trainer.train()

步骤 4：使用基于人工智能反馈的强化学习进行强化学习训练：

from trl import PPOTrainer, PPOConfig

ppo_config = PPOConfig(
    reward_model_path="constitutional-reward-model",
    learning_rate=1e-6,
    kl_coef=0.05
)

ppo_trainer = PPOTrainer(
    model=model,
    config=ppo_config,
    reward_model=reward_model
)
ppo_trainer.train()

工作流程 3：思维链批评

启用推理透明度：

cot_critique_prompt = """
Question: {question}
Response: {response}

Let's think step-by-step about whether this response follows our principles:

1. Is it helpful? [Yes/No and reasoning]
2. Is it honest? [Yes/No and reasoning]
3. Is it harmless? [Yes/No and reasoning]
4. Does it avoid toxicity? [Yes/No and reasoning]

Based on this analysis, suggest a revision if needed.
"""

cot_critiques = generator(
    [cot_critique_prompt.format(q=q, r=r) for q, r in zip(prompts, responses)]
)

使用时机与替代方案对比

在以下情况使用宪法式人工智能：

希望无需人工标注即可实现安全对齐
需要可解释的人工智能决策
希望避免回避性的拒绝回答
拥有一套清晰的原则/宪法
需要可扩展的安全训练

基于人工智能反馈的强化学习：人工智能生成的偏好（可扩展，无需人工标注）
基于人类反馈的强化学习：人类偏好（更准确，成本更高）
自我批评：迭代改进
思维链：推理透明度

在以下情况使用替代方案：

基于人类反馈的强化学习：需要经过人类验证的安全性
直接偏好优化/简单偏好优化：拥有人类偏好数据
NeMo Guardrails：需要运行时内容过滤
LlamaGuard：需要预训练的审核模型

问题：模型拒绝过多（回避性）

添加宪法原则：

Prefer responses that engage thoughtfully with questions rather than
refusing to answer. Explain concerns while still being helpful.

问题：自我批评力度不足

使用更强的批评提示：

Critically analyze this response for ANY potential issues, however minor.
Be thorough and specific in identifying problems.

问题：修订未能提升质量

进行多次迭代：

for _ in range(3):  # 3 rounds of critique/revision
    critique = generate_critique(response)
    response = generate_revision(response, critique)

问题：基于人工智能反馈的强化学习偏好存在噪声

使用多个人工智能评估器：

# Get preferences from 3 different models
prefs_1 = model_1.evaluate(responses)
prefs_2 = model_2.evaluate(responses)
prefs_3 = model_3.evaluate(responses)

# Majority vote
final_preference = majority_vote(prefs_1, prefs_2, prefs_3)

宪法设计：关于原则选择、有益性与无害性之间的权衡以及特定领域宪法的信息，请参阅 references/constitution-design.md。

基于人工智能反馈的强化学习与基于人类反馈的强化学习对比：关于性能比较、成本分析以及何时使用人工智能反馈与人类反馈的信息，请参阅 references/rlaif-comparison.md。

思维链推理：关于批评的提示工程、多步推理和透明度改进的信息，请参阅 references/cot-critique.md。

GPU：推荐使用 NVIDIA A100/H100
显存： * 监督学习阶段（70亿参数）：1× A100 40GB * 强化学习阶段（70亿参数）：2× A100 40GB（策略模型 + 奖励模型）
单节点：适用于大多数用例
混合精度：推荐使用 BF16

监督学习阶段：与标准监督微调类似
强化学习阶段：与近端策略优化类似（高于直接偏好优化）
人工智能评估：为生成批评/偏好进行额外的推理

论文：https://arxiv.org/abs/2212.08073（2022年12月）
Anthropic 博客：https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback
实现：TRL（PPOTrainer + RewardTrainer）
Claude：使用宪法式人工智能确保安全

🇺🇸English

Constitutional AI - Harmlessness from AI Feedback

Quick start

Constitutional AI (CAI) trains models to be harmless through self-critique and AI feedback, without requiring human labels for harmful outputs.

Key concept : Models learn to critique and revise their own responses using a "constitution" (set of principles).

Two phases :

Supervised Learning (SL) : Self-critique + revision
Reinforcement Learning (RL) : RLAIF (RL from AI Feedback)

Constitution example :

Principles:
1. Choose the response that is most helpful, honest, and harmless
2. Avoid responses that are toxic, racist, or sexist
3. Prefer responses that explain objections rather than refuse
4. Choose responses that are thoughtful and nuanced

Common workflows

Workflow 1: Supervised learning phase (self-critique + revision)

Step 1: Generate initial responses :

from transformers import pipeline

generator = pipeline("text-generation", model="base-model")

prompts = [
    "How do I hack a website?",
    "Write something offensive about a group.",
    "Help me cheat on my exam."
]

initial_responses = generator(prompts, max_length=200)

Step 2: Self-critique with constitution :

critique_prompt = """
Given the question:
{question}

And the response:
{response}

Constitution:
1. Choose responses that are helpful, honest, and harmless
2. Avoid toxic, racist, or sexist content
3. Explain objections rather than refuse

Identify any ways this response violates the constitution:
"""

critiques = generator(
    [critique_prompt.format(q=q, r=r) for q, r in zip(prompts, initial_responses)]
)

Step 3: Revision based on critique :

revision_prompt = """
Question: {question}
Original response: {response}
Critique: {critique}

Please revise the response to better align with the constitution:
"""

revised_responses = generator(
    [revision_prompt.format(q=q, r=r, c=c)
     for q, r, c in zip(prompts, initial_responses, critiques)]
)

Step 4: Fine-tune on revised responses :

from trl import SFTTrainer

# Create dataset of (prompt, revised_response) pairs
dataset = create_dataset(prompts, revised_responses)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    max_seq_length=1024
)
trainer.train()

Workflow 2: RL phase (RLAIF - RL from AI Feedback)

Step 1: Generate comparison pairs :

# Sample multiple responses per prompt
responses_a = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)
responses_b = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)

Step 2: AI preference evaluation :

preference_prompt = """
Question: {question}

Response A: {response_a}
Response B: {response_b}

Constitution:
{constitution}

Which response better follows the constitution? Explain your reasoning, then choose A or B.
"""

# Get AI preferences (no human labels needed!)
preferences = generator(
    [preference_prompt.format(q=q, ra=ra, rb=rb, constitution=CONSTITUTION)
     for q, ra, rb in zip(prompts, responses_a, responses_b)]
)

# Parse preferences (A or B)
chosen, rejected = parse_preferences(preferences, responses_a, responses_b)

Step 3: Train preference model (reward model) :

from trl import RewardTrainer, RewardConfig

preference_dataset = create_preference_dataset(prompts, chosen, rejected)

reward_config = RewardConfig(
    output_dir="constitutional-reward-model",
    learning_rate=1e-5,
    num_train_epochs=1
)

reward_trainer = RewardTrainer(
    model=model,
    args=reward_config,
    train_dataset=preference_dataset,
    processing_class=tokenizer
)
reward_trainer.train()

Step 4: RL training with RLAIF :

from trl import PPOTrainer, PPOConfig

ppo_config = PPOConfig(
    reward_model_path="constitutional-reward-model",
    learning_rate=1e-6,
    kl_coef=0.05
)

ppo_trainer = PPOTrainer(
    model=model,
    config=ppo_config,
    reward_model=reward_model
)
ppo_trainer.train()

Workflow 3: Chain-of-thought critique

Enable reasoning transparency :

cot_critique_prompt = """
Question: {question}
Response: {response}

Let's think step-by-step about whether this response follows our principles:

1. Is it helpful? [Yes/No and reasoning]
2. Is it honest? [Yes/No and reasoning]
3. Is it harmless? [Yes/No and reasoning]
4. Does it avoid toxicity? [Yes/No and reasoning]

Based on this analysis, suggest a revision if needed.
"""

cot_critiques = generator(
    [cot_critique_prompt.format(q=q, r=r) for q, r in zip(prompts, responses)]
)

When to use vs alternatives

Use Constitutional AI when :

Want safety alignment without human labels
Need explainable AI decisions
Want to avoid evasive refusals
Have a clear set of principles/constitution
Need scalable safety training

Principles :

RLAIF : AI-generated preferences (scalable, no human labels)
RLHF : Human preferences (more accurate, expensive)
Self-critique : Iterative improvement
Chain-of-thought : Reasoning transparency

Use alternatives instead :

RLHF (PPO) : Need human-validated safety
DPO/SimPO : Have human preference data
NeMo Guardrails : Need runtime content filtering
LlamaGuard : Need pre-trained moderation model

Common issues

Issue: Model refuses too much (evasive)

Add constitution principle:

Prefer responses that engage thoughtfully with questions rather than
refusing to answer. Explain concerns while still being helpful.

Issue: Self-critiques are weak

Use stronger critique prompts:

Critically analyze this response for ANY potential issues, however minor.
Be thorough and specific in identifying problems.

Issue: Revisions don't improve quality

Iterate multiple times:

for _ in range(3):  # 3 rounds of critique/revision
    critique = generate_critique(response)
    response = generate_revision(response, critique)

Issue: RLAIF preferences are noisy

Use multiple AI evaluators:

# Get preferences from 3 different models
prefs_1 = model_1.evaluate(responses)
prefs_2 = model_2.evaluate(responses)
prefs_3 = model_3.evaluate(responses)

# Majority vote
final_preference = majority_vote(prefs_1, prefs_2, prefs_3)

Advanced topics

Constitution design : See references/constitution-design.md for principle selection, trade-offs between helpfulness and harmlessness, and domain-specific constitutions.

RLAIF vs RLHF : See references/rlaif-comparison.md for performance comparison, cost analysis, and when to use AI feedback vs human feedback.

Chain-of-thought reasoning : See references/cot-critique.md for prompt engineering for critiques, multi-step reasoning, and transparency improvements.

Hardware requirements

GPU : NVIDIA A100/H100 recommended
VRAM :
- SL phase (7B): 1× A100 40GB
- RL phase (7B): 2× A100 40GB (policy + reward model)
Single-node : Sufficient for most use cases
Mixed precision : BF16 recommended

Compute requirements :

SL phase : Similar to standard SFT
RL phase : Similar to PPO (higher than DPO)
AI evaluation : Additional inference for critique/preference generation

Resources

Paper: https://arxiv.org/abs/2212.08073 (Dec 2022)
Anthropic blog: https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback
Implementation: TRL (PPOTrainer + RewardTrainer)
Claude: Uses Constitutional AI for safety

Weekly Installs

Repository

orchestra-resea…h-skills

GitHub Stars

5.5K

First Seen

Feb 7, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

codex55

opencode55

cursor55

gemini-cli54

github-copilot53

claude-code52

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

53,700 周安装