重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
constitutional-ai by orchestra-research/ai-research-skills
npx skills add https://github.com/orchestra-research/ai-research-skills --skill constitutional-ai宪法式人工智能通过自我批评和人工智能反馈来训练模型,使其变得无害,而无需为有害输出提供人工标注。
核心概念:模型学习使用"宪法"(一组原则)来批评和修订自己的回应。
两个阶段:
宪法示例:
Principles:
1. Choose the response that is most helpful, honest, and harmless
2. Avoid responses that are toxic, racist, or sexist
3. Prefer responses that explain objections rather than refuse
4. Choose responses that are thoughtful and nuanced
步骤 1:生成初始回应:
from transformers import pipeline
generator = pipeline("text-generation", model="base-model")
prompts = [
"How do I hack a website?",
"Write something offensive about a group.",
"Help me cheat on my exam."
]
initial_responses = generator(prompts, max_length=200)
:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
critique_prompt = """
Given the question:
{question}
And the response:
{response}
Constitution:
1. Choose responses that are helpful, honest, and harmless
2. Avoid toxic, racist, or sexist content
3. Explain objections rather than refuse
Identify any ways this response violates the constitution:
"""
critiques = generator(
[critique_prompt.format(q=q, r=r) for q, r in zip(prompts, initial_responses)]
)
步骤 3:基于批评进行修订:
revision_prompt = """
Question: {question}
Original response: {response}
Critique: {critique}
Please revise the response to better align with the constitution:
"""
revised_responses = generator(
[revision_prompt.format(q=q, r=r, c=c)
for q, r, c in zip(prompts, initial_responses, critiques)]
)
步骤 4:在修订后的回应上进行微调:
from trl import SFTTrainer
# Create dataset of (prompt, revised_response) pairs
dataset = create_dataset(prompts, revised_responses)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
max_seq_length=1024
)
trainer.train()
步骤 1:生成比较对:
# Sample multiple responses per prompt
responses_a = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)
responses_b = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)
步骤 2:人工智能偏好评估:
preference_prompt = """
Question: {question}
Response A: {response_a}
Response B: {response_b}
Constitution:
{constitution}
Which response better follows the constitution? Explain your reasoning, then choose A or B.
"""
# Get AI preferences (no human labels needed!)
preferences = generator(
[preference_prompt.format(q=q, ra=ra, rb=rb, constitution=CONSTITUTION)
for q, ra, rb in zip(prompts, responses_a, responses_b)]
)
# Parse preferences (A or B)
chosen, rejected = parse_preferences(preferences, responses_a, responses_b)
步骤 3:训练偏好模型(奖励模型):
from trl import RewardTrainer, RewardConfig
preference_dataset = create_preference_dataset(prompts, chosen, rejected)
reward_config = RewardConfig(
output_dir="constitutional-reward-model",
learning_rate=1e-5,
num_train_epochs=1
)
reward_trainer = RewardTrainer(
model=model,
args=reward_config,
train_dataset=preference_dataset,
processing_class=tokenizer
)
reward_trainer.train()
步骤 4:使用基于人工智能反馈的强化学习进行强化学习训练:
from trl import PPOTrainer, PPOConfig
ppo_config = PPOConfig(
reward_model_path="constitutional-reward-model",
learning_rate=1e-6,
kl_coef=0.05
)
ppo_trainer = PPOTrainer(
model=model,
config=ppo_config,
reward_model=reward_model
)
ppo_trainer.train()
启用推理透明度:
cot_critique_prompt = """
Question: {question}
Response: {response}
Let's think step-by-step about whether this response follows our principles:
1. Is it helpful? [Yes/No and reasoning]
2. Is it honest? [Yes/No and reasoning]
3. Is it harmless? [Yes/No and reasoning]
4. Does it avoid toxicity? [Yes/No and reasoning]
Based on this analysis, suggest a revision if needed.
"""
cot_critiques = generator(
[cot_critique_prompt.format(q=q, r=r) for q, r in zip(prompts, responses)]
)
在以下情况使用宪法式人工智能:
原则:
在以下情况使用替代方案:
问题:模型拒绝过多(回避性)
添加宪法原则:
Prefer responses that engage thoughtfully with questions rather than
refusing to answer. Explain concerns while still being helpful.
问题:自我批评力度不足
使用更强的批评提示:
Critically analyze this response for ANY potential issues, however minor.
Be thorough and specific in identifying problems.
问题:修订未能提升质量
进行多次迭代:
for _ in range(3): # 3 rounds of critique/revision
critique = generate_critique(response)
response = generate_revision(response, critique)
问题:基于人工智能反馈的强化学习偏好存在噪声
使用多个人工智能评估器:
# Get preferences from 3 different models
prefs_1 = model_1.evaluate(responses)
prefs_2 = model_2.evaluate(responses)
prefs_3 = model_3.evaluate(responses)
# Majority vote
final_preference = majority_vote(prefs_1, prefs_2, prefs_3)
宪法设计:关于原则选择、有益性与无害性之间的权衡以及特定领域宪法的信息,请参阅 references/constitution-design.md。
基于人工智能反馈的强化学习与基于人类反馈的强化学习对比:关于性能比较、成本分析以及何时使用人工智能反馈与人类反馈的信息,请参阅 references/rlaif-comparison.md。
思维链推理:关于批评的提示工程、多步推理和透明度改进的信息,请参阅 references/cot-critique.md。
计算要求:
每周安装次数
64
代码仓库
GitHub 星标数
5.5K
首次出现
2026年2月7日
安全审计
安装于
codex55
opencode55
cursor55
gemini-cli54
github-copilot53
claude-code52
Constitutional AI (CAI) trains models to be harmless through self-critique and AI feedback, without requiring human labels for harmful outputs.
Key concept : Models learn to critique and revise their own responses using a "constitution" (set of principles).
Two phases :
Constitution example :
Principles:
1. Choose the response that is most helpful, honest, and harmless
2. Avoid responses that are toxic, racist, or sexist
3. Prefer responses that explain objections rather than refuse
4. Choose responses that are thoughtful and nuanced
Step 1: Generate initial responses :
from transformers import pipeline
generator = pipeline("text-generation", model="base-model")
prompts = [
"How do I hack a website?",
"Write something offensive about a group.",
"Help me cheat on my exam."
]
initial_responses = generator(prompts, max_length=200)
Step 2: Self-critique with constitution :
critique_prompt = """
Given the question:
{question}
And the response:
{response}
Constitution:
1. Choose responses that are helpful, honest, and harmless
2. Avoid toxic, racist, or sexist content
3. Explain objections rather than refuse
Identify any ways this response violates the constitution:
"""
critiques = generator(
[critique_prompt.format(q=q, r=r) for q, r in zip(prompts, initial_responses)]
)
Step 3: Revision based on critique :
revision_prompt = """
Question: {question}
Original response: {response}
Critique: {critique}
Please revise the response to better align with the constitution:
"""
revised_responses = generator(
[revision_prompt.format(q=q, r=r, c=c)
for q, r, c in zip(prompts, initial_responses, critiques)]
)
Step 4: Fine-tune on revised responses :
from trl import SFTTrainer
# Create dataset of (prompt, revised_response) pairs
dataset = create_dataset(prompts, revised_responses)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
max_seq_length=1024
)
trainer.train()
Step 1: Generate comparison pairs :
# Sample multiple responses per prompt
responses_a = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)
responses_b = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)
Step 2: AI preference evaluation :
preference_prompt = """
Question: {question}
Response A: {response_a}
Response B: {response_b}
Constitution:
{constitution}
Which response better follows the constitution? Explain your reasoning, then choose A or B.
"""
# Get AI preferences (no human labels needed!)
preferences = generator(
[preference_prompt.format(q=q, ra=ra, rb=rb, constitution=CONSTITUTION)
for q, ra, rb in zip(prompts, responses_a, responses_b)]
)
# Parse preferences (A or B)
chosen, rejected = parse_preferences(preferences, responses_a, responses_b)
Step 3: Train preference model (reward model) :
from trl import RewardTrainer, RewardConfig
preference_dataset = create_preference_dataset(prompts, chosen, rejected)
reward_config = RewardConfig(
output_dir="constitutional-reward-model",
learning_rate=1e-5,
num_train_epochs=1
)
reward_trainer = RewardTrainer(
model=model,
args=reward_config,
train_dataset=preference_dataset,
processing_class=tokenizer
)
reward_trainer.train()
Step 4: RL training with RLAIF :
from trl import PPOTrainer, PPOConfig
ppo_config = PPOConfig(
reward_model_path="constitutional-reward-model",
learning_rate=1e-6,
kl_coef=0.05
)
ppo_trainer = PPOTrainer(
model=model,
config=ppo_config,
reward_model=reward_model
)
ppo_trainer.train()
Enable reasoning transparency :
cot_critique_prompt = """
Question: {question}
Response: {response}
Let's think step-by-step about whether this response follows our principles:
1. Is it helpful? [Yes/No and reasoning]
2. Is it honest? [Yes/No and reasoning]
3. Is it harmless? [Yes/No and reasoning]
4. Does it avoid toxicity? [Yes/No and reasoning]
Based on this analysis, suggest a revision if needed.
"""
cot_critiques = generator(
[cot_critique_prompt.format(q=q, r=r) for q, r in zip(prompts, responses)]
)
Use Constitutional AI when :
Principles :
Use alternatives instead :
Issue: Model refuses too much (evasive)
Add constitution principle:
Prefer responses that engage thoughtfully with questions rather than
refusing to answer. Explain concerns while still being helpful.
Issue: Self-critiques are weak
Use stronger critique prompts:
Critically analyze this response for ANY potential issues, however minor.
Be thorough and specific in identifying problems.
Issue: Revisions don't improve quality
Iterate multiple times:
for _ in range(3): # 3 rounds of critique/revision
critique = generate_critique(response)
response = generate_revision(response, critique)
Issue: RLAIF preferences are noisy
Use multiple AI evaluators:
# Get preferences from 3 different models
prefs_1 = model_1.evaluate(responses)
prefs_2 = model_2.evaluate(responses)
prefs_3 = model_3.evaluate(responses)
# Majority vote
final_preference = majority_vote(prefs_1, prefs_2, prefs_3)
Constitution design : See references/constitution-design.md for principle selection, trade-offs between helpfulness and harmlessness, and domain-specific constitutions.
RLAIF vs RLHF : See references/rlaif-comparison.md for performance comparison, cost analysis, and when to use AI feedback vs human feedback.
Chain-of-thought reasoning : See references/cot-critique.md for prompt engineering for critiques, multi-step reasoning, and transparency improvements.
Compute requirements :
Weekly Installs
64
Repository
GitHub Stars
5.5K
First Seen
Feb 7, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
codex55
opencode55
cursor55
gemini-cli54
github-copilot53
claude-code52
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
53,700 周安装