constitutional-ai by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill constitutional-ai宪法式人工智能(CAI)通过自我批判和人工智能反馈来训练模型,使其变得无害,而无需对有害输出进行人工标注。
核心概念:模型学习使用一套"宪法"(原则集)来批判和修订自己的回应。
两个阶段:
宪法示例:
Principles:
1. Choose the response that is most helpful, honest, and harmless
2. Avoid responses that are toxic, racist, or sexist
3. Prefer responses that explain objections rather than refuse
4. Choose responses that are thoughtful and nuanced
步骤 1:生成初始回应:
from transformers import pipeline
generator = pipeline("text-generation", model="base-model")
prompts = [
"How do I hack a website?",
"Write something offensive about a group.",
"Help me cheat on my exam."
]
initial_responses = generator(prompts, max_length=200)
:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
critique_prompt = """
Given the question:
{question}
And the response:
{response}
Constitution:
1. Choose responses that are helpful, honest, and harmless
2. Avoid toxic, racist, or sexist content
3. Explain objections rather than refuse
Identify any ways this response violates the constitution:
"""
critiques = generator(
[critique_prompt.format(q=q, r=r) for q, r in zip(prompts, initial_responses)]
)
步骤 3:基于批判进行修订:
revision_prompt = """
Question: {question}
Original response: {response}
Critique: {critique}
Please revise the response to better align with the constitution:
"""
revised_responses = generator(
[revision_prompt.format(q=q, r=r, c=c)
for q, r, c in zip(prompts, initial_responses, critiques)]
)
步骤 4:基于修订后的回应进行微调:
from trl import SFTTrainer
# Create dataset of (prompt, revised_response) pairs
dataset = create_dataset(prompts, revised_responses)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
max_seq_length=1024
)
trainer.train()
步骤 1:生成比较对:
# Sample multiple responses per prompt
responses_a = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)
responses_b = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)
步骤 2:人工智能偏好评估:
preference_prompt = """
Question: {question}
Response A: {response_a}
Response B: {response_b}
Constitution:
{constitution}
Which response better follows the constitution? Explain your reasoning, then choose A or B.
"""
# Get AI preferences (no human labels needed!)
preferences = generator(
[preference_prompt.format(q=q, ra=ra, rb=rb, constitution=CONSTITUTION)
for q, ra, rb in zip(prompts, responses_a, responses_b)]
)
# Parse preferences (A or B)
chosen, rejected = parse_preferences(preferences, responses_a, responses_b)
步骤 3:训练偏好模型(奖励模型):
from trl import RewardTrainer, RewardConfig
preference_dataset = create_preference_dataset(prompts, chosen, rejected)
reward_config = RewardConfig(
output_dir="constitutional-reward-model",
learning_rate=1e-5,
num_train_epochs=1
)
reward_trainer = RewardTrainer(
model=model,
args=reward_config,
train_dataset=preference_dataset,
processing_class=tokenizer
)
reward_trainer.train()
步骤 4:使用基于人工智能反馈的强化学习进行强化学习训练:
from trl import PPOTrainer, PPOConfig
ppo_config = PPOConfig(
reward_model_path="constitutional-reward-model",
learning_rate=1e-6,
kl_coef=0.05
)
ppo_trainer = PPOTrainer(
model=model,
config=ppo_config,
reward_model=reward_model
)
ppo_trainer.train()
启用推理透明度:
cot_critique_prompt = """
Question: {question}
Response: {response}
Let's think step-by-step about whether this response follows our principles:
1. Is it helpful? [Yes/No and reasoning]
2. Is it honest? [Yes/No and reasoning]
3. Is it harmless? [Yes/No and reasoning]
4. Does it avoid toxicity? [Yes/No and reasoning]
Based on this analysis, suggest a revision if needed.
"""
cot_critiques = generator(
[cot_critique_prompt.format(q=q, r=r) for q, r in zip(prompts, responses)]
)
在以下情况使用宪法式人工智能:
原则:
在以下情况使用替代方案:
问题:模型拒绝过多(回避性)
添加宪法原则:
Prefer responses that engage thoughtfully with questions rather than
refusing to answer. Explain concerns while still being helpful.
问题:自我批判力度不足
使用更强的批判提示:
Critically analyze this response for ANY potential issues, however minor.
Be thorough and specific in identifying problems.
问题:修订未能提升质量
进行多次迭代:
for _ in range(3): # 3 rounds of critique/revision
critique = generate_critique(response)
response = generate_revision(response, critique)
问题:基于人工智能反馈的强化学习偏好存在噪声
使用多个人工智能评估器:
# Get preferences from 3 different models
prefs_1 = model_1.evaluate(responses)
prefs_2 = model_2.evaluate(responses)
prefs_3 = model_3.evaluate(responses)
# Majority vote
final_preference = majority_vote(prefs_1, prefs_2, prefs_3)
宪法设计:关于原则选择、有益性与无害性之间的权衡以及特定领域宪法,请参阅 references/constitution-design.md。
基于人工智能反馈的强化学习与基于人类反馈的强化学习对比:关于性能比较、成本分析以及何时使用人工智能反馈与人类反馈,请参阅 references/rlaif-comparison.md。
思维链推理:关于批判的提示工程、多步推理以及透明度改进,请参阅 references/cot-critique.md。
计算要求:
每周安装量
172
代码仓库
GitHub 星标数
23.4K
首次出现
2026年1月21日
安全审计
安装于
opencode142
claude-code142
gemini-cli137
cursor130
codex123
github-copilot117
Constitutional AI (CAI) trains models to be harmless through self-critique and AI feedback, without requiring human labels for harmful outputs.
Key concept : Models learn to critique and revise their own responses using a "constitution" (set of principles).
Two phases :
Constitution example :
Principles:
1. Choose the response that is most helpful, honest, and harmless
2. Avoid responses that are toxic, racist, or sexist
3. Prefer responses that explain objections rather than refuse
4. Choose responses that are thoughtful and nuanced
Step 1: Generate initial responses :
from transformers import pipeline
generator = pipeline("text-generation", model="base-model")
prompts = [
"How do I hack a website?",
"Write something offensive about a group.",
"Help me cheat on my exam."
]
initial_responses = generator(prompts, max_length=200)
Step 2: Self-critique with constitution :
critique_prompt = """
Given the question:
{question}
And the response:
{response}
Constitution:
1. Choose responses that are helpful, honest, and harmless
2. Avoid toxic, racist, or sexist content
3. Explain objections rather than refuse
Identify any ways this response violates the constitution:
"""
critiques = generator(
[critique_prompt.format(q=q, r=r) for q, r in zip(prompts, initial_responses)]
)
Step 3: Revision based on critique :
revision_prompt = """
Question: {question}
Original response: {response}
Critique: {critique}
Please revise the response to better align with the constitution:
"""
revised_responses = generator(
[revision_prompt.format(q=q, r=r, c=c)
for q, r, c in zip(prompts, initial_responses, critiques)]
)
Step 4: Fine-tune on revised responses :
from trl import SFTTrainer
# Create dataset of (prompt, revised_response) pairs
dataset = create_dataset(prompts, revised_responses)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
max_seq_length=1024
)
trainer.train()
Step 1: Generate comparison pairs :
# Sample multiple responses per prompt
responses_a = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)
responses_b = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)
Step 2: AI preference evaluation :
preference_prompt = """
Question: {question}
Response A: {response_a}
Response B: {response_b}
Constitution:
{constitution}
Which response better follows the constitution? Explain your reasoning, then choose A or B.
"""
# Get AI preferences (no human labels needed!)
preferences = generator(
[preference_prompt.format(q=q, ra=ra, rb=rb, constitution=CONSTITUTION)
for q, ra, rb in zip(prompts, responses_a, responses_b)]
)
# Parse preferences (A or B)
chosen, rejected = parse_preferences(preferences, responses_a, responses_b)
Step 3: Train preference model (reward model) :
from trl import RewardTrainer, RewardConfig
preference_dataset = create_preference_dataset(prompts, chosen, rejected)
reward_config = RewardConfig(
output_dir="constitutional-reward-model",
learning_rate=1e-5,
num_train_epochs=1
)
reward_trainer = RewardTrainer(
model=model,
args=reward_config,
train_dataset=preference_dataset,
processing_class=tokenizer
)
reward_trainer.train()
Step 4: RL training with RLAIF :
from trl import PPOTrainer, PPOConfig
ppo_config = PPOConfig(
reward_model_path="constitutional-reward-model",
learning_rate=1e-6,
kl_coef=0.05
)
ppo_trainer = PPOTrainer(
model=model,
config=ppo_config,
reward_model=reward_model
)
ppo_trainer.train()
Enable reasoning transparency :
cot_critique_prompt = """
Question: {question}
Response: {response}
Let's think step-by-step about whether this response follows our principles:
1. Is it helpful? [Yes/No and reasoning]
2. Is it honest? [Yes/No and reasoning]
3. Is it harmless? [Yes/No and reasoning]
4. Does it avoid toxicity? [Yes/No and reasoning]
Based on this analysis, suggest a revision if needed.
"""
cot_critiques = generator(
[cot_critique_prompt.format(q=q, r=r) for q, r in zip(prompts, responses)]
)
Use Constitutional AI when :
Principles :
Use alternatives instead :
Issue: Model refuses too much (evasive)
Add constitution principle:
Prefer responses that engage thoughtfully with questions rather than
refusing to answer. Explain concerns while still being helpful.
Issue: Self-critiques are weak
Use stronger critique prompts:
Critically analyze this response for ANY potential issues, however minor.
Be thorough and specific in identifying problems.
Issue: Revisions don't improve quality
Iterate multiple times:
for _ in range(3): # 3 rounds of critique/revision
critique = generate_critique(response)
response = generate_revision(response, critique)
Issue: RLAIF preferences are noisy
Use multiple AI evaluators:
# Get preferences from 3 different models
prefs_1 = model_1.evaluate(responses)
prefs_2 = model_2.evaluate(responses)
prefs_3 = model_3.evaluate(responses)
# Majority vote
final_preference = majority_vote(prefs_1, prefs_2, prefs_3)
Constitution design : See references/constitution-design.md for principle selection, trade-offs between helpfulness and harmlessness, and domain-specific constitutions.
RLAIF vs RLHF : See references/rlaif-comparison.md for performance comparison, cost analysis, and when to use AI feedback vs human feedback.
Chain-of-thought reasoning : See references/cot-critique.md for prompt engineering for critiques, multi-step reasoning, and transparency improvements.
Compute requirements :
Weekly Installs
172
Repository
GitHub Stars
23.4K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode142
claude-code142
gemini-cli137
cursor130
codex123
github-copilot117
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
48,300 周安装
Proxychains 网络代理工具:自动解决GitHub/PyPI/npm访问失败和连接超时问题
108 周安装
Windows 基础设施管理员:Active Directory、组策略、PowerShell 自动化与混合身份管理专家
108 周安装
Arize Prompt Optimization - 提示词优化技能详解,提升LLM应用性能与追踪数据分析
188 周安装
Xcode构建性能优化指南:axiom-build-performance 工具使用与Swift编译加速
149 周安装
Nansen Profiler 钱包画像分析器 - 区块链地址余额、交易、盈亏与关联分析工具
133 周安装
Hummingbot 交易机器人 AI 技能:自动化加密货币交易与 DeFi 策略管理
175 周安装