llm-architect by 404kidwiz/claude-supercode-skills
npx skills add https://github.com/404kidwiz/claude-supercode-skills --skill llm-architect提供专业的大型语言模型系统架构,用于大规模设计、部署和优化 LLM 应用程序。专精于生产级 LLM 系统的模型选择、RAG(检索增强生成)管道、微调策略、服务基础设施、成本优化和安全护栏。
在以下情况下调用此技能:
在以下情况下请勿调用:
| 需求 | 推荐方法 |
|---|---|
| 延迟 <100ms | 小型微调模型(7B 量化版) |
| 延迟 <2s,预算无限制 | Claude 3 Opus / GPT-4 |
| 延迟 <2s,领域特定 |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| Claude 3 Sonnet 微调版 |
| 延迟 <2s,成本敏感 | Claude 3 Haiku |
| 可接受批量/异步处理 | 批量 API,最便宜层级 |
Need to customize LLM behavior?
│
├─ Need domain-specific knowledge?
│ ├─ Knowledge changes frequently?
│ │ └─ RAG (Retrieval Augmented Generation)
│ └─ Knowledge is static?
│ └─ Fine-tuning OR RAG (test both)
│
├─ Need specific output format/style?
│ ├─ Can describe in prompt?
│ │ └─ Prompt engineering (try first)
│ └─ Format too complex for prompt?
│ └─ Fine-tuning
│
└─ Need latency <100ms?
└─ Fine-tuned small model (7B-13B)
[Client] → [API Gateway + Rate Limiting]
↓
[Request Router]
(Route by intent/complexity)
↓
┌────────┴────────┐
↓ ↓
[Fast Model] [Powerful Model]
(Haiku/Small) (Sonnet/Large)
↓ ↓
[Cache Layer] ← [Response Aggregator]
↓
[Logging & Monitoring]
↓
[Response to Client]
询问以下问题:
def select_model(requirements):
if requirements.latency_p95 < 100: # milliseconds
if requirements.task_complexity == "simple":
return "llama2-7b-finetune"
else:
return "mistral-7b-quantized"
elif requirements.latency_p95 < 2000:
if requirements.budget == "unlimited":
return "claude-3-opus"
elif requirements.domain_specific:
return "claude-3-sonnet-finetuned"
else:
return "claude-3-haiku"
else: # Batch/async acceptable
if requirements.accuracy_critical:
return "gpt-4-with-ensemble"
else:
return "batch-api-cheapest-tier"
# Run benchmark on eval dataset
python scripts/evaluate_model.py \
--model claude-3-sonnet \
--dataset data/eval_1000_examples.jsonl \
--metrics accuracy,latency,cost
# Expected output:
# Accuracy: 94.3%
# P95 Latency: 1,245ms
# Cost per 1K requests: $2.15
| 策略 | 节省幅度 | 使用时机 |
|---|---|---|
| 语义缓存 | 40-80% | 60%+ 相似查询 |
| 多模型路由 | 30-50% | 混合复杂度查询 |
| 提示压缩 | 10-20% | 长上下文输入 |
| 批量处理 | 20-40% | 可容忍异步的工作负载 |
| 小型模型级联 | 40-60% | 简单查询优先 |
| 观察到的现象 | 应对措施 |
|---|---|
| 提示迭代后准确率仍 <80% | 考虑微调 |
| 延迟是要求的 2 倍 | 审查基础设施 |
| 成本是预算的 2 倍 | 采用积极的缓存/路由策略 |
| 幻觉率 >5% | 添加 RAG 或更强的护栏 |
| 检测到安全绕过 | 立即进行安全审查 |
| 指标 | 目标 | 临界值 |
|---|---|---|
| P95 延迟 | < 2 倍要求 | < 3 倍要求 |
| 准确率 | >90% | >80% |
| 缓存命中率 | >60% | >40% |
| 错误率 | <1% | <5% |
| 每千次请求成本 | 在预算内 | <150% 预算 |
详细技术参考:参见 REFERENCE.md
代码示例与模式:参见 EXAMPLES.md
每周安装量
73
代码仓库
GitHub 星标数
42
首次出现
Jan 24, 2026
安全审计
安装于
opencode62
codex57
gemini-cli57
claude-code55
cursor50
github-copilot50
Provides expert large language model system architecture for designing, deploying, and optimizing LLM applications at scale. Specializes in model selection, RAG (Retrieval Augmented Generation) pipelines, fine-tuning strategies, serving infrastructure, cost optimization, and safety guardrails for production LLM systems.
Invoke this skill when:
Do NOT invoke when:
| Requirement | Recommended Approach |
|---|---|
| Latency <100ms | Small fine-tuned model (7B quantized) |
| Latency <2s, budget unlimited | Claude 3 Opus / GPT-4 |
| Latency <2s, domain-specific | Claude 3 Sonnet fine-tuned |
| Latency <2s, cost-sensitive | Claude 3 Haiku |
| Batch/async acceptable | Batch API, cheapest tier |
Need to customize LLM behavior?
│
├─ Need domain-specific knowledge?
│ ├─ Knowledge changes frequently?
│ │ └─ RAG (Retrieval Augmented Generation)
│ └─ Knowledge is static?
│ └─ Fine-tuning OR RAG (test both)
│
├─ Need specific output format/style?
│ ├─ Can describe in prompt?
│ │ └─ Prompt engineering (try first)
│ └─ Format too complex for prompt?
│ └─ Fine-tuning
│
└─ Need latency <100ms?
└─ Fine-tuned small model (7B-13B)
[Client] → [API Gateway + Rate Limiting]
↓
[Request Router]
(Route by intent/complexity)
↓
┌────────┴────────┐
↓ ↓
[Fast Model] [Powerful Model]
(Haiku/Small) (Sonnet/Large)
↓ ↓
[Cache Layer] ← [Response Aggregator]
↓
[Logging & Monitoring]
↓
[Response to Client]
Ask these questions:
def select_model(requirements):
if requirements.latency_p95 < 100: # milliseconds
if requirements.task_complexity == "simple":
return "llama2-7b-finetune"
else:
return "mistral-7b-quantized"
elif requirements.latency_p95 < 2000:
if requirements.budget == "unlimited":
return "claude-3-opus"
elif requirements.domain_specific:
return "claude-3-sonnet-finetuned"
else:
return "claude-3-haiku"
else: # Batch/async acceptable
if requirements.accuracy_critical:
return "gpt-4-with-ensemble"
else:
return "batch-api-cheapest-tier"
# Run benchmark on eval dataset
python scripts/evaluate_model.py \
--model claude-3-sonnet \
--dataset data/eval_1000_examples.jsonl \
--metrics accuracy,latency,cost
# Expected output:
# Accuracy: 94.3%
# P95 Latency: 1,245ms
# Cost per 1K requests: $2.15
| Strategy | Savings | When to Use |
|---|---|---|
| Semantic caching | 40-80% | 60%+ similar queries |
| Multi-model routing | 30-50% | Mixed complexity queries |
| Prompt compression | 10-20% | Long context inputs |
| Batching | 20-40% | Async-tolerant workloads |
| Smaller model cascade | 40-60% | Simple queries first |
| Observation | Action |
|---|---|
| Accuracy <80% after prompt iteration | Consider fine-tuning |
| Latency 2x requirement | Review infrastructure |
| Cost >2x budget | Aggressive caching/routing |
| Hallucination rate >5% | Add RAG or stronger guardrails |
| Safety bypass detected | Immediate security review |
| Metric | Target | Critical |
|---|---|---|
| P95 Latency | <2x requirement | <3x requirement |
| Accuracy | >90% | >80% |
| Cache Hit Rate | >60% | >40% |
| Error Rate | <1% | <5% |
| Cost/1K requests | Within budget | <150% budget |
Detailed Technical Reference : See REFERENCE.md
Code Examples & Patterns: See EXAMPLES.md
Weekly Installs
73
Repository
GitHub Stars
42
First Seen
Jan 24, 2026
Security Audits
Gen Agent Trust HubWarnSocketPassSnykPass
Installed on
opencode62
codex57
gemini-cli57
claude-code55
cursor50
github-copilot50
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
49,000 周安装