LLM架构师：大型语言模型系统架构设计、RAG管道与生产级部署优化指南

llm-architect by 404kidwiz/claude-supercode-skills

98 周安装量

63 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/404kidwiz/claude-supercode-skills --skill llm-architect

AI/机器学习系统架构自然语言处理

🇨🇳中文介绍

LLM 架构师

目的

提供专业的大型语言模型系统架构，用于大规模设计、部署和优化 LLM 应用程序。专精于生产级 LLM 系统的模型选择、RAG（检索增强生成）管道、微调策略、服务基础设施、成本优化和安全护栏。

使用时机

从需求到生产，设计端到端的 LLM 系统
为特定用例选择模型和服务基础设施
实施 RAG（检索增强生成）管道
在保持质量阈值的同时优化 LLM 成本
构建安全护栏和合规机制
规划微调、RAG 与提示工程策略
为高吞吐量应用扩展 LLM 推理能力

快速开始

在以下情况下调用此技能：

从需求到生产，设计端到端的 LLM 系统
为特定用例选择模型和服务基础设施
实施 RAG（检索增强生成）管道
在保持质量阈值的同时优化 LLM 成本
构建安全护栏和合规机制

在以下情况下请勿调用：

仅需简单的 API 集成（请改用后端开发人员技能）
仅需提示工程，无需架构决策
需要从头开始训练基础模型（这几乎总是错误的方法）
与语言模型无关的通用机器学习任务（请使用机器学习工程师技能）

决策框架

模型选择快速指南

需求	推荐方法
延迟 <100ms	小型微调模型（7B 量化版）
延迟 <2s，预算无限制	Claude 3 Opus / GPT-4
延迟 <2s，领域特定

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

RAG 与微调决策树

Need to customize LLM behavior?
│
├─ Need domain-specific knowledge?
│  ├─ Knowledge changes frequently?
│  │  └─ RAG (Retrieval Augmented Generation)
│  └─ Knowledge is static?
│     └─ Fine-tuning OR RAG (test both)
│
├─ Need specific output format/style?
│  ├─ Can describe in prompt?
│  │  └─ Prompt engineering (try first)
│  └─ Format too complex for prompt?
│     └─ Fine-tuning
│
└─ Need latency <100ms?
   └─ Fine-tuned small model (7B-13B)

[Client] → [API Gateway + Rate Limiting]
              ↓
         [Request Router]
          (Route by intent/complexity)
              ↓
    ┌────────┴────────┐
    ↓                 ↓
[Fast Model]    [Powerful Model]
(Haiku/Small)   (Sonnet/Large)
    ↓                 ↓
[Cache Layer] ← [Response Aggregator]
    ↓
[Logging & Monitoring]
    ↓
[Response to Client]

核心工作流：设计 LLM 系统

询问以下问题：

延迟：P95 响应时间要求是多少？
规模：预期的每日请求量和增长轨迹？
准确性：最低可接受的质量标准是什么？（可衡量的指标）
成本：预算限制？（$/请求或 $/月）
数据：是否有用于评估的现有数据集？敏感级别？
合规性：有哪些法规要求？（HIPAA、GDPR、SOC2 等）

def select_model(requirements):
    if requirements.latency_p95 < 100:  # milliseconds
        if requirements.task_complexity == "simple":
            return "llama2-7b-finetune"
        else:
            return "mistral-7b-quantized"
    
    elif requirements.latency_p95 < 2000:
        if requirements.budget == "unlimited":
            return "claude-3-opus"
        elif requirements.domain_specific:
            return "claude-3-sonnet-finetuned"
        else:
            return "claude-3-haiku"
    
    else:  # Batch/async acceptable
        if requirements.accuracy_critical:
            return "gpt-4-with-ensemble"
        else:
            return "batch-api-cheapest-tier"

# Run benchmark on eval dataset
python scripts/evaluate_model.py \
  --model claude-3-sonnet \
  --dataset data/eval_1000_examples.jsonl \
  --metrics accuracy,latency,cost

# Expected output:
# Accuracy: 94.3%
# P95 Latency: 1,245ms
# Cost per 1K requests: $2.15

4. 迭代检查清单

P95 延迟是否满足要求？如果不满足 → 优化服务（量化、缓存）
准确性是否达到阈值？如果未达到 → 改进提示、微调或升级模型
成本是否在预算内？如果超出 → 积极缓存、路由到更小模型、批量处理
安全护栏是否经过测试？如果未测试 → 添加内容过滤器、PII 检测
监控仪表板是否已上线？如果未上线 → 设置 Prometheus + Grafana
运行手册是否已记录？如果未记录 → 记录常见故障和修复方法

策略	节省幅度	使用时机
语义缓存	40-80%	60%+ 相似查询
多模型路由	30-50%	混合复杂度查询
提示压缩	10-20%	长上下文输入
批量处理	20-40%	可容忍异步的工作负载
小型模型级联	40-60%	简单查询优先

针对对抗性示例测试了内容过滤
已验证 PII 检测和脱敏功能
已部署提示注入防御措施
已实施输出验证规则
已为所有请求配置审计日志记录
已记录并验证合规性要求

危险信号 - 何时需要升级处理

观察到的现象	应对措施
提示迭代后准确率仍 <80%	考虑微调
延迟是要求的 2 倍	审查基础设施
成本是预算的 2 倍	采用积极的缓存/路由策略
幻觉率 >5%	添加 RAG 或更强的护栏
检测到安全绕过	立即进行安全审查

快速参考：性能目标

指标	目标	临界值
P95 延迟	< 2 倍要求	< 3 倍要求
准确率	>90%	>80%
缓存命中率	>60%	>40%
错误率	<1%	<5%
每千次请求成本	在预算内	<150% 预算

详细技术参考：参见 REFERENCE.md
- RAG 实施工作流
- 语义缓存模式
- 部署配置
代码示例与模式：参见 EXAMPLES.md
- 反模式（提示工程足够时却进行微调、没有回退机制）
- LLM 系统质量检查清单
- 健壮的 LLM 调用模式

🇺🇸English

LLM Architect

Purpose

Provides expert large language model system architecture for designing, deploying, and optimizing LLM applications at scale. Specializes in model selection, RAG (Retrieval Augmented Generation) pipelines, fine-tuning strategies, serving infrastructure, cost optimization, and safety guardrails for production LLM systems.

When to Use

Designing end-to-end LLM systems from requirements to production
Selecting models and serving infrastructure for specific use cases
Implementing RAG (Retrieval Augmented Generation) pipelines
Optimizing LLM costs while maintaining quality thresholds
Building safety guardrails and compliance mechanisms
Planning fine-tuning vs RAG vs prompt engineering strategies
Scaling LLM inference for high-throughput applications

Quick Start

Invoke this skill when:

Designing end-to-end LLM systems from requirements to production
Selecting models and serving infrastructure for specific use cases
Implementing RAG (Retrieval Augmented Generation) pipelines
Optimizing LLM costs while maintaining quality thresholds
Building safety guardrails and compliance mechanisms

Do NOT invoke when:

Simple API integration exists (use backend-developer instead)
Only prompt engineering needed without architecture decisions
Training foundation models from scratch (almost always wrong approach)
Generic ML tasks unrelated to language models (use ml-engineer)

Decision Framework

Model Selection Quick Guide

Requirement	Recommended Approach
Latency <100ms	Small fine-tuned model (7B quantized)
Latency <2s, budget unlimited	Claude 3 Opus / GPT-4
Latency <2s, domain-specific	Claude 3 Sonnet fine-tuned
Latency <2s, cost-sensitive	Claude 3 Haiku
Batch/async acceptable	Batch API, cheapest tier

RAG vs Fine-Tuning Decision Tree

Need to customize LLM behavior?
│
├─ Need domain-specific knowledge?
│  ├─ Knowledge changes frequently?
│  │  └─ RAG (Retrieval Augmented Generation)
│  └─ Knowledge is static?
│     └─ Fine-tuning OR RAG (test both)
│
├─ Need specific output format/style?
│  ├─ Can describe in prompt?
│  │  └─ Prompt engineering (try first)
│  └─ Format too complex for prompt?
│     └─ Fine-tuning
│
└─ Need latency <100ms?
   └─ Fine-tuned small model (7B-13B)

Architecture Pattern

[Client] → [API Gateway + Rate Limiting]
              ↓
         [Request Router]
          (Route by intent/complexity)
              ↓
    ┌────────┴────────┐
    ↓                 ↓
[Fast Model]    [Powerful Model]
(Haiku/Small)   (Sonnet/Large)
    ↓                 ↓
[Cache Layer] ← [Response Aggregator]
    ↓
[Logging & Monitoring]
    ↓
[Response to Client]

Core Workflow: Design LLM System

1. Requirements Gathering

Ask these questions:

Latency : What's the P95 response time requirement?
Scale : Expected requests/day and growth trajectory?
Accuracy : What's the minimum acceptable quality? (measurable metric)
Cost : Budget constraints? ($/request or $/month)
Data : Existing datasets for evaluation? Sensitivity level?
Compliance : Regulatory requirements? (HIPAA, GDPR, SOC2, etc.)

2. Model Selection

def select_model(requirements):
    if requirements.latency_p95 < 100:  # milliseconds
        if requirements.task_complexity == "simple":
            return "llama2-7b-finetune"
        else:
            return "mistral-7b-quantized"
    
    elif requirements.latency_p95 < 2000:
        if requirements.budget == "unlimited":
            return "claude-3-opus"
        elif requirements.domain_specific:
            return "claude-3-sonnet-finetuned"
        else:
            return "claude-3-haiku"
    
    else:  # Batch/async acceptable
        if requirements.accuracy_critical:
            return "gpt-4-with-ensemble"
        else:
            return "batch-api-cheapest-tier"

3. Prototype & Evaluate

# Run benchmark on eval dataset
python scripts/evaluate_model.py \
  --model claude-3-sonnet \
  --dataset data/eval_1000_examples.jsonl \
  --metrics accuracy,latency,cost

# Expected output:
# Accuracy: 94.3%
# P95 Latency: 1,245ms
# Cost per 1K requests: $2.15

4. Iteration Checklist

Latency P95 meets requirement? If no → optimize serving (quantization, caching)
Accuracy meets threshold? If no → improve prompts, fine-tune, or upgrade model
Cost within budget? If no → aggressive caching, smaller model routing, batching
Safety guardrails tested? If no → add content filters, PII detection
Monitoring dashboards live? If no → set up Prometheus + Grafana
Runbook documented? If no → document common failures and fixes

Cost Optimization Strategies

Strategy	Savings	When to Use
Semantic caching	40-80%	60%+ similar queries
Multi-model routing	30-50%	Mixed complexity queries
Prompt compression	10-20%	Long context inputs
Batching	20-40%	Async-tolerant workloads
Smaller model cascade	40-60%	Simple queries first

Safety Checklist

Content filtering tested against adversarial examples
PII detection and redaction validated
Prompt injection defenses in place
Output validation rules implemented
Audit logging configured for all requests
Compliance requirements documented and validated

Red Flags - When to Escalate

Observation	Action
Accuracy <80% after prompt iteration	Consider fine-tuning
Latency 2x requirement	Review infrastructure
Cost >2x budget	Aggressive caching/routing
Hallucination rate >5%	Add RAG or stronger guardrails
Safety bypass detected	Immediate security review

Quick Reference: Performance Targets

Metric	Target	Critical
P95 Latency	<2x requirement	<3x requirement
Accuracy	>90%	>80%
Cache Hit Rate	>60%	>40%
Error Rate	<1%	<5%
Cost/1K requests	Within budget	<150% budget

Additional Resources

Detailed Technical Reference : See REFERENCE.md
- RAG implementation workflow
- Semantic caching patterns
- Deployment configurations
Code Examples & Patterns: See EXAMPLES.md
- Anti-patterns (fine-tuning when prompting suffices, no fallback)
- Quality checklist for LLM systems
- Resilient LLM call patterns

Weekly Installs

Repository

404kidwiz/claud…e-skills

GitHub Stars

First Seen

Jan 24, 2026

Security Audits

Gen Agent Trust HubWarn SocketPass SnykPass

Installed on

opencode62

codex57

gemini-cli57

claude-code55

cursor50

github-copilot50

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

49,000 周安装