IT运维专家指南：服务可靠性、自动化、ITIL与卓越运营实践

it-operations by davila7/claude-code-templates

188 周安装量

23,400 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill it-operations

自动化开发运维监控

🇨🇳中文介绍

IT 运维专家

一项全面的技能，用于管理 IT 基础设施运维，确保服务可靠性，实施监控和告警策略，管理事件，并通过自动化和最佳实践保持卓越运营。

核心原则

1. 服务可靠性优先

主动监控 : 在事件发生前实施全面的可观测性
事件管理 : 具有明确升级路径的结构化响应流程
SLA/SLO 管理 : 定义和维护与业务需求一致的服务级别目标
持续改进 : 通过无责的事后分析从事件中学习

2. 自动化优于手动流程

基础设施即代码 : 通过版本控制的代码管理基础设施配置
操作手册自动化 : 将手动程序转换为自动化工作流
自愈系统 : 针对常见问题实施自动化修复
配置管理 : 跨环境保持一致性

3. ITIL 服务管理

服务策略 : 使 IT 服务与业务目标保持一致
服务设计 : 设计弹性、可扩展的服务
服务转换 : 以最小中断管理变更
服务运营 : 有效交付和支持服务
持续服务改进 : 迭代提升服务质量

4. 卓越运营

文档 : 维护最新的操作手册、流程和架构图
知识管理 : 从事件解决方案中构建可搜索的知识库
容量规划 : 主动预测和配置资源

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

基础设施运维工作流

1. 监控与可观测性
   ├─ 为关键服务定义 SLI/SLO/SLA
   ├─ 实施指标收集（基础设施、应用、业务）
   ├─ 配置具有适当阈值和升级路径的告警
   ├─ 为不同受众（运维、开发、高管）构建仪表板
   └─ 建立待命轮换和升级程序

2. 事件管理
   ├─ 接收告警或用户报告
   ├─ 评估严重性和影响（P1/P2/P3/P4）
   ├─ 召集适当的响应人员
   ├─ 调查和诊断根本原因
   ├─ 实施修复或变通方案
   ├─ 向利益相关者通报状态
   ├─ 在知识库中记录解决方案
   └─ 进行事后审查

3. 变更管理
   ├─ 提交包含影响评估的变更请求
   ├─ 通过 CAB（变更咨询委员会）审查和批准
   ├─ 安排变更窗口
   ├─ 准备好回滚计划后执行变更
   ├─ 验证成功标准
   ├─ 记录实际与计划结果
   └─ 关闭变更工单

4. 容量规划
   ├─ 收集资源利用率趋势
   ├─ 分析增长模式
   ├─ 预测未来需求
   ├─ 计划采购或配置
   ├─ 执行容量增加
   └─ 监控有效性

5. 自动化与优化
   ├─ 识别重复性手动任务
   ├─ 记录当前流程
   ├─ 设计自动化解决方案
   ├─ 实施和测试自动化
   ├─ 部署到生产环境
   ├─ 衡量时间/成本节省
   └─ 迭代和改进

告警配置决策矩阵

场景	告警类型	阈值	响应时间	升级
服务完全中断	寻呼	立即	< 5 分钟	立即通知待命人员
服务降级	寻呼	2-3 次故障	< 15 分钟	15 分钟后通知待命人员
资源使用率高	警告	> 80% 持续	< 1 小时	2 小时后通知团队负责人
接近容量	信息	> 70% 趋势	< 24 小时	每周容量审查
配置漂移	工单	任何偏差	< 7 天	每月审查

事件严重性分类

优先级 1（严重）

影响所有用户的完全服务中断
数据丢失或安全漏洞
财务影响 > $10K/小时
响应：立即，7x24，全员参与

优先级 2（高）

影响许多用户的部分服务中断
显著的性能下降
财务影响 $1K-$10K/小时
响应：工作时间 < 30 分钟

优先级 3（中）

影响部分用户的服务降级
非关键功能受损
有可用的变通方案
响应：工作时间 < 4 小时

优先级 4（低）

影响最小的次要问题
外观问题
增强请求
响应：下一个工作日

变更管理风险评估

风险等级 = 影响 × 可能性 × 复杂度

影响 (1-5):
1 = 单个用户
2 = 团队
3 = 部门
4 = 全公司
5 = 面向客户

问题可能性 (1-5):
1 = 常规，已测试
2 = 熟悉，已记录
3 = 有些不确定性
4 = 新领域
5 = 从未做过

复杂度 (1-5):
1 = 单个组件
2 = 少数组件
3 = 多个系统
4 = 跨平台
5 = 企业级

风险评分解释:
1-20: 标准变更（预批准）
21-50: 正常变更（CAB 审查）
51-75: 高风险变更（广泛测试，高级别批准）
76-125: 仅限紧急变更（高管批准）

需求	Prometheus + Grafana	Datadog	New Relic	ELK Stack	Splunk
成本	免费（自托管）	$$$$	$$$$	免费-$$	$$$$$
指标	优秀	优秀	优秀	良好	良好
日志	通过 Loki	优秀	优秀	优秀	优秀
追踪	通过 Tempo	优秀	优秀	有限	良好
学习曲线	陡峭	中等	中等	陡峭	陡峭
云原生	优秀	优秀	优秀	良好	良好
本地部署	优秀	良好	良好	优秀	优秀
APM	通过导出器	优秀	优秀	有限	良好

挑战 1：告警疲劳

问题 : 过多的误报警告导致团队倦怠

告警调优流程:
1. 测量基线告警量和误报率
2. 按可操作性对告警分类:
   - 可操作 + 紧急 = 保留为寻呼
   - 可操作 + 不紧急 = 工单
   - 不可操作 = 移除或转换为仪表板指标
3. 实施告警聚合（分组类似告警）
4. 为告警添加上下文（操作手册链接、相关指标）
5. 定期审查会议（每周）以调整阈值
6. 跟踪指标:
   - MTTA（平均确认时间）: 目标 < 5 分钟
   - 误报率: 目标 < 20%
   - 每周告警量: 趋势下降

挑战 2：危机期间的事件记录

问题 : 团队在高压力事件期间跳过记录

分配专门的记录员角色（非事件指挥官）
使用具有自动时间线的事件管理工具（PagerDuty, Opsgenie）
基于模板的事件报告，包含必填字段
自动安排事后审查（48 小时内）
游戏化文档记录（跟踪并认可详尽的文档记录）

挑战 3：知识孤岛

问题 : 关键知识被困在个别团队成员的头脑中

知识转移策略:
- 结对编程/影子学习: 占冲刺容量的 20%
- 操作手册要求: 每个系统必须有操作手册
- 午餐学习会: 每周 30 分钟知识分享
- 交叉培训矩阵: 跟踪谁懂什么，识别差距
- 待命轮换: 每个人都轮换以传播知识
- 事后审查: 强制团队分享
- 文档冲刺: 每季度专注于文档完成

挑战 4：平衡稳定性与创新

问题 : 运维团队为保持稳定性而抵制变更

实施变更窗口（计划维护时段）
使用蓝绿或金丝雀部署以降低风险
建立“创新时间”（谷歌 20% 时间模型）
创建用于实验的沙盒环境
衡量并奖励稳定性和改进指标
将“减少苦差事”作为 OKR 目标

服务可靠性指标

可用性:
  公式: (总时间 - 停机时间) / 总时间 × 100
  目标: 99.9%（每月 43.8 分钟停机时间）
  测量: 按服务，每月

MTTR（平均恢复时间）:
  公式: 恢复时间总和 / 事件数量
  目标: P1 < 30 分钟, P2 < 4 小时
  测量: 按严重级别，每月

MTBF（平均故障间隔时间）:
  公式: 总运行时间 / 故障次数
  目标: > 720 小时（30 天）
  测量: 按服务，每季度

MTTA（平均确认时间）:
  公式: 确认时间总和 / 告警数量
  目标: 寻呼 < 5 分钟
  测量: 按待命工程师，每周

变更成功率:
  公式: 成功变更 / 总变更 × 100
  目标: > 95%
  测量: 每月

事件复发率:
  公式: 重复事件 / 总事件 × 100
  目标: < 10%
  测量: 每季度（90 天内相同根本原因）

苦差事百分比:
  定义: 花在手动、重复性任务上的时间
  目标: < 团队容量的 30%
  测量: 每周时间跟踪

自动化覆盖率:
  公式: 自动化任务 / 总重复性任务 × 100
  目标: > 70%
  测量: 每季度审计

待命负载:
  公式: 每次待命班次的告警数
  目标: 每次班次 < 5 个可操作告警
  测量: 按工程师，每周

操作手册覆盖率:
  公式: 有操作手册的服务 / 总服务数 × 100
  目标: 100%
  测量: 每月审计

知识库使用率:
  公式: 通过知识库解决的事件 / 总事件 × 100
  目标: > 40%
  测量: 每月

参与设计评审以获取运维需求
提供部署自动化和 CI/CD 流水线支持
共享监控和日志记录要求
协作进行事件响应和事后分析
共同负责 SLO 和错误预算

实施安全监控和告警
管理访问控制和认证系统
协调漏洞修补和修复
进行安全事件响应
维护安全策略的合规性

与业务利益相关者

报告服务可用性和性能
沟通计划维护窗口
提供容量规划预测
将技术指标转化为业务影响
参与业务连续性规划

1. 无责事后分析

事后审查模板:
- 事件摘要（发生了什么、时间、影响）
- 事件时间线（详细时间顺序）
- 根本原因分析（5 Whys 或鱼骨图）
- 做得好的方面（响应过程中的优势）
- 可以改进的方面（机会）
- 行动项（包含负责人和截止日期）
- 经验教训（可分享的见解）

规则:
- 无责备或惩罚
- 关注系统和流程，而非个人
- 每个人都可以自由发言
- 行动项必须跟踪至完成

2. 操作手册标准

操作手册内容:
  - 服务概述: 目的、依赖项、架构
  - SLI/SLO/SLA: 定义的阈值和目标
  - 常见问题: 症状、原因、解决方案
  - 故障排除步骤: 逐步程序
  - 升级路径: 联系谁以及何时联系
  - 有用命令: 可复制粘贴的命令
  - 仪表板链接: 相关仪表板的直接链接
  - 近期变更: 变更日志链接
  - 联系信息: 团队、产品负责人、主题专家

维护:
  - 每季度或重大事件后审查
  - 在低流量时段测试程序
  - 每次重大变更后更新
  - 跟踪使用指标（页面浏览量、有用性评分）

3. 待命最佳实践

待命准备:
  - 带 VPN 访问权限的笔记本电脑
  - 带通知应用的移动设备
  - 联系人列表（升级路径）
  - 访问所有关键系统
  - 操作手册已收藏
  - 确定备用待命人员

待命期间:
  - 5 分钟内确认告警
  - 定期更新事件状态
  - 遵循升级程序
  - 在事件工单中记录所有操作
  - 向下一位待命人员清晰交接

待命结束后:
  - 完成事件报告
  - 提交减少苦差事的工单
  - 提供操作手册反馈
  - 更新待命文档

4. 变更管理纪律

标准变更流程:
  1. 创建变更请求（RFC）
  2. 记录:
     - 内容: 正在进行的特定变更
     - 原因: 业务理由
     - 时间: 建议日期/时间
     - 人员: 变更实施者和批准者
     - 方式: 逐步程序
     - 风险: 评估和缓解措施
     - 回滚: 详细回滚计划
     - 测试: 验证步骤
  3. 提交 CAB 审查（提前 7 天通知）
  4. 在批准的窗口内实施
  5. 验证成功标准
  6. 记录实际结果并关闭变更
  7. 如果出现问题，进行实施后审查

紧急变更流程:
  - 需要高管批准
  - 在加强监控下实施
  - 全团队通知
  - 24 小时内完成文档记录
  - 强制性的变更后审查

有关详细技术指导，请参阅：

reference/monitoring.md - 可观测性、指标、告警和仪表板设计
reference/incident-management.md - 事件响应、根本原因分析、事后分析
reference/infrastructure.md - 服务器管理、网络运维、容量规划
reference/automation.md - 脚本编写、配置管理、编排工具
reference/backup-recovery.md - 备份策略、灾难恢复、业务连续性

对于新基础设施 : 从 reference/infrastructure.md 开始获取设置指导
对于监控设置 : 查看 reference/monitoring.md 获取可观测性策略
对于事件响应 : 参阅 reference/incident-management.md 获取流程
对于自动化项目 : 查看 reference/automation.md 获取工具推荐
对于灾难恢复规划 : 咨询 reference/backup-recovery.md 获取恢复策略

🇺🇸English

IT Operations Expert

A comprehensive skill for managing IT infrastructure operations, ensuring service reliability, implementing monitoring and alerting strategies, managing incidents, and maintaining operational excellence through automation and best practices.

Core Principles

1. Service Reliability First

Proactive Monitoring : Implement comprehensive observability before incidents occur
Incident Management : Structured response processes with clear escalation paths
SLA/SLO Management : Define and maintain service level objectives aligned with business needs
Continuous Improvement : Learn from incidents through blameless post-mortems

2. Automation Over Manual Processes

Infrastructure as Code : Manage infrastructure configuration through version-controlled code
Runbook Automation : Convert manual procedures into automated workflows
Self-Healing Systems : Implement automated remediation for common issues
Configuration Management : Maintain consistency across environments

3. ITIL Service Management

Service Strategy : Align IT services with business objectives
Service Design : Design resilient, scalable services
Service Transition : Manage changes with minimal disruption
Service Operation : Deliver and support services effectively
Continual Service Improvement : Iteratively enhance service quality

4. Operational Excellence

Documentation : Maintain current runbooks, procedures, and architecture diagrams
Knowledge Management : Build searchable knowledge bases from incident resolutions
Capacity Planning : Forecast and provision resources proactively
Cost Optimization : Balance performance requirements with infrastructure costs

Core Workflow

Infrastructure Operations Workflow

1. MONITORING & OBSERVABILITY
   ├─ Define SLIs/SLOs/SLAs for critical services
   ├─ Implement metrics collection (infrastructure, application, business)
   ├─ Configure alerting with proper thresholds and escalation
   ├─ Build dashboards for different audiences (ops, devs, executives)
   └─ Establish on-call rotation and escalation procedures

2. INCIDENT MANAGEMENT
   ├─ Receive alert or user report
   ├─ Assess severity and impact (P1/P2/P3/P4)
   ├─ Engage appropriate responders
   ├─ Investigate and diagnose root cause
   ├─ Implement fix or workaround
   ├─ Communicate status to stakeholders
   ├─ Document resolution in knowledge base
   └─ Conduct post-incident review

3. CHANGE MANAGEMENT
   ├─ Submit change request with impact assessment
   ├─ Review and approve through CAB (Change Advisory Board)
   ├─ Schedule change window
   ├─ Execute change with rollback plan ready
   ├─ Validate success criteria
   ├─ Document actual vs planned results
   └─ Close change ticket

4. CAPACITY PLANNING
   ├─ Collect resource utilization trends
   ├─ Analyze growth patterns
   ├─ Forecast future requirements
   ├─ Plan procurement or provisioning
   ├─ Execute capacity additions
   └─ Monitor effectiveness

5. AUTOMATION & OPTIMIZATION
   ├─ Identify repetitive manual tasks
   ├─ Document current process
   ├─ Design automated solution
   ├─ Implement and test automation
   ├─ Deploy to production
   ├─ Measure time/cost savings
   └─ Iterate and improve

Decision Frameworks

Alert Configuration Decision Matrix

Scenario	Alert Type	Threshold	Response Time	Escalation
Service completely down	Page	Immediate	< 5 min	Immediate to on-call
Service degraded	Page	2-3 failures	< 15 min	After 15 min to on-call
High resource usage	Warning	> 80% sustained	< 1 hour	After 2 hours to team lead
Approaching capacity	Info	> 70% trend	< 24 hours	Weekly capacity review
Configuration drift	Ticket	Any deviation	< 7 days	Monthly review

Incident Severity Classification

Priority 1 (Critical)

Complete service outage affecting all users
Data loss or security breach
Financial impact > $10K/hour
Response: Immediate, 24/7, all hands on deck

Priority 2 (High)

Partial service outage affecting many users
Significant performance degradation
Financial impact $1K-$10K/hour
Response: < 30 minutes during business hours

Priority 3 (Medium)

Service degradation affecting some users
Non-critical functionality impaired
Workaround available
Response: < 4 hours during business hours

Priority 4 (Low)

Minor issues with minimal impact
Cosmetic problems
Enhancement requests
Response: Next business day

Change Management Risk Assessment

Risk Level = Impact × Likelihood × Complexity

Impact (1-5):
1 = Single user
2 = Team
3 = Department
4 = Company-wide
5 = Customer-facing

Likelihood of Issues (1-5):
1 = Routine, tested
2 = Familiar, documented
3 = Some uncertainty
4 = New territory
5 = Never done before

Complexity (1-5):
1 = Single component
2 = Few components
3 = Multiple systems
4 = Cross-platform
5 = Enterprise-wide

Risk Score Interpretation:
1-20: Standard change (pre-approved)
21-50: Normal change (CAB review)
51-75: High-risk change (extensive testing, senior approval)
76-125: Emergency change only (executive approval)

Monitoring Tool Selection

Requirement	Prometheus + Grafana	Datadog	New Relic	ELK Stack	Splunk
Cost	Free (self-hosted)	$$$$	$$$$	Free-$$	$$$$$
Metrics	Excellent	Excellent	Excellent	Good	Good
Logs	Via Loki	Excellent	Excellent	Excellent	Excellent
Traces	Via Tempo	Excellent	Excellent	Limited	Good
Learning Curve	Steep	Moderate

Common Operational Challenges

Challenge 1: Alert Fatigue

Problem : Too many false positive alerts causing team burnout

Solution :

Alert Tuning Process:
1. Measure baseline alert volume and false positive rate
2. Categorize alerts by actionability:
   - Actionable + Urgent = Keep as page
   - Actionable + Not Urgent = Ticket
   - Not Actionable = Remove or convert to dashboard metric
3. Implement alert aggregation (group similar alerts)
4. Add context to alerts (runbook links, relevant metrics)
5. Regular review meetings (weekly) to tune thresholds
6. Track metrics:
   - MTTA (Mean Time to Acknowledge): < 5 min target
   - False Positive Rate: < 20% target
   - Alert Volume per Week: Trending down

Challenge 2: Incident Documentation During Crisis

Problem : Teams skip documentation during high-pressure incidents

Solution :

Assign dedicated scribe role (not the incident commander)
Use incident management tools (PagerDuty, Opsgenie) with automatic timeline
Template-based incident reports with required fields
Post-incident review scheduled automatically (within 48 hours)
Gamify documentation (track and recognize thorough documentation)

Challenge 3: Knowledge Silos

Problem : Critical knowledge trapped in individual team members' heads

Solution :

Knowledge Transfer Strategy:
- Pair Programming/Shadowing: 20% of sprint capacity
- Runbook Requirements: Every system must have runbook
- Lunch & Learn Sessions: Weekly 30-min knowledge sharing
- Cross-Training Matrix: Track who knows what, identify gaps
- On-Call Rotation: Everyone rotates to spread knowledge
- Post-Incident Reviews: Mandatory team sharing
- Documentation Sprints: Quarterly focus on doc completion

Challenge 4: Balancing Stability vs Innovation

Problem : Operations team resists change to maintain stability

Solution :

Implement change windows (planned maintenance periods)
Use blue-green or canary deployments for lower risk
Establish "innovation time" (Google 20% time model)
Create sandbox environments for experimentation
Measure and reward both stability AND improvement metrics
Include "toil reduction" as OKR target

Key Metrics & KPIs

Service Reliability Metrics

Availability:
  Formula: (Total Time - Downtime) / Total Time × 100
  Target: 99.9% (43.8 min/month downtime)
  Measurement: Per service, monthly

MTTR (Mean Time to Recovery):
  Formula: Sum of recovery times / Number of incidents
  Target: < 30 minutes for P1, < 4 hours for P2
  Measurement: Per severity level, monthly

MTBF (Mean Time Between Failures):
  Formula: Total operational time / Number of failures
  Target: > 720 hours (30 days)
  Measurement: Per service, quarterly

MTTA (Mean Time to Acknowledge):
  Formula: Sum of acknowledgment times / Number of alerts
  Target: < 5 minutes for pages
  Measurement: Per on-call engineer, weekly

Change Success Rate:
  Formula: Successful changes / Total changes × 100
  Target: > 95%
  Measurement: Monthly

Incident Recurrence Rate:
  Formula: Repeat incidents / Total incidents × 100
  Target: < 10%
  Measurement: Quarterly (same root cause within 90 days)

Operational Efficiency Metrics

Toil Percentage:
  Definition: Time spent on manual, repetitive tasks
  Target: < 30% of team capacity
  Measurement: Weekly time tracking

Automation Coverage:
  Formula: Automated tasks / Total repetitive tasks × 100
  Target: > 70%
  Measurement: Quarterly audit

On-Call Load:
  Formula: Alerts per on-call shift
  Target: < 5 actionable alerts per shift
  Measurement: Per engineer, weekly

Runbook Coverage:
  Formula: Services with runbooks / Total services × 100
  Target: 100%
  Measurement: Monthly audit

Knowledge Base Utilization:
  Formula: Incidents resolved via KB / Total incidents × 100
  Target: > 40%
  Measurement: Monthly

Integration Points

With Development Teams

Participate in design reviews for operational requirements
Provide deployment automation and CI/CD pipeline support
Share monitoring and logging requirements
Collaborate on incident response and post-mortems
Joint ownership of SLOs and error budgets

With Security Teams

Implement security monitoring and alerting
Manage access controls and authentication systems
Coordinate vulnerability patching and remediation
Conduct security incident response
Maintain compliance with security policies

With Business Stakeholders

Report on service availability and performance
Communicate planned maintenance windows
Provide capacity planning forecasts
Translate technical metrics to business impact
Participate in business continuity planning

Best Practices

1. Blameless Post-Mortems

Post-Incident Review Template:
- Incident Summary (what happened, when, impact)
- Timeline of Events (detailed chronology)
- Root Cause Analysis (5 Whys or Fishbone)
- What Went Well (strengths during response)
- What Could Be Improved (opportunities)
- Action Items (with owners and due dates)
- Lessons Learned (shareable insights)

Rules:
- No blame or punishment
- Focus on systems and processes, not people
- Everyone can speak freely
- Action items must be tracked to completion

2. Runbook Standards

Runbook Contents:
  - Service Overview: Purpose, dependencies, architecture
  - SLIs/SLOs/SLAs: Defined thresholds and targets
  - Common Issues: Symptoms, causes, solutions
  - Troubleshooting Steps: Step-by-step procedures
  - Escalation Paths: Who to contact and when
  - Useful Commands: Copy-paste ready commands
  - Dashboard Links: Direct links to relevant dashboards
  - Recent Changes: Link to change log
  - Contact Information: Team, product owner, SMEs

Maintenance:
  - Review quarterly or after major incidents
  - Test procedures during low-traffic periods
  - Update after every significant change
  - Track usage metrics (page views, helpfulness ratings)

3. On-Call Best Practices

On-Call Preparation:
  - Laptop with VPN access
  - Mobile device with notification apps
  - Contact list (escalation paths)
  - Access to all critical systems
  - Runbooks bookmarked
  - Backup on-call identified

During On-Call:
  - Acknowledge alerts within 5 minutes
  - Update incident status regularly
  - Follow escalation procedures
  - Document all actions in incident ticket
  - Handoff clearly to next on-call

Post On-Call:
  - Complete incident reports
  - Submit toil reduction tickets
  - Provide feedback on runbooks
  - Update on-call documentation

4. Change Management Discipline

Standard Change Process:
  1. Create change request (RFC)
  2. Document:
     - What: Specific changes being made
     - Why: Business justification
     - When: Proposed date/time
     - Who: Change implementer and approver
     - How: Step-by-step procedure
     - Risk: Assessment and mitigation
     - Rollback: Detailed rollback plan
     - Testing: Validation steps
  3. Submit for CAB review (7 days advance notice)
  4. Implement during approved window
  5. Validate success criteria
  6. Close change with actual results
  7. Post-implementation review if issues occurred

Emergency Change Process:
  - Executive approval required
  - Implement with heightened monitoring
  - Full team notification
  - Complete documentation within 24 hours
  - Mandatory post-change review

Reference Files

For detailed technical guidance, see:

reference/monitoring.md - Observability, metrics, alerting, and dashboard design
reference/incident-management.md - Incident response, root cause analysis, post-mortems
reference/infrastructure.md - Server management, network operations, capacity planning
reference/automation.md - Scripting, configuration management, orchestration tools
reference/backup-recovery.md - Backup strategies, disaster recovery, business continuity

Getting Started

For New Infrastructure : Start with reference/infrastructure.md for setup guidance
For Monitoring Setup : Review reference/monitoring.md for observability strategy
For Incident Response : See reference/incident-management.md for procedures
For Automation Projects : Check reference/automation.md for tooling recommendations
For DR Planning : Consult reference/backup-recovery.md for recovery strategies

Weekly Installs

156

Repository

davila7/claude-…emplates

GitHub Stars

22.6K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

claude-code131

opencode122

gemini-cli118

cursor109

codex105

antigravity101

Azure Data Explorer (Kusto) 查询技能：KQL数据分析、日志遥测与时间序列处理

119,800 周安装

IT运维专家指南：服务可靠性、自动化、ITIL与卓越运营实践

🇨🇳中文介绍

IT 运维专家

核心原则

1. 服务可靠性优先

2. 自动化优于手动流程

3. ITIL 服务管理

4. 卓越运营

相关 Skills

核心工作流

基础设施运维工作流

决策框架

告警配置决策矩阵

事件严重性分类

变更管理风险评估

监控工具选择

常见运维挑战

挑战 1：告警疲劳

挑战 2：危机期间的事件记录

挑战 3：知识孤岛

挑战 4：平衡稳定性与创新

关键指标与 KPI

服务可靠性指标

运营效率指标

集成点

与开发团队

与安全团队

与业务利益相关者

最佳实践

1. 无责事后分析

2. 操作手册标准

3. 待命最佳实践

4. 变更管理纪律

参考文件

入门指南

🇺🇸English

IT Operations Expert

Core Principles

1. Service Reliability First

2. Automation Over Manual Processes

3. ITIL Service Management

4. Operational Excellence

Core Workflow

Infrastructure Operations Workflow

Decision Frameworks

Alert Configuration Decision Matrix

Incident Severity Classification

Change Management Risk Assessment

Monitoring Tool Selection

Common Operational Challenges

Challenge 1: Alert Fatigue

Challenge 2: Incident Documentation During Crisis

Challenge 3: Knowledge Silos

Challenge 4: Balancing Stability vs Innovation

Key Metrics & KPIs

Service Reliability Metrics

Operational Efficiency Metrics

Integration Points

With Development Teams

With Security Teams

With Business Stakeholders

Best Practices

1. Blameless Post-Mortems

2. Runbook Standards

3. On-Call Best Practices

4. Change Management Discipline

Reference Files

Getting Started

最新 Skills