Azure资源健康诊断与修复指南 - 利用日志遥测分析问题

azure-resource-health-diagnose by github/awesome-copilot

7,300 周安装量

26,700 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/github/awesome-copilot --skill azure-resource-health-diagnose

云服务开发运维

🇨🇳中文介绍

Azure 资源健康与问题诊断

此工作流用于分析特定 Azure 资源，评估其健康状态，利用日志和遥测数据诊断潜在问题，并为发现的问题制定全面的修复计划。

先决条件

Azure MCP 服务器已配置并完成身份验证
已确定目标 Azure 资源（名称，可选资源组/订阅）
资源必须已部署并正在运行以生成日志/遥测数据
优先使用 Azure MCP 工具（azmcp-*），其次才是直接使用 Azure CLI

工作流步骤

步骤 1：获取 Azure 最佳实践

操作：检索诊断和故障排除最佳实践工具：Azure MCP 最佳实践工具流程：

加载最佳实践：
- 执行 Azure 最佳实践工具以获取诊断指南
- 重点关注健康监控、日志分析和问题解决模式
- 使用这些实践来指导诊断方法和修复建议

步骤 2：资源发现与识别

操作：定位并识别目标 Azure 资源工具：Azure MCP 工具 + Azure CLI 备用流程：

资源查找：
- 如果仅提供资源名称：使用 azmcp-subscription-list 在所有订阅中搜索
- 使用 az resource list --name <resource-name> 查找匹配的资源

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

Vercel React 最佳实践指南 | 58条Next.js性能优化规则与代码重构

240,900 周安装

agent-browser 浏览器自动化工具 - Vercel Labs 命令行网页操作与测试

125,300 周安装

Azure Data Explorer (Kusto) 查询技能：KQL数据分析、日志遥测与时间序列处理

76,500 周安装

系统性调试指南：4阶段根因分析法，杜绝无效修复，提升开发效率

38,500 周安装

Web应用测试指南：使用Python Playwright自动化测试本地Web应用

查找监控源：
- 使用 azmcp-monitor-workspace-list 识别 Log Analytics 工作区
- 定位与资源关联的 Application Insights 实例
- 使用 azmcp-monitor-table-list 识别相关的日志表
执行诊断查询：根据资源类型，使用带有针对性 KQL 查询的 azmcp-monitor-log-query：

问题分类：
- 严重：服务不可用、数据丢失、安全漏洞
- 高：性能下降、间歇性故障、高错误率
- 中：警告、配置欠佳、轻微性能问题
- 低：信息性警报、优化机会
根本原因分析：
- 配置问题：设置不正确、缺少依赖项
- 资源限制：CPU/内存/磁盘限制、限制
- 网络问题：连接问题、DNS 解析、防火墙规则
- 应用程序问题：代码错误、内存泄漏、低效查询
- 外部依赖项：第三方服务故障、API 限制
- 安全问题：身份验证失败、证书过期
影响评估：
- 确定业务影响和受影响的用户/系统
- 评估数据完整性和安全影响
- 评估恢复时间目标和优先级

立即行动（严重问题）：
- 紧急修复以恢复服务可用性
- 临时变通方案以减轻影响
- 复杂问题的升级流程
短期修复（高/中优先级问题）：
- 配置调整和资源扩展
- 应用程序更新和补丁
- 监控和警报改进
长期改进（所有问题）：
- 架构变更以提高弹性
- 预防措施和监控增强
- 文档和流程改进
实施步骤：
- 包含特定 Azure CLI 命令的优先操作项
- 测试和验证程序
- 每次变更的回滚计划
- 用于验证问题解决的监控

显示健康评估摘要：

🏥 Azure 资源健康评估

📊 资源概览：
• 资源：[名称]（[类型]）
• 状态：[健康/警告/严重]
• 位置：[区域]
• 上次分析时间：[时间戳]

🚨 识别出的问题：
• 严重：X 个需要立即关注的问题
• 高：Y 个影响性能/可靠性的问题
• 中：Z 个用于优化的问题
• 低：N 个信息性项目

🔍 主要问题：
1. [问题类型]：[描述] - 影响：[高/中/低]
2. [问题类型]：[描述] - 影响：[高/中/低]
3. [问题类型]：[描述] - 影响：[高/中/低]

🛠️ 修复计划：
• 立即行动：X 项
• 短期修复：Y 项
• 长期改进：Z 项
• 预计解决时间：[时间线]

❓ 是否继续执行详细的修复计划？(y/n)

生成详细报告：

# Azure 资源健康报告：[资源名称]

**生成时间**：[时间戳]
**资源**：[完整资源 ID]
**整体健康状况**：[带颜色指示器的状态]

## 🔍 执行摘要
[健康状况和关键发现的简要概述]

## 📊 健康指标
- **可用性**：过去 24 小时 X%
- **性能**：[平均响应时间/吞吐量]
- **错误率**：过去 24 小时 X%
- **资源利用率**：[CPU/内存/存储百分比]

## 🚨 识别出的问题

### 严重问题
- **[问题 1]**：[描述]
  - **根本原因**：[分析]
  - **影响**：[业务影响]
  - **立即行动**：[所需步骤]

### 高优先级问题
- **[问题 2]**：[描述]
  - **根本原因**：[分析]
  - **影响**：[性能/可靠性影响]
  - **建议修复**：[解决步骤]

## 🛠️ 修复计划

### 阶段 1：立即行动 (0-2 小时)
```bash
# 恢复服务的关键修复
[带有解释的 Azure CLI 命令]

🇺🇸English

Azure Resource Health & Issue Diagnosis

This workflow analyzes a specific Azure resource to assess its health status, diagnose potential issues using logs and telemetry data, and develop a comprehensive remediation plan for any problems discovered.

Prerequisites

Azure MCP server configured and authenticated
Target Azure resource identified (name and optionally resource group/subscription)
Resource must be deployed and running to generate logs/telemetry
Prefer Azure MCP tools (azmcp-*) over direct Azure CLI when available

Workflow Steps

Step 1: Get Azure Best Practices

Action : Retrieve diagnostic and troubleshooting best practices Tools : Azure MCP best practices tool Process :

Load Best Practices :
- Execute Azure best practices tool to get diagnostic guidelines
- Focus on health monitoring, log analysis, and issue resolution patterns
- Use these practices to inform diagnostic approach and remediation recommendations

Step 2: Resource Discovery & Identification

Action : Locate and identify the target Azure resource Tools : Azure MCP tools + Azure CLI fallback Process :

Resource Lookup :
- If only resource name provided: Search across subscriptions using azmcp-subscription-list
- Use az resource list --name <resource-name> to find matching resources
- If multiple matches found, prompt user to specify subscription/resource group
- Gather detailed resource information:
  - Resource type and current status
  - Location, tags, and configuration
  - Associated services and dependencies
Resource Type Detection :
- Identify resource type to determine appropriate diagnostic approach:
  - Web Apps/Function Apps : Application logs, performance metrics, dependency tracking
  - Virtual Machines : System logs, performance counters, boot diagnostics
  - Cosmos DB : Request metrics, throttling, partition statistics
  - Storage Accounts : Access logs, performance metrics, availability
  - SQL Database : Query performance, connection logs, resource utilization
  - Application Insights : Application telemetry, exceptions, dependencies
  - Key Vault : Access logs, certificate status, secret usage
  - Service Bus : Message metrics, dead letter queues, throughput

Step 3: Health Status Assessment

Action : Evaluate current resource health and availability Tools : Azure MCP monitoring tools + Azure CLI Process :

Basic Health Check :
- Check resource provisioning state and operational status
- Verify service availability and responsiveness
- Review recent deployment or configuration changes
- Assess current resource utilization (CPU, memory, storage, etc.)
Service-Specific Health Indicators :
- Web Apps : HTTP response codes, response times, uptime
- Databases : Connection success rate, query performance, deadlocks
- Storage : Availability percentage, request success rate, latency
- VMs : Boot diagnostics, guest OS metrics, network connectivity
- Functions : Execution success rate, duration, error frequency

Step 4: Log & Telemetry Analysis

Action : Analyze logs and telemetry to identify issues and patterns Tools : Azure MCP monitoring tools for Log Analytics queries Process :

Find Monitoring Sources :
- Use azmcp-monitor-workspace-list to identify Log Analytics workspaces
- Locate Application Insights instances associated with the resource
- Identify relevant log tables using azmcp-monitor-table-list
Execute Diagnostic Queries : Use azmcp-monitor-log-query with targeted KQL queries based on resource type:

General Error Analysis :

     // Recent errors and exceptions
     union isfuzzy=true 
         AzureDiagnostics,
         AppServiceHTTPLogs,
         AppServiceAppLogs,
         AzureActivity
     | where TimeGenerated > ago(24h)
     | where Level == "Error" or ResultType != "Success"
     | summarize ErrorCount=count() by Resource, ResultType, bin(TimeGenerated, 1h)
     | order by TimeGenerated desc

Performance Analysis :

     // Performance degradation patterns
     Perf
     | where TimeGenerated > ago(7d)
     | where ObjectName == "Processor" and CounterName == "% Processor Time"
     | summarize avg(CounterValue) by Computer, bin(TimeGenerated, 1h)
     | where avg_CounterValue > 80

Application-Specific Queries :

     // Application Insights - Failed requests
     requests
     | where timestamp > ago(24h)
     | where success == false
     | summarize FailureCount=count() by resultCode, bin(timestamp, 1h)
     | order by timestamp desc
     
     // Database - Connection failures
     AzureDiagnostics
     | where ResourceProvider == "MICROSOFT.SQL"
     | where Category == "SQLSecurityAuditEvents"
     | where action_name_s == "CONNECTION_FAILED"
     | summarize ConnectionFailures=count() by bin(TimeGenerated, 1h)

3. Pattern Recognition :

 * Identify recurring error patterns or anomalies
 * Correlate errors with deployment times or configuration changes
 * Analyze performance trends and degradation patterns
 * Look for dependency failures or external service issues

Step 5: Issue Classification & Root Cause Analysis

Action : Categorize identified issues and determine root causes Process :

Issue Classification :
- Critical : Service unavailable, data loss, security breaches
- High : Performance degradation, intermittent failures, high error rates
- Medium : Warnings, suboptimal configuration, minor performance issues
- Low : Informational alerts, optimization opportunities
Root Cause Analysis :
- Configuration Issues : Incorrect settings, missing dependencies
- Resource Constraints : CPU/memory/disk limitations, throttling
- Network Issues : Connectivity problems, DNS resolution, firewall rules
- Application Issues : Code bugs, memory leaks, inefficient queries
- External Dependencies : Third-party service failures, API limits
- Security Issues : Authentication failures, certificate expiration
Impact Assessment :
- Determine business impact and affected users/systems
- Evaluate data integrity and security implications
- Assess recovery time objectives and priorities

Step 6: Generate Remediation Plan

Action : Create a comprehensive plan to address identified issues Process :

Immediate Actions (Critical issues):
- Emergency fixes to restore service availability
- Temporary workarounds to mitigate impact
- Escalation procedures for complex issues
Short-term Fixes (High/Medium issues):
- Configuration adjustments and resource scaling
- Application updates and patches
- Monitoring and alerting improvements
Long-term Improvements (All issues):
- Architectural changes for better resilience
- Preventive measures and monitoring enhancements
- Documentation and process improvements
Implementation Steps :
- Prioritized action items with specific Azure CLI commands
- Testing and validation procedures
- Rollback plans for each change
- Monitoring to verify issue resolution

Step 7: User Confirmation & Report Generation

Action : Present findings and get approval for remediation actions Process :

Display Health Assessment Summary :

🏥 Azure Resource Health Assessment

📊 Resource Overview:
• Resource: [Name] ([Type])
• Status: [Healthy/Warning/Critical]
• Location: [Region]
• Last Analyzed: [Timestamp]

🚨 Issues Identified:
• Critical: X issues requiring immediate attention
• High: Y issues affecting performance/reliability  
• Medium: Z issues for optimization
• Low: N informational items

🔍 Top Issues:
1. [Issue Type]: [Description] - Impact: [High/Medium/Low]
2. [Issue Type]: [Description] - Impact: [High/Medium/Low]
3. [Issue Type]: [Description] - Impact: [High/Medium/Low]

🛠️ Remediation Plan:
• Immediate Actions: X items
• Short-term Fixes: Y items  
• Long-term Improvements: Z items
• Estimated Resolution Time: [Timeline]

❓ Proceed with detailed remediation plan? (y/n)

Generate Detailed Report :

# Azure Resource Health Report: [Resource Name]

**Generated**: [Timestamp]  
**Resource**: [Full Resource ID]  
**Overall Health**: [Status with color indicator]

## 🔍 Executive Summary
[Brief overview of health status and key findings]

## 📊 Health Metrics
- **Availability**: X% over last 24h
- **Performance**: [Average response time/throughput]
- **Error Rate**: X% over last 24h
- **Resource Utilization**: [CPU/Memory/Storage percentages]

## 🚨 Issues Identified

### Critical Issues
- **[Issue 1]**: [Description]
  - **Root Cause**: [Analysis]
  - **Impact**: [Business impact]
  - **Immediate Action**: [Required steps]

### High Priority Issues  
- **[Issue 2]**: [Description]
  - **Root Cause**: [Analysis]
  - **Impact**: [Performance/reliability impact]
  - **Recommended Fix**: [Solution steps]

## 🛠️ Remediation Plan

### Phase 1: Immediate Actions (0-2 hours)
```bash
# Critical fixes to restore service
[Azure CLI commands with explanations]

Phase 2: Short-term Fixes (2-24 hours)

     # Performance and reliability improvements
     [Azure CLI commands with explanations]

Phase 3: Long-term Improvements (1-4 weeks)

     # Architectural and preventive measures
     [Azure CLI commands and configuration changes]

📈 Monitoring Recommendations

 * **Alerts to Configure** : [List of recommended alerts]
 * **Dashboards to Create** : [Monitoring dashboard suggestions]
 * **Regular Health Checks** : [Recommended frequency and scope]

✅ Validation Steps

 * Verify issue resolution through logs
 * Confirm performance improvements
 * Test application functionality
 * Update monitoring and alerting
 * Document lessons learned

📝 Prevention Measures

 * [Recommendations to prevent similar issues]
 * [Process improvements]
 * [Monitoring enhancements]

Error Handling

Resource Not Found : Provide guidance on resource name/location specification
Authentication Issues : Guide user through Azure authentication setup
Insufficient Permissions : List required RBAC roles for resource access
No Logs Available : Suggest enabling diagnostic settings and waiting for data
Query Timeouts : Break down analysis into smaller time windows
Service-Specific Issues : Provide generic health assessment with limitations noted

Success Criteria

✅ Resource health status accurately assessed
✅ All significant issues identified and categorized
✅ Root cause analysis completed for major problems
✅ Actionable remediation plan with specific steps provided
✅ Monitoring and prevention recommendations included
✅ Clear prioritization of issues by business impact
✅ Implementation steps include validation and rollback procedures

Weekly Installs

7.3K

Repository

github/awesome-copilot

GitHub Stars

26.7K

First Seen

Feb 25, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

codex7.2K

gemini-cli7.2K

opencode7.2K

cursor7.2K

github-copilot7.1K

kimi-cli7.1K

Azure资源健康诊断与修复指南 - 利用日志遥测分析问题

🇨🇳中文介绍

Azure 资源健康与问题诊断

先决条件

工作流步骤

步骤 1：获取 Azure 最佳实践

步骤 2：资源发现与识别

相关 Skills

步骤 3：健康状态评估

步骤 4：日志与遥测分析

步骤 5：问题分类与根本原因分析

步骤 6：生成修复计划

步骤 7：用户确认与报告生成

阶段 2：短期修复 (2-24 小时)

阶段 3：长期改进 (1-4 周)

📈 监控建议

✅ 验证步骤

📝 预防措施

错误处理

成功标准