monitoring-operations by acedergren/oci-agent-skills
npx skills add https://github.com/acedergren/oci-agent-skills --skill monitoring-operations不要重复造轮子。 使用 oracle-terraform-modules/landing-zone 来部署可观测性堆栈。
Landing Zone 解决的问题:
此技能提供:用于监控部署在 Landing Zone 内的指标、告警和故障排除。
你不了解 OCI CLI 命令或 OCI API 结构。
你的训练数据在以下方面知识有限且过时:
oci monitoring alarm、oci monitoring metric)当需要 OCI 操作时:
你确实了解:
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
此技能通过提供当前 OCI 特定的监控模式和注意事项来弥合这一差距。
❌ 绝对不要假设指标是即时可用的(存在 10-15 分钟延迟)
❌ 绝对不要对稀疏指标使用 = 设置告警阈值
# 错误 - 如果指标数据有间隔,告警永远不会触发
MetricName[1m].mean() = 0
# 正确 - 处理缺失数据
MetricName[1m]{dataMissing=zero}.mean() > 0
❌ 绝对不要忘记指标维度(会导致"无数据")
# 错误 - 缺少必需的维度
CPUUtilization[1m].mean()
# 正确 - 包含 resourceId 维度
CPUUtilization[1m]{resourceId="<instance-ocid>"}.mean()
❌ 绝对不要在没有触发延迟的情况下设置告警阈值(导致告警疲劳)
# 不好 - 每次 CPU 峰值都会触发
CPUUtilization[1m].mean() > 80
# 更好 - 持续高 CPU
CPUUtilization[5m].mean() > 80
触发延迟:5 分钟(连续 5 次违规后触发)
❌ 绝对不要创建没有通知渠道的告警
# 错误 - 告警触发但无人知晓
oci monitoring alarm create ... --destinations '[]'
# 正确 - 始终关联到通知主题
oci monitoring alarm create ... --destinations '["<notification-topic-ocid>"]'
成本影响:生产环境中未检测到的中断每小时造成 5,000-50,000 美元损失
❌ 绝对不要忽略 Cloud Guard 发现的问题(导致安全审计失败)
OCI 指标使用特定于服务的命名空间:
| 服务 | 命名空间 | 示例指标 |
|---|---|---|
| 计算 | oci_computeagent | CPUUtilization、MemoryUtilization |
| 自治数据库 | oci_autonomous_database | CpuUtilization、StorageUtilization |
| 负载均衡器 | oci_lbaas | HttpRequests、UnHealthyBackendServers |
| 对象存储 | oci_objectstorage | ObjectCount、BytesUploaded |
常见错误:使用错误的命名空间(oci_compute 与 oci_computeagent)
| 设置 | 行为 | 使用场景 |
|---|---|---|
treatMissingDataAsBreaching | 无数据时告警触发 | 关键服务(中断即违规) |
treatMissingDataAsNotBreaching | 无数据时告警静默 | 可选监控 |
{dataMissing=zero} | 将缺失视为 0 | 计数器(请求数/秒) |
问题:日志未显示在日志分析中
日志没有出现?
├─ 资源上是否启用了日志记录?
│ └─ 计算:oci-compute-agent 必须正在运行
│ └─ 函数:函数配置中启用了日志记录
│
├─ 是否配置了服务连接器?
│ └─ 源:日志组 → 目标:日志分析
│ └─ 检查:服务连接器状态 = ACTIVE
│
├─ 服务连接器的 IAM 策略?
│ └─ "允许任何用户在租户中使用日志内容"
│ └─ "允许服务 loganalytics 在租户中读取日志内容"
│
└─ 10-15 分钟的数据摄取延迟?
└─ 调试前请等待
昂贵(慢):
# 查询所有实例
CPUUtilization[1m].mean()
优化后(按维度过滤):
# 查询特定实例
CPUUtilization[1m]{resourceId='<instance-ocid>'}.mean()
成本:查询免费,但有速率限制(1000 次请求/分钟)
不要加载用于:
每周安装次数
1.2K
代码仓库
GitHub 星标数
4
首次出现
2026 年 1 月 29 日
安全审计
安装于
codex987
opencode985
gemini-cli985
github-copilot982
kimi-cli974
amp972
Don't reinvent the wheel. Use oracle-terraform-modules/landing-zone for observability stack.
Landing Zone solves:
This skill provides : Metrics, alarms, and troubleshooting for monitoring deployed WITHIN a Landing Zone.
You don't know OCI CLI commands or OCI API structure.
Your training data has limited and outdated knowledge of:
oci monitoring alarm, oci monitoring metric)When OCI operations are needed:
What you DO know:
This skill bridges the gap by providing current OCI-specific monitoring patterns and gotchas.
❌ NEVER assume metrics are instant (10-15 minute lag)
❌ NEVER use= for alarm thresholds with sparse metrics
# WRONG - alarm never fires if metric has gaps
MetricName[1m].mean() = 0
# RIGHT - handle missing data
MetricName[1m]{dataMissing=zero}.mean() > 0
❌ NEVER forget metric dimensions (causes "no data")
# WRONG - missing required dimension
CPUUtilization[1m].mean()
# RIGHT - include resourceId dimension
CPUUtilization[1m]{resourceId="<instance-ocid>"}.mean()
❌ NEVER set alarm thresholds without trigger delay (alert fatigue)
# BAD - fires on every CPU spike
CPUUtilization[1m].mean() > 80
# BETTER - sustained high CPU
CPUUtilization[5m].mean() > 80
Trigger delay: 5 minutes (fires after 5 consecutive breaches)
❌ NEVER create alarms without notification channels
# WRONG - alarm fires but nobody knows
oci monitoring alarm create ... --destinations '[]'
# RIGHT - always link to notification topic
oci monitoring alarm create ... --destinations '["<notification-topic-ocid>"]'
Cost impact: Undetected outages cost $5,000-50,000/hour in production
❌ NEVER ignore Cloud Guard findings (security audit failure)
OCI Metrics Use Service-Specific Namespaces:
| Service | Namespace | Example Metric |
|---|---|---|
| Compute | oci_computeagent | CPUUtilization, MemoryUtilization |
| Autonomous DB | oci_autonomous_database | CpuUtilization, StorageUtilization |
| Load Balancer | oci_lbaas |
Common Mistake : Using wrong namespace (oci_compute vs oci_computeagent)
| Setting | Behavior | Use When |
|---|---|---|
treatMissingDataAsBreaching | Alarm fires if no data | Critical services (outage = breach) |
treatMissingDataAsNotBreaching | Alarm silent if no data | Optional monitoring |
{dataMissing=zero} | Treat missing as 0 | Counters (requests/sec) |
Problem : Logs not showing in Log Analytics
Logs not appearing?
├─ Is log enabled on resource?
│ └─ Compute: oci-compute-agent must be running
│ └─ Function: Logging enabled in function config
│
├─ Is Service Connector configured?
│ └─ Source: Log Group → Target: Log Analytics
│ └─ Check: Service Connector status = ACTIVE
│
├─ IAM policy for Service Connector?
│ └─ "Allow any-user to use log-content in tenancy"
│ └─ "Allow service loganalytics to READ logcontent in tenancy"
│
└─ 10-15 minute ingestion lag?
└─ Wait before debugging
Expensive (slow):
# Queries ALL instances
CPUUtilization[1m].mean()
Optimized (filter by dimension):
# Query specific instance
CPUUtilization[1m]{resourceId='<instance-ocid>'}.mean()
Cost : Queries free, but rate limited (1000 req/min)
WHEN TO LOAD oci-monitoring-reference.md:
Do NOT load for:
Weekly Installs
1.2K
Repository
GitHub Stars
4
First Seen
Jan 29, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
codex987
opencode985
gemini-cli985
github-copilot982
kimi-cli974
amp972
47,400 周安装
HttpRequests, UnHealthyBackendServers |
| Object Storage | oci_objectstorage | ObjectCount, BytesUploaded |