Airflow DAG 测试与调试指南：使用 af runs trigger-wait 快速验证工作流

testing-dags by astronomer/agents

462 周安装量

290 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/astronomer/agents --skill testing-dags

自动化开发运维测试

🇨🇳中文介绍

DAG 测试技能

使用 af 命令在迭代周期中测试、调试和修复 DAG。

运行 CLI

使用 uvx 运行所有 af 命令（无需安装）：

uvx --from astro-airflow-mcp af <command>

在整个文档中，af 是 uvx --from astro-airflow-mcp af 的简写。

使用 Astro CLI 快速验证

如果用户可以使用 Astro CLI，这些命令可以在无需运行 Airflow 实例的情况下提供快速反馈：

# 解析 DAG 以捕获导入错误、语法问题和 DAG 级别的问题
astro dev parse

# 对 DAG 运行 pytest（运行 tests/ 目录中的测试）
astro dev pytest

在开发过程中使用这些命令进行快速验证。如需针对实时 Airflow 实例进行完整的端到端测试，请继续阅读下面的触发并等待工作流。

首要操作：直接触发 DAG

当用户要求测试 DAG 时，您的首要且唯一的操作应该是：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

测试工作流概述

┌─────────────────────────────────────┐
│ 1. 触发并等待                       │
│    运行 DAG，等待完成               │
└─────────────────────────────────────┘
                 ↓
        ┌───────┴───────┐
        ↓               ↓
   ┌─────────┐    ┌──────────┐
   │ 成功    │    │ 失败     │
   │ 完成！  │    │ 调试...  │
   └─────────┘    └──────────┘
                       ↓
        ┌─────────────────────────────────────┐
        │ 2. 调试（仅在失败时）               │
        │    获取日志，识别根本原因           │
        └─────────────────────────────────────┘
                       ↓
        ┌─────────────────────────────────────┐
        │ 3. 修复并重新测试                   │
        │    应用修复，从步骤 1 重新开始       │
        └─────────────────────────────────────┘

理念：先尝试，失败再调试。 不要在起飞前检查上浪费时间——直接运行 DAG，如果出现问题再进行诊断。

阶段 1：触发并等待

使用 af runs trigger-wait 测试 DAG：

主要方法：触发并等待

af runs trigger-wait <dag_id> --timeout 300

af runs trigger-wait my_dag --timeout 300

为什么这是首选方法：

单个命令处理触发 + 监控
DAG 完成（成功或失败）时立即返回
如果运行失败，包含失败任务的详细信息
无需手动轮询

{
  "dag_run": {
    "dag_id": "my_dag",
    "dag_run_id": "manual__2025-01-14T...",
    "state": "success",
    "start_date": "...",
    "end_date": "..."
  },
  "timed_out": false,
  "elapsed_seconds": 45.2
}

{
  "dag_run": {
    "state": "failed"
  },
  "timed_out": false,
  "elapsed_seconds": 30.1,
  "failed_tasks": [
    {
      "task_id": "extract_data",
      "state": "failed",
      "try_number": 2
    }
  ]
}

{
  "dag_id": "my_dag",
  "dag_run_id": "manual__...",
  "state": "running",
  "timed_out": true,
  "elapsed_seconds": 300.0,
  "message": "Timed out after 300 seconds. DAG run is still running."
}

替代方案：分别触发和监控

仅在需要更多控制时使用此方法：

# 步骤 1：触发
af runs trigger my_dag
# 返回：{"dag_run_id": "manual__...", "state": "queued"}

# 步骤 2：检查状态
af runs get my_dag manual__2025-01-14T...
# 返回当前状态

DAG 运行成功。为用户总结：

总耗时
完成的任务数量
任何显著的输出（如果在日志中可见）

DAG 仍在运行。选项：

检查当前状态：af runs get <dag_id> <dag_run_id>
询问用户是否要继续等待
增加超时时间并重试

进入阶段 2（调试）以识别根本原因。

阶段 2：调试失败（仅在需要时）

当 DAG 运行失败时，使用这些命令进行诊断：

af runs diagnose <dag_id> <dag_run_id>

一次调用返回：

运行元数据（状态、时间）
所有任务实例及其状态
失败任务摘要
状态计数（成功、失败、跳过等）

af tasks logs <dag_id> <dag_run_id> <task_id>

af tasks logs my_dag manual__2025-01-14T... extract_data

针对特定重试尝试：

af tasks logs my_dag manual__2025-01-14T... extract_data --try 2

异常消息和堆栈跟踪
连接错误（数据库、API、S3）
权限错误
超时错误
缺少依赖项

如果一个任务显示 upstream_failed，根本原因在于上游任务。使用 af runs diagnose 查找实际失败的任务。

检查导入错误（如果 DAG 未运行）

如果因为 DAG 不存在而触发失败：

这将揭示导致 DAG 无法加载的语法错误或缺失的依赖项。

阶段 3：修复并重新测试

一旦识别出问题：

问题	修复方法
缺少导入	添加到 DAG 文件
缺少包	添加到 `requirements.txt`
连接错误	检查 `af config connections`，验证凭据
变量缺失	检查 `af config variables`，如果需要则创建
超时	增加任务超时时间或优化查询
权限错误	检查连接中的凭据

保存文件
重新测试： af runs trigger-wait <dag_id>

重复测试 → 调试 → 修复循环，直到 DAG 成功。

阶段	命令	用途
测试	`af runs trigger-wait <dag_id>`	主要测试方法 — 从此处开始
测试	`af runs trigger <dag_id>`	开始运行（替代方案）
测试	`af runs get <dag_id> <run_id>`	检查运行状态
调试	`af runs diagnose <dag_id> <run_id>`	全面的失败诊断
调试	`af tasks logs <dag_id> <run_id> <task_id>`	获取任务输出/错误
调试	`af dags errors`	检查解析错误（如果 DAG 无法加载）
调试	`af dags get <dag_id>`	验证 DAG 配置
调试	`af dags explore <dag_id>`	完整的 DAG 检查
配置	`af config connections`	列出连接
配置	`af config variables`	列出变量

场景 1：测试 DAG（顺利情况）

af runs trigger-wait my_dag
# 成功！完成。

场景 2：测试 DAG（失败情况）

# 1. 运行并等待
af runs trigger-wait my_dag
# 失败...

# 2. 查找失败的任务
af runs diagnose my_dag manual__2025-01-14T...

# 3. 获取错误详情
af tasks logs my_dag manual__2025-01-14T... extract_data

# 4. [在 DAG 代码中修复问题]

# 5. 重新测试
af runs trigger-wait my_dag

场景 3：DAG 不存在 / 无法加载

# 1. 触发失败 - 未找到 DAG
af runs trigger-wait my_dag
# 错误：未找到 DAG

# 2. 查找解析错误
af dags errors

# 3. [在 DAG 代码中修复问题]

# 4. 重新测试
af runs trigger-wait my_dag

场景 4：调试失败的调度运行

# 1. 获取失败摘要
af runs diagnose my_dag scheduled__2025-01-14T...

# 2. 从失败任务获取错误
af tasks logs my_dag scheduled__2025-01-14T... failed_task_id

# 3. [修复问题]

# 4. 重新测试
af runs trigger-wait my_dag

场景 5：使用自定义配置进行测试

af runs trigger-wait my_dag --conf '{"env": "staging", "batch_size": 100}' --timeout 600

场景 6：长时间运行的 DAG

# 等待最多 1 小时
af runs trigger-wait my_dag --timeout 3600

# 如果超时，检查当前状态
af runs get my_dag manual__2025-01-14T...

连接被拒绝 / 超时：

检查 af config connections 中的主机/端口是否正确
验证到外部系统的网络连接
检查连接凭据是否正确

ModuleNotFoundError：

requirements.txt 中缺少包
添加后，可能需要重启环境

PermissionError：

检查 IAM 角色、数据库授权、API 密钥
验证连接是否具有正确的凭据

查询或操作耗时过长
考虑向任务添加超时参数
优化底层查询/操作

任务日志通常显示：

任务开始时间戳
任务代码中的任何打印/日志语句
返回值（对于 @task 装饰的函数）
异常 + 完整堆栈跟踪（如果失败）
任务结束时间戳和持续时间

重点关注失败任务日志底部的异常。

Astro 部署支持环境升级，这有助于构建您的测试工作流：

开发部署：使用 astro deploy --dags 自由测试 DAG，实现快速迭代
暂存部署：针对类生产数据运行集成测试
生产部署：仅在较低环境验证后部署
为每个环境使用单独的 Astro 部署，并通过它们升级代码

authoring-dags：用于创建新 DAG（包括测试前的验证）
debugging-dags：用于通用的 Airflow 故障排除
deploying-airflow：用于在测试后将 DAG 部署到生产环境

2026 年 1 月 23 日

🇺🇸English

DAG Testing Skill

Use af commands to test, debug, and fix DAGs in iterative cycles.

Running the CLI

Run all af commands using uvx (no installation required):

uvx --from astro-airflow-mcp af <command>

Throughout this document, af is shorthand for uvx --from astro-airflow-mcp af.

Quick Validation with Astro CLI

If the user has the Astro CLI available, these commands provide fast feedback without needing a running Airflow instance:

# Parse DAGs to catch import errors, syntax issues, and DAG-level problems
astro dev parse

# Run pytest against DAGs (runs tests in tests/ directory)
astro dev pytest

Use these for quick validation during development. For full end-to-end testing against a live Airflow instance, continue to the trigger-and-wait workflow below.

FIRST ACTION: Just Trigger the DAG

When the user asks to test a DAG, your FIRST AND ONLY action should be:

af runs trigger-wait <dag_id>

DO NOT:

Call af dags list first
Call af dags get first
Call af dags errors first
Use grep or ls or any other bash command
Do any "pre-flight checks"

Just trigger the DAG. If it fails, THEN debug.

Testing Workflow Overview

┌─────────────────────────────────────┐
│ 1. TRIGGER AND WAIT                 │
│    Run DAG, wait for completion     │
└─────────────────────────────────────┘
                 ↓
        ┌───────┴───────┐
        ↓               ↓
   ┌─────────┐    ┌──────────┐
   │ SUCCESS │    │ FAILED   │
   │ Done!   │    │ Debug... │
   └─────────┘    └──────────┘
                       ↓
        ┌─────────────────────────────────────┐
        │ 2. DEBUG (only if failed)           │
        │    Get logs, identify root cause    │
        └─────────────────────────────────────┘
                       ↓
        ┌─────────────────────────────────────┐
        │ 3. FIX AND RETEST                   │
        │    Apply fix, restart from step 1   │
        └─────────────────────────────────────┘

Philosophy: Try first, debug on failure. Don't waste time on pre-flight checks — just run the DAG and diagnose if something goes wrong.

Phase 1: Trigger and Wait

Use af runs trigger-wait to test the DAG:

Primary Method: Trigger and Wait

af runs trigger-wait <dag_id> --timeout 300

Example:

af runs trigger-wait my_dag --timeout 300

Why this is the preferred method:

Single command handles trigger + monitoring
Returns immediately when DAG completes (success or failure)
Includes failed task details if run fails
No manual polling required

Response Interpretation

Success:

{
  "dag_run": {
    "dag_id": "my_dag",
    "dag_run_id": "manual__2025-01-14T...",
    "state": "success",
    "start_date": "...",
    "end_date": "..."
  },
  "timed_out": false,
  "elapsed_seconds": 45.2
}

Failure:

{
  "dag_run": {
    "state": "failed"
  },
  "timed_out": false,
  "elapsed_seconds": 30.1,
  "failed_tasks": [
    {
      "task_id": "extract_data",
      "state": "failed",
      "try_number": 2
    }
  ]
}

Timeout:

{
  "dag_id": "my_dag",
  "dag_run_id": "manual__...",
  "state": "running",
  "timed_out": true,
  "elapsed_seconds": 300.0,
  "message": "Timed out after 300 seconds. DAG run is still running."
}

Alternative: Trigger and Monitor Separately

Use this only when you need more control:

# Step 1: Trigger
af runs trigger my_dag
# Returns: {"dag_run_id": "manual__...", "state": "queued"}

# Step 2: Check status
af runs get my_dag manual__2025-01-14T...
# Returns current state

Handling Results

If Success

The DAG ran successfully. Summarize for the user:

Total elapsed time
Number of tasks completed
Any notable outputs (if visible in logs)

You're done!

If Timed Out

The DAG is still running. Options:

Check current status: af runs get <dag_id> <dag_run_id>
Ask user if they want to continue waiting
Increase timeout and try again

If Failed

Move to Phase 2 (Debug) to identify the root cause.

Phase 2: Debug Failures (Only If Needed)

When a DAG run fails, use these commands to diagnose:

Get Comprehensive Diagnosis

af runs diagnose <dag_id> <dag_run_id>

Returns in one call:

Run metadata (state, timing)
All task instances with states
Summary of failed tasks
State counts (success, failed, skipped, etc.)

Get Task Logs

af tasks logs <dag_id> <dag_run_id> <task_id>

Example:

af tasks logs my_dag manual__2025-01-14T... extract_data

For specific retry attempt:

af tasks logs my_dag manual__2025-01-14T... extract_data --try 2

Look for:

Exception messages and stack traces
Connection errors (database, API, S3)
Permission errors
Timeout errors
Missing dependencies

Check Upstream Tasks

If a task shows upstream_failed, the root cause is in an upstream task. Use af runs diagnose to find which task actually failed.

Check Import Errors (If DAG Didn't Run)

If the trigger failed because the DAG doesn't exist:

af dags errors

This reveals syntax errors or missing dependencies that prevented the DAG from loading.

Phase 3: Fix and Retest

Once you identify the issue:

Common Fixes

Issue	Fix
Missing import	Add to DAG file
Missing package	Add to `requirements.txt`
Connection error	Check `af config connections`, verify credentials
Variable missing	Check `af config variables`, create if needed
Timeout	Increase task timeout or optimize query
Permission error	Check credentials in connection

After Fixing

Save the file
Retest: af runs trigger-wait <dag_id>

Repeat the test → debug → fix loop until the DAG succeeds.

CLI Quick Reference

Phase	Command	Purpose
Test	`af runs trigger-wait <dag_id>`	Primary test method — start here
Test	`af runs trigger <dag_id>`	Start run (alternative)
Test	`af runs get <dag_id> <run_id>`	Check run status
Debug	`af runs diagnose <dag_id> <run_id>`	Comprehensive failure diagnosis
Debug	`af tasks logs <dag_id> <run_id> <task_id>`

Testing Scenarios

Scenario 1: Test a DAG (Happy Path)

af runs trigger-wait my_dag
# Success! Done.

Scenario 2: Test a DAG (With Failure)

# 1. Run and wait
af runs trigger-wait my_dag
# Failed...

# 2. Find failed tasks
af runs diagnose my_dag manual__2025-01-14T...

# 3. Get error details
af tasks logs my_dag manual__2025-01-14T... extract_data

# 4. [Fix the issue in DAG code]

# 5. Retest
af runs trigger-wait my_dag

Scenario 3: DAG Doesn't Exist / Won't Load

# 1. Trigger fails - DAG not found
af runs trigger-wait my_dag
# Error: DAG not found

# 2. Find parse error
af dags errors

# 3. [Fix the issue in DAG code]

# 4. Retest
af runs trigger-wait my_dag

Scenario 4: Debug a Failed Scheduled Run

# 1. Get failure summary
af runs diagnose my_dag scheduled__2025-01-14T...

# 2. Get error from failed task
af tasks logs my_dag scheduled__2025-01-14T... failed_task_id

# 3. [Fix the issue]

# 4. Retest
af runs trigger-wait my_dag

Scenario 5: Test with Custom Configuration

af runs trigger-wait my_dag --conf '{"env": "staging", "batch_size": 100}' --timeout 600

Scenario 6: Long-Running DAG

# Wait up to 1 hour
af runs trigger-wait my_dag --timeout 3600

# If timed out, check current state
af runs get my_dag manual__2025-01-14T...

Debugging Tips

Common Error Patterns

Connection Refused / Timeout:

Check af config connections for correct host/port
Verify network connectivity to external system
Check if connection credentials are correct

ModuleNotFoundError:

Package missing from requirements.txt
After adding, may need environment restart

PermissionError:

Check IAM roles, database grants, API keys
Verify connection has correct credentials

Task Timeout:

Query or operation taking too long
Consider adding timeout parameter to task
Optimize underlying query/operation

Reading Task Logs

Task logs typically show:

Task start timestamp
Any print/log statements from task code
Return value (for @task decorated functions)
Exception + full stack trace (if failed)
Task end timestamp and duration

Focus on the exception at the bottom of failed task logs.

On Astro

Astro deployments support environment promotion, which helps structure your testing workflow:

Dev deployment : Test DAGs freely with astro deploy --dags for fast iteration
Staging deployment : Run integration tests against production-like data
Production deployment : Deploy only after validation in lower environments
Use separate Astro deployments for each environment and promote code through them

Related Skills

authoring-dags : For creating new DAGs (includes validation before testing)
debugging-dags : For general Airflow troubleshooting
deploying-airflow : For deploying DAGs to production after testing

Weekly Installs

387

Repository

astronomer/agents

GitHub Stars

269

First Seen

Jan 23, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode275

codex269

cursor268

github-copilot265

gemini-cli250

claude-code241

Azure Data Explorer (Kusto) 查询技能：KQL数据分析、日志遥测与时间序列处理

100,500 周安装