content-parser：智能URL内容提取工具，自动解析网页结构化数据

content-parser by marswaveai/skills

403 周安装量

28 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/marswaveai/skills --skill content-parser

内容创作自动化数据处理

🇨🇳中文介绍

何时使用

用户提供 URL 并希望提取/读取其内容
另一个技能在生成前需要从 URL 解析源材料
用户说“解析这个 URL”、“从这个链接提取内容”
用户说“解析链接”、“提取内容”

何时不使用

用户已有文本内容，不需要 URL 解析
用户想要生成音频/视频内容（而非内容提取）
用户想要读取本地文件（使用标准文件读取工具）

目的

从支持的平台提取并规范化 URL 内容。返回结构化数据，包括内容正文、元数据和引用。可作为内容生成技能的预处理步骤或独立的内容提取工具。

硬性约束

禁止使用 shell 脚本。根据资源中列出的 API 参考文件构建 curl 命令
始终阅读 shared/authentication.md 以获取 API 密钥和请求头信息
遵循 shared/common-patterns.md 中的轮询、错误和交互模式
URL 必须是有效的 HTTP(S) URL
在任何交互之前，始终按照 shared/config-pattern.md 读取配置
切勿将文件保存到 ~/Downloads/ 或 .listenhub/ — 保存到当前工作目录

步骤 -1：API 密钥检查

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

步骤 0：配置设置

遵循 shared/config-pattern.md 步骤 0（零问题启动）。

如果文件不存在 — 静默创建默认配置并继续：

mkdir -p ".listenhub/content-parser"
echo '{"autoDownload":true}' > ".listenhub/content-parser/config.json"
CONFIG_PATH=".listenhub/content-parser/config.json"
CONFIG=$(cat "$CONFIG_PATH")

不要询问任何设置问题。 直接进入交互流程。

如果文件存在 — 静默读取配置并继续：

CONFIG_PATH=".listenhub/content-parser/config.json"
[ ! -f "$CONFIG_PATH" ] && CONFIG_PATH="$HOME/.listenhub/content-parser/config.json"
CONFIG=$(cat "$CONFIG_PATH")

设置流程（仅限用户主动要求重新配置时）

仅在用户明确要求重新配置时运行。显示当前设置：

当前配置 (content-parser)：
  自动下载：{是 / 否}

autoDownload : "自动保存提取的内容到当前目录？"
- "是（推荐）" → autoDownload: true
- "否" → autoDownload: false

NEW_CONFIG=$(echo "$CONFIG" | jq --argjson dl {true/false} '. + {"autoDownload": $dl}')
echo "$NEW_CONFIG" > "$CONFIG_PATH"
CONFIG=$(cat "$CONFIG_PATH")

步骤 1：URL 输入

自由文本输入。询问用户：

您想从哪个 URL 提取内容？

步骤 2：选项（可选）

询问用户是否要配置提取选项：

问题："您想配置提取选项吗？"
选项：
  - "否，使用默认设置" — 使用默认设置提取
  - "是，配置选项" — 设置摘要、最大长度或 Twitter 推文数量

如果选择“是”，则询问后续问题：

Summarize : "生成内容摘要吗？" (是/否)
Max Length : "设置最大内容长度？" (自由文本，例如 "5000")
Twitter count (仅当 URL 是 Twitter/X 个人资料时): "获取多少条推文？" (1-100，默认 20)

步骤 3：确认并提取

准备提取内容：

  URL: {url}
  选项: {summarize: true, maxLength: 5000, twitter.count: 50} / default

  继续吗？

在调用 API 之前，等待明确的确认。

验证 URL : 必须是 HTTP(S)。如果需要，进行规范化（参见 references/supported-platforms.md）
构建请求体 :

{ "source": { "type": "url", "uri": "{url}" }, "options": { "summarize": true/false, "maxLength": 5000, "twitter": { "count": 50 } } }

如果用户选择了默认设置，则省略 options。

提交（前台） : POST /v1/content/extract → 提取 taskId
告知用户提取正在进行中
轮询（后台） : 使用 run_in_background: true 和 timeout: 300000 运行以下精确的 bash 命令。注意：状态字段是 .data.status（不是 processStatus），间隔为 5 秒，值为 processing/completed/failed：

TASK_ID="<id-from-step-3>" for i in $(seq 1 60); do RESULT=$(curl -sS "https://api.marswave.ai/openapi/v1/content/extract/$TASK_ID"
-H "Authorization: Bearer $LISTENHUB_API_KEY"
-H "X-Source: skills" 2>/dev/null) STATUS=$(echo "$RESULT" | tr -d '\000-\037\177' | jq -r '.data.status // "processing"') case "$STATUS" in completed) echo "$RESULT"; exit 0 ;; failed) echo "FAILED: $RESULT" >&2; exit 1 ;; *) sleep 5 ;; esac done echo "TIMEOUT" >&2; exit 2
收到通知后，下载并呈现结果 :

如果 autoDownload 是 true，则从提取的标题生成一个 slug（如果没有标题则回退到域名）。遵循 shared/config-pattern.md § 工件命名规则来生成 slug 并去重。

 * 将 `{slug}.md` 写入**当前目录** — 完整的提取内容（Markdown 格式）
 * 将 `{slug}.json` 写入**当前目录** — 完整的原始 API 响应数据

    SLUG="{title-slug}"  # 例如 "topology-wikipedia"
# 去重：检查文件是否存在
BASE="$SLUG"; i=2
while [ -e "${SLUG}.md" ] || [ -e "${SLUG}.json" ]; do SLUG="${BASE}-${i}"; i=$((i+1)); done
echo "$CONTENT_MD" > "${SLUG}.md"
echo "$RESULT" > "${SLUG}.json"

    内容提取完成！

来源：{url}
标题：{metadata.title}
长度：~{character count} 字符
消耗积分：{credits}

已保存到当前目录：
  {slug}.md
  {slug}.json

7. 显示提取内容的预览（前约 500 个字符）

提供在其他技能中使用此内容的选项（例如 /podcast, /tts）

预计时间 : 10-30 秒，具体取决于内容大小和平台。

内容提取: shared/api-content-extract.md
支持的平台: references/supported-platforms.md
轮询: shared/common-patterns.md § 异步轮询
错误处理: shared/common-patterns.md § 错误处理
配置模式: shared/config-pattern.md

用户 : "解析这篇文章：https://en.wikipedia.org/wiki/Topology"

代理工作流程 :

URL: https://en.wikipedia.org/wiki/Topology
选项: 默认（省略选项）
提交提取

curl -sS -X POST "https://api.marswave.ai/openapi/v1/content/extract" \

  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Source: skills" \
  -d '{
    "source": {
      "type": "url",
      "uri": "https://en.wikipedia.org/wiki/Topology"
    }
  }'

4. 轮询直到完成:

curl -sS "https://api.marswave.ai/openapi/v1/content/extract/69a7dac700cf95938f86d9bb" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "X-Source: skills"

5. 呈现提取内容预览并提供后续操作建议。

用户 : "从 @elonmusk 提取最近的推文，获取 50 条"

代理工作流程 :

URL: https://x.com/elonmusk
选项: {"twitter": {"count": 50}}
提交提取

curl -sS -X POST "https://api.marswave.ai/openapi/v1/content/extract" \

  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Source: skills" \
  -d '{
    "source": {
      "type": "url",
      "uri": "https://x.com/elonmusk"
    },
    "options": {
      "twitter": {
        "count": 50
      }
    }
  }'

4. 轮询直到完成，呈现结果。

🇺🇸English

When to Use

User provides a URL and wants to extract/read its content
Another skill needs to parse source material from a URL before generation
User says "parse this URL", "extract content from this link"
User says "解析链接", "提取内容"

When NOT to Use

User already has text content and doesn't need URL parsing
User wants to generate audio/video content (not content extraction)
User wants to read a local file (use standard file reading tools)

Purpose

Extract and normalize content from URLs across supported platforms. Returns structured data including content body, metadata, and references. Useful as a preprocessing step for content generation skills or standalone content extraction.

Hard Constraints

No shell scripts. Construct curl commands from the API reference files listed in Resources
Always read shared/authentication.md for API key and headers
Follow shared/common-patterns.md for polling, errors, and interaction patterns
URL must be a valid HTTP(S) URL
Always read config following shared/config-pattern.md before any interaction
Never save files to ~/Downloads/ or .listenhub/ — save to the current working directory

Step -1: API Key Check

Follow shared/config-pattern.md § API Key Check. If the key is missing, stop immediately.

Step 0: Config Setup

Follow shared/config-pattern.md Step 0 (Zero-Question Boot).

If file doesn't exist — silently create with defaults and proceed:

mkdir -p ".listenhub/content-parser"
echo '{"autoDownload":true}' > ".listenhub/content-parser/config.json"
CONFIG_PATH=".listenhub/content-parser/config.json"
CONFIG=$(cat "$CONFIG_PATH")

Do NOT ask any setup questions. Proceed directly to the Interaction Flow.

If file exists — read config silently and proceed:

CONFIG_PATH=".listenhub/content-parser/config.json"
[ ! -f "$CONFIG_PATH" ] && CONFIG_PATH="$HOME/.listenhub/content-parser/config.json"
CONFIG=$(cat "$CONFIG_PATH")

Setup Flow (user-initiated reconfigure only)

Only run when the user explicitly asks to reconfigure. Display current settings:

当前配置 (content-parser)：
  自动下载：{是 / 否}

Then ask:

autoDownload : "自动保存提取的内容到当前目录？"
- "是（推荐）" → autoDownload: true
- "否" → autoDownload: false

Save immediately:

NEW_CONFIG=$(echo "$CONFIG" | jq --argjson dl {true/false} '. + {"autoDownload": $dl}')
echo "$NEW_CONFIG" > "$CONFIG_PATH"
CONFIG=$(cat "$CONFIG_PATH")

Interaction Flow

Step 1: URL Input

Free text input. Ask the user:

What URL would you like to extract content from?

Step 2: Options (optional)

Ask if the user wants to configure extraction options:

Question: "Do you want to configure extraction options?"
Options:
  - "No, use defaults" — Extract with default settings
  - "Yes, configure options" — Set summarize, maxLength, or Twitter tweet count

If "Yes", ask follow-up questions:

Summarize : "Generate a summary of the content?" (Yes/No)
Max Length : "Set maximum content length?" (Free text, e.g., "5000")
Twitter count (only if URL is Twitter/X profile): "How many tweets to fetch?" (1-100, default 20)

Step 3: Confirm & Extract

Summarize:

Ready to extract content:

  URL: {url}
  Options: {summarize: true, maxLength: 5000, twitter.count: 50} / default

  Proceed?

Wait for explicit confirmation before calling the API.

Workflow

Validate URL : Must be HTTP(S). Normalize if needed (see references/supported-platforms.md)
Build request body :

{ "source": { "type": "url", "uri": "{url}" }, "options": { "summarize": true/false, "maxLength": 5000, "twitter": { "count": 50 } } }

Omit options if user chose defaults.

Submit (foreground) : POST /v1/content/extract → extract taskId
Tell the user extraction is in progress
Poll (background) : Run the following exact bash command with run_in_background: true and timeout: 300000. Note: status field is .data.status (not processStatus), interval is 5s, values are processing/completed/failed:

If autoDownload is true, generate a slug from the extracted title (falling back to domain name if no title). Follow shared/config-pattern.md § Artifact Naming for slug generation and dedup.

 * Write `{slug}.md` to the **current directory** — full extracted content in markdown
 * Write `{slug}.json` to the **current directory** — full raw API response data

    SLUG="{title-slug}"  # e.g. "topology-wikipedia"
# Dedup: check if files exist
BASE="$SLUG"; i=2
while [ -e "${SLUG}.md" ] || [ -e "${SLUG}.json" ]; do SLUG="${BASE}-${i}"; i=$((i+1)); done
echo "$CONTENT_MD" > "${SLUG}.md"
echo "$RESULT" > "${SLUG}.json"

Present:

    内容提取完成！

来源：{url}
标题：{metadata.title}
长度：~{character count} 字符
消耗积分：{credits}

已保存到当前目录：
  {slug}.md
  {slug}.json

7. Show a preview of the extracted content (first ~500 chars)

Offer to use content in another skill (e.g. /podcast, /tts)

Estimated time : 10-30 seconds depending on content size and platform.

API Reference

Content extract: shared/api-content-extract.md
Supported platforms: references/supported-platforms.md
Polling: shared/common-patterns.md § Async Polling
Error handling: shared/common-patterns.md § Error Handling
Config pattern: shared/config-pattern.md

Example

User : "Parse this article: https://en.wikipedia.org/wiki/Topology"

Agent workflow :

URL: https://en.wikipedia.org/wiki/Topology
Options: defaults (omit options)
Submit extraction

curl -sS -X POST "https://api.marswave.ai/openapi/v1/content/extract" \

  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Source: skills" \
  -d '{
    "source": {
      "type": "url",
      "uri": "https://en.wikipedia.org/wiki/Topology"
    }
  }'

4. Poll until complete:

curl -sS "https://api.marswave.ai/openapi/v1/content/extract/69a7dac700cf95938f86d9bb" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "X-Source: skills"

5. Present extracted content preview and offer next actions.

User : "Extract recent tweets from @elonmusk, get 50 tweets"

Agent workflow :

URL: https://x.com/elonmusk
Options: {"twitter": {"count": 50}}
Submit extraction

curl -sS -X POST "https://api.marswave.ai/openapi/v1/content/extract" \

  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Source: skills" \
  -d '{
    "source": {
      "type": "url",
      "uri": "https://x.com/elonmusk"
    },
    "options": {
      "twitter": {
        "count": 50
      }
    }
  }'

4. Poll until complete, present results.

Weekly Installs

371

Repository

marswaveai/skills

GitHub Stars

First Seen

11 days ago

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

codex367

gemini-cli365

cursor365

opencode365

kimi-cli364

amp364

Skills CLI 使用指南：AI Agent 技能包管理器安装与管理教程

19,000 周安装

TASK_ID="<id-from-step-3>" for i in $(seq 1 60); do RESULT=$(curl -sS "https://api.marswave.ai/openapi/v1/content/extract/$TASK_ID"
-H "Authorization: Bearer $LISTENHUB_API_KEY"
-H "X-Source: skills" 2>/dev/null) STATUS=$(echo "$RESULT" | tr -d '\000-\037\177' | jq -r '.data.status // "processing"') case "$STATUS" in completed) echo "$RESULT"; exit 0 ;; failed) echo "FAILED: $RESULT" >&2; exit 1 ;; *) sleep 5 ;; esac done echo "TIMEOUT" >&2; exit 2

When notified, download and present result :