article-extractor by michalparkola/tapestry-skills-for-claude-code
npx skills add https://github.com/michalparkola/tapestry-skills-for-claude-code --skill article-extractor此技能可从网页文章和博客帖子中提取主要内容,移除导航栏、广告、新闻订阅注册框和其他杂乱内容。保存干净、可读的文本。
当用户出现以下情况时激活:
按以下顺序检查文章提取工具:
command -v reader
如果未安装:
npm install -g @mozilla/readability-cli
# 或
npm install -g reader-cli
command -v trafilatura
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
如果未安装:
pip3 install trafilatura
如果没有可用工具,则使用基本的 curl + 文本提取 (可靠性较低但可用)
# 提取文章
reader "URL" > article.txt
优点:
# 提取文章
trafilatura --URL "URL" --output-format txt > article.txt
# 或使用更多选项
trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt
优点:
选项:
--no-comments: 跳过评论部分--no-tables: 跳过数据表格--precision: 优先考虑精确度而非召回率--recall: 提取更多内容 (可能包含一些噪音)# 下载并提取基础内容
curl -s "URL" | python3 -c "
from html.parser import HTMLParser
import sys
class ArticleExtractor(HTMLParser):
def __init__(self):
super().__init__()
self.in_content = False
self.content = []
self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'}
self.current_tag = None
def handle_starttag(self, tag, attrs):
if tag not in self.skip_tags:
if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
self.in_content = True
self.current_tag = tag
def handle_data(self, data):
if self.in_content and data.strip():
self.content.append(data.strip())
def get_content(self):
return '\n\n'.join(self.content)
parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > article.txt
注意: 此方法可靠性较低,但无需依赖项即可工作。
提取标题用于文件名:
# reader 输出 markdown,标题在最顶部
TITLE=$(reader "URL" | head -n 1 | sed 's/^# //')
# 获取元数据,包括标题
TITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")
TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')
清理标题以适应文件系统:
# 获取标题
TITLE="Article Title from Website"
# 为文件系统清理 (移除特殊字符,限制长度)
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//')
# 添加扩展名
FILENAME="${FILENAME}.txt"
ARTICLE_URL="https://example.com/article"
# 检查工具
if command -v reader &> /dev/null; then
TOOL="reader"
echo "Using reader (Mozilla Readability)"
elif command -v trafilatura &> /dev/null; then
TOOL="trafilatura"
echo "Using trafilatura"
else
TOOL="fallback"
echo "Using fallback method (may be less accurate)"
fi
# 提取文章
case $TOOL in
reader)
# 获取内容
reader "$ARTICLE_URL" > temp_article.txt
# 获取标题 (markdown 中 # 后的第一行)
TITLE=$(head -n 1 temp_article.txt | sed 's/^# //')
;;
trafilatura)
# 从元数据获取标题
METADATA=$(trafilatura --URL "$ARTICLE_URL" --json)
TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))")
# 获取干净内容
trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt
;;
fallback)
# 获取标题
TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1)
TITLE=${TITLE%% - *} # 移除网站名称
TITLE=${TITLE%% | *} # 移除网站名称 (替代方式)
# 获取内容 (基础提取)
curl -s "$ARTICLE_URL" | python3 -c "
from html.parser import HTMLParser
import sys
class ArticleExtractor(HTMLParser):
def __init__(self):
super().__init__()
self.in_content = False
self.content = []
self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'}
def handle_starttag(self, tag, attrs):
if tag not in self.skip_tags:
if tag in {'p', 'article', 'main'}:
self.in_content = True
if tag in {'h1', 'h2', 'h3'}:
self.content.append('\n')
def handle_data(self, data):
if self.in_content and data.strip():
self.content.append(data.strip())
def get_content(self):
return '\n\n'.join(self.content)
parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > temp_article.txt
;;
esac
# 清理文件名
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ *$//' | sed 's/^ *//')
FILENAME="${FILENAME}.txt"
# 移动到最终文件名
mv temp_article.txt "$FILENAME"
# 显示结果
echo "✓ Extracted article: $TITLE"
echo "✓ Saved to: $FILENAME"
echo ""
echo "Preview (first 10 lines):"
head -n 10 "$FILENAME"
1. 工具未安装
2. 需要付费墙或登录
3. 无效 URL
4. 未提取到内容
5. 标题中包含特殊字符
/、:、?、"、<、>、|- 或直接移除1. 对大多数文章使用 reader
2. 对以下情况使用 trafilatura:
3. 备用方法的限制:
4. 检查提取质量:
简单提取:
# 用户:"提取 https://example.com/article"
reader "https://example.com/article" > temp.txt
TITLE=$(head -n 1 temp.txt | sed 's/^# //')
FILENAME="$(echo "$TITLE" | tr '/' '-').txt"
mv temp.txt "$FILENAME"
echo "✓ Saved to: $FILENAME"
带错误处理:
if ! reader "$URL" > temp.txt 2>/dev/null; then
if command -v trafilatura &> /dev/null; then
trafilatura --URL "$URL" --output-format txt > temp.txt
else
echo "Error: Could not extract article. Install reader or trafilatura."
exit 1
fi
fi
向用户显示:
根据需要询问:
每周安装数
218
仓库
GitHub 星标数
294
首次出现
2026 年 1 月 20 日
安全审计
安装于
opencode190
gemini-cli181
codex177
cursor172
github-copilot165
amp149
This skill extracts the main content from web articles and blog posts, removing navigation, ads, newsletter signups, and other clutter. Saves clean, readable text.
Activate when the user:
Check for article extraction tools in this order:
command -v reader
If not installed:
npm install -g @mozilla/readability-cli
# or
npm install -g reader-cli
command -v trafilatura
If not installed:
pip3 install trafilatura
If no tools available, use basic curl + text extraction (less reliable but works)
# Extract article
reader "URL" > article.txt
Pros:
# Extract article
trafilatura --URL "URL" --output-format txt > article.txt
# Or with more options
trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt
Pros:
Options:
--no-comments: Skip comment sections--no-tables: Skip data tables--precision: Favor precision over recall--recall: Extract more content (may include some noise)# Download and extract basic content
curl -s "URL" | python3 -c "
from html.parser import HTMLParser
import sys
class ArticleExtractor(HTMLParser):
def __init__(self):
super().__init__()
self.in_content = False
self.content = []
self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'}
self.current_tag = None
def handle_starttag(self, tag, attrs):
if tag not in self.skip_tags:
if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
self.in_content = True
self.current_tag = tag
def handle_data(self, data):
if self.in_content and data.strip():
self.content.append(data.strip())
def get_content(self):
return '\n\n'.join(self.content)
parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > article.txt
Note: This is less reliable but works without dependencies.
Extract title for filename:
# reader outputs markdown with title at top
TITLE=$(reader "URL" | head -n 1 | sed 's/^# //')
# Get metadata including title
TITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")
TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')
Clean title for filesystem:
# Get title
TITLE="Article Title from Website"
# Clean for filesystem (remove special chars, limit length)
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//')
# Add extension
FILENAME="${FILENAME}.txt"
ARTICLE_URL="https://example.com/article"
# Check for tools
if command -v reader &> /dev/null; then
TOOL="reader"
echo "Using reader (Mozilla Readability)"
elif command -v trafilatura &> /dev/null; then
TOOL="trafilatura"
echo "Using trafilatura"
else
TOOL="fallback"
echo "Using fallback method (may be less accurate)"
fi
# Extract article
case $TOOL in
reader)
# Get content
reader "$ARTICLE_URL" > temp_article.txt
# Get title (first line after # in markdown)
TITLE=$(head -n 1 temp_article.txt | sed 's/^# //')
;;
trafilatura)
# Get title from metadata
METADATA=$(trafilatura --URL "$ARTICLE_URL" --json)
TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))")
# Get clean content
trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt
;;
fallback)
# Get title
TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1)
TITLE=${TITLE%% - *} # Remove site name
TITLE=${TITLE%% | *} # Remove site name (alternate)
# Get content (basic extraction)
curl -s "$ARTICLE_URL" | python3 -c "
from html.parser import HTMLParser
import sys
class ArticleExtractor(HTMLParser):
def __init__(self):
super().__init__()
self.in_content = False
self.content = []
self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'}
def handle_starttag(self, tag, attrs):
if tag not in self.skip_tags:
if tag in {'p', 'article', 'main'}:
self.in_content = True
if tag in {'h1', 'h2', 'h3'}:
self.content.append('\n')
def handle_data(self, data):
if self.in_content and data.strip():
self.content.append(data.strip())
def get_content(self):
return '\n\n'.join(self.content)
parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > temp_article.txt
;;
esac
# Clean filename
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ *$//' | sed 's/^ *//')
FILENAME="${FILENAME}.txt"
# Move to final filename
mv temp_article.txt "$FILENAME"
# Show result
echo "✓ Extracted article: $TITLE"
echo "✓ Saved to: $FILENAME"
echo ""
echo "Preview (first 10 lines):"
head -n 10 "$FILENAME"
1. Tool not installed
2. Paywall or login required
3. Invalid URL
4. No content extracted
5. Special characters in title
/, :, ?, ", <, >, |- or remove1. Use reader for most articles
2. Use trafilatura for:
3. Fallback method limitations:
4. Check extraction quality:
Simple extraction:
# User: "Extract https://example.com/article"
reader "https://example.com/article" > temp.txt
TITLE=$(head -n 1 temp.txt | sed 's/^# //')
FILENAME="$(echo "$TITLE" | tr '/' '-').txt"
mv temp.txt "$FILENAME"
echo "✓ Saved to: $FILENAME"
With error handling:
if ! reader "$URL" > temp.txt 2>/dev/null; then
if command -v trafilatura &> /dev/null; then
trafilatura --URL "$URL" --output-format txt > temp.txt
else
echo "Error: Could not extract article. Install reader or trafilatura."
exit 1
fi
fi
Display to user:
Ask if needed:
Weekly Installs
218
Repository
GitHub Stars
294
First Seen
Jan 20, 2026
Security Audits
Gen Agent Trust HubWarnSocketPassSnykWarn
Installed on
opencode190
gemini-cli181
codex177
cursor172
github-copilot165
amp149
Skills CLI 使用指南:AI Agent 技能包管理器安装与管理教程
31,600 周安装