Python网络爬虫工具：使用BeautifulSoup和requests快速提取网站数据，支持竞争对手研究、价格监控

web-scraper by guia-matthieu/clawfu-skills

189 周安装量

62 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/guia-matthieu/clawfu-skills --skill web-scraper

自动化数据分析数据处理

🇨🇳中文介绍

网络爬虫

使用 BeautifulSoup 和 requests 从网站提取结构化数据 - 将任何网页转化为可用数据。

使用场景

竞争对手研究 - 爬取定价、功能、定位信息
潜在客户生成 - 从目录中提取联系信息
内容审计 - 提取标题、链接、元数据
价格监控 - 跟踪竞争对手价格变化
数据收集 - 从多个来源收集研究数据

Claude 负责的部分 vs 您决定的部分

Claude 负责	您决定
构建分析框架	战略优先级
综合市场数据	竞争定位
识别机会	资源分配
创建战略选项	最终战略选择
建议实施方法	执行决策

依赖项

pip install beautifulsoup4 requests pandas click lxml

命令

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

提取结构化数据

python scripts/main.py structured https://example.com/article --schema article
python scripts/main.py structured https://example.com/product --schema product

示例 1: 爬取竞争对手定价

python scripts/main.py scrape https://competitor.com/pricing --selector ".price,.plan-name"

# 输出:
# 提取了 6 个元素
# 1. Starter - $29/月
# 2. Pro - $99/月
# 3. Enterprise - 联系我们

示例 2: 提取文章内容

python scripts/main.py structured https://blog.example.com/post --schema article

# 输出: article_data.json
# {
#   "title": "如何扩展您的初创公司",
#   "author": "Jane Doe",
#   "date": "2024-01-15",
#   "content": "...",
#   "word_count": 1523
# }

选择器	描述	示例
`tag`	元素类型	`h1`, `p`, `div`
`.class`	类名	`.price`, `.title`
`#id`	元素 ID	`#main-content`
`tag.class`	带类的标签	`div.product`
`tag[attr]`	具有属性	`a[href]`
`parent > child`	直接子元素	`ul > li`
`tag1, tag2`	多个	`h1, h2, h3`

检查 robots.txt - 尊重网站的爬虫政策
限制速率 - 不要使服务器过载 (1-2 次请求/秒)
表明身份 - 使用描述性的 User-Agent
缓存请求 - 不要重新爬取未更改的页面
服务条款 - 检查是否允许爬取

此技能擅长的方面

构建战略分析框架
识别市场机会
创建战略框架
综合竞争数据

此技能无法做到的方面

替代市场研究
保证战略成功
了解专有的竞争对手信息
做出执行决策

竞争对手监控 - 监控竞争对手变化
PDF 提取器 - 从 PDF 中提取数据

模式 : 半人马座

category: automation subcategory: data-extraction dependencies: [beautifulsoup4, requests, pandas] difficulty: intermediate time_saved: 5+ hours/week

🇺🇸English

Web Scraper

Extract structured data from websites using BeautifulSoup and requests - turn any webpage into usable data.

When to Use This Skill

Competitor research - Scrape pricing, features, positioning
Lead generation - Extract contact info from directories
Content audit - Pull headings, links, meta data
Price monitoring - Track competitor pricing changes
Data collection - Gather research data from multiple sources

What Claude Does vs What You Decide

Claude Does	You Decide
Structures analysis frameworks	Strategic priorities
Synthesizes market data	Competitive positioning
Identifies opportunities	Resource allocation
Creates strategic options	Final strategy selection
Suggests implementation approaches	Execution decisions

Dependencies

pip install beautifulsoup4 requests pandas click lxml

Commands

Scrape Elements

python scripts/main.py scrape https://example.com --selector "h1,h2,p"
python scripts/main.py scrape https://example.com --selector ".product-price"

Extract Links

python scripts/main.py links https://example.com
python scripts/main.py links https://example.com --internal-only

Extract Emails

python scripts/main.py emails https://example.com
python scripts/main.py emails https://example.com --depth 2

Extract Structured Data

python scripts/main.py structured https://example.com/article --schema article
python scripts/main.py structured https://example.com/product --schema product

Examples

Example 1: Scrape Competitor Pricing

python scripts/main.py scrape https://competitor.com/pricing --selector ".price,.plan-name"

# Output:
# Extracted 6 elements
# 1. Starter - $29/mo
# 2. Pro - $99/mo
# 3. Enterprise - Contact us

Example 2: Extract Article Content

python scripts/main.py structured https://blog.example.com/post --schema article

# Output: article_data.json
# {
#   "title": "How to Scale Your Startup",
#   "author": "Jane Doe",
#   "date": "2024-01-15",
#   "content": "...",
#   "word_count": 1523
# }

CSS Selector Reference

Selector	Description	Example
`tag`	Element type	`h1`, `p`, `div`
`.class`	Class name	`.price`, `.title`
`#id`

Ethical Scraping Guidelines

Check robots.txt - Respect site's scraping policy
Rate limit - Don't overload servers (1-2 req/sec)
Identify yourself - Use descriptive User-Agent
Cache requests - Don't re-scrape unchanged pages
Terms of Service - Check if scraping is allowed

Skill Boundaries

What This Skill Does Well

Structuring strategic analysis
Identifying market opportunities
Creating strategic frameworks
Synthesizing competitive data

What This Skill Cannot Do

Replace market research
Guarantee strategic success
Know proprietary competitor info
Make executive decisions

Related Skills

competitor-monitor - Monitor competitor changes
pdf-extractor - Extract from PDFs

Skill Metadata

Mode : centaur

category: automation subcategory: data-extraction dependencies: [beautifulsoup4, requests, pandas] difficulty: intermediate time_saved: 5+ hours/week

Weekly Installs

103

Repository

guia-matthieu/c…u-skills

GitHub Stars

First Seen

Feb 13, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

gemini-cli102

opencode102

codex101

github-copilot100

cursor100

kimi-cli99

Python PDF处理教程：合并拆分、提取文本表格、创建PDF文件

65,000 周安装