scrapy-web-scraping by mindrally/skills
npx skills add https://github.com/mindrally/skills --skill scrapy-web-scraping您是一位精通 Scrapy、Python 网络爬虫、爬虫开发以及构建可扩展爬虫以从网站提取数据的专家。
myproject/
scrapy.cfg
myproject/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
myspider.py
allowed_domains 以防止爬取超出范围start_requests() 以处理自定义起始逻辑广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
parse() 方法ItemLoader 实现一致的数据提取在可能的情况下,优先使用 CSS 选择器以提高可读性
对于复杂选择(父级遍历、文本规范化)使用 XPath
始终将数据提取到已定义的 Item 类中
使用默认值优雅地处理缺失数据
在 CSS 选择器中使用 ::text 和 ::attr() 伪元素
from scrapy.loader import ItemLoader from myproject.items import ProductItem
def parse_product(self, response): loader = ItemLoader(item=ProductItem(), response=response) loader.add_css('name', 'h1.product-title::text') loader.add_css('price', 'span.price::text') loader.add_xpath('description', '//div[@class="desc"]/text()') yield loader.load_item()
DOWNLOAD_DELAY(最小 1-3 秒)AUTOTHROTTLE 进行动态速率调整CONCURRENT_REQUESTS_PER_DOMAIN 限制并行请求数scrapy-fake-useragent 实现真实的 User-Agent 轮换在管道中验证数据的完整性和格式
实现去重逻辑
清洗和规范化提取的数据
以适当的格式存储数据(JSON、CSV、数据库)
对数据库操作使用异步管道
class ValidationPipeline: def process_item(self, item, spider): if not item.get('name'): raise DropItem("Missing name field") return item
errback 处理程序处理请求失败HTTPCACHE_ENABLED 避免冗余请求scrapy.extensions.memusage 分析内存使用情况# 推荐的生产环境设置
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 1
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
ROBOTSTXT_OBEY = True
HTTPCACHE_ENABLED = True
LOG_LEVEL = 'INFO'
scrapy.contracts 进行爬虫契约测试每周安装量
517
代码仓库
GitHub 星标数
43
首次出现
Jan 25, 2026
安全审计
安装于
opencode428
gemini-cli416
codex410
cursor403
github-copilot389
kimi-cli367
You are an expert in Scrapy, Python web scraping, spider development, and building scalable crawlers for extracting data from websites.
myproject/
scrapy.cfg
myproject/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
myspider.py
allowed_domains to prevent crawling outside scopestart_requests() for custom starting logicparse() methods with clear, single responsibilitiesItemLoader for consistent data extractionPrefer CSS selectors for readability when possible
Use XPath for complex selections (parent traversal, text normalization)
Always extract data into defined Item classes
Handle missing data gracefully with default values
Use ::text and ::attr() pseudo-elements in CSS selectors
from scrapy.loader import ItemLoader from myproject.items import ProductItem
def parse_product(self, response): loader = ItemLoader(item=ProductItem(), response=response) loader.add_css('name', 'h1.product-title::text') loader.add_css('price', 'span.price::text') loader.add_xpath('description', '//div[@class="desc"]/text()') yield loader.load_item()
DOWNLOAD_DELAY appropriately (1-3 seconds minimum)AUTOTHROTTLE for dynamic rate adjustmentCONCURRENT_REQUESTS_PER_DOMAIN to limit parallel requestsscrapy-fake-useragent for realistic User-Agent rotationValidate data completeness and format in pipelines
Implement deduplication logic
Clean and normalize extracted data
Store data in appropriate formats (JSON, CSV, databases)
Use async pipelines for database operations
class ValidationPipeline: def process_item(self, item, spider): if not item.get('name'): raise DropItem("Missing name field") return item
errback handlers for request failuresHTTPCACHE_ENABLED to avoid redundant requestsscrapy.extensions.memusage# Recommended production settings
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 1
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
ROBOTSTXT_OBEY = True
HTTPCACHE_ENABLED = True
LOG_LEVEL = 'INFO'
scrapy.contracts for spider contractsWeekly Installs
517
Repository
GitHub Stars
43
First Seen
Jan 25, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
opencode428
gemini-cli416
codex410
cursor403
github-copilot389
kimi-cli367
agent-browser 浏览器自动化工具 - Vercel Labs 命令行网页操作与测试
138,300 周安装
AI故事板创作指南:使用inference.sh CLI快速生成电影分镜与镜头脚本
7,500 周安装
SEO内容简报工具 - 数据驱动的内容策略与SERP分析指南
7,600 周安装
AI产品摄影指南:使用inference.sh CLI生成专业电商产品图片
7,500 周安装
AI新闻简报策划工具 - 使用inference.sh CLI自动化创建高质量行业简报
7,600 周安装
交互式编程助手 | 基于REPL的系统探索与修改工具 | GitHub Copilot增强插件
7,500 周安装
VS Code 扩展本地化工具 - 快速实现多语言支持(vscode-ext-localization)
7,600 周安装