robots-txt by kostja94/marketing-skills
npx skills add https://github.com/kostja94/marketing-skills --skill robots-txt指导配置和审核 robots.txt,以控制搜索引擎和 AI 爬虫的访问。
调用时机:在首次使用时,如果合适,用 1-2 句话开场,说明此技能涵盖的内容及其重要性,然后提供主要输出。在后续使用或用要求跳过时,直接进入主要输出。
首先检查项目上下文:如果存在 .claude/project-context.md 或 .cursor/project-context.md 文件,请阅读以获取网站 URL 和索引目标。
识别:
https://example.com)广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 要点 | 说明 |
|---|---|
| 目的 | 控制爬虫访问;不阻止索引(被禁止的 URL 仍可能出现在搜索结果中,但无摘要) |
| 建议性 | 规则是建议性的;恶意爬虫可能忽略 |
| 公开性 | robots.txt 是公开可读的;对于敏感内容,请使用 noindex 或身份验证。参见索引 |
| 工具 | 控制 | 阻止索引吗? |
|---|---|---|
| robots.txt | 爬取(路径级别) | 否——被屏蔽的 URL 仍可能出现在搜索结果页 |
| noindex(元标签 / X-Robots-Tag) | 索引(页面级别) | 是。参见索引 |
| nofollow | 仅链接权重 | 否——不控制索引 |
| 用途 | 工具 | 示例 |
|---|---|---|
| 路径级别(整个目录) | robots.txt | Disallow: /admin/, Disallow: /api/, Disallow: /staging/ |
| 页面级别(特定页面) | noindex 元标签 / X-Robots-Tag | 登录、注册、感谢、404、法律页面。完整列表见索引 |
| 关键 | 切勿在 robots.txt 中屏蔽 | 使用 noindex 的页面——爬虫必须能访问该页面才能读取指令 |
应在 robots.txt 中屏蔽的路径:/admin/, /api/, /staging/, 临时文件。应使用 noindex 的路径(允许爬取):/login/, /signup/, /thank-you/ 等——参见索引。
| 项目 | 要求 |
|---|---|
| 路径 | 网站根目录:https://example.com/robots.txt |
| 编码 | UTF-8 纯文本 |
| 标准 | RFC 9309(Robots 排除协议) |
| 指令 | 目的 | 示例 |
|---|---|---|
User-agent: | 目标爬虫 | User-agent: Googlebot, User-agent: * |
Disallow: | 屏蔽路径前缀 | Disallow: /admin/ |
Allow: | 允许路径(可覆盖 Disallow) | Allow: /public/ |
Sitemap: | 声明站点地图的绝对 URL | Sitemap: https://example.com/sitemap.xml |
Clean-param: | 剥离查询参数(Yandex) | 见下文 |
| 切勿屏蔽 | 原因 |
|---|---|
| CSS, JS, 图片 | Google 需要它们来渲染页面;屏蔽会破坏索引 |
/_next/(Next.js) | 破坏 CSS/JS 加载;GSC 中静态资源显示为"已爬取 - 未索引"是正常的。参见索引 |
| 使用 noindex 的页面 | 爬虫必须能访问页面才能读取 noindex 指令;在 robots.txt 中屏蔽会阻止这一点 |
仅屏蔽:不需要爬取的路径:/admin/, /api/, /staging/, 临时文件。
robots.txt 对所有已测量的 AI 爬虫都有效(Vercel/MERJ 研究,2024)。按 User-agent 设置规则;请查阅各供应商文档以获取当前令牌。
| User-agent | 目的 | 通常做法 |
|---|---|---|
| OAI-SearchBot | ChatGPT 搜索 | Allow |
| GPTBot | OpenAI 训练 | Disallow |
| Claude-SearchBot | Claude 搜索 | Allow |
| ClaudeBot | Anthropic 训练 | Disallow |
| PerplexityBot | Perplexity 搜索 | Allow |
| Google-Extended | Gemini 训练 | Disallow |
| CCBot | Common Crawl(LLM 训练) | Disallow |
| Bytespider | ByteDance | Disallow |
| Meta-ExternalAgent | Meta | Disallow |
| AppleBot | Apple(Siri, Spotlight);渲染 JS | 为索引而 Allow |
允许 vs 屏蔽:允许用于搜索/索引的爬虫(OAI-SearchBot, Claude-SearchBot, PerplexityBot);如果不希望内容用于模型训练,则屏蔽仅用于训练的爬虫(GPTBot, ClaudeBot, CCBot)。AI 爬虫优化(SSR、URL 管理)请参见网站可爬取性。
Clean-param: utm_source&utm_medium&utm_campaign&utm_term&utm_content&ref&fbclid&gclid
每周安装数
219
仓库
GitHub 星标
239
首次出现
2026年3月1日
安全审计
安装于
codex200
kimi-cli200
cursor200
gemini-cli199
opencode199
github-copilot199
Guides configuration and auditing of robots.txt for search engine and AI crawler control.
When invoking : On first use , if helpful, open with 1–2 sentences on what this skill covers and why it matters, then provide the main output. On subsequent use or when the user asks to skip, go directly to the main output.
Check for project context first: If .claude/project-context.md or .cursor/project-context.md exists, read it for site URL and indexing goals.
Identify:
https://example.com)| Point | Note |
|---|---|
| Purpose | Controls crawler access; does NOT prevent indexing (disallowed URLs may still appear in search without snippet) |
| Advisory | Rules are advisory; malicious crawlers may ignore |
| Public | robots.txt is publicly readable; use noindex or auth for sensitive content. See indexing |
| Tool | Controls | Prevents indexing? |
|---|---|---|
| robots.txt | Crawl (path-level) | No—blocked URLs may still appear in SERP |
| noindex (meta / X-Robots-Tag) | Index (page-level) | Yes. See indexing |
| nofollow | Link equity only | No—does not control indexing |
| Use | Tool | Example |
|---|---|---|
| Path-level (whole directory) | robots.txt | Disallow: /admin/, Disallow: /api/, Disallow: /staging/ |
| Page-level (specific pages) | noindex meta / X-Robots-Tag | Login, signup, thank-you, 404, legal. See indexing for full list |
| Critical | Do NOT block in robots.txt | Pages that use noindex—crawlers must access the page to read the directive |
Paths to block in robots.txt : /admin/, /api/, /staging/, temp files. Paths to use noindex (allow crawl): /login/, /signup/, /thank-you/, etc.—see indexing.
| Item | Requirement |
|---|---|
| Path | Site root: https://example.com/robots.txt |
| Encoding | UTF-8 plain text |
| Standard | RFC 9309 (Robots Exclusion Protocol) |
| Directive | Purpose | Example |
|---|---|---|
User-agent: | Target crawler | User-agent: Googlebot, User-agent: * |
Disallow: | Block path prefix | Disallow: /admin/ |
Allow: | Allow path (can override Disallow) | Allow: /public/ |
| Do not block | Reason |
|---|---|
| CSS, JS, images | Google needs them to render pages; blocking breaks indexing |
/_next/ (Next.js) | Breaks CSS/JS loading; static assets in GSC "Crawled - not indexed" is expected. See indexing |
| Pages that use noindex | Crawlers must access the page to read the noindex directive; blocking in robots.txt prevents that |
Only block : paths that don't need crawling: /admin/, /api/, /staging/, temp files.
robots.txt is effective for all measured AI crawlers (Vercel/MERJ study, 2024). Set rules per user-agent; check each vendor's docs for current tokens.
| User-agent | Purpose | Typical |
|---|---|---|
| OAI-SearchBot | ChatGPT search | Allow |
| GPTBot | OpenAI training | Disallow |
| Claude-SearchBot | Claude search | Allow |
| ClaudeBot | Anthropic training | Disallow |
| PerplexityBot | Perplexity search | Allow |
| Google-Extended | Gemini training | Disallow |
| CCBot | Common Crawl (LLM training) | Disallow |
| Bytespider | ByteDance | Disallow |
Allow vs Disallow : Allow search/indexing bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot); Disallow training-only bots (GPTBot, ClaudeBot, CCBot) if you don't want content used for model training. See site-crawlability for AI crawler optimization (SSR, URL management).
Clean-param: utm_source&utm_medium&utm_campaign&utm_term&utm_content&ref&fbclid&gclid
Weekly Installs
219
Repository
GitHub Stars
239
First Seen
Mar 1, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
codex200
kimi-cli200
cursor200
gemini-cli199
opencode199
github-copilot199
程序化SEO实战指南:大规模创建优质页面,避免内容单薄惩罚
34,900 周安装
Sitemap: | Declare sitemap absolute URL | Sitemap: https://example.com/sitemap.xml |
Clean-param: | Strip query params (Yandex) | See below |
| Meta-ExternalAgent | Meta | Disallow |
| AppleBot | Apple (Siri, Spotlight); renders JS | Allow for indexing |