robots.txt 配置与审核指南：控制搜索引擎与AI爬虫访问的SEO技术

robots-txt by kostja94/marketing-skills

285 周安装量

277 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/kostja94/marketing-skills --skill robots-txt

自动化 SEO 网站优化

🇨🇳中文介绍

SEO 技术：robots.txt

指导配置和审核 robots.txt，以控制搜索引擎和 AI 爬虫的访问。

调用时机：在首次使用时，如果合适，用 1-2 句话开场，说明此技能涵盖的内容及其重要性，然后提供主要输出。在后续使用或用要求跳过时，直接进入主要输出。

范围（技术 SEO）

Robots.txt：配置 Disallow/Allow、Sitemap、Clean-param；审核意外屏蔽情况
爬虫访问：路径级别的爬取控制；AI 爬虫允许/屏蔽策略
区别：robots.txt = 爬取控制（谁访问哪些路径）；noindex = 索引控制（哪些内容被索引）。页面级排除请参见索引。

初步评估

首先检查项目上下文：如果存在 .claude/project-context.md 或 .cursor/project-context.md 文件，请阅读以获取网站 URL 和索引目标。

识别：

网站 URL：基础域名（例如，https://example.com）
索引范围：整个网站、部分网站，或要排除的特定路径
AI 爬虫策略：允许搜索/索引与屏蔽用于训练数据的爬虫

最佳实践

目的与限制

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

要点	说明
目的	控制爬虫访问；不阻止索引（被禁止的 URL 仍可能出现在搜索结果中，但无摘要）
建议性	规则是建议性的；恶意爬虫可能忽略
公开性	robots.txt 是公开可读的；对于敏感内容，请使用 noindex 或身份验证。参见索引

爬取 vs 索引 vs 链接权重（快速参考）

工具	控制	阻止索引吗？
robots.txt	爬取（路径级别）	否——被屏蔽的 URL 仍可能出现在搜索结果页
noindex（元标签 / X-Robots-Tag）	索引（页面级别）	是。参见索引
nofollow	仅链接权重	否——不控制索引

何时使用 robots.txt vs noindex

用途	工具	示例
路径级别（整个目录）	robots.txt	`Disallow: /admin/`, `Disallow: /api/`, `Disallow: /staging/`
页面级别（特定页面）	noindex 元标签 / X-Robots-Tag	登录、注册、感谢、404、法律页面。完整列表见索引
关键	切勿在 robots.txt 中屏蔽	使用 noindex 的页面——爬虫必须能访问该页面才能读取指令

应在 robots.txt 中屏蔽的路径：/admin/, /api/, /staging/, 临时文件。应使用 noindex 的路径（允许爬取）：/login/, /signup/, /thank-you/ 等——参见索引。

项目	要求
路径	网站根目录：`https://example.com/robots.txt`
编码	UTF-8 纯文本
标准	RFC 9309（Robots 排除协议）

指令	目的	示例
`User-agent:`	目标爬虫	`User-agent: Googlebot`, `User-agent: *`
`Disallow:`	屏蔽路径前缀	`Disallow: /admin/`
`Allow:`	允许路径（可覆盖 Disallow）	`Allow: /public/`
`Sitemap:`	声明站点地图的绝对 URL	`Sitemap: https://example.com/sitemap.xml`
`Clean-param:`	剥离查询参数（Yandex）	见下文

关键：切勿屏蔽

切勿屏蔽	原因
CSS, JS, 图片	Google 需要它们来渲染页面；屏蔽会破坏索引
`/_next/`（Next.js）	破坏 CSS/JS 加载；GSC 中静态资源显示为"已爬取 - 未索引"是正常的。参见索引
使用 noindex 的页面	爬虫必须能访问页面才能读取 noindex 指令；在 robots.txt 中屏蔽会阻止这一点

仅屏蔽：不需要爬取的路径：/admin/, /api/, /staging/, 临时文件。

robots.txt 对所有已测量的 AI 爬虫都有效（Vercel/MERJ 研究，2024）。按 User-agent 设置规则；请查阅各供应商文档以获取当前令牌。

User-agent	目的	通常做法
OAI-SearchBot	ChatGPT 搜索	Allow
GPTBot	OpenAI 训练	Disallow
Claude-SearchBot	Claude 搜索	Allow
ClaudeBot	Anthropic 训练	Disallow
PerplexityBot	Perplexity 搜索	Allow
Google-Extended	Gemini 训练	Disallow
CCBot	Common Crawl（LLM 训练）	Disallow
Bytespider	ByteDance	Disallow
Meta-ExternalAgent	Meta	Disallow
AppleBot	Apple（Siri, Spotlight）；渲染 JS	为索引而 Allow

允许 vs 屏蔽：允许用于搜索/索引的爬虫（OAI-SearchBot, Claude-SearchBot, PerplexityBot）；如果不希望内容用于模型训练，则屏蔽仅用于训练的爬虫（GPTBot, ClaudeBot, CCBot）。AI 爬虫优化（SSR、URL 管理）请参见网站可爬取性。

Clean-param（Yandex）

Clean-param: utm_source&utm_medium&utm_campaign&utm_term&utm_content&ref&fbclid&gclid

当前状态（如果正在审核）
推荐的 robots.txt（完整文件）
合规性检查清单
参考：Google robots.txt

索引：完整的 noindex 页面类型列表；何时使用 noindex vs robots.txt；GSC 索引诊断
页面元数据：Meta robots（noindex, nofollow）的实现
XML 站点地图：在 robots.txt 中引用的站点地图 URL
网站可爬取性：更广泛的爬取和结构指导；AI 爬虫优化
渲染策略：SSR, SSG, CSR；供爬虫使用的初始 HTML 中的内容

🇺🇸English

SEO Technical: robots.txt

Guides configuration and auditing of robots.txt for search engine and AI crawler control.

When invoking : On first use , if helpful, open with 1–2 sentences on what this skill covers and why it matters, then provide the main output. On subsequent use or when the user asks to skip, go directly to the main output.

Scope (Technical SEO)

Robots.txt : Configure Disallow/Allow, Sitemap, Clean-param; audit for accidental blocks
Crawler access : Path-level crawl control; AI crawler allow/block strategy
Differentiation : robots.txt = crawl control (who accesses what paths); noindex = index control (what gets indexed). See indexing for page-level exclusions.

Initial Assessment

Check for project context first: If .claude/project-context.md or .cursor/project-context.md exists, read it for site URL and indexing goals.

Identify:

Site URL : Base domain (e.g., https://example.com)
Indexing scope : Full site, partial, or specific paths to exclude
AI crawler strategy : Allow search/indexing vs. block training data crawlers

Best Practices

Purpose and Limitations

Point	Note
Purpose	Controls crawler access; does NOT prevent indexing (disallowed URLs may still appear in search without snippet)
Advisory	Rules are advisory; malicious crawlers may ignore
Public	robots.txt is publicly readable; use noindex or auth for sensitive content. See indexing

Crawl vs Index vs Link Equity (Quick Reference)

Tool	Controls	Prevents indexing?
robots.txt	Crawl (path-level)	No—blocked URLs may still appear in SERP
noindex (meta / X-Robots-Tag)	Index (page-level)	Yes. See indexing
nofollow	Link equity only	No—does not control indexing

When to Use robots.txt vs noindex

Use	Tool	Example
Path-level (whole directory)	robots.txt	`Disallow: /admin/`, `Disallow: /api/`, `Disallow: /staging/`
Page-level (specific pages)	noindex meta / X-Robots-Tag	Login, signup, thank-you, 404, legal. See indexing for full list
Critical	Do NOT block in robots.txt	Pages that use noindex—crawlers must access the page to read the directive

Paths to block in robots.txt : /admin/, /api/, /staging/, temp files. Paths to use noindex (allow crawl): /login/, /signup/, /thank-you/, etc.—see indexing.

Location and Format

Item	Requirement
Path	Site root: `https://example.com/robots.txt`
Encoding	UTF-8 plain text
Standard	RFC 9309 (Robots Exclusion Protocol)

Core Directives

Directive	Purpose	Example
`User-agent:`	Target crawler	`User-agent: Googlebot`, `User-agent: *`
`Disallow:`	Block path prefix	`Disallow: /admin/`
`Allow:`	Allow path (can override Disallow)	`Allow: /public/`

Critical: Do Not Block

Do not block	Reason
CSS, JS, images	Google needs them to render pages; blocking breaks indexing
`/_next/` (Next.js)	Breaks CSS/JS loading; static assets in GSC "Crawled - not indexed" is expected. See indexing
Pages that use noindex	Crawlers must access the page to read the noindex directive; blocking in robots.txt prevents that

Only block : paths that don't need crawling: /admin/, /api/, /staging/, temp files.

AI Crawler Strategy

robots.txt is effective for all measured AI crawlers (Vercel/MERJ study, 2024). Set rules per user-agent; check each vendor's docs for current tokens.

User-agent	Purpose	Typical
OAI-SearchBot	ChatGPT search	Allow
GPTBot	OpenAI training	Disallow
Claude-SearchBot	Claude search	Allow
ClaudeBot	Anthropic training	Disallow
PerplexityBot	Perplexity search	Allow
Google-Extended	Gemini training	Disallow
CCBot	Common Crawl (LLM training)	Disallow
Bytespider	ByteDance	Disallow

Allow vs Disallow : Allow search/indexing bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot); Disallow training-only bots (GPTBot, ClaudeBot, CCBot) if you don't want content used for model training. See site-crawlability for AI crawler optimization (SSR, URL management).

Clean-param (Yandex)

Clean-param: utm_source&utm_medium&utm_campaign&utm_term&utm_content&ref&fbclid&gclid

Output Format

Current state (if auditing)
Recommended robots.txt (full file)
Compliance checklist
References : Google robots.txt

Related Skills

indexing : Full noindex page-type list; when to use noindex vs robots.txt; GSC indexing diagnosis
page-metadata : Meta robots (noindex, nofollow) implementation
xml-sitemap : Sitemap URL to reference in robots.txt
site-crawlability : Broader crawl and structure guidance; AI crawler optimization
rendering-strategies : SSR, SSG, CSR; content in initial HTML for crawlers

Weekly Installs

219

Repository

kostja94/market…g-skills

GitHub Stars

239

First Seen

Mar 1, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

codex200

kimi-cli200

cursor200

gemini-cli199

opencode199

github-copilot199

程序化SEO实战指南：大规模创建优质页面，避免内容单薄惩罚

34,900 周安装