AI爬虫访问分析：优化robots.txt避免在ChatGPT等AI搜索中隐身

geo-crawlers by zubair-trabzada/geo-seo-claude

72 周安装量

3,900 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/zubair-trabzada/geo-seo-claude --skill geo-crawlers

AI/机器学习 SEO 网站优化

🇨🇳中文介绍

AI 爬虫访问分析技能

目的

此技能分析网站对 AI 爬虫的可访问性——AI 公司用来发现、索引和训练网络内容的机器人。如果 AI 爬虫被屏蔽，无论网站内容质量如何，都无法出现在 AI 生成的回答中。爬虫访问是 GEO 的基础技术要求。

核心洞察

截至 2026 年初，许多网站因过于激进的 robots.txt 规则（继承自旧的 SEO 配置）而无意中屏蔽了 AI 爬虫。Originality.ai 2025 年的一项研究发现，排名前 1000 的网站中有超过 35% 屏蔽了至少一个主要的 AI 爬虫，5-10% 屏蔽了所有 AI 爬虫。屏蔽 AI 爬虫是在 AI 生成的搜索结果中"隐身"的最快方式。

完整的 AI 爬虫参考

第一层级：对 AI 搜索可见性至关重要（建议：允许）

这些爬虫为 AI 搜索产品提供支持，用户在这些产品中主动寻找答案。屏蔽它们会直接降低你在 AI 生成回答中的可见性。

GPTBot

运营商: OpenAI
用户代理: GPTBot
完整用户代理字符串: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)
目的: 为 ChatGPT 的网络浏览、插件和搜索功能抓取内容。GPTBot 访问的内容可能用于改进 OpenAI 模型。
屏蔽的影响: 内容将不会出现在 ChatGPT 搜索结果中，当用户要求 ChatGPT 浏览网页时也无法访问。这是影响最大的 AI 爬虫，应予以允许。
建议: 允许 —— 截至 2025 年，ChatGPT 拥有超过 3 亿周活跃用户。屏蔽 GPTBot 将使你的内容从最大的 AI 搜索界面之一中消失。

OAI-SearchBot

OpenAI

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

第二层级：对更广泛的 AI 生态系统很重要（建议：允许）

这些爬虫服务于大型 AI 平台或搜索生态系统。允许它们可以增加你内容的覆盖范围。

运营商: Google
用户代理: Google-Extended
目的: 控制 Google 是否使用你的内容进行 Gemini 模型训练和改进 AI 概览。重要提示: 屏蔽 Google-Extended 不会影响你的 Google 搜索排名或在 Google 搜索结果中的出现。这由标准的 Googlebot 控制。
屏蔽的影响: 内容可能不会被用于 Gemini 训练或改进 AI 概览。然而，基于标准搜索索引，你的内容仍然可以出现在 AI 概览中。
建议: 允许 —— 屏蔽带来的内容保护收益微乎其微，同时会降低你在 Google AI 功能中的存在感。由于它不影响标准搜索排名，屏蔽的唯一理由是哲学上反对训练数据的使用。

运营商: Google
用户代理: GoogleOther
目的: Google 用于各种非搜索排名目的，包括研究、一次性抓取和 AI 相关数据收集。
屏蔽的影响: 对搜索排名影响极小。可能会降低在 Google AI 研究和实验性功能中的存在感。
建议: 允许 —— 风险低，对纳入 AI 功能有中等潜在益处。

运营商: Apple
用户代理: Applebot-Extended
目的: Apple 用于训练和改进 Apple Intelligence 功能、Siri 和 Apple 的 AI 产品。与标准的 Applebot（为 Siri 搜索和聚焦建议提供支持）是分开的。
屏蔽的影响: 内容可能不会被用于 Apple Intelligence 功能。标准的 Siri 和聚焦功能不受影响（由 Applebot 控制）。
建议: 允许 —— Apple Intelligence 已集成到所有 Apple 设备中（超过 20 亿台活跃设备）。在 Apple AI 功能中的存在具有日益增长的战略价值。

运营商: Amazon
用户代理: Amazonbot
完整用户代理字符串: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
目的: 为 Alexa 回答和 Amazon 的 AI 功能索引内容。
屏蔽的影响: 内容将不会出现在 Alexa 语音响应或 Amazon 的 AI 驱动搜索功能中。
建议: 允许 —— 与语音搜索优化相关。优先级低于第一层级爬虫，但允许没有坏处。

运营商: Meta
用户代理: FacebookBot
目的: Meta 用于 Facebook、Instagram、WhatsApp 和 Meta AI 助手等平台上的 AI 功能。
屏蔽的影响: Meta AI 可能无法访问你的内容。Facebook/Instagram 上的链接预览由不同的爬虫处理，不受影响。
建议: 允许 —— Meta AI 内置于拥有超过 30 亿合并用户的应用程序中。对于 AI 可见性越来越重要。

第三层级：仅用于训练的爬虫（根据策略允许或屏蔽）

这些爬虫主要用于 AI 模型训练，而非实时搜索功能。屏蔽它们不会影响 AI 搜索可见性。

运营商: Common Crawl（非营利组织）
用户代理: CCBot
完整用户代理字符串: CCBot/2.0 (https://commoncrawl.org/faq/)
目的: 构建 Common Crawl 数据集，该数据集被许多 AI 公司（Google、Meta、Stability AI 等）用作训练数据。
屏蔽的影响: 内容将不会出现在未来的 Common Crawl 数据集中。不会影响任何实时的 AI 搜索产品。
建议: 视情况而定 —— 如果你希望获得最大的长期 AI 训练存在感，请允许。如果你想控制训练数据的使用，请屏蔽。对搜索可见性没有影响。

运营商: Anthropic
用户代理: anthropic-ai
目的: Anthropic 用于 AI 安全研究和 Claude 模型训练。与 ClaudeBot（为实时功能提供支持）是分开的。
屏蔽的影响: 内容将不会被用于 Claude 训练。不会影响 Claude 的实时搜索或网页浏览功能（由 ClaudeBot 控制）。
建议: 视情况而定 —— 类似于 CCBot。为了训练存在感请允许，为了控制训练数据请屏蔽。对实时 AI 搜索没有影响。

运营商: ByteDance
用户代理: Bytespider
目的: ByteDance 用于各种 AI 产品，包括 TikTok 的 AI 功能和豆包（他们在中国的 ChatGPT 竞争对手）。
屏蔽的影响: 内容将不会被用于 ByteDance 的 AI 产品。对面向西方市场的企业影响极小。
建议: 对于大多数西方企业屏蔽（据报道有激进的抓取行为，搜索可见性收益极小）。如果目标是中国/亚洲市场，则允许。

运营商: Cohere
用户代理: cohere-ai
目的: Cohere 用于模型训练。Cohere 为企业 AI 解决方案和 Coral 聊天产品提供支持。
屏蔽的影响: 内容将不会被用于 Cohere 模型训练。对面向消费者的直接影响极小。
建议: 视情况而定 —— 低优先级。根据对训练数据的一般立场决定允许或屏蔽。

爬虫	层级	建议	理由
GPTBot	1	允许	为 ChatGPT 搜索提供支持（3亿+用户）
OAI-SearchBot	1	允许	仅用于搜索，不用于训练
ChatGPT-User	1	允许	用户发起的浏览
ClaudeBot	1	允许	Claude 网络搜索和分析
PerplexityBot	1	允许	最佳推荐流量的 AI 搜索
Google-Extended	2	允许	Gemini 功能；不影响搜索排名
GoogleOther	2	允许	Google AI 研究
Applebot-Extended	2	允许	Apple Intelligence（20亿+设备）
Amazonbot	2	允许	Alexa 和 Amazon AI
FacebookBot	2	允许	Meta AI（30亿+应用用户）
CCBot	3	视情况	仅训练数据
anthropic-ai	3	视情况	仅训练数据
Bytespider	3	屏蔽	激进的爬虫，收益低
cohere-ai	3	视情况	仅训练数据

最大化 AI 可见性配置（robots.txt）

对于希望获得最大 AI 搜索可见性的网站：

# AI Crawlers - ALLOWED for AI search visibility
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: GoogleOther
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Amazonbot
Allow: /

User-agent: FacebookBot
Allow: /

# AI Crawlers - BLOCKED (aggressive/low value)
User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

步骤 1：获取并解析 robots.txt

使用 WebFetch 获取 [domain]/robots.txt。
解析所有 User-agent 指令及其相关的 Allow/Disallow 规则。
对于上面参考列表中的每个 AI 爬虫：
- 检查是否有针对该爬虫的特定 User-agent 块
- 检查是否有通配符（User-agent: *）块会适用
- 确定有效访问状态：允许、屏蔽或未提及（继承通配符规则）
注意任何可能减慢 AI 爬虫访问的 Crawl-delay 指令。
检查 Sitemap 指令（AI 爬虫使用这些指令进行发现）。

步骤 2：检查 Meta Robots 标签

对于 5-10 个关键页面的样本，获取 HTML 并检查：
- <meta name="robots" content="noindex"> —— 屏蔽所有机器人
- <meta name="robots" content="nofollow"> —— 阻止链接跟踪
- <meta name="robots" content="noai"> —— 新兴标签，用于阻止 AI 使用
- <meta name="robots" content="noimageai"> —— 阻止 AI 图像训练
- 特定于机器人的 meta 标签：<meta name="GPTBot" content="noindex">
记录任何页面级别对 robots.txt 指令的覆盖。

步骤 3：检查 HTTP 头部

对于相同的样本页面，检查响应头部中的：
- X-Robots-Tag: noindex —— 等同于 meta noindex 的 HTTP 头部
- X-Robots-Tag: noai —— 阻止 AI 使用的 HTTP 头部
- X-Robots-Tag: noimageai —— 阻止 AI 图像训练
- 特定于机器人的头部：X-Robots-Tag: GPTBot: noindex
注意 HTTP 头部会覆盖 meta 标签，并且也适用于非 HTML 资源。

步骤 4：检查 AI 特定文件

检查 /llms.txt（用于 AI 爬虫指导的新兴标准）。
检查 /.well-known/ai-plugin.json（OpenAI 插件清单）。
检查 /ai.txt（提议的标准，类似于针对 AI 的 ads.txt）。
记录每个文件的存在/缺失和质量。

步骤 5：评估 JavaScript 渲染要求

检查网站是否是单页应用程序（SPA）或严重依赖 JavaScript 渲染。
AI 爬虫的 JavaScript 渲染能力各不相同：
- GPTBot：有限的 JS 渲染能力
- ClaudeBot：有限的 JS 渲染能力
- PerplexityBot：有限的 JS 渲染能力
- Googlebot：完整的 JS 渲染能力（但 Google-Extended 继承此能力）
如果关键内容需要 JS 渲染，将此标记为潜在问题。
检查是否存在服务器端渲染（SSR）或静态站点生成（SSG）作为缓解措施。

生成一个名为 GEO-CRAWLER-ACCESS.md 的文件：

# AI 爬虫访问报告：[域名]

**分析日期:** [日期]
**域名:** [域名]
**robots.txt 状态:** [找到/未找到/错误]

---

## 爬虫访问摘要

| 爬虫 | 运营商 | 层级 | 状态 | 影响 |
|---|---|---|---|---|
| GPTBot | OpenAI | 1 | [允许/屏蔽/未提及] | [影响描述] |
| OAI-SearchBot | OpenAI | 1 | [状态] | [影响] |
| ChatGPT-User | OpenAI | 1 | [状态] | [影响] |
| ClaudeBot | Anthropic | 1 | [状态] | [影响] |
| PerplexityBot | Perplexity | 1 | [状态] | [影响] |
| Google-Extended | Google | 2 | [状态] | [影响] |
| GoogleOther | Google | 2 | [状态] | [影响] |
| Applebot-Extended | Apple | 2 | [状态] | [影响] |
| Amazonbot | Amazon | 2 | [状态] | [影响] |
| FacebookBot | Meta | 2 | [状态] | [影响] |
| CCBot | Common Crawl | 3 | [状态] | [影响] |
| anthropic-ai | Anthropic | 3 | [状态] | [影响] |
| Bytespider | ByteDance | 3 | [状态] | [影响] |
| cohere-ai | Cohere | 3 | [状态] | [影响] |

## AI 可见性得分：[X]/100

**第一层级访问:** [X/5 个爬虫允许]
**第二层级访问:** [X/5 个爬虫允许]
**第三层级访问:** [X/4 个爬虫允许]

---

## 关键问题

[列出任何被屏蔽的第一层级爬虫]

## 建议

### 立即行动
[需要进行的特定 robots.txt 更改]

### robots.txt 建议

[针对 AI 爬虫的完整建议 robots.txt 内容]

### 其他技术发现
- **Meta Robots 标签:** [发现]
- **X-Robots-Tag 头部:** [发现]
- **JavaScript 渲染:** [评估]
- **llms.txt:** [存在/缺失]
- **站点地图可访问性:** [评估]

AI 爬虫访问得分计算如下：

组件	权重	评分
允许的第一层级爬虫	50%	每个允许的第一层级爬虫得 20 分（5 个爬虫 = 最高 100 分，按比例缩放到 50）
允许的第二层级爬虫	25%	每个允许的第二层级爬虫得 20 分（5 个爬虫 = 最高 100 分，按比例缩放到 25）
无全面 AI 屏蔽	15%	如果没有 `User-agent: *` Disallow: / 且没有 noai meta 标签，则得满分
存在 AI 特定文件	10%	llms.txt 得 5 分，AI 爬虫可访问站点地图得 5 分

最终得分 = 所有权重组件的总和，上限为 100。

🇺🇸English

AI Crawler Access Analysis Skill

Purpose

This skill analyzes a website's accessibility to AI crawlers -- the bots that AI companies use to discover, index, and train on web content. If AI crawlers are blocked, the site's content cannot appear in AI-generated responses regardless of its quality. Crawler access is the foundational technical requirement for GEO.

Key Insight

As of early 2026, many websites inadvertently block AI crawlers through overly aggressive robots.txt rules, inherited from legacy SEO configurations. An Originality.ai 2025 study found that over 35% of the top 1,000 websites block at least one major AI crawler, and 5-10% block all AI crawlers. Blocking AI crawlers is the single fastest way to become invisible in AI-generated search results.

Complete AI Crawler Reference

Tier 1: Critical for AI Search Visibility (RECOMMEND: ALLOW)

These crawlers power the AI search products where users actively look for answers. Blocking them directly reduces your visibility in AI-generated responses.

GPTBot

Operator: OpenAI
User-Agent: GPTBot
Full User-Agent String: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)
Purpose: Fetches content for ChatGPT's web browsing, plugins, and search features. Content accessed by GPTBot may be used to improve OpenAI models.
Impact of Blocking: Content will NOT appear in ChatGPT Search results or be accessible when users ask ChatGPT to browse the web. This is the highest-impact AI crawler to allow.
Recommendation: ALLOW -- ChatGPT has 300M+ weekly active users as of 2025. Blocking GPTBot removes your content from one of the largest AI search surfaces.

OAI-SearchBot

Operator: OpenAI
User-Agent: OAI-SearchBot
Full User-Agent String: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; OAI-SearchBot/1.0; +https://docs.openai.com/bots/overview)
Purpose: Specifically powers ChatGPT's search feature. Unlike GPTBot, content accessed by OAI-SearchBot is NOT used for model training -- only for live search results.
Impact of Blocking: Content will not appear in ChatGPT's search results even if GPTBot is allowed.
Recommendation: ALLOW -- This is a search-only crawler with no training implications. There is no strategic reason to block it.

ChatGPT-User

Operator: OpenAI
User-Agent: ChatGPT-User
Full User-Agent String: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ChatGPT-User/1.0; +https://openai.com/bot)
Purpose: Used when a ChatGPT user explicitly asks the model to visit a specific URL. Acts like a browser agent on behalf of the user.
Impact of Blocking: ChatGPT cannot visit your pages when users ask it to read or summarize them. This prevents direct user-initiated traffic.
Recommendation: ALLOW -- Blocking this bot prevents users who are actively trying to engage with your content from accessing it through ChatGPT.

ClaudeBot

Operator: Anthropic
User-Agent: ClaudeBot
Full User-Agent String: ClaudeBot/1.0; +https://www.anthropic.com/claude-bot
Purpose: Fetches web content for Claude's features including web search, citations, and analysis tools.
Impact of Blocking: Content will not be accessible to Claude for web search or when users ask Claude to analyze specific URLs.
Recommendation: ALLOW -- Claude is a major AI assistant with growing market share. Blocking ClaudeBot reduces your AI search footprint.

PerplexityBot

Operator: Perplexity AI
User-Agent: PerplexityBot
Full User-Agent String: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)
Purpose: Powers Perplexity's AI search engine, which provides sourced answers with direct citations and links back to source pages.
Impact of Blocking: Content will not appear in Perplexity search results. Perplexity is one of the best referral traffic sources among AI search products because it always displays source links.
Recommendation: ALLOW -- Perplexity drives actual referral traffic and always attributes sources. High-value AI crawler for publishers and businesses.

Tier 2: Important for Broader AI Ecosystem (RECOMMEND: ALLOW)

These crawlers serve large AI platforms or search ecosystems. Allowing them increases your content's reach.

Google-Extended

Operator: Google
User-Agent: Google-Extended
Purpose: Controls whether Google uses your content for Gemini model training and AI Overviews improvement. CRITICAL NOTE: Blocking Google-Extended does NOT affect your Google Search rankings or your appearance in Google Search results. That is controlled by the standard Googlebot.
Impact of Blocking: Content may not be used for Gemini training or to improve AI Overviews. However, your content can still appear in AI Overviews based on standard search indexing.
Recommendation: ALLOW -- Blocking provides minimal content protection upside while reducing your presence in Google's AI features. Since it does not affect standard search ranking, the only reason to block is philosophical objection to training data usage.

GoogleOther

Operator: Google
User-Agent: GoogleOther
Purpose: Used by Google for various non-search-ranking purposes including research, one-off crawls, and AI-related data collection.
Impact of Blocking: Minimal impact on search rankings. May reduce presence in Google's AI research and experimental features.
Recommendation: ALLOW -- Low risk, moderate potential benefit for AI feature inclusion.

Applebot-Extended

Operator: Apple
User-Agent: Applebot-Extended
Purpose: Used by Apple to train and improve Apple Intelligence features, Siri, and Apple's AI products. Separate from standard Applebot (which powers Siri search and Spotlight Suggestions).
Impact of Blocking: Content may not be used in Apple Intelligence features. Standard Siri and Spotlight functionality is unaffected (controlled by Applebot).
Recommendation: ALLOW -- Apple Intelligence is integrated into all Apple devices (2B+ active devices). Presence in Apple's AI features has growing strategic value.

Amazonbot

Operator: Amazon
User-Agent: Amazonbot
Full User-Agent String: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
Purpose: Indexes content for Alexa answers and Amazon's AI features.
Impact of Blocking: Content will not appear in Alexa voice responses or Amazon's AI-powered search features.
Recommendation: ALLOW -- Relevant for voice search optimization. Lower priority than Tier 1 crawlers but no downside to allowing.

FacebookBot

Operator: Meta
User-Agent: FacebookBot
Purpose: Used by Meta for AI features across Facebook, Instagram, WhatsApp, and Meta AI assistant.
Impact of Blocking: Content may not be accessible to Meta AI. Link previews on Facebook/Instagram are handled by a different crawler and are unaffected.
Recommendation: ALLOW -- Meta AI is embedded in apps with 3B+ combined users. Growing importance for AI visibility.

Tier 3: Training-Only Crawlers (ALLOW or BLOCK Based on Strategy)

These crawlers are primarily used for AI model training rather than live search features. Blocking them does not affect AI search visibility.

CCBot

Operator: Common Crawl (nonprofit)
User-Agent: CCBot
Full User-Agent String: CCBot/2.0 (https://commoncrawl.org/faq/)
Purpose: Builds the Common Crawl dataset, which is used as training data by many AI companies (Google, Meta, Stability AI, and others).
Impact of Blocking: Content will not appear in future Common Crawl datasets. Does NOT affect any live AI search product.
Recommendation: CONTEXT-DEPENDENT -- Allow if you want maximum long-term AI training presence. Block if you want to control training data usage. No impact on search visibility.

anthropic-ai

Operator: Anthropic
User-Agent: anthropic-ai
Purpose: Used by Anthropic for AI safety research and Claude model training. Separate from ClaudeBot (which powers live features).
Impact of Blocking: Content will not be used for Claude training. Does NOT affect Claude's live search or web browsing features (controlled by ClaudeBot).
Recommendation: CONTEXT-DEPENDENT -- Similar to CCBot. Allow for training presence, block for training data control. No impact on live AI search.

Bytespider

Operator: ByteDance
User-Agent: Bytespider
Purpose: Used by ByteDance for various AI products including TikTok's AI features and Doubao (their ChatGPT competitor in China).
Impact of Blocking: Content will not be used for ByteDance AI products. Minimal impact for Western-market businesses.
Recommendation: BLOCK for most Western businesses (aggressive crawling behavior reported, minimal search visibility benefit). ALLOW if targeting Chinese/Asian markets.

cohere-ai

Operator: Cohere
User-Agent: cohere-ai
Purpose: Used by Cohere for model training. Cohere powers enterprise AI solutions and the Coral chat product.
Impact of Blocking: Content will not be used for Cohere model training. Minimal direct consumer-facing impact.
Recommendation: CONTEXT-DEPENDENT -- Low priority. Allow or block based on general training data stance.

Recommendation Matrix Summary

Crawler	Tier	Recommendation	Reason
GPTBot	1	ALLOW	Powers ChatGPT Search (300M+ users)
OAI-SearchBot	1	ALLOW	Search-only, no training use
ChatGPT-User	1	ALLOW	User-initiated browsing
ClaudeBot	1	ALLOW	Claude web search and analysis
PerplexityBot	1	ALLOW	Best referral traffic AI search
Google-Extended	2	ALLOW	Gemini features; no search rank impact

Maximum AI Visibility Configuration (robots.txt)

For sites wanting maximum AI search visibility:

# AI Crawlers - ALLOWED for AI search visibility
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: GoogleOther
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Amazonbot
Allow: /

User-agent: FacebookBot
Allow: /

# AI Crawlers - BLOCKED (aggressive/low value)
User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

Analysis Procedure

Step 1: Fetch and Parse robots.txt

Use WebFetch to retrieve [domain]/robots.txt.
Parse all User-agent directives and their associated Allow/Disallow rules.
For each AI crawler in the reference list above:
- Check if there is a specific User-agent block for that crawler
- Check if there is a wildcard (User-agent: *) block that would apply
- Determine effective access: Allowed , Blocked , or Not Mentioned (inherits wildcard rules)
Note any Crawl-delay directives that may slow AI crawler access.
Check for Sitemap directives (AI crawlers use these for discovery).

Step 2: Check Meta Robots Tags

For a sample of 5-10 key pages, fetch the HTML and check for:
- <meta name="robots" content="noindex"> -- blocks all bots
- <meta name="robots" content="nofollow"> -- prevents link following
- <meta name="robots" content="noai"> -- emerging tag to block AI use
- <meta name="robots" content="noimageai"> -- blocks AI image training
- Bot-specific meta tags: <meta name="GPTBot" content="noindex">
Record any page-level overrides of the robots.txt directives.

Step 3: Check HTTP Headers

For the same sample pages, check response headers for:
- X-Robots-Tag: noindex -- HTTP header equivalent of meta noindex
- X-Robots-Tag: noai -- HTTP header to block AI use
- X-Robots-Tag: noimageai -- blocks AI image training
- Bot-specific headers: X-Robots-Tag: GPTBot: noindex
Note that HTTP headers override meta tags and apply to non-HTML resources too.

Step 4: Check for AI-Specific Files

Check for /llms.txt (emerging standard for AI crawler guidance).
Check for /.well-known/ai-plugin.json (OpenAI plugin manifest).
Check for /ai.txt (proposed standard, similar to ads.txt for AI).
Record presence/absence and quality of each file.

Step 5: Assess JavaScript Rendering Requirements

Check if the site is a Single Page Application (SPA) or heavily JavaScript-rendered.
AI crawlers vary in their JavaScript rendering capabilities:
- GPTBot: Limited JS rendering
- ClaudeBot: Limited JS rendering
- PerplexityBot: Limited JS rendering
- Googlebot: Full JS rendering (but Google-Extended inherits this)
If critical content requires JS rendering, flag this as a potential issue.
Check for Server-Side Rendering (SSR) or Static Site Generation (SSG) as mitigations.

Output Format

Generate a file called GEO-CRAWLER-ACCESS.md:

# AI Crawler Access Report: [Domain]

**Analysis Date:** [Date]
**Domain:** [Domain]
**robots.txt Status:** [Found/Not Found/Error]

---

## Crawler Access Summary

| Crawler | Operator | Tier | Status | Impact |
|---|---|---|---|---|
| GPTBot | OpenAI | 1 | [Allowed/Blocked/Not Mentioned] | [Impact description] |
| OAI-SearchBot | OpenAI | 1 | [Status] | [Impact] |
| ChatGPT-User | OpenAI | 1 | [Status] | [Impact] |
| ClaudeBot | Anthropic | 1 | [Status] | [Impact] |
| PerplexityBot | Perplexity | 1 | [Status] | [Impact] |
| Google-Extended | Google | 2 | [Status] | [Impact] |
| GoogleOther | Google | 2 | [Status] | [Impact] |
| Applebot-Extended | Apple | 2 | [Status] | [Impact] |
| Amazonbot | Amazon | 2 | [Status] | [Impact] |
| FacebookBot | Meta | 2 | [Status] | [Impact] |
| CCBot | Common Crawl | 3 | [Status] | [Impact] |
| anthropic-ai | Anthropic | 3 | [Status] | [Impact] |
| Bytespider | ByteDance | 3 | [Status] | [Impact] |
| cohere-ai | Cohere | 3 | [Status] | [Impact] |

## AI Visibility Score: [X]/100

**Tier 1 Access:** [X/5 crawlers allowed]
**Tier 2 Access:** [X/5 crawlers allowed]
**Tier 3 Access:** [X/4 crawlers allowed]

---

## Critical Issues

[List any Tier 1 crawlers that are blocked]

## Recommendations

### Immediate Actions
[Specific robots.txt changes needed]

### robots.txt Recommendation

[Complete recommended robots.txt content for AI crawlers]

### Additional Technical Findings
- **Meta Robots Tags:** [Findings]
- **X-Robots-Tag Headers:** [Findings]
- **JavaScript Rendering:** [Assessment]
- **llms.txt:** [Present/Absent]
- **Sitemap Accessibility:** [Assessment]

Scoring for Crawler Access

The AI Crawler Access Score is calculated as:

Component	Weight	Scoring
Tier 1 Crawlers Allowed	50%	20 points per Tier 1 crawler allowed (5 crawlers = 100 points max, scaled to 50)
Tier 2 Crawlers Allowed	25%	20 points per Tier 2 crawler allowed (5 crawlers = 100 points max, scaled to 25)
No Blanket AI Blocks	15%	Full points if no `User-agent: *` Disallow: / and no noai meta tags
AI-Specific Files Present	10%	5 points for llms.txt, 5 points for sitemap accessible to AI crawlers

Final score = sum of all weighted components, capped at 100.

Weekly Installs

Repository

zubair-trabzada…o-claude

GitHub Stars

3.9K

First Seen

Feb 27, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode71

codex71

cline69

gemini-cli69

cursor69

github-copilot69

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

52,100 周安装

AI爬虫访问分析：优化robots.txt避免在ChatGPT等AI搜索中隐身

🇨🇳中文介绍

AI 爬虫访问分析技能

目的

核心洞察

完整的 AI 爬虫参考

第一层级：对 AI 搜索可见性至关重要（建议：允许）

GPTBot

OAI-SearchBot

相关 Skills

ChatGPT-User

ClaudeBot

PerplexityBot

第二层级：对更广泛的 AI 生态系统很重要（建议：允许）

Google-Extended

GoogleOther

Applebot-Extended

Amazonbot

FacebookBot

第三层级：仅用于训练的爬虫（根据策略允许或屏蔽）

CCBot

anthropic-ai

Bytespider

cohere-ai

建议矩阵摘要

最大化 AI 可见性配置（robots.txt）

分析流程

步骤 1：获取并解析 robots.txt

步骤 2：检查 Meta Robots 标签

步骤 3：检查 HTTP 头部

步骤 4：检查 AI 特定文件

步骤 5：评估 JavaScript 渲染要求

输出格式

爬虫访问评分

🇺🇸English

AI Crawler Access Analysis Skill

Purpose

Key Insight

Complete AI Crawler Reference

Tier 1: Critical for AI Search Visibility (RECOMMEND: ALLOW)

GPTBot

OAI-SearchBot

ChatGPT-User

ClaudeBot

PerplexityBot

Tier 2: Important for Broader AI Ecosystem (RECOMMEND: ALLOW)

Google-Extended

GoogleOther

Applebot-Extended

Amazonbot

FacebookBot

Tier 3: Training-Only Crawlers (ALLOW or BLOCK Based on Strategy)

CCBot

anthropic-ai

Bytespider

cohere-ai

Recommendation Matrix Summary

Maximum AI Visibility Configuration (robots.txt)

Analysis Procedure

Step 1: Fetch and Parse robots.txt

Step 2: Check Meta Robots Tags

Step 3: Check HTTP Headers

Step 4: Check for AI-Specific Files

Step 5: Assess JavaScript Rendering Requirements

Output Format

Scoring for Crawler Access

最新 Skills