网站可抓取性优化指南：robots.txt、网站结构、内部链接与AI爬虫优化

site-crawlability by kostja94/marketing-skills

207 周安装量

239 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/kostja94/marketing-skills --skill site-crawlability

开发运维 SEO 网站优化

🇨🇳中文介绍

SEO 技术：可抓取性

指导可抓取性改进：robots、X-Robots-Tag、网站结构和内部链接。

调用时机：在首次使用时，如果适用，以 1-2 句话开头，说明此技能涵盖的内容及其重要性，然后提供主要输出。在后续使用或用户要求跳过时，直接进入主要输出。

范围（技术性 SEO）

重定向链与循环：修复多跳重定向；直接指向最终 URL
失效链接（4xx）：修复失效的内部/外部链接；进行 301 重定向或移除
网站架构：逻辑层次结构；页面距离主页在 3-4 次点击内
孤立页面：为没有入链的页面添加内部链接
分页：为提升可抓取性，优先使用分页而非无限滚动
抓取预算：减少在重复内容、重定向、低价值 URL 上的浪费（见下文）
AI 爬虫优化：关键内容使用 SSR；URL 管理；减少 404/重定向浪费（见下文）

初步评估

首先检查项目上下文：如果存在 .claude/project-context.md 或 .cursor/project-context.md，请阅读以了解网站结构。

识别：

网站结构：扁平化 vs 深层级结构
框架：Next.js、静态站点、SPA 等
关键路径：站点地图、robots.txt、API、静态资源

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

重定向链与循环

修复多跳重定向；直接指向最终 URL
循环：URL 重定向回自身；打破循环

失效链接（4xx）

修复失效的内部/外部链接；进行 301 重定向或移除
定期审核；更新或移除失效链接

原则	指南
深度	重要页面距离主页在 3-4 次点击内
孤立页面	为没有入链的页面添加内部链接；关于链接策略，请参见 internal-links
层次结构	逻辑结构；中心页面链接到内容

分页 vs 无限滚动

问题：使用无限滚动时，爬虫无法模拟用户行为（滚动、点击“加载更多”）；初始页面加载后加载的内容无法被发现。这同样适用于瀑布流 + 无限滚动、懒加载列表和类似模式。

解决方案：关键内容优先使用分页。如果保留无限滚动，请按照 Google 的建议使其对搜索引擎友好：

要求	实践
组件页面	将内容分块成无需 JavaScript 即可访问的分页页面
完整 URL	每个页面都有唯一的 URL（例如 `?page=1`，`?lastid=567`）；避免使用 `#1`
无重叠	每个项目在系列中只列出一次；页面间无重复
直接访问	URL 在新标签页中有效；不依赖 cookie/历史记录
pushState/replaceState	用户滚动时更新 URL；支持前进/后退、可分享链接
越界返回 404	当只有 998 页时，`?page=999` 返回 404

参考：无限滚动对搜索引擎友好的建议 (Google Search Central, 2014)

引用链接到下一页/上一页；在适用处使用 rel="prev" / rel="next"
避免仅动态加载；确保 HTML 中包含链接

抓取预算是 Googlebot 在给定时间段内将在您网站上抓取的 URL 数量。大型网站（10,000+ 页面）可能将高达 30% 的抓取预算浪费在重复内容、重定向和低价值 URL 上。

浪费来源	修复方法
重复 URL	使用规范链接；合并；301 重定向到首选 URL
重定向链	直接指向最终 URL
参数泛滥	使用 `rel="canonical"`；考虑 `Clean-param` (Yandex)
低价值页面	对单薄/重复页面使用 noindex；参见 indexing
抓取陷阱	避免无限 URL 生成（例如分面过滤器）

站点地图：仅包含可索引的、规范的 URL。参见 xml-sitemap，canonical-tag。

AI 爬虫（GPTBot、ClaudeBot、PerplexityBot 等）现在约占 Googlebot 抓取量的 28%。它们的行为与搜索引擎不同——同时优化两者可以提升 GEO（AI 搜索可见性）。关于 GEO 策略，请参见 generative-engine-optimization。Vercel/MERJ 研究 (2024年12月)：

因素	AI 爬虫（GPTBot、Claude）	Googlebot
JavaScript	不执行 JS；无法读取客户端渲染的内容	完整的 JS 渲染
404 率	约 34% 的抓取命中 404	约 8%
重定向	约 14% 的抓取跟随重定向	约 1.5%
初始 HTML 中的内容	初始响应中的 JSON、RSC 可被索引	相同

AI 可抓取性建议：

实践	行动
服务端渲染	关键内容放在初始 HTML 中。使用 SSR、ISR 或 SSG。完整指南请参见 rendering-strategies。
URL 管理	保持站点地图更新；使用一致的 URL 模式；避免导致 404 的过时 /static/ 资源。AI 爬虫经常访问过时的 URL。
重定向	修复重定向链；直接指向最终 URL。AI 爬虫将约 14% 的抓取浪费在重定向上。
404 处理	修复失效链接；移除或重定向过时的 URL。高 404 率表明 AI 爬虫可能在使用过时的 URL 列表。

参考：AI 爬虫的崛起 (Vercel, 2024)

问题	检查项
重定向链	更新链接以直接指向最终 URL
失效链接	进行 301 重定向或移除；审核内部和外部链接
孤立页面	从中心页面或导航添加内部链接；策略请参见 internal-links
无限滚动	提供分页的组件页面；或者对关键内容替换为分页；见上文
AI 爬虫遗漏内容	确保关键内容在初始 HTML 中；参见 rendering-strategies

重定向审核：需要修复的链和循环
失效链接审核：需要修复的 4xx 链接
网站结构：孤立页面、层次结构
分页：可抓取内容的实现方式
AI 爬虫：如果目标是 GEO 或 AI 可见性，则进行 SSR/URL/重定向检查

seo-strategy：SEO 工作流；可抓取性是技术阶段 (P0)
website-structure：规划要构建的页面、页面优先级、结构规划；在可抓取性审核之前或同时使用
robots-txt：robots.txt 配置；AI 爬虫允许/阻止 (GPTBot、ClaudeBot)
xml-sitemap：URL 发现；保持更新以减少 AI 爬虫 404
google-search-console：索引状态、覆盖率报告
indexing：修复索引问题
internal-links：内部链接最佳实践
masonry：瀑布流 + 无限滚动存在相同的抓取问题；布局技能为此引用了 SEO
generative-engine-optimization：GEO 策略；AI 搜索可见性；可抓取性支持 AI 引用
canonical-tag：规范链接减少抓取预算在重复内容上的浪费
rendering-strategies：SSR、SSG、CSR；初始 HTML 中的内容；爬虫可见性

🇺🇸English

SEO Technical: Crawlability

Guides crawlability improvements: robots, X-Robots-Tag, site structure, and internal linking.

When invoking : On first use , if helpful, open with 1–2 sentences on what this skill covers and why it matters, then provide the main output. On subsequent use or when the user asks to skip, go directly to the main output.

Scope (Technical SEO)

Redirect chains & loops: Fix multi-hop redirects; point directly to final URL
Broken links (4xx) : Fix broken internal/external links; 301 or remove
Site architecture : Logical hierarchy; pages within 3–4 clicks from homepage
Orphan pages : Add internal links to pages with no incoming links
Pagination : Prefer pagination over infinite scroll for crawlability
Crawl budget : Reduce waste on duplicates, redirects, low-value URLs (see below)
AI crawler optimization : SSR for critical content; URL management; reduce 404/redirect waste (see below)

Initial Assessment

Check for project context first: If .claude/project-context.md or .cursor/project-context.md exists, read it for site structure.

Identify:

Site structure : Flat vs. deep hierarchy
Framework : Next.js, static, SPA, etc.
Key paths : Sitemap, robots.txt, API, static assets

Best Practices

Redirect Chains & Loops

Fix multi-hop redirects; point directly to final URL
Loops: URLs redirecting back to themselves; break the cycle

Broken Links (4xx)

Fix broken internal/external links; 301 or remove
Audit regularly; update or remove broken links

Site Architecture

Principle	Guideline
Depth	Important pages within 3–4 clicks from homepage
Orphan pages	Add internal links to pages with no incoming links; see internal-links for link strategy
Hierarchy	Logical structure; hub pages link to content

Pagination vs Infinite Scroll

Problem : With infinite scroll, crawlers cannot emulate user behavior (scroll, click "Load more"); content loaded after initial page load is not discoverable. Same applies to masonry + infinite scroll, lazy-loaded lists, and similar patterns.

Solution : Prefer pagination for key content. If keeping infinite scroll, make it search-friendly per Google's recommendations:

Requirement	Practice
Component pages	Chunk content into paginated pages accessible without JavaScript
Full URLs	Each page has unique URL (e.g. `?page=1`, `?lastid=567`); avoid `#1`
No overlap	Each item listed once in series; no duplication across pages
Direct access	URL works in new tab; no cookie/history dependency
pushState/replaceState	Update URL as user scrolls; enables back/forward, shareable links
404 for out-of-bounds	`?page=999` returns 404 when only 998 pages exist

Reference : Infinite scroll search-friendly recommendations (Google Search Central, 2014)

Pagination (Traditional)

Reference links to next/previous pages; rel="prev" / rel="next" where applicable
Avoid dynamic-only loading; ensure links in HTML

Crawl Budget

Crawl budget is the number of URLs Googlebot will crawl on your site in a given period. Large sites (10,000+ pages) may waste up to 30% of crawl budget on duplicates, redirects, and low-value URLs.

Waste source	Fix
Duplicate URLs	Canonical; consolidate; 301 to preferred
Redirect chains	Point directly to final URL
Parameter proliferation	Use `rel="canonical"`; consider `Clean-param` (Yandex)
Low-value pages	noindex for thin/duplicate; see indexing
Crawl traps	Avoid infinite URL generation (e.g. faceted filters)

Sitemap : Include only indexable, canonical URLs. See xml-sitemap , canonical-tag.

AI Crawler Optimization

AI crawlers (GPTBot, ClaudeBot, PerplexityBot, etc.) now represent ~28% of Googlebot's crawl volume. Their behavior differs from search engines—optimizing for both improves GEO (AI search visibility). See generative-engine-optimization for GEO strategy. Vercel/MERJ study (Dec 2024):

Factor	AI Crawlers (GPTBot, Claude)	Googlebot
JavaScript	Do not execute JS; cannot read client-side rendered content	Full JS rendering
404 rate	~34% of fetches hit 404s	~8%
Redirects	~14% of fetches follow redirects	~1.5%
Content in initial HTML	JSON, RSC in initial response can be indexed	Same

Recommendations for AI crawlability:

Practice	Action
Server-side rendering	Critical content in initial HTML. Use SSR, ISR, or SSG. See rendering-strategies for full guide.
URL management	Keep sitemaps updated; use consistent URL patterns; avoid outdated /static/ assets that cause 404s. AI crawlers frequently hit outdated URLs.
Redirects	Fix redirect chains; point directly to final URL. AI crawlers waste ~14% of fetches on redirects.
404 handling	Fix broken links; remove or redirect outdated URLs. High 404 rates suggest AI crawlers may use stale URL lists.

Reference : The rise of the AI crawler (Vercel, 2024)

Common Issues

Issue	Check
Redirect chains	Update links to point directly to final URL
Broken links	301 or remove; audit internal and external
Orphan pages	Add internal links from hub or navigation; see internal-links for strategy
Infinite scroll	Provide paginated component pages; or replace with pagination for key content; see above
AI crawlers missing content	Ensure critical content in initial HTML; see rendering-strategies

Output Format

Redirect audit : Chains and loops to fix
Broken link audit : 4xx links to fix
Site structure : Orphan pages, hierarchy
Pagination : Implementation for crawlable content
AI crawler : SSR/URL/redirect checks if GEO or AI visibility is a goal

Related Skills

seo-strategy : SEO workflow; crawlability is Technical phase (P0)
website-structure : Plan which pages to build, page priority, structure planning; use before or alongside crawlability audit
robots-txt : robots.txt configuration; AI crawler allow/block (GPTBot, ClaudeBot)
xml-sitemap : URL discovery; keep updated to reduce AI crawler 404s
google-search-console : Index status, Coverage report
indexing : Fix indexing issues
internal-links : Internal linking best practices
masonry : Masonry + infinite scroll has same crawl issue; layout skill references this for SEO
generative-engine-optimization : GEO strategy; AI search visibility; crawlability enables AI citation
canonical-tag : Canonical reduces crawl budget waste on duplicates
rendering-strategies : SSR, SSG, CSR; content in initial HTML; crawler visibility

Weekly Installs

207

Repository

kostja94/market…g-skills

GitHub Stars

239

First Seen

Mar 1, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

codex188

kimi-cli188

cursor188

gemini-cli187

github-copilot187

opencode187

Azure 升级评估与自动化工具 - 轻松迁移 Functions 计划、托管层级和 SKU

85,700 周安装

网站可抓取性优化指南：robots.txt、网站结构、内部链接与AI爬虫优化

🇨🇳中文介绍

SEO 技术：可抓取性

范围（技术性 SEO）

初步评估

相关 Skills

最佳实践

重定向链与循环

失效链接（4xx）

网站架构

分页 vs 无限滚动

分页（传统）

抓取预算

AI 爬虫优化

常见问题

输出格式

相关技能