Cloudflare Workers AI 完整指南：2025 模型更新、性能优化与常见问题解决方案

cloudflare-workers-ai by jezweb/claude-skills

338 周安装量

650 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/jezweb/claude-skills --skill cloudflare-workers-ai

AI/机器学习云服务开发运维

🇨🇳中文介绍

Cloudflare Workers AI

状态：生产就绪 ✅ 最后更新：2026-01-21 依赖项：cloudflare-worker-base（用于 Worker 设置）最新版本：wrangler@4.58.0, @cloudflare/workers-types@4.20260109.0, workers-ai-provider@3.0.2

近期更新（2025年）：

2025年4月 - 性能：Llama 3.3 70B 速度提升 2-4 倍（推测解码、前缀缓存），BGE 嵌入速度提升 2 倍
2025年4月 - 破坏性变更：max_tokens 现在正确默认为 256（之前未生效），BGE pooling 参数（cls 与 mean 不向后兼容）
2025年 - 新模型（14个）：Mistral 3.1 24B（视觉+工具）、Gemma 3 12B（128K 上下文）、EmbeddingGemma 300M、Llama 4 Scout、GPT-OSS 120B/20B、Qwen 模型（QwQ 32B、Coder 32B）、Leonardo 图像生成、Deepgram Aura 2、Whisper v3 Turbo、IBM Granite、Nova 3
2025年 - 平台：上下文窗口 API 变更（基于令牌而非字符）、基于单位的定价（按模型粒度）、workers-ai-provider v3.0.2（AI SDK v5）、LoRA 秩最高 32（原为 8）、每个账户 100 个适配器
2025年10月：模型弃用（使用 Llama 4、GPT-OSS 替代）

快速开始（5分钟）

// 1. 在 wrangler.jsonc 中添加 AI 绑定
{ "ai": { "binding": "AI" } }

// 2. 使用流式传输运行模型（推荐）
export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      messages: [{ role: 'user', content: 'Tell me a story' }],
      stream: true, // 文本生成始终使用流式传输！
    });

    return new Response(stream, {
      headers: { 'content-type': 'text/event-stream' },
    });
  },
};

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

问题 #1：上下文窗口验证改为基于令牌（2025年2月）

错误：尽管模型支持更大的上下文，但仍出现 "Exceeded character limit" 来源：Cloudflare 更新日志 发生原因：2025年2月之前，Workers AI 使用硬性的 6144 字符限制来验证提示，即使对于具有更大基于令牌的上下文窗口的模型（例如，具有 32K 令牌的 Mistral）也是如此。更新后，验证切换为基于令牌计数。预防措施：检查上下文窗口限制时计算令牌（而非字符）。

import { encode } from 'gpt-tokenizer'; // 或特定于模型的 tokenizer

const tokens = encode(prompt);
const contextWindow = 32768; // 模型的最大令牌数（查阅文档）
const maxResponseTokens = 2048;

if (tokens.length + maxResponseTokens > contextWindow) {
  throw new Error(`提示超出上下文窗口：${tokens.length} 个令牌`);
}

const response = await env.AI.run('@cf/mistral/mistral-7b-instruct-v0.2', {
  messages: [{ role: 'user', content: prompt }],
  max_tokens: maxResponseTokens,
});

问题 #2：仪表板中的神经元消耗差异

错误：仪表板显示的神经元使用量显著超过基于令牌的预期计算值来源：Cloudflare 社区讨论 发生原因：用户报告仪表板显示千百万级别的神经元消耗，而实际令牌使用量仅为千级别，特别是在使用 AutoRAG 功能和某些模型时。预期神经元消耗（基于定价文档）与实际仪表板指标之间的差异尚未完全记录。预防措施：通过 AI Gateway 日志监控神经元使用情况并与请求关联。如果消耗量显著超出预期，请提交支持工单。

// 使用 AI Gateway 进行详细的请求日志记录
const response = await env.AI.run(
  '@cf/meta/llama-3.1-8b-instruct',
  { messages: [{ role: 'user', content: query }] },
  { gateway: { id: 'my-gateway' } }
);

// 在仪表板监控：https://dash.cloudflare.com → AI → Workers AI
// 将神经元使用量与令牌计数进行比较
// 如果差异持续存在，请提交包含详细信息的支持工单

问题 #3：本地开发中 AI 绑定需要远程或最新工具

错误："MiniflareCoreError: wrapped binding module can't be resolved (internal modules only)" 来源：GitHub Issue #6796 发生原因：在本地开发中使用 Workers AI 绑定与 Miniflare（特别是与自定义 Vite 插件一起）时，AI 绑定需要外部 worker，而旧的 unstable_getMiniflareWorkerOptions 未能正确暴露这些 worker。当 Miniflare 无法解析内部 AI worker 模块时会发生此错误。预防措施：在本地开发中使用远程 AI 绑定，或更新到最新的 @cloudflare/vite-plugin。

// wrangler.jsonc - 选项 1：在本地开发中使用远程 AI 绑定
{
  "ai": { "binding": "AI" },
  "dev": {
    "remote": true // 在本地使用生产环境 AI 绑定
  }
}



# 选项 2：更新到最新工具
npm install -D @cloudflare/vite-plugin@latest

# 选项 3：使用 wrangler dev 替代自定义 Miniflare
npm run dev

问题 #4：Flux 图像生成 NSFW 过滤器误报

错误：对于无害的提示出现 "AiError: Input prompt contains NSFW content (code 3030)" 来源：Cloudflare 社区讨论 发生原因：Flux 图像生成模型（@cf/black-forest-labs/flux-1-schnell）有时会触发误报的 NSFW 内容错误，即使对于像 "hamburger" 这样的无害单字提示也是如此。NSFW 过滤器在没有上下文的情况下可能过于敏感。预防措施：在潜在触发词周围添加描述性上下文，而不是使用单字提示。

// ❌ 可能触发错误 3030
const response = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
  prompt: 'hamburger', // 单字触发过滤器
});

// ✅ 添加上下文以避免误报
const response = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
  prompt: 'A photo of a delicious large hamburger on a plate with lettuce and tomato',
  num_steps: 4,
});

问题 #5：图像生成错误 1000 - 缺少 num_steps 参数

错误："Error: unexpected type 'int32' with value 'undefined' (code 1000)" 来源：Cloudflare 社区讨论 发生原因：当未提供 num_steps 参数时，图像生成 API 调用会返回错误代码 1000，即使文档暗示它是可选的。实际上，大多数 Flux 模型都需要此参数。预防措施：始终为图像生成模型包含 num_steps: 4（对于 Flux Schnell 通常为 4）。

// ✅ 始终为图像生成包含 num_steps
const image = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
  prompt: 'A beautiful sunset over mountains',
  num_steps: 4, // 必需 - 对于 Flux Schnell 通常为 4
});

// 注意：FLUX.2 [klein] 4B 具有固定的 steps=4（无法调整）

问题 #6：Zod v4 与结构化输出工具不兼容

错误：使用 Stagehand 与 Zod v4 时出现语法错误和转译失败来源：GitHub Issue #10798 发生原因：Stagehand（浏览器自动化）和 Workers AI 中的一些结构化输出示例在使用 Zod v4（现为默认版本）时会失败。底层的 zod-to-json-schema 库尚不支持 Zod v4，导致转译失败。预防措施：在 zod-to-json-schema 支持 v4 之前，将 Zod 固定在 v3 版本。

# 专门安装 Zod v3
npm install zod@3

# 或在 package.json 中固定版本
{
  "dependencies": {
    "zod": "~3.23.8" // 为兼容性固定在 v3
  }
}

问题 #7：用于按请求控制的 AI Gateway 缓存头

并非错误，但重要功能：AI Gateway 支持通过 HTTP 头进行按请求的缓存控制，以实现自定义 TTL、缓存绕过以及超出仪表板默认值的自定义缓存键。来源：AI Gateway 缓存文档 使用场景：您需要为不同的请求（例如，昂贵查询 1 小时，实时数据跳过缓存）设置不同的缓存行为。实现：有关头用法，请参阅下面的 AI Gateway 集成部分。

env.AI.run(
  model: string,
  inputs: ModelInputs,
  options?: { gateway?: { id: string; skipCache?: boolean } }
): Promise<ModelOutput | ReadableStream>

模型选择指南（2025年更新）

文本生成（LLMs）

模型	最佳用途	速率限制	大小	备注
2025年模型
`@cf/meta/llama-4-scout-17b-16e-instruct`	最新 Llama，通用	300/分钟	17B	2025年新增
`@cf/openai/gpt-oss-120b`	最大的开源 GPT	300/分钟	120B	2025年新增
`@cf/openai/gpt-oss-20b`	较小的开源 GPT	300/分钟	20B	2025年新增
`@cf/google/gemma-3-12b-it`	128K 上下文，140+ 种语言	300/分钟	12B	2025年新增，视觉
`@cf/mistralai/mistral-small-3.1-24b-instruct`	视觉 + 工具调用	300/分钟	24B	2025年新增
`@cf/qwen/qwq-32b`	推理，复杂任务	300/分钟	32B	2025年新增
`@cf/qwen/qwen2.5-coder-32b-instruct`	编码专家	300/分钟	32B	2025年新增
`@cf/qwen/qwen3-30b-a3b-fp8`	快速量化	300/分钟	30B	2025年新增
`@cf/ibm-granite/granite-4.0-h-micro`	小型，高效	300/分钟	Micro	2025年新增
性能（2025年）
`@cf/meta/llama-3.3-70b-instruct-fp8-fast`	2-4 倍更快（2025年更新）	300/分钟	70B	推测解码
`@cf/meta/llama-3.1-8b-instruct-fp8-fast`	快速 8B 变体	300/分钟	8B	-
标准模型
`@cf/meta/llama-3.1-8b-instruct`	通用	300/分钟	8B	-
`@cf/meta/llama-3.2-1b-instruct`	超快，简单任务	300/分钟	1B	-
`@cf/deepseek-ai/deepseek-r1-distill-qwen-32b`	编码，技术	300/分钟	32B	-

文本嵌入（2 倍更快 - 2025年）

模型	维度	最佳用途	速率限制	备注
`@cf/google/embeddinggemma-300m`	768	最佳 RAG	3000/分钟	2025年新增
`@cf/baai/bge-base-en-v1.5`	768	通用 RAG（2 倍更快）	3000/分钟	pooling: "cls" 推荐
`@cf/baai/bge-large-en-v1.5`	1024	高精度（2 倍更快）	1500/分钟	pooling: "cls" 推荐
`@cf/baai/bge-small-en-v1.5`	384	快速，低存储（2 倍更快）	3000/分钟	pooling: "cls" 推荐
`@cf/qwen/qwen3-embedding-0.6b`	768	Qwen 嵌入	3000/分钟	2025年新增

关键（2025年）：BGE 模型现在支持 pooling: "cls" 参数（推荐），但与 pooling: "mean"（默认）不向后兼容。

模型	最佳用途	速率限制	备注
`@cf/black-forest-labs/flux-1-schnell`	高质量，逼真	720/分钟	⚠️ 参见下方警告
`@cf/leonardo/lucid-origin`	Leonardo AI 风格	720/分钟	2025年新增，需要 num_steps
`@cf/leonardo/phoenix-1.0`	Leonardo AI 变体	720/分钟	2025年新增，需要 num_steps
`@cf/stabilityai/stable-diffusion-xl-base-1.0`	通用	720/分钟	需要 num_steps

⚠️ 常见图像生成问题：

错误 1000：始终包含 num_steps: 4 参数（尽管文档暗示可选，但实际必需）
错误 3030（NSFW 过滤器）：像 "hamburger" 这样的单字可能触发误报 - 在提示中添加描述性上下文

// ✅ 图像生成的正确模式 const image = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', { prompt: 'A photo of a delicious hamburger on a plate with fresh vegetables', num_steps: 4, // 必需，以避免错误 1000 }); // 描述性上下文有助于避免 NSFW 误报（错误 3030）

模型	最佳用途	速率限制	备注
`@cf/meta/llama-3.2-11b-vision-instruct`	图像理解	720/分钟	-
`@cf/google/gemma-3-12b-it`	视觉 + 文本（128K 上下文）	300/分钟	2025年新增

音频模型（2025年）

模型	类型	速率限制	备注
`@cf/deepgram/aura-2-en`	文本转语音（英语）	720/分钟	2025年新增
`@cf/deepgram/aura-2-es`	文本转语音（西班牙语）	720/分钟	2025年新增
`@cf/deepgram/nova-3`	语音转文本（+ WebSocket）	720/分钟	2025年新增
`@cf/openai/whisper-large-v3-turbo`	语音转文本（更快）	720/分钟	2025年新增

RAG（检索增强生成）

// 1. 生成嵌入
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', { text: [userQuery] });

// 2. 搜索 Vectorize
const matches = await env.VECTORIZE.query(embeddings.data[0], { topK: 3 });
const context = matches.matches.map((m) => m.metadata.text).join('\n\n');

// 3. 使用上下文生成
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [
    { role: 'system', content: `使用此上下文回答：\n${context}` },
    { role: 'user', content: userQuery },
  ],
  stream: true,
});

使用 Zod 的结构化输出

import { z } from 'zod';

const Schema = z.object({ name: z.string(), items: z.array(z.string()) });

const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [{
    role: 'user',
    content: `生成匹配的 JSON：${JSON.stringify(Schema.shape)}`
  }],
});

const validated = Schema.parse(JSON.parse(response.response));

为 AI 请求提供缓存、日志记录、成本跟踪和分析功能。

基本 Gateway 用法

const response = await env.AI.run(
  '@cf/meta/llama-3.1-8b-instruct',
  { prompt: 'Hello' },
  { gateway: { id: 'my-gateway', skipCache: false } }
);

// 访问日志并发送反馈
const gateway = env.AI.gateway('my-gateway');
await gateway.patchLog(env.AI.aiGatewayLogId, {
  feedback: { rating: 1, comment: 'Great response' },
});

按请求缓存控制（高级）

使用 HTTP 头覆盖默认缓存行为，实现细粒度控制：

// 自定义缓存 TTL（昂贵查询 1 小时）
const response = await fetch(
  `https://gateway.ai.cloudflare.com/v1/${accountId}/${gatewayId}/workers-ai/@cf/meta/llama-3.1-8b-instruct`,
  {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${env.CLOUDFLARE_API_KEY}`,
      'Content-Type': 'application/json',
      'cf-aig-cache-ttl': '3600', // 1 小时，单位秒（最小值：60，最大值：2592000）
    },
    body: JSON.stringify({
      messages: [{ role: 'user', content: prompt }],
    }),
  }
);

// 为实时数据跳过缓存
const response = await fetch(gatewayUrl, {
  headers: {
    'cf-aig-skip-cache': 'true', // 完全绕过缓存
  },
  // ...
});

// 检查响应是否来自缓存
const cacheStatus = response.headers.get('cf-aig-cache-status'); // "HIT" 或 "MISS"

可用的缓存头：

cf-aig-cache-ttl：以秒为单位设置自定义 TTL（60 秒至 1 个月）
cf-aig-skip-cache：完全绕过缓存（'true'）
cf-aig-cache-key：用于细粒度控制的自定义缓存键
cf-aig-cache-status：响应头，显示 "HIT" 或 "MISS"

优势：成本跟踪、缓存（减少重复推理）、日志记录、速率限制、分析、按请求缓存自定义。

速率限制与定价（2025年更新）

速率限制（每分钟）

任务类型	默认限制	备注
文本生成	300/分钟	一些快速模型：400-1500/分钟
文本嵌入	3000/分钟	BGE-large：1500/分钟
图像生成	720/分钟	所有图像模型
视觉模型	720/分钟	图像理解
音频（TTS/STT）	720/分钟	Deepgram、Whisper
翻译	720/分钟	M2M100、Opus MT
分类	2000/分钟	文本分类

定价（基于单位，以神经元计费 - 2025年）

每天 10,000 个神经元
每天 UTC 00:00 重置

付费层级（每 1,000 个神经元 $0.011）：

包含每天 10,000 个神经元
超出免费分配后无限使用

2025年模型成本（每 100 万令牌）：

模型	输入	输出	备注
2025年模型
Llama 4 Scout 17B	$0.270	$0.850	2025年新增
GPT-OSS 120B	$0.350	$0.750	2025年新增
GPT-OSS 20B	$0.200	$0.300	2025年新增
Gemma 3 12B	$0.345	$0.556	2025年新增
Mistral 3.1 24B	$0.351	$0.555	2025年新增
Qwen QwQ 32B	$0.660	$1.000	2025年新增
Qwen Coder 32B	$0.660	$1.000	2025年新增
IBM Granite Micro	$0.017	$0.112	2025年新增
EmbeddingGemma 300M	$0.012	N/A	2025年新增
Qwen3 Embedding 0.6B	$0.012	N/A	2025年新增
性能（2025年）
Llama 3.3 70B Fast	$0.293	$2.253	2-4 倍更快
Llama 3.1 8B FP8 Fast	$0.045	$0.384	快速变体
标准模型
Llama 3.2 1B	$0.027	$0.201	-
Llama 3.1 8B	$0.282	$0.827	-
Deepseek R1 32B	$0.497	$4.881	-
BGE-base（2 倍更快）	$0.067	N/A	2025年加速
BGE-large（2 倍更快）	$0.204	N/A	2025年加速
图像模型（2025年）
Flux 1 Schnell	$0.0000528 每 512x512 图块	-
Leonardo Lucid	$0.006996 每 512x512 图块	2025年新增
Leonardo Phoenix	$0.005830 每 512x512 图块	2025年新增
音频模型（2025年）
Deepgram Aura 2	$0.030 每 1k 字符	2025年新增
Deepgram Nova 3	$0.0052 每分钟音频	2025年新增
Whisper v3 Turbo	$0.0005 每分钟音频	2025年新增

错误处理与重试

async function runAIWithRetry(
  env: Env,
  model: string,
  inputs: any,
  maxRetries = 3
): Promise<any> {
  let lastError: Error;

  for (let i = 0; i < maxRetries; i++) {
    try {
      return await env.AI.run(model, inputs);
    } catch (error) {
      lastError = error as Error;

      // 速率限制 - 使用指数退避重试
      if (lastError.message.toLowerCase().includes('rate limit')) {
        await new Promise((resolve) => setTimeout(resolve, Math.pow(2, i) * 1000));
        continue;
      }

      throw error; // 其他错误 - 立即失败
    }
  }

  throw lastError!;
}

import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: env.CLOUDFLARE_API_KEY,
  baseURL: `https://api.cloudflare.com/client/v4/accounts/${env.ACCOUNT_ID}/ai/v1`,
});

// 聊天补全
await openai.chat.completions.create({
  model: '@cf/meta/llama-3.1-8b-instruct',
  messages: [{ role: 'user', content: 'Hello!' }],
});

端点：/v1/chat/completions、/v1/embeddings

Vercel AI SDK 集成（workers-ai-provider v3.0.2）

import { createWorkersAI } from 'workers-ai-provider'; // v3.0.2 with AI SDK v5
import { generateText, streamText } from 'ai';

const workersai = createWorkersAI({ binding: env.AI });

// 生成或流式传输
await generateText({
  model: workersai('@cf/meta/llama-3.1-8b-instruct'),
  prompt: 'Write a poem',
});

注意：这些技巧来自社区讨论和生产经验。

Hono 框架流式传输模式

在使用 Workers AI 流式传输与 Hono 时，直接将流作为 Response 返回（不通过 Hono 的流式传输工具）：

import { Hono } from 'hono';

type Bindings = { AI: Ai };
const app = new Hono<{ Bindings: Bindings }>();

app.post('/chat', async (c) => {
  const { prompt } = await c.req.json();

  const stream = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
    messages: [{ role: 'user', content: prompt }],
    stream: true,
  });

  // 直接返回流（不使用 c.stream()）
  return new Response(stream, {
    headers: {
      'content-type': 'text/event-stream',
      'cache-control': 'no-cache',
      'connection': 'keep-alive',
    },
  });
});

排查无法解释的 AI 绑定失败

如果遇到无法解释的 Workers AI 失败：

# 1. 检查 wrangler 版本
npx wrangler --version

# 2. 清除 wrangler 缓存
rm -rf ~/.wrangler

# 3. 更新到最新稳定版
npm install -D wrangler@latest

# 4. 检查本地网络/防火墙设置
# 一些企业防火墙会阻止 Workers AI 端点

注意：大多数“版本不兼容”问题实际上是网络配置问题。

Workers AI 文档
模型目录
AI Gateway
定价
更新日志
LoRA 适配器
MCP 工具：使用 mcp__cloudflare-docs__search_cloudflare_documentation 获取最新文档

🇺🇸English

Cloudflare Workers AI

Status : Production Ready ✅ Last Updated : 2026-01-21 Dependencies : cloudflare-worker-base (for Worker setup) Latest Versions : wrangler@4.58.0, @cloudflare/workers-types@4.20260109.0, workers-ai-provider@3.0.2

Recent Updates (2025) :

April 2025 - Performance : Llama 3.3 70B 2-4x faster (speculative decoding, prefix caching), BGE embeddings 2x faster
April 2025 - Breaking Changes : max_tokens now correctly defaults to 256 (was not respected), BGE pooling parameter (cls NOT backwards compatible with mean)
2025 - New Models (14) : Mistral 3.1 24B (vision+tools), Gemma 3 12B (128K context), EmbeddingGemma 300M, Llama 4 Scout, GPT-OSS 120B/20B, Qwen models (QwQ 32B, Coder 32B), Leonardo image gen, Deepgram Aura 2, Whisper v3 Turbo, IBM Granite, Nova 3
2025 - Platform : Context windows API change (tokens not chars), unit-based pricing with per-model granularity, workers-ai-provider v3.0.2 (AI SDK v5), LoRA rank up to 32 (was 8), 100 adapters per account
October 2025 : Model deprecations (use Llama 4, GPT-OSS instead)

Quick Start (5 Minutes)

// 1. Add AI binding to wrangler.jsonc
{ "ai": { "binding": "AI" } }

// 2. Run model with streaming (recommended)
export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      messages: [{ role: 'user', content: 'Tell me a story' }],
      stream: true, // Always stream for text generation!
    });

    return new Response(stream, {
      headers: { 'content-type': 'text/event-stream' },
    });
  },
};

Why streaming? Prevents buffering in memory, faster time-to-first-token, avoids Worker timeout issues.

Known Issues Prevention

This skill prevents 7 documented issues:

Issue #1: Context Window Validation Changed to Tokens (February 2025)

Error : "Exceeded character limit" despite model supporting larger context Source : Cloudflare Changelog Why It Happens : Before February 2025, Workers AI validated prompts using a hard 6144 character limit, even for models with larger token-based context windows (e.g., Mistral with 32K tokens). After the update, validation switched to token-based counting. Prevention : Calculate tokens (not characters) when checking context window limits.

import { encode } from 'gpt-tokenizer'; // or model-specific tokenizer

const tokens = encode(prompt);
const contextWindow = 32768; // Model's max tokens (check docs)
const maxResponseTokens = 2048;

if (tokens.length + maxResponseTokens > contextWindow) {
  throw new Error(`Prompt exceeds context window: ${tokens.length} tokens`);
}

const response = await env.AI.run('@cf/mistral/mistral-7b-instruct-v0.2', {
  messages: [{ role: 'user', content: prompt }],
  max_tokens: maxResponseTokens,
});

Issue #2: Neuron Consumption Discrepancies in Dashboard

Error : Dashboard neuron usage significantly exceeds expected token-based calculations Source : Cloudflare Community Discussion Why It Happens : Users report dashboard showing hundred-million-level neuron consumption for K-level token usage, particularly with AutoRAG features and certain models. The discrepancy between expected neuron consumption (based on pricing docs) and actual dashboard metrics is not fully documented. Prevention : Monitor neuron usage via AI Gateway logs and correlate with requests. File support ticket if consumption significantly exceeds expectations.

// Use AI Gateway for detailed request logging
const response = await env.AI.run(
  '@cf/meta/llama-3.1-8b-instruct',
  { messages: [{ role: 'user', content: query }] },
  { gateway: { id: 'my-gateway' } }
);

// Monitor dashboard at: https://dash.cloudflare.com → AI → Workers AI
// Compare neuron usage with token counts
// File support ticket with details if discrepancy persists

Issue #3: AI Binding Requires Remote or Latest Tooling in Local Dev

Error : "MiniflareCoreError: wrapped binding module can't be resolved (internal modules only)" Source : GitHub Issue #6796 Why It Happens : When using Workers AI bindings with Miniflare in local development (particularly with custom Vite plugins), the AI binding requires external workers that aren't properly exposed by older unstable_getMiniflareWorkerOptions. The error occurs when Miniflare can't resolve the internal AI worker module. Prevention : Use remote bindings for AI in local dev, or update to latest @cloudflare/vite-plugin.

// wrangler.jsonc - Option 1: Use remote AI binding in local dev
{
  "ai": { "binding": "AI" },
  "dev": {
    "remote": true // Use production AI binding locally
  }
}



# Option 2: Update to latest tooling
npm install -D @cloudflare/vite-plugin@latest

# Option 3: Use wrangler dev instead of custom Miniflare
npm run dev

Issue #4: Flux Image Generation NSFW Filter False Positives

Error : "AiError: Input prompt contains NSFW content (code 3030)" for innocent prompts Source : Cloudflare Community Discussion Why It Happens : Flux image generation models (@cf/black-forest-labs/flux-1-schnell) sometimes trigger false positive NSFW content errors even with innocent single-word prompts like "hamburger". The NSFW filter can be overly sensitive without context. Prevention : Add descriptive context around potential trigger words instead of using single-word prompts.

// ❌ May trigger error 3030
const response = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
  prompt: 'hamburger', // Single word triggers filter
});

// ✅ Add context to avoid false positives
const response = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
  prompt: 'A photo of a delicious large hamburger on a plate with lettuce and tomato',
  num_steps: 4,
});

Issue #5: Image Generation Error 1000 - Missing num_steps Parameter

Error : "Error: unexpected type 'int32' with value 'undefined' (code 1000)" Source : Cloudflare Community Discussion Why It Happens : Image generation API calls return error code 1000 when the num_steps parameter is not provided, even though documentation suggests it's optional. The parameter is actually required for most Flux models. Prevention : Always include num_steps: 4 for image generation models (typically 4 for Flux Schnell).

// ✅ Always include num_steps for image generation
const image = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
  prompt: 'A beautiful sunset over mountains',
  num_steps: 4, // Required - typically 4 for Flux Schnell
});

// Note: FLUX.2 [klein] 4B has fixed steps=4 (cannot be adjusted)

Issue #6: Zod v4 Incompatibility with Structured Output Tools

Error : Syntax errors and failed transpilation when using Stagehand with Zod v4 Source : GitHub Issue #10798 Why It Happens : Stagehand (browser automation) and some structured output examples in Workers AI fail with Zod v4 (now default). The underlying zod-to-json-schema library doesn't yet support Zod v4, causing transpilation failures. Prevention : Pin Zod to v3 until zod-to-json-schema supports v4.

# Install Zod v3 specifically
npm install zod@3

# Or pin in package.json
{
  "dependencies": {
    "zod": "~3.23.8" // Pin to v3 for compatibility
  }
}

Issue #7: AI Gateway Cache Headers for Per-Request Control

Not an error, but important feature : AI Gateway supports per-request cache control via HTTP headers for custom TTL, cache bypass, and custom cache keys beyond dashboard defaults. Source : AI Gateway Caching Documentation Use When : You need different caching behavior for different requests (e.g., 1 hour for expensive queries, skip cache for real-time data). Implementation : See AI Gateway Integration section below for header usage.

API Reference

env.AI.run(
  model: string,
  inputs: ModelInputs,
  options?: { gateway?: { id: string; skipCache?: boolean } }
): Promise<ModelOutput | ReadableStream>

Model Selection Guide (Updated 2025)

Text Generation (LLMs)

Model	Best For	Rate Limit	Size	Notes
2025 Models
`@cf/meta/llama-4-scout-17b-16e-instruct`	Latest Llama, general purpose	300/min	17B	NEW 2025
`@cf/openai/gpt-oss-120b`	Largest open-source GPT	300/min	120B	NEW 2025
`@cf/openai/gpt-oss-20b`	Smaller open-source GPT

Text Embeddings (2x Faster - 2025)

Model	Dimensions	Best For	Rate Limit	Notes
`@cf/google/embeddinggemma-300m`	768	Best-in-class RAG	3000/min	NEW 2025
`@cf/baai/bge-base-en-v1.5`	768	General RAG (2x faster)	3000/min	pooling: "cls" recommended
`@cf/baai/bge-large-en-v1.5`	1024	High accuracy (2x faster)	1500/min	pooling: "cls" recommended

CRITICAL (2025) : BGE models now support pooling: "cls" parameter (recommended) but NOT backwards compatible with pooling: "mean" (default).

Image Generation

Model	Best For	Rate Limit	Notes
`@cf/black-forest-labs/flux-1-schnell`	High quality, photorealistic	720/min	⚠️ See warnings below
`@cf/leonardo/lucid-origin`	Leonardo AI style	720/min	NEW 2025, requires num_steps
`@cf/leonardo/phoenix-1.0`	Leonardo AI variant	720/min	NEW 2025, requires num_steps
`@cf/stabilityai/stable-diffusion-xl-base-1.0`	General purpose

⚠️ Common Image Generation Issues:

Error 1000 : Always include num_steps: 4 parameter (required despite docs suggesting optional)
Error 3030 (NSFW filter) : Single words like "hamburger" may trigger false positives - add descriptive context to prompts

// ✅ Correct pattern for image generation const image = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', { prompt: 'A photo of a delicious hamburger on a plate with fresh vegetables', num_steps: 4, // Required to avoid error 1000 }); // Descriptive context helps avoid NSFW false positives (error 3030)

Vision Models

Model	Best For	Rate Limit	Notes
`@cf/meta/llama-3.2-11b-vision-instruct`	Image understanding	720/min	-
`@cf/google/gemma-3-12b-it`	Vision + text (128K context)	300/min	NEW 2025

Audio Models (2025)

Model	Type	Rate Limit	Notes
`@cf/deepgram/aura-2-en`	Text-to-speech (English)	720/min	NEW 2025
`@cf/deepgram/aura-2-es`	Text-to-speech (Spanish)	720/min	NEW 2025
`@cf/deepgram/nova-3`	Speech-to-text (+ WebSocket)	720/min	NEW 2025
`@cf/openai/whisper-large-v3-turbo`	Speech-to-text (faster)	720/min	NEW 2025

Common Patterns

RAG (Retrieval Augmented Generation)

// 1. Generate embeddings
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', { text: [userQuery] });

// 2. Search Vectorize
const matches = await env.VECTORIZE.query(embeddings.data[0], { topK: 3 });
const context = matches.matches.map((m) => m.metadata.text).join('\n\n');

// 3. Generate with context
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [
    { role: 'system', content: `Answer using this context:\n${context}` },
    { role: 'user', content: userQuery },
  ],
  stream: true,
});

Structured Output with Zod

import { z } from 'zod';

const Schema = z.object({ name: z.string(), items: z.array(z.string()) });

const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [{
    role: 'user',
    content: `Generate JSON matching: ${JSON.stringify(Schema.shape)}`
  }],
});

const validated = Schema.parse(JSON.parse(response.response));

AI Gateway Integration

Provides caching, logging, cost tracking, and analytics for AI requests.

Basic Gateway Usage

const response = await env.AI.run(
  '@cf/meta/llama-3.1-8b-instruct',
  { prompt: 'Hello' },
  { gateway: { id: 'my-gateway', skipCache: false } }
);

// Access logs and send feedback
const gateway = env.AI.gateway('my-gateway');
await gateway.patchLog(env.AI.aiGatewayLogId, {
  feedback: { rating: 1, comment: 'Great response' },
});

Per-Request Cache Control (Advanced)

Override default cache behavior with HTTP headers for fine-grained control:

// Custom cache TTL (1 hour for expensive queries)
const response = await fetch(
  `https://gateway.ai.cloudflare.com/v1/${accountId}/${gatewayId}/workers-ai/@cf/meta/llama-3.1-8b-instruct`,
  {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${env.CLOUDFLARE_API_KEY}`,
      'Content-Type': 'application/json',
      'cf-aig-cache-ttl': '3600', // 1 hour in seconds (min: 60, max: 2592000)
    },
    body: JSON.stringify({
      messages: [{ role: 'user', content: prompt }],
    }),
  }
);

// Skip cache for real-time data
const response = await fetch(gatewayUrl, {
  headers: {
    'cf-aig-skip-cache': 'true', // Bypass cache entirely
  },
  // ...
});

// Check if response was cached
const cacheStatus = response.headers.get('cf-aig-cache-status'); // "HIT" or "MISS"

Available Cache Headers:

cf-aig-cache-ttl: Set custom TTL in seconds (60s to 1 month)
cf-aig-skip-cache: Bypass cache entirely ('true')
cf-aig-cache-key: Custom cache key for granular control
cf-aig-cache-status: Response header showing "HIT" or "MISS"

Benefits: Cost tracking, caching (reduces duplicate inference), logging, rate limiting, analytics, per-request cache customization.

Rate Limits & Pricing (Updated 2025)

Rate Limits (per minute)

Task Type	Default Limit	Notes
Text Generation	300/min	Some fast models: 400-1500/min
Text Embeddings	3000/min	BGE-large: 1500/min
Image Generation	720/min	All image models
Vision Models	720/min	Image understanding
Audio (TTS/STT)	720/min	Deepgram, Whisper
Translation	720/min	M2M100, Opus MT
Classification	2000/min	Text classification

Pricing (Unit-Based, Billed in Neurons - 2025)

Free Tier:

10,000 neurons per day
Resets daily at 00:00 UTC

Paid Tier ($0.011 per 1,000 neurons):

10,000 neurons/day included
Unlimited usage above free allocation

2025 Model Costs (per 1M tokens):

Model	Input	Output	Notes
2025 Models
Llama 4 Scout 17B	$0.270	$0.850	NEW 2025
GPT-OSS 120B	$0.350	$0.750	NEW 2025
GPT-OSS 20B	$0.200	$0.300	NEW 2025
Gemma 3 12B	$0.345	$0.556	NEW 2025
Mistral 3.1 24B	$0.351	$0.555	NEW 2025
Qwen QwQ 32B	$0.660	$1.000	NEW 2025
Qwen Coder 32B

Error Handling with Retry

async function runAIWithRetry(
  env: Env,
  model: string,
  inputs: any,
  maxRetries = 3
): Promise<any> {
  let lastError: Error;

  for (let i = 0; i < maxRetries; i++) {
    try {
      return await env.AI.run(model, inputs);
    } catch (error) {
      lastError = error as Error;

      // Rate limit - retry with exponential backoff
      if (lastError.message.toLowerCase().includes('rate limit')) {
        await new Promise((resolve) => setTimeout(resolve, Math.pow(2, i) * 1000));
        continue;
      }

      throw error; // Other errors - fail immediately
    }
  }

  throw lastError!;
}

OpenAI Compatibility

import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: env.CLOUDFLARE_API_KEY,
  baseURL: `https://api.cloudflare.com/client/v4/accounts/${env.ACCOUNT_ID}/ai/v1`,
});

// Chat completions
await openai.chat.completions.create({
  model: '@cf/meta/llama-3.1-8b-instruct',
  messages: [{ role: 'user', content: 'Hello!' }],
});

Endpoints: /v1/chat/completions, /v1/embeddings

Vercel AI SDK Integration (workers-ai-provider v3.0.2)

import { createWorkersAI } from 'workers-ai-provider'; // v3.0.2 with AI SDK v5
import { generateText, streamText } from 'ai';

const workersai = createWorkersAI({ binding: env.AI });

// Generate or stream
await generateText({
  model: workersai('@cf/meta/llama-3.1-8b-instruct'),
  prompt: 'Write a poem',
});

Community Tips

Note : These tips come from community discussions and production experience.

Hono Framework Streaming Pattern

When using Workers AI streaming with Hono, return the stream directly as a Response (not through Hono's streaming utilities):

import { Hono } from 'hono';

type Bindings = { AI: Ai };
const app = new Hono<{ Bindings: Bindings }>();

app.post('/chat', async (c) => {
  const { prompt } = await c.req.json();

  const stream = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
    messages: [{ role: 'user', content: prompt }],
    stream: true,
  });

  // Return stream directly (not c.stream())
  return new Response(stream, {
    headers: {
      'content-type': 'text/event-stream',
      'cache-control': 'no-cache',
      'connection': 'keep-alive',
    },
  });
});

Source : Hono Discussion #2409

Troubleshooting Unexplained AI Binding Failures

If experiencing unexplained Workers AI failures:

# 1. Check wrangler version
npx wrangler --version

# 2. Clear wrangler cache
rm -rf ~/.wrangler

# 3. Update to latest stable
npm install -D wrangler@latest

# 4. Check local network/firewall settings
# Some corporate firewalls block Workers AI endpoints

Note : Most "version incompatibility" issues turn out to be network configuration problems.

References

Workers AI Docs
Models Catalog
AI Gateway
Pricing
Changelog
LoRA Adapters
MCP Tool : Use mcp__cloudflare-docs__search_cloudflare_documentation for latest docs

Weekly Installs

338

Repository

jezweb/claude-skills

GitHub Stars

643

First Seen

Jan 20, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

claude-code283

gemini-cli233

opencode227

cursor214

antigravity210

codex201

Azure RBAC 权限管理工具：查找最小角色、创建自定义角色与自动化分配

101,200 周安装

Cloudflare Workers AI 完整指南：2025 模型更新、性能优化与常见问题解决方案

🇨🇳中文介绍

Cloudflare Workers AI

快速开始（5分钟）

相关 Skills

已知问题预防

问题 #1：上下文窗口验证改为基于令牌（2025年2月）

问题 #2：仪表板中的神经元消耗差异

问题 #3：本地开发中 AI 绑定需要远程或最新工具

问题 #4：Flux 图像生成 NSFW 过滤器误报

问题 #5：图像生成错误 1000 - 缺少 num_steps 参数

问题 #6：Zod v4 与结构化输出工具不兼容

问题 #7：用于按请求控制的 AI Gateway 缓存头

API 参考

模型选择指南（2025年更新）

文本生成（LLMs）

文本嵌入（2 倍更快 - 2025年）

图像生成

视觉模型

音频模型（2025年）

常见模式

RAG（检索增强生成）

使用 Zod 的结构化输出

AI Gateway 集成

基本 Gateway 用法

按请求缓存控制（高级）

速率限制与定价（2025年更新）

速率限制（每分钟）

定价（基于单位，以神经元计费 - 2025年）

错误处理与重试

OpenAI 兼容性

Vercel AI SDK 集成（workers-ai-provider v3.0.2）

社区技巧

Hono 框架流式传输模式

排查无法解释的 AI 绑定失败

参考

🇺🇸English

Cloudflare Workers AI

Quick Start (5 Minutes)

Known Issues Prevention

Issue #1: Context Window Validation Changed to Tokens (February 2025)

Issue #2: Neuron Consumption Discrepancies in Dashboard

Issue #3: AI Binding Requires Remote or Latest Tooling in Local Dev

Issue #4: Flux Image Generation NSFW Filter False Positives

Issue #5: Image Generation Error 1000 - Missing num_steps Parameter

Issue #6: Zod v4 Incompatibility with Structured Output Tools

Issue #7: AI Gateway Cache Headers for Per-Request Control

API Reference

Model Selection Guide (Updated 2025)

Text Generation (LLMs)

Text Embeddings (2x Faster - 2025)

Image Generation

Vision Models

Audio Models (2025)

Common Patterns

RAG (Retrieval Augmented Generation)

Structured Output with Zod

AI Gateway Integration

Basic Gateway Usage

Per-Request Cache Control (Advanced)

Rate Limits & Pricing (Updated 2025)

Rate Limits (per minute)

Pricing (Unit-Based, Billed in Neurons - 2025)

Error Handling with Retry

OpenAI Compatibility

Vercel AI SDK Integration (workers-ai-provider v3.0.2)

Community Tips

Hono Framework Streaming Pattern

Troubleshooting Unexplained AI Binding Failures

References

最新 Skills