cloudflare-workers-ai by jezweb/claude-skills
npx skills add https://github.com/jezweb/claude-skills --skill cloudflare-workers-ai状态:生产就绪 ✅ 最后更新:2026-01-21 依赖项:cloudflare-worker-base(用于 Worker 设置)最新版本:wrangler@4.58.0, @cloudflare/workers-types@4.20260109.0, workers-ai-provider@3.0.2
近期更新(2025年):
// 1. 在 wrangler.jsonc 中添加 AI 绑定
{ "ai": { "binding": "AI" } }
// 2. 使用流式传输运行模型(推荐)
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{ role: 'user', content: 'Tell me a story' }],
stream: true, // 文本生成始终使用流式传输!
});
return new Response(stream, {
headers: { 'content-type': 'text/event-stream' },
});
},
};
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
为什么使用流式传输? 防止内存缓冲,更快获得首个令牌,避免 Worker 超时问题。
此技能可预防 7 个已记录的问题:
错误:尽管模型支持更大的上下文,但仍出现 "Exceeded character limit" 来源:Cloudflare 更新日志 发生原因:2025年2月之前,Workers AI 使用硬性的 6144 字符限制来验证提示,即使对于具有更大基于令牌的上下文窗口的模型(例如,具有 32K 令牌的 Mistral)也是如此。更新后,验证切换为基于令牌计数。预防措施:检查上下文窗口限制时计算令牌(而非字符)。
import { encode } from 'gpt-tokenizer'; // 或特定于模型的 tokenizer
const tokens = encode(prompt);
const contextWindow = 32768; // 模型的最大令牌数(查阅文档)
const maxResponseTokens = 2048;
if (tokens.length + maxResponseTokens > contextWindow) {
throw new Error(`提示超出上下文窗口:${tokens.length} 个令牌`);
}
const response = await env.AI.run('@cf/mistral/mistral-7b-instruct-v0.2', {
messages: [{ role: 'user', content: prompt }],
max_tokens: maxResponseTokens,
});
错误:仪表板显示的神经元使用量显著超过基于令牌的预期计算值 来源:Cloudflare 社区讨论 发生原因:用户报告仪表板显示千百万级别的神经元消耗,而实际令牌使用量仅为千级别,特别是在使用 AutoRAG 功能和某些模型时。预期神经元消耗(基于定价文档)与实际仪表板指标之间的差异尚未完全记录。预防措施:通过 AI Gateway 日志监控神经元使用情况并与请求关联。如果消耗量显著超出预期,请提交支持工单。
// 使用 AI Gateway 进行详细的请求日志记录
const response = await env.AI.run(
'@cf/meta/llama-3.1-8b-instruct',
{ messages: [{ role: 'user', content: query }] },
{ gateway: { id: 'my-gateway' } }
);
// 在仪表板监控:https://dash.cloudflare.com → AI → Workers AI
// 将神经元使用量与令牌计数进行比较
// 如果差异持续存在,请提交包含详细信息的支持工单
错误:"MiniflareCoreError: wrapped binding module can't be resolved (internal modules only)" 来源:GitHub Issue #6796 发生原因:在本地开发中使用 Workers AI 绑定与 Miniflare(特别是与自定义 Vite 插件一起)时,AI 绑定需要外部 worker,而旧的 unstable_getMiniflareWorkerOptions 未能正确暴露这些 worker。当 Miniflare 无法解析内部 AI worker 模块时会发生此错误。预防措施:在本地开发中使用远程 AI 绑定,或更新到最新的 @cloudflare/vite-plugin。
// wrangler.jsonc - 选项 1:在本地开发中使用远程 AI 绑定
{
"ai": { "binding": "AI" },
"dev": {
"remote": true // 在本地使用生产环境 AI 绑定
}
}
# 选项 2:更新到最新工具
npm install -D @cloudflare/vite-plugin@latest
# 选项 3:使用 wrangler dev 替代自定义 Miniflare
npm run dev
错误:对于无害的提示出现 "AiError: Input prompt contains NSFW content (code 3030)" 来源:Cloudflare 社区讨论 发生原因:Flux 图像生成模型(@cf/black-forest-labs/flux-1-schnell)有时会触发误报的 NSFW 内容错误,即使对于像 "hamburger" 这样的无害单字提示也是如此。NSFW 过滤器在没有上下文的情况下可能过于敏感。预防措施:在潜在触发词周围添加描述性上下文,而不是使用单字提示。
// ❌ 可能触发错误 3030
const response = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
prompt: 'hamburger', // 单字触发过滤器
});
// ✅ 添加上下文以避免误报
const response = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
prompt: 'A photo of a delicious large hamburger on a plate with lettuce and tomato',
num_steps: 4,
});
错误:"Error: unexpected type 'int32' with value 'undefined' (code 1000)" 来源:Cloudflare 社区讨论 发生原因:当未提供 num_steps 参数时,图像生成 API 调用会返回错误代码 1000,即使文档暗示它是可选的。实际上,大多数 Flux 模型都需要此参数。预防措施:始终为图像生成模型包含 num_steps: 4(对于 Flux Schnell 通常为 4)。
// ✅ 始终为图像生成包含 num_steps
const image = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
prompt: 'A beautiful sunset over mountains',
num_steps: 4, // 必需 - 对于 Flux Schnell 通常为 4
});
// 注意:FLUX.2 [klein] 4B 具有固定的 steps=4(无法调整)
错误:使用 Stagehand 与 Zod v4 时出现语法错误和转译失败 来源:GitHub Issue #10798 发生原因:Stagehand(浏览器自动化)和 Workers AI 中的一些结构化输出示例在使用 Zod v4(现为默认版本)时会失败。底层的 zod-to-json-schema 库尚不支持 Zod v4,导致转译失败。预防措施:在 zod-to-json-schema 支持 v4 之前,将 Zod 固定在 v3 版本。
# 专门安装 Zod v3
npm install zod@3
# 或在 package.json 中固定版本
{
"dependencies": {
"zod": "~3.23.8" // 为兼容性固定在 v3
}
}
并非错误,但重要功能:AI Gateway 支持通过 HTTP 头进行按请求的缓存控制,以实现自定义 TTL、缓存绕过以及超出仪表板默认值的自定义缓存键。来源:AI Gateway 缓存文档 使用场景:您需要为不同的请求(例如,昂贵查询 1 小时,实时数据跳过缓存)设置不同的缓存行为。实现:有关头用法,请参阅下面的 AI Gateway 集成部分。
env.AI.run(
model: string,
inputs: ModelInputs,
options?: { gateway?: { id: string; skipCache?: boolean } }
): Promise<ModelOutput | ReadableStream>
| 模型 | 最佳用途 | 速率限制 | 大小 | 备注 |
|---|---|---|---|---|
| 2025年模型 | ||||
@cf/meta/llama-4-scout-17b-16e-instruct | 最新 Llama,通用 | 300/分钟 | 17B | 2025年新增 |
@cf/openai/gpt-oss-120b | 最大的开源 GPT | 300/分钟 | 120B | 2025年新增 |
@cf/openai/gpt-oss-20b | 较小的开源 GPT | 300/分钟 | 20B | 2025年新增 |
@cf/google/gemma-3-12b-it | 128K 上下文,140+ 种语言 | 300/分钟 | 12B | 2025年新增,视觉 |
@cf/mistralai/mistral-small-3.1-24b-instruct | 视觉 + 工具调用 | 300/分钟 | 24B | 2025年新增 |
@cf/qwen/qwq-32b | 推理,复杂任务 | 300/分钟 | 32B | 2025年新增 |
@cf/qwen/qwen2.5-coder-32b-instruct | 编码专家 | 300/分钟 | 32B | 2025年新增 |
@cf/qwen/qwen3-30b-a3b-fp8 | 快速量化 | 300/分钟 | 30B | 2025年新增 |
@cf/ibm-granite/granite-4.0-h-micro | 小型,高效 | 300/分钟 | Micro | 2025年新增 |
| 性能(2025年) | ||||
@cf/meta/llama-3.3-70b-instruct-fp8-fast | 2-4 倍更快(2025年更新) | 300/分钟 | 70B | 推测解码 |
@cf/meta/llama-3.1-8b-instruct-fp8-fast | 快速 8B 变体 | 300/分钟 | 8B | - |
| 标准模型 | ||||
@cf/meta/llama-3.1-8b-instruct | 通用 | 300/分钟 | 8B | - |
@cf/meta/llama-3.2-1b-instruct | 超快,简单任务 | 300/分钟 | 1B | - |
@cf/deepseek-ai/deepseek-r1-distill-qwen-32b | 编码,技术 | 300/分钟 | 32B | - |
| 模型 | 维度 | 最佳用途 | 速率限制 | 备注 |
|---|---|---|---|---|
@cf/google/embeddinggemma-300m | 768 | 最佳 RAG | 3000/分钟 | 2025年新增 |
@cf/baai/bge-base-en-v1.5 | 768 | 通用 RAG(2 倍更快) | 3000/分钟 | pooling: "cls" 推荐 |
@cf/baai/bge-large-en-v1.5 | 1024 | 高精度(2 倍更快) | 1500/分钟 | pooling: "cls" 推荐 |
@cf/baai/bge-small-en-v1.5 | 384 | 快速,低存储(2 倍更快) | 3000/分钟 | pooling: "cls" 推荐 |
@cf/qwen/qwen3-embedding-0.6b | 768 | Qwen 嵌入 | 3000/分钟 | 2025年新增 |
关键(2025年):BGE 模型现在支持 pooling: "cls" 参数(推荐),但与 pooling: "mean"(默认)不向后兼容。
| 模型 | 最佳用途 | 速率限制 | 备注 |
|---|---|---|---|
@cf/black-forest-labs/flux-1-schnell | 高质量,逼真 | 720/分钟 | ⚠️ 参见下方警告 |
@cf/leonardo/lucid-origin | Leonardo AI 风格 | 720/分钟 | 2025年新增,需要 num_steps |
@cf/leonardo/phoenix-1.0 | Leonardo AI 变体 | 720/分钟 | 2025年新增,需要 num_steps |
@cf/stabilityai/stable-diffusion-xl-base-1.0 | 通用 | 720/分钟 | 需要 num_steps |
⚠️ 常见图像生成问题:
错误 1000:始终包含 num_steps: 4 参数(尽管文档暗示可选,但实际必需)
错误 3030(NSFW 过滤器):像 "hamburger" 这样的单字可能触发误报 - 在提示中添加描述性上下文
// ✅ 图像生成的正确模式 const image = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', { prompt: 'A photo of a delicious hamburger on a plate with fresh vegetables', num_steps: 4, // 必需,以避免错误 1000 }); // 描述性上下文有助于避免 NSFW 误报(错误 3030)
| 模型 | 最佳用途 | 速率限制 | 备注 |
|---|---|---|---|
@cf/meta/llama-3.2-11b-vision-instruct | 图像理解 | 720/分钟 | - |
@cf/google/gemma-3-12b-it | 视觉 + 文本(128K 上下文) | 300/分钟 | 2025年新增 |
| 模型 | 类型 | 速率限制 | 备注 |
|---|---|---|---|
@cf/deepgram/aura-2-en | 文本转语音(英语) | 720/分钟 | 2025年新增 |
@cf/deepgram/aura-2-es | 文本转语音(西班牙语) | 720/分钟 | 2025年新增 |
@cf/deepgram/nova-3 | 语音转文本(+ WebSocket) | 720/分钟 | 2025年新增 |
@cf/openai/whisper-large-v3-turbo | 语音转文本(更快) | 720/分钟 | 2025年新增 |
// 1. 生成嵌入
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', { text: [userQuery] });
// 2. 搜索 Vectorize
const matches = await env.VECTORIZE.query(embeddings.data[0], { topK: 3 });
const context = matches.matches.map((m) => m.metadata.text).join('\n\n');
// 3. 使用上下文生成
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{ role: 'system', content: `使用此上下文回答:\n${context}` },
{ role: 'user', content: userQuery },
],
stream: true,
});
import { z } from 'zod';
const Schema = z.object({ name: z.string(), items: z.array(z.string()) });
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{
role: 'user',
content: `生成匹配的 JSON:${JSON.stringify(Schema.shape)}`
}],
});
const validated = Schema.parse(JSON.parse(response.response));
为 AI 请求提供缓存、日志记录、成本跟踪和分析功能。
const response = await env.AI.run(
'@cf/meta/llama-3.1-8b-instruct',
{ prompt: 'Hello' },
{ gateway: { id: 'my-gateway', skipCache: false } }
);
// 访问日志并发送反馈
const gateway = env.AI.gateway('my-gateway');
await gateway.patchLog(env.AI.aiGatewayLogId, {
feedback: { rating: 1, comment: 'Great response' },
});
使用 HTTP 头覆盖默认缓存行为,实现细粒度控制:
// 自定义缓存 TTL(昂贵查询 1 小时)
const response = await fetch(
`https://gateway.ai.cloudflare.com/v1/${accountId}/${gatewayId}/workers-ai/@cf/meta/llama-3.1-8b-instruct`,
{
method: 'POST',
headers: {
'Authorization': `Bearer ${env.CLOUDFLARE_API_KEY}`,
'Content-Type': 'application/json',
'cf-aig-cache-ttl': '3600', // 1 小时,单位秒(最小值:60,最大值:2592000)
},
body: JSON.stringify({
messages: [{ role: 'user', content: prompt }],
}),
}
);
// 为实时数据跳过缓存
const response = await fetch(gatewayUrl, {
headers: {
'cf-aig-skip-cache': 'true', // 完全绕过缓存
},
// ...
});
// 检查响应是否来自缓存
const cacheStatus = response.headers.get('cf-aig-cache-status'); // "HIT" 或 "MISS"
可用的缓存头:
cf-aig-cache-ttl:以秒为单位设置自定义 TTL(60 秒至 1 个月)cf-aig-skip-cache:完全绕过缓存('true')cf-aig-cache-key:用于细粒度控制的自定义缓存键cf-aig-cache-status:响应头,显示 "HIT" 或 "MISS"优势:成本跟踪、缓存(减少重复推理)、日志记录、速率限制、分析、按请求缓存自定义。
| 任务类型 | 默认限制 | 备注 |
|---|---|---|
| 文本生成 | 300/分钟 | 一些快速模型:400-1500/分钟 |
| 文本嵌入 | 3000/分钟 | BGE-large:1500/分钟 |
| 图像生成 | 720/分钟 | 所有图像模型 |
| 视觉模型 | 720/分钟 | 图像理解 |
| 音频(TTS/STT) | 720/分钟 | Deepgram、Whisper |
| 翻译 | 720/分钟 | M2M100、Opus MT |
| 分类 | 2000/分钟 | 文本分类 |
免费层级:
付费层级(每 1,000 个神经元 $0.011):
2025年模型成本(每 100 万令牌):
| 模型 | 输入 | 输出 | 备注 |
|---|---|---|---|
| 2025年模型 | |||
| Llama 4 Scout 17B | $0.270 | $0.850 | 2025年新增 |
| GPT-OSS 120B | $0.350 | $0.750 | 2025年新增 |
| GPT-OSS 20B | $0.200 | $0.300 | 2025年新增 |
| Gemma 3 12B | $0.345 | $0.556 | 2025年新增 |
| Mistral 3.1 24B | $0.351 | $0.555 | 2025年新增 |
| Qwen QwQ 32B | $0.660 | $1.000 | 2025年新增 |
| Qwen Coder 32B | $0.660 | $1.000 | 2025年新增 |
| IBM Granite Micro | $0.017 | $0.112 | 2025年新增 |
| EmbeddingGemma 300M | $0.012 | N/A | 2025年新增 |
| Qwen3 Embedding 0.6B | $0.012 | N/A | 2025年新增 |
| 性能(2025年) | |||
| Llama 3.3 70B Fast | $0.293 | $2.253 | 2-4 倍更快 |
| Llama 3.1 8B FP8 Fast | $0.045 | $0.384 | 快速变体 |
| 标准模型 | |||
| Llama 3.2 1B | $0.027 | $0.201 | - |
| Llama 3.1 8B | $0.282 | $0.827 | - |
| Deepseek R1 32B | $0.497 | $4.881 | - |
| BGE-base(2 倍更快) | $0.067 | N/A | 2025年加速 |
| BGE-large(2 倍更快) | $0.204 | N/A | 2025年加速 |
| 图像模型(2025年) | |||
| Flux 1 Schnell | $0.0000528 每 512x512 图块 | - | |
| Leonardo Lucid | $0.006996 每 512x512 图块 | 2025年新增 | |
| Leonardo Phoenix | $0.005830 每 512x512 图块 | 2025年新增 | |
| 音频模型(2025年) | |||
| Deepgram Aura 2 | $0.030 每 1k 字符 | 2025年新增 | |
| Deepgram Nova 3 | $0.0052 每分钟音频 | 2025年新增 | |
| Whisper v3 Turbo | $0.0005 每分钟音频 | 2025年新增 |
async function runAIWithRetry(
env: Env,
model: string,
inputs: any,
maxRetries = 3
): Promise<any> {
let lastError: Error;
for (let i = 0; i < maxRetries; i++) {
try {
return await env.AI.run(model, inputs);
} catch (error) {
lastError = error as Error;
// 速率限制 - 使用指数退避重试
if (lastError.message.toLowerCase().includes('rate limit')) {
await new Promise((resolve) => setTimeout(resolve, Math.pow(2, i) * 1000));
continue;
}
throw error; // 其他错误 - 立即失败
}
}
throw lastError!;
}
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: env.CLOUDFLARE_API_KEY,
baseURL: `https://api.cloudflare.com/client/v4/accounts/${env.ACCOUNT_ID}/ai/v1`,
});
// 聊天补全
await openai.chat.completions.create({
model: '@cf/meta/llama-3.1-8b-instruct',
messages: [{ role: 'user', content: 'Hello!' }],
});
端点:/v1/chat/completions、/v1/embeddings
import { createWorkersAI } from 'workers-ai-provider'; // v3.0.2 with AI SDK v5
import { generateText, streamText } from 'ai';
const workersai = createWorkersAI({ binding: env.AI });
// 生成或流式传输
await generateText({
model: workersai('@cf/meta/llama-3.1-8b-instruct'),
prompt: 'Write a poem',
});
注意:这些技巧来自社区讨论和生产经验。
在使用 Workers AI 流式传输与 Hono 时,直接将流作为 Response 返回(不通过 Hono 的流式传输工具):
import { Hono } from 'hono';
type Bindings = { AI: Ai };
const app = new Hono<{ Bindings: Bindings }>();
app.post('/chat', async (c) => {
const { prompt } = await c.req.json();
const stream = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{ role: 'user', content: prompt }],
stream: true,
});
// 直接返回流(不使用 c.stream())
return new Response(stream, {
headers: {
'content-type': 'text/event-stream',
'cache-control': 'no-cache',
'connection': 'keep-alive',
},
});
});
如果遇到无法解释的 Workers AI 失败:
# 1. 检查 wrangler 版本
npx wrangler --version
# 2. 清除 wrangler 缓存
rm -rf ~/.wrangler
# 3. 更新到最新稳定版
npm install -D wrangler@latest
# 4. 检查本地网络/防火墙设置
# 一些企业防火墙会阻止 Workers AI 端点
注意:大多数“版本不兼容”问题实际上是网络配置问题。
mcp__cloudflare-docs__search_cloudflare_documentation 获取最新文档每周安装次数
338
仓库
GitHub 星标数
643
首次出现
2026年1月20日
安全审计
安装于
claude-code283
gemini-cli233
opencode227
cursor214
antigravity210
codex201
Status : Production Ready ✅ Last Updated : 2026-01-21 Dependencies : cloudflare-worker-base (for Worker setup) Latest Versions : wrangler@4.58.0, @cloudflare/workers-types@4.20260109.0, workers-ai-provider@3.0.2
Recent Updates (2025) :
// 1. Add AI binding to wrangler.jsonc
{ "ai": { "binding": "AI" } }
// 2. Run model with streaming (recommended)
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{ role: 'user', content: 'Tell me a story' }],
stream: true, // Always stream for text generation!
});
return new Response(stream, {
headers: { 'content-type': 'text/event-stream' },
});
},
};
Why streaming? Prevents buffering in memory, faster time-to-first-token, avoids Worker timeout issues.
This skill prevents 7 documented issues:
Error : "Exceeded character limit" despite model supporting larger context Source : Cloudflare Changelog Why It Happens : Before February 2025, Workers AI validated prompts using a hard 6144 character limit, even for models with larger token-based context windows (e.g., Mistral with 32K tokens). After the update, validation switched to token-based counting. Prevention : Calculate tokens (not characters) when checking context window limits.
import { encode } from 'gpt-tokenizer'; // or model-specific tokenizer
const tokens = encode(prompt);
const contextWindow = 32768; // Model's max tokens (check docs)
const maxResponseTokens = 2048;
if (tokens.length + maxResponseTokens > contextWindow) {
throw new Error(`Prompt exceeds context window: ${tokens.length} tokens`);
}
const response = await env.AI.run('@cf/mistral/mistral-7b-instruct-v0.2', {
messages: [{ role: 'user', content: prompt }],
max_tokens: maxResponseTokens,
});
Error : Dashboard neuron usage significantly exceeds expected token-based calculations Source : Cloudflare Community Discussion Why It Happens : Users report dashboard showing hundred-million-level neuron consumption for K-level token usage, particularly with AutoRAG features and certain models. The discrepancy between expected neuron consumption (based on pricing docs) and actual dashboard metrics is not fully documented. Prevention : Monitor neuron usage via AI Gateway logs and correlate with requests. File support ticket if consumption significantly exceeds expectations.
// Use AI Gateway for detailed request logging
const response = await env.AI.run(
'@cf/meta/llama-3.1-8b-instruct',
{ messages: [{ role: 'user', content: query }] },
{ gateway: { id: 'my-gateway' } }
);
// Monitor dashboard at: https://dash.cloudflare.com → AI → Workers AI
// Compare neuron usage with token counts
// File support ticket with details if discrepancy persists
Error : "MiniflareCoreError: wrapped binding module can't be resolved (internal modules only)" Source : GitHub Issue #6796 Why It Happens : When using Workers AI bindings with Miniflare in local development (particularly with custom Vite plugins), the AI binding requires external workers that aren't properly exposed by older unstable_getMiniflareWorkerOptions. The error occurs when Miniflare can't resolve the internal AI worker module. Prevention : Use remote bindings for AI in local dev, or update to latest @cloudflare/vite-plugin.
// wrangler.jsonc - Option 1: Use remote AI binding in local dev
{
"ai": { "binding": "AI" },
"dev": {
"remote": true // Use production AI binding locally
}
}
# Option 2: Update to latest tooling
npm install -D @cloudflare/vite-plugin@latest
# Option 3: Use wrangler dev instead of custom Miniflare
npm run dev
Error : "AiError: Input prompt contains NSFW content (code 3030)" for innocent prompts Source : Cloudflare Community Discussion Why It Happens : Flux image generation models (@cf/black-forest-labs/flux-1-schnell) sometimes trigger false positive NSFW content errors even with innocent single-word prompts like "hamburger". The NSFW filter can be overly sensitive without context. Prevention : Add descriptive context around potential trigger words instead of using single-word prompts.
// ❌ May trigger error 3030
const response = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
prompt: 'hamburger', // Single word triggers filter
});
// ✅ Add context to avoid false positives
const response = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
prompt: 'A photo of a delicious large hamburger on a plate with lettuce and tomato',
num_steps: 4,
});
Error : "Error: unexpected type 'int32' with value 'undefined' (code 1000)" Source : Cloudflare Community Discussion Why It Happens : Image generation API calls return error code 1000 when the num_steps parameter is not provided, even though documentation suggests it's optional. The parameter is actually required for most Flux models. Prevention : Always include num_steps: 4 for image generation models (typically 4 for Flux Schnell).
// ✅ Always include num_steps for image generation
const image = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
prompt: 'A beautiful sunset over mountains',
num_steps: 4, // Required - typically 4 for Flux Schnell
});
// Note: FLUX.2 [klein] 4B has fixed steps=4 (cannot be adjusted)
Error : Syntax errors and failed transpilation when using Stagehand with Zod v4 Source : GitHub Issue #10798 Why It Happens : Stagehand (browser automation) and some structured output examples in Workers AI fail with Zod v4 (now default). The underlying zod-to-json-schema library doesn't yet support Zod v4, causing transpilation failures. Prevention : Pin Zod to v3 until zod-to-json-schema supports v4.
# Install Zod v3 specifically
npm install zod@3
# Or pin in package.json
{
"dependencies": {
"zod": "~3.23.8" // Pin to v3 for compatibility
}
}
Not an error, but important feature : AI Gateway supports per-request cache control via HTTP headers for custom TTL, cache bypass, and custom cache keys beyond dashboard defaults. Source : AI Gateway Caching Documentation Use When : You need different caching behavior for different requests (e.g., 1 hour for expensive queries, skip cache for real-time data). Implementation : See AI Gateway Integration section below for header usage.
env.AI.run(
model: string,
inputs: ModelInputs,
options?: { gateway?: { id: string; skipCache?: boolean } }
): Promise<ModelOutput | ReadableStream>
| Model | Best For | Rate Limit | Size | Notes |
|---|---|---|---|---|
| 2025 Models | ||||
@cf/meta/llama-4-scout-17b-16e-instruct | Latest Llama, general purpose | 300/min | 17B | NEW 2025 |
@cf/openai/gpt-oss-120b | Largest open-source GPT | 300/min | 120B | NEW 2025 |
@cf/openai/gpt-oss-20b | Smaller open-source GPT |
| Model | Dimensions | Best For | Rate Limit | Notes |
|---|---|---|---|---|
@cf/google/embeddinggemma-300m | 768 | Best-in-class RAG | 3000/min | NEW 2025 |
@cf/baai/bge-base-en-v1.5 | 768 | General RAG (2x faster) | 3000/min | pooling: "cls" recommended |
@cf/baai/bge-large-en-v1.5 | 1024 | High accuracy (2x faster) | 1500/min | pooling: "cls" recommended |
CRITICAL (2025) : BGE models now support pooling: "cls" parameter (recommended) but NOT backwards compatible with pooling: "mean" (default).
| Model | Best For | Rate Limit | Notes |
|---|---|---|---|
@cf/black-forest-labs/flux-1-schnell | High quality, photorealistic | 720/min | ⚠️ See warnings below |
@cf/leonardo/lucid-origin | Leonardo AI style | 720/min | NEW 2025, requires num_steps |
@cf/leonardo/phoenix-1.0 | Leonardo AI variant | 720/min | NEW 2025, requires num_steps |
@cf/stabilityai/stable-diffusion-xl-base-1.0 | General purpose |
⚠️ Common Image Generation Issues:
Error 1000 : Always include num_steps: 4 parameter (required despite docs suggesting optional)
Error 3030 (NSFW filter) : Single words like "hamburger" may trigger false positives - add descriptive context to prompts
// ✅ Correct pattern for image generation const image = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', { prompt: 'A photo of a delicious hamburger on a plate with fresh vegetables', num_steps: 4, // Required to avoid error 1000 }); // Descriptive context helps avoid NSFW false positives (error 3030)
| Model | Best For | Rate Limit | Notes |
|---|---|---|---|
@cf/meta/llama-3.2-11b-vision-instruct | Image understanding | 720/min | - |
@cf/google/gemma-3-12b-it | Vision + text (128K context) | 300/min | NEW 2025 |
| Model | Type | Rate Limit | Notes |
|---|---|---|---|
@cf/deepgram/aura-2-en | Text-to-speech (English) | 720/min | NEW 2025 |
@cf/deepgram/aura-2-es | Text-to-speech (Spanish) | 720/min | NEW 2025 |
@cf/deepgram/nova-3 | Speech-to-text (+ WebSocket) | 720/min | NEW 2025 |
@cf/openai/whisper-large-v3-turbo | Speech-to-text (faster) | 720/min | NEW 2025 |
// 1. Generate embeddings
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', { text: [userQuery] });
// 2. Search Vectorize
const matches = await env.VECTORIZE.query(embeddings.data[0], { topK: 3 });
const context = matches.matches.map((m) => m.metadata.text).join('\n\n');
// 3. Generate with context
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{ role: 'system', content: `Answer using this context:\n${context}` },
{ role: 'user', content: userQuery },
],
stream: true,
});
import { z } from 'zod';
const Schema = z.object({ name: z.string(), items: z.array(z.string()) });
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{
role: 'user',
content: `Generate JSON matching: ${JSON.stringify(Schema.shape)}`
}],
});
const validated = Schema.parse(JSON.parse(response.response));
Provides caching, logging, cost tracking, and analytics for AI requests.
const response = await env.AI.run(
'@cf/meta/llama-3.1-8b-instruct',
{ prompt: 'Hello' },
{ gateway: { id: 'my-gateway', skipCache: false } }
);
// Access logs and send feedback
const gateway = env.AI.gateway('my-gateway');
await gateway.patchLog(env.AI.aiGatewayLogId, {
feedback: { rating: 1, comment: 'Great response' },
});
Override default cache behavior with HTTP headers for fine-grained control:
// Custom cache TTL (1 hour for expensive queries)
const response = await fetch(
`https://gateway.ai.cloudflare.com/v1/${accountId}/${gatewayId}/workers-ai/@cf/meta/llama-3.1-8b-instruct`,
{
method: 'POST',
headers: {
'Authorization': `Bearer ${env.CLOUDFLARE_API_KEY}`,
'Content-Type': 'application/json',
'cf-aig-cache-ttl': '3600', // 1 hour in seconds (min: 60, max: 2592000)
},
body: JSON.stringify({
messages: [{ role: 'user', content: prompt }],
}),
}
);
// Skip cache for real-time data
const response = await fetch(gatewayUrl, {
headers: {
'cf-aig-skip-cache': 'true', // Bypass cache entirely
},
// ...
});
// Check if response was cached
const cacheStatus = response.headers.get('cf-aig-cache-status'); // "HIT" or "MISS"
Available Cache Headers:
cf-aig-cache-ttl: Set custom TTL in seconds (60s to 1 month)cf-aig-skip-cache: Bypass cache entirely ('true')cf-aig-cache-key: Custom cache key for granular controlcf-aig-cache-status: Response header showing "HIT" or "MISS"Benefits: Cost tracking, caching (reduces duplicate inference), logging, rate limiting, analytics, per-request cache customization.
| Task Type | Default Limit | Notes |
|---|---|---|
| Text Generation | 300/min | Some fast models: 400-1500/min |
| Text Embeddings | 3000/min | BGE-large: 1500/min |
| Image Generation | 720/min | All image models |
| Vision Models | 720/min | Image understanding |
| Audio (TTS/STT) | 720/min | Deepgram, Whisper |
| Translation | 720/min | M2M100, Opus MT |
| Classification | 2000/min | Text classification |
Free Tier:
Paid Tier ($0.011 per 1,000 neurons):
2025 Model Costs (per 1M tokens):
| Model | Input | Output | Notes |
|---|---|---|---|
| 2025 Models | |||
| Llama 4 Scout 17B | $0.270 | $0.850 | NEW 2025 |
| GPT-OSS 120B | $0.350 | $0.750 | NEW 2025 |
| GPT-OSS 20B | $0.200 | $0.300 | NEW 2025 |
| Gemma 3 12B | $0.345 | $0.556 | NEW 2025 |
| Mistral 3.1 24B | $0.351 | $0.555 | NEW 2025 |
| Qwen QwQ 32B | $0.660 | $1.000 | NEW 2025 |
| Qwen Coder 32B |
async function runAIWithRetry(
env: Env,
model: string,
inputs: any,
maxRetries = 3
): Promise<any> {
let lastError: Error;
for (let i = 0; i < maxRetries; i++) {
try {
return await env.AI.run(model, inputs);
} catch (error) {
lastError = error as Error;
// Rate limit - retry with exponential backoff
if (lastError.message.toLowerCase().includes('rate limit')) {
await new Promise((resolve) => setTimeout(resolve, Math.pow(2, i) * 1000));
continue;
}
throw error; // Other errors - fail immediately
}
}
throw lastError!;
}
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: env.CLOUDFLARE_API_KEY,
baseURL: `https://api.cloudflare.com/client/v4/accounts/${env.ACCOUNT_ID}/ai/v1`,
});
// Chat completions
await openai.chat.completions.create({
model: '@cf/meta/llama-3.1-8b-instruct',
messages: [{ role: 'user', content: 'Hello!' }],
});
Endpoints: /v1/chat/completions, /v1/embeddings
import { createWorkersAI } from 'workers-ai-provider'; // v3.0.2 with AI SDK v5
import { generateText, streamText } from 'ai';
const workersai = createWorkersAI({ binding: env.AI });
// Generate or stream
await generateText({
model: workersai('@cf/meta/llama-3.1-8b-instruct'),
prompt: 'Write a poem',
});
Note : These tips come from community discussions and production experience.
When using Workers AI streaming with Hono, return the stream directly as a Response (not through Hono's streaming utilities):
import { Hono } from 'hono';
type Bindings = { AI: Ai };
const app = new Hono<{ Bindings: Bindings }>();
app.post('/chat', async (c) => {
const { prompt } = await c.req.json();
const stream = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{ role: 'user', content: prompt }],
stream: true,
});
// Return stream directly (not c.stream())
return new Response(stream, {
headers: {
'content-type': 'text/event-stream',
'cache-control': 'no-cache',
'connection': 'keep-alive',
},
});
});
Source : Hono Discussion #2409
If experiencing unexplained Workers AI failures:
# 1. Check wrangler version
npx wrangler --version
# 2. Clear wrangler cache
rm -rf ~/.wrangler
# 3. Update to latest stable
npm install -D wrangler@latest
# 4. Check local network/firewall settings
# Some corporate firewalls block Workers AI endpoints
Note : Most "version incompatibility" issues turn out to be network configuration problems.
mcp__cloudflare-docs__search_cloudflare_documentation for latest docsWeekly Installs
338
Repository
GitHub Stars
643
First Seen
Jan 20, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
claude-code283
gemini-cli233
opencode227
cursor214
antigravity210
codex201
Azure RBAC 权限管理工具:查找最小角色、创建自定义角色与自动化分配
101,200 周安装
Google Gemini 文件搜索设置教程 - 完全托管RAG系统,支持100+格式,集成最佳实践
328 周安装
Cloudflare MCP Server 教程:在Cloudflare Workers上构建远程模型上下文协议服务器
328 周安装
Cloudflare Images 图像托管与转换 API 使用指南 | 支持 AI 人脸裁剪与内容凭证
328 周安装
Swift iOS HomeKit Matter 开发指南:控制智能家居与设备配网
329 周安装
iOS WeatherKit 使用指南:获取天气数据、预报与警报的 Swift 实现
329 周安装
Microsoft Agent Framework 开发指南:统一Semantic Kernel与AutoGen的AI智能体框架
329 周安装
| 300/min |
| 20B |
| NEW 2025 |
@cf/google/gemma-3-12b-it | 128K context, 140+ languages | 300/min | 12B | NEW 2025, vision |
@cf/mistralai/mistral-small-3.1-24b-instruct | Vision + tool calling | 300/min | 24B | NEW 2025 |
@cf/qwen/qwq-32b | Reasoning, complex tasks | 300/min | 32B | NEW 2025 |
@cf/qwen/qwen2.5-coder-32b-instruct | Coding specialist | 300/min | 32B | NEW 2025 |
@cf/qwen/qwen3-30b-a3b-fp8 | Fast quantized | 300/min | 30B | NEW 2025 |
@cf/ibm-granite/granite-4.0-h-micro | Small, efficient | 300/min | Micro | NEW 2025 |
| Performance (2025) |
@cf/meta/llama-3.3-70b-instruct-fp8-fast | 2-4x faster (2025 update) | 300/min | 70B | Speculative decoding |
@cf/meta/llama-3.1-8b-instruct-fp8-fast | Fast 8B variant | 300/min | 8B | - |
| Standard Models |
@cf/meta/llama-3.1-8b-instruct | General purpose | 300/min | 8B | - |
@cf/meta/llama-3.2-1b-instruct | Ultra-fast, simple tasks | 300/min | 1B | - |
@cf/deepseek-ai/deepseek-r1-distill-qwen-32b | Coding, technical | 300/min | 32B | - |
@cf/baai/bge-small-en-v1.5 | 384 | Fast, low storage (2x faster) | 3000/min | pooling: "cls" recommended |
@cf/qwen/qwen3-embedding-0.6b | 768 | Qwen embeddings | 3000/min | NEW 2025 |
| 720/min |
| Requires num_steps |
| $0.660 |
| $1.000 |
| NEW 2025 |
| IBM Granite Micro | $0.017 | $0.112 | NEW 2025 |
| EmbeddingGemma 300M | $0.012 | N/A | NEW 2025 |
| Qwen3 Embedding 0.6B | $0.012 | N/A | NEW 2025 |
| Performance (2025) |
| Llama 3.3 70B Fast | $0.293 | $2.253 | 2-4x faster |
| Llama 3.1 8B FP8 Fast | $0.045 | $0.384 | Fast variant |
| Standard Models |
| Llama 3.2 1B | $0.027 | $0.201 | - |
| Llama 3.1 8B | $0.282 | $0.827 | - |
| Deepseek R1 32B | $0.497 | $4.881 | - |
| BGE-base (2x faster) | $0.067 | N/A | 2025 speedup |
| BGE-large (2x faster) | $0.204 | N/A | 2025 speedup |
| Image Models (2025) |
| Flux 1 Schnell | $0.0000528 per 512x512 tile | - |
| Leonardo Lucid | $0.006996 per 512x512 tile | NEW 2025 |
| Leonardo Phoenix | $0.005830 per 512x512 tile | NEW 2025 |
| Audio Models (2025) |
| Deepgram Aura 2 | $0.030 per 1k chars | NEW 2025 |
| Deepgram Nova 3 | $0.0052 per audio min | NEW 2025 |
| Whisper v3 Turbo | $0.0005 per audio min | NEW 2025 |