⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

Hugging Face Jobs 云端工作负载运行指南：数据处理、批量推理与模型训练

hugging-face-jobs by sickn33/antigravity-awesome-skills

52 周安装量

32,100 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill hugging-face-jobs

AI/机器学习云服务数据处理

🇨🇳中文介绍

在 Hugging Face Jobs 上运行工作负载

概述

在完全托管的 Hugging Face 基础设施上运行任何工作负载。无需本地设置——任务在云端 CPU、GPU 或 TPU 上运行，并可将结果持久化到 Hugging Face Hub。

常见用例：

数据处理 - 转换、过滤或分析大型数据集
批量推理 - 对数千个样本运行推理
实验与基准测试 - 可重现的机器学习实验
模型训练 - 微调模型（针对基于 TRL 的训练，请参阅 model-trainer 技能）
合成数据生成 - 使用 LLM 生成数据集
开发与测试 - 无需本地 GPU 设置即可测试代码
计划任务 - 自动化重复性任务

专门用于模型训练： 针对基于 TRL 的训练工作流，请参阅 model-trainer 技能。

何时使用此技能

当用户希望进行以下操作时使用此技能：

在云基础设施上运行 Python 工作负载
无需本地 GPU/TPU 设置即可执行任务
大规模处理数据
运行批量推理或实验
安排重复性任务
为任何工作负载使用 GPU/TPU
将结果持久化到 Hugging Face Hub

关键指令

协助处理任务时：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

✅ 账户与身份验证

拥有 Pro、Team 或 Enterprise 计划的 Hugging Face 账户（任务需要付费计划）
已登录验证：使用 hf_whoami() 检查
用于 Hub 访问的 HF_TOKEN ⚠️ 关键 - 任何 Hub 操作（推送模型/数据集、下载私有仓库等）都需要
令牌必须具有适当的权限（下载需要读取权限，上传需要写入权限）

✅ 令牌使用（详情请参阅令牌使用部分）

需要令牌的情况：

将模型/数据集推送到 Hub
访问私有仓库
在脚本中使用 Hub API
任何经过身份验证的 Hub 操作

如何提供令牌：

{
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # 推荐：自动令牌
}

⚠️ 关键： $HF_TOKEN 占位符会自动替换为您已登录的令牌。切勿在脚本中硬编码令牌。

什么是 HF 令牌？

Hugging Face Hub 的身份验证凭证
经过身份验证的操作（推送、私有仓库、API 访问）所必需
在 hf auth login 后安全地存储在您的机器上

读取令牌 - 可以下载模型/数据集，读取私有仓库
写入令牌 - 可以推送模型/数据集，创建仓库，修改内容
组织令牌 - 可以代表组织进行操作

将模型/数据集推送到 Hub
访问私有仓库
创建新仓库
修改现有仓库
以编程方式使用 Hub API

下载公共模型/数据集
运行不与 Hub 交互的任务
读取公共仓库信息

如何向任务提供令牌

方法 1：自动令牌（推荐）

hf_jobs("uv", {
    "script": "your_script.py",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # ✅ 自动替换
})

$HF_TOKEN 是一个占位符，会被替换为您的实际令牌
使用您登录会话中的令牌（hf auth login）
最安全、最方便的方法
令牌作为 secret 传递时在服务器端被加密

代码中不暴露令牌
使用您当前的登录会话
如果您重新登录，会自动更新
与 MCP 工具无缝协作

方法 2：显式令牌（不推荐）

hf_jobs("uv", {
    "script": "your_script.py",
    "secrets": {"HF_TOKEN": "hf_abc123..."}  # ⚠️ 硬编码令牌
})

仅当自动令牌不起作用时
使用特定令牌进行测试
组织令牌（谨慎使用）

令牌在代码/日志中可见
如果令牌轮换，必须手动更新
存在令牌暴露风险

方法 3：环境变量（安全性较低）

hf_jobs("uv", {
    "script": "your_script.py",
    "env": {"HF_TOKEN": "hf_abc123..."}  # ⚠️ 比 secrets 安全性低
})

与 secrets 的区别：

env 变量在任务日志中可见
secrets 在服务器端被加密
对于令牌，始终首选 secrets

在脚本中使用令牌

在您的 Python 脚本中，令牌作为环境变量可用：

# /// script
# dependencies = ["huggingface-hub"]
# ///

import os
from huggingface_hub import HfApi

# 如果通过 secrets 传递，令牌会自动可用
token = os.environ.get("HF_TOKEN")

# 与 Hub API 一起使用
api = HfApi(token=token)

# 或者让 huggingface_hub 自动检测
api = HfApi()  # 自动使用 HF_TOKEN 环境变量

不要在脚本中硬编码令牌
使用 os.environ.get("HF_TOKEN") 来访问
尽可能让 huggingface_hub 自动检测
在 Hub 操作之前验证令牌是否存在

检查是否已登录：

from huggingface_hub import whoami
user_info = whoami()  # 如果已认证，返回您的用户名

验证任务中的令牌：

import os
assert "HF_TOKEN" in os.environ, "HF_TOKEN not found!"
token = os.environ["HF_TOKEN"]
print(f"Token starts with: {token[:7]}...")  # 应以 "hf_" 开头

错误：401 未授权

原因： 令牌缺失或无效
修复： 在任务配置中添加 secrets={"HF_TOKEN": "$HF_TOKEN"}
验证： 检查 hf_whoami() 在本地是否有效

错误：403 禁止访问

原因： 令牌缺少所需权限
修复： 确保令牌对于推送操作具有写入权限
检查： 在 https://huggingface.co/settings/tokens 查看令牌类型

错误：环境中未找到令牌

原因： 未传递 secrets 或键名错误
修复： 使用 secrets={"HF_TOKEN": "$HF_TOKEN"}（而不是 env）
验证： 脚本检查 os.environ.get("HF_TOKEN")

错误：仓库访问被拒绝

原因： 令牌无权访问私有仓库
修复： 使用具有访问权限的账户的令牌
检查： 验证仓库可见性和您的权限

令牌安全最佳实践

切勿提交令牌 - 使用 $HF_TOKEN 占位符或环境变量
使用 secrets，而非 env - Secrets 在服务器端被加密
定期轮换令牌 - 定期生成新令牌
使用最小权限 - 创建仅具有所需权限的令牌
不要共享令牌 - 每个用户应使用自己的令牌
监控令牌使用情况 - 在 Hub 设置中检查令牌活动

# 示例：将结果推送到 Hub
hf_jobs("uv", {
    "script": """
# /// script
# dependencies = ["huggingface-hub", "datasets"]
# ///

import os
from huggingface_hub import HfApi
from datasets import Dataset

# 验证令牌是否可用
assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"

# 使用令牌进行 Hub 操作
api = HfApi(token=os.environ["HF_TOKEN"])

# 创建并推送数据集
data = {"text": ["Hello", "World"]}
dataset = Dataset.from_dict(data)
dataset.push_to_hub("username/my-dataset", token=os.environ["HF_TOKEN"])

print("✅ Dataset pushed successfully!")
""",
    "flavor": "cpu-basic",
    "timeout": "30m",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # ✅ 安全地提供令牌
})

快速入门：两种方法

方法 1：UV 脚本（推荐）

UV 脚本使用 PEP 723 内联依赖项，实现简洁、自包含的工作负载。

hf_jobs("uv", {
    "script": """
# /// script
# dependencies = ["transformers", "torch"]
# ///

from transformers import pipeline
import torch

# 您的工作负载放在这里
classifier = pipeline("sentiment-analysis")
result = classifier("I love Hugging Face!")
print(result)
""",
    "flavor": "cpu-basic",
    "timeout": "30m"
})

CLI 等效命令：

hf jobs uv run my_script.py --flavor cpu-basic --timeout 30m

from huggingface_hub import run_uv_job
run_uv_job("my_script.py", flavor="cpu-basic", timeout="30m")

好处： 直接使用 MCP 工具，代码简洁，依赖项内联声明，无需保存文件

何时使用： 所有工作负载的默认选择，自定义逻辑，任何需要 hf_jobs() 的场景

UV 脚本的自定义 Docker 镜像

默认情况下，UV 脚本使用 ghcr.io/astral-sh/uv:python3.12-bookworm-slim。对于具有复杂依赖项的 ML 工作负载，请使用预构建的镜像：

hf_jobs("uv", {
    "script": "inference.py",
    "image": "vllm/vllm-openai:latest",  # 预构建的包含 vLLM 的镜像
    "flavor": "a10g-large"
})

hf jobs uv run --image vllm/vllm-openai:latest --flavor a10g-large inference.py

好处： 启动更快，预安装依赖项，针对特定框架优化

默认情况下，UV 脚本使用 Python 3.12。指定不同版本：

hf_jobs("uv", {
    "script": "my_script.py",
    "python": "3.11",  # 使用 Python 3.11
    "flavor": "cpu-basic"
})

from huggingface_hub import run_uv_job
run_uv_job("my_script.py", python="3.11")

⚠️ 重要： 根据您运行任务的方式，存在两种“脚本路径”情况：

使用 hf_jobs() MCP 工具（本仓库推荐）：script 值必须是内联代码（字符串）或URL。本地文件系统路径（如 "./scripts/foo.py"）在远程容器中不存在。
使用 hf jobs uv run CLI：本地文件路径确实有效（CLI 会上传您的脚本）。

使用 hf_jobs() MCP 工具时的常见错误：

# ❌ 将失败（远程容器无法看到您的本地路径）
hf_jobs("uv", {"script": "./scripts/foo.py"})

使用 hf_jobs() MCP 工具的正确模式：

# ✅ 内联：读取本地脚本文件并传递其*内容*
from pathlib import Path
script = Path("hf-jobs/scripts/foo.py").read_text()
hf_jobs("uv", {"script": script})

# ✅ URL：将脚本托管在可访问的位置
hf_jobs("uv", {"script": "https://huggingface.co/datasets/uv-scripts/.../raw/main/foo.py"})

# ✅ 来自 GitHub 的 URL
hf_jobs("uv", {"script": "https://raw.githubusercontent.com/huggingface/trl/main/trl/scripts/sft.py"})

CLI 等效命令（支持本地路径）：

hf jobs uv run ./scripts/foo.py -- --your --args

在运行时添加依赖项

添加 PEP 723 头部之外的额外依赖项：

hf_jobs("uv", {
    "script": "inference.py",
    "dependencies": ["transformers", "torch>=2.0"],  # 额外依赖项
    "flavor": "a10g-small"
})

from huggingface_hub import run_uv_job
run_uv_job("inference.py", dependencies=["transformers", "torch>=2.0"])

方法 2：基于 Docker 的任务

使用自定义 Docker 镜像和命令运行任务。

hf_jobs("run", {
    "image": "python:3.12",
    "command": ["python", "-c", "print('Hello from HF Jobs!')"],
    "flavor": "cpu-basic",
    "timeout": "30m"
})

CLI 等效命令：

hf jobs run python:3.12 python -c "print('Hello from HF Jobs!')"

from huggingface_hub import run_job
run_job(image="python:3.12", command=["python", "-c", "print('Hello!')"], flavor="cpu-basic")

好处： 完全的 Docker 控制，使用预构建镜像，运行任何命令 何时使用： 需要特定的 Docker 镜像，非 Python 工作负载，复杂环境

使用 GPU 的示例：

hf_jobs("run", {
    "image": "pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel",
    "command": ["python", "-c", "import torch; print(torch.cuda.get_device_name())"],
    "flavor": "a10g-small",
    "timeout": "1h"
})

使用 Hugging Face Spaces 作为镜像：

您可以使用来自 HF Spaces 的 Docker 镜像：

hf_jobs("run", {
    "image": "hf.co/spaces/lhoestq/duckdb",  # 将 Space 用作 Docker 镜像
    "command": ["duckdb", "-c", "SELECT 'Hello from DuckDB!'"],
    "flavor": "cpu-basic"
})

hf jobs run hf.co/spaces/lhoestq/duckdb duckdb -c "SELECT 'Hello!'"

在 Hub 上查找更多 UV 脚本

uv-scripts 组织提供了随时可用的 UV 脚本，这些脚本作为数据集存储在 Hugging Face Hub 上：

# 发现可用的 UV 脚本集合
dataset_search({"author": "uv-scripts", "sort": "downloads", "limit": 20})

# 探索特定集合
hub_repo_details(["uv-scripts/classification"], repo_type="dataset", include_readme=True)

热门集合： OCR、classification、synthetic-data、vLLM、dataset-creation

参考： HF Jobs 硬件文档（更新于 07/2025）

工作负载类型	推荐硬件	用例
数据处理、测试	`cpu-basic`, `cpu-upgrade`	轻量级任务
小型模型、演示	`t4-small`	<1B 模型，快速测试
中型模型	`t4-medium`, `l4x1`	1-7B 模型
大型模型、生产环境	`a10g-small`, `a10g-large`	7-13B 模型
超大型模型	`a100-large`	13B+ 模型
批量推理	`a10g-large`, `a100-large`	高吞吐量
多 GPU 工作负载	`l4x4`, `a10g-largex2`, `a10g-largex4`	并行/大型模型
TPU 工作负载	`v5e-1x1`, `v5e-2x2`, `v5e-2x4`	JAX/Flax，TPU 优化

所有可用规格：

CPU： cpu-basic, cpu-upgrade
GPU： t4-small, t4-medium, l4x1, l4x4, a10g-small, a10g-large, a10g-largex2, a10g-largex4, a100-large
TPU： v5e-1x1, v5e-2x2, v5e-2x4

从较小的硬件开始进行测试
根据实际需求进行扩展
对于并行工作负载或大型模型，使用多 GPU
对于 JAX/Flax 工作负载，使用 TPU
详细规格请参阅 references/hardware_guide.md

关键：保存结果

⚠️ 临时环境——必须持久化结果

任务环境是临时的。所有文件在任务结束时都会被删除。如果结果没有持久化，所有工作都将丢失。

1. 推送到 Hugging Face Hub（推荐）

# 推送模型
model.push_to_hub("username/model-name", token=os.environ["HF_TOKEN"])

# 推送数据集
dataset.push_to_hub("username/dataset-name", token=os.environ["HF_TOKEN"])

# 推送工件
api.upload_file(
    path_or_fileobj="results.json",
    path_in_repo="results.json",
    repo_id="username/results",
    token=os.environ["HF_TOKEN"]
)

2. 使用外部存储

# 上传到 S3、GCS 等
import boto3
s3 = boto3.client('s3')
s3.upload_file('results.json', 'my-bucket', 'results.json')

3. 通过 API 发送结果

# 将结果 POST 到您的 API
import requests
requests.post("https://your-api.com/results", json=results)

Hub 推送所需的配置

在任务提交中：

{
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # 启用身份验证
}

import os
from huggingface_hub import HfApi

# 令牌自动从 secrets 中可用
api = HfApi(token=os.environ.get("HF_TOKEN"))

# 推送您的结果
api.upload_file(...)

已选择结果持久化方法
如果使用 Hub，已配置 secrets={"HF_TOKEN": "$HF_TOKEN"}
脚本能妥善处理令牌缺失的情况
测试持久化路径有效

请参阅： references/hub_saving.md 获取详细的 Hub 持久化指南

⚠️ 默认：30 分钟

任务在超时后会自动停止。对于像训练这样的长时间运行任务，请始终设置自定义超时。

{
    "timeout": "2h"   # 2 小时
}

支持的格式：

整数/浮点数：秒（例如，300 = 5 分钟）
带后缀的字符串："5m"（分钟），"2h"（小时），"1d"（天）
示例："90m", "2h", "1.5h", 300, "1d"

from huggingface_hub import run_job, run_uv_job

run_job(image="python:3.12", command=[...], timeout="2h")
run_uv_job("script.py", timeout=7200)  # 2 小时，以秒为单位

场景	推荐	备注
快速测试	10-30 分钟	验证设置
数据处理	1-2 小时	取决于数据大小
批量推理	2-4 小时	大批量
实验	4-8 小时	多次运行
长时间运行	8-24 小时	生产工作负载

始终增加 20-30% 的缓冲时间，用于设置、网络延迟和清理。

超时时： 任务立即终止，所有未保存的进度丢失

Total Cost = (Hours of runtime) × (Cost per hour)

硬件：cpu-basic（$0.10/小时）
时间：15 分钟（0.25 小时）
成本：$0.03

硬件：l4x1（$2.50/小时）
时间：2 小时
成本：$5.00

硬件：a10g-large（$5/小时）
时间：4 小时
成本：$20.00

成本优化技巧：

从小规模开始 - 在 cpu-basic 或 t4-small 上测试
监控运行时间 - 设置适当的超时
使用检查点 - 如果任务失败，可以恢复
优化代码 - 减少不必要的计算
选择合适的硬件 - 不要过度配置

# 列出所有任务
hf_jobs("ps")

# 检查特定任务
hf_jobs("inspect", {"job_id": "your-job-id"})

# 查看日志
hf_jobs("logs", {"job_id": "your-job-id"})

# 取消任务
hf_jobs("cancel", {"job_id": "your-job-id"})

from huggingface_hub import list_jobs, inspect_job, fetch_job_logs, cancel_job

# 列出您的任务
jobs = list_jobs()

# 仅列出正在运行的任务
running = [j for j in list_jobs() if j.status.stage == "RUNNING"]

# 检查特定任务
job_info = inspect_job(job_id="your-job-id")

# 查看日志
for log in fetch_job_logs(job_id="your-job-id"):
    print(log)

# 取消任务
cancel_job(job_id="your-job-id")

hf jobs ps                    # 列出任务
hf jobs logs <job-id>         # 查看日志
hf jobs cancel <job-id>       # 取消任务

记住： 等待用户请求状态检查。避免重复轮询。

提交后，任务具有监控 URL：

https://huggingface.co/jobs/username/job-id

在浏览器中查看日志、状态和详细信息。

import time
from huggingface_hub import inspect_job, run_job

# 运行多个任务
jobs = [run_job(image=img, command=cmd) for img, cmd in workloads]

# 等待所有任务完成
for job in jobs:
    while inspect_job(job_id=job.id).status.stage not in ("COMPLETED", "ERROR"):
        time.sleep(10)

使用 CRON 表达式或预定义计划按计划运行任务。

# 安排一个每小时运行的 UV 脚本
hf_jobs("scheduled uv", {
    "script": "your_script.py",
    "schedule": "@hourly",
    "flavor": "cpu-basic"
})

# 使用 CRON 语法安排
hf_jobs("scheduled uv", {
    "script": "your_script.py",
    "schedule": "0 9 * * 1",  # 每周一上午 9 点
    "flavor": "cpu-basic"
})

# 安排一个基于 Docker 的任务
hf_jobs("scheduled run", {
    "image": "python:3.12",
    "command": ["python", "-c", "print('Scheduled!')"],
    "schedule": "@daily",
    "flavor": "cpu-basic"
})

from huggingface_hub import create_scheduled_job, create_scheduled_uv_job

# 安排一个 Docker 任务
create_scheduled_job(
    image="python:3.12",
    command=["python", "-c", "print('Running on schedule!')"],
    schedule="@hourly"
)

# 安排一个 UV 脚本
create_scheduled_uv_job("my_script.py", schedule="@daily", flavor="cpu-basic")

# 使用 GPU 安排
create_scheduled_uv_job(
    "ml_inference.py",
    schedule="0 */6 * * *",  # 每 6 小时
    flavor="a10g-small"
)

@annually, @yearly - 每年一次
@monthly - 每月一次
@weekly - 每周一次
@daily - 每天一次
@hourly - 每小时一次
CRON 表达式 - 自定义计划（例如，"*/5 * * * *" 表示每 5 分钟）

管理计划任务：

# MCP 工具
hf_jobs("scheduled ps")                              # 列出计划任务
hf_jobs("scheduled inspect", {"job_id": "..."})     # 检查详情
hf_jobs("scheduled suspend", {"job_id": "..."})     # 暂停
hf_jobs("scheduled resume", {"job_id": "..."})      # 恢复
hf_jobs("scheduled delete", {"job_id": "..."})      # 删除

用于管理的 Python API：

from huggingface_hub import (
    list_scheduled_jobs,
    inspect_scheduled_job,
    suspend_scheduled_job,
    resume_scheduled_job,
    delete_scheduled_job
)

# 列出所有计划任务
scheduled = list_scheduled_jobs()

# 检查一个计划任务
info = inspect_scheduled_job(scheduled_job_id)

# 暂停一个计划任务
suspend_scheduled_job(scheduled_job_id)

# 恢复一个计划任务
resume_scheduled_job(scheduled_job_id)

# 删除一个计划任务
delete_scheduled_job(scheduled_job_id)

Webhooks：在事件上触发任务

当 Hugging Face 仓库发生更改时，自动触发任务。

from huggingface_hub import create_webhook

# 创建在仓库更改时触发任务的 webhook
webhook = create_webhook(
    job_id=job.id,
    watched=[
        {"type": "user", "name": "your-username"},
        {"type": "org", "name": "your-org-name"}
    ],
    domains=["repo", "discussion"],
    secret="your-secret"
)

Webhook 监听被监视仓库的更改
触发时，任务运行并带有 WEBHOOK_PAYLOAD 环境变量
您的脚本可以解析有效负载以了解发生了什么更改

上传新数据集时自动处理
模型更新时触发推理
代码更改时运行测试
根据仓库活动生成报告

在脚本中访问 webhook 有效负载：

import os
import json

payload = json.loads(os.environ.get("WEBHOOK_PAYLOAD", "{}"))
print(f"Event type: {payload.get('event', {}).get('action')}")

详情请参阅 Webhooks 文档。

常见工作负载模式

此仓库在 hf-jobs/scripts/ 中提供了随时可运行的 UV 脚本。请优先使用它们，而不是发明新的模板。

模式 1：数据集 → 模型响应（vLLM）— `scripts/generate-responses.py`

功能： 从 Hub 加载数据集（聊天 messages 或 prompt 列），应用模型聊天模板，使用 vLLM 生成响应，并将输出数据集 + 数据集卡片推送回 Hub。

要求： GPU + 写入令牌（它会推送数据集）。

from pathlib import Path

script = Path("hf-jobs/scripts/generate-responses.py").read_text()
hf_jobs("uv", {
    "script": script,
    "script_args": [
        "username/input-dataset",
        "username/output-dataset",
        "--messages-column", "messages",
        "--model-id", "Qwen/Qwen3-30B-A3B-Instruct-2507",
        "--temperature", "0.7",
        "--top-p", "0.8",
        "--max-tokens", "2048",
    ],
    "flavor": "a10g-large",
    "timeout": "4h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"},
})

模式 2：CoT 自指导合成数据 — `scripts/cot-self-instruct.py`

功能： 通过 CoT 自指导生成合成提示/答案，可选地过滤输出（答案一致性 / RIP），然后将生成的数据集 + 数据集卡片推送到 Hub。

要求： GPU + 写入令牌（它会推送数据集）。

from pathlib import Path

script = Path("hf-jobs/scripts/cot-self-instruct.py").read_text()
hf_jobs("uv", {
    "script": script,
    "script_args": [
        "--seed-dataset", "davanstrien/s1k-reasoning",
        "--output-dataset", "username/synthetic-math",
        "--task-type", "reasoning",
        "--num-samples", "5000",
        "--filter-method", "answer-consistency",
    ],
    "flavor": "l4x4",
    "timeout": "8h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"},
})

模式 3：流式数据集统计（Polars + HF Hub）— `scripts/finepdfs-stats.py`

功能： 直接从 Hub 扫描 parquet 文件（无需下载 300GB），计算时间统计信息，并（可选）将结果上传到 Hub 数据集仓库。

要求： CPU 通常足够；仅当您传递 --output-repo（上传）时才需要令牌。

from pathlib import Path

script = Path("hf-jobs/scripts/finepdfs-stats.py").read_text()
hf_jobs("uv", {
    "script": script,
    "script_args": [
        "--limit", "10000",
        "--show-plan",
        "--output-repo", "username/finepdfs-temporal-stats",
    ],
    "flavor": "cpu-upgrade",
    "timeout": "2h",
    "env": {"HF_XET_HIGH_PERFORMANCE": "1"},
    "secrets": {"HF_TOKEN": "$HF_TOKEN"},
})

内存不足（OOM）

减小批处理大小或数据块大小
以较小的批次处理数据
升级硬件：cpu → t4 → a10g → a100

检查日志中的实际运行时间
增加带缓冲的超时时间："timeout": "3h"
优化代码以加快执行速度
分块处理数据

添加到任务：secrets={"HF_TOKEN": "$HF_TOKEN"}
在脚本中验证令牌：assert "HF_TOKEN" in os.environ
检查令牌权限
验证仓库存在或可以创建

修复： 添加到 PEP 723 头部：

# /// script
# dependencies = ["package1", "package2>=1.0.0"]
# ///

检查 hf_whoami() 在本地是否有效
验证任务配置中是否有 secrets={"HF_TOKEN": "$HF_TOKEN"}
重新登录：hf auth login
检查令牌是否具有所需权限

任务超时 → 增加超时时间，优化代码
结果未保存 → 检查持久化方法，验证 HF_TOKEN
内存不足 → 减小批处理大小，升级硬件
导入错误 → 将依赖项添加到 PEP 723 头部
身份验证错误 → 检查令牌，验证 secrets 参数

请参阅： references/troubleshooting.md 获取完整的故障排除指南

参考资料（在此技能中）

references/token_usage.md - 完整的令牌使用指南
references/hardware_guide.md - 硬件

🇺🇸English

Running Workloads on Hugging Face Jobs

Overview

Run any workload on fully managed Hugging Face infrastructure. No local setup required—jobs run on cloud CPUs, GPUs, or TPUs and can persist results to the Hugging Face Hub.

Common use cases:

Data Processing - Transform, filter, or analyze large datasets
Batch Inference - Run inference on thousands of samples
Experiments & Benchmarks - Reproducible ML experiments
Model Training - Fine-tune models (see model-trainer skill for TRL-specific training)
Synthetic Data Generation - Generate datasets using LLMs
Development & Testing - Test code without local GPU setup
Scheduled Jobs - Automate recurring tasks

For model training specifically: See the model-trainer skill for TRL-based training workflows.

When to Use This Skill

Use this skill when users want to:

Run Python workloads on cloud infrastructure
Execute jobs without local GPU/TPU setup
Process data at scale
Run batch inference or experiments
Schedule recurring tasks
Use GPUs/TPUs for any workload
Persist results to the Hugging Face Hub

Key Directives

When assisting with jobs:

ALWAYS usehf_jobs() MCP tool - Submit jobs using hf_jobs("uv", {...}) or hf_jobs("run", {...}). The script parameter accepts Python code directly. Do NOT save to local files unless the user explicitly requests it. Pass the script content as a string to hf_jobs().
Always handle authentication - Jobs that interact with the Hub require HF_TOKEN via secrets. See Token Usage section below.
Provide job details after submission - After submitting, provide job ID, monitoring URL, estimated time, and note that the user can request status checks later.
Set appropriate timeouts - Default 30min may be insufficient for long-running tasks.

Prerequisites Checklist

Before starting any job, verify:

✅ Account & Authentication

Hugging Face Account with Pro, Team, or Enterprise plan (Jobs require paid plan)
Authenticated login: Check with hf_whoami()
HF_TOKEN for Hub Access ⚠️ CRITICAL - Required for any Hub operations (push models/datasets, download private repos, etc.)
Token must have appropriate permissions (read for downloads, write for uploads)

✅ Token Usage (See Token Usage section for details)

When tokens are required:

Pushing models/datasets to Hub
Accessing private repositories
Using Hub APIs in scripts
Any authenticated Hub operations

How to provide tokens:

{
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # Recommended: automatic token
}

⚠️ CRITICAL: The $HF_TOKEN placeholder is automatically replaced with your logged-in token. Never hardcode tokens in scripts.

Token Usage Guide

Understanding Tokens

What are HF Tokens?

Authentication credentials for Hugging Face Hub
Required for authenticated operations (push, private repos, API access)
Stored securely on your machine after hf auth login

Token Types:

Read Token - Can download models/datasets, read private repos
Write Token - Can push models/datasets, create repos, modify content
Organization Token - Can act on behalf of an organization

When Tokens Are Required

Always Required:

Pushing models/datasets to Hub
Accessing private repositories
Creating new repositories
Modifying existing repositories
Using Hub APIs programmatically

Not Required:

Downloading public models/datasets
Running jobs that don't interact with Hub
Reading public repository information

How to Provide Tokens to Jobs

Method 1: Automatic Token (Recommended)

hf_jobs("uv", {
    "script": "your_script.py",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # ✅ Automatic replacement
})

How it works:

$HF_TOKEN is a placeholder that gets replaced with your actual token
Uses the token from your logged-in session (hf auth login)
Most secure and convenient method
Token is encrypted server-side when passed as a secret

Benefits:

No token exposure in code
Uses your current login session
Automatically updated if you re-login
Works seamlessly with MCP tools

Method 2: Explicit Token (Not Recommended)

hf_jobs("uv", {
    "script": "your_script.py",
    "secrets": {"HF_TOKEN": "hf_abc123..."}  # ⚠️ Hardcoded token
})

When to use:

Only if automatic token doesn't work
Testing with a specific token
Organization tokens (use with caution)

Security concerns:

Token visible in code/logs
Must manually update if token rotates
Risk of token exposure

Method 3: Environment Variable (Less Secure)

hf_jobs("uv", {
    "script": "your_script.py",
    "env": {"HF_TOKEN": "hf_abc123..."}  # ⚠️ Less secure than secrets
})

Difference from secrets:

env variables are visible in job logs
secrets are encrypted server-side
Always prefer secrets for tokens

Using Tokens in Scripts

In your Python script, tokens are available as environment variables:

# /// script
# dependencies = ["huggingface-hub"]
# ///

import os
from huggingface_hub import HfApi

# Token is automatically available if passed via secrets
token = os.environ.get("HF_TOKEN")

# Use with Hub API
api = HfApi(token=token)

# Or let huggingface_hub auto-detect
api = HfApi()  # Automatically uses HF_TOKEN env var

Best practices:

Don't hardcode tokens in scripts
Use os.environ.get("HF_TOKEN") to access
Let huggingface_hub auto-detect when possible
Verify token exists before Hub operations

Token Verification

Check if you're logged in:

from huggingface_hub import whoami
user_info = whoami()  # Returns your username if authenticated

Verify token in job:

import os
assert "HF_TOKEN" in os.environ, "HF_TOKEN not found!"
token = os.environ["HF_TOKEN"]
print(f"Token starts with: {token[:7]}...")  # Should start with "hf_"

Common Token Issues

Error: 401 Unauthorized

Cause: Token missing or invalid
Fix: Add secrets={"HF_TOKEN": "$HF_TOKEN"} to job config
Verify: Check hf_whoami() works locally

Error: 403 Forbidden

Cause: Token lacks required permissions
Fix: Ensure token has write permissions for push operations
Check: Token type at https://huggingface.co/settings/tokens

Error: Token not found in environment

Cause: secrets not passed or wrong key name
Fix: Use secrets={"HF_TOKEN": "$HF_TOKEN"} (not env)
Verify: Script checks os.environ.get("HF_TOKEN")

Error: Repository access denied

Cause: Token doesn't have access to private repo
Fix: Use token from account with access
Check: Verify repo visibility and your permissions

Token Security Best Practices

Never commit tokens - Use $HF_TOKEN placeholder or environment variables
Use secrets, not env - Secrets are encrypted server-side
Rotate tokens regularly - Generate new tokens periodically
Use minimal permissions - Create tokens with only needed permissions
Don't share tokens - Each user should use their own token
Monitor token usage - Check token activity in Hub settings

Complete Token Example

# Example: Push results to Hub
hf_jobs("uv", {
    "script": """
# /// script
# dependencies = ["huggingface-hub", "datasets"]
# ///

import os
from huggingface_hub import HfApi
from datasets import Dataset

# Verify token is available
assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"

# Use token for Hub operations
api = HfApi(token=os.environ["HF_TOKEN"])

# Create and push dataset
data = {"text": ["Hello", "World"]}
dataset = Dataset.from_dict(data)
dataset.push_to_hub("username/my-dataset", token=os.environ["HF_TOKEN"])

print("✅ Dataset pushed successfully!")
""",
    "flavor": "cpu-basic",
    "timeout": "30m",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # ✅ Token provided securely
})

Quick Start: Two Approaches

Approach 1: UV Scripts (Recommended)

UV scripts use PEP 723 inline dependencies for clean, self-contained workloads.

MCP Tool:

hf_jobs("uv", {
    "script": """
# /// script
# dependencies = ["transformers", "torch"]
# ///

from transformers import pipeline
import torch

# Your workload here
classifier = pipeline("sentiment-analysis")
result = classifier("I love Hugging Face!")
print(result)
""",
    "flavor": "cpu-basic",
    "timeout": "30m"
})

CLI Equivalent:

hf jobs uv run my_script.py --flavor cpu-basic --timeout 30m

Python API:

from huggingface_hub import run_uv_job
run_uv_job("my_script.py", flavor="cpu-basic", timeout="30m")

Benefits: Direct MCP tool usage, clean code, dependencies declared inline, no file saving required

When to use: Default choice for all workloads, custom logic, any scenario requiring hf_jobs()

Custom Docker Images for UV Scripts

By default, UV scripts use ghcr.io/astral-sh/uv:python3.12-bookworm-slim. For ML workloads with complex dependencies, use pre-built images:

hf_jobs("uv", {
    "script": "inference.py",
    "image": "vllm/vllm-openai:latest",  # Pre-built image with vLLM
    "flavor": "a10g-large"
})

CLI:

hf jobs uv run --image vllm/vllm-openai:latest --flavor a10g-large inference.py

Benefits: Faster startup, pre-installed dependencies, optimized for specific frameworks

Python Version

By default, UV scripts use Python 3.12. Specify a different version:

hf_jobs("uv", {
    "script": "my_script.py",
    "python": "3.11",  # Use Python 3.11
    "flavor": "cpu-basic"
})

Python API:

from huggingface_hub import run_uv_job
run_uv_job("my_script.py", python="3.11")

Working with Scripts

⚠️ Important: There are two "script path" stories depending on how you run Jobs:

Using thehf_jobs() MCP tool (recommended in this repo): the script value must be inline code (a string) or a URL. A local filesystem path (like "./scripts/foo.py") won't exist inside the remote container.
Using thehf jobs uv run CLI: local file paths do work (the CLI uploads your script).

Common mistake withhf_jobs() MCP tool:

# ❌ Will fail (remote container can't see your local path)
hf_jobs("uv", {"script": "./scripts/foo.py"})

Correct patterns withhf_jobs() MCP tool:

# ✅ Inline: read the local script file and pass its *contents*
from pathlib import Path
script = Path("hf-jobs/scripts/foo.py").read_text()
hf_jobs("uv", {"script": script})

# ✅ URL: host the script somewhere reachable
hf_jobs("uv", {"script": "https://huggingface.co/datasets/uv-scripts/.../raw/main/foo.py"})

# ✅ URL from GitHub
hf_jobs("uv", {"script": "https://raw.githubusercontent.com/huggingface/trl/main/trl/scripts/sft.py"})

CLI equivalent (local paths supported):

hf jobs uv run ./scripts/foo.py -- --your --args

Adding Dependencies at Runtime

Add extra dependencies beyond what's in the PEP 723 header:

hf_jobs("uv", {
    "script": "inference.py",
    "dependencies": ["transformers", "torch>=2.0"],  # Extra deps
    "flavor": "a10g-small"
})

Python API:

from huggingface_hub import run_uv_job
run_uv_job("inference.py", dependencies=["transformers", "torch>=2.0"])

Approach 2: Docker-Based Jobs

Run jobs with custom Docker images and commands.

MCP Tool:

hf_jobs("run", {
    "image": "python:3.12",
    "command": ["python", "-c", "print('Hello from HF Jobs!')"],
    "flavor": "cpu-basic",
    "timeout": "30m"
})

CLI Equivalent:

hf jobs run python:3.12 python -c "print('Hello from HF Jobs!')"

Python API:

from huggingface_hub import run_job
run_job(image="python:3.12", command=["python", "-c", "print('Hello!')"], flavor="cpu-basic")

Benefits: Full Docker control, use pre-built images, run any command When to use: Need specific Docker images, non-Python workloads, complex environments

Example with GPU:

hf_jobs("run", {
    "image": "pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel",
    "command": ["python", "-c", "import torch; print(torch.cuda.get_device_name())"],
    "flavor": "a10g-small",
    "timeout": "1h"
})

Using Hugging Face Spaces as Images:

You can use Docker images from HF Spaces:

hf_jobs("run", {
    "image": "hf.co/spaces/lhoestq/duckdb",  # Space as Docker image
    "command": ["duckdb", "-c", "SELECT 'Hello from DuckDB!'"],
    "flavor": "cpu-basic"
})

CLI:

hf jobs run hf.co/spaces/lhoestq/duckdb duckdb -c "SELECT 'Hello!'"

Finding More UV Scripts on Hub

The uv-scripts organization provides ready-to-use UV scripts stored as datasets on Hugging Face Hub:

# Discover available UV script collections
dataset_search({"author": "uv-scripts", "sort": "downloads", "limit": 20})

# Explore a specific collection
hub_repo_details(["uv-scripts/classification"], repo_type="dataset", include_readme=True)

Popular collections: OCR, classification, synthetic-data, vLLM, dataset-creation

Hardware Selection

Reference: HF Jobs Hardware Docs (updated 07/2025)

Workload Type	Recommended Hardware	Use Case
Data processing, testing	`cpu-basic`, `cpu-upgrade`	Lightweight tasks
Small models, demos	`t4-small`	<1B models, quick tests
Medium models	`t4-medium`, `l4x1`	1-7B models
Large models, production	`a10g-small`,

All Available Flavors:

CPU: cpu-basic, cpu-upgrade
GPU: t4-small, t4-medium, l4x1, l4x4, a10g-small, a10g-large, a10g-largex2, a10g-largex4, a100-large

Guidelines:

Start with smaller hardware for testing
Scale up based on actual needs
Use multi-GPU for parallel workloads or large models
Use TPUs for JAX/Flax workloads
See references/hardware_guide.md for detailed specifications

Critical: Saving Results

⚠️ EPHEMERAL ENVIRONMENT—MUST PERSIST RESULTS

The Jobs environment is temporary. All files are deleted when the job ends. If results aren't persisted, ALL WORK IS LOST.

Persistence Options

1. Push to Hugging Face Hub (Recommended)

# Push models
model.push_to_hub("username/model-name", token=os.environ["HF_TOKEN"])

# Push datasets
dataset.push_to_hub("username/dataset-name", token=os.environ["HF_TOKEN"])

# Push artifacts
api.upload_file(
    path_or_fileobj="results.json",
    path_in_repo="results.json",
    repo_id="username/results",
    token=os.environ["HF_TOKEN"]
)

2. Use External Storage

# Upload to S3, GCS, etc.
import boto3
s3 = boto3.client('s3')
s3.upload_file('results.json', 'my-bucket', 'results.json')

3. Send Results via API

# POST results to your API
import requests
requests.post("https://your-api.com/results", json=results)

Required Configuration for Hub Push

In job submission:

{
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # Enables authentication
}

In script:

import os
from huggingface_hub import HfApi

# Token automatically available from secrets
api = HfApi(token=os.environ.get("HF_TOKEN"))

# Push your results
api.upload_file(...)

Verification Checklist

Before submitting:

Results persistence method chosen
secrets={"HF_TOKEN": "$HF_TOKEN"} if using Hub
Script handles missing token gracefully
Test persistence path works

See: references/hub_saving.md for detailed Hub persistence guide

Timeout Management

⚠️ DEFAULT: 30 MINUTES

Jobs automatically stop after the timeout. For long-running tasks like training, always set a custom timeout.

Setting Timeouts

MCP Tool:

{
    "timeout": "2h"   # 2 hours
}

Supported formats:

Integer/float: seconds (e.g., 300 = 5 minutes)
String with suffix: "5m" (minutes), "2h" (hours), "1d" (days)
Examples: "90m", "2h", "1.5h", 300, "1d"

Python API:

from huggingface_hub import run_job, run_uv_job

run_job(image="python:3.12", command=[...], timeout="2h")
run_uv_job("script.py", timeout=7200)  # 2 hours in seconds

Timeout Guidelines

Scenario	Recommended	Notes
Quick test	10-30 min	Verify setup
Data processing	1-2 hours	Depends on data size
Batch inference	2-4 hours	Large batches
Experiments	4-8 hours	Multiple runs
Long-running	8-24 hours	Production workloads

Always add 20-30% buffer for setup, network delays, and cleanup.

On timeout: Job killed immediately, all unsaved progress lost

Cost Estimation

General guidelines:

Total Cost = (Hours of runtime) × (Cost per hour)

Example calculations:

Quick test:

Hardware: cpu-basic ($0.10/hour)
Time: 15 minutes (0.25 hours)
Cost: $0.03

Data processing:

Hardware: l4x1 ($2.50/hour)
Time: 2 hours
Cost: $5.00

Batch inference:

Hardware: a10g-large ($5/hour)
Time: 4 hours
Cost: $20.00

Cost optimization tips:

Start small - Test on cpu-basic or t4-small
Monitor runtime - Set appropriate timeouts
Use checkpoints - Resume if job fails
Optimize code - Reduce unnecessary compute
Choose right hardware - Don't over-provision

Monitoring and Tracking

Check Job Status

MCP Tool:

# List all jobs
hf_jobs("ps")

# Inspect specific job
hf_jobs("inspect", {"job_id": "your-job-id"})

# View logs
hf_jobs("logs", {"job_id": "your-job-id"})

# Cancel a job
hf_jobs("cancel", {"job_id": "your-job-id"})

Python API:

from huggingface_hub import list_jobs, inspect_job, fetch_job_logs, cancel_job

# List your jobs
jobs = list_jobs()

# List running jobs only
running = [j for j in list_jobs() if j.status.stage == "RUNNING"]

# Inspect specific job
job_info = inspect_job(job_id="your-job-id")

# View logs
for log in fetch_job_logs(job_id="your-job-id"):
    print(log)

# Cancel a job
cancel_job(job_id="your-job-id")

CLI:

hf jobs ps                    # List jobs
hf jobs logs <job-id>         # View logs
hf jobs cancel <job-id>       # Cancel job

Remember: Wait for user to request status checks. Avoid polling repeatedly.

Job URLs

After submission, jobs have monitoring URLs:

https://huggingface.co/jobs/username/job-id

View logs, status, and details in the browser.

Wait for Multiple Jobs

import time
from huggingface_hub import inspect_job, run_job

# Run multiple jobs
jobs = [run_job(image=img, command=cmd) for img, cmd in workloads]

# Wait for all to complete
for job in jobs:
    while inspect_job(job_id=job.id).status.stage not in ("COMPLETED", "ERROR"):
        time.sleep(10)

Scheduled Jobs

Run jobs on a schedule using CRON expressions or predefined schedules.

MCP Tool:

# Schedule a UV script that runs every hour
hf_jobs("scheduled uv", {
    "script": "your_script.py",
    "schedule": "@hourly",
    "flavor": "cpu-basic"
})

# Schedule with CRON syntax
hf_jobs("scheduled uv", {
    "script": "your_script.py",
    "schedule": "0 9 * * 1",  # 9 AM every Monday
    "flavor": "cpu-basic"
})

# Schedule a Docker-based job
hf_jobs("scheduled run", {
    "image": "python:3.12",
    "command": ["python", "-c", "print('Scheduled!')"],
    "schedule": "@daily",
    "flavor": "cpu-basic"
})

Python API:

from huggingface_hub import create_scheduled_job, create_scheduled_uv_job

# Schedule a Docker job
create_scheduled_job(
    image="python:3.12",
    command=["python", "-c", "print('Running on schedule!')"],
    schedule="@hourly"
)

# Schedule a UV script
create_scheduled_uv_job("my_script.py", schedule="@daily", flavor="cpu-basic")

# Schedule with GPU
create_scheduled_uv_job(
    "ml_inference.py",
    schedule="0 */6 * * *",  # Every 6 hours
    flavor="a10g-small"
)

Available schedules:

@annually, @yearly - Once per year
@monthly - Once per month
@weekly - Once per week
@daily - Once per day
@hourly - Once per hour
CRON expression - Custom schedule (e.g., "*/5 * * * *" for every 5 minutes)

Manage scheduled jobs:

# MCP Tool
hf_jobs("scheduled ps")                              # List scheduled jobs
hf_jobs("scheduled inspect", {"job_id": "..."})     # Inspect details
hf_jobs("scheduled suspend", {"job_id": "..."})     # Pause
hf_jobs("scheduled resume", {"job_id": "..."})      # Resume
hf_jobs("scheduled delete", {"job_id": "..."})      # Delete

Python API for management:

from huggingface_hub import (
    list_scheduled_jobs,
    inspect_scheduled_job,
    suspend_scheduled_job,
    resume_scheduled_job,
    delete_scheduled_job
)

# List all scheduled jobs
scheduled = list_scheduled_jobs()

# Inspect a scheduled job
info = inspect_scheduled_job(scheduled_job_id)

# Suspend (pause) a scheduled job
suspend_scheduled_job(scheduled_job_id)

# Resume a scheduled job
resume_scheduled_job(scheduled_job_id)

# Delete a scheduled job
delete_scheduled_job(scheduled_job_id)

Webhooks: Trigger Jobs on Events

Trigger jobs automatically when changes happen in Hugging Face repositories.

Python API:

from huggingface_hub import create_webhook

# Create webhook that triggers a job when a repo changes
webhook = create_webhook(
    job_id=job.id,
    watched=[
        {"type": "user", "name": "your-username"},
        {"type": "org", "name": "your-org-name"}
    ],
    domains=["repo", "discussion"],
    secret="your-secret"
)

How it works:

Webhook listens for changes in watched repositories
When triggered, the job runs with WEBHOOK_PAYLOAD environment variable
Your script can parse the payload to understand what changed

Use cases:

Auto-process new datasets when uploaded
Trigger inference when models are updated
Run tests when code changes
Generate reports on repository activity

Access webhook payload in script:

import os
import json

payload = json.loads(os.environ.get("WEBHOOK_PAYLOAD", "{}"))
print(f"Event type: {payload.get('event', {}).get('action')}")

See Webhooks Documentation for more details.

Common Workload Patterns

This repository ships ready-to-run UV scripts in hf-jobs/scripts/. Prefer using them instead of inventing new templates.

Pattern 1: Dataset → Model Responses (vLLM) — `scripts/generate-responses.py`

What it does: loads a Hub dataset (chat messages or a prompt column), applies a model chat template, generates responses with vLLM, and pushes the output dataset + dataset card back to the Hub.

Requires: GPU + write token (it pushes a dataset).

from pathlib import Path

script = Path("hf-jobs/scripts/generate-responses.py").read_text()
hf_jobs("uv", {
    "script": script,
    "script_args": [
        "username/input-dataset",
        "username/output-dataset",
        "--messages-column", "messages",
        "--model-id", "Qwen/Qwen3-30B-A3B-Instruct-2507",
        "--temperature", "0.7",
        "--top-p", "0.8",
        "--max-tokens", "2048",
    ],
    "flavor": "a10g-large",
    "timeout": "4h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"},
})

Pattern 2: CoT Self-Instruct Synthetic Data — `scripts/cot-self-instruct.py`

What it does: generates synthetic prompts/answers via CoT Self-Instruct, optionally filters outputs (answer-consistency / RIP), then pushes the generated dataset + dataset card to the Hub.

Requires: GPU + write token (it pushes a dataset).

from pathlib import Path

script = Path("hf-jobs/scripts/cot-self-instruct.py").read_text()
hf_jobs("uv", {
    "script": script,
    "script_args": [
        "--seed-dataset", "davanstrien/s1k-reasoning",
        "--output-dataset", "username/synthetic-math",
        "--task-type", "reasoning",
        "--num-samples", "5000",
        "--filter-method", "answer-consistency",
    ],
    "flavor": "l4x4",
    "timeout": "8h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"},
})

Pattern 3: Streaming Dataset Stats (Polars + HF Hub) — `scripts/finepdfs-stats.py`

What it does: scans parquet directly from Hub (no 300GB download), computes temporal stats, and (optionally) uploads results to a Hub dataset repo.

Requires: CPU is often enough; token needed only if you pass --output-repo (upload).

from pathlib import Path

script = Path("hf-jobs/scripts/finepdfs-stats.py").read_text()
hf_jobs("uv", {
    "script": script,
    "script_args": [
        "--limit", "10000",
        "--show-plan",
        "--output-repo", "username/finepdfs-temporal-stats",
    ],
    "flavor": "cpu-upgrade",
    "timeout": "2h",
    "env": {"HF_XET_HIGH_PERFORMANCE": "1"},
    "secrets": {"HF_TOKEN": "$HF_TOKEN"},
})

Common Failure Modes

Out of Memory (OOM)

Fix:

Reduce batch size or data chunk size
Process data in smaller batches
Upgrade hardware: cpu → t4 → a10g → a100

Job Timeout

Fix:

Check logs for actual runtime
Increase timeout with buffer: "timeout": "3h"
Optimize code for faster execution
Process data in chunks

Hub Push Failures

Fix:

Add to job: secrets={"HF_TOKEN": "$HF_TOKEN"}
Verify token in script: assert "HF_TOKEN" in os.environ
Check token permissions
Verify repo exists or can be created

Missing Dependencies

Fix: Add to PEP 723 header:

# /// script
# dependencies = ["package1", "package2>=1.0.0"]
# ///

Authentication Errors

Fix:

Check hf_whoami() works locally
Verify secrets={"HF_TOKEN": "$HF_TOKEN"} in job config
Re-login: hf auth login
Check token has required permissions

Troubleshooting

Common issues:

Job times out → Increase timeout, optimize code
Results not saved → Check persistence method, verify HF_TOKEN
Out of Memory → Reduce batch size, upgrade hardware
Import errors → Add dependencies to PEP 723 header
Authentication errors → Check token, verify secrets parameter

See: references/troubleshooting.md for complete troubleshooting guide

Resources

References (In This Skill)

references/token_usage.md - Complete token usage guide
references/hardware_guide.md - Hardware specs and selection
references/hub_saving.md - Hub persistence guide
references/troubleshooting.md - Common issues and solutions

Scripts (In This Skill)

scripts/generate-responses.py - vLLM batch generation: dataset → responses → push to Hub
scripts/cot-self-instruct.py - CoT Self-Instruct synthetic data generation + filtering → push to Hub
scripts/finepdfs-stats.py - Polars streaming stats over finepdfs-edu parquet on Hub (optional push)

External Links

Official Documentation:

HF Jobs Guide - Main documentation
HF Jobs CLI Reference - Command line interface
HF Jobs API Reference - Python API details
Hardware Flavors Reference - Available hardware

Related Tools:

UV Scripts Guide - PEP 723 inline dependencies
UV Scripts Organization - Community UV script collection
HF Hub Authentication - Token setup
Webhooks Documentation - Event triggers

Key Takeaways

Submit scripts inline - The script parameter accepts Python code directly; no file saving required unless user requests
Jobs are asynchronous - Don't wait/poll; let user check when ready
Always set timeout - Default 30 min may be insufficient; set appropriate timeout
Always persist results - Environment is ephemeral; without persistence, all work is lost
Use tokens securely - Always use secrets={"HF_TOKEN": "$HF_TOKEN"} for Hub operations
Choose appropriate hardware - Start small, scale up based on needs (see hardware guide)
Use UV scripts - Default to hf_jobs("uv", {...}) with inline scripts for Python workloads
Handle authentication - Verify tokens are available before Hub operations
Monitor jobs - Provide job URLs and status check commands
Optimize costs - Choose right hardware, set appropriate timeouts

Quick Reference: MCP Tool vs CLI vs Python API

Operation	MCP Tool	CLI	Python API
Run UV script	`hf_jobs("uv", {...})`	`hf jobs uv run script.py`	`run_uv_job("script.py")`
Run Docker job	`hf_jobs("run", {...})`	`hf jobs run image cmd`	`run_job(image, command)`
List jobs

Weekly Installs

Repository

sickn33/antigra…e-skills

GitHub Stars

28.1K

First Seen

Jan 30, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode50

github-copilot50

codex50

cursor50

gemini-cli49

cline49

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

127,000 周安装

TPU: v5e-1x1, v5e-2x2, v5e-2x4

Hugging Face Jobs 云端工作负载运行指南：数据处理、批量推理与模型训练

🇨🇳中文介绍

在 Hugging Face Jobs 上运行工作负载

概述

何时使用此技能

关键指令

相关 Skills

先决条件清单

✅ 账户与身份验证

✅ 令牌使用（详情请参阅令牌使用部分）

令牌使用指南

理解令牌

何时需要令牌

如何向任务提供令牌

方法 1：自动令牌（推荐）

方法 2：显式令牌（不推荐）

方法 3：环境变量（安全性较低）

在脚本中使用令牌

令牌验证

常见令牌问题

令牌安全最佳实践

完整令牌示例

快速入门：两种方法

方法 1：UV 脚本（推荐）

UV 脚本的自定义 Docker 镜像

Python 版本

处理脚本

在运行时添加依赖项

方法 2：基于 Docker 的任务

在 Hub 上查找更多 UV 脚本

硬件选择

关键：保存结果

持久化选项

Hub 推送所需的配置

验证清单

超时管理

设置超时

超时指南

成本估算

监控与跟踪

检查任务状态

任务 URL

等待多个任务

计划任务

Webhooks：在事件上触发任务

常见工作负载模式

模式 1：数据集 → 模型响应（vLLM）— scripts/generate-responses.py

模式 2：CoT 自指导合成数据 — scripts/cot-self-instruct.py

模式 3：流式数据集统计（Polars + HF Hub）— scripts/finepdfs-stats.py

常见故障模式

内存不足（OOM）

任务超时

Hub 推送失败

缺少依赖项

身份验证错误

故障排除

资源

参考资料（在此技能中）

🇺🇸English

Running Workloads on Hugging Face Jobs

Overview

When to Use This Skill

Key Directives

Prerequisites Checklist

✅ Account & Authentication

✅ Token Usage (See Token Usage section for details)

Token Usage Guide

Understanding Tokens

When Tokens Are Required

How to Provide Tokens to Jobs

Method 1: Automatic Token (Recommended)

Method 2: Explicit Token (Not Recommended)

Method 3: Environment Variable (Less Secure)

Using Tokens in Scripts

Token Verification

Common Token Issues

Token Security Best Practices

Complete Token Example

Quick Start: Two Approaches

Approach 1: UV Scripts (Recommended)

模式 1：数据集 → 模型响应（vLLM）— `scripts/generate-responses.py`

模式 2：CoT 自指导合成数据 — `scripts/cot-self-instruct.py`

模式 3：流式数据集统计（Polars + HF Hub）— `scripts/finepdfs-stats.py`

Pattern 1: Dataset → Model Responses (vLLM) — `scripts/generate-responses.py`

Pattern 2: CoT Self-Instruct Synthetic Data — `scripts/cot-self-instruct.py`

Pattern 3: Streaming Dataset Stats (Polars + HF Hub) — `scripts/finepdfs-stats.py`