BLS职业市场可视化工具：AI暴露度评分与就业数据矩形树图分析

karpathy-jobs-bls-visualizer by aradotso/trending-skills

250 周安装量

10 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/aradotso/trending-skills --skill karpathy-jobs-bls-visualizer

AI/机器学习数据可视化数据分析

🇨🇳中文介绍

karpathy/jobs — BLS 职业市场可视化工具

技能来自 ara.so — Daily 2026 技能集合。

这是一个研究工具，用于可视化探索美国劳工统计局职业展望手册中 342 种职业的数据。交互式矩形树图通过就业规模（面积）和任何选定的指标（颜色）为矩形着色：BLS 增长前景、中位数薪酬、教育要求或 LLM 评分的 AI 暴露度。整个流程是完全可复刻的——编写一个新的提示词，重新运行评分，即可获得新的颜色图层。

在线演示: karpathy.ai/jobs

安装与设置

# 克隆仓库
git clone https://github.com/karpathy/jobs
cd jobs

# 安装依赖项 (使用 uv)
uv sync
uv run playwright install chromium

创建一个包含你的 OpenRouter API 密钥的 .env 文件（仅在 LLM 评分时需要）：

OPENROUTER_API_KEY=your_openrouter_key_here

完整流程 — 关键命令

按顺序运行以下命令以完成一次全新的构建：

# 1. 抓取 BLS 页面 (非无头 Playwright；BLS 会屏蔽机器人)
#    结果缓存在 html/ 目录下 — 只需运行一次
uv run python scrape.py

# 2. 将原始 HTML 转换为 pages/ 目录下的干净 Markdown
uv run python process.py

# 3. 提取结构化字段到 occupations.csv
uv run python make_csv.py

# 4. 通过 LLM 对 AI 暴露度进行评分 (使用 OpenRouter API，保存 scores.json)
uv run python score.py

# 5. 合并 CSV 和评分结果到 site/data.json 供前端使用
uv run python build_site_data.py

# 6. 本地运行可视化服务
cd site && python -m http.server 8000
# 打开 http://localhost:8000

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

文件	描述
`occupations.json`	342 种职业的主列表（标题、URL、类别、slug）
`occupations.csv`	汇总统计：薪酬、教育程度、职位数量、增长预测
`scores.json`	所有 342 种职业的 AI 暴露度评分 (0–10) + 理由
`prompt.md`	所有数据整合在一个约 45K token 的文件中，方便粘贴到 LLM
`html/`	来自 BLS 的原始 HTML 页面 (~40MB，数据源)
`pages/`	每个职业页面的干净 Markdown 版本
`site/index.html`	矩形树图可视化界面（单个 HTML 文件）
`site/data.json`	前端使用的紧凑合并数据
`score.py`	LLM 评分流程 — 复刻此文件以编写自定义提示词

编写自定义 LLM 评分图层

最强大的功能：编写任何评分提示词，运行 score.py，获得新的矩形树图颜色图层。

1. 在 `score.py` 中编辑提示词

# score.py (简化结构)
SYSTEM_PROMPT = """
You are evaluating occupations for exposure to humanoid robotics over the next 10 years.

Score each occupation from 0 to 10:
- 0 = no meaningful exposure (e.g., requires fine social judgment, non-physical)
- 5 = moderate exposure (some tasks automatable, but humans still central)
- 10 = high exposure (repetitive physical tasks, predictable environments)

Consider: physical task complexity, environment predictability, dexterity requirements,
cost of robot vs human, regulatory barriers.

Respond ONLY with JSON: {"score": <int 0-10>, "rationale": "<1-2 sentences>"}
"""

2. 运行评分流程

# 该流程从 pages/ 读取每个职业的 Markdown，
# 将其发送给 LLM，并将结果写入 scores.json

# scores.json 结构:
{
  "software-developers": {
    "score": 1,
    "rationale": "Software development is digital and cognitive; humanoid robots provide no advantage."
  },
  "construction-laborers": {
    "score": 7,
    "rationale": "Physical, repetitive outdoor tasks are targets for humanoid robotics, though unstructured environments remain challenging."
  }
  // ... 总共 342 种职业
}

3. 重新构建站点数据

uv run python build_site_data.py
cd site && python -m http.server 8000

`occupations.json` 条目

{
  "title": "Software Developers",
  "url": "https://www.bls.gov/ooh/computer-and-information-technology/software-developers.htm",
  "category": "Computer and Information Technology",
  "slug": "software-developers"
}

`occupations.csv` 列

slug, title, category, median_pay, education, job_count, growth_percent, growth_outlook

software-developers, Software Developers, Computer and Information Technology,
130160, Bachelor's degree, 1847900, 17, Much faster than average

`site/data.json` 条目（合并后的前端数据）

{
  "slug": "software-developers",
  "title": "Software Developers",
  "category": "Computer and Information Technology",
  "median_pay": 130160,
  "education": "Bachelor's degree",
  "job_count": 1847900,
  "growth_percent": 17,
  "growth_outlook": "Much faster than average",
  "ai_score": 9,
  "ai_rationale": "AI is deeply transforming software development workflows..."
}

前端矩形树图 (`site/index.html`)

该可视化是一个使用 D3.js 的独立 HTML 文件。

颜色图层（在 UI 中切换）

图层	显示内容
BLS 前景	BLS 预测的增长类别（绿色 = 快速增长）
中位数薪酬	年度中位数工资（颜色渐变）
教育程度	最低教育要求
数字 AI 暴露度	LLM 评分的 0–10 分 AI 影响估计

向前端添加新的颜色图层

<!-- 在 site/index.html 中，找到图层切换按钮 -->
<button onclick="setLayer('ai_score')">Digital AI Exposure</button>

<!-- 添加你的新图层按钮 -->
<button onclick="setLayer('robotics_score')">Humanoid Robotics</button>



// 在 colorScale 函数中，为你的新字段添加一个 case:
function getColor(d, layer) {
  if (layer === 'robotics_score') {
    // 分数 0-10，蓝色 = 低暴露度，红色 = 高暴露度
    return d3.interpolateRdYlBu(1 - d.robotics_score / 10);
  }
  // ... 现有 case
}

然后更新 build_site_data.py，将你的新评分字段包含在 data.json 中。

生成 LLM 就绪的提示词文件

将所有 342 种职业和汇总统计数据打包到一个文件中，供 LLM 聊天使用：

uv run python make_prompt.py
# 生成 prompt.md (~45K tokens)
# 粘贴到 Claude、GPT-4、Gemini 等中进行基于数据的对话

BLS 会屏蔽自动化机器人，因此 scrape.py 使用非无头的 Playwright（真实的可见浏览器窗口）：

# scrape.py 关键行为
browser = await p.chromium.launch(headless=False)  # 必须可见
# 页面保存到 html/<slug>.html
# 已抓取的页面会被跳过（已缓存）

如果抓取失败或被限速：

仓库中的 html/ 目录已包含缓存的页面
你可以完全跳过抓取，直接从 process.py 开始运行
如果重新抓取，请在请求之间添加延迟以避免被屏蔽

仅重新评分缺失的职业

import json, os

with open("scores.json") as f:
    existing = json.load(f)

with open("occupations.json") as f:
    all_occupations = json.load(f)

# 查找缺失项
missing = [o for o in all_occupations if o["slug"] not in existing]
print(f"Missing scores: {len(missing)}")
# 然后运行 score.py 并筛选缺失的 slug

手动解析单个职业页面

from parse_detail import parse_occupation_page
from pathlib import Path

html = Path("html/software-developers.html").read_text()
data = parse_occupation_page(html)
print(data["median_pay"])     # 例如 130160
print(data["job_count"])      # 例如 1847900
print(data["growth_outlook"]) # 例如 "Much faster than average"

加载和查询 occupations.csv

import pandas as pd

df = pd.read_csv("occupations.csv")

# 薪酬最高的前 10 种职业
top_pay = df.nlargest(10, "median_pay")[["title", "median_pay", "growth_outlook"]]
print(top_pay)

# 筛选：快速增长 + 高薪酬
high_value = df[
    (df["growth_percent"] > 10) &
    (df["median_pay"] > 80000)
].sort_values("median_pay", ascending=False)

结合 CSV 和 AI 评分进行分析

import pandas as pd, json

df = pd.read_csv("occupations.csv")

with open("scores.json") as f:
    scores = json.load(f)

df["ai_score"] = df["slug"].map(lambda s: scores.get(s, {}).get("score"))
df["ai_rationale"] = df["slug"].map(lambda s: scores.get(s, {}).get("rationale"))

# 高 AI 暴露度，高薪酬 — 正在重塑，而非消失
high_exposure_high_pay = df[
    (df["ai_score"] >= 8) &
    (df["median_pay"] > 100000)
][["title", "median_pay", "ai_score", "growth_outlook"]]
print(high_exposure_high_pay)

playwright install 失败

uv run playwright install --with-deps chromium

BLS 抓取被屏蔽 / 返回空页面

确保 scrape.py 中 headless=False（默认如此）
添加手动延迟；不要在 CI 中运行
可以直接使用仓库中缓存的 html/ 目录

score.py OpenRouter 错误

验证 .env 中是否设置了 OPENROUTER_API_KEY
检查你的 OpenRouter 账户是否有额度
默认模型是 Gemini Flash — 更改 score.py 中的 model 以使用不同的 LLM

重新评分后 site/data.json 未更新

# 更改 scores.json 后，务必重新构建站点数据
uv run python build_site_data.py

矩形树图显示空白 / 无数据

确认 site/data.json 存在且是有效的 JSON
使用 python -m http.server 运行服务（不要用 file:// — CORS 会阻止本地 JSON 获取）
检查浏览器控制台是否有获取错误

重要注意事项（来自项目）

AI 暴露度 ≠ 工作消失。 9/10 的分数意味着 AI 正在改变工作，而不是消除需求。软件开发人员评分为 9/10，但需求仍在增长。
评分是粗略的 LLM 估计（通过 OpenRouter 使用 Gemini Flash），并非严谨的经济预测。
该工具未考虑需求弹性、潜在需求、监管障碍或社会对人类工作者的偏好。
这是一个开发/研究工具，而非经济出版物。

🇺🇸English

karpathy/jobs — BLS Job Market Visualizer

Skill by ara.so — Daily 2026 Skills collection.

A research tool for visually exploring Bureau of Labor Statistics Occupational Outlook Handbook data across 342 occupations. The interactive treemap colors rectangles by employment size (area) and any chosen metric (color): BLS growth outlook, median pay, education requirements, or LLM-scored AI exposure. The pipeline is fully forkable — write a new prompt, re-run scoring, get a new color layer.

Live demo: karpathy.ai/jobs

Installation & Setup

# Clone the repo
git clone https://github.com/karpathy/jobs
cd jobs

# Install dependencies (uses uv)
uv sync
uv run playwright install chromium

Create a .env file with your OpenRouter API key (required only for LLM scoring):

OPENROUTER_API_KEY=your_openrouter_key_here

Full Pipeline — Key Commands

Run these in order for a complete fresh build:

# 1. Scrape BLS pages (non-headless Playwright; BLS blocks bots)
#    Results cached in html/ — only needed once
uv run python scrape.py

# 2. Convert raw HTML → clean Markdown in pages/
uv run python process.py

# 3. Extract structured fields → occupations.csv
uv run python make_csv.py

# 4. Score AI exposure via LLM (uses OpenRouter API, saves scores.json)
uv run python score.py

# 5. Merge CSV + scores → site/data.json for the frontend
uv run python build_site_data.py

# 6. Serve the visualization locally
cd site && python -m http.server 8000
# Open http://localhost:8000

Key Files Reference

File	Description
`occupations.json`	Master list of 342 occupations (title, URL, category, slug)
`occupations.csv`	Summary stats: pay, education, job count, growth projections
`scores.json`	AI exposure scores (0–10) + rationales for all 342 occupations
`prompt.md`	All data in one ~45K-token file for pasting into an LLM
`html/`	Raw HTML pages from BLS (~40MB, source of truth)
`pages/`

Writing a Custom LLM Scoring Layer

The most powerful feature: write any scoring prompt, run score.py, get a new treemap color layer.

1. Edit the prompt in `score.py`

# score.py (simplified structure)
SYSTEM_PROMPT = """
You are evaluating occupations for exposure to humanoid robotics over the next 10 years.

Score each occupation from 0 to 10:
- 0 = no meaningful exposure (e.g., requires fine social judgment, non-physical)
- 5 = moderate exposure (some tasks automatable, but humans still central)
- 10 = high exposure (repetitive physical tasks, predictable environments)

Consider: physical task complexity, environment predictability, dexterity requirements,
cost of robot vs human, regulatory barriers.

Respond ONLY with JSON: {"score": <int 0-10>, "rationale": "<1-2 sentences>"}
"""

2. Run the scoring pipeline

# The pipeline reads each occupation's Markdown from pages/,
# sends it to the LLM, and writes results to scores.json

# scores.json structure:
{
  "software-developers": {
    "score": 1,
    "rationale": "Software development is digital and cognitive; humanoid robots provide no advantage."
  },
  "construction-laborers": {
    "score": 7,
    "rationale": "Physical, repetitive outdoor tasks are targets for humanoid robotics, though unstructured environments remain challenging."
  }
  // ... 342 occupations total
}

3. Rebuild site data

uv run python build_site_data.py
cd site && python -m http.server 8000

Data Structures

`occupations.json` entry

{
  "title": "Software Developers",
  "url": "https://www.bls.gov/ooh/computer-and-information-technology/software-developers.htm",
  "category": "Computer and Information Technology",
  "slug": "software-developers"
}

`occupations.csv` columns

slug, title, category, median_pay, education, job_count, growth_percent, growth_outlook

Example row:

software-developers, Software Developers, Computer and Information Technology,
130160, Bachelor's degree, 1847900, 17, Much faster than average

`site/data.json` entry (merged frontend data)

{
  "slug": "software-developers",
  "title": "Software Developers",
  "category": "Computer and Information Technology",
  "median_pay": 130160,
  "education": "Bachelor's degree",
  "job_count": 1847900,
  "growth_percent": 17,
  "growth_outlook": "Much faster than average",
  "ai_score": 9,
  "ai_rationale": "AI is deeply transforming software development workflows..."
}

Frontend Treemap (`site/index.html`)

The visualization is a single self-contained HTML file using D3.js.

Color layers (toggle in UI)

Layer	What it shows
BLS Outlook	BLS projected growth category (green = fast growth)
Median Pay	Annual median wage (color gradient)
Education	Minimum education required
Digital AI Exposure	LLM-scored 0–10 AI impact estimate

Adding a new color layer to the frontend

<!-- In site/index.html, find the layer toggle buttons -->
<button onclick="setLayer('ai_score')">Digital AI Exposure</button>

<!-- Add your new layer button -->
<button onclick="setLayer('robotics_score')">Humanoid Robotics</button>



// In the colorScale function, add a case for your new field:
function getColor(d, layer) {
  if (layer === 'robotics_score') {
    // scores 0-10, blue = low exposure, red = high
    return d3.interpolateRdYlBu(1 - d.robotics_score / 10);
  }
  // ... existing cases
}

Then update build_site_data.py to include your new score field in data.json.

Generating the LLM-Ready Prompt File

Package all 342 occupations + aggregate stats into a single file for LLM chat:

uv run python make_prompt.py
# Produces prompt.md (~45K tokens)
# Paste into Claude, GPT-4, Gemini, etc. for data-grounded conversation

Scraping Notes

The BLS blocks automated bots, so scrape.py uses non-headless Playwright (real visible browser window):

# scrape.py key behavior
browser = await p.chromium.launch(headless=False)  # Must be visible
# Pages saved to html/<slug>.html
# Already-scraped pages are skipped (cached)

If scraping fails or is rate-limited:

The html/ directory already contains cached pages in the repo
You can skip scraping entirely and run from process.py onward
If re-scraping, add delays between requests to avoid blocks

Common Patterns

Re-score only missing occupations

import json, os

with open("scores.json") as f:
    existing = json.load(f)

with open("occupations.json") as f:
    all_occupations = json.load(f)

# Find gaps
missing = [o for o in all_occupations if o["slug"] not in existing]
print(f"Missing scores: {len(missing)}")
# Then run score.py with a filter for missing slugs

Parse a single occupation page manually

from parse_detail import parse_occupation_page
from pathlib import Path

html = Path("html/software-developers.html").read_text()
data = parse_occupation_page(html)
print(data["median_pay"])     # e.g. 130160
print(data["job_count"])      # e.g. 1847900
print(data["growth_outlook"]) # e.g. "Much faster than average"

Load and query occupations.csv

import pandas as pd

df = pd.read_csv("occupations.csv")

# Top 10 highest paying occupations
top_pay = df.nlargest(10, "median_pay")[["title", "median_pay", "growth_outlook"]]
print(top_pay)

# Filter: fast growth + high pay
high_value = df[
    (df["growth_percent"] > 10) &
    (df["median_pay"] > 80000)
].sort_values("median_pay", ascending=False)

Combine CSV with AI scores for analysis

import pandas as pd, json

df = pd.read_csv("occupations.csv")

with open("scores.json") as f:
    scores = json.load(f)

df["ai_score"] = df["slug"].map(lambda s: scores.get(s, {}).get("score"))
df["ai_rationale"] = df["slug"].map(lambda s: scores.get(s, {}).get("rationale"))

# High AI exposure, high pay — reshaping, not disappearing
high_exposure_high_pay = df[
    (df["ai_score"] >= 8) &
    (df["median_pay"] > 100000)
][["title", "median_pay", "ai_score", "growth_outlook"]]
print(high_exposure_high_pay)

Troubleshooting

playwright install fails

uv run playwright install --with-deps chromium

BLS scraping blocked / returns empty pages

Ensure headless=False in scrape.py (already the default)
Add manual delays; do not run in CI
The cached html/ directory in the repo can be used directly

score.py OpenRouter errors

Verify OPENROUTER_API_KEY is set in .env
Check your OpenRouter account has credits
Default model is Gemini Flash — change model in score.py for a different LLM

site/data.json not updating after re-scoring

# Always rebuild site data after changing scores.json
uv run python build_site_data.py

Treemap shows blank / no data

Confirm site/data.json exists and is valid JSON
Serve with python -m http.server (not file:// — CORS blocks local JSON fetch)
Check browser console for fetch errors

Important Caveats (from the project)

AI Exposure ≠ job disappearance. A score of 9/10 means AI is transforming the work, not eliminating demand. Software developers score 9/10 but demand is growing.
Scores are rough LLM estimates (Gemini Flash via OpenRouter), not rigorous economic predictions.
The tool does not account for demand elasticity, latent demand, regulatory barriers, or social preferences for human workers.
This is a development/research tool , not an economic publication.

Weekly Installs

250

Repository

aradotso/trending-skills

GitHub Stars

First Seen

6 days ago

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

github-copilot249

codex249

amp249

cline249

kimi-cli249

gemini-cli249

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

56,200 周安装