llmfit-hardware-model-matcher by aradotso/trending-skills
npx skills add https://github.com/aradotso/trending-skills --skill llmfit-hardware-model-matcher由 ara.so 提供的 Skill — Daily 2026 Skills 集合。
llmfit 检测您系统的 RAM、CPU 和 GPU,然后从质量、速度、适配度和上下文长度等多个维度对数百个 LLM 模型进行评分 — 准确告诉您哪些模型能在您的硬件上良好运行。它附带交互式 TUI 和 CLI,支持多 GPU、MoE 架构、动态量化以及本地运行时提供程序(Ollama、llama.cpp、MLX、Docker Model Runner)。
brew install llmfit
curl -fsSL https://llmfit.axjns.dev/install.sh | sh
# 不使用 sudo,安装到 ~/.local/bin
curl -fsSL https://llmfit.axjns.dev/install.sh | sh -s -- --local
scoop install llmfit
docker run ghcr.io/alexsjones/llmfit
# 配合 jq 进行脚本处理
podman run ghcr.io/alexsjones/llmfit recommend --use-case coding | jq '.models[].name'
git clone https://github.com/AlexsJones/llmfit.git
cd llmfit
cargo build --release
# 二进制文件位于 target/release/llmfit
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
perfect (运行极佳), good (运行良好), marginal (可运行但紧张), too_tight (无法运行)llmfit
llmfit --cli
llmfit system
llmfit --json system # JSON 输出
llmfit list
llmfit search "llama 8b"
llmfit search "mistral"
llmfit search "qwen coding"
# 所有可运行模型按适配度排序
llmfit fit
# 仅显示完美适配的模型,前 5 个
llmfit fit --perfect -n 5
# JSON 输出
llmfit --json fit -n 10
llmfit info "Mistral-7B"
llmfit info "Llama-3.1-70B"
# 前 5 个推荐模型(默认 JSON 格式)
llmfit recommend --json --limit 5
# 按用例筛选:general, coding, reasoning, chat, multimodal, embedding
llmfit recommend --json --use-case coding --limit 3
llmfit recommend --json --use-case reasoning --limit 5
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --quant mlx-4bit
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --target-tps 25 --json
llmfit plan "Qwen/Qwen2.5-Coder-0.5B-Instruct" --context 8192 --json
llmfit serve
llmfit serve --host 0.0.0.0 --port 8787
当自动检测失败时(虚拟机、nvidia-smi 损坏、直通设置):
# 覆盖 GPU VRAM
llmfit --memory=32G
llmfit --memory=24G --cli
llmfit --memory=24G fit --perfect -n 5
llmfit --memory=24G recommend --json
# 兆字节
llmfit --memory=32000M
# 可与任何子命令配合使用
llmfit --memory=16G info "Llama-3.1-70B"
接受的后缀:G/GB/GiB、M/MB/MiB、T/TB/TiB(不区分大小写)。
# 在 4K 上下文长度下估算内存适配度
llmfit --max-context 4096 --cli
# 配合子命令使用
llmfit --max-context 8192 fit --perfect -n 5
llmfit --max-context 16384 recommend --json --limit 5
# 环境变量替代方案
export OLLAMA_CONTEXT_LENGTH=8192
llmfit recommend --json
启动服务器:
llmfit serve --host 0.0.0.0 --port 8787
# 健康检查
curl http://localhost:8787/health
# 节点硬件信息
curl http://localhost:8787/api/v1/system
# 带筛选器的完整模型列表
curl "http://localhost:8787/api/v1/models?min_fit=marginal&runtime=llamacpp&sort=score&limit=20"
# 适合此节点的顶级可运行模型(关键调度端点)
curl "http://localhost:8787/api/v1/models/top?limit=5&min_fit=good&use_case=coding"
# 按模型名称/提供程序搜索
curl "http://localhost:8787/api/v1/models/Mistral?runtime=any"
/models 和 /models/top 的查询参数| 参数 | 值 | 描述 |
|---|---|---|
limit / n | 整数 | 返回的最大行数 |
min_fit | `perfect | good |
perfect | `true | false` |
runtime | `any | mlx |
use_case | `general | coding |
provider | 字符串 | 提供程序子字符串匹配 |
search | 字符串 | 跨名称/提供程序/大小/用例的自由文本搜索 |
sort | `score | tps |
include_too_tight | `true | false` |
max_context | 整数 | 单次请求的上下文上限 |
#!/bin/bash
# 获取完美适配的前 3 个编码模型
llmfit recommend --json --use-case coding --limit 3 | \
jq -r '.models[] | "\(.name) (\(.score)) - \(.quantization)"'
#!/bin/bash
MODEL="Mistral-7B"
RESULT=$(llmfit info "$MODEL" --json 2>/dev/null)
FIT=$(echo "$RESULT" | jq -r '.fit')
if [[ "$FIT" == "perfect" || "$FIT" == "good" ]]; then
echo "$MODEL will run well (fit: $FIT)"
else
echo "$MODEL may not run well (fit: $FIT)"
fi
#!/bin/bash
# 获取适配度最高的模型名称并用 Ollama 拉取
TOP_MODEL=$(llmfit recommend --json --limit 1 | jq -r '.models[0].name')
echo "Pulling: $TOP_MODEL"
ollama pull "$TOP_MODEL"
import requests
BASE_URL = "http://localhost:8787"
def get_system_info():
resp = requests.get(f"{BASE_URL}/api/v1/system")
return resp.json()
def get_top_models(use_case="coding", limit=5, min_fit="good"):
params = {
"use_case": use_case,
"limit": limit,
"min_fit": min_fit,
"sort": "score"
}
resp = requests.get(f"{BASE_URL}/api/v1/models/top", params=params)
return resp.json()
def search_models(query, runtime="any"):
resp = requests.get(
f"{BASE_URL}/api/v1/models/{query}",
params={"runtime": runtime}
)
return resp.json()
# 示例用法
system = get_system_info()
print(f"GPU: {system.get('gpu_name')} | VRAM: {system.get('vram_gb')}GB")
models = get_top_models(use_case="reasoning", limit=3)
for m in models.get("models", []):
print(f"{m['name']}: score={m['score']}, fit={m['fit']}, quant={m['quantization']}")
import subprocess
import json
def get_best_model_for_task(use_case: str, min_fit: str = "good") -> dict:
"""使用 llmfit 为给定任务选择最佳模型。"""
result = subprocess.run(
["llmfit", "recommend", "--json", "--use-case", use_case, "--limit", "1"],
capture_output=True,
text=True
)
data = json.loads(result.stdout)
models = data.get("models", [])
return models[0] if models else None
def plan_hardware_requirements(model_name: str, context: int = 4096) -> dict:
"""获取运行特定模型所需的硬件要求。"""
result = subprocess.run(
["llmfit", "plan", model_name, "--context", str(context), "--json"],
capture_output=True,
text=True
)
return json.loads(result.stdout)
# 选择最佳编码模型
best = get_best_model_for_task("coding")
if best:
print(f"Best coding model: {best['name']}")
print(f" Quantization: {best['quantization']}")
print(f" Estimated tok/s: {best['tps']}")
print(f" Memory usage: {best['mem_pct']}%")
# 为特定模型规划硬件
plan = plan_hardware_requirements("Qwen/Qwen3-4B-MLX-4bit", context=8192)
print(f"Min VRAM needed: {plan['hardware']['min_vram_gb']}GB")
print(f"Recommended VRAM: {plan['hardware']['recommended_vram_gb']}GB")
version: "3.8"
services:
llmfit-api:
image: ghcr.io/alexsjones/llmfit
command: serve --host 0.0.0.0 --port 8787
ports:
- "8787:8787"
environment:
- OLLAMA_CONTEXT_LENGTH=8192
devices:
- /dev/nvidia0:/dev/nvidia0 # 直通 GPU
| 按键 | 操作 |
|---|---|
↑/↓ 或 j/k | 导航模型 |
/ | 搜索(名称、提供程序、参数、用例) |
Esc/Enter | 退出搜索 |
Ctrl-U | 清除搜索 |
f | 循环切换适配筛选器:全部 → 可运行 → 完美 → 良好 → 紧张 |
a | 循环切换可用性:全部 → GGUF 可用 → 已安装 |
s | 循环切换排序:评分 → 参数 → 内存占比 → 上下文 → 日期 → 用例 |
t | 循环切换颜色主题(自动保存) |
v | 可视化模式(多选比较) |
V | 选择模式(基于列的筛选) |
p | 规划模式(此模型需要什么硬件?) |
P | 提供程序筛选器弹出窗口 |
U | 用例筛选器弹出窗口 |
C | 能力筛选器弹出窗口 |
m | 标记模型以进行比较 |
c | 比较视图(标记的 vs 选中的) |
d | 下载模型(通过检测到的运行时) |
r | 从运行时刷新已安装的模型 |
Enter | 切换详情视图 |
g/G | 跳转到顶部/底部 |
q | 退出 |
t 循环切换:默认 → Dracula → Solarized → Nord → Monokai → Gruvbox
主题保存到 ~/.config/llmfit/theme
| GPU 厂商 | 检测方法 |
|---|---|
| NVIDIA | nvidia-smi(多 GPU,聚合 VRAM) |
| AMD | rocm-smi |
| Intel Arc | sysfs(独立显卡)/ lspci(集成显卡) |
| Apple Silicon | system_profiler(统一内存 = VRAM) |
| Ascend | npu-smi |
llmfit fit --perfect -n 10
# 或交互式
llmfit
# 按 'f' 筛选到完美适配
llmfit recommend --json --use-case coding | jq '.models[]'
# 或者如果检测失败,手动覆盖
llmfit --memory=24G recommend --json --use-case coding
llmfit info "Llama-3.1-70B"
# 规划您需要的硬件
llmfit plan "Llama-3.1-70B" --context 4096 --json
llmfit
# 按 'a' 循环切换到已安装筛选器
# 或
llmfit fit -n 20 # 运行,在 TUI 中按 'i' 优先显示已安装的
MODEL=$(llmfit recommend --json --limit 1 | jq -r '.models[0].name')
ollama serve &
ollama run "$MODEL"
# 检查节点,获取前 3 个适配良好及以上的推理模型
curl -s "http://node1:8787/api/v1/models/top?limit=3&min_fit=good&use_case=reasoning" | \
jq '.models[].name'
GPU 未检测到 / VRAM 报告错误
# 验证检测
llmfit system
# 手动覆盖
llmfit --memory=24G --cli
nvidia-smi 未找到,但您有 NVIDIA GPU
# 安装 CUDA 工具包或 nvidia-utils,然后重试
# 或手动覆盖:
llmfit --memory=8G fit --perfect
模型显示为 too_tight,但您有足够的 RAM
# llmfit 可能使用了上下文膨胀的估算值;限制上下文长度
llmfit --max-context 2048 fit --perfect -n 10
REST API:测试端点
# 启动服务器并运行验证套件
python3 scripts/test_api.py --spawn
# 测试已运行的服务器
python3 scripts/test_api.py --base-url http://127.0.0.1:8787
Apple Silicon:VRAM 显示为系统 RAM(正常现象)
# 这是正确的 — Apple Silicon 使用统一内存
# llmfit 会自动考虑这一点
llmfit system # 应显示后端:Metal
上下文长度环境变量
export OLLAMA_CONTEXT_LENGTH=4096
llmfit recommend --json # 使用 4096 作为上下文上限
每周安装量
375
仓库
GitHub 星标数
10
首次出现
8 天前
安全审计
安装于
cursor372
gemini-cli372
github-copilot372
codex372
amp372
cline372
Skill by ara.so — Daily 2026 Skills collection.
llmfit detects your system's RAM, CPU, and GPU then scores hundreds of LLM models across quality, speed, fit, and context dimensions — telling you exactly which models will run well on your hardware. It ships with an interactive TUI and a CLI, supports multi-GPU, MoE architectures, dynamic quantization, and local runtime providers (Ollama, llama.cpp, MLX, Docker Model Runner).
brew install llmfit
curl -fsSL https://llmfit.axjns.dev/install.sh | sh
# Without sudo, installs to ~/.local/bin
curl -fsSL https://llmfit.axjns.dev/install.sh | sh -s -- --local
scoop install llmfit
docker run ghcr.io/alexsjones/llmfit
# With jq for scripting
podman run ghcr.io/alexsjones/llmfit recommend --use-case coding | jq '.models[].name'
git clone https://github.com/AlexsJones/llmfit.git
cd llmfit
cargo build --release
# binary at target/release/llmfit
perfect (runs great), good (runs well), marginal (runs but tight), too_tight (won't run)llmfit
llmfit --cli
llmfit system
llmfit --json system # JSON output
llmfit list
llmfit search "llama 8b"
llmfit search "mistral"
llmfit search "qwen coding"
# All runnable models ranked by fit
llmfit fit
# Only perfect fits, top 5
llmfit fit --perfect -n 5
# JSON output
llmfit --json fit -n 10
llmfit info "Mistral-7B"
llmfit info "Llama-3.1-70B"
# Top 5 recommendations (JSON default)
llmfit recommend --json --limit 5
# Filter by use case: general, coding, reasoning, chat, multimodal, embedding
llmfit recommend --json --use-case coding --limit 3
llmfit recommend --json --use-case reasoning --limit 5
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --quant mlx-4bit
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --target-tps 25 --json
llmfit plan "Qwen/Qwen2.5-Coder-0.5B-Instruct" --context 8192 --json
llmfit serve
llmfit serve --host 0.0.0.0 --port 8787
When autodetection fails (VMs, broken nvidia-smi, passthrough setups):
# Override GPU VRAM
llmfit --memory=32G
llmfit --memory=24G --cli
llmfit --memory=24G fit --perfect -n 5
llmfit --memory=24G recommend --json
# Megabytes
llmfit --memory=32000M
# Works with any subcommand
llmfit --memory=16G info "Llama-3.1-70B"
Accepted suffixes: G/GB/GiB, M/MB/MiB, T/TB/TiB (case-insensitive).
# Estimate memory fit at 4K context
llmfit --max-context 4096 --cli
# With subcommands
llmfit --max-context 8192 fit --perfect -n 5
llmfit --max-context 16384 recommend --json --limit 5
# Environment variable alternative
export OLLAMA_CONTEXT_LENGTH=8192
llmfit recommend --json
Start the server:
llmfit serve --host 0.0.0.0 --port 8787
# Health check
curl http://localhost:8787/health
# Node hardware info
curl http://localhost:8787/api/v1/system
# Full model list with filters
curl "http://localhost:8787/api/v1/models?min_fit=marginal&runtime=llamacpp&sort=score&limit=20"
# Top runnable models for this node (key scheduling endpoint)
curl "http://localhost:8787/api/v1/models/top?limit=5&min_fit=good&use_case=coding"
# Search by model name/provider
curl "http://localhost:8787/api/v1/models/Mistral?runtime=any"
/models and /models/top| Param | Values | Description |
|---|---|---|
limit / n | integer | Max rows returned |
min_fit | `perfect | good |
perfect | `true | false` |
runtime | `any | mlx |
use_case |
#!/bin/bash
# Get top 3 coding models that fit perfectly
llmfit recommend --json --use-case coding --limit 3 | \
jq -r '.models[] | "\(.name) (\(.score)) - \(.quantization)"'
#!/bin/bash
MODEL="Mistral-7B"
RESULT=$(llmfit info "$MODEL" --json 2>/dev/null)
FIT=$(echo "$RESULT" | jq -r '.fit')
if [[ "$FIT" == "perfect" || "$FIT" == "good" ]]; then
echo "$MODEL will run well (fit: $FIT)"
else
echo "$MODEL may not run well (fit: $FIT)"
fi
#!/bin/bash
# Get the top fitting model name and pull it with Ollama
TOP_MODEL=$(llmfit recommend --json --limit 1 | jq -r '.models[0].name')
echo "Pulling: $TOP_MODEL"
ollama pull "$TOP_MODEL"
import requests
BASE_URL = "http://localhost:8787"
def get_system_info():
resp = requests.get(f"{BASE_URL}/api/v1/system")
return resp.json()
def get_top_models(use_case="coding", limit=5, min_fit="good"):
params = {
"use_case": use_case,
"limit": limit,
"min_fit": min_fit,
"sort": "score"
}
resp = requests.get(f"{BASE_URL}/api/v1/models/top", params=params)
return resp.json()
def search_models(query, runtime="any"):
resp = requests.get(
f"{BASE_URL}/api/v1/models/{query}",
params={"runtime": runtime}
)
return resp.json()
# Example usage
system = get_system_info()
print(f"GPU: {system.get('gpu_name')} | VRAM: {system.get('vram_gb')}GB")
models = get_top_models(use_case="reasoning", limit=3)
for m in models.get("models", []):
print(f"{m['name']}: score={m['score']}, fit={m['fit']}, quant={m['quantization']}")
import subprocess
import json
def get_best_model_for_task(use_case: str, min_fit: str = "good") -> dict:
"""Use llmfit to select the best model for a given task."""
result = subprocess.run(
["llmfit", "recommend", "--json", "--use-case", use_case, "--limit", "1"],
capture_output=True,
text=True
)
data = json.loads(result.stdout)
models = data.get("models", [])
return models[0] if models else None
def plan_hardware_requirements(model_name: str, context: int = 4096) -> dict:
"""Get hardware requirements for running a specific model."""
result = subprocess.run(
["llmfit", "plan", model_name, "--context", str(context), "--json"],
capture_output=True,
text=True
)
return json.loads(result.stdout)
# Select best coding model
best = get_best_model_for_task("coding")
if best:
print(f"Best coding model: {best['name']}")
print(f" Quantization: {best['quantization']}")
print(f" Estimated tok/s: {best['tps']}")
print(f" Memory usage: {best['mem_pct']}%")
# Plan hardware for a specific model
plan = plan_hardware_requirements("Qwen/Qwen3-4B-MLX-4bit", context=8192)
print(f"Min VRAM needed: {plan['hardware']['min_vram_gb']}GB")
print(f"Recommended VRAM: {plan['hardware']['recommended_vram_gb']}GB")
version: "3.8"
services:
llmfit-api:
image: ghcr.io/alexsjones/llmfit
command: serve --host 0.0.0.0 --port 8787
ports:
- "8787:8787"
environment:
- OLLAMA_CONTEXT_LENGTH=8192
devices:
- /dev/nvidia0:/dev/nvidia0 # pass GPU through
| Key | Action |
|---|---|
↑/↓ or j/k | Navigate models |
/ | Search (name, provider, params, use case) |
Esc/Enter | Exit search |
Ctrl-U | Clear search |
t cycles: Default → Dracula → Solarized → Nord → Monokai → Gruvbox
Theme saved to ~/.config/llmfit/theme
| GPU Vendor | Detection Method |
|---|---|
| NVIDIA | nvidia-smi (multi-GPU, aggregates VRAM) |
| AMD | rocm-smi |
| Intel Arc | sysfs (discrete) / lspci (integrated) |
| Apple Silicon | system_profiler (unified memory = VRAM) |
| Ascend | npu-smi |
llmfit fit --perfect -n 10
# or interactively
llmfit
# press 'f' to filter to Perfect fit
llmfit recommend --json --use-case coding | jq '.models[]'
# or with manual override if detection fails
llmfit --memory=24G recommend --json --use-case coding
llmfit info "Llama-3.1-70B"
# Plan what hardware you'd need
llmfit plan "Llama-3.1-70B" --context 4096 --json
llmfit
# press 'a' to cycle to Installed filter
# or
llmfit fit -n 20 # run, press 'i' in TUI for installed-first
MODEL=$(llmfit recommend --json --limit 1 | jq -r '.models[0].name')
ollama serve &
ollama run "$MODEL"
# Check node, get top 3 good+ models for reasoning
curl -s "http://node1:8787/api/v1/models/top?limit=3&min_fit=good&use_case=reasoning" | \
jq '.models[].name'
GPU not detected / wrong VRAM reported
# Verify detection
llmfit system
# Manual override
llmfit --memory=24G --cli
nvidia-smi not found but you have an NVIDIA GPU
# Install CUDA toolkit or nvidia-utils, then retry
# Or override manually:
llmfit --memory=8G fit --perfect
Models show as too_tight but you have enough RAM
# llmfit may be using context-inflated estimates; cap context
llmfit --max-context 2048 fit --perfect -n 10
REST API: test endpoints
# Spawn server and run validation suite
python3 scripts/test_api.py --spawn
# Test already-running server
python3 scripts/test_api.py --base-url http://127.0.0.1:8787
Apple Silicon: VRAM shows as system RAM (expected)
# This is correct — Apple Silicon uses unified memory
# llmfit accounts for this automatically
llmfit system # should show backend: Metal
Context length environment variable
export OLLAMA_CONTEXT_LENGTH=4096
llmfit recommend --json # uses 4096 as context cap
Weekly Installs
375
Repository
GitHub Stars
10
First Seen
8 days ago
Security Audits
Gen Agent Trust HubFailSocketWarnSnykFail
Installed on
cursor372
gemini-cli372
github-copilot372
codex372
amp372
cline372
React 组合模式指南:Vercel 组件架构最佳实践,提升代码可维护性
105,000 周安装
Python类型注解模式指南:现代类型提示与Typing最佳实践
24 周安装
Web应用安全模式指南:OWASP Top 10防护、输入验证、身份认证与授权最佳实践
25 周安装
task-runner任务运行器:使用just简化项目命令执行,替代make的跨平台工具
30 周安装
EdgeOne Pages 一键部署:无需账户,秒级将HTML文件发布到公共URL
35 周安装
Vibe Security 安全扫描器 - 多语言代码漏洞检测与AI智能修复工具
38 周安装
wechat-publisher:一键发布Markdown文章到微信公众号草稿箱工具
323 周安装
| `general |
| coding |
provider | string | Substring match on provider |
search | string | Free-text across name/provider/size/use-case |
sort | `score | tps |
include_too_tight | `true | false` |
max_context | integer | Per-request context cap |
f | Cycle fit filter: All → Runnable → Perfect → Good → Marginal |
a | Cycle availability: All → GGUF Avail → Installed |
s | Cycle sort: Score → Params → Mem% → Ctx → Date → Use Case |
t | Cycle color theme (auto-saved) |
v | Visual mode (multi-select for comparison) |
V | Select mode (column-based filtering) |
p | Plan mode (what hardware needed for this model?) |
P | Provider filter popup |
U | Use-case filter popup |
C | Capability filter popup |
m | Mark model for comparison |
c | Compare view (marked vs selected) |
d | Download model (via detected runtime) |
r | Refresh installed models from runtimes |
Enter | Toggle detail view |
g/G | Jump to top/bottom |
q | Quit |