⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

llama.cpp：纯C/C++大语言模型推理引擎，CPU/非NVIDIA硬件优化，边缘部署

llama-cpp by orchestra-research/ai-research-skills

105 周安装量

6,500 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/orchestra-research/ai-research-skills --skill llama-cpp

AI/机器学习高性能计算 C++

🇨🇳中文介绍

llama.cpp

使用纯 C/C++ 进行 LLM 推理，依赖极少，针对 CPU 和非 NVIDIA 硬件进行了优化。

何时使用 llama.cpp

在以下情况使用 llama.cpp：

在仅支持 CPU 的机器上运行
在 Apple Silicon (M1/M2/M3/M4) 上部署
使用 AMD 或 Intel GPU（无 CUDA）
边缘部署（树莓派、嵌入式系统）
需要无需 Docker/Python 的简单部署

在以下情况改用 TensorRT-LLM：

拥有 NVIDIA GPU (A100/H100)
需要最大吞吐量 (100K+ tok/s)
在配备 CUDA 的数据中心运行

在以下情况改用 vLLM：

拥有 NVIDIA GPU
需要以 Python 为先的 API
需要 PagedAttention

快速开始

安装

# macOS/Linux
brew install llama.cpp

# 或从源代码构建
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# 使用 Metal (Apple Silicon)
make LLAMA_METAL=1

# 使用 CUDA (NVIDIA)
make LLAMA_CUDA=1

# 使用 ROCm (AMD)
make LLAMA_HIP=1

下载模型

# 从 HuggingFace 下载 (GGUF 格式)
huggingface-cli download \
    TheBloke/Llama-2-7B-Chat-GGUF \
    llama-2-7b-chat.Q4_K_M.gguf \
    --local-dir models/

# 或从 HuggingFace 转换
python convert_hf_to_gguf.py models/llama-2-7b-chat/

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

格式	比特数	大小 (7B)	速度	质量	使用场景
Q4_K_M	4.5	4.1 GB	快	好	推荐默认值
Q4_K_S	4.3	3.9 GB	更快	较低	速度优先
Q5_K_M	5.5	4.8 GB	中等	更好	质量优先
Q6_K	6.5	5.5 GB	较慢	最佳	最高质量
Q8_0	8.0	7.0 GB	慢	优秀	最小退化
Q2_K	2.5	2.7 GB	最快	差	仅用于测试

Apple Silicon (Metal)

# 使用 Metal 构建
make LLAMA_METAL=1

# 使用 GPU 加速运行（自动）
./llama-cli -m model.gguf -ngl 999  # 卸载所有层

# 性能：M3 Max 40-60 令牌/秒 (Llama 2-7B Q4_K_M)

# 使用 CUDA 构建
make LLAMA_CUDA=1

# 将层卸载到 GPU
./llama-cli -m model.gguf -ngl 35  # 卸载 35/40 层

# 大型模型的 CPU+GPU 混合模式
./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20  # GPU：20 层，CPU：其余层

# 使用 ROCm 构建
make LLAMA_HIP=1

# 使用 AMD GPU 运行
./llama-cli -m model.gguf -ngl 999

# 从文件处理多个提示
cat prompts.txt | ./llama-cli \
    -m model.gguf \
    --batch-size 512 \
    -n 100

# 使用语法的 JSON 输出
./llama-cli \
    -m model.gguf \
    -p "生成一个人物：" \
    --grammar-file grammars/json.gbnf

# 仅输出有效的 JSON

# 增加上下文（默认 512）
./llama-cli \
    -m model.gguf \
    -c 4096  # 4K 上下文窗口

# 超长上下文（如果模型支持）
./llama-cli -m model.gguf -c 32768  # 32K 上下文

CPU 性能 (Llama 2-7B Q4_K_M)

CPU	线程数	速度	成本
Apple M3 Max	16	50 tok/s	$0 (本地)
AMD Ryzen 9 7950X	32	35 tok/s	$0.50/小时
Intel i9-13900K	32	30 tok/s	$0.40/小时
AWS c7i.16xlarge	64	40 tok/s	$2.88/小时

GPU 加速 (Llama 2-7B Q4_K_M)

GPU	速度	对比 CPU	成本
NVIDIA RTX 4090	120 tok/s	3-4×	$0 (本地)
NVIDIA A10	80 tok/s	2-3×	$1.00/小时
AMD MI250	70 tok/s	2×	$2.00/小时
Apple M3 Max (Metal)	50 tok/s	~相同	$0 (本地)

Llama 2 (7B, 13B, 70B)
Llama 3 (8B, 70B, 405B)
Code Llama

Mistral 系列 :

Mistral 7B
Mixtral 8x7B, 8x22B

Falcon, BLOOM, GPT-J
Phi-3, Gemma, Qwen
LLaVA (视觉), Whisper (音频)

量化指南 - GGUF 格式、转换、质量比较
服务器部署 - API 端点、Docker、监控
优化 - 性能调优、CPU+GPU 混合模式

🇺🇸English

llama.cpp

Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.

When to use llama.cpp

Use llama.cpp when:

Running on CPU-only machines
Deploying on Apple Silicon (M1/M2/M3/M4)
Using AMD or Intel GPUs (no CUDA)
Edge deployment (Raspberry Pi, embedded systems)
Need simple deployment without Docker/Python

Use TensorRT-LLM instead when:

Have NVIDIA GPUs (A100/H100)
Need maximum throughput (100K+ tok/s)
Running in datacenter with CUDA

Use vLLM instead when:

Have NVIDIA GPUs
Need Python-first API
Want PagedAttention

Quick start

Installation

# macOS/Linux
brew install llama.cpp

# Or build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# With Metal (Apple Silicon)
make LLAMA_METAL=1

# With CUDA (NVIDIA)
make LLAMA_CUDA=1

# With ROCm (AMD)
make LLAMA_HIP=1

Download model

# Download from HuggingFace (GGUF format)
huggingface-cli download \
    TheBloke/Llama-2-7B-Chat-GGUF \
    llama-2-7b-chat.Q4_K_M.gguf \
    --local-dir models/

# Or convert from HuggingFace
python convert_hf_to_gguf.py models/llama-2-7b-chat/

Run inference

# Simple chat
./llama-cli \
    -m models/llama-2-7b-chat.Q4_K_M.gguf \
    -p "Explain quantum computing" \
    -n 256  # Max tokens

# Interactive chat
./llama-cli \
    -m models/llama-2-7b-chat.Q4_K_M.gguf \
    --interactive

Server mode

# Start OpenAI-compatible server
./llama-server \
    -m models/llama-2-7b-chat.Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 32  # Offload 32 layers to GPU

# Client request
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-2-7b-chat",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Quantization formats

GGUF format overview

Format	Bits	Size (7B)	Speed	Quality	Use Case
Q4_K_M	4.5	4.1 GB	Fast	Good	Recommended default
Q4_K_S	4.3	3.9 GB	Faster	Lower	Speed critical
Q5_K_M	5.5	4.8 GB	Medium	Better	Quality critical
Q6_K	6.5	5.5 GB	Slower	Best	Maximum quality
Q8_0	8.0

Choosing quantization

# General use (balanced)
Q4_K_M  # 4-bit, medium quality

# Maximum speed (more degradation)
Q2_K or Q3_K_M

# Maximum quality (slower)
Q6_K or Q8_0

# Very large models (70B, 405B)
Q3_K_M or Q4_K_S  # Lower bits to fit in memory

Hardware acceleration

Apple Silicon (Metal)

# Build with Metal
make LLAMA_METAL=1

# Run with GPU acceleration (automatic)
./llama-cli -m model.gguf -ngl 999  # Offload all layers

# Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)

NVIDIA GPUs (CUDA)

# Build with CUDA
make LLAMA_CUDA=1

# Offload layers to GPU
./llama-cli -m model.gguf -ngl 35  # Offload 35/40 layers

# Hybrid CPU+GPU for large models
./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20  # GPU: 20 layers, CPU: rest

AMD GPUs (ROCm)

# Build with ROCm
make LLAMA_HIP=1

# Run with AMD GPU
./llama-cli -m model.gguf -ngl 999

Common patterns

Batch processing

# Process multiple prompts from file
cat prompts.txt | ./llama-cli \
    -m model.gguf \
    --batch-size 512 \
    -n 100

Constrained generation

# JSON output with grammar
./llama-cli \
    -m model.gguf \
    -p "Generate a person: " \
    --grammar-file grammars/json.gbnf

# Outputs valid JSON only

Context size

# Increase context (default 512)
./llama-cli \
    -m model.gguf \
    -c 4096  # 4K context window

# Very long context (if model supports)
./llama-cli -m model.gguf -c 32768  # 32K context

Performance benchmarks

CPU performance (Llama 2-7B Q4_K_M)

CPU	Threads	Speed	Cost
Apple M3 Max	16	50 tok/s	$0 (local)
AMD Ryzen 9 7950X	32	35 tok/s	$0.50/hour
Intel i9-13900K	32	30 tok/s	$0.40/hour
AWS c7i.16xlarge	64	40 tok/s	$2.88/hour

GPU acceleration (Llama 2-7B Q4_K_M)

GPU	Speed	vs CPU	Cost
NVIDIA RTX 4090	120 tok/s	3-4×	$0 (local)
NVIDIA A10	80 tok/s	2-3×	$1.00/hour
AMD MI250	70 tok/s	2×	$2.00/hour
Apple M3 Max (Metal)	50 tok/s	~Same	$0 (local)

Supported models

LLaMA family :

Llama 2 (7B, 13B, 70B)
Llama 3 (8B, 70B, 405B)
Code Llama

Mistral family :

Mistral 7B
Mixtral 8x7B, 8x22B

Other :

Falcon, BLOOM, GPT-J
Phi-3, Gemma, Qwen
LLaVA (vision), Whisper (audio)

Find models : https://huggingface.co/models?library=gguf

References

Quantization Guide - GGUF formats, conversion, quality comparison
Server Deployment - API endpoints, Docker, monitoring
Optimization - Performance tuning, hybrid CPU+GPU

Resources

GitHub : https://github.com/ggerganov/llama.cpp
Models : https://huggingface.co/models?library=gguf
Discord : https://discord.gg/llama-cpp

Weekly Installs

Repository

orchestra-resea…h-skills

GitHub Stars

5.6K

First Seen

Feb 7, 2026

Security Audits

Gen Agent Trust HubWarn SocketPass SnykWarn

Installed on

codex53

cursor53

opencode53

gemini-cli52

claude-code51

github-copilot51

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

56,600 周安装

llama.cpp：纯C/C++大语言模型推理引擎，CPU/非NVIDIA硬件优化，边缘部署

🇨🇳中文介绍

llama.cpp

何时使用 llama.cpp

快速开始

安装

下载模型

相关 Skills

运行推理

服务器模式

量化格式

GGUF 格式概览

选择量化

硬件加速

Apple Silicon (Metal)

NVIDIA GPU (CUDA)

AMD GPU (ROCm)

常见模式

批处理

受限生成

上下文大小

性能基准测试

CPU 性能 (Llama 2-7B Q4_K_M)

GPU 加速 (Llama 2-7B Q4_K_M)

支持的模型

参考资料

资源

🇺🇸English

llama.cpp

When to use llama.cpp

Quick start

Installation

Download model

Run inference

Server mode

Quantization formats

GGUF format overview

Choosing quantization

Hardware acceleration

Apple Silicon (Metal)

NVIDIA GPUs (CUDA)

AMD GPUs (ROCm)

Common patterns

Batch processing

Constrained generation

Context size

Performance benchmarks

CPU performance (Llama 2-7B Q4_K_M)

GPU acceleration (Llama 2-7B Q4_K_M)

Supported models

References

Resources

最新 Skills