⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

LLaVA 开源视觉语言模型：图像理解与多模态对话助手，支持视觉问答和图像描述

llava by orchestra-research/ai-research-skills

67 周安装量

4 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/orchestra-research/ai-research-skills --skill llava

AI/机器学习多模态处理计算机视觉

🇨🇳中文介绍

LLaVA - 大型语言与视觉助手

用于对话式图像理解的开源视觉语言模型。

何时使用 LLaVA

适用于以下场景：

构建视觉语言聊天机器人
视觉问答
图像描述和字幕生成
多轮图像对话
视觉指令跟随
带图像的文档理解

关键指标：

GitHub 星标数超过 23,000
目标达到 GPT-4V 级别能力
Apache 2.0 许可证
多种模型尺寸

替代方案：

GPT-4V：最高质量，基于 API
CLIP：简单的零样本分类
BLIP-2：仅适用于字幕生成
Flamingo：研究用途，非开源

快速开始

安装

# 克隆仓库
git clone https://github.com/haotian-liu/LLaVA
cd LLaVA

# 安装
pip install -e .

基础用法

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates
from PIL import Image
import torch

# 加载模型
model_path = "liuhaotian/llava-v1.5-7b"
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path)
)

# 加载图像
image = Image.open("image.jpg")
image_tensor = process_images([image], image_processor, model.config)
image_tensor = image_tensor.to(model.device, dtype=torch.float16)

# 创建对话
conv = conv_templates["llava_v1"].copy()
conv.append_message(conv.roles[0], DEFAULT_IMAGE_TOKEN + "\nWhat is in this image?")
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

# 生成响应
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).to(model.device)

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=image_tensor,
        do_sample=True,
        temperature=0.2,
        max_new_tokens=512
    )

response = tokenizer.decode(output_ids[0], skip_special_tokens=True).strip()
print(response)

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

模型	参数量	显存需求	质量
LLaVA-v1.5-7B	7B	~14 GB	良好
LLaVA-v1.5-13B	13B	~28 GB	更好
LLaVA-v1.6-34B	34B	~70 GB	最佳

命令行界面用法

# 单图像查询
python -m llava.serve.cli \
    --model-path liuhaotian/llava-v1.5-7b \
    --image-file image.jpg \
    --query "What is in this image?"

# 多轮对话
python -m llava.serve.cli \
    --model-path liuhaotian/llava-v1.5-7b \
    --image-file image.jpg
# 然后交互式输入问题

# 启动 Gradio 界面
python -m llava.serve.gradio_web_server \
    --model-path liuhaotian/llava-v1.5-7b \
    --load-4bit  # 可选：减少显存

# 通过 http://localhost:7860 访问

# 初始化对话
conv = conv_templates["llava_v1"].copy()

# 第 1 轮
conv.append_message(conv.roles[0], DEFAULT_IMAGE_TOKEN + "\nWhat is in this image?")
conv.append_message(conv.roles[1], None)
response1 = generate(conv, model, image)  # "一只狗在公园里玩耍"

# 第 2 轮
conv.messages[-1][1] = response1  # 添加上一轮响应
conv.append_message(conv.roles[0], "What breed is the dog?")
conv.append_message(conv.roles[1], None)
response2 = generate(conv, model, image)  # "金毛寻回犬"

# 第 3 轮
conv.messages[-1][1] = response2
conv.append_message(conv.roles[0], "What time of day is it?")
conv.append_message(conv.roles[1], None)
response3 = generate(conv, model, image)

question = "Describe this image in detail."
response = ask(model, image, question)

question = "How many people are in the image?"
response = ask(model, image, question)

question = "List all the objects you can see in this image."
response = ask(model, image, question)

question = "What is happening in this scene?"
response = ask(model, image, question)

question = "What is the main topic of this document?"
response = ask(model, document_image, question)

训练自定义模型

# 阶段 1：特征对齐
bash scripts/v1_5/pretrain.sh

# 阶段 2：视觉指令微调
bash scripts/v1_5/finetune.sh

# 4 位量化
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path="liuhaotian/llava-v1.5-13b",
    model_base=None,
    model_name=get_model_name_from_path("liuhaotian/llava-v1.5-13b"),
    load_4bit=True  # 显存减少约 4 倍
)

# 8 位量化
load_8bit=True  # 显存减少约 2 倍

从 7B 模型开始 - 质量良好，显存需求可控
使用 4 位量化 - 显著降低显存需求
需要 GPU - CPU 推理极慢
清晰的提示词 - 具体问题获得更好答案
多轮对话 - 保持对话上下文
温度值 0.2-0.7 - 平衡创造性与一致性
最大新令牌数 512-1024 - 用于详细响应
批处理 - 顺序处理多张图像

模型	显存需求	显存需求	速度
7B	~14 GB	~4 GB	~20
13B	~28 GB	~8 GB	~12
34B	~70 GB	~18 GB	~5

LLaVA 在以下基准测试中取得有竞争力的分数：

VQAv2 : 78.5%
GQA : 62.0%
MM-Vet : 35.4%
MMBench : 64.3%

幻觉问题 - 可能描述图像中不存在的内容
空间推理 - 难以精确定位
小文字 - 难以阅读细小文字
对象计数 - 对大量对象计数不精确
显存需求 - 需要强大的 GPU
推理速度 - 比 CLIP 慢

from langchain.llms.base import LLM

class LLaVALLM(LLM):
    def _call(self, prompt, stop=None):
        # 自定义 LLaVA 推理
        return response

llm = LLaVALLM()

import gradio as gr

def chat(image, text, history):
    response = ask_llava(model, image, text)
    return response

demo = gr.ChatInterface(
    chat,
    additional_inputs=[gr.Image(type="pil")],
    title="LLaVA Chat"
)
demo.launch()

GitHub : https://github.com/haotian-liu/LLaVA ⭐ 23,000+
论文 : https://arxiv.org/abs/2304.08485
演示 : https://llava.hliu.cc
模型 : https://huggingface.co/liuhaotian
许可证 : Apache 2.0

🇺🇸English

LLaVA - Large Language and Vision Assistant

Open-source vision-language model for conversational image understanding.

When to use LLaVA

Use when:

Building vision-language chatbots
Visual question answering (VQA)
Image description and captioning
Multi-turn image conversations
Visual instruction following
Document understanding with images

Metrics :

23,000+ GitHub stars
GPT-4V level capabilities (targeted)
Apache 2.0 License
Multiple model sizes (7B-34B params)

Use alternatives instead :

GPT-4V : Highest quality, API-based
CLIP : Simple zero-shot classification
BLIP-2 : Better for captioning only
Flamingo : Research, not open-source

Quick start

Installation

# Clone repository
git clone https://github.com/haotian-liu/LLaVA
cd LLaVA

# Install
pip install -e .

Basic usage

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates
from PIL import Image
import torch

# Load model
model_path = "liuhaotian/llava-v1.5-7b"
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path)
)

# Load image
image = Image.open("image.jpg")
image_tensor = process_images([image], image_processor, model.config)
image_tensor = image_tensor.to(model.device, dtype=torch.float16)

# Create conversation
conv = conv_templates["llava_v1"].copy()
conv.append_message(conv.roles[0], DEFAULT_IMAGE_TOKEN + "\nWhat is in this image?")
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

# Generate response
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).to(model.device)

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=image_tensor,
        do_sample=True,
        temperature=0.2,
        max_new_tokens=512
    )

response = tokenizer.decode(output_ids[0], skip_special_tokens=True).strip()
print(response)

Available models

Model	Parameters	VRAM	Quality
LLaVA-v1.5-7B	7B	~14 GB	Good
LLaVA-v1.5-13B	13B	~28 GB	Better
LLaVA-v1.6-34B	34B	~70 GB	Best

# Load different models
model_7b = "liuhaotian/llava-v1.5-7b"
model_13b = "liuhaotian/llava-v1.5-13b"
model_34b = "liuhaotian/llava-v1.6-34b"

# 4-bit quantization for lower VRAM
load_4bit = True  # Reduces VRAM by ~4×

CLI usage

# Single image query
python -m llava.serve.cli \
    --model-path liuhaotian/llava-v1.5-7b \
    --image-file image.jpg \
    --query "What is in this image?"

# Multi-turn conversation
python -m llava.serve.cli \
    --model-path liuhaotian/llava-v1.5-7b \
    --image-file image.jpg
# Then type questions interactively

Web UI (Gradio)

# Launch Gradio interface
python -m llava.serve.gradio_web_server \
    --model-path liuhaotian/llava-v1.5-7b \
    --load-4bit  # Optional: reduce VRAM

# Access at http://localhost:7860

Multi-turn conversations

# Initialize conversation
conv = conv_templates["llava_v1"].copy()

# Turn 1
conv.append_message(conv.roles[0], DEFAULT_IMAGE_TOKEN + "\nWhat is in this image?")
conv.append_message(conv.roles[1], None)
response1 = generate(conv, model, image)  # "A dog playing in a park"

# Turn 2
conv.messages[-1][1] = response1  # Add previous response
conv.append_message(conv.roles[0], "What breed is the dog?")
conv.append_message(conv.roles[1], None)
response2 = generate(conv, model, image)  # "Golden Retriever"

# Turn 3
conv.messages[-1][1] = response2
conv.append_message(conv.roles[0], "What time of day is it?")
conv.append_message(conv.roles[1], None)
response3 = generate(conv, model, image)

Common tasks

Image captioning

question = "Describe this image in detail."
response = ask(model, image, question)

Visual question answering

question = "How many people are in the image?"
response = ask(model, image, question)

Object detection (textual)

question = "List all the objects you can see in this image."
response = ask(model, image, question)

Scene understanding

question = "What is happening in this scene?"
response = ask(model, image, question)

Document understanding

question = "What is the main topic of this document?"
response = ask(model, document_image, question)

Training custom model

# Stage 1: Feature alignment (558K image-caption pairs)
bash scripts/v1_5/pretrain.sh

# Stage 2: Visual instruction tuning (150K instruction data)
bash scripts/v1_5/finetune.sh

Quantization (reduce VRAM)

# 4-bit quantization
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path="liuhaotian/llava-v1.5-13b",
    model_base=None,
    model_name=get_model_name_from_path("liuhaotian/llava-v1.5-13b"),
    load_4bit=True  # Reduces VRAM ~4×
)

# 8-bit quantization
load_8bit=True  # Reduces VRAM ~2×

Best practices

Start with 7B model - Good quality, manageable VRAM
Use 4-bit quantization - Reduces VRAM significantly
GPU required - CPU inference extremely slow
Clear prompts - Specific questions get better answers
Multi-turn conversations - Maintain conversation context
Temperature 0.2-0.7 - Balance creativity/consistency
max_new_tokens 512-1024 - For detailed responses
Batch processing - Process multiple images sequentially

Performance

Model	VRAM (FP16)	VRAM (4-bit)	Speed (tokens/s)
7B	~14 GB	~4 GB	~20
13B	~28 GB	~8 GB	~12
34B	~70 GB	~18 GB	~5

On A100 GPU

Benchmarks

LLaVA achieves competitive scores on:

VQAv2 : 78.5%
GQA : 62.0%
MM-Vet : 35.4%
MMBench : 64.3%

Limitations

Hallucinations - May describe things not in image
Spatial reasoning - Struggles with precise locations
Small text - Difficulty reading fine print
Object counting - Imprecise for many objects
VRAM requirements - Need powerful GPU
Inference speed - Slower than CLIP

Integration with frameworks

LangChain

from langchain.llms.base import LLM

class LLaVALLM(LLM):
    def _call(self, prompt, stop=None):
        # Custom LLaVA inference
        return response

llm = LLaVALLM()

Gradio App

import gradio as gr

def chat(image, text, history):
    response = ask_llava(model, image, text)
    return response

demo = gr.ChatInterface(
    chat,
    additional_inputs=[gr.Image(type="pil")],
    title="LLaVA Chat"
)
demo.launch()

Resources

GitHub : https://github.com/haotian-liu/LLaVA ⭐ 23,000+
Paper : https://arxiv.org/abs/2304.08485
Demo : https://llava.hliu.cc
Models : https://huggingface.co/liuhaotian
License : Apache 2.0

Weekly Installs

Repository

orchestra-resea…h-skills

GitHub Stars

5.6K

First Seen

Feb 7, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode58

codex57

cursor57

gemini-cli56

claude-code55

github-copilot55

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

53,700 周安装

LLaVA 开源视觉语言模型：图像理解与多模态对话助手，支持视觉问答和图像描述

🇨🇳中文介绍

LLaVA - 大型语言与视觉助手

何时使用 LLaVA

快速开始

安装

基础用法

相关 Skills

可用模型

命令行界面用法

Web 界面

多轮对话

常见任务

图像字幕生成

视觉问答

对象检测

场景理解

文档理解

训练自定义模型

量化

最佳实践

性能

基准测试

局限性

与框架集成

LangChain

Gradio 应用

资源

🇺🇸English

LLaVA - Large Language and Vision Assistant

When to use LLaVA

Quick start

Installation

Basic usage

Available models

CLI usage

Web UI (Gradio)

Multi-turn conversations

Common tasks

Image captioning

Visual question answering

Object detection (textual)

Scene understanding

Document understanding

Training custom model

Quantization (reduce VRAM)

Best practices

Performance

Benchmarks

Limitations

Integration with frameworks

LangChain

Gradio App

Resources

最新 Skills