BLIP-2视觉语言预训练模型指南：图像描述、视觉问答与多模态AI应用

blip-2-vision-language by davila7/claude-code-templates

237 周安装量

24,300 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill blip-2-vision-language

AI/机器学习图像处理自然语言处理

🇨🇳中文介绍

BLIP-2：视觉语言预训练

使用 Salesforce 的 BLIP-2 进行视觉语言任务的综合指南，包含冻结的图像编码器和大型语言模型。

何时使用 BLIP-2

在以下情况下使用 BLIP-2：

需要具有自然描述的高质量图像描述
构建视觉问答系统
需要零样本图像-文本理解，无需特定任务训练
希望利用 LLM 推理进行视觉任务
构建多模态对话 AI
需要图像-文本检索或匹配

主要特性：

Q-Former 架构：轻量级查询变换器桥接视觉和语言
冻结骨干网络效率：无需微调大型视觉/语言模型
多种 LLM 后端：OPT 和 FlanT5
零样本能力：无需特定任务训练即可获得强大性能
高效训练：仅训练 Q-Former
最先进的结果：在 VQA 基准测试中击败更大的模型

改用其他方案的情况：

LLaVA：用于遵循指令的多模态聊天
InstructBLIP：用于改进的指令遵循
GPT-4V/Claude 3：用于生产环境的多模态聊天
CLIP：用于简单的图像-文本相似性，无需生成
Flamingo：用于少样本视觉学习

快速开始

安装

# HuggingFace Transformers (推荐)
pip install transformers accelerate torch Pillow

# 或 LAVIS 库
pip install salesforce-lavis

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

find-skills 技能搜索工具 - Vercel Labs 开源智能体技能包管理器

917,400 周安装

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

122,000 周安装

AI Elements：基于shadcn/ui的AI原生应用组件库，快速构建对话界面

71,500 周安装

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

52,100 周安装

import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

# 加载模型和处理器
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    torch_dtype=torch.float16,
    device_map="auto"
)

# 加载图像
image = Image.open("photo.jpg").convert("RGB")

# 生成描述
inputs = processor(images=image, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(caption)

# 询问关于图像的问题
question = "What color is the car in this image?"

inputs = processor(images=image, text=question, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(answer)

import torch
from lavis.models import load_model_and_preprocess
from PIL import Image

# 加载模型
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model, vis_processors, txt_processors = load_model_and_preprocess(
    name="blip2_opt",
    model_type="pretrain_opt2.7b",
    is_eval=True,
    device=device
)

# 处理图像
image = Image.open("photo.jpg").convert("RGB")
image = vis_processors["eval"](image).unsqueeze(0).to(device)

# 描述
caption = model.generate({"image": image})
print(caption)

# VQA
question = txt_processors["eval"]("What is in this image?")
answer = model.generate({"image": image, "prompt": question})
print(answer)

BLIP-2 Architecture:
┌─────────────────────────────────────────────────────────────┐
│                        Q-Former                              │
│  ┌─────────────────────────────────────────────────────┐    │
│  │     Learned Queries (32 queries × 768 dim)          │    │
│  └────────────────────────┬────────────────────────────┘    │
│                           │                                  │
│  ┌────────────────────────▼────────────────────────────┐    │
│  │    Cross-Attention with Image Features               │    │
│  └────────────────────────┬────────────────────────────┘    │
│                           │                                  │
│  ┌────────────────────────▼────────────────────────────┐    │
│  │    Self-Attention Layers (Transformer)               │    │
│  └────────────────────────┬────────────────────────────┘    │
└───────────────────────────┼─────────────────────────────────┘
                            │
┌───────────────────────────▼─────────────────────────────────┐
│  Frozen Vision Encoder    │      Frozen LLM                  │
│  (ViT-G/14 from EVA-CLIP) │      (OPT or FlanT5)            │
└─────────────────────────────────────────────────────────────┘

模型	LLM 后端	大小	使用场景
`blip2-opt-2.7b`	OPT-2.7B	~4GB	通用描述，VQA
`blip2-opt-6.7b`	OPT-6.7B	~8GB	更好的推理
`blip2-flan-t5-xl`	FlanT5-XL	~5GB	指令遵循
`blip2-flan-t5-xxl`	FlanT5-XXL	~13GB	最佳质量

组件	描述	参数
学习查询	一组固定的可学习嵌入	32 × 768
图像变换器	与视觉特征的交叉注意力	~108M
文本变换器	文本的自注意力	~108M
线性投影	映射到 LLM 维度	可变

from PIL import Image
import torch

# 加载多张图像
images = [Image.open(f"image_{i}.jpg").convert("RGB") for i in range(4)]
questions = [
    "What is shown in this image?",
    "Describe the scene.",
    "What colors are prominent?",
    "Is there a person in this image?"
]

# 处理批次
inputs = processor(
    images=images,
    text=questions,
    return_tensors="pt",
    padding=True
).to("cuda", torch.float16)

# 生成
generated_ids = model.generate(**inputs, max_new_tokens=50)
answers = processor.batch_decode(generated_ids, skip_special_tokens=True)

for q, a in zip(questions, answers):
    print(f"Q: {q}\nA: {a}\n")

# 控制生成参数
generated_ids = model.generate(
    **inputs,
    max_new_tokens=100,
    min_length=20,
    num_beams=5,              # 束搜索
    no_repeat_ngram_size=2,   # 避免重复
    top_p=0.9,                # 核采样
    temperature=0.7,          # 创造性
    do_sample=True,           # 启用采样
)

# 确定性输出
generated_ids = model.generate(
    **inputs,
    max_new_tokens=50,
    num_beams=5,
    do_sample=False,
)

# 8位量化
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-6.7b",
    quantization_config=quantization_config,
    device_map="auto"
)

# 4位量化
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-flan-t5-xxl",
    quantization_config=quantization_config,
    device_map="auto"
)

# 使用 LAVIS 进行 ITM
from lavis.models import load_model_and_preprocess

model, vis_processors, txt_processors = load_model_and_preprocess(
    name="blip2_image_text_matching",
    model_type="pretrain",
    is_eval=True,
    device=device
)

image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
text = txt_processors["eval"]("a dog sitting on grass")

# 获取匹配分数
itm_output = model({"image": image, "text_input": text}, match_head="itm")
itm_scores = torch.nn.functional.softmax(itm_output, dim=1)
print(f"Match probability: {itm_scores[:, 1].item():.3f}")

# 使用 Q-Former 提取图像特征
from lavis.models import load_model_and_preprocess

model, vis_processors, _ = load_model_and_preprocess(
    name="blip2_feature_extractor",
    model_type="pretrain",
    is_eval=True,
    device=device
)

image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)

# 获取特征
features = model.extract_features({"image": image}, mode="image")
image_embeds = features.image_embeds  # 形状：[1, 32, 768]
image_features = features.image_embeds_proj  # 投影后用于匹配

工作流 1：图像描述流水线

import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from pathlib import Path

class ImageCaptioner:
    def __init__(self, model_name="Salesforce/blip2-opt-2.7b"):
        self.processor = Blip2Processor.from_pretrained(model_name)
        self.model = Blip2ForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )

    def caption(self, image_path: str, prompt: str = None) -> str:
        image = Image.open(image_path).convert("RGB")

        if prompt:
            inputs = self.processor(images=image, text=prompt, return_tensors="pt")
        else:
            inputs = self.processor(images=image, return_tensors="pt")

        inputs = inputs.to("cuda", torch.float16)

        generated_ids = self.model.generate(
            **inputs,
            max_new_tokens=50,
            num_beams=5
        )

        return self.processor.decode(generated_ids[0], skip_special_tokens=True)

    def caption_batch(self, image_paths: list, prompt: str = None) -> list:
        images = [Image.open(p).convert("RGB") for p in image_paths]

        if prompt:
            inputs = self.processor(
                images=images,
                text=[prompt] * len(images),
                return_tensors="pt",
                padding=True
            )
        else:
            inputs = self.processor(images=images, return_tensors="pt", padding=True)

        inputs = inputs.to("cuda", torch.float16)

        generated_ids = self.model.generate(**inputs, max_new_tokens=50)
        return self.processor.batch_decode(generated_ids, skip_special_tokens=True)

# 用法
captioner = ImageCaptioner()

# 单张图像
caption = captioner.caption("photo.jpg")
print(f"Caption: {caption}")

# 使用提示词控制风格
caption = captioner.caption("photo.jpg", "a detailed description of")
print(f"Detailed: {caption}")

# 批处理
captions = captioner.caption_batch(["img1.jpg", "img2.jpg", "img3.jpg"])
for i, cap in enumerate(captions):
    print(f"Image {i+1}: {cap}")

工作流 2：视觉问答系统

class VisualQA:
    def __init__(self, model_name="Salesforce/blip2-flan-t5-xl"):
        self.processor = Blip2Processor.from_pretrained(model_name)
        self.model = Blip2ForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )
        self.current_image = None
        self.current_inputs = None

    def set_image(self, image_path: str):
        """加载图像用于多个问题。"""
        self.current_image = Image.open(image_path).convert("RGB")

    def ask(self, question: str) -> str:
        """询问关于当前图像的问题。"""
        if self.current_image is None:
            raise ValueError("No image set. Call set_image() first.")

        # 为 FlanT5 格式化问题
        prompt = f"Question: {question} Answer:"

        inputs = self.processor(
            images=self.current_image,
            text=prompt,
            return_tensors="pt"
        ).to("cuda", torch.float16)

        generated_ids = self.model.generate(
            **inputs,
            max_new_tokens=50,
            num_beams=5
        )

        return self.processor.decode(generated_ids[0], skip_special_tokens=True)

    def ask_multiple(self, questions: list) -> dict:
        """询问关于当前图像的多个问题。"""
        return {q: self.ask(q) for q in questions}

# 用法
vqa = VisualQA()
vqa.set_image("scene.jpg")

# 提问
print(vqa.ask("What objects are in this image?"))
print(vqa.ask("What is the weather like?"))
print(vqa.ask("How many people are there?"))

# 批量问题
results = vqa.ask_multiple([
    "What is the main subject?",
    "What colors are dominant?",
    "Is this indoors or outdoors?"
])

工作流 3：图像搜索/检索

import torch
import numpy as np
from PIL import Image
from lavis.models import load_model_and_preprocess

class ImageSearchEngine:
    def __init__(self):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model, self.vis_processors, self.txt_processors = load_model_and_preprocess(
            name="blip2_feature_extractor",
            model_type="pretrain",
            is_eval=True,
            device=self.device
        )
        self.image_features = []
        self.image_paths = []

    def index_images(self, image_paths: list):
        """从图像构建索引。"""
        self.image_paths = image_paths

        for path in image_paths:
            image = Image.open(path).convert("RGB")
            image = self.vis_processors["eval"](image).unsqueeze(0).to(self.device)

            with torch.no_grad():
                features = self.model.extract_features({"image": image}, mode="image")
                # 使用投影特征进行匹配
                self.image_features.append(
                    features.image_embeds_proj.mean(dim=1).cpu().numpy()
                )

        self.image_features = np.vstack(self.image_features)

    def search(self, query: str, top_k: int = 5) -> list:
        """通过文本查询搜索图像。"""
        # 获取文本特征
        text = self.txt_processors["eval"](query)
        text_input = {"text_input": [text]}

        with torch.no_grad():
            text_features = self.model.extract_features(text_input, mode="text")
            text_embeds = text_features.text_embeds_proj[:, 0].cpu().numpy()

        # 计算相似度
        similarities = np.dot(self.image_features, text_embeds.T).squeeze()
        top_indices = np.argsort(similarities)[::-1][:top_k]

        return [(self.image_paths[i], similarities[i]) for i in top_indices]

# 用法
engine = ImageSearchEngine()
engine.index_images(["img1.jpg", "img2.jpg", "img3.jpg", ...])

# 搜索
results = engine.search("a sunset over the ocean", top_k=5)
for path, score in results:
    print(f"{path}: {score:.3f}")

# 直接生成返回令牌 ID
generated_ids = model.generate(**inputs, max_new_tokens=50)
# 形状：[batch_size, sequence_length]

# 解码为文本
text = processor.batch_decode(generated_ids, skip_special_tokens=True)
# 返回：字符串列表

# Q-Former 输出
features = model.extract_features({"image": image}, mode="image")

features.image_embeds          # [B, 32, 768] - Q-Former 输出
features.image_embeds_proj     # [B, 32, 256] - 投影后用于匹配
features.text_embeds          # [B, seq_len, 768] - 文本特征
features.text_embeds_proj     # [B, 256] - 投影文本

模型	FP16 VRAM	INT8 VRAM	INT4 VRAM
blip2-opt-2.7b	~8GB	~5GB	~3GB
blip2-opt-6.7b	~16GB	~9GB	~5GB
blip2-flan-t5-xl	~10GB	~6GB	~4GB
blip2-flan-t5-xxl	~26GB	~14GB	~8GB

# 如果可用，使用 Flash Attention
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    torch_dtype=torch.float16,
    attn_implementation="flash_attention_2",  # 需要 flash-attn
    device_map="auto"
)

# 编译模型
model = torch.compile(model)

# 使用较小的图像
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
# 默认是 224x224，这是最优的

问题	解决方案
CUDA 内存不足	使用 INT8/INT4 量化，更小的模型
生成速度慢	使用贪婪解码，减少 max_new_tokens
描述质量差	尝试 FlanT5 变体，使用提示词
幻觉	降低温度，使用束搜索
错误答案	重新表述问题，提供上下文

高级用法 - 微调、集成、部署
故障排除 - 常见问题及解决方案

🇺🇸English

BLIP-2: Vision-Language Pre-training

Comprehensive guide to using Salesforce's BLIP-2 for vision-language tasks with frozen image encoders and large language models.

When to use BLIP-2

Use BLIP-2 when:

Need high-quality image captioning with natural descriptions
Building visual question answering (VQA) systems
Require zero-shot image-text understanding without task-specific training
Want to leverage LLM reasoning for visual tasks
Building multimodal conversational AI
Need image-text retrieval or matching

Key features:

Q-Former architecture : Lightweight query transformer bridges vision and language
Frozen backbone efficiency : No need to fine-tune large vision/language models
Multiple LLM backends : OPT (2.7B, 6.7B) and FlanT5 (XL, XXL)
Zero-shot capabilities : Strong performance without task-specific training
Efficient training : Only trains Q-Former (~188M parameters)
State-of-the-art results : Beats larger models on VQA benchmarks

Use alternatives instead:

LLaVA : For instruction-following multimodal chat
InstructBLIP : For improved instruction-following (BLIP-2 successor)
GPT-4V/Claude 3 : For production multimodal chat (proprietary)
CLIP : For simple image-text similarity without generation
Flamingo : For few-shot visual learning

Quick start

Installation

# HuggingFace Transformers (recommended)
pip install transformers accelerate torch Pillow

# Or LAVIS library (Salesforce official)
pip install salesforce-lavis

Basic image captioning

import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

# Load model and processor
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load image
image = Image.open("photo.jpg").convert("RGB")

# Generate caption
inputs = processor(images=image, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(caption)

Visual question answering

# Ask a question about the image
question = "What color is the car in this image?"

inputs = processor(images=image, text=question, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(answer)

Using LAVIS library

import torch
from lavis.models import load_model_and_preprocess
from PIL import Image

# Load model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model, vis_processors, txt_processors = load_model_and_preprocess(
    name="blip2_opt",
    model_type="pretrain_opt2.7b",
    is_eval=True,
    device=device
)

# Process image
image = Image.open("photo.jpg").convert("RGB")
image = vis_processors["eval"](image).unsqueeze(0).to(device)

# Caption
caption = model.generate({"image": image})
print(caption)

# VQA
question = txt_processors["eval"]("What is in this image?")
answer = model.generate({"image": image, "prompt": question})
print(answer)

Core concepts

Architecture overview

BLIP-2 Architecture:
┌─────────────────────────────────────────────────────────────┐
│                        Q-Former                              │
│  ┌─────────────────────────────────────────────────────┐    │
│  │     Learned Queries (32 queries × 768 dim)          │    │
│  └────────────────────────┬────────────────────────────┘    │
│                           │                                  │
│  ┌────────────────────────▼────────────────────────────┐    │
│  │    Cross-Attention with Image Features               │    │
│  └────────────────────────┬────────────────────────────┘    │
│                           │                                  │
│  ┌────────────────────────▼────────────────────────────┐    │
│  │    Self-Attention Layers (Transformer)               │    │
│  └────────────────────────┬────────────────────────────┘    │
└───────────────────────────┼─────────────────────────────────┘
                            │
┌───────────────────────────▼─────────────────────────────────┐
│  Frozen Vision Encoder    │      Frozen LLM                  │
│  (ViT-G/14 from EVA-CLIP) │      (OPT or FlanT5)            │
└─────────────────────────────────────────────────────────────┘

Model variants

Model	LLM Backend	Size	Use Case
`blip2-opt-2.7b`	OPT-2.7B	~4GB	General captioning, VQA
`blip2-opt-6.7b`	OPT-6.7B	~8GB	Better reasoning
`blip2-flan-t5-xl`	FlanT5-XL	~5GB	Instruction following
`blip2-flan-t5-xxl`	FlanT5-XXL	~13GB	Best quality

Q-Former components

Component	Description	Parameters
Learned queries	Fixed set of learnable embeddings	32 × 768
Image transformer	Cross-attention to vision features	~108M
Text transformer	Self-attention for text	~108M
Linear projection	Maps to LLM dimension	Varies

Advanced usage

Batch processing

from PIL import Image
import torch

# Load multiple images
images = [Image.open(f"image_{i}.jpg").convert("RGB") for i in range(4)]
questions = [
    "What is shown in this image?",
    "Describe the scene.",
    "What colors are prominent?",
    "Is there a person in this image?"
]

# Process batch
inputs = processor(
    images=images,
    text=questions,
    return_tensors="pt",
    padding=True
).to("cuda", torch.float16)

# Generate
generated_ids = model.generate(**inputs, max_new_tokens=50)
answers = processor.batch_decode(generated_ids, skip_special_tokens=True)

for q, a in zip(questions, answers):
    print(f"Q: {q}\nA: {a}\n")

Controlling generation

# Control generation parameters
generated_ids = model.generate(
    **inputs,
    max_new_tokens=100,
    min_length=20,
    num_beams=5,              # Beam search
    no_repeat_ngram_size=2,   # Avoid repetition
    top_p=0.9,                # Nucleus sampling
    temperature=0.7,          # Creativity
    do_sample=True,           # Enable sampling
)

# For deterministic output
generated_ids = model.generate(
    **inputs,
    max_new_tokens=50,
    num_beams=5,
    do_sample=False,
)

Memory optimization

# 8-bit quantization
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-6.7b",
    quantization_config=quantization_config,
    device_map="auto"
)

# 4-bit quantization (more aggressive)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-flan-t5-xxl",
    quantization_config=quantization_config,
    device_map="auto"
)

Image-text matching

# Using LAVIS for ITM (Image-Text Matching)
from lavis.models import load_model_and_preprocess

model, vis_processors, txt_processors = load_model_and_preprocess(
    name="blip2_image_text_matching",
    model_type="pretrain",
    is_eval=True,
    device=device
)

image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
text = txt_processors["eval"]("a dog sitting on grass")

# Get matching score
itm_output = model({"image": image, "text_input": text}, match_head="itm")
itm_scores = torch.nn.functional.softmax(itm_output, dim=1)
print(f"Match probability: {itm_scores[:, 1].item():.3f}")

Feature extraction

# Extract image features with Q-Former
from lavis.models import load_model_and_preprocess

model, vis_processors, _ = load_model_and_preprocess(
    name="blip2_feature_extractor",
    model_type="pretrain",
    is_eval=True,
    device=device
)

image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)

# Get features
features = model.extract_features({"image": image}, mode="image")
image_embeds = features.image_embeds  # Shape: [1, 32, 768]
image_features = features.image_embeds_proj  # Projected for matching

Common workflows

Workflow 1: Image captioning pipeline

import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from pathlib import Path

class ImageCaptioner:
    def __init__(self, model_name="Salesforce/blip2-opt-2.7b"):
        self.processor = Blip2Processor.from_pretrained(model_name)
        self.model = Blip2ForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )

    def caption(self, image_path: str, prompt: str = None) -> str:
        image = Image.open(image_path).convert("RGB")

        if prompt:
            inputs = self.processor(images=image, text=prompt, return_tensors="pt")
        else:
            inputs = self.processor(images=image, return_tensors="pt")

        inputs = inputs.to("cuda", torch.float16)

        generated_ids = self.model.generate(
            **inputs,
            max_new_tokens=50,
            num_beams=5
        )

        return self.processor.decode(generated_ids[0], skip_special_tokens=True)

    def caption_batch(self, image_paths: list, prompt: str = None) -> list:
        images = [Image.open(p).convert("RGB") for p in image_paths]

        if prompt:
            inputs = self.processor(
                images=images,
                text=[prompt] * len(images),
                return_tensors="pt",
                padding=True
            )
        else:
            inputs = self.processor(images=images, return_tensors="pt", padding=True)

        inputs = inputs.to("cuda", torch.float16)

        generated_ids = self.model.generate(**inputs, max_new_tokens=50)
        return self.processor.batch_decode(generated_ids, skip_special_tokens=True)

# Usage
captioner = ImageCaptioner()

# Single image
caption = captioner.caption("photo.jpg")
print(f"Caption: {caption}")

# With prompt for style
caption = captioner.caption("photo.jpg", "a detailed description of")
print(f"Detailed: {caption}")

# Batch processing
captions = captioner.caption_batch(["img1.jpg", "img2.jpg", "img3.jpg"])
for i, cap in enumerate(captions):
    print(f"Image {i+1}: {cap}")

Workflow 2: Visual Q&A system

class VisualQA:
    def __init__(self, model_name="Salesforce/blip2-flan-t5-xl"):
        self.processor = Blip2Processor.from_pretrained(model_name)
        self.model = Blip2ForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )
        self.current_image = None
        self.current_inputs = None

    def set_image(self, image_path: str):
        """Load image for multiple questions."""
        self.current_image = Image.open(image_path).convert("RGB")

    def ask(self, question: str) -> str:
        """Ask a question about the current image."""
        if self.current_image is None:
            raise ValueError("No image set. Call set_image() first.")

        # Format question for FlanT5
        prompt = f"Question: {question} Answer:"

        inputs = self.processor(
            images=self.current_image,
            text=prompt,
            return_tensors="pt"
        ).to("cuda", torch.float16)

        generated_ids = self.model.generate(
            **inputs,
            max_new_tokens=50,
            num_beams=5
        )

        return self.processor.decode(generated_ids[0], skip_special_tokens=True)

    def ask_multiple(self, questions: list) -> dict:
        """Ask multiple questions about current image."""
        return {q: self.ask(q) for q in questions}

# Usage
vqa = VisualQA()
vqa.set_image("scene.jpg")

# Ask questions
print(vqa.ask("What objects are in this image?"))
print(vqa.ask("What is the weather like?"))
print(vqa.ask("How many people are there?"))

# Batch questions
results = vqa.ask_multiple([
    "What is the main subject?",
    "What colors are dominant?",
    "Is this indoors or outdoors?"
])

Workflow 3: Image search/retrieval

import torch
import numpy as np
from PIL import Image
from lavis.models import load_model_and_preprocess

class ImageSearchEngine:
    def __init__(self):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model, self.vis_processors, self.txt_processors = load_model_and_preprocess(
            name="blip2_feature_extractor",
            model_type="pretrain",
            is_eval=True,
            device=self.device
        )
        self.image_features = []
        self.image_paths = []

    def index_images(self, image_paths: list):
        """Build index from images."""
        self.image_paths = image_paths

        for path in image_paths:
            image = Image.open(path).convert("RGB")
            image = self.vis_processors["eval"](image).unsqueeze(0).to(self.device)

            with torch.no_grad():
                features = self.model.extract_features({"image": image}, mode="image")
                # Use projected features for matching
                self.image_features.append(
                    features.image_embeds_proj.mean(dim=1).cpu().numpy()
                )

        self.image_features = np.vstack(self.image_features)

    def search(self, query: str, top_k: int = 5) -> list:
        """Search images by text query."""
        # Get text features
        text = self.txt_processors["eval"](query)
        text_input = {"text_input": [text]}

        with torch.no_grad():
            text_features = self.model.extract_features(text_input, mode="text")
            text_embeds = text_features.text_embeds_proj[:, 0].cpu().numpy()

        # Compute similarities
        similarities = np.dot(self.image_features, text_embeds.T).squeeze()
        top_indices = np.argsort(similarities)[::-1][:top_k]

        return [(self.image_paths[i], similarities[i]) for i in top_indices]

# Usage
engine = ImageSearchEngine()
engine.index_images(["img1.jpg", "img2.jpg", "img3.jpg", ...])

# Search
results = engine.search("a sunset over the ocean", top_k=5)
for path, score in results:
    print(f"{path}: {score:.3f}")

Output format

Generation output

# Direct generation returns token IDs
generated_ids = model.generate(**inputs, max_new_tokens=50)
# Shape: [batch_size, sequence_length]

# Decode to text
text = processor.batch_decode(generated_ids, skip_special_tokens=True)
# Returns: list of strings

Feature extraction output

# Q-Former outputs
features = model.extract_features({"image": image}, mode="image")

features.image_embeds          # [B, 32, 768] - Q-Former outputs
features.image_embeds_proj     # [B, 32, 256] - Projected for matching
features.text_embeds          # [B, seq_len, 768] - Text features
features.text_embeds_proj     # [B, 256] - Projected text (CLS)

Performance optimization

GPU memory requirements

Model	FP16 VRAM	INT8 VRAM	INT4 VRAM
blip2-opt-2.7b	~8GB	~5GB	~3GB
blip2-opt-6.7b	~16GB	~9GB	~5GB
blip2-flan-t5-xl	~10GB	~6GB	~4GB
blip2-flan-t5-xxl	~26GB	~14GB	~8GB

Speed optimization

# Use Flash Attention if available
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    torch_dtype=torch.float16,
    attn_implementation="flash_attention_2",  # Requires flash-attn
    device_map="auto"
)

# Compile model (PyTorch 2.0+)
model = torch.compile(model)

# Use smaller images (if quality allows)
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
# Default is 224x224, which is optimal

Common issues

Issue	Solution
CUDA OOM	Use INT8/INT4 quantization, smaller model
Slow generation	Use greedy decoding, reduce max_new_tokens
Poor captions	Try FlanT5 variant, use prompts
Hallucinations	Lower temperature, use beam search
Wrong answers	Rephrase question, provide context

References

Advanced Usage - Fine-tuning, integration, deployment
Troubleshooting - Common issues and solutions

Resources

Paper : https://arxiv.org/abs/2301.12597
GitHub (LAVIS) : https://github.com/salesforce/LAVIS
HuggingFace : https://huggingface.co/Salesforce/blip2-opt-2.7b
Demo : https://huggingface.co/spaces/Salesforce/BLIP2
InstructBLIP : https://arxiv.org/abs/2305.06500 (successor)

Weekly Installs

167

Repository

davila7/claude-…emplates

GitHub Stars

22.6K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode134

gemini-cli124

claude-code124

cursor121

codex112

github-copilot102

AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具

50,900 周安装

BLIP-2视觉语言预训练模型指南：图像描述、视觉问答与多模态AI应用

🇨🇳中文介绍

BLIP-2：视觉语言预训练

何时使用 BLIP-2

快速开始

安装

相关 Skills

基本图像描述

视觉问答

使用 LAVIS 库

核心概念

架构概述

模型变体

Q-Former 组件

高级用法

批处理

控制生成

内存优化

图像-文本匹配

特征提取

常见工作流