blip-2-vision-language by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill blip-2-vision-language使用 Salesforce 的 BLIP-2 进行视觉语言任务的综合指南,包含冻结的图像编码器和大型语言模型。
在以下情况下使用 BLIP-2:
主要特性:
改用其他方案的情况:
# HuggingFace Transformers (推荐)
pip install transformers accelerate torch Pillow
# 或 LAVIS 库
pip install salesforce-lavis
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
# 加载模型和处理器
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
torch_dtype=torch.float16,
device_map="auto"
)
# 加载图像
image = Image.open("photo.jpg").convert("RGB")
# 生成描述
inputs = processor(images=image, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(caption)
# 询问关于图像的问题
question = "What color is the car in this image?"
inputs = processor(images=image, text=question, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(answer)
import torch
from lavis.models import load_model_and_preprocess
from PIL import Image
# 加载模型
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model, vis_processors, txt_processors = load_model_and_preprocess(
name="blip2_opt",
model_type="pretrain_opt2.7b",
is_eval=True,
device=device
)
# 处理图像
image = Image.open("photo.jpg").convert("RGB")
image = vis_processors["eval"](image).unsqueeze(0).to(device)
# 描述
caption = model.generate({"image": image})
print(caption)
# VQA
question = txt_processors["eval"]("What is in this image?")
answer = model.generate({"image": image, "prompt": question})
print(answer)
BLIP-2 Architecture:
┌─────────────────────────────────────────────────────────────┐
│ Q-Former │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Learned Queries (32 queries × 768 dim) │ │
│ └────────────────────────┬────────────────────────────┘ │
│ │ │
│ ┌────────────────────────▼────────────────────────────┐ │
│ │ Cross-Attention with Image Features │ │
│ └────────────────────────┬────────────────────────────┘ │
│ │ │
│ ┌────────────────────────▼────────────────────────────┐ │
│ │ Self-Attention Layers (Transformer) │ │
│ └────────────────────────┬────────────────────────────┘ │
└───────────────────────────┼─────────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────────┐
│ Frozen Vision Encoder │ Frozen LLM │
│ (ViT-G/14 from EVA-CLIP) │ (OPT or FlanT5) │
└─────────────────────────────────────────────────────────────┘
| 模型 | LLM 后端 | 大小 | 使用场景 |
|---|---|---|---|
blip2-opt-2.7b | OPT-2.7B | ~4GB | 通用描述,VQA |
blip2-opt-6.7b | OPT-6.7B | ~8GB | 更好的推理 |
blip2-flan-t5-xl | FlanT5-XL | ~5GB | 指令遵循 |
blip2-flan-t5-xxl | FlanT5-XXL | ~13GB | 最佳质量 |
| 组件 | 描述 | 参数 |
|---|---|---|
| 学习查询 | 一组固定的可学习嵌入 | 32 × 768 |
| 图像变换器 | 与视觉特征的交叉注意力 | ~108M |
| 文本变换器 | 文本的自注意力 | ~108M |
| 线性投影 | 映射到 LLM 维度 | 可变 |
from PIL import Image
import torch
# 加载多张图像
images = [Image.open(f"image_{i}.jpg").convert("RGB") for i in range(4)]
questions = [
"What is shown in this image?",
"Describe the scene.",
"What colors are prominent?",
"Is there a person in this image?"
]
# 处理批次
inputs = processor(
images=images,
text=questions,
return_tensors="pt",
padding=True
).to("cuda", torch.float16)
# 生成
generated_ids = model.generate(**inputs, max_new_tokens=50)
answers = processor.batch_decode(generated_ids, skip_special_tokens=True)
for q, a in zip(questions, answers):
print(f"Q: {q}\nA: {a}\n")
# 控制生成参数
generated_ids = model.generate(
**inputs,
max_new_tokens=100,
min_length=20,
num_beams=5, # 束搜索
no_repeat_ngram_size=2, # 避免重复
top_p=0.9, # 核采样
temperature=0.7, # 创造性
do_sample=True, # 启用采样
)
# 确定性输出
generated_ids = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5,
do_sample=False,
)
# 8位量化
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-6.7b",
quantization_config=quantization_config,
device_map="auto"
)
# 4位量化
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-flan-t5-xxl",
quantization_config=quantization_config,
device_map="auto"
)
# 使用 LAVIS 进行 ITM
from lavis.models import load_model_and_preprocess
model, vis_processors, txt_processors = load_model_and_preprocess(
name="blip2_image_text_matching",
model_type="pretrain",
is_eval=True,
device=device
)
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
text = txt_processors["eval"]("a dog sitting on grass")
# 获取匹配分数
itm_output = model({"image": image, "text_input": text}, match_head="itm")
itm_scores = torch.nn.functional.softmax(itm_output, dim=1)
print(f"Match probability: {itm_scores[:, 1].item():.3f}")
# 使用 Q-Former 提取图像特征
from lavis.models import load_model_and_preprocess
model, vis_processors, _ = load_model_and_preprocess(
name="blip2_feature_extractor",
model_type="pretrain",
is_eval=True,
device=device
)
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
# 获取特征
features = model.extract_features({"image": image}, mode="image")
image_embeds = features.image_embeds # 形状:[1, 32, 768]
image_features = features.image_embeds_proj # 投影后用于匹配
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from pathlib import Path
class ImageCaptioner:
def __init__(self, model_name="Salesforce/blip2-opt-2.7b"):
self.processor = Blip2Processor.from_pretrained(model_name)
self.model = Blip2ForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
def caption(self, image_path: str, prompt: str = None) -> str:
image = Image.open(image_path).convert("RGB")
if prompt:
inputs = self.processor(images=image, text=prompt, return_tensors="pt")
else:
inputs = self.processor(images=image, return_tensors="pt")
inputs = inputs.to("cuda", torch.float16)
generated_ids = self.model.generate(
**inputs,
max_new_tokens=50,
num_beams=5
)
return self.processor.decode(generated_ids[0], skip_special_tokens=True)
def caption_batch(self, image_paths: list, prompt: str = None) -> list:
images = [Image.open(p).convert("RGB") for p in image_paths]
if prompt:
inputs = self.processor(
images=images,
text=[prompt] * len(images),
return_tensors="pt",
padding=True
)
else:
inputs = self.processor(images=images, return_tensors="pt", padding=True)
inputs = inputs.to("cuda", torch.float16)
generated_ids = self.model.generate(**inputs, max_new_tokens=50)
return self.processor.batch_decode(generated_ids, skip_special_tokens=True)
# 用法
captioner = ImageCaptioner()
# 单张图像
caption = captioner.caption("photo.jpg")
print(f"Caption: {caption}")
# 使用提示词控制风格
caption = captioner.caption("photo.jpg", "a detailed description of")
print(f"Detailed: {caption}")
# 批处理
captions = captioner.caption_batch(["img1.jpg", "img2.jpg", "img3.jpg"])
for i, cap in enumerate(captions):
print(f"Image {i+1}: {cap}")
class VisualQA:
def __init__(self, model_name="Salesforce/blip2-flan-t5-xl"):
self.processor = Blip2Processor.from_pretrained(model_name)
self.model = Blip2ForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
self.current_image = None
self.current_inputs = None
def set_image(self, image_path: str):
"""加载图像用于多个问题。"""
self.current_image = Image.open(image_path).convert("RGB")
def ask(self, question: str) -> str:
"""询问关于当前图像的问题。"""
if self.current_image is None:
raise ValueError("No image set. Call set_image() first.")
# 为 FlanT5 格式化问题
prompt = f"Question: {question} Answer:"
inputs = self.processor(
images=self.current_image,
text=prompt,
return_tensors="pt"
).to("cuda", torch.float16)
generated_ids = self.model.generate(
**inputs,
max_new_tokens=50,
num_beams=5
)
return self.processor.decode(generated_ids[0], skip_special_tokens=True)
def ask_multiple(self, questions: list) -> dict:
"""询问关于当前图像的多个问题。"""
return {q: self.ask(q) for q in questions}
# 用法
vqa = VisualQA()
vqa.set_image("scene.jpg")
# 提问
print(vqa.ask("What objects are in this image?"))
print(vqa.ask("What is the weather like?"))
print(vqa.ask("How many people are there?"))
# 批量问题
results = vqa.ask_multiple([
"What is the main subject?",
"What colors are dominant?",
"Is this indoors or outdoors?"
])
import torch
import numpy as np
from PIL import Image
from lavis.models import load_model_and_preprocess
class ImageSearchEngine:
def __init__(self):
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model, self.vis_processors, self.txt_processors = load_model_and_preprocess(
name="blip2_feature_extractor",
model_type="pretrain",
is_eval=True,
device=self.device
)
self.image_features = []
self.image_paths = []
def index_images(self, image_paths: list):
"""从图像构建索引。"""
self.image_paths = image_paths
for path in image_paths:
image = Image.open(path).convert("RGB")
image = self.vis_processors["eval"](image).unsqueeze(0).to(self.device)
with torch.no_grad():
features = self.model.extract_features({"image": image}, mode="image")
# 使用投影特征进行匹配
self.image_features.append(
features.image_embeds_proj.mean(dim=1).cpu().numpy()
)
self.image_features = np.vstack(self.image_features)
def search(self, query: str, top_k: int = 5) -> list:
"""通过文本查询搜索图像。"""
# 获取文本特征
text = self.txt_processors["eval"](query)
text_input = {"text_input": [text]}
with torch.no_grad():
text_features = self.model.extract_features(text_input, mode="text")
text_embeds = text_features.text_embeds_proj[:, 0].cpu().numpy()
# 计算相似度
similarities = np.dot(self.image_features, text_embeds.T).squeeze()
top_indices = np.argsort(similarities)[::-1][:top_k]
return [(self.image_paths[i], similarities[i]) for i in top_indices]
# 用法
engine = ImageSearchEngine()
engine.index_images(["img1.jpg", "img2.jpg", "img3.jpg", ...])
# 搜索
results = engine.search("a sunset over the ocean", top_k=5)
for path, score in results:
print(f"{path}: {score:.3f}")
# 直接生成返回令牌 ID
generated_ids = model.generate(**inputs, max_new_tokens=50)
# 形状:[batch_size, sequence_length]
# 解码为文本
text = processor.batch_decode(generated_ids, skip_special_tokens=True)
# 返回:字符串列表
# Q-Former 输出
features = model.extract_features({"image": image}, mode="image")
features.image_embeds # [B, 32, 768] - Q-Former 输出
features.image_embeds_proj # [B, 32, 256] - 投影后用于匹配
features.text_embeds # [B, seq_len, 768] - 文本特征
features.text_embeds_proj # [B, 256] - 投影文本
| 模型 | FP16 VRAM | INT8 VRAM | INT4 VRAM |
|---|---|---|---|
| blip2-opt-2.7b | ~8GB | ~5GB | ~3GB |
| blip2-opt-6.7b | ~16GB | ~9GB | ~5GB |
| blip2-flan-t5-xl | ~10GB | ~6GB | ~4GB |
| blip2-flan-t5-xxl | ~26GB | ~14GB | ~8GB |
# 如果可用,使用 Flash Attention
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
torch_dtype=torch.float16,
attn_implementation="flash_attention_2", # 需要 flash-attn
device_map="auto"
)
# 编译模型
model = torch.compile(model)
# 使用较小的图像
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
# 默认是 224x224,这是最优的
| 问题 | 解决方案 |
|---|---|
| CUDA 内存不足 | 使用 INT8/INT4 量化,更小的模型 |
| 生成速度慢 | 使用贪婪解码,减少 max_new_tokens |
| 描述质量差 | 尝试 FlanT5 变体,使用提示词 |
| 幻觉 | 降低温度,使用束搜索 |
| 错误答案 | 重新表述问题,提供上下文 |
每周安装
167
仓库
GitHub 星标
22.6K
首次出现
Jan 21, 2026
安全审计
安装于
opencode134
gemini-cli124
claude-code124
cursor121
codex112
github-copilot102
Comprehensive guide to using Salesforce's BLIP-2 for vision-language tasks with frozen image encoders and large language models.
Use BLIP-2 when:
Key features:
Use alternatives instead:
# HuggingFace Transformers (recommended)
pip install transformers accelerate torch Pillow
# Or LAVIS library (Salesforce official)
pip install salesforce-lavis
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
# Load model and processor
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
torch_dtype=torch.float16,
device_map="auto"
)
# Load image
image = Image.open("photo.jpg").convert("RGB")
# Generate caption
inputs = processor(images=image, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(caption)
# Ask a question about the image
question = "What color is the car in this image?"
inputs = processor(images=image, text=question, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(answer)
import torch
from lavis.models import load_model_and_preprocess
from PIL import Image
# Load model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model, vis_processors, txt_processors = load_model_and_preprocess(
name="blip2_opt",
model_type="pretrain_opt2.7b",
is_eval=True,
device=device
)
# Process image
image = Image.open("photo.jpg").convert("RGB")
image = vis_processors["eval"](image).unsqueeze(0).to(device)
# Caption
caption = model.generate({"image": image})
print(caption)
# VQA
question = txt_processors["eval"]("What is in this image?")
answer = model.generate({"image": image, "prompt": question})
print(answer)
BLIP-2 Architecture:
┌─────────────────────────────────────────────────────────────┐
│ Q-Former │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Learned Queries (32 queries × 768 dim) │ │
│ └────────────────────────┬────────────────────────────┘ │
│ │ │
│ ┌────────────────────────▼────────────────────────────┐ │
│ │ Cross-Attention with Image Features │ │
│ └────────────────────────┬────────────────────────────┘ │
│ │ │
│ ┌────────────────────────▼────────────────────────────┐ │
│ │ Self-Attention Layers (Transformer) │ │
│ └────────────────────────┬────────────────────────────┘ │
└───────────────────────────┼─────────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────────┐
│ Frozen Vision Encoder │ Frozen LLM │
│ (ViT-G/14 from EVA-CLIP) │ (OPT or FlanT5) │
└─────────────────────────────────────────────────────────────┘
| Model | LLM Backend | Size | Use Case |
|---|---|---|---|
blip2-opt-2.7b | OPT-2.7B | ~4GB | General captioning, VQA |
blip2-opt-6.7b | OPT-6.7B | ~8GB | Better reasoning |
blip2-flan-t5-xl | FlanT5-XL | ~5GB | Instruction following |
blip2-flan-t5-xxl | FlanT5-XXL | ~13GB | Best quality |
| Component | Description | Parameters |
|---|---|---|
| Learned queries | Fixed set of learnable embeddings | 32 × 768 |
| Image transformer | Cross-attention to vision features | ~108M |
| Text transformer | Self-attention for text | ~108M |
| Linear projection | Maps to LLM dimension | Varies |
from PIL import Image
import torch
# Load multiple images
images = [Image.open(f"image_{i}.jpg").convert("RGB") for i in range(4)]
questions = [
"What is shown in this image?",
"Describe the scene.",
"What colors are prominent?",
"Is there a person in this image?"
]
# Process batch
inputs = processor(
images=images,
text=questions,
return_tensors="pt",
padding=True
).to("cuda", torch.float16)
# Generate
generated_ids = model.generate(**inputs, max_new_tokens=50)
answers = processor.batch_decode(generated_ids, skip_special_tokens=True)
for q, a in zip(questions, answers):
print(f"Q: {q}\nA: {a}\n")
# Control generation parameters
generated_ids = model.generate(
**inputs,
max_new_tokens=100,
min_length=20,
num_beams=5, # Beam search
no_repeat_ngram_size=2, # Avoid repetition
top_p=0.9, # Nucleus sampling
temperature=0.7, # Creativity
do_sample=True, # Enable sampling
)
# For deterministic output
generated_ids = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5,
do_sample=False,
)
# 8-bit quantization
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-6.7b",
quantization_config=quantization_config,
device_map="auto"
)
# 4-bit quantization (more aggressive)
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-flan-t5-xxl",
quantization_config=quantization_config,
device_map="auto"
)
# Using LAVIS for ITM (Image-Text Matching)
from lavis.models import load_model_and_preprocess
model, vis_processors, txt_processors = load_model_and_preprocess(
name="blip2_image_text_matching",
model_type="pretrain",
is_eval=True,
device=device
)
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
text = txt_processors["eval"]("a dog sitting on grass")
# Get matching score
itm_output = model({"image": image, "text_input": text}, match_head="itm")
itm_scores = torch.nn.functional.softmax(itm_output, dim=1)
print(f"Match probability: {itm_scores[:, 1].item():.3f}")
# Extract image features with Q-Former
from lavis.models import load_model_and_preprocess
model, vis_processors, _ = load_model_and_preprocess(
name="blip2_feature_extractor",
model_type="pretrain",
is_eval=True,
device=device
)
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
# Get features
features = model.extract_features({"image": image}, mode="image")
image_embeds = features.image_embeds # Shape: [1, 32, 768]
image_features = features.image_embeds_proj # Projected for matching
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from pathlib import Path
class ImageCaptioner:
def __init__(self, model_name="Salesforce/blip2-opt-2.7b"):
self.processor = Blip2Processor.from_pretrained(model_name)
self.model = Blip2ForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
def caption(self, image_path: str, prompt: str = None) -> str:
image = Image.open(image_path).convert("RGB")
if prompt:
inputs = self.processor(images=image, text=prompt, return_tensors="pt")
else:
inputs = self.processor(images=image, return_tensors="pt")
inputs = inputs.to("cuda", torch.float16)
generated_ids = self.model.generate(
**inputs,
max_new_tokens=50,
num_beams=5
)
return self.processor.decode(generated_ids[0], skip_special_tokens=True)
def caption_batch(self, image_paths: list, prompt: str = None) -> list:
images = [Image.open(p).convert("RGB") for p in image_paths]
if prompt:
inputs = self.processor(
images=images,
text=[prompt] * len(images),
return_tensors="pt",
padding=True
)
else:
inputs = self.processor(images=images, return_tensors="pt", padding=True)
inputs = inputs.to("cuda", torch.float16)
generated_ids = self.model.generate(**inputs, max_new_tokens=50)
return self.processor.batch_decode(generated_ids, skip_special_tokens=True)
# Usage
captioner = ImageCaptioner()
# Single image
caption = captioner.caption("photo.jpg")
print(f"Caption: {caption}")
# With prompt for style
caption = captioner.caption("photo.jpg", "a detailed description of")
print(f"Detailed: {caption}")
# Batch processing
captions = captioner.caption_batch(["img1.jpg", "img2.jpg", "img3.jpg"])
for i, cap in enumerate(captions):
print(f"Image {i+1}: {cap}")
class VisualQA:
def __init__(self, model_name="Salesforce/blip2-flan-t5-xl"):
self.processor = Blip2Processor.from_pretrained(model_name)
self.model = Blip2ForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
self.current_image = None
self.current_inputs = None
def set_image(self, image_path: str):
"""Load image for multiple questions."""
self.current_image = Image.open(image_path).convert("RGB")
def ask(self, question: str) -> str:
"""Ask a question about the current image."""
if self.current_image is None:
raise ValueError("No image set. Call set_image() first.")
# Format question for FlanT5
prompt = f"Question: {question} Answer:"
inputs = self.processor(
images=self.current_image,
text=prompt,
return_tensors="pt"
).to("cuda", torch.float16)
generated_ids = self.model.generate(
**inputs,
max_new_tokens=50,
num_beams=5
)
return self.processor.decode(generated_ids[0], skip_special_tokens=True)
def ask_multiple(self, questions: list) -> dict:
"""Ask multiple questions about current image."""
return {q: self.ask(q) for q in questions}
# Usage
vqa = VisualQA()
vqa.set_image("scene.jpg")
# Ask questions
print(vqa.ask("What objects are in this image?"))
print(vqa.ask("What is the weather like?"))
print(vqa.ask("How many people are there?"))
# Batch questions
results = vqa.ask_multiple([
"What is the main subject?",
"What colors are dominant?",
"Is this indoors or outdoors?"
])
import torch
import numpy as np
from PIL import Image
from lavis.models import load_model_and_preprocess
class ImageSearchEngine:
def __init__(self):
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model, self.vis_processors, self.txt_processors = load_model_and_preprocess(
name="blip2_feature_extractor",
model_type="pretrain",
is_eval=True,
device=self.device
)
self.image_features = []
self.image_paths = []
def index_images(self, image_paths: list):
"""Build index from images."""
self.image_paths = image_paths
for path in image_paths:
image = Image.open(path).convert("RGB")
image = self.vis_processors["eval"](image).unsqueeze(0).to(self.device)
with torch.no_grad():
features = self.model.extract_features({"image": image}, mode="image")
# Use projected features for matching
self.image_features.append(
features.image_embeds_proj.mean(dim=1).cpu().numpy()
)
self.image_features = np.vstack(self.image_features)
def search(self, query: str, top_k: int = 5) -> list:
"""Search images by text query."""
# Get text features
text = self.txt_processors["eval"](query)
text_input = {"text_input": [text]}
with torch.no_grad():
text_features = self.model.extract_features(text_input, mode="text")
text_embeds = text_features.text_embeds_proj[:, 0].cpu().numpy()
# Compute similarities
similarities = np.dot(self.image_features, text_embeds.T).squeeze()
top_indices = np.argsort(similarities)[::-1][:top_k]
return [(self.image_paths[i], similarities[i]) for i in top_indices]
# Usage
engine = ImageSearchEngine()
engine.index_images(["img1.jpg", "img2.jpg", "img3.jpg", ...])
# Search
results = engine.search("a sunset over the ocean", top_k=5)
for path, score in results:
print(f"{path}: {score:.3f}")
# Direct generation returns token IDs
generated_ids = model.generate(**inputs, max_new_tokens=50)
# Shape: [batch_size, sequence_length]
# Decode to text
text = processor.batch_decode(generated_ids, skip_special_tokens=True)
# Returns: list of strings
# Q-Former outputs
features = model.extract_features({"image": image}, mode="image")
features.image_embeds # [B, 32, 768] - Q-Former outputs
features.image_embeds_proj # [B, 32, 256] - Projected for matching
features.text_embeds # [B, seq_len, 768] - Text features
features.text_embeds_proj # [B, 256] - Projected text (CLS)
| Model | FP16 VRAM | INT8 VRAM | INT4 VRAM |
|---|---|---|---|
| blip2-opt-2.7b | ~8GB | ~5GB | ~3GB |
| blip2-opt-6.7b | ~16GB | ~9GB | ~5GB |
| blip2-flan-t5-xl | ~10GB | ~6GB | ~4GB |
| blip2-flan-t5-xxl | ~26GB | ~14GB | ~8GB |
# Use Flash Attention if available
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
torch_dtype=torch.float16,
attn_implementation="flash_attention_2", # Requires flash-attn
device_map="auto"
)
# Compile model (PyTorch 2.0+)
model = torch.compile(model)
# Use smaller images (if quality allows)
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
# Default is 224x224, which is optimal
| Issue | Solution |
|---|---|
| CUDA OOM | Use INT8/INT4 quantization, smaller model |
| Slow generation | Use greedy decoding, reduce max_new_tokens |
| Poor captions | Try FlanT5 variant, use prompts |
| Hallucinations | Lower temperature, use beam search |
| Wrong answers | Rephrase question, provide context |
Weekly Installs
167
Repository
GitHub Stars
22.6K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
opencode134
gemini-cli124
claude-code124
cursor121
codex112
github-copilot102
AI 代码实施计划编写技能 | 自动化开发任务分解与 TDD 流程规划工具
50,900 周安装
Markdown转HTML专业技能 - 使用marked.js、Pandoc和Hugo实现高效文档转换
8,200 周安装
GitHub Copilot 技能模板制作指南 - 创建自定义 Agent Skills 分步教程
8,200 周安装
ImageMagick图像处理技能:批量调整大小、格式转换与元数据提取
8,200 周安装
GitHub Actions 工作流规范创建指南:AI优化模板与CI/CD流程设计
8,200 周安装
GitHub Copilot SDK 官方开发包 - 在应用中嵌入AI智能体工作流(Python/TypeScript/Go/.NET)
8,200 周安装
AI提示工程安全审查与改进指南 - 负责任AI开发、偏见检测与提示优化
8,200 周安装