重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
clip by orchestra-research/ai-research-skills
npx skills add https://github.com/orchestra-research/ai-research-skills --skill clipOpenAI 推出的能够从自然语言理解图像的模型。
在以下情况使用:
指标 :
替代方案 :
pip install git+https://github.com/openai/CLIP.git
pip install torch torchvision ftfy regex tqdm
import torch
import clip
from PIL import Image
# 加载模型
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
# 加载图像
image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
# 定义可能的标签
text = clip.tokenize(["a dog", "a cat", "a bird", "a car"]).to(device)
# 计算相似度
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# 余弦相似度
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
# 打印结果
labels = ["a dog", "a cat", "a bird", "a car"]
for label, prob in zip(labels, probs[0]):
print(f"{label}: {prob:.2%}")
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
# 模型(按大小排序)
models = [
"RN50", # ResNet-50
"RN101", # ResNet-101
"ViT-B/32", # Vision Transformer (推荐)
"ViT-B/16", # 质量更好,速度更慢
"ViT-L/14", # 质量最佳,速度最慢
]
model, preprocess = clip.load("ViT-B/32")
| 模型 | 参数量 | 速度 | 质量 |
|---|---|---|---|
| RN50 | 102M | 快 | 良好 |
| ViT-B/32 | 151M | 中等 | 更好 |
| ViT-L/14 | 428M | 慢 | 最佳 |
# 计算嵌入向量
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# 归一化
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
# 余弦相似度
similarity = (image_features @ text_features.T).item()
print(f"相似度: {similarity:.4f}")
# 索引图像
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
image_embeddings = []
for img_path in image_paths:
image = preprocess(Image.open(img_path)).unsqueeze(0).to(device)
with torch.no_grad():
embedding = model.encode_image(image)
embedding /= embedding.norm(dim=-1, keepdim=True)
image_embeddings.append(embedding)
image_embeddings = torch.cat(image_embeddings)
# 使用文本查询进行搜索
query = "a sunset over the ocean"
text_input = clip.tokenize([query]).to(device)
with torch.no_grad():
text_embedding = model.encode_text(text_input)
text_embedding /= text_embedding.norm(dim=-1, keepdim=True)
# 查找最相似的图像
similarities = (text_embedding @ image_embeddings.T).squeeze(0)
top_k = similarities.topk(3)
for idx, score in zip(top_k.indices, top_k.values):
print(f"{image_paths[idx]}: {score:.3f}")
# 定义类别
categories = [
"safe for work",
"not safe for work",
"violent content",
"graphic content"
]
text = clip.tokenize(categories).to(device)
# 检查图像
with torch.no_grad():
logits_per_image, _ = model(image, text)
probs = logits_per_image.softmax(dim=-1)
# 获取分类结果
max_idx = probs.argmax().item()
max_prob = probs[0, max_idx].item()
print(f"类别: {categories[max_idx]} ({max_prob:.2%})")
# 处理多张图像
images = [preprocess(Image.open(f"img{i}.jpg")) for i in range(10)]
images = torch.stack(images).to(device)
with torch.no_grad():
image_features = model.encode_image(images)
image_features /= image_features.norm(dim=-1, keepdim=True)
# 批量文本
texts = ["a dog", "a cat", "a bird"]
text_tokens = clip.tokenize(texts).to(device)
with torch.no_grad():
text_features = model.encode_text(text_tokens)
text_features /= text_features.norm(dim=-1, keepdim=True)
# 相似度矩阵 (10 张图像 × 3 个文本)
similarities = image_features @ text_features.T
print(similarities.shape) # (10, 3)
# 将 CLIP 嵌入向量存储在 Chroma/FAISS 中
import chromadb
client = chromadb.Client()
collection = client.create_collection("image_embeddings")
# 添加图像嵌入向量
for img_path, embedding in zip(image_paths, image_embeddings):
collection.add(
embeddings=[embedding.cpu().numpy().tolist()],
metadatas=[{"path": img_path}],
ids=[img_path]
)
# 使用文本查询
query = "a sunset"
text_embedding = model.encode_text(clip.tokenize([query]))
results = collection.query(
query_embeddings=[text_embedding.cpu().numpy().tolist()],
n_results=5
)
| 操作 | CPU | GPU (V100) |
|---|---|---|
| 图像编码 | ~200ms | ~20ms |
| 文本编码 | ~50ms | ~5ms |
| 相似度计算 | <1ms | <1ms |
每周安装量
65
代码仓库
GitHub 星标数
5.5K
首次出现
2026年2月7日
安全审计
已安装于
opencode56
codex55
cursor55
gemini-cli54
github-copilot53
claude-code52
OpenAI's model that understands images from natural language.
Use when:
Metrics :
Use alternatives instead :
pip install git+https://github.com/openai/CLIP.git
pip install torch torchvision ftfy regex tqdm
import torch
import clip
from PIL import Image
# Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
# Load image
image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
# Define possible labels
text = clip.tokenize(["a dog", "a cat", "a bird", "a car"]).to(device)
# Compute similarity
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Cosine similarity
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
# Print results
labels = ["a dog", "a cat", "a bird", "a car"]
for label, prob in zip(labels, probs[0]):
print(f"{label}: {prob:.2%}")
# Models (sorted by size)
models = [
"RN50", # ResNet-50
"RN101", # ResNet-101
"ViT-B/32", # Vision Transformer (recommended)
"ViT-B/16", # Better quality, slower
"ViT-L/14", # Best quality, slowest
]
model, preprocess = clip.load("ViT-B/32")
| Model | Parameters | Speed | Quality |
|---|---|---|---|
| RN50 | 102M | Fast | Good |
| ViT-B/32 | 151M | Medium | Better |
| ViT-L/14 | 428M | Slow | Best |
# Compute embeddings
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Normalize
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Cosine similarity
similarity = (image_features @ text_features.T).item()
print(f"Similarity: {similarity:.4f}")
# Index images
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
image_embeddings = []
for img_path in image_paths:
image = preprocess(Image.open(img_path)).unsqueeze(0).to(device)
with torch.no_grad():
embedding = model.encode_image(image)
embedding /= embedding.norm(dim=-1, keepdim=True)
image_embeddings.append(embedding)
image_embeddings = torch.cat(image_embeddings)
# Search with text query
query = "a sunset over the ocean"
text_input = clip.tokenize([query]).to(device)
with torch.no_grad():
text_embedding = model.encode_text(text_input)
text_embedding /= text_embedding.norm(dim=-1, keepdim=True)
# Find most similar images
similarities = (text_embedding @ image_embeddings.T).squeeze(0)
top_k = similarities.topk(3)
for idx, score in zip(top_k.indices, top_k.values):
print(f"{image_paths[idx]}: {score:.3f}")
# Define categories
categories = [
"safe for work",
"not safe for work",
"violent content",
"graphic content"
]
text = clip.tokenize(categories).to(device)
# Check image
with torch.no_grad():
logits_per_image, _ = model(image, text)
probs = logits_per_image.softmax(dim=-1)
# Get classification
max_idx = probs.argmax().item()
max_prob = probs[0, max_idx].item()
print(f"Category: {categories[max_idx]} ({max_prob:.2%})")
# Process multiple images
images = [preprocess(Image.open(f"img{i}.jpg")) for i in range(10)]
images = torch.stack(images).to(device)
with torch.no_grad():
image_features = model.encode_image(images)
image_features /= image_features.norm(dim=-1, keepdim=True)
# Batch text
texts = ["a dog", "a cat", "a bird"]
text_tokens = clip.tokenize(texts).to(device)
with torch.no_grad():
text_features = model.encode_text(text_tokens)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Similarity matrix (10 images × 3 texts)
similarities = image_features @ text_features.T
print(similarities.shape) # (10, 3)
# Store CLIP embeddings in Chroma/FAISS
import chromadb
client = chromadb.Client()
collection = client.create_collection("image_embeddings")
# Add image embeddings
for img_path, embedding in zip(image_paths, image_embeddings):
collection.add(
embeddings=[embedding.cpu().numpy().tolist()],
metadatas=[{"path": img_path}],
ids=[img_path]
)
# Query with text
query = "a sunset"
text_embedding = model.encode_text(clip.tokenize([query]))
results = collection.query(
query_embeddings=[text_embedding.cpu().numpy().tolist()],
n_results=5
)
| Operation | CPU | GPU (V100) |
|---|---|---|
| Image encoding | ~200ms | ~20ms |
| Text encoding | ~50ms | ~5ms |
| Similarity compute | <1ms | <1ms |
Weekly Installs
65
Repository
GitHub Stars
5.5K
First Seen
Feb 7, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode56
codex55
cursor55
gemini-cli54
github-copilot53
claude-code52
超能力技能使用指南:AI助手技能调用优先级与工作流程详解
53,700 周安装
通用项目发布工具 - 多语言更新日志自动生成 | 支持Node.js/Python/Rust/Claude插件
62 周安装
Edge Pipeline Orchestrator:自动化金融交易策略流水线编排工具
62 周安装
Python ROI 计算器:投资回报率、营销ROI、盈亏平衡分析工具
62 周安装
Salesforce Hyperforce 2025架构指南:云原生、零信任安全与开发最佳实践
62 周安装
PowerShell 2025 重大变更与迁移指南:版本移除、模块停用、WMIC替代方案
62 周安装
2025安全优先Bash脚本编写指南:输入验证、命令注入与路径遍历防护
62 周安装