重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
umap-learn by k-dense-ai/claude-scientific-skills
npx skills add https://github.com/k-dense-ai/claude-scientific-skills --skill umap-learnUMAP(统一流形逼近与投影)是一种用于可视化和通用非线性降维的技术。应用此技能可获得快速、可扩展的嵌入,这些嵌入能保留局部和全局结构,适用于监督学习和聚类预处理。
uv pip install umap-learn
UMAP 遵循 scikit-learn 的惯例,可以作为 t-SNE 或 PCA 的直接替代品使用。
import umap
from sklearn.preprocessing import StandardScaler
# 准备数据(标准化至关重要)
scaled_data = StandardScaler().fit_transform(data)
# 方法 1:单步操作(拟合与转换)
embedding = umap.UMAP().fit_transform(scaled_data)
# 方法 2:分步操作(用于复用已训练的模型)
reducer = umap.UMAP(random_state=42)
reducer.fit(scaled_data)
embedding = reducer.embedding_ # 访问已训练的嵌入
关键预处理要求: 在应用 UMAP 之前,始终将特征标准化到可比较的尺度,以确保各维度权重相等。
import umap
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
# 1. 预处理数据
scaler = StandardScaler()
scaled_data = scaler.fit_transform(raw_data)
# 2. 创建并拟合 UMAP
reducer = umap.UMAP(
n_neighbors=15,
min_dist=0.1,
n_components=2,
metric='euclidean',
random_state=42
)
embedding = reducer.fit_transform(scaled_data)
# 3. 可视化
plt.scatter(embedding[:, 0], embedding[:, 1], c=labels, cmap='Spectral', s=5)
plt.colorbar()
plt.title('UMAP Embedding')
plt.show()
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
UMAP 有四个主要参数控制嵌入行为。理解这些参数对于有效使用至关重要。
目的: 平衡嵌入中的局部结构与全局结构。
工作原理: 控制 UMAP 在学习流形结构时检查的局部邻域大小。
不同取值的效果:
建议: 从 15 开始,根据结果调整。增加以获得更多全局结构,减少以获得更多局部细节。
目的: 控制点在低维空间中的聚集紧密程度。
工作原理: 设置在输出表示中允许点之间的最小距离。
不同取值的效果:
建议: 聚类应用使用 0.0,可视化使用 0.1-0.3,松散结构使用 0.5+。
目的: 确定嵌入输出空间的维度。
关键特性: 与 t-SNE 不同,UMAP 在嵌入维度上扩展性良好,支持超越可视化的用途。
常见用途:
建议: 可视化使用 2,聚类使用 5-10,ML 流水线使用更高维度。
目的: 指定输入数据点之间距离的计算方式。
支持的度量标准:
建议: 数值数据使用 euclidean,文本/文档向量使用 cosine,二进制数据使用 hamming。
# 用于强调局部结构的可视化
umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, metric='euclidean')
# 用于聚类预处理
umap.UMAP(n_neighbors=30, min_dist=0.0, n_components=10, metric='euclidean')
# 用于文档嵌入
umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, metric='cosine')
# 用于保留全局结构
umap.UMAP(n_neighbors=100, min_dist=0.5, n_components=2, metric='euclidean')
UMAP 支持整合标签信息来指导嵌入过程,在保留内部结构的同时实现类别分离。
在拟合时通过 y 参数传递目标标签:
# 监督降维
embedding = umap.UMAP().fit_transform(data, y=labels)
主要优势:
何时使用: 当您有标记数据,并且希望在保持有意义的点嵌入的同时分离已知类别时。
对于部分标签,遵循 scikit-learn 惯例,用 -1 标记未标记的点:
# 创建半监督标签
semi_labels = labels.copy()
semi_labels[unlabeled_indices] = -1
# 使用部分标签进行拟合
embedding = umap.UMAP().fit_transform(data, y=semi_labels)
何时使用: 当标记成本高昂或您拥有的数据多于可用标签时。
在标记数据上训练监督嵌入,然后应用于新的未标记数据:
# 在标记数据上训练
mapper = umap.UMAP().fit(train_data, train_labels)
# 转换未标记的测试数据
test_embedding = mapper.transform(test_data)
# 用作下游分类器的特征工程
from sklearn.svm import SVC
clf = SVC().fit(mapper.embedding_, train_labels)
predictions = clf.predict(test_embedding)
何时使用: 用于机器学习流水线中的监督特征工程。
UMAP 可作为基于密度的聚类算法(如 HDBSCAN)的有效预处理步骤,克服维度灾难。
关键原则: 为聚类配置的 UMAP 参数应与可视化不同。
推荐参数:
import umap
import hdbscan
from sklearn.preprocessing import StandardScaler
# 1. 预处理数据
scaled_data = StandardScaler().fit_transform(data)
# 2. 使用聚类优化参数的 UMAP
reducer = umap.UMAP(
n_neighbors=30,
min_dist=0.0,
n_components=10, # 高于 2 以更好地保留密度
metric='euclidean',
random_state=42
)
embedding = reducer.fit_transform(scaled_data)
# 3. 应用 HDBSCAN 聚类
clusterer = hdbscan.HDBSCAN(
min_cluster_size=15,
min_samples=5,
metric='euclidean'
)
labels = clusterer.fit_predict(embedding)
# 4. 评估
from sklearn.metrics import adjusted_rand_score
score = adjusted_rand_score(true_labels, labels)
print(f"Adjusted Rand Score: {score:.3f}")
print(f"Number of clusters: {len(set(labels)) - (1 if -1 in labels else 0)}")
print(f"Noise points: {sum(labels == -1)}")
# 创建用于可视化的 2D 嵌入(与聚类分开)
vis_reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, random_state=42)
vis_embedding = vis_reducer.fit_transform(scaled_data)
# 使用聚类标签绘图
import matplotlib.pyplot as plt
plt.scatter(vis_embedding[:, 0], vis_embedding[:, 1], c=labels, cmap='Spectral', s=5)
plt.colorbar()
plt.title('UMAP Visualization with HDBSCAN Clusters')
plt.show()
重要注意事项: UMAP 不能完全保留密度,并且可能产生人为的聚类划分。始终验证并探索生成的聚类。
UMAP 通过其 transform() 方法支持对新数据进行预处理,允许训练好的模型将未见过的数据投影到学习到的嵌入空间中。
# 在训练数据上训练
trans = umap.UMAP(n_neighbors=15, random_state=42).fit(X_train)
# 转换测试数据
test_embedding = trans.transform(X_test)
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import umap
# 分割数据
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2)
# 预处理
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 训练 UMAP
reducer = umap.UMAP(n_components=10, random_state=42)
X_train_embedded = reducer.fit_transform(X_train_scaled)
X_test_embedded = reducer.transform(X_test_scaled)
# 在嵌入上训练分类器
clf = SVC()
clf.fit(X_train_embedded, y_train)
accuracy = clf.score(X_test_embedded, y_test)
print(f"Test accuracy: {accuracy:.3f}")
数据一致性: 转换方法假设高维空间中的整体分布在训练数据和测试数据之间是一致的。当此假设不成立时,请考虑改用 Parametric UMAP。
性能: 转换操作是高效的(通常 <1 秒),尽管由于 Numba JIT 编译,初始调用可能较慢。
Scikit-learn 兼容性: UMAP 遵循标准的 sklearn 惯例,可在流水线中无缝工作:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('umap', umap.UMAP(n_components=10)),
('classifier', SVC())
])
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
参数化 UMAP 用学习到的神经网络映射函数替代了直接的嵌入优化。
与标准 UMAP 的主要区别:
安装:
uv pip install umap-learn[parametric_umap]
# 需要 TensorFlow 2.x
基本用法:
from umap.parametric_umap import ParametricUMAP
# 默认架构(3 层 100 个神经元的全连接网络)
embedder = ParametricUMAP()
embedding = embedder.fit_transform(data)
# 高效转换新数据
new_embedding = embedder.transform(new_data)
自定义架构:
import tensorflow as tf
# 定义自定义编码器
encoder = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(input_dim,)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(2) # 输出维度
])
embedder = ParametricUMAP(encoder=encoder, dims=(input_dim,))
embedding = embedder.fit_transform(data)
何时使用参数化 UMAP:
何时使用标准 UMAP:
逆变换支持从低维嵌入重构高维数据。
基本用法:
reducer = umap.UMAP()
embedding = reducer.fit_transform(data)
# 从嵌入坐标重构高维数据
reconstructed = reducer.inverse_transform(embedding)
重要限制:
使用场景:
示例:探索嵌入空间:
import numpy as np
# 在嵌入空间中创建点网格
x = np.linspace(embedding[:, 0].min(), embedding[:, 0].max(), 10)
y = np.linspace(embedding[:, 1].min(), embedding[:, 1].max(), 10)
xx, yy = np.meshgrid(x, y)
grid_points = np.c_[xx.ravel(), yy.ravel()]
# 从网格重构样本
reconstructed_samples = reducer.inverse_transform(grid_points)
用于分析时间序列或相关数据集(例如,时间序列实验、批次数据):
from umap import AlignedUMAP
# 相关数据集列表
datasets = [day1_data, day2_data, day3_data]
# 创建对齐的嵌入
mapper = AlignedUMAP().fit(datasets)
aligned_embeddings = mapper.embeddings_ # 嵌入列表
何时使用: 在比较相关数据集的嵌入时,保持一致的坐标系。
为确保结果可复现,始终设置 random_state 参数:
reducer = umap.UMAP(random_state=42)
UMAP 使用随机优化,因此如果没有固定的随机状态,不同运行之间的结果会略有不同。
问题: 不连通的组件或碎片化的聚类
n_neighbors 以强调更多的全局结构问题: 聚类过于分散或分离不佳
min_dist 以允许更紧密的堆积问题: 聚类结果不佳
问题: 转换结果与训练结果差异显著
问题: 大型数据集上性能缓慢
low_memory=True(默认),或考虑先使用 PCA 进行降维问题: 所有点坍缩为单个聚类
min_dist包含详细的 API 文档:
api_reference.md:完整的 UMAP 类参数和方法当需要详细的参数信息或高级方法用法时,请加载这些参考资料。
每周安装次数
56
代码仓库
GitHub 星标数
17.3K
首次出现
2026 年 1 月 20 日
安全审计
安装于
opencode49
codex48
gemini-cli47
claude-code45
cursor45
github-copilot44
UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique for visualization and general non-linear dimensionality reduction. Apply this skill for fast, scalable embeddings that preserve local and global structure, supervised learning, and clustering preprocessing.
uv pip install umap-learn
UMAP follows scikit-learn conventions and can be used as a drop-in replacement for t-SNE or PCA.
import umap
from sklearn.preprocessing import StandardScaler
# Prepare data (standardization is essential)
scaled_data = StandardScaler().fit_transform(data)
# Method 1: Single step (fit and transform)
embedding = umap.UMAP().fit_transform(scaled_data)
# Method 2: Separate steps (for reusing trained model)
reducer = umap.UMAP(random_state=42)
reducer.fit(scaled_data)
embedding = reducer.embedding_ # Access the trained embedding
Critical preprocessing requirement: Always standardize features to comparable scales before applying UMAP to ensure equal weighting across dimensions.
import umap
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
# 1. Preprocess data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(raw_data)
# 2. Create and fit UMAP
reducer = umap.UMAP(
n_neighbors=15,
min_dist=0.1,
n_components=2,
metric='euclidean',
random_state=42
)
embedding = reducer.fit_transform(scaled_data)
# 3. Visualize
plt.scatter(embedding[:, 0], embedding[:, 1], c=labels, cmap='Spectral', s=5)
plt.colorbar()
plt.title('UMAP Embedding')
plt.show()
UMAP has four primary parameters that control the embedding behavior. Understanding these is crucial for effective usage.
Purpose: Balances local versus global structure in the embedding.
How it works: Controls the size of the local neighborhood UMAP examines when learning manifold structure.
Effects by value:
Recommendation: Start with 15 and adjust based on results. Increase for more global structure, decrease for more local detail.
Purpose: Controls how tightly points cluster in the low-dimensional space.
How it works: Sets the minimum distance apart that points are allowed to be in the output representation.
Effects by value:
Recommendation: Use 0.0 for clustering applications, 0.1-0.3 for visualization, 0.5+ for loose structure.
Purpose: Determines the dimensionality of the embedded output space.
Key feature: Unlike t-SNE, UMAP scales well in the embedding dimension, enabling use beyond visualization.
Common uses:
Recommendation: Use 2 for visualization, 5-10 for clustering, higher for ML pipelines.
Purpose: Specifies how distance is calculated between input data points.
Supported metrics:
Recommendation: Use euclidean for numeric data, cosine for text/document vectors, hamming for binary data.
# For visualization with emphasis on local structure
umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, metric='euclidean')
# For clustering preprocessing
umap.UMAP(n_neighbors=30, min_dist=0.0, n_components=10, metric='euclidean')
# For document embeddings
umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, metric='cosine')
# For preserving global structure
umap.UMAP(n_neighbors=100, min_dist=0.5, n_components=2, metric='euclidean')
UMAP supports incorporating label information to guide the embedding process, enabling class separation while preserving internal structure.
Pass target labels via the y parameter when fitting:
# Supervised dimension reduction
embedding = umap.UMAP().fit_transform(data, y=labels)
Key benefits:
When to use: When you have labeled data and want to separate known classes while keeping meaningful point embeddings.
For partial labels, mark unlabeled points with -1 following scikit-learn convention:
# Create semi-supervised labels
semi_labels = labels.copy()
semi_labels[unlabeled_indices] = -1
# Fit with partial labels
embedding = umap.UMAP().fit_transform(data, y=semi_labels)
When to use: When labeling is expensive or you have more data than labels available.
Train a supervised embedding on labeled data, then apply to new unlabeled data:
# Train on labeled data
mapper = umap.UMAP().fit(train_data, train_labels)
# Transform unlabeled test data
test_embedding = mapper.transform(test_data)
# Use as feature engineering for downstream classifier
from sklearn.svm import SVC
clf = SVC().fit(mapper.embedding_, train_labels)
predictions = clf.predict(test_embedding)
When to use: For supervised feature engineering in machine learning pipelines.
UMAP serves as effective preprocessing for density-based clustering algorithms like HDBSCAN, overcoming the curse of dimensionality.
Key principle: Configure UMAP differently for clustering than for visualization.
Recommended parameters:
import umap
import hdbscan
from sklearn.preprocessing import StandardScaler
# 1. Preprocess data
scaled_data = StandardScaler().fit_transform(data)
# 2. UMAP with clustering-optimized parameters
reducer = umap.UMAP(
n_neighbors=30,
min_dist=0.0,
n_components=10, # Higher than 2 for better density preservation
metric='euclidean',
random_state=42
)
embedding = reducer.fit_transform(scaled_data)
# 3. Apply HDBSCAN clustering
clusterer = hdbscan.HDBSCAN(
min_cluster_size=15,
min_samples=5,
metric='euclidean'
)
labels = clusterer.fit_predict(embedding)
# 4. Evaluate
from sklearn.metrics import adjusted_rand_score
score = adjusted_rand_score(true_labels, labels)
print(f"Adjusted Rand Score: {score:.3f}")
print(f"Number of clusters: {len(set(labels)) - (1 if -1 in labels else 0)}")
print(f"Noise points: {sum(labels == -1)}")
# Create 2D embedding for visualization (separate from clustering)
vis_reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, random_state=42)
vis_embedding = vis_reducer.fit_transform(scaled_data)
# Plot with cluster labels
import matplotlib.pyplot as plt
plt.scatter(vis_embedding[:, 0], vis_embedding[:, 1], c=labels, cmap='Spectral', s=5)
plt.colorbar()
plt.title('UMAP Visualization with HDBSCAN Clusters')
plt.show()
Important caveat: UMAP does not completely preserve density and can create artificial cluster divisions. Always validate and explore resulting clusters.
UMAP enables preprocessing of new data through its transform() method, allowing trained models to project unseen data into the learned embedding space.
# Train on training data
trans = umap.UMAP(n_neighbors=15, random_state=42).fit(X_train)
# Transform test data
test_embedding = trans.transform(X_test)
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import umap
# Split data
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2)
# Preprocess
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train UMAP
reducer = umap.UMAP(n_components=10, random_state=42)
X_train_embedded = reducer.fit_transform(X_train_scaled)
X_test_embedded = reducer.transform(X_test_scaled)
# Train classifier on embeddings
clf = SVC()
clf.fit(X_train_embedded, y_train)
accuracy = clf.score(X_test_embedded, y_test)
print(f"Test accuracy: {accuracy:.3f}")
Data consistency: The transform method assumes the overall distribution in the higher-dimensional space is consistent between training and test data. When this assumption fails, consider using Parametric UMAP instead.
Performance: Transform operations are efficient (typically <1 second), though initial calls may be slower due to Numba JIT compilation.
Scikit-learn compatibility: UMAP follows standard sklearn conventions and works seamlessly in pipelines:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('umap', umap.UMAP(n_components=10)),
('classifier', SVC())
])
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
Parametric UMAP replaces direct embedding optimization with a learned neural network mapping function.
Key differences from standard UMAP:
Installation:
uv pip install umap-learn[parametric_umap]
# Requires TensorFlow 2.x
Basic usage:
from umap.parametric_umap import ParametricUMAP
# Default architecture (3-layer 100-neuron fully-connected network)
embedder = ParametricUMAP()
embedding = embedder.fit_transform(data)
# Transform new data efficiently
new_embedding = embedder.transform(new_data)
Custom architecture:
import tensorflow as tf
# Define custom encoder
encoder = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(input_dim,)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(2) # Output dimension
])
embedder = ParametricUMAP(encoder=encoder, dims=(input_dim,))
embedding = embedder.fit_transform(data)
When to use Parametric UMAP:
When to use standard UMAP:
Inverse transforms enable reconstruction of high-dimensional data from low-dimensional embeddings.
Basic usage:
reducer = umap.UMAP()
embedding = reducer.fit_transform(data)
# Reconstruct high-dimensional data from embedding coordinates
reconstructed = reducer.inverse_transform(embedding)
Important limitations:
Use cases:
Example: Exploring embedding space:
import numpy as np
# Create grid of points in embedding space
x = np.linspace(embedding[:, 0].min(), embedding[:, 0].max(), 10)
y = np.linspace(embedding[:, 1].min(), embedding[:, 1].max(), 10)
xx, yy = np.meshgrid(x, y)
grid_points = np.c_[xx.ravel(), yy.ravel()]
# Reconstruct samples from grid
reconstructed_samples = reducer.inverse_transform(grid_points)
For analyzing temporal or related datasets (e.g., time-series experiments, batch data):
from umap import AlignedUMAP
# List of related datasets
datasets = [day1_data, day2_data, day3_data]
# Create aligned embeddings
mapper = AlignedUMAP().fit(datasets)
aligned_embeddings = mapper.embeddings_ # List of embeddings
When to use: Comparing embeddings across related datasets while maintaining consistent coordinate systems.
To ensure reproducible results, always set the random_state parameter:
reducer = umap.UMAP(random_state=42)
UMAP uses stochastic optimization, so results will vary slightly between runs without a fixed random state.
Issue: Disconnected components or fragmented clusters
n_neighbors to emphasize more global structureIssue: Clusters too spread out or not well separated
min_dist to allow tighter packingIssue: Poor clustering results
Issue: Transform results differ significantly from training
Issue: Slow performance on large datasets
low_memory=True (default), or consider dimensionality reduction with PCA firstIssue: All points collapsed to single cluster
min_distContains detailed API documentation:
api_reference.md: Complete UMAP class parameters and methodsLoad these references when detailed parameter information or advanced method usage is needed.
Weekly Installs
56
Repository
GitHub Stars
17.3K
First Seen
Jan 20, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode49
codex48
gemini-cli47
claude-code45
cursor45
github-copilot44
AI界面设计评审工具 - 全面评估UI/UX设计质量、检测AI生成痕迹与优化用户体验
58,500 周安装
结构变异分析工作流:基于ACMG标准的临床基因组学解读与致病性分类指南
161 周安装
Base链智能合约部署指南:安全配置、测试网ETH获取与合约验证
161 周安装
Skill Installer - 一键安装AI技能工具,支持GitHub公共/私有仓库
160 周安装
Web Artifacts Builder:React + TypeScript + Vite 前端工件构建工具,一键打包为单HTML文件
160 周安装
Claude Code 插件结构详解:目录布局、组件组织与配置指南
161 周安装
Java Gradle 构建工具指南:Kotlin DSL 配置、依赖管理与性能优化
163 周安装