umap-learn by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill umap-learnUMAP(均匀流形近似与投影)是一种用于可视化和通用非线性降维的技术。应用此技能可获得快速、可扩展的嵌入,这些嵌入能保留局部和全局结构,适用于监督学习和聚类预处理。
uv pip install umap-learn
UMAP 遵循 scikit-learn 的惯例,可以作为 t-SNE 或 PCA 的直接替代品使用。
import umap
from sklearn.preprocessing import StandardScaler
# 准备数据(标准化至关重要)
scaled_data = StandardScaler().fit_transform(data)
# 方法 1:单步操作(拟合和转换)
embedding = umap.UMAP().fit_transform(scaled_data)
# 方法 2:分步操作(用于复用训练好的模型)
reducer = umap.UMAP(random_state=42)
reducer.fit(scaled_data)
embedding = reducer.embedding_ # 访问训练好的嵌入
关键预处理要求: 在应用 UMAP 之前,务必对特征进行标准化,使其具有可比尺度,以确保各维度权重相等。
import umap
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
# 1. 预处理数据
scaler = StandardScaler()
scaled_data = scaler.fit_transform(raw_data)
# 2. 创建并拟合 UMAP
reducer = umap.UMAP(
n_neighbors=15,
min_dist=0.1,
n_components=2,
metric='euclidean',
random_state=42
)
embedding = reducer.fit_transform(scaled_data)
# 3. 可视化
plt.scatter(embedding[:, 0], embedding[:, 1], c=labels, cmap='Spectral', s=5)
plt.colorbar()
plt.title('UMAP 嵌入')
plt.show()
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
UMAP 有四个主要参数控制嵌入行为。理解这些参数对于有效使用至关重要。
目的: 平衡嵌入中的局部结构与全局结构。
工作原理: 控制 UMAP 在学习流形结构时检查的局部邻域大小。
不同值的效果:
建议: 从 15 开始,根据结果调整。增加以获得更多全局结构,减少以获得更多局部细节。
目的: 控制点在低维空间中的聚集紧密程度。
工作原理: 设置在输出表示中允许点之间的最小距离。
不同值的效果:
建议: 聚类应用使用 0.0,可视化使用 0.1-0.3,松散结构使用 0.5+。
目的: 确定嵌入输出空间的维度。
关键特性: 与 t-SNE 不同,UMAP 在嵌入维度上扩展性良好,支持超越可视化的用途。
常见用途:
建议: 可视化使用 2,聚类使用 5-10,ML 管道使用更高维度。
目的: 指定输入数据点之间距离的计算方式。
支持的度量标准:
建议: 数值数据使用 euclidean,文本/文档向量使用 cosine,二进制数据使用 hamming。
# 用于强调局部结构的可视化
umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, metric='euclidean')
# 用于聚类预处理
umap.UMAP(n_neighbors=30, min_dist=0.0, n_components=10, metric='euclidean')
# 用于文档嵌入
umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, metric='cosine')
# 用于保持全局结构
umap.UMAP(n_neighbors=100, min_dist=0.5, n_components=2, metric='euclidean')
UMAP 支持整合标签信息来指导嵌入过程,在保持内部结构的同时实现类别分离。
在拟合时通过 y 参数传递目标标签:
# 监督降维
embedding = umap.UMAP().fit_transform(data, y=labels)
主要优势:
使用时机: 当您有标记数据并希望在保持有意义的点嵌入的同时分离已知类别时。
对于部分标签,按照 scikit-learn 惯例,用 -1 标记未标记的点:
# 创建半监督标签
semi_labels = labels.copy()
semi_labels[unlabeled_indices] = -1
# 使用部分标签进行拟合
embedding = umap.UMAP().fit_transform(data, y=semi_labels)
使用时机: 当标记成本高昂或您拥有的数据多于可用标签时。
在标记数据上训练监督嵌入,然后应用于新的未标记数据:
# 在标记数据上训练
mapper = umap.UMAP().fit(train_data, train_labels)
# 转换未标记的测试数据
test_embedding = mapper.transform(test_data)
# 用作下游分类器的特征工程
from sklearn.svm import SVC
clf = SVC().fit(mapper.embedding_, train_labels)
predictions = clf.predict(test_embedding)
使用时机: 用于机器学习管道中的监督特征工程。
UMAP 可作为基于密度的聚类算法(如 HDBSCAN)的有效预处理步骤,克服维度灾难。
关键原则: 为聚类配置 UMAP 的方式应与可视化不同。
推荐参数:
import umap
import hdbscan
from sklearn.preprocessing import StandardScaler
# 1. 预处理数据
scaled_data = StandardScaler().fit_transform(data)
# 2. 使用聚类优化参数的 UMAP
reducer = umap.UMAP(
n_neighbors=30,
min_dist=0.0,
n_components=10, # 高于 2 以更好地保持密度
metric='euclidean',
random_state=42
)
embedding = reducer.fit_transform(scaled_data)
# 3. 应用 HDBSCAN 聚类
clusterer = hdbscan.HDBSCAN(
min_cluster_size=15,
min_samples=5,
metric='euclidean'
)
labels = clusterer.fit_predict(embedding)
# 4. 评估
from sklearn.metrics import adjusted_rand_score
score = adjusted_rand_score(true_labels, labels)
print(f"调整兰德指数: {score:.3f}")
print(f"聚类数量: {len(set(labels)) - (1 if -1 in labels else 0)}")
print(f"噪声点: {sum(labels == -1)}")
# 为可视化创建 2D 嵌入(与聚类分开)
vis_reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, random_state=42)
vis_embedding = vis_reducer.fit_transform(scaled_data)
# 使用聚类标签绘图
import matplotlib.pyplot as plt
plt.scatter(vis_embedding[:, 0], vis_embedding[:, 1], c=labels, cmap='Spectral', s=5)
plt.colorbar()
plt.title('带有 HDBSCAN 聚类的 UMAP 可视化')
plt.show()
重要注意事项: UMAP 不能完全保持密度,并且可能产生人为的聚类划分。务必验证并探索结果聚类。
UMAP 通过其 transform() 方法支持对新数据进行预处理,允许训练好的模型将未见过的数据投影到学习到的嵌入空间中。
# 在训练数据上训练
trans = umap.UMAP(n_neighbors=15, random_state=42).fit(X_train)
# 转换测试数据
test_embedding = trans.transform(X_test)
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import umap
# 分割数据
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2)
# 预处理
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 训练 UMAP
reducer = umap.UMAP(n_components=10, random_state=42)
X_train_embedded = reducer.fit_transform(X_train_scaled)
X_test_embedded = reducer.transform(X_test_scaled)
# 在嵌入上训练分类器
clf = SVC()
clf.fit(X_train_embedded, y_train)
accuracy = clf.score(X_test_embedded, y_test)
print(f"测试准确率: {accuracy:.3f}")
数据一致性: transform 方法假设高维空间中的整体分布在训练数据和测试数据之间是一致的。当此假设不成立时,请考虑改用 Parametric UMAP。
性能: 转换操作是高效的(通常 <1 秒),但初始调用可能由于 Numba JIT 编译而较慢。
Scikit-learn 兼容性: UMAP 遵循标准的 sklearn 惯例,可在管道中无缝工作:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('umap', umap.UMAP(n_components=10)),
('classifier', SVC())
])
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
参数化 UMAP 用学习到的神经网络映射函数替代直接的嵌入优化。
与标准 UMAP 的主要区别:
安装:
uv pip install umap-learn[parametric_umap]
# 需要 TensorFlow 2.x
基本用法:
from umap.parametric_umap import ParametricUMAP
# 默认架构(3 层 100 神经元全连接网络)
embedder = ParametricUMAP()
embedding = embedder.fit_transform(data)
# 高效转换新数据
new_embedding = embedder.transform(new_data)
自定义架构:
import tensorflow as tf
# 定义自定义编码器
encoder = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(input_dim,)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(2) # 输出维度
])
embedder = ParametricUMAP(encoder=encoder, dims=(input_dim,))
embedding = embedder.fit_transform(data)
何时使用参数化 UMAP:
何时使用标准 UMAP:
逆变换支持从低维嵌入重建高维数据。
基本用法:
reducer = umap.UMAP()
embedding = reducer.fit_transform(data)
# 从嵌入坐标重建高维数据
reconstructed = reducer.inverse_transform(embedding)
重要限制:
使用场景:
示例:探索嵌入空间:
import numpy as np
# 在嵌入空间中创建点网格
x = np.linspace(embedding[:, 0].min(), embedding[:, 0].max(), 10)
y = np.linspace(embedding[:, 1].min(), embedding[:, 1].max(), 10)
xx, yy = np.meshgrid(x, y)
grid_points = np.c_[xx.ravel(), yy.ravel()]
# 从网格重建样本
reconstructed_samples = reducer.inverse_transform(grid_points)
用于分析时间序列或相关数据集(例如,时间序列实验、批次数据):
from umap import AlignedUMAP
# 相关数据集列表
datasets = [day1_data, day2_data, day3_data]
# 创建对齐的嵌入
mapper = AlignedUMAP().fit(datasets)
aligned_embeddings = mapper.embeddings_ # 嵌入列表
使用时机: 在保持一致坐标系的同时,比较跨相关数据集的嵌入。
为确保结果可重现,请始终设置 random_state 参数:
reducer = umap.UMAP(random_state=42)
UMAP 使用随机优化,因此如果没有固定的随机状态,不同运行之间的结果会略有不同。
问题: 不连通的组件或碎片化的聚类
n_neighbors 以强调更多全局结构问题: 聚类过于分散或分离不佳
min_dist 以允许更紧密的聚集问题: 聚类结果不佳
问题: 转换结果与训练结果差异显著
问题: 大型数据集上性能缓慢
low_memory=True(默认),或考虑先使用 PCA 进行降维问题: 所有点都塌缩到单个聚类
min_dist包含详细的 API 文档:
api_reference.md:完整的 UMAP 类参数和方法当需要详细的参数信息或高级方法用法时,请加载这些参考资料。
每周安装数
135
代码仓库
GitHub 星标数
23.4K
首次出现
2026年1月21日
安全审计
安装于
claude-code120
opencode110
gemini-cli106
cursor104
antigravity102
codex96
UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique for visualization and general non-linear dimensionality reduction. Apply this skill for fast, scalable embeddings that preserve local and global structure, supervised learning, and clustering preprocessing.
uv pip install umap-learn
UMAP follows scikit-learn conventions and can be used as a drop-in replacement for t-SNE or PCA.
import umap
from sklearn.preprocessing import StandardScaler
# Prepare data (standardization is essential)
scaled_data = StandardScaler().fit_transform(data)
# Method 1: Single step (fit and transform)
embedding = umap.UMAP().fit_transform(scaled_data)
# Method 2: Separate steps (for reusing trained model)
reducer = umap.UMAP(random_state=42)
reducer.fit(scaled_data)
embedding = reducer.embedding_ # Access the trained embedding
Critical preprocessing requirement: Always standardize features to comparable scales before applying UMAP to ensure equal weighting across dimensions.
import umap
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
# 1. Preprocess data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(raw_data)
# 2. Create and fit UMAP
reducer = umap.UMAP(
n_neighbors=15,
min_dist=0.1,
n_components=2,
metric='euclidean',
random_state=42
)
embedding = reducer.fit_transform(scaled_data)
# 3. Visualize
plt.scatter(embedding[:, 0], embedding[:, 1], c=labels, cmap='Spectral', s=5)
plt.colorbar()
plt.title('UMAP Embedding')
plt.show()
UMAP has four primary parameters that control the embedding behavior. Understanding these is crucial for effective usage.
Purpose: Balances local versus global structure in the embedding.
How it works: Controls the size of the local neighborhood UMAP examines when learning manifold structure.
Effects by value:
Recommendation: Start with 15 and adjust based on results. Increase for more global structure, decrease for more local detail.
Purpose: Controls how tightly points cluster in the low-dimensional space.
How it works: Sets the minimum distance apart that points are allowed to be in the output representation.
Effects by value:
Recommendation: Use 0.0 for clustering applications, 0.1-0.3 for visualization, 0.5+ for loose structure.
Purpose: Determines the dimensionality of the embedded output space.
Key feature: Unlike t-SNE, UMAP scales well in the embedding dimension, enabling use beyond visualization.
Common uses:
Recommendation: Use 2 for visualization, 5-10 for clustering, higher for ML pipelines.
Purpose: Specifies how distance is calculated between input data points.
Supported metrics:
Recommendation: Use euclidean for numeric data, cosine for text/document vectors, hamming for binary data.
# For visualization with emphasis on local structure
umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, metric='euclidean')
# For clustering preprocessing
umap.UMAP(n_neighbors=30, min_dist=0.0, n_components=10, metric='euclidean')
# For document embeddings
umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, metric='cosine')
# For preserving global structure
umap.UMAP(n_neighbors=100, min_dist=0.5, n_components=2, metric='euclidean')
UMAP supports incorporating label information to guide the embedding process, enabling class separation while preserving internal structure.
Pass target labels via the y parameter when fitting:
# Supervised dimension reduction
embedding = umap.UMAP().fit_transform(data, y=labels)
Key benefits:
When to use: When you have labeled data and want to separate known classes while keeping meaningful point embeddings.
For partial labels, mark unlabeled points with -1 following scikit-learn convention:
# Create semi-supervised labels
semi_labels = labels.copy()
semi_labels[unlabeled_indices] = -1
# Fit with partial labels
embedding = umap.UMAP().fit_transform(data, y=semi_labels)
When to use: When labeling is expensive or you have more data than labels available.
Train a supervised embedding on labeled data, then apply to new unlabeled data:
# Train on labeled data
mapper = umap.UMAP().fit(train_data, train_labels)
# Transform unlabeled test data
test_embedding = mapper.transform(test_data)
# Use as feature engineering for downstream classifier
from sklearn.svm import SVC
clf = SVC().fit(mapper.embedding_, train_labels)
predictions = clf.predict(test_embedding)
When to use: For supervised feature engineering in machine learning pipelines.
UMAP serves as effective preprocessing for density-based clustering algorithms like HDBSCAN, overcoming the curse of dimensionality.
Key principle: Configure UMAP differently for clustering than for visualization.
Recommended parameters:
import umap
import hdbscan
from sklearn.preprocessing import StandardScaler
# 1. Preprocess data
scaled_data = StandardScaler().fit_transform(data)
# 2. UMAP with clustering-optimized parameters
reducer = umap.UMAP(
n_neighbors=30,
min_dist=0.0,
n_components=10, # Higher than 2 for better density preservation
metric='euclidean',
random_state=42
)
embedding = reducer.fit_transform(scaled_data)
# 3. Apply HDBSCAN clustering
clusterer = hdbscan.HDBSCAN(
min_cluster_size=15,
min_samples=5,
metric='euclidean'
)
labels = clusterer.fit_predict(embedding)
# 4. Evaluate
from sklearn.metrics import adjusted_rand_score
score = adjusted_rand_score(true_labels, labels)
print(f"Adjusted Rand Score: {score:.3f}")
print(f"Number of clusters: {len(set(labels)) - (1 if -1 in labels else 0)}")
print(f"Noise points: {sum(labels == -1)}")
# Create 2D embedding for visualization (separate from clustering)
vis_reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, random_state=42)
vis_embedding = vis_reducer.fit_transform(scaled_data)
# Plot with cluster labels
import matplotlib.pyplot as plt
plt.scatter(vis_embedding[:, 0], vis_embedding[:, 1], c=labels, cmap='Spectral', s=5)
plt.colorbar()
plt.title('UMAP Visualization with HDBSCAN Clusters')
plt.show()
Important caveat: UMAP does not completely preserve density and can create artificial cluster divisions. Always validate and explore resulting clusters.
UMAP enables preprocessing of new data through its transform() method, allowing trained models to project unseen data into the learned embedding space.
# Train on training data
trans = umap.UMAP(n_neighbors=15, random_state=42).fit(X_train)
# Transform test data
test_embedding = trans.transform(X_test)
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import umap
# Split data
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2)
# Preprocess
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train UMAP
reducer = umap.UMAP(n_components=10, random_state=42)
X_train_embedded = reducer.fit_transform(X_train_scaled)
X_test_embedded = reducer.transform(X_test_scaled)
# Train classifier on embeddings
clf = SVC()
clf.fit(X_train_embedded, y_train)
accuracy = clf.score(X_test_embedded, y_test)
print(f"Test accuracy: {accuracy:.3f}")
Data consistency: The transform method assumes the overall distribution in the higher-dimensional space is consistent between training and test data. When this assumption fails, consider using Parametric UMAP instead.
Performance: Transform operations are efficient (typically <1 second), though initial calls may be slower due to Numba JIT compilation.
Scikit-learn compatibility: UMAP follows standard sklearn conventions and works seamlessly in pipelines:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('umap', umap.UMAP(n_components=10)),
('classifier', SVC())
])
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
Parametric UMAP replaces direct embedding optimization with a learned neural network mapping function.
Key differences from standard UMAP:
Installation:
uv pip install umap-learn[parametric_umap]
# Requires TensorFlow 2.x
Basic usage:
from umap.parametric_umap import ParametricUMAP
# Default architecture (3-layer 100-neuron fully-connected network)
embedder = ParametricUMAP()
embedding = embedder.fit_transform(data)
# Transform new data efficiently
new_embedding = embedder.transform(new_data)
Custom architecture:
import tensorflow as tf
# Define custom encoder
encoder = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(input_dim,)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(2) # Output dimension
])
embedder = ParametricUMAP(encoder=encoder, dims=(input_dim,))
embedding = embedder.fit_transform(data)
When to use Parametric UMAP:
When to use standard UMAP:
Inverse transforms enable reconstruction of high-dimensional data from low-dimensional embeddings.
Basic usage:
reducer = umap.UMAP()
embedding = reducer.fit_transform(data)
# Reconstruct high-dimensional data from embedding coordinates
reconstructed = reducer.inverse_transform(embedding)
Important limitations:
Use cases:
Example: Exploring embedding space:
import numpy as np
# Create grid of points in embedding space
x = np.linspace(embedding[:, 0].min(), embedding[:, 0].max(), 10)
y = np.linspace(embedding[:, 1].min(), embedding[:, 1].max(), 10)
xx, yy = np.meshgrid(x, y)
grid_points = np.c_[xx.ravel(), yy.ravel()]
# Reconstruct samples from grid
reconstructed_samples = reducer.inverse_transform(grid_points)
For analyzing temporal or related datasets (e.g., time-series experiments, batch data):
from umap import AlignedUMAP
# List of related datasets
datasets = [day1_data, day2_data, day3_data]
# Create aligned embeddings
mapper = AlignedUMAP().fit(datasets)
aligned_embeddings = mapper.embeddings_ # List of embeddings
When to use: Comparing embeddings across related datasets while maintaining consistent coordinate systems.
To ensure reproducible results, always set the random_state parameter:
reducer = umap.UMAP(random_state=42)
UMAP uses stochastic optimization, so results will vary slightly between runs without a fixed random state.
Issue: Disconnected components or fragmented clusters
n_neighbors to emphasize more global structureIssue: Clusters too spread out or not well separated
min_dist to allow tighter packingIssue: Poor clustering results
Issue: Transform results differ significantly from training
Issue: Slow performance on large datasets
low_memory=True (default), or consider dimensionality reduction with PCA firstIssue: All points collapsed to single cluster
min_distContains detailed API documentation:
api_reference.md: Complete UMAP class parameters and methodsLoad these references when detailed parameter information or advanced method usage is needed.
Weekly Installs
135
Repository
GitHub Stars
23.4K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
claude-code120
opencode110
gemini-cli106
cursor104
antigravity102
codex96
专业SEO审计工具:全面网站诊断、技术SEO优化与页面分析指南
68,800 周安装