geo-database by davila7/claude-code-templates
npx skills add https://github.com/davila7/claude-code-templates --skill geo-database基因表达综合数据库(GEO)是 NCBI 用于高通量基因表达和功能基因组学数据的公共存储库。GEO 包含超过 264,000 项研究,涵盖基于芯片和基于测序的实验,样本数量超过 800 万个。
当需要搜索基因表达数据集、检索实验数据、下载原始和处理后的文件、查询表达谱或将 GEO 数据整合到计算分析工作流中时,应使用此技能。
GEO 使用不同的登录号类型分层组织数据:
系列(GSE): 包含一组相关样本的完整实验
样本(GSM): 单个实验样本或生物学重复
平台(GPL): 使用的微阵列或测序平台
数据集(GDS): 具有一致格式的精选集合
表达谱: 与序列特征关联的基因特异性表达数据
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
GEO 数据集搜索:
通过关键词、生物体或实验条件搜索研究:
from Bio import Entrez
# 配置 Entrez(必需)
Entrez.email = "your.email@example.com"
# 搜索数据集
def search_geo_datasets(query, retmax=20):
"""搜索 GEO 数据集数据库"""
handle = Entrez.esearch(
db="gds",
term=query,
retmax=retmax,
usehistory="y"
)
results = Entrez.read(handle)
handle.close()
return results
# 示例搜索
results = search_geo_datasets("breast cancer[MeSH] AND Homo sapiens[Organism]")
print(f"Found {results['Count']} datasets")
# 按特定平台搜索
results = search_geo_datasets("GPL570[Accession]")
# 按研究类型搜索
results = search_geo_datasets("expression profiling by array[DataSet Type]")
GEO 表达谱搜索:
查找基因特异性表达模式:
# 搜索基因表达谱
def search_geo_profiles(gene_name, organism="Homo sapiens", retmax=100):
"""为特定基因搜索 GEO 表达谱"""
query = f"{gene_name}[Gene Name] AND {organism}[Organism]"
handle = Entrez.esearch(
db="geoprofiles",
term=query,
retmax=retmax
)
results = Entrez.read(handle)
handle.close()
return results
# 查找跨研究的 TP53 表达
tp53_results = search_geo_profiles("TP53", organism="Homo sapiens")
print(f"Found {tp53_results['Count']} expression profiles for TP53")
高级搜索模式:
# 组合多个搜索词
def advanced_geo_search(terms, operator="AND"):
"""构建复杂搜索查询"""
query = f" {operator} ".join(terms)
return search_geo_datasets(query)
# 查找近期的高通量研究
search_terms = [
"RNA-seq[DataSet Type]",
"Homo sapiens[Organism]",
"2024[Publication Date]"
]
results = advanced_geo_search(search_terms)
# 按作者和条件搜索
search_terms = [
"Smith[Author]",
"diabetes[Disease]"
]
results = advanced_geo_search(search_terms)
GEOparse 是访问 GEO 数据的主要 Python 库:
安装:
uv pip install GEOparse
基本用法:
import GEOparse
# 下载并解析 GEO 系列
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
# 访问系列元数据
print(gse.metadata['title'])
print(gse.metadata['summary'])
print(gse.metadata['overall_design'])
# 访问样本信息
for gsm_name, gsm in gse.gsms.items():
print(f"Sample: {gsm_name}")
print(f" Title: {gsm.metadata['title'][0]}")
print(f" Source: {gsm.metadata['source_name_ch1'][0]}")
print(f" Characteristics: {gsm.metadata.get('characteristics_ch1', [])}")
# 访问平台信息
for gpl_name, gpl in gse.gpls.items():
print(f"Platform: {gpl_name}")
print(f" Title: {gpl.metadata['title'][0]}")
print(f" Organism: {gpl.metadata['organism'][0]}")
处理表达数据:
import GEOparse
import pandas as pd
# 从系列获取表达数据
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
# 提取表达矩阵
# 方法 1:从系列矩阵文件(最快)
if hasattr(gse, 'pivot_samples'):
expression_df = gse.pivot_samples('VALUE')
print(expression_df.shape) # genes x samples
# 方法 2:从单个样本
expression_data = {}
for gsm_name, gsm in gse.gsms.items():
if hasattr(gsm, 'table'):
expression_data[gsm_name] = gsm.table['VALUE']
expression_df = pd.DataFrame(expression_data)
print(f"Expression matrix: {expression_df.shape}")
访问补充文件:
import GEOparse
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
# 下载补充文件
gse.download_supplementary_files(
directory="./data/GSE123456_suppl",
download_sra=False # 设置为 True 以下载 SRA 文件
)
# 列出可用的补充文件
for gsm_name, gsm in gse.gsms.items():
if hasattr(gsm, 'supplementary_files'):
print(f"Sample {gsm_name}:")
for file_url in gsm.metadata.get('supplementary_file', []):
print(f" {file_url}")
过滤和子集化数据:
import GEOparse
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
# 按元数据过滤样本
control_samples = [
gsm_name for gsm_name, gsm in gse.gsms.items()
if 'control' in gsm.metadata.get('title', [''])[0].lower()
]
treatment_samples = [
gsm_name for gsm_name, gsm in gse.gsms.items()
if 'treatment' in gsm.metadata.get('title', [''])[0].lower()
]
print(f"Control samples: {len(control_samples)}")
print(f"Treatment samples: {len(treatment_samples)}")
# 提取子集表达矩阵
expression_df = gse.pivot_samples('VALUE')
control_expr = expression_df[control_samples]
treatment_expr = expression_df[treatment_samples]
E-utilities 提供对 GEO 元数据的底层程序化访问:
基本 E-utilities 工作流:
from Bio import Entrez
import time
Entrez.email = "your.email@example.com"
# 步骤 1:搜索 GEO 条目
def search_geo(query, db="gds", retmax=100):
"""使用 E-utilities 搜索 GEO"""
handle = Entrez.esearch(
db=db,
term=query,
retmax=retmax,
usehistory="y"
)
results = Entrez.read(handle)
handle.close()
return results
# 步骤 2:获取摘要
def fetch_geo_summaries(id_list, db="gds"):
"""获取 GEO 条目的文档摘要"""
ids = ",".join(id_list)
handle = Entrez.esummary(db=db, id=ids)
summaries = Entrez.read(handle)
handle.close()
return summaries
# 步骤 3:获取完整记录
def fetch_geo_records(id_list, db="gds"):
"""获取完整的 GEO 记录"""
ids = ",".join(id_list)
handle = Entrez.efetch(db=db, id=ids, retmode="xml")
records = Entrez.read(handle)
handle.close()
return records
# 示例工作流
search_results = search_geo("breast cancer AND Homo sapiens")
id_list = search_results['IdList'][:5]
summaries = fetch_geo_summaries(id_list)
for summary in summaries:
print(f"GDS: {summary.get('Accession', 'N/A')}")
print(f"Title: {summary.get('title', 'N/A')}")
print(f"Samples: {summary.get('n_samples', 'N/A')}")
print()
使用 E-utilities 进行批处理:
from Bio import Entrez
import time
Entrez.email = "your.email@example.com"
def batch_fetch_geo_metadata(accessions, batch_size=100):
"""获取多个 GEO 登录号的元数据"""
results = {}
for i in range(0, len(accessions), batch_size):
batch = accessions[i:i + batch_size]
# 搜索每个登录号
for accession in batch:
try:
query = f"{accession}[Accession]"
search_handle = Entrez.esearch(db="gds", term=query)
search_results = Entrez.read(search_handle)
search_handle.close()
if search_results['IdList']:
# 获取摘要
summary_handle = Entrez.esummary(
db="gds",
id=search_results['IdList'][0]
)
summary = Entrez.read(summary_handle)
summary_handle.close()
results[accession] = summary[0]
# 对 NCBI 服务器保持礼貌
time.sleep(0.34) # 每秒最多 3 个请求
except Exception as e:
print(f"Error fetching {accession}: {e}")
return results
# 获取多个数据集的元数据
gse_list = ["GSE100001", "GSE100002", "GSE100003"]
metadata = batch_fetch_geo_metadata(gse_list)
GEO 数据的 FTP URL:
GEO 数据可以通过 FTP 直接下载:
import ftplib
import os
def download_geo_ftp(accession, file_type="matrix", dest_dir="./data"):
"""通过 FTP 下载 GEO 文件"""
# 根据登录号类型构建 FTP 路径
if accession.startswith("GSE"):
# 系列文件
gse_num = accession[3:]
base_num = gse_num[:-3] + "nnn"
ftp_path = f"/geo/series/GSE{base_num}/{accession}/"
if file_type == "matrix":
filename = f"{accession}_series_matrix.txt.gz"
elif file_type == "soft":
filename = f"{accession}_family.soft.gz"
elif file_type == "miniml":
filename = f"{accession}_family.xml.tgz"
# 连接到 FTP 服务器
ftp = ftplib.FTP("ftp.ncbi.nlm.nih.gov")
ftp.login()
ftp.cwd(ftp_path)
# 下载文件
os.makedirs(dest_dir, exist_ok=True)
local_file = os.path.join(dest_dir, filename)
with open(local_file, 'wb') as f:
ftp.retrbinary(f'RETR {filename}', f.write)
ftp.quit()
print(f"Downloaded: {local_file}")
return local_file
# 下载系列矩阵文件
download_geo_ftp("GSE123456", file_type="matrix")
# 下载 SOFT 格式文件
download_geo_ftp("GSE123456", file_type="soft")
使用 wget 或 curl 下载:
# 下载系列矩阵文件
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/matrix/GSE123456_series_matrix.txt.gz
# 下载系列的所有补充文件
wget -r -np -nd ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/suppl/
# 下载 SOFT 格式家族文件
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/soft/GSE123456_family.soft.gz
质量控制和预处理:
import GEOparse
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# 加载数据集
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')
# 检查缺失值
print(f"Missing values: {expression_df.isnull().sum().sum()}")
# 对数转换(如果需要)
if expression_df.min().min() > 0: # 检查是否已进行对数转换
if expression_df.max().max() > 100:
expression_df = np.log2(expression_df + 1)
print("Applied log2 transformation")
# 分布图
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
expression_df.plot.box(ax=plt.gca())
plt.title("Expression Distribution per Sample")
plt.xticks(rotation=90)
plt.subplot(1, 2, 2)
expression_df.mean(axis=1).hist(bins=50)
plt.title("Gene Expression Distribution")
plt.xlabel("Average Expression")
plt.tight_layout()
plt.savefig("geo_qc.png", dpi=300, bbox_inches='tight')
差异表达分析:
import GEOparse
import pandas as pd
import numpy as np
from scipy import stats
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')
# 定义样本组
control_samples = ["GSM1", "GSM2", "GSM3"]
treatment_samples = ["GSM4", "GSM5", "GSM6"]
# 计算倍数变化和 p 值
results = []
for gene in expression_df.index:
control_expr = expression_df.loc[gene, control_samples]
treatment_expr = expression_df.loc[gene, treatment_samples]
# 计算统计量
fold_change = treatment_expr.mean() - control_expr.mean()
t_stat, p_value = stats.ttest_ind(treatment_expr, control_expr)
results.append({
'gene': gene,
'log2_fold_change': fold_change,
'p_value': p_value,
'control_mean': control_expr.mean(),
'treatment_mean': treatment_expr.mean()
})
# 创建结果 DataFrame
de_results = pd.DataFrame(results)
# 多重检验校正(Benjamini-Hochberg)
from statsmodels.stats.multitest import multipletests
_, de_results['q_value'], _, _ = multipletests(
de_results['p_value'],
method='fdr_bh'
)
# 过滤显著基因
significant_genes = de_results[
(de_results['q_value'] < 0.05) &
(abs(de_results['log2_fold_change']) > 1)
]
print(f"Significant genes: {len(significant_genes)}")
significant_genes.to_csv("de_results.csv", index=False)
相关性和聚类分析:
import GEOparse
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy
from scipy.spatial.distance import pdist
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')
# 样本相关性热图
sample_corr = expression_df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(sample_corr, cmap='coolwarm', center=0,
square=True, linewidths=0.5)
plt.title("Sample Correlation Matrix")
plt.tight_layout()
plt.savefig("sample_correlation.png", dpi=300, bbox_inches='tight')
# 层次聚类
distances = pdist(expression_df.T, metric='correlation')
linkage = hierarchy.linkage(distances, method='average')
plt.figure(figsize=(12, 6))
hierarchy.dendrogram(linkage, labels=expression_df.columns)
plt.title("Hierarchical Clustering of Samples")
plt.xlabel("Samples")
plt.ylabel("Distance")
plt.xticks(rotation=90)
plt.tight_layout()
plt.savefig("sample_clustering.png", dpi=300, bbox_inches='tight')
下载和处理多个系列:
import GEOparse
import pandas as pd
import os
def batch_download_geo(gse_list, destdir="./geo_data"):
"""下载多个 GEO 系列"""
results = {}
for gse_id in gse_list:
try:
print(f"Processing {gse_id}...")
gse = GEOparse.get_GEO(geo=gse_id, destdir=destdir)
# 提取关键信息
results[gse_id] = {
'title': gse.metadata.get('title', ['N/A'])[0],
'organism': gse.metadata.get('organism', ['N/A'])[0],
'platform': list(gse.gpls.keys())[0] if gse.gpls else 'N/A',
'num_samples': len(gse.gsms),
'submission_date': gse.metadata.get('submission_date', ['N/A'])[0]
}
# 保存表达数据
if hasattr(gse, 'pivot_samples'):
expr_df = gse.pivot_samples('VALUE')
expr_df.to_csv(f"{destdir}/{gse_id}_expression.csv")
results[gse_id]['num_genes'] = len(expr_df)
except Exception as e:
print(f"Error processing {gse_id}: {e}")
results[gse_id] = {'error': str(e)}
# 保存摘要
summary_df = pd.DataFrame(results).T
summary_df.to_csv(f"{destdir}/batch_summary.csv")
return results
# 处理多个数据集
gse_list = ["GSE100001", "GSE100002", "GSE100003"]
results = batch_download_geo(gse_list)
跨研究的荟萃分析:
import GEOparse
import pandas as pd
import numpy as np
def meta_analysis_geo(gse_list, gene_of_interest):
"""跨研究进行基因表达的荟萃分析"""
results = []
for gse_id in gse_list:
try:
gse = GEOparse.get_GEO(geo=gse_id, destdir="./data")
# 获取平台注释
gpl = list(gse.gpls.values())[0]
# 在平台中查找基因
if hasattr(gpl, 'table'):
gene_probes = gpl.table[
gpl.table['Gene Symbol'].str.contains(
gene_of_interest,
case=False,
na=False
)
]
if not gene_probes.empty:
expr_df = gse.pivot_samples('VALUE')
for probe_id in gene_probes['ID']:
if probe_id in expr_df.index:
expr_values = expr_df.loc[probe_id]
results.append({
'study': gse_id,
'probe': probe_id,
'mean_expression': expr_values.mean(),
'std_expression': expr_values.std(),
'num_samples': len(expr_values)
})
except Exception as e:
print(f"Error in {gse_id}: {e}")
return pd.DataFrame(results)
# TP53 的荟萃分析
gse_studies = ["GSE100001", "GSE100002", "GSE100003"]
meta_results = meta_analysis_geo(gse_studies, "TP53")
print(meta_results)
# 主要的 GEO 访问库(推荐)
uv pip install GEOparse
# 用于 E-utilities 和程序化 NCBI 访问
uv pip install biopython
# 用于数据分析
uv pip install pandas numpy scipy
# 用于可视化
uv pip install matplotlib seaborn
# 用于统计分析
uv pip install statsmodels scikit-learn
设置 NCBI E-utilities 访问:
from Bio import Entrez
# 始终设置您的邮箱(NCBI 要求)
Entrez.email = "your.email@example.com"
# 可选:设置 API 密钥以提高速率限制
# 从以下网址获取您的 API 密钥:https://www.ncbi.nlm.nih.gov/account/
Entrez.api_key = "your_api_key_here"
# 使用 API 密钥:10 个请求/秒
# 不使用 API 密钥:3 个请求/秒
SOFT(文本简单综合格式): GEO 的主要基于文本的格式,包含元数据和数据表。易于被 GEOparse 解析。
MINiML(标记语言中的 MIAME 表示法): GEO 数据的 XML 格式,用于程序化访问和数据交换。
系列矩阵: 以样本为列、基因/探针为行的制表符分隔的表达矩阵。获取表达数据最快的格式。
MIAME 合规性: 关于微阵列实验的最小信息 - GEO 对所有提交强制执行的标准注释。
表达值类型: 不同类型的表达测量(原始信号、标准化、对数转换)。始终检查平台和处理方法。
平台注释: 将探针/特征 ID 映射到基因。对于表达数据的生物学解释至关重要。
对于无需编码的快速分析,使用 GEO2R:
NCBI E-utilities 速率限制:
time.sleep(0.34)(无 API 密钥)或 time.sleep(0.1)(有 API 密钥)FTP 访问:
GEOparse 缓存:
最佳实践:
全面的参考文档涵盖:
请查阅此参考文档以获取深入的技术细节、复杂的查询模式,或处理不常见的数据格式时。
每周安装次数
157
仓库
GitHub 星标数
23.4K
首次出现时间
2026年1月21日
安全审计
已安装于
claude-code137
opencode129
cursor122
gemini-cli121
antigravity113
codex112
The Gene Expression Omnibus (GEO) is NCBI's public repository for high-throughput gene expression and functional genomics data. GEO contains over 264,000 studies with more than 8 million samples from both array-based and sequence-based experiments.
This skill should be used when searching for gene expression datasets, retrieving experimental data, downloading raw and processed files, querying expression profiles, or integrating GEO data into computational analysis workflows.
GEO organizes data hierarchically using different accession types:
Series (GSE): A complete experiment with a set of related samples
Sample (GSM): A single experimental sample or biological replicate
Platform (GPL): The microarray or sequencing platform used
DataSet (GDS): Curated collections with consistent formatting
Profiles: Gene-specific expression data linked to sequence features
GEO DataSets Search:
Search for studies by keywords, organism, or experimental conditions:
from Bio import Entrez
# Configure Entrez (required)
Entrez.email = "your.email@example.com"
# Search for datasets
def search_geo_datasets(query, retmax=20):
"""Search GEO DataSets database"""
handle = Entrez.esearch(
db="gds",
term=query,
retmax=retmax,
usehistory="y"
)
results = Entrez.read(handle)
handle.close()
return results
# Example searches
results = search_geo_datasets("breast cancer[MeSH] AND Homo sapiens[Organism]")
print(f"Found {results['Count']} datasets")
# Search by specific platform
results = search_geo_datasets("GPL570[Accession]")
# Search by study type
results = search_geo_datasets("expression profiling by array[DataSet Type]")
GEO Profiles Search:
Find gene-specific expression patterns:
# Search for gene expression profiles
def search_geo_profiles(gene_name, organism="Homo sapiens", retmax=100):
"""Search GEO Profiles for a specific gene"""
query = f"{gene_name}[Gene Name] AND {organism}[Organism]"
handle = Entrez.esearch(
db="geoprofiles",
term=query,
retmax=retmax
)
results = Entrez.read(handle)
handle.close()
return results
# Find TP53 expression across studies
tp53_results = search_geo_profiles("TP53", organism="Homo sapiens")
print(f"Found {tp53_results['Count']} expression profiles for TP53")
Advanced Search Patterns:
# Combine multiple search terms
def advanced_geo_search(terms, operator="AND"):
"""Build complex search queries"""
query = f" {operator} ".join(terms)
return search_geo_datasets(query)
# Find recent high-throughput studies
search_terms = [
"RNA-seq[DataSet Type]",
"Homo sapiens[Organism]",
"2024[Publication Date]"
]
results = advanced_geo_search(search_terms)
# Search by author and condition
search_terms = [
"Smith[Author]",
"diabetes[Disease]"
]
results = advanced_geo_search(search_terms)
GEOparse is the primary Python library for accessing GEO data:
Installation:
uv pip install GEOparse
Basic Usage:
import GEOparse
# Download and parse a GEO Series
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
# Access series metadata
print(gse.metadata['title'])
print(gse.metadata['summary'])
print(gse.metadata['overall_design'])
# Access sample information
for gsm_name, gsm in gse.gsms.items():
print(f"Sample: {gsm_name}")
print(f" Title: {gsm.metadata['title'][0]}")
print(f" Source: {gsm.metadata['source_name_ch1'][0]}")
print(f" Characteristics: {gsm.metadata.get('characteristics_ch1', [])}")
# Access platform information
for gpl_name, gpl in gse.gpls.items():
print(f"Platform: {gpl_name}")
print(f" Title: {gpl.metadata['title'][0]}")
print(f" Organism: {gpl.metadata['organism'][0]}")
Working with Expression Data:
import GEOparse
import pandas as pd
# Get expression data from series
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
# Extract expression matrix
# Method 1: From series matrix file (fastest)
if hasattr(gse, 'pivot_samples'):
expression_df = gse.pivot_samples('VALUE')
print(expression_df.shape) # genes x samples
# Method 2: From individual samples
expression_data = {}
for gsm_name, gsm in gse.gsms.items():
if hasattr(gsm, 'table'):
expression_data[gsm_name] = gsm.table['VALUE']
expression_df = pd.DataFrame(expression_data)
print(f"Expression matrix: {expression_df.shape}")
Accessing Supplementary Files:
import GEOparse
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
# Download supplementary files
gse.download_supplementary_files(
directory="./data/GSE123456_suppl",
download_sra=False # Set to True to download SRA files
)
# List available supplementary files
for gsm_name, gsm in gse.gsms.items():
if hasattr(gsm, 'supplementary_files'):
print(f"Sample {gsm_name}:")
for file_url in gsm.metadata.get('supplementary_file', []):
print(f" {file_url}")
Filtering and Subsetting Data:
import GEOparse
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
# Filter samples by metadata
control_samples = [
gsm_name for gsm_name, gsm in gse.gsms.items()
if 'control' in gsm.metadata.get('title', [''])[0].lower()
]
treatment_samples = [
gsm_name for gsm_name, gsm in gse.gsms.items()
if 'treatment' in gsm.metadata.get('title', [''])[0].lower()
]
print(f"Control samples: {len(control_samples)}")
print(f"Treatment samples: {len(treatment_samples)}")
# Extract subset expression matrix
expression_df = gse.pivot_samples('VALUE')
control_expr = expression_df[control_samples]
treatment_expr = expression_df[treatment_samples]
E-utilities provide lower-level programmatic access to GEO metadata:
Basic E-utilities Workflow:
from Bio import Entrez
import time
Entrez.email = "your.email@example.com"
# Step 1: Search for GEO entries
def search_geo(query, db="gds", retmax=100):
"""Search GEO using E-utilities"""
handle = Entrez.esearch(
db=db,
term=query,
retmax=retmax,
usehistory="y"
)
results = Entrez.read(handle)
handle.close()
return results
# Step 2: Fetch summaries
def fetch_geo_summaries(id_list, db="gds"):
"""Fetch document summaries for GEO entries"""
ids = ",".join(id_list)
handle = Entrez.esummary(db=db, id=ids)
summaries = Entrez.read(handle)
handle.close()
return summaries
# Step 3: Fetch full records
def fetch_geo_records(id_list, db="gds"):
"""Fetch full GEO records"""
ids = ",".join(id_list)
handle = Entrez.efetch(db=db, id=ids, retmode="xml")
records = Entrez.read(handle)
handle.close()
return records
# Example workflow
search_results = search_geo("breast cancer AND Homo sapiens")
id_list = search_results['IdList'][:5]
summaries = fetch_geo_summaries(id_list)
for summary in summaries:
print(f"GDS: {summary.get('Accession', 'N/A')}")
print(f"Title: {summary.get('title', 'N/A')}")
print(f"Samples: {summary.get('n_samples', 'N/A')}")
print()
Batch Processing with E-utilities:
from Bio import Entrez
import time
Entrez.email = "your.email@example.com"
def batch_fetch_geo_metadata(accessions, batch_size=100):
"""Fetch metadata for multiple GEO accessions"""
results = {}
for i in range(0, len(accessions), batch_size):
batch = accessions[i:i + batch_size]
# Search for each accession
for accession in batch:
try:
query = f"{accession}[Accession]"
search_handle = Entrez.esearch(db="gds", term=query)
search_results = Entrez.read(search_handle)
search_handle.close()
if search_results['IdList']:
# Fetch summary
summary_handle = Entrez.esummary(
db="gds",
id=search_results['IdList'][0]
)
summary = Entrez.read(summary_handle)
summary_handle.close()
results[accession] = summary[0]
# Be polite to NCBI servers
time.sleep(0.34) # Max 3 requests per second
except Exception as e:
print(f"Error fetching {accession}: {e}")
return results
# Fetch metadata for multiple datasets
gse_list = ["GSE100001", "GSE100002", "GSE100003"]
metadata = batch_fetch_geo_metadata(gse_list)
FTP URLs for GEO Data:
GEO data can be downloaded directly via FTP:
import ftplib
import os
def download_geo_ftp(accession, file_type="matrix", dest_dir="./data"):
"""Download GEO files via FTP"""
# Construct FTP path based on accession type
if accession.startswith("GSE"):
# Series files
gse_num = accession[3:]
base_num = gse_num[:-3] + "nnn"
ftp_path = f"/geo/series/GSE{base_num}/{accession}/"
if file_type == "matrix":
filename = f"{accession}_series_matrix.txt.gz"
elif file_type == "soft":
filename = f"{accession}_family.soft.gz"
elif file_type == "miniml":
filename = f"{accession}_family.xml.tgz"
# Connect to FTP server
ftp = ftplib.FTP("ftp.ncbi.nlm.nih.gov")
ftp.login()
ftp.cwd(ftp_path)
# Download file
os.makedirs(dest_dir, exist_ok=True)
local_file = os.path.join(dest_dir, filename)
with open(local_file, 'wb') as f:
ftp.retrbinary(f'RETR {filename}', f.write)
ftp.quit()
print(f"Downloaded: {local_file}")
return local_file
# Download series matrix file
download_geo_ftp("GSE123456", file_type="matrix")
# Download SOFT format file
download_geo_ftp("GSE123456", file_type="soft")
Using wget or curl for Downloads:
# Download series matrix file
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/matrix/GSE123456_series_matrix.txt.gz
# Download all supplementary files for a series
wget -r -np -nd ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/suppl/
# Download SOFT format family file
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/soft/GSE123456_family.soft.gz
Quality Control and Preprocessing:
import GEOparse
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Load dataset
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')
# Check for missing values
print(f"Missing values: {expression_df.isnull().sum().sum()}")
# Log transformation (if needed)
if expression_df.min().min() > 0: # Check if already log-transformed
if expression_df.max().max() > 100:
expression_df = np.log2(expression_df + 1)
print("Applied log2 transformation")
# Distribution plots
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
expression_df.plot.box(ax=plt.gca())
plt.title("Expression Distribution per Sample")
plt.xticks(rotation=90)
plt.subplot(1, 2, 2)
expression_df.mean(axis=1).hist(bins=50)
plt.title("Gene Expression Distribution")
plt.xlabel("Average Expression")
plt.tight_layout()
plt.savefig("geo_qc.png", dpi=300, bbox_inches='tight')
Differential Expression Analysis:
import GEOparse
import pandas as pd
import numpy as np
from scipy import stats
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')
# Define sample groups
control_samples = ["GSM1", "GSM2", "GSM3"]
treatment_samples = ["GSM4", "GSM5", "GSM6"]
# Calculate fold changes and p-values
results = []
for gene in expression_df.index:
control_expr = expression_df.loc[gene, control_samples]
treatment_expr = expression_df.loc[gene, treatment_samples]
# Calculate statistics
fold_change = treatment_expr.mean() - control_expr.mean()
t_stat, p_value = stats.ttest_ind(treatment_expr, control_expr)
results.append({
'gene': gene,
'log2_fold_change': fold_change,
'p_value': p_value,
'control_mean': control_expr.mean(),
'treatment_mean': treatment_expr.mean()
})
# Create results DataFrame
de_results = pd.DataFrame(results)
# Multiple testing correction (Benjamini-Hochberg)
from statsmodels.stats.multitest import multipletests
_, de_results['q_value'], _, _ = multipletests(
de_results['p_value'],
method='fdr_bh'
)
# Filter significant genes
significant_genes = de_results[
(de_results['q_value'] < 0.05) &
(abs(de_results['log2_fold_change']) > 1)
]
print(f"Significant genes: {len(significant_genes)}")
significant_genes.to_csv("de_results.csv", index=False)
Correlation and Clustering Analysis:
import GEOparse
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy
from scipy.spatial.distance import pdist
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')
# Sample correlation heatmap
sample_corr = expression_df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(sample_corr, cmap='coolwarm', center=0,
square=True, linewidths=0.5)
plt.title("Sample Correlation Matrix")
plt.tight_layout()
plt.savefig("sample_correlation.png", dpi=300, bbox_inches='tight')
# Hierarchical clustering
distances = pdist(expression_df.T, metric='correlation')
linkage = hierarchy.linkage(distances, method='average')
plt.figure(figsize=(12, 6))
hierarchy.dendrogram(linkage, labels=expression_df.columns)
plt.title("Hierarchical Clustering of Samples")
plt.xlabel("Samples")
plt.ylabel("Distance")
plt.xticks(rotation=90)
plt.tight_layout()
plt.savefig("sample_clustering.png", dpi=300, bbox_inches='tight')
Download and Process Multiple Series:
import GEOparse
import pandas as pd
import os
def batch_download_geo(gse_list, destdir="./geo_data"):
"""Download multiple GEO series"""
results = {}
for gse_id in gse_list:
try:
print(f"Processing {gse_id}...")
gse = GEOparse.get_GEO(geo=gse_id, destdir=destdir)
# Extract key information
results[gse_id] = {
'title': gse.metadata.get('title', ['N/A'])[0],
'organism': gse.metadata.get('organism', ['N/A'])[0],
'platform': list(gse.gpls.keys())[0] if gse.gpls else 'N/A',
'num_samples': len(gse.gsms),
'submission_date': gse.metadata.get('submission_date', ['N/A'])[0]
}
# Save expression data
if hasattr(gse, 'pivot_samples'):
expr_df = gse.pivot_samples('VALUE')
expr_df.to_csv(f"{destdir}/{gse_id}_expression.csv")
results[gse_id]['num_genes'] = len(expr_df)
except Exception as e:
print(f"Error processing {gse_id}: {e}")
results[gse_id] = {'error': str(e)}
# Save summary
summary_df = pd.DataFrame(results).T
summary_df.to_csv(f"{destdir}/batch_summary.csv")
return results
# Process multiple datasets
gse_list = ["GSE100001", "GSE100002", "GSE100003"]
results = batch_download_geo(gse_list)
Meta-Analysis Across Studies:
import GEOparse
import pandas as pd
import numpy as np
def meta_analysis_geo(gse_list, gene_of_interest):
"""Perform meta-analysis of gene expression across studies"""
results = []
for gse_id in gse_list:
try:
gse = GEOparse.get_GEO(geo=gse_id, destdir="./data")
# Get platform annotation
gpl = list(gse.gpls.values())[0]
# Find gene in platform
if hasattr(gpl, 'table'):
gene_probes = gpl.table[
gpl.table['Gene Symbol'].str.contains(
gene_of_interest,
case=False,
na=False
)
]
if not gene_probes.empty:
expr_df = gse.pivot_samples('VALUE')
for probe_id in gene_probes['ID']:
if probe_id in expr_df.index:
expr_values = expr_df.loc[probe_id]
results.append({
'study': gse_id,
'probe': probe_id,
'mean_expression': expr_values.mean(),
'std_expression': expr_values.std(),
'num_samples': len(expr_values)
})
except Exception as e:
print(f"Error in {gse_id}: {e}")
return pd.DataFrame(results)
# Meta-analysis for TP53
gse_studies = ["GSE100001", "GSE100002", "GSE100003"]
meta_results = meta_analysis_geo(gse_studies, "TP53")
print(meta_results)
# Primary GEO access library (recommended)
uv pip install GEOparse
# For E-utilities and programmatic NCBI access
uv pip install biopython
# For data analysis
uv pip install pandas numpy scipy
# For visualization
uv pip install matplotlib seaborn
# For statistical analysis
uv pip install statsmodels scikit-learn
Set up NCBI E-utilities access:
from Bio import Entrez
# Always set your email (required by NCBI)
Entrez.email = "your.email@example.com"
# Optional: Set API key for increased rate limits
# Get your API key from: https://www.ncbi.nlm.nih.gov/account/
Entrez.api_key = "your_api_key_here"
# With API key: 10 requests/second
# Without API key: 3 requests/second
SOFT (Simple Omnibus Format in Text): GEO's primary text-based format containing metadata and data tables. Easily parsed by GEOparse.
MINiML (MIAME Notation in Markup Language): XML format for GEO data, used for programmatic access and data exchange.
Series Matrix: Tab-delimited expression matrix with samples as columns and genes/probes as rows. Fastest format for getting expression data.
MIAME Compliance: Minimum Information About a Microarray Experiment - standardized annotation that GEO enforces for all submissions.
Expression Value Types: Different types of expression measurements (raw signal, normalized, log-transformed). Always check platform and processing methods.
Platform Annotation: Maps probe/feature IDs to genes. Essential for biological interpretation of expression data.
For quick analysis without coding, use GEO2R:
NCBI E-utilities Rate Limits:
time.sleep(0.34) (no API key) or time.sleep(0.1) (with API key)FTP Access:
GEOparse Caching:
Optimal Practices:
Comprehensive reference documentation covering:
Consult this reference for in-depth technical details, complex query patterns, or when working with uncommon data formats.
Weekly Installs
157
Repository
GitHub Stars
23.4K
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
claude-code137
opencode129
cursor122
gemini-cli121
antigravity113
codex112
智能OCR文字识别工具 - 支持100+语言,高精度提取图片/PDF/手写文本
976 周安装
元数据修复指南:优化页面标题、描述、规范链接与社交卡片,提升SEO效果
6,400 周安装
NotebookLM 快捷指令 - 自动化管理 Google NotebookLM 笔记本与 AI 内容生成
6,200 周安装
GitHub Actions 模板:生产就绪的 CI/CD 工作流模式,支持测试、构建和部署
6,400 周安装
BMAD-GDS游戏开发技能:AI驱动的游戏设计文档、架构与敏捷开发工作流
6,300 周安装
PHP专家开发助手:精通PHP 8.3+、Laravel、Symfony与现代企业级架构
6,500 周安装
BMAD 创意智能套件:36种头脑风暴方法、设计思维、创新策略与故事框架
6,400 周安装