GEO数据库使用指南：Python检索基因表达数据与生物信息学分析

geo-database by davila7/claude-code-templates

157 周安装量

23,400 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill geo-database

科研工具生物信息学数据处理

🇨🇳中文介绍

GEO 数据库

概述

基因表达综合数据库（GEO）是 NCBI 用于高通量基因表达和功能基因组学数据的公共存储库。GEO 包含超过 264,000 项研究，涵盖基于芯片和基于测序的实验，样本数量超过 800 万个。

何时使用此技能

当需要搜索基因表达数据集、检索实验数据、下载原始和处理后的文件、查询表达谱或将 GEO 数据整合到计算分析工作流中时，应使用此技能。

核心功能

1. 理解 GEO 数据组织

GEO 使用不同的登录号类型分层组织数据：

系列（GSE）： 包含一组相关样本的完整实验

示例：GSE123456
包含实验设计、样本和整体研究信息
GEO 中最大的组织单元
当前数量：264,928+ 个系列

样本（GSM）： 单个实验样本或生物学重复

示例：GSM987654
包含单个样本数据、实验方案和元数据
与平台和系列相关联
当前数量：8,068,632+ 个样本

平台（GPL）： 使用的微阵列或测序平台

示例：GPL570（Affymetrix Human Genome U133 Plus 2.0 Array）
描述技术和探针/特征注释
在多个实验间共享
当前数量：27,739+ 个平台

数据集（GDS）： 具有一致格式的精选集合

示例：GDS5678
按研究设计组织的实验可比样本
经过处理用于差异分析
GEO 数据的子集（4,348 个精选数据集）
适用于快速比较分析

表达谱： 与序列特征关联的基因特异性表达数据

可通过基因名称或注释查询
与 Entrez Gene 交叉引用
支持跨所有研究的以基因为中心的搜索

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

2. 搜索 GEO 数据

GEO 数据集搜索：

通过关键词、生物体或实验条件搜索研究：

from Bio import Entrez

# 配置 Entrez（必需）
Entrez.email = "your.email@example.com"

# 搜索数据集
def search_geo_datasets(query, retmax=20):
    """搜索 GEO 数据集数据库"""
    handle = Entrez.esearch(
        db="gds",
        term=query,
        retmax=retmax,
        usehistory="y"
    )
    results = Entrez.read(handle)
    handle.close()
    return results

# 示例搜索
results = search_geo_datasets("breast cancer[MeSH] AND Homo sapiens[Organism]")
print(f"Found {results['Count']} datasets")

# 按特定平台搜索
results = search_geo_datasets("GPL570[Accession]")

# 按研究类型搜索
results = search_geo_datasets("expression profiling by array[DataSet Type]")

GEO 表达谱搜索：

查找基因特异性表达模式：

# 搜索基因表达谱
def search_geo_profiles(gene_name, organism="Homo sapiens", retmax=100):
    """为特定基因搜索 GEO 表达谱"""
    query = f"{gene_name}[Gene Name] AND {organism}[Organism]"
    handle = Entrez.esearch(
        db="geoprofiles",
        term=query,
        retmax=retmax
    )
    results = Entrez.read(handle)
    handle.close()
    return results

# 查找跨研究的 TP53 表达
tp53_results = search_geo_profiles("TP53", organism="Homo sapiens")
print(f"Found {tp53_results['Count']} expression profiles for TP53")

高级搜索模式：

# 组合多个搜索词
def advanced_geo_search(terms, operator="AND"):
    """构建复杂搜索查询"""
    query = f" {operator} ".join(terms)
    return search_geo_datasets(query)

# 查找近期的高通量研究
search_terms = [
    "RNA-seq[DataSet Type]",
    "Homo sapiens[Organism]",
    "2024[Publication Date]"
]
results = advanced_geo_search(search_terms)

# 按作者和条件搜索
search_terms = [
    "Smith[Author]",
    "diabetes[Disease]"
]
results = advanced_geo_search(search_terms)

3. 使用 GEOparse 检索 GEO 数据（推荐）

GEOparse 是访问 GEO 数据的主要 Python 库：

uv pip install GEOparse

import GEOparse

# 下载并解析 GEO 系列
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")

# 访问系列元数据
print(gse.metadata['title'])
print(gse.metadata['summary'])
print(gse.metadata['overall_design'])

# 访问样本信息
for gsm_name, gsm in gse.gsms.items():
    print(f"Sample: {gsm_name}")
    print(f"  Title: {gsm.metadata['title'][0]}")
    print(f"  Source: {gsm.metadata['source_name_ch1'][0]}")
    print(f"  Characteristics: {gsm.metadata.get('characteristics_ch1', [])}")

# 访问平台信息
for gpl_name, gpl in gse.gpls.items():
    print(f"Platform: {gpl_name}")
    print(f"  Title: {gpl.metadata['title'][0]}")
    print(f"  Organism: {gpl.metadata['organism'][0]}")

处理表达数据：

import GEOparse
import pandas as pd

# 从系列获取表达数据
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")

# 提取表达矩阵
# 方法 1：从系列矩阵文件（最快）
if hasattr(gse, 'pivot_samples'):
    expression_df = gse.pivot_samples('VALUE')
    print(expression_df.shape)  # genes x samples

# 方法 2：从单个样本
expression_data = {}
for gsm_name, gsm in gse.gsms.items():
    if hasattr(gsm, 'table'):
        expression_data[gsm_name] = gsm.table['VALUE']

expression_df = pd.DataFrame(expression_data)
print(f"Expression matrix: {expression_df.shape}")

访问补充文件：

import GEOparse

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")

# 下载补充文件
gse.download_supplementary_files(
    directory="./data/GSE123456_suppl",
    download_sra=False  # 设置为 True 以下载 SRA 文件
)

# 列出可用的补充文件
for gsm_name, gsm in gse.gsms.items():
    if hasattr(gsm, 'supplementary_files'):
        print(f"Sample {gsm_name}:")
        for file_url in gsm.metadata.get('supplementary_file', []):
            print(f"  {file_url}")

过滤和子集化数据：

import GEOparse

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")

# 按元数据过滤样本
control_samples = [
    gsm_name for gsm_name, gsm in gse.gsms.items()
    if 'control' in gsm.metadata.get('title', [''])[0].lower()
]

treatment_samples = [
    gsm_name for gsm_name, gsm in gse.gsms.items()
    if 'treatment' in gsm.metadata.get('title', [''])[0].lower()
]

print(f"Control samples: {len(control_samples)}")
print(f"Treatment samples: {len(treatment_samples)}")

# 提取子集表达矩阵
expression_df = gse.pivot_samples('VALUE')
control_expr = expression_df[control_samples]
treatment_expr = expression_df[treatment_samples]

4. 使用 NCBI E-utilities 访问 GEO

E-utilities 提供对 GEO 元数据的底层程序化访问：

基本 E-utilities 工作流：

from Bio import Entrez
import time

Entrez.email = "your.email@example.com"

# 步骤 1：搜索 GEO 条目
def search_geo(query, db="gds", retmax=100):
    """使用 E-utilities 搜索 GEO"""
    handle = Entrez.esearch(
        db=db,
        term=query,
        retmax=retmax,
        usehistory="y"
    )
    results = Entrez.read(handle)
    handle.close()
    return results

# 步骤 2：获取摘要
def fetch_geo_summaries(id_list, db="gds"):
    """获取 GEO 条目的文档摘要"""
    ids = ",".join(id_list)
    handle = Entrez.esummary(db=db, id=ids)
    summaries = Entrez.read(handle)
    handle.close()
    return summaries

# 步骤 3：获取完整记录
def fetch_geo_records(id_list, db="gds"):
    """获取完整的 GEO 记录"""
    ids = ",".join(id_list)
    handle = Entrez.efetch(db=db, id=ids, retmode="xml")
    records = Entrez.read(handle)
    handle.close()
    return records

# 示例工作流
search_results = search_geo("breast cancer AND Homo sapiens")
id_list = search_results['IdList'][:5]

summaries = fetch_geo_summaries(id_list)
for summary in summaries:
    print(f"GDS: {summary.get('Accession', 'N/A')}")
    print(f"Title: {summary.get('title', 'N/A')}")
    print(f"Samples: {summary.get('n_samples', 'N/A')}")
    print()

使用 E-utilities 进行批处理：

from Bio import Entrez
import time

Entrez.email = "your.email@example.com"

def batch_fetch_geo_metadata(accessions, batch_size=100):
    """获取多个 GEO 登录号的元数据"""
    results = {}

    for i in range(0, len(accessions), batch_size):
        batch = accessions[i:i + batch_size]

        # 搜索每个登录号
        for accession in batch:
            try:
                query = f"{accession}[Accession]"
                search_handle = Entrez.esearch(db="gds", term=query)
                search_results = Entrez.read(search_handle)
                search_handle.close()

                if search_results['IdList']:
                    # 获取摘要
                    summary_handle = Entrez.esummary(
                        db="gds",
                        id=search_results['IdList'][0]
                    )
                    summary = Entrez.read(summary_handle)
                    summary_handle.close()
                    results[accession] = summary[0]

                # 对 NCBI 服务器保持礼貌
                time.sleep(0.34)  # 每秒最多 3 个请求

            except Exception as e:
                print(f"Error fetching {accession}: {e}")

    return results

# 获取多个数据集的元数据
gse_list = ["GSE100001", "GSE100002", "GSE100003"]
metadata = batch_fetch_geo_metadata(gse_list)

5. 直接 FTP 访问数据文件

GEO 数据的 FTP URL：

GEO 数据可以通过 FTP 直接下载：

import ftplib
import os

def download_geo_ftp(accession, file_type="matrix", dest_dir="./data"):
    """通过 FTP 下载 GEO 文件"""
    # 根据登录号类型构建 FTP 路径
    if accession.startswith("GSE"):
        # 系列文件
        gse_num = accession[3:]
        base_num = gse_num[:-3] + "nnn"
        ftp_path = f"/geo/series/GSE{base_num}/{accession}/"

        if file_type == "matrix":
            filename = f"{accession}_series_matrix.txt.gz"
        elif file_type == "soft":
            filename = f"{accession}_family.soft.gz"
        elif file_type == "miniml":
            filename = f"{accession}_family.xml.tgz"

    # 连接到 FTP 服务器
    ftp = ftplib.FTP("ftp.ncbi.nlm.nih.gov")
    ftp.login()
    ftp.cwd(ftp_path)

    # 下载文件
    os.makedirs(dest_dir, exist_ok=True)
    local_file = os.path.join(dest_dir, filename)

    with open(local_file, 'wb') as f:
        ftp.retrbinary(f'RETR {filename}', f.write)

    ftp.quit()
    print(f"Downloaded: {local_file}")
    return local_file

# 下载系列矩阵文件
download_geo_ftp("GSE123456", file_type="matrix")

# 下载 SOFT 格式文件
download_geo_ftp("GSE123456", file_type="soft")

使用 wget 或 curl 下载：

# 下载系列矩阵文件
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/matrix/GSE123456_series_matrix.txt.gz

# 下载系列的所有补充文件
wget -r -np -nd ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/suppl/

# 下载 SOFT 格式家族文件
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/soft/GSE123456_family.soft.gz

6. 分析 GEO 数据

质量控制和预处理：

import GEOparse
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 加载数据集
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')

# 检查缺失值
print(f"Missing values: {expression_df.isnull().sum().sum()}")

# 对数转换（如果需要）
if expression_df.min().min() > 0:  # 检查是否已进行对数转换
    if expression_df.max().max() > 100:
        expression_df = np.log2(expression_df + 1)
        print("Applied log2 transformation")

# 分布图
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
expression_df.plot.box(ax=plt.gca())
plt.title("Expression Distribution per Sample")
plt.xticks(rotation=90)

plt.subplot(1, 2, 2)
expression_df.mean(axis=1).hist(bins=50)
plt.title("Gene Expression Distribution")
plt.xlabel("Average Expression")

plt.tight_layout()
plt.savefig("geo_qc.png", dpi=300, bbox_inches='tight')

差异表达分析：

import GEOparse
import pandas as pd
import numpy as np
from scipy import stats

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')

# 定义样本组
control_samples = ["GSM1", "GSM2", "GSM3"]
treatment_samples = ["GSM4", "GSM5", "GSM6"]

# 计算倍数变化和 p 值
results = []
for gene in expression_df.index:
    control_expr = expression_df.loc[gene, control_samples]
    treatment_expr = expression_df.loc[gene, treatment_samples]

    # 计算统计量
    fold_change = treatment_expr.mean() - control_expr.mean()
    t_stat, p_value = stats.ttest_ind(treatment_expr, control_expr)

    results.append({
        'gene': gene,
        'log2_fold_change': fold_change,
        'p_value': p_value,
        'control_mean': control_expr.mean(),
        'treatment_mean': treatment_expr.mean()
    })

# 创建结果 DataFrame
de_results = pd.DataFrame(results)

# 多重检验校正（Benjamini-Hochberg）
from statsmodels.stats.multitest import multipletests
_, de_results['q_value'], _, _ = multipletests(
    de_results['p_value'],
    method='fdr_bh'
)

# 过滤显著基因
significant_genes = de_results[
    (de_results['q_value'] < 0.05) &
    (abs(de_results['log2_fold_change']) > 1)
]

print(f"Significant genes: {len(significant_genes)}")
significant_genes.to_csv("de_results.csv", index=False)

相关性和聚类分析：

import GEOparse
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy
from scipy.spatial.distance import pdist

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')

# 样本相关性热图
sample_corr = expression_df.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(sample_corr, cmap='coolwarm', center=0,
            square=True, linewidths=0.5)
plt.title("Sample Correlation Matrix")
plt.tight_layout()
plt.savefig("sample_correlation.png", dpi=300, bbox_inches='tight')

# 层次聚类
distances = pdist(expression_df.T, metric='correlation')
linkage = hierarchy.linkage(distances, method='average')

plt.figure(figsize=(12, 6))
hierarchy.dendrogram(linkage, labels=expression_df.columns)
plt.title("Hierarchical Clustering of Samples")
plt.xlabel("Samples")
plt.ylabel("Distance")
plt.xticks(rotation=90)
plt.tight_layout()
plt.savefig("sample_clustering.png", dpi=300, bbox_inches='tight')

7. 批处理多个数据集

下载和处理多个系列：

import GEOparse
import pandas as pd
import os

def batch_download_geo(gse_list, destdir="./geo_data"):
    """下载多个 GEO 系列"""
    results = {}

    for gse_id in gse_list:
        try:
            print(f"Processing {gse_id}...")
            gse = GEOparse.get_GEO(geo=gse_id, destdir=destdir)

            # 提取关键信息
            results[gse_id] = {
                'title': gse.metadata.get('title', ['N/A'])[0],
                'organism': gse.metadata.get('organism', ['N/A'])[0],
                'platform': list(gse.gpls.keys())[0] if gse.gpls else 'N/A',
                'num_samples': len(gse.gsms),
                'submission_date': gse.metadata.get('submission_date', ['N/A'])[0]
            }

            # 保存表达数据
            if hasattr(gse, 'pivot_samples'):
                expr_df = gse.pivot_samples('VALUE')
                expr_df.to_csv(f"{destdir}/{gse_id}_expression.csv")
                results[gse_id]['num_genes'] = len(expr_df)

        except Exception as e:
            print(f"Error processing {gse_id}: {e}")
            results[gse_id] = {'error': str(e)}

    # 保存摘要
    summary_df = pd.DataFrame(results).T
    summary_df.to_csv(f"{destdir}/batch_summary.csv")

    return results

# 处理多个数据集
gse_list = ["GSE100001", "GSE100002", "GSE100003"]
results = batch_download_geo(gse_list)

跨研究的荟萃分析：

import GEOparse
import pandas as pd
import numpy as np

def meta_analysis_geo(gse_list, gene_of_interest):
    """跨研究进行基因表达的荟萃分析"""
    results = []

    for gse_id in gse_list:
        try:
            gse = GEOparse.get_GEO(geo=gse_id, destdir="./data")

            # 获取平台注释
            gpl = list(gse.gpls.values())[0]

            # 在平台中查找基因
            if hasattr(gpl, 'table'):
                gene_probes = gpl.table[
                    gpl.table['Gene Symbol'].str.contains(
                        gene_of_interest,
                        case=False,
                        na=False
                    )
                ]

                if not gene_probes.empty:
                    expr_df = gse.pivot_samples('VALUE')

                    for probe_id in gene_probes['ID']:
                        if probe_id in expr_df.index:
                            expr_values = expr_df.loc[probe_id]

                            results.append({
                                'study': gse_id,
                                'probe': probe_id,
                                'mean_expression': expr_values.mean(),
                                'std_expression': expr_values.std(),
                                'num_samples': len(expr_values)
                            })

        except Exception as e:
            print(f"Error in {gse_id}: {e}")

    return pd.DataFrame(results)

# TP53 的荟萃分析
gse_studies = ["GSE100001", "GSE100002", "GSE100003"]
meta_results = meta_analysis_geo(gse_studies, "TP53")
print(meta_results)

# 主要的 GEO 访问库（推荐）
uv pip install GEOparse

# 用于 E-utilities 和程序化 NCBI 访问
uv pip install biopython

# 用于数据分析
uv pip install pandas numpy scipy

# 用于可视化
uv pip install matplotlib seaborn

# 用于统计分析
uv pip install statsmodels scikit-learn

设置 NCBI E-utilities 访问：

from Bio import Entrez

# 始终设置您的邮箱（NCBI 要求）
Entrez.email = "your.email@example.com"

# 可选：设置 API 密钥以提高速率限制
# 从以下网址获取您的 API 密钥：https://www.ncbi.nlm.nih.gov/account/
Entrez.api_key = "your_api_key_here"

# 使用 API 密钥：10 个请求/秒
# 不使用 API 密钥：3 个请求/秒

下载特定条件的基因表达数据
比较跨研究的表达谱
识别差异表达基因
在多个数据集上进行荟萃分析

分析药物治疗后的基因表达变化
识别药物反应的生物标志物
比较不同细胞系或患者间的药物效应
构建药物敏感性的预测模型

研究疾病与正常组织中的基因表达
识别疾病相关的表达特征
比较患者亚组和疾病阶段
将表达与临床结果相关联

生物标志物发现

筛选诊断或预后标志物
在独立队列中验证生物标志物
比较不同平台间的标志物性能
将表达数据与临床数据整合

SOFT（文本简单综合格式）： GEO 的主要基于文本的格式，包含元数据和数据表。易于被 GEOparse 解析。

MINiML（标记语言中的 MIAME 表示法）： GEO 数据的 XML 格式，用于程序化访问和数据交换。

系列矩阵： 以样本为列、基因/探针为行的制表符分隔的表达矩阵。获取表达数据最快的格式。

MIAME 合规性： 关于微阵列实验的最小信息 - GEO 对所有提交强制执行的标准注释。

表达值类型： 不同类型的表达测量（原始信号、标准化、对数转换）。始终检查平台和处理方法。

平台注释： 将探针/特征 ID 映射到基因。对于表达数据的生物学解释至关重要。

对于无需编码的快速分析，使用 GEO2R：

集成到 GEO 中的基于网络的统计分析工具
访问地址：https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSExxxxx
执行差异表达分析
生成用于可重复性的 R 脚本
在下载数据前进行探索性分析很有用

速率限制和最佳实践

NCBI E-utilities 速率限制：

无 API 密钥：每秒 3 个请求
有 API 密钥：每秒 10 个请求
在请求之间实施延迟：time.sleep(0.34)（无 API 密钥）或 time.sleep(0.1)（有 API 密钥）

FTP 下载无速率限制
批量下载的首选方法
可以使用 wget -r 下载整个目录

GEOparse 缓存：

GEOparse 自动将下载的文件缓存在 destdir 中
后续调用使用缓存数据
定期清理缓存以节省磁盘空间

使用 GEOparse 进行系列级访问（最简单）
使用 E-utilities 进行元数据搜索和批量查询
使用 FTP 进行直接文件下载和批量操作
在本地缓存数据以避免重复下载
使用 Biopython 时始终设置 Entrez.email

references/geo_reference.md

全面的参考文档涵盖：

详细的 E-utilities API 规范和端点
完整的 SOFT 和 MINiML 文件格式文档
高级 GEOparse 使用模式和示例
FTP 目录结构和文件命名约定
数据处理流程和标准化方法
常见问题故障排除和错误处理
平台特定的注意事项和特性

请查阅此参考文档以获取深入的技术细节、复杂的查询模式，或处理不常见的数据格式时。

GEO 接受用户提交的数据，质量标准各异
始终检查平台注释和处理方法
验证样本元数据和实验设计
注意跨研究的批次效应
考虑重新处理原始数据以确保一致性

系列矩阵文件可能很大（大型研究 >1 GB）
补充文件（例如 CEL 文件）可能非常大
下载前规划足够的磁盘空间
考虑增量下载样本

数据使用和引用

GEO 数据可免费用于研究用途
使用 GEO 数据时始终引用原始研究
引用 GEO 数据库：Barrett 等人（2013）Nucleic Acids Research
检查单个数据集的使用限制（如果有）
遵循 NCBI 的程序化访问指南

不同平台使用不同的探针 ID（需要注释映射）
表达值可能是原始、标准化或对数转换的（检查元数据）
样本元数据在不同研究间可能格式不一致
并非所有系列都有系列矩阵文件（较旧的提交）
平台注释可能已过时（基因重命名，ID 弃用）

GEO 网站： https://www.ncbi.nlm.nih.gov/geo/
GEO 提交指南： https://www.ncbi.nlm.nih.gov/geo/info/submission.html
GEOparse 文档： https://geoparse.readthedocs.io/
E-utilities 文档： https://www.ncbi.nlm.nih.gov/books/NBK25501/
GEO FTP 站点： ftp://ftp.ncbi.nlm.nih.gov/geo/
GEO2R 工具： https://www.ncbi.nlm.nih.gov/geo/geo2r/
NCBI API 密钥： https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
Biopython 教程： https://biopython.org/DIST/docs/tutorial/Tutorial.html

🇺🇸English

GEO Database

Overview

The Gene Expression Omnibus (GEO) is NCBI's public repository for high-throughput gene expression and functional genomics data. GEO contains over 264,000 studies with more than 8 million samples from both array-based and sequence-based experiments.

When to Use This Skill

This skill should be used when searching for gene expression datasets, retrieving experimental data, downloading raw and processed files, querying expression profiles, or integrating GEO data into computational analysis workflows.

Core Capabilities

1. Understanding GEO Data Organization

GEO organizes data hierarchically using different accession types:

Series (GSE): A complete experiment with a set of related samples

Example: GSE123456
Contains experimental design, samples, and overall study information
Largest organizational unit in GEO
Current count: 264,928+ series

Sample (GSM): A single experimental sample or biological replicate

Example: GSM987654
Contains individual sample data, protocols, and metadata
Linked to platforms and series
Current count: 8,068,632+ samples

Platform (GPL): The microarray or sequencing platform used

Example: GPL570 (Affymetrix Human Genome U133 Plus 2.0 Array)
Describes the technology and probe/feature annotations
Shared across multiple experiments
Current count: 27,739+ platforms

DataSet (GDS): Curated collections with consistent formatting

Example: GDS5678
Experimentally-comparable samples organized by study design
Processed for differential analysis
Subset of GEO data (4,348 curated datasets)
Ideal for quick comparative analyses

Profiles: Gene-specific expression data linked to sequence features

Queryable by gene name or annotation
Cross-references to Entrez Gene
Enables gene-centric searches across all studies

2. Searching GEO Data

GEO DataSets Search:

Search for studies by keywords, organism, or experimental conditions:

from Bio import Entrez

# Configure Entrez (required)
Entrez.email = "your.email@example.com"

# Search for datasets
def search_geo_datasets(query, retmax=20):
    """Search GEO DataSets database"""
    handle = Entrez.esearch(
        db="gds",
        term=query,
        retmax=retmax,
        usehistory="y"
    )
    results = Entrez.read(handle)
    handle.close()
    return results

# Example searches
results = search_geo_datasets("breast cancer[MeSH] AND Homo sapiens[Organism]")
print(f"Found {results['Count']} datasets")

# Search by specific platform
results = search_geo_datasets("GPL570[Accession]")

# Search by study type
results = search_geo_datasets("expression profiling by array[DataSet Type]")

GEO Profiles Search:

Find gene-specific expression patterns:

# Search for gene expression profiles
def search_geo_profiles(gene_name, organism="Homo sapiens", retmax=100):
    """Search GEO Profiles for a specific gene"""
    query = f"{gene_name}[Gene Name] AND {organism}[Organism]"
    handle = Entrez.esearch(
        db="geoprofiles",
        term=query,
        retmax=retmax
    )
    results = Entrez.read(handle)
    handle.close()
    return results

# Find TP53 expression across studies
tp53_results = search_geo_profiles("TP53", organism="Homo sapiens")
print(f"Found {tp53_results['Count']} expression profiles for TP53")

Advanced Search Patterns:

# Combine multiple search terms
def advanced_geo_search(terms, operator="AND"):
    """Build complex search queries"""
    query = f" {operator} ".join(terms)
    return search_geo_datasets(query)

# Find recent high-throughput studies
search_terms = [
    "RNA-seq[DataSet Type]",
    "Homo sapiens[Organism]",
    "2024[Publication Date]"
]
results = advanced_geo_search(search_terms)

# Search by author and condition
search_terms = [
    "Smith[Author]",
    "diabetes[Disease]"
]
results = advanced_geo_search(search_terms)

3. Retrieving GEO Data with GEOparse (Recommended)

GEOparse is the primary Python library for accessing GEO data:

Installation:

uv pip install GEOparse

Basic Usage:

import GEOparse

# Download and parse a GEO Series
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")

# Access series metadata
print(gse.metadata['title'])
print(gse.metadata['summary'])
print(gse.metadata['overall_design'])

# Access sample information
for gsm_name, gsm in gse.gsms.items():
    print(f"Sample: {gsm_name}")
    print(f"  Title: {gsm.metadata['title'][0]}")
    print(f"  Source: {gsm.metadata['source_name_ch1'][0]}")
    print(f"  Characteristics: {gsm.metadata.get('characteristics_ch1', [])}")

# Access platform information
for gpl_name, gpl in gse.gpls.items():
    print(f"Platform: {gpl_name}")
    print(f"  Title: {gpl.metadata['title'][0]}")
    print(f"  Organism: {gpl.metadata['organism'][0]}")

Working with Expression Data:

import GEOparse
import pandas as pd

# Get expression data from series
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")

# Extract expression matrix
# Method 1: From series matrix file (fastest)
if hasattr(gse, 'pivot_samples'):
    expression_df = gse.pivot_samples('VALUE')
    print(expression_df.shape)  # genes x samples

# Method 2: From individual samples
expression_data = {}
for gsm_name, gsm in gse.gsms.items():
    if hasattr(gsm, 'table'):
        expression_data[gsm_name] = gsm.table['VALUE']

expression_df = pd.DataFrame(expression_data)
print(f"Expression matrix: {expression_df.shape}")

Accessing Supplementary Files:

import GEOparse

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")

# Download supplementary files
gse.download_supplementary_files(
    directory="./data/GSE123456_suppl",
    download_sra=False  # Set to True to download SRA files
)

# List available supplementary files
for gsm_name, gsm in gse.gsms.items():
    if hasattr(gsm, 'supplementary_files'):
        print(f"Sample {gsm_name}:")
        for file_url in gsm.metadata.get('supplementary_file', []):
            print(f"  {file_url}")

Filtering and Subsetting Data:

import GEOparse

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")

# Filter samples by metadata
control_samples = [
    gsm_name for gsm_name, gsm in gse.gsms.items()
    if 'control' in gsm.metadata.get('title', [''])[0].lower()
]

treatment_samples = [
    gsm_name for gsm_name, gsm in gse.gsms.items()
    if 'treatment' in gsm.metadata.get('title', [''])[0].lower()
]

print(f"Control samples: {len(control_samples)}")
print(f"Treatment samples: {len(treatment_samples)}")

# Extract subset expression matrix
expression_df = gse.pivot_samples('VALUE')
control_expr = expression_df[control_samples]
treatment_expr = expression_df[treatment_samples]

4. Using NCBI E-utilities for GEO Access

E-utilities provide lower-level programmatic access to GEO metadata:

Basic E-utilities Workflow:

from Bio import Entrez
import time

Entrez.email = "your.email@example.com"

# Step 1: Search for GEO entries
def search_geo(query, db="gds", retmax=100):
    """Search GEO using E-utilities"""
    handle = Entrez.esearch(
        db=db,
        term=query,
        retmax=retmax,
        usehistory="y"
    )
    results = Entrez.read(handle)
    handle.close()
    return results

# Step 2: Fetch summaries
def fetch_geo_summaries(id_list, db="gds"):
    """Fetch document summaries for GEO entries"""
    ids = ",".join(id_list)
    handle = Entrez.esummary(db=db, id=ids)
    summaries = Entrez.read(handle)
    handle.close()
    return summaries

# Step 3: Fetch full records
def fetch_geo_records(id_list, db="gds"):
    """Fetch full GEO records"""
    ids = ",".join(id_list)
    handle = Entrez.efetch(db=db, id=ids, retmode="xml")
    records = Entrez.read(handle)
    handle.close()
    return records

# Example workflow
search_results = search_geo("breast cancer AND Homo sapiens")
id_list = search_results['IdList'][:5]

summaries = fetch_geo_summaries(id_list)
for summary in summaries:
    print(f"GDS: {summary.get('Accession', 'N/A')}")
    print(f"Title: {summary.get('title', 'N/A')}")
    print(f"Samples: {summary.get('n_samples', 'N/A')}")
    print()

Batch Processing with E-utilities:

from Bio import Entrez
import time

Entrez.email = "your.email@example.com"

def batch_fetch_geo_metadata(accessions, batch_size=100):
    """Fetch metadata for multiple GEO accessions"""
    results = {}

    for i in range(0, len(accessions), batch_size):
        batch = accessions[i:i + batch_size]

        # Search for each accession
        for accession in batch:
            try:
                query = f"{accession}[Accession]"
                search_handle = Entrez.esearch(db="gds", term=query)
                search_results = Entrez.read(search_handle)
                search_handle.close()

                if search_results['IdList']:
                    # Fetch summary
                    summary_handle = Entrez.esummary(
                        db="gds",
                        id=search_results['IdList'][0]
                    )
                    summary = Entrez.read(summary_handle)
                    summary_handle.close()
                    results[accession] = summary[0]

                # Be polite to NCBI servers
                time.sleep(0.34)  # Max 3 requests per second

            except Exception as e:
                print(f"Error fetching {accession}: {e}")

    return results

# Fetch metadata for multiple datasets
gse_list = ["GSE100001", "GSE100002", "GSE100003"]
metadata = batch_fetch_geo_metadata(gse_list)

5. Direct FTP Access for Data Files

FTP URLs for GEO Data:

GEO data can be downloaded directly via FTP:

import ftplib
import os

def download_geo_ftp(accession, file_type="matrix", dest_dir="./data"):
    """Download GEO files via FTP"""
    # Construct FTP path based on accession type
    if accession.startswith("GSE"):
        # Series files
        gse_num = accession[3:]
        base_num = gse_num[:-3] + "nnn"
        ftp_path = f"/geo/series/GSE{base_num}/{accession}/"

        if file_type == "matrix":
            filename = f"{accession}_series_matrix.txt.gz"
        elif file_type == "soft":
            filename = f"{accession}_family.soft.gz"
        elif file_type == "miniml":
            filename = f"{accession}_family.xml.tgz"

    # Connect to FTP server
    ftp = ftplib.FTP("ftp.ncbi.nlm.nih.gov")
    ftp.login()
    ftp.cwd(ftp_path)

    # Download file
    os.makedirs(dest_dir, exist_ok=True)
    local_file = os.path.join(dest_dir, filename)

    with open(local_file, 'wb') as f:
        ftp.retrbinary(f'RETR {filename}', f.write)

    ftp.quit()
    print(f"Downloaded: {local_file}")
    return local_file

# Download series matrix file
download_geo_ftp("GSE123456", file_type="matrix")

# Download SOFT format file
download_geo_ftp("GSE123456", file_type="soft")

Using wget or curl for Downloads:

# Download series matrix file
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/matrix/GSE123456_series_matrix.txt.gz

# Download all supplementary files for a series
wget -r -np -nd ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/suppl/

# Download SOFT format family file
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/soft/GSE123456_family.soft.gz

6. Analyzing GEO Data

Quality Control and Preprocessing:

import GEOparse
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load dataset
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')

# Check for missing values
print(f"Missing values: {expression_df.isnull().sum().sum()}")

# Log transformation (if needed)
if expression_df.min().min() > 0:  # Check if already log-transformed
    if expression_df.max().max() > 100:
        expression_df = np.log2(expression_df + 1)
        print("Applied log2 transformation")

# Distribution plots
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
expression_df.plot.box(ax=plt.gca())
plt.title("Expression Distribution per Sample")
plt.xticks(rotation=90)

plt.subplot(1, 2, 2)
expression_df.mean(axis=1).hist(bins=50)
plt.title("Gene Expression Distribution")
plt.xlabel("Average Expression")

plt.tight_layout()
plt.savefig("geo_qc.png", dpi=300, bbox_inches='tight')

Differential Expression Analysis:

import GEOparse
import pandas as pd
import numpy as np
from scipy import stats

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')

# Define sample groups
control_samples = ["GSM1", "GSM2", "GSM3"]
treatment_samples = ["GSM4", "GSM5", "GSM6"]

# Calculate fold changes and p-values
results = []
for gene in expression_df.index:
    control_expr = expression_df.loc[gene, control_samples]
    treatment_expr = expression_df.loc[gene, treatment_samples]

    # Calculate statistics
    fold_change = treatment_expr.mean() - control_expr.mean()
    t_stat, p_value = stats.ttest_ind(treatment_expr, control_expr)

    results.append({
        'gene': gene,
        'log2_fold_change': fold_change,
        'p_value': p_value,
        'control_mean': control_expr.mean(),
        'treatment_mean': treatment_expr.mean()
    })

# Create results DataFrame
de_results = pd.DataFrame(results)

# Multiple testing correction (Benjamini-Hochberg)
from statsmodels.stats.multitest import multipletests
_, de_results['q_value'], _, _ = multipletests(
    de_results['p_value'],
    method='fdr_bh'
)

# Filter significant genes
significant_genes = de_results[
    (de_results['q_value'] < 0.05) &
    (abs(de_results['log2_fold_change']) > 1)
]

print(f"Significant genes: {len(significant_genes)}")
significant_genes.to_csv("de_results.csv", index=False)

Correlation and Clustering Analysis:

import GEOparse
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy
from scipy.spatial.distance import pdist

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')

# Sample correlation heatmap
sample_corr = expression_df.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(sample_corr, cmap='coolwarm', center=0,
            square=True, linewidths=0.5)
plt.title("Sample Correlation Matrix")
plt.tight_layout()
plt.savefig("sample_correlation.png", dpi=300, bbox_inches='tight')

# Hierarchical clustering
distances = pdist(expression_df.T, metric='correlation')
linkage = hierarchy.linkage(distances, method='average')

plt.figure(figsize=(12, 6))
hierarchy.dendrogram(linkage, labels=expression_df.columns)
plt.title("Hierarchical Clustering of Samples")
plt.xlabel("Samples")
plt.ylabel("Distance")
plt.xticks(rotation=90)
plt.tight_layout()
plt.savefig("sample_clustering.png", dpi=300, bbox_inches='tight')

7. Batch Processing Multiple Datasets

Download and Process Multiple Series:

import GEOparse
import pandas as pd
import os

def batch_download_geo(gse_list, destdir="./geo_data"):
    """Download multiple GEO series"""
    results = {}

    for gse_id in gse_list:
        try:
            print(f"Processing {gse_id}...")
            gse = GEOparse.get_GEO(geo=gse_id, destdir=destdir)

            # Extract key information
            results[gse_id] = {
                'title': gse.metadata.get('title', ['N/A'])[0],
                'organism': gse.metadata.get('organism', ['N/A'])[0],
                'platform': list(gse.gpls.keys())[0] if gse.gpls else 'N/A',
                'num_samples': len(gse.gsms),
                'submission_date': gse.metadata.get('submission_date', ['N/A'])[0]
            }

            # Save expression data
            if hasattr(gse, 'pivot_samples'):
                expr_df = gse.pivot_samples('VALUE')
                expr_df.to_csv(f"{destdir}/{gse_id}_expression.csv")
                results[gse_id]['num_genes'] = len(expr_df)

        except Exception as e:
            print(f"Error processing {gse_id}: {e}")
            results[gse_id] = {'error': str(e)}

    # Save summary
    summary_df = pd.DataFrame(results).T
    summary_df.to_csv(f"{destdir}/batch_summary.csv")

    return results

# Process multiple datasets
gse_list = ["GSE100001", "GSE100002", "GSE100003"]
results = batch_download_geo(gse_list)

Meta-Analysis Across Studies:

import GEOparse
import pandas as pd
import numpy as np

def meta_analysis_geo(gse_list, gene_of_interest):
    """Perform meta-analysis of gene expression across studies"""
    results = []

    for gse_id in gse_list:
        try:
            gse = GEOparse.get_GEO(geo=gse_id, destdir="./data")

            # Get platform annotation
            gpl = list(gse.gpls.values())[0]

            # Find gene in platform
            if hasattr(gpl, 'table'):
                gene_probes = gpl.table[
                    gpl.table['Gene Symbol'].str.contains(
                        gene_of_interest,
                        case=False,
                        na=False
                    )
                ]

                if not gene_probes.empty:
                    expr_df = gse.pivot_samples('VALUE')

                    for probe_id in gene_probes['ID']:
                        if probe_id in expr_df.index:
                            expr_values = expr_df.loc[probe_id]

                            results.append({
                                'study': gse_id,
                                'probe': probe_id,
                                'mean_expression': expr_values.mean(),
                                'std_expression': expr_values.std(),
                                'num_samples': len(expr_values)
                            })

        except Exception as e:
            print(f"Error in {gse_id}: {e}")

    return pd.DataFrame(results)

# Meta-analysis for TP53
gse_studies = ["GSE100001", "GSE100002", "GSE100003"]
meta_results = meta_analysis_geo(gse_studies, "TP53")
print(meta_results)

Installation and Setup

Python Libraries

# Primary GEO access library (recommended)
uv pip install GEOparse

# For E-utilities and programmatic NCBI access
uv pip install biopython

# For data analysis
uv pip install pandas numpy scipy

# For visualization
uv pip install matplotlib seaborn

# For statistical analysis
uv pip install statsmodels scikit-learn

Configuration

Set up NCBI E-utilities access:

from Bio import Entrez

# Always set your email (required by NCBI)
Entrez.email = "your.email@example.com"

# Optional: Set API key for increased rate limits
# Get your API key from: https://www.ncbi.nlm.nih.gov/account/
Entrez.api_key = "your_api_key_here"

# With API key: 10 requests/second
# Without API key: 3 requests/second

Common Use Cases

Transcriptomics Research

Download gene expression data for specific conditions
Compare expression profiles across studies
Identify differentially expressed genes
Perform meta-analyses across multiple datasets

Drug Response Studies

Analyze gene expression changes after drug treatment
Identify biomarkers for drug response
Compare drug effects across cell lines or patients
Build predictive models for drug sensitivity

Disease Biology

Study gene expression in disease vs. normal tissues
Identify disease-associated expression signatures
Compare patient subgroups and disease stages
Correlate expression with clinical outcomes

Biomarker Discovery

Screen for diagnostic or prognostic markers
Validate biomarkers across independent cohorts
Compare marker performance across platforms
Integrate expression with clinical data

Key Concepts

SOFT (Simple Omnibus Format in Text): GEO's primary text-based format containing metadata and data tables. Easily parsed by GEOparse.

MINiML (MIAME Notation in Markup Language): XML format for GEO data, used for programmatic access and data exchange.

Series Matrix: Tab-delimited expression matrix with samples as columns and genes/probes as rows. Fastest format for getting expression data.

MIAME Compliance: Minimum Information About a Microarray Experiment - standardized annotation that GEO enforces for all submissions.

Expression Value Types: Different types of expression measurements (raw signal, normalized, log-transformed). Always check platform and processing methods.

Platform Annotation: Maps probe/feature IDs to genes. Essential for biological interpretation of expression data.

GEO2R Web Tool

For quick analysis without coding, use GEO2R:

Web-based statistical analysis tool integrated into GEO
Accessible at: https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSExxxxx
Performs differential expression analysis
Generates R scripts for reproducibility
Useful for exploratory analysis before downloading data

Rate Limiting and Best Practices

NCBI E-utilities Rate Limits:

Without API key: 3 requests per second
With API key: 10 requests per second
Implement delays between requests: time.sleep(0.34) (no API key) or time.sleep(0.1) (with API key)

FTP Access:

No rate limits for FTP downloads
Preferred method for bulk downloads
Can download entire directories with wget -r

GEOparse Caching:

GEOparse automatically caches downloaded files in destdir
Subsequent calls use cached data
Clean cache periodically to save disk space

Optimal Practices:

Use GEOparse for series-level access (easiest)
Use E-utilities for metadata searching and batch queries
Use FTP for direct file downloads and bulk operations
Cache data locally to avoid repeated downloads
Always set Entrez.email when using Biopython

Resources

references/geo_reference.md

Comprehensive reference documentation covering:

Detailed E-utilities API specifications and endpoints
Complete SOFT and MINiML file format documentation
Advanced GEOparse usage patterns and examples
FTP directory structure and file naming conventions
Data processing pipelines and normalization methods
Troubleshooting common issues and error handling
Platform-specific considerations and quirks

Consult this reference for in-depth technical details, complex query patterns, or when working with uncommon data formats.

Important Notes

Data Quality Considerations

GEO accepts user-submitted data with varying quality standards
Always check platform annotation and processing methods
Verify sample metadata and experimental design
Be cautious with batch effects across studies
Consider reprocessing raw data for consistency

File Size Warnings

Series matrix files can be large (>1 GB for large studies)
Supplementary files (e.g., CEL files) can be very large
Plan for adequate disk space before downloading
Consider downloading samples incrementally

Data Usage and Citation

GEO data is freely available for research use
Always cite original studies when using GEO data
Cite GEO database: Barrett et al. (2013) Nucleic Acids Research
Check individual dataset usage restrictions (if any)
Follow NCBI guidelines for programmatic access

Common Pitfalls

Different platforms use different probe IDs (requires annotation mapping)
Expression values may be raw, normalized, or log-transformed (check metadata)
Sample metadata can be inconsistently formatted across studies
Not all series have series matrix files (older submissions)
Platform annotations may be outdated (genes renamed, IDs deprecated)

Additional Resources

GEO Website: https://www.ncbi.nlm.nih.gov/geo/
GEO Submission Guidelines: https://www.ncbi.nlm.nih.gov/geo/info/submission.html
GEOparse Documentation: https://geoparse.readthedocs.io/
E-utilities Documentation: https://www.ncbi.nlm.nih.gov/books/NBK25501/
GEO FTP Site: ftp://ftp.ncbi.nlm.nih.gov/geo/
GEO2R Tool: https://www.ncbi.nlm.nih.gov/geo/geo2r/
NCBI API Keys: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
Biopython Tutorial: https://biopython.org/DIST/docs/tutorial/Tutorial.html

Weekly Installs

157

Repository

davila7/claude-…emplates

GitHub Stars

23.4K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

claude-code137

opencode129

cursor122

gemini-cli121

antigravity113

codex112

智能OCR文字识别工具 - 支持100+语言，高精度提取图片/PDF/手写文本

976 周安装

GEO数据库使用指南：Python检索基因表达数据与生物信息学分析

🇨🇳中文介绍

GEO 数据库

概述

何时使用此技能

核心功能

1. 理解 GEO 数据组织

相关 Skills

2. 搜索 GEO 数据

3. 使用 GEOparse 检索 GEO 数据（推荐）

4. 使用 NCBI E-utilities 访问 GEO

5. 直接 FTP 访问数据文件

6. 分析 GEO 数据

7. 批处理多个数据集

安装和设置

Python 库

配置

常见用例

转录组学研究

药物反应研究

疾病生物学

生物标志物发现

关键概念

GEO2R 网络工具

速率限制和最佳实践

资源

references/geo_reference.md

重要注意事项

数据质量考虑

文件大小警告

数据使用和引用

常见陷阱

其他资源

🇺🇸English

GEO Database

Overview

When to Use This Skill

Core Capabilities

1. Understanding GEO Data Organization

2. Searching GEO Data

3. Retrieving GEO Data with GEOparse (Recommended)

4. Using NCBI E-utilities for GEO Access

5. Direct FTP Access for Data Files

6. Analyzing GEO Data

7. Batch Processing Multiple Datasets

Installation and Setup

Python Libraries

Configuration

Common Use Cases

Transcriptomics Research

Drug Response Studies

Disease Biology

Biomarker Discovery

Key Concepts

GEO2R Web Tool

Rate Limiting and Best Practices

Resources

references/geo_reference.md

Important Notes

Data Quality Considerations

File Size Warnings

Data Usage and Citation

Common Pitfalls

Additional Resources

最新 Skills