⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

IDC影像数据共享平台使用指南：Python查询下载癌症影像数据与DICOM可视化

imaging-data-commons by k-dense-ai/claude-scientific-skills

50 周安装量

17,200 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/k-dense-ai/claude-scientific-skills --skill imaging-data-commons

医疗科技科研工具数据处理

🇨🇳中文介绍

影像数据共享平台

概述

使用 idc-index Python 包查询和下载来自美国国家癌症研究所影像数据共享平台（IDC）的公开癌症影像数据。访问数据无需身份验证。

当前 IDC 数据版本：v23（始终使用 IDCClient().get_idc_version() 验证）

主要工具： idc-index (GitHub)

关键 - 检查包版本并在需要时升级（首先运行此代码）：

import idc_index

REQUIRED_VERSION = "0.11.10"  # 必须与此文件中的 metadata.idc-index 匹配
installed = idc_index.__version__

if installed < REQUIRED_VERSION:
    print(f"Upgrading idc-index from {installed} to {REQUIRED_VERSION}...")
    import subprocess
    subprocess.run(["pip3", "install", "--upgrade", "--break-system-packages", "idc-index"], check=True)
    print("Upgrade complete. Restart Python to use new version.")
else:
    print(f"idc-index {installed} meets requirement ({REQUIRED_VERSION})")

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

何时使用此技能

查找公开可用的放射学（CT、MR、PET）或病理学（玻片显微镜）图像
根据癌症类型、成像模态、解剖部位或其他元数据选择图像子集
从 IDC 下载 DICOM 数据
在研究或商业应用中使用前检查数据许可证
在浏览器中可视化医学图像，无需本地 DICOM 查看器软件

核心章节（内联）：

IDC 数据模型 - 集合和分析结果层次结构
索引表 - 可用的表和连接模式
安装 - 包设置和版本验证
核心功能 - 基本 API 模式（查询、下载、可视化、许可证、引用、批处理）
最佳实践 - 使用指南
故障排除 - 常见问题及解决方案

参考指南（按需加载）：

指南	何时加载
`index_tables_guide.md`	复杂的 JOIN 操作、模式发现、DataFrame 访问
`use_cases.md`	端到端工作流程示例（训练数据集、批量下载）
`sql_patterns.md`	用于过滤器发现、注释、大小估算的快速 SQL 模式
`clinical_data_guide.md`	临床/表格数据、影像+临床连接、值映射
`cloud_storage_guide.md`	直接 S3/GCS 访问、版本控制、UUID 映射
`dicomweb_guide.md`	DICOMweb 端点、PACS 集成
`digital_pathology_guide.md`	玻片显微镜（SM）、注释（ANN）、病理学工作流程
`bigquery_guide.md`	完整的 DICOM 元数据、私有元素（需要 GCP）
`cli_guide.md`	命令行工具（`idc download`、清单文件）

IDC 在标准 DICOM 层次结构（患者 → 研究 → 序列 → 实例）之上增加了两个分组级别：

collection_id ：按疾病、成像模态或研究重点对患者进行分组（例如，tcga_luad、nlst）。一个患者恰好属于一个集合。
analysis_result_id ：标识跨一个或多个原始集合的派生对象（分割、注释、影像组学特征）。

使用 collection_id 查找原始影像数据，可能包含与图像一起提交的注释；使用 analysis_result_id 查找 AI 生成的或专家注释。

用于查询的关键标识符：

标识符	范围	用途
`collection_id`	数据集分组	按项目/研究筛选
`PatientID`	患者	按患者分组图像
`StudyInstanceUID`	DICOM 研究	相关序列的分组、可视化
`SeriesInstanceUID`	DICOM 序列	相关序列的分组、可视化

idc-index 包提供多个元数据索引表，可通过 SQL 或作为 pandas DataFrame 访问。

完整的索引表文档： 使用 https://idc-index.readthedocs.io/en/latest/indices_reference.html 快速检查可用表和列，无需执行任何代码。

重要： 使用 client.indices_overview 获取当前表的描述和列模式。这是可用列及其类型的权威来源——在编写 SQL 或探索数据结构时始终查询它。

表	行粒度	加载方式	描述
`index`	1 行 = 1 个 DICOM 序列	自动	所有当前 IDC 数据的主要元数据
`prior_versions_index`	1 行 = 1 个 DICOM 序列	自动	先前 IDC 版本中的序列；用于下载已弃用的数据
`collections_index`	1 行 = 1 个集合	fetch_index()	集合级别的元数据和描述
`analysis_results_index`	1 行 = 1 个分析结果集合	fetch_index()	关于派生数据集（注释、分割）的元数据
`clinical_index`	1 行 = 1 个临床数据列	fetch_index()	将临床表列映射到集合的字典
`sm_index`	1 行 = 1 个玻片显微镜序列	fetch_index()	玻片显微镜（病理学）序列元数据
`sm_instance_index`	1 行 = 1 个玻片显微镜实例	fetch_index()	玻片显微镜的实例级别（SOPInstanceUID）元数据
`seg_index`	1 行 = 1 个 DICOM 分割序列	fetch_index()	分割元数据：算法、片段数量、对源图像序列的引用
`ann_index`	1 行 = 1 个 DICOM ANN 序列	fetch_index()	显微镜批量简单注释序列元数据；引用被注释的图像序列
`ann_group_index`	1 行 = 1 个注释组	fetch_index()	详细的注释组元数据：图形类型、注释数量、属性代码、算法
`contrast_index`	1 行 = 1 个包含对比剂信息的序列	fetch_index()	对比剂元数据：药剂名称、成分、给药途径（CT、MR、PT、XA、RF）

自动 = 实例化 IDCClient() 时自动加载 fetch_index() = 需要 client.fetch_index("table_name") 来加载

关键列没有明确标记，以下是可用于连接操作的一个子集。

连接列	表	用例
`collection_id`	index, prior_versions_index, collections_index, clinical_index	将序列链接到集合元数据或临床数据
`SeriesInstanceUID`	index, prior_versions_index, sm_index, sm_instance_index	跨表链接序列；连接到玻片显微镜详细信息
`StudyInstanceUID`	index, prior_versions_index	跨当前和历史数据链接研究
`PatientID`	index, prior_versions_index	跨当前和历史数据链接患者
`analysis_result_id`	index, analysis_results_index	将序列链接到分析结果元数据（注释、分割）
`source_DOI`	index, analysis_results_index	通过出版物 DOI 链接
`crdc_series_uuid`	index, prior_versions_index	通过 CRDC 唯一标识符链接
`Modality`	index, prior_versions_index	按成像模态筛选
`SeriesInstanceUID`	index, seg_index, ann_index, ann_group_index, contrast_index	将分割/注释/对比剂序列链接到其索引元数据
`segmented_SeriesInstanceUID`	seg_index → index	将分割链接到其源图像序列（连接 seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID）
`referenced_SeriesInstanceUID`	ann_index → index	将注释链接到其源图像序列（连接 ann_index.referenced_SeriesInstanceUID = index.SeriesInstanceUID）

注意： Subjects、Updated 和 Description 出现在多个表中，但含义不同（计数 vs 标识符，不同的更新上下文）。

有关详细的连接示例、模式发现模式、关键列参考和 DataFrame 访问，请参阅 references/index_tables_guide.md。

# 获取临床索引（同时下载临床数据表）
client.fetch_index("clinical_index")

# 查询临床索引以查找可用的表及其列
tables = client.sql_query("SELECT DISTINCT table_name, column_label FROM clinical_index")

# 将特定的临床表加载为 DataFrame
clinical_df = client.get_clinical_table("table_name")

有关详细的工作流程，包括值映射模式以及将临床数据与影像数据连接，请参阅 references/clinical_data_guide.md。

方法	需要身份验证	最适合
`idc-index`	否	关键查询和下载（推荐）
IDC 门户	否	交互式探索、手动选择、基于浏览器的下载
BigQuery	是（GCP 账户）	复杂查询、完整的 DICOM 元数据
DICOMweb 代理	否	通过 DICOMweb API 进行工具集成
云存储（S3/GCS）	否	直接文件访问、批量下载、自定义管道

IDC 将所有 DICOM 文件保存在公共云存储桶中，这些桶在 AWS S3 和 Google Cloud Storage 之间镜像。文件按 CRDC UUID（而非 DICOM UID）组织以支持版本控制。

存储桶（AWS / GCS）	许可证	内容
`idc-open-data` / `idc-open-data`	无商业限制	>90% 的 IDC 数据
`idc-open-data-two` / `idc-open-idc1`	无商业限制	可能包含头部扫描的集合
`idc-open-data-cr` / `idc-open-cr`	商业用途受限（CC BY-NC）	约 4% 的数据

文件存储为 <crdc_series_uuid>/<crdc_instance_uuid>.dcm。可通过 AWS CLI、gsutil 或 s5cmd 匿名访问免费访问（无出口费用）。使用索引中的 series_aws_url 列获取 S3 URL；GCS 使用相同的路径结构。

有关存储桶详细信息、访问命令、UUID 映射和版本控制，请参阅 references/cloud_storage_guide.md。

IDC 数据可通过 DICOMweb 接口（Google Cloud Healthcare API 实现）获取，以便与 PACS 系统和兼容 DICOMweb 的工具集成。

端点	身份验证	用例
公共代理	否	测试、中等查询、每日配额
Google Healthcare	是（GCP）	生产使用、更高配额

有关端点 URL、代码示例、支持的操作和实现细节，请参阅 references/dicomweb_guide.md。

必需（用于基本访问）：

pip install --upgrade idc-index

重要： 新的 IDC 数据发布总会触发 idc-index 的新版本。安装时始终使用 --upgrade 标志，除非出于可重复性需要旧版本。

重要： IDC 数据版本 v23 是当前版本。始终验证您的版本：

print(client.get_idc_version())  # 应返回 "v23"

如果您看到旧版本，请使用以下命令升级：pip install --upgrade idc-index

测试版本： idc-index 0.11.10（IDC 数据版本 v23）

可选（用于数据分析）：

pip install pandas numpy pydicom

1. 数据发现和探索

发现 IDC 中可用的影像集合和数据：

from idc_index import IDCClient

client = IDCClient()

# 从主索引获取汇总统计信息
query = """
SELECT
  collection_id,
  COUNT(DISTINCT PatientID) as patients,
  COUNT(DISTINCT SeriesInstanceUID) as series,
  SUM(series_size_MB) as size_mb
FROM index
GROUP BY collection_id
ORDER BY patients DESC
"""
collections_summary = client.sql_query(query)

# 对于更丰富的集合元数据，使用 collections_index
client.fetch_index("collections_index")
collections_info = client.sql_query("""
    SELECT collection_id, CancerTypes, TumorLocations, Species, Subjects, SupportingData
    FROM collections_index
""")

# 对于分析结果（注释、分割），使用 analysis_results_index
client.fetch_index("analysis_results_index")
analysis_info = client.sql_query("""
    SELECT analysis_result_id, analysis_result_title, Subjects, Collections, Modalities
    FROM analysis_results_index
""")

collections_index 提供每个集合的精选元数据：癌症类型、肿瘤位置、物种、受试者数量和数据类型——无需从主索引聚合。

analysis_results_index 列出派生数据集（AI 分割、专家注释、影像组学特征）及其源集合和模态。

2. 使用 SQL 查询元数据

使用 SQL 查询 IDC 迷你索引以查找特定数据集。

首先，探索筛选列的可用值：

from idc_index import IDCClient

client = IDCClient()

# 检查存在哪些 Modality 值
modalities = client.sql_query("""
    SELECT DISTINCT Modality, COUNT(*) as series_count
    FROM index
    GROUP BY Modality
    ORDER BY series_count DESC
""")
print(modalities)

# 检查 MR 模态存在哪些 BodyPartExamined 值
body_parts = client.sql_query("""
    SELECT DISTINCT BodyPartExamined, COUNT(*) as series_count
    FROM index
    WHERE Modality = 'MR' AND BodyPartExamined IS NOT NULL
    GROUP BY BodyPartExamined
    ORDER BY series_count DESC
    LIMIT 20
""")
print(body_parts)

然后使用验证过的筛选值进行查询：

# 查找乳腺 MRI 扫描（使用上面探索得到的实际值）
results = client.sql_query("""
    SELECT
      collection_id,
      PatientID,
      SeriesInstanceUID,
      Modality,
      SeriesDescription,
      license_short_name
    FROM index
    WHERE Modality = 'MR'
      AND BodyPartExamined = 'BREAST'
    LIMIT 20
""")

# 以 pandas DataFrame 形式访问结果
for idx, row in results.iterrows():
    print(f"Patient: {row['PatientID']}, Series: {row['SeriesInstanceUID']}")

要通过癌症类型筛选，请与collections_index 连接：

client.fetch_index("collections_index")
results = client.sql_query("""
    SELECT i.collection_id, i.PatientID, i.SeriesInstanceUID, i.Modality
    FROM index i
    JOIN collections_index c ON i.collection_id = c.collection_id
    WHERE c.CancerTypes LIKE '%Breast%'
      AND i.Modality = 'MR'
    LIMIT 20
""")

可用的元数据字段（使用 client.indices_overview 获取完整列表）：

标识符：collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID
影像：Modality, BodyPartExamined, Manufacturer, ManufacturerModelName
临床：PatientAge, PatientSex, StudyDate
描述：StudyDescription, SeriesDescription
许可：license_short_name

注意： 癌症类型在 collections_index.CancerTypes 中，不在主 index 表中。

3. 下载 DICOM 文件

从 IDC 的云存储高效下载影像数据：

下载整个集合：

from idc_index import IDCClient

client = IDCClient()

# 下载小型集合（RIDER Pilot ~1GB）
client.download_from_selection(
    collection_id="rider_pilot",
    downloadDir="./data/rider"
)

下载特定序列：

# 首先，查询序列 UID
series_df = client.sql_query("""
    SELECT SeriesInstanceUID
    FROM index
    WHERE Modality = 'CT'
      AND BodyPartExamined = 'CHEST'
      AND collection_id = 'nlst'
    LIMIT 5
""")

# 仅下载这些序列
client.download_from_selection(
    seriesInstanceUID=list(series_df['SeriesInstanceUID'].values),
    downloadDir="./data/lung_ct"
)

自定义目录结构：

默认 dirTemplate：%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID

# 简化层次结构（省略 StudyInstanceUID 级别）
client.download_from_selection(
    collection_id="tcga_luad",
    downloadDir="./data",
    dirTemplate="%collection_id/%PatientID/%Modality"
)
# 结果：./data/tcga_luad/TCGA-05-4244/CT/

# 扁平结构（所有文件在一个目录中）
client.download_from_selection(
    seriesInstanceUID=list(series_df['SeriesInstanceUID'].values),
    downloadDir="./data/flat",
    dirTemplate=""
)
# 结果：./data/flat/*.dcm

下载的文件名：

单个 DICOM 文件使用其 CRDC 实例 UUID 命名：<crdc_instance_uuid>.dcm（例如，0d73f84e-70ae-4eeb-96a0-1c613b5d9229.dcm）。这种基于 UUID 的命名：

支持版本跟踪（文件内容更改时 UUID 会改变）
匹配云存储组织（s3://idc-open-data/<crdc_series_uuid>/<crdc_instance_uuid>.dcm）
不同于 DICOM UID（SOPInstanceUID），后者保存在文件元数据内部

要识别文件，请在查询中使用 crdc_instance_uuid 列或从文件中读取 DICOM 元数据（SOPInstanceUID）。

idc download 命令提供对下载功能的命令行访问，无需编写 Python 代码。安装 idc-index 后可用。

自动检测输入类型： 清单文件路径，或标识符（collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID, crdc_series_uuid）。

# 下载整个集合
idc download rider_pilot --download-dir ./data

# 按 UID 下载特定序列
idc download "1.3.6.1.4.1.9328.50.1.69736" --download-dir ./data

# 下载多个项目（逗号分隔）
idc download "tcga_luad,tcga_lusc" --download-dir ./data

# 从清单文件下载（自动检测）
idc download manifest.txt --download-dir ./data

选项	描述
`--download-dir`	输出目录（默认：当前目录）
`--dir-template`	目录层次结构模板（默认：`%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID`）
`--log-level`	详细程度：debug, info, warning, error, critical

清单文件包含 S3 URL（每行一个），可以是：

从 IDC 门户选择队列后导出
由协作者共享以实现可重复的数据访问
从查询结果以编程方式生成

格式（每行一个 S3 URL）：

s3://idc-open-data/cb09464a-c5cc-4428-9339-d7fa87cfe837/*
s3://idc-open-data/88f3990d-bdef-49cd-9b2b-4787767240f2/*

示例：从 Python 查询生成清单：

from idc_index import IDCClient

client = IDCClient()

# 查询序列 URL
results = client.sql_query("""
    SELECT series_aws_url
    FROM index
    WHERE collection_id = 'rider_pilot' AND Modality = 'CT'
""")

# 保存为清单文件
with open('ct_manifest.txt', 'w') as f:
    for url in results['series_aws_url']:
        f.write(url + '\n')

idc download ct_manifest.txt --download-dir ./ct_data

4. 可视化 IDC 图像

在浏览器中查看 DICOM 数据而无需下载：

from idc_index import IDCClient
import webbrowser

client = IDCClient()

# 首先查询以获取有效的 UID
results = client.sql_query("""
    SELECT SeriesInstanceUID, StudyInstanceUID
    FROM index
    WHERE collection_id = 'rider_pilot' AND Modality = 'CT'
    LIMIT 1
""")

# 查看单个序列
viewer_url = client.get_viewer_URL(seriesInstanceUID=results.iloc[0]['SeriesInstanceUID'])
webbrowser.open(viewer_url)

# 查看研究中的所有序列（对于多序列检查如 MRI 协议很有用）
viewer_url = client.get_viewer_URL(studyInstanceUID=results.iloc[0]['StudyInstanceUID'])
webbrowser.open(viewer_url)

该方法自动为放射学选择 OHIF v3 或为玻片显微镜选择 SLIM。按研究查看非常有用，当一个 DICOM 研究包含多个序列时（例如，来自单个 MRI 会话的 T1、T2、DWI 序列）。

5. 理解和检查许可证

在使用前检查数据许可证（对商业应用至关重要）：

from idc_index import IDCClient

client = IDCClient()

# 检查所有集合的许可证
query = """
SELECT DISTINCT
  collection_id,
  license_short_name,
  COUNT(DISTINCT SeriesInstanceUID) as series_count
FROM index
GROUP BY collection_id, license_short_name
ORDER BY collection_id
"""

licenses = client.sql_query(query)
print(licenses)

IDC 中的许可证类型：

CC BY 4.0 / CC BY 3.0（约 97% 的数据）- 允许商业使用，需注明出处
CC BY-NC 4.0 / CC BY-NC 3.0（约 3% 的数据）- 仅限非商业使用
自定义许可证（罕见）- 某些集合有特定条款（例如，NLM 条款和条件）

重要： 在出版物或商业应用中使用 IDC 数据前，务必检查许可证。每个 DICOM 文件都在元数据中标记了其特定许可证。

生成引用以注明出处

source_DOI 列包含 DOI，链接到描述数据生成方式的出版物。为满足注明出处的要求，使用 citations_from_selection() 生成格式正确的引用：

from idc_index import IDCClient

client = IDCClient()

# 获取集合的引用（默认 APA 格式）
citations = client.citations_from_selection(collection_id="rider_pilot")
for citation in citations:
    print(citation)

# 获取特定序列的引用
results = client.sql_query("""
    SELECT SeriesInstanceUID FROM index
    WHERE collection_id = 'tcga_luad' LIMIT 5
""")
citations = client.citations_from_selection(
    seriesInstanceUID=list(results['SeriesInstanceUID'].values)
)

# 替代格式：BibTeX（用于 LaTeX 文档）
bibtex_citations = client.citations_from_selection(
    collection_id="tcga_luad",
    citation_format=IDCClient.CITATION_FORMAT_BIBTEX
)

collection_id：按集合筛选
patientId：按患者 ID 筛选
studyInstanceUID：按研究 UID 筛选
seriesInstanceUID：按序列 UID 筛选
citation_format：使用 IDCClient.CITATION_FORMAT_* 常量：
- CITATION_FORMAT_APA（默认）- APA 样式
- CITATION_FORMAT_BIBTEX - 用于 LaTeX 的 BibTeX
- CITATION_FORMAT_JSON - CSL JSON
- CITATION_FORMAT_TURTLE - RDF Turtle

最佳实践： 当发布使用 IDC 数据的结果时，包含生成的引用以正确注明数据来源并满足许可证要求。

6. 批处理和筛选

通过筛选高效处理大型数据集：

from idc_index import IDCClient
import pandas as pd

client = IDCClient()

# 查找来自 GE 扫描仪的胸部 CT 扫描
query = """
SELECT
  SeriesInstanceUID,
  PatientID,
  collection_id,
  ManufacturerModelName
FROM index
WHERE Modality = 'CT'
  AND BodyPartExamined = 'CHEST'
  AND Manufacturer = 'GE MEDICAL SYSTEMS'
  AND license_short_name = 'CC BY 4.0'
LIMIT 100
"""

results = client.sql_query(query)

# 保存清单供以后使用
results.to_csv('lung_ct_manifest.csv', index=False)

# 分批下载以避免超时
batch_size = 10
for i in range(0, len(results), batch_size):
    batch = results.iloc[i:i+batch_size]
    client.download_from_selection(
        seriesInstanceUID=list(batch['SeriesInstanceUID'].values),
        downloadDir=f"./data/batch_{i//batch_size}"
    )

7. 使用 BigQuery 进行高级查询

对于需要完整 DICOM 元数据、复杂 JOIN 操作、临床数据表或私有 DICOM 元素的查询，请使用 Google BigQuery。需要启用计费的 GCP 账户。

数据集：bigquery-public-data.idc_current.*
主表：dicom_all（合并的元数据）
完整元数据：dicom_metadata（所有 DICOM 标签）
私有元素：OtherElements 列（供应商特定标签，如扩散 b 值）

有关设置、表模式、查询模式、私有元素访问和成本优化，请参阅 references/bigquery_guide.md。

在使用 BigQuery 之前，始终检查专门的索引表是否已包含您需要的元数据：

使用 client.indices_overview 或 idc-index indices reference 发现所有可用表及其列
获取相关索引：client.fetch_index("table_name")
使用 client.sql_query() 在本地查询（免费，无需 GCP 账户）

常见的专门索引：seg_index（分割）、ann_index / ann_group_index（显微镜注释）、sm_index（玻片显微镜）、collections_index（集合元数据）。仅当您需要私有 DICOM 元素或任何索引中不存在的属性时才使用 BigQuery。

8. 工具选择指南

任务	工具	参考
编程查询和下载	`idc-index`	本文档
交互式探索	IDC 门户	https://portal.imaging.datacommons.cancer.gov/
复杂元数据查询	BigQuery	`references/bigquery_guide.md`
3D 可视化和分析	SlicerIDCBrowser	https://github.com/ImagingDataCommons/SlicerIDCBrowser

默认选择： 对于大多数任务使用 idc-index（无需身份验证、易于使用的 API、批量下载）。

9. 与分析管道的集成

将 IDC 数据集成到影像分析工作流程中：

读取下载的 DICOM 文件：

import pydicom
import os

# 从下载的序列读取 DICOM 文件
series_dir = "./data/rider/rider_pilot/RIDER-1007893286/CT_1.3.6.1..."

dicom_files = [os.path.join(series_dir, f) for f in os.listdir(series_dir)
               if f.endswith('.dcm')]

# 加载第一张图像
ds = pydicom.dcmread(dicom_files[0])
print(f"Patient ID: {ds.PatientID}")
print(f"Modality: {ds.Modality}")
print(f"Image shape: {ds.pixel_array.shape}")

从 CT 序列构建 3D 体积：

import pydicom
import numpy as np
from pathlib import Path

def load_ct_series(series_path):
    """将 CT 序列加载为 3D numpy 数组"""
    files = sorted(Path(series_path).glob('*.dcm'))
    slices = [pydicom.dcmread(str(f)) for f in files]

    # 按切片位置排序
    slices.sort(key=lambda x: float(x.ImagePositionPatient[2]))

    # 堆叠成 3D 数组
    volume = np.stack([s.pixel_array for s in slices])

    return volume, slices[0]  # 返回体积和第一张切片用于元数据

volume, metadata = load_ct_series("./data/lung_ct/series_dir")
print(f"Volume shape: {volume.shape}")  # (z, y, x)

与 SimpleITK 集成：

import SimpleITK as sitk
from pathlib import Path

# 读取 DICOM 序列
series_path = "./data/ct_series"
reader = sitk.ImageSeriesReader()
dicom_names = reader.GetGDCMSeriesFileNames(series_path)
reader.SetFileNames(dicom_names)
image = reader.Execute()

# 应用处理
smoothed = sitk.CurvatureFlow(image1=image, timeStep=0.125, numberOfIterations=5)

# 保存为 NIfTI
sitk.WriteImage(smoothed, "processed_volume.nii.gz")

有关完整的端到端工作流程示例，请参阅 references/use_cases.md，包括：

从肺部 CT 扫描构建深度学习训练数据集
比较不同扫描仪制造商之间的图像质量
在下载前在浏览器中预览数据
用于商业用途的许可证感知批量下载

在生成响应前验证 IDC 版本 - 始终在会话开始时调用 client.get_idc_version() 以确认您使用的是预期的数据版本（当前为 v23）。如果使用旧版本，建议 pip install --upgrade idc-index
使用前检查许可证 - 始终查询 license_short_name 字段并遵守许可条款（CC BY 与 CC BY-NC）
生成引用以注明出处 - 使用 citations_from_selection() 从 source_DOI 值获取格式正确的引用；在出版物中包含这些引用
从小查询开始 - 探索时使用 LIMIT 子句以避免长时间下载并理解数据结构
简单查询使用迷你索引 - 仅当需要全面的元数据或复杂的 JOIN 操作时才使用 BigQuery
使用 dirTemplate 组织下载 - 使用有意义的目录结构，如 %collection_id/%PatientID/%Modality
缓存查询结果 - 将 DataFrame 保存到 CSV 文件以避免重新查询并确保可重复性
首先估算大小 - 下载前检查集合大小 - 某些集合大小达 TB 级别！
保存清单 - 始终保存带有序列 UID 的查询结果以确保可重复性和数据溯源
阅读文档 - IDC 数据结构和元数据字段记录在 https://learn.canceridc.dev/
使用 IDC 论坛 - 在 https://discourse.canceridc.dev/ 搜索问题/答案并向 IDC 维护者和用户提问

问题：ModuleNotFoundError: No module named 'idc_index'

原因： 未安装 idc-index 包
解决方案： 使用 pip install --upgrade idc-index 安装

问题：下载因连接超时而失败

原因： 网络不稳定或下载文件过大
解决方案：
- 下载较小的批次（例如，每次 10-20 个序列）
- 检查网络连接
- 使用 dirTemplate 按批次组织下载
- 实现带延迟的重试逻辑

问题：BigQuery quota exceeded 或计费错误

原因： BigQuery 需要启用计费的 GCP 项目
解决方案： 对于简单查询使用 idc-index 迷你索引（无需计费），或参阅 references/bigquery_guide.md 获取成本优化技巧

问题：未找到序列 UID 或未返回数据

原因： UID 拼写错误、数据不在当前 IDC 版本中，或字段名错误
解决方案：
- 检查数据是否在当前 IDC 版本中（某些旧数据可能已弃用）
- 首先使用 LIMIT 5 测试查询
- 根据元数据模式文档检查字段名

问题：下载的 DICOM 文件无法打开

原因： 下载损坏或查看器不兼容
解决方案：
- 检查 DICOM 对象类型（Modality 和 SOPClassUID 属性）- 某些对象类型需要专门的工具
- 验证文件完整性（检查文件大小）
- 使用 pydicom 验证：pydicom.dcmread(file, force=True)
- 尝试不同的 DICOM 查看器（3D Slicer, Horos, RadiAnt, QuPath）
- 重新下载序列

常见 SQL 查询模式

有关快速参考的 SQL 模式，请参阅 references/sql_patterns.md，包括：

筛选值发现（模态、身体部位、制造商）
注释和分割查询（包括 seg_index、ann_index 连接）
玻片显微镜查询（sm_index 模式）
下载大小估算
临床数据链接

有关分割和注释的详细信息，另请参阅 references/digital_pathology_guide.md。

🇺🇸English

Imaging Data Commons

Overview

Use the idc-index Python package to query and download public cancer imaging data from the National Cancer Institute Imaging Data Commons (IDC). No authentication required for data access.

Current IDC Data Version: v23 (always verify with IDCClient().get_idc_version())

Primary tool: idc-index (GitHub)

CRITICAL - Check package version and upgrade if needed (run this FIRST):

import idc_index

REQUIRED_VERSION = "0.11.10"  # Must match metadata.idc-index in this file
installed = idc_index.__version__

if installed < REQUIRED_VERSION:
    print(f"Upgrading idc-index from {installed} to {REQUIRED_VERSION}...")
    import subprocess
    subprocess.run(["pip3", "install", "--upgrade", "--break-system-packages", "idc-index"], check=True)
    print("Upgrade complete. Restart Python to use new version.")
else:
    print(f"idc-index {installed} meets requirement ({REQUIRED_VERSION})")

Verify IDC data version and check current data scale:

from idc_index import IDCClient
client = IDCClient()

# Verify IDC data version (should be "v23")
print(f"IDC data version: {client.get_idc_version()}")

# Get collection count and total series
stats = client.sql_query("""
    SELECT
        COUNT(DISTINCT collection_id) as collections,
        COUNT(DISTINCT analysis_result_id) as analysis_results,
        COUNT(DISTINCT PatientID) as patients,
        COUNT(DISTINCT StudyInstanceUID) as studies,
        COUNT(DISTINCT SeriesInstanceUID) as series,
        SUM(instanceCount) as instances,
        SUM(series_size_MB)/1000000 as size_TB
    FROM index
""")
print(stats)

Core workflow:

Query metadata → client.sql_query()
Download DICOM files → client.download_from_selection()
Visualize in browser → client.get_viewer_URL(seriesInstanceUID=...)

When to Use This Skill

Finding publicly available radiology (CT, MR, PET) or pathology (slide microscopy) images
Selecting image subsets by cancer type, modality, anatomical site, or other metadata
Downloading DICOM data from IDC
Checking data licenses before use in research or commercial applications
Visualizing medical images in a browser without local DICOM viewer software

Quick Navigation

Core Sections (inline):

IDC Data Model - Collection and analysis result hierarchy
Index Tables - Available tables and joining patterns
Installation - Package setup and version verification
Core Capabilities - Essential API patterns (query, download, visualize, license, citations, batch)
Best Practices - Usage guidelines
Troubleshooting - Common issues and solutions

Reference Guides (load on demand):

Guide	When to Load
`index_tables_guide.md`	Complex JOINs, schema discovery, DataFrame access
`use_cases.md`	End-to-end workflow examples (training datasets, batch downloads)
`sql_patterns.md`	Quick SQL patterns for filter discovery, annotations, size estimation
`clinical_data_guide.md`	Clinical/tabular data, imaging+clinical joins, value mapping
`cloud_storage_guide.md`	Direct S3/GCS access, versioning, UUID mapping
`dicomweb_guide.md`

IDC Data Model

IDC adds two grouping levels above the standard DICOM hierarchy (Patient → Study → Series → Instance):

collection_id : Groups patients by disease, modality, or research focus (e.g., tcga_luad, nlst). A patient belongs to exactly one collection.
analysis_result_id : Identifies derived objects (segmentations, annotations, radiomics features) across one or more original collections.

Use collection_id to find original imaging data, may include annotations deposited along with the images; use analysis_result_id to find AI-generated or expert annotations.

Key identifiers for queries:

Identifier	Scope	Use for
`collection_id`	Dataset grouping	Filtering by project/study
`PatientID`	Patient	Grouping images by patient
`StudyInstanceUID`	DICOM study	Grouping of related series, visualization
`SeriesInstanceUID`	DICOM series	Grouping of related series, visualization

Index Tables

The idc-index package provides multiple metadata index tables, accessible via SQL or as pandas DataFrames.

Complete index table documentation: Use https://idc-index.readthedocs.io/en/latest/indices_reference.html for quick check of available tables and columns without executing any code.

Important: Use client.indices_overview to get current table descriptions and column schemas. This is the authoritative source for available columns and their types — always query it when writing SQL or exploring data structure.

Available Tables

Table	Row Granularity	Loaded	Description
`index`	1 row = 1 DICOM series	Auto	Primary metadata for all current IDC data
`prior_versions_index`	1 row = 1 DICOM series	Auto	Series from previous IDC releases; for downloading deprecated data
`collections_index`	1 row = 1 collection	fetch_index()	Collection-level metadata and descriptions
`analysis_results_index`	1 row = 1 analysis result collection	fetch_index()

Auto = loaded automatically when IDCClient() is instantiated fetch_index() = requires client.fetch_index("table_name") to load

Joining Tables

Key columns are not explicitly labeled, the following is a subset that can be used in joins.

Join Column	Tables	Use Case
`collection_id`	index, prior_versions_index, collections_index, clinical_index	Link series to collection metadata or clinical data
`SeriesInstanceUID`	index, prior_versions_index, sm_index, sm_instance_index	Link series across tables; connect to slide microscopy details
`StudyInstanceUID`	index, prior_versions_index	Link studies across current and historical data
`PatientID`	index, prior_versions_index	Link patients across current and historical data
`analysis_result_id`

Note: Subjects, Updated, and Description appear in multiple tables but have different meanings (counts vs identifiers, different update contexts).

For detailed join examples, schema discovery patterns, key columns reference, and DataFrame access, see references/index_tables_guide.md.

Clinical Data Access

# Fetch clinical index (also downloads clinical data tables)
client.fetch_index("clinical_index")

# Query clinical index to find available tables and their columns
tables = client.sql_query("SELECT DISTINCT table_name, column_label FROM clinical_index")

# Load a specific clinical table as DataFrame
clinical_df = client.get_clinical_table("table_name")

See references/clinical_data_guide.md for detailed workflows including value mapping patterns and joining clinical data with imaging.

Data Access Options

Method	Auth Required	Best For
`idc-index`	No	Key queries and downloads (recommended)
IDC Portal	No	Interactive exploration, manual selection, browser-based download
BigQuery	Yes (GCP account)	Complex queries, full DICOM metadata
DICOMweb proxy	No	Tool integration via DICOMweb API
Cloud storage (S3/GCS)	No	Direct file access, bulk downloads, custom pipelines

Cloud storage organization

IDC maintains all DICOM files in public cloud storage buckets mirrored between AWS S3 and Google Cloud Storage. Files are organized by CRDC UUIDs (not DICOM UIDs) to support versioning.

Bucket (AWS / GCS)	License	Content
`idc-open-data` / `idc-open-data`	No commercial restriction	>90% of IDC data
`idc-open-data-two` / `idc-open-idc1`	No commercial restriction	Collections with potential head scans
`idc-open-data-cr` / `idc-open-cr`	Commercial use restricted (CC BY-NC)	~4% of data

Files are stored as <crdc_series_uuid>/<crdc_instance_uuid>.dcm. Access is free (no egress fees) via AWS CLI, gsutil, or s5cmd with anonymous access. Use series_aws_url column from the index for S3 URLs; GCS uses the same path structure.

See references/cloud_storage_guide.md for bucket details, access commands, UUID mapping, and versioning.

DICOMweb access

IDC data is available via DICOMweb interface (Google Cloud Healthcare API implementation) for integration with PACS systems and DICOMweb-compatible tools.

Endpoint	Auth	Use Case
Public proxy	No	Testing, moderate queries, daily quota
Google Healthcare	Yes (GCP)	Production use, higher quotas

See references/dicomweb_guide.md for endpoint URLs, code examples, supported operations, and implementation details.

Installation and Setup

Required (for basic access):

pip install --upgrade idc-index

Important: New IDC data release will always trigger a new version of idc-index. Always use --upgrade flag while installing, unless an older version is needed for reproducibility.

IMPORTANT: IDC data version v23 is current. Always verify your version:

print(client.get_idc_version())  # Should return "v23"

If you see an older version, upgrade with: pip install --upgrade idc-index

Tested with: idc-index 0.11.10 (IDC data version v23)

Optional (for data analysis):

pip install pandas numpy pydicom

Core Capabilities

1. Data Discovery and Exploration

Discover what imaging collections and data are available in IDC:

from idc_index import IDCClient

client = IDCClient()

# Get summary statistics from primary index
query = """
SELECT
  collection_id,
  COUNT(DISTINCT PatientID) as patients,
  COUNT(DISTINCT SeriesInstanceUID) as series,
  SUM(series_size_MB) as size_mb
FROM index
GROUP BY collection_id
ORDER BY patients DESC
"""
collections_summary = client.sql_query(query)

# For richer collection metadata, use collections_index
client.fetch_index("collections_index")
collections_info = client.sql_query("""
    SELECT collection_id, CancerTypes, TumorLocations, Species, Subjects, SupportingData
    FROM collections_index
""")

# For analysis results (annotations, segmentations), use analysis_results_index
client.fetch_index("analysis_results_index")
analysis_info = client.sql_query("""
    SELECT analysis_result_id, analysis_result_title, Subjects, Collections, Modalities
    FROM analysis_results_index
""")

collections_index provides curated metadata per collection: cancer types, tumor locations, species, subject counts, and supporting data types — without needing to aggregate from the primary index.

analysis_results_index lists derived datasets (AI segmentations, expert annotations, radiomics features) with their source collections and modalities.

2. Querying Metadata with SQL

Query the IDC mini-index using SQL to find specific datasets.

First, explore available values for filter columns:

from idc_index import IDCClient

client = IDCClient()

# Check what Modality values exist
modalities = client.sql_query("""
    SELECT DISTINCT Modality, COUNT(*) as series_count
    FROM index
    GROUP BY Modality
    ORDER BY series_count DESC
""")
print(modalities)

# Check what BodyPartExamined values exist for MR modality
body_parts = client.sql_query("""
    SELECT DISTINCT BodyPartExamined, COUNT(*) as series_count
    FROM index
    WHERE Modality = 'MR' AND BodyPartExamined IS NOT NULL
    GROUP BY BodyPartExamined
    ORDER BY series_count DESC
    LIMIT 20
""")
print(body_parts)

Then query with validated filter values:

# Find breast MRI scans (use actual values from exploration above)
results = client.sql_query("""
    SELECT
      collection_id,
      PatientID,
      SeriesInstanceUID,
      Modality,
      SeriesDescription,
      license_short_name
    FROM index
    WHERE Modality = 'MR'
      AND BodyPartExamined = 'BREAST'
    LIMIT 20
""")

# Access results as pandas DataFrame
for idx, row in results.iterrows():
    print(f"Patient: {row['PatientID']}, Series: {row['SeriesInstanceUID']}")

To filter by cancer type, join withcollections_index:

client.fetch_index("collections_index")
results = client.sql_query("""
    SELECT i.collection_id, i.PatientID, i.SeriesInstanceUID, i.Modality
    FROM index i
    JOIN collections_index c ON i.collection_id = c.collection_id
    WHERE c.CancerTypes LIKE '%Breast%'
      AND i.Modality = 'MR'
    LIMIT 20
""")

Available metadata fields (use client.indices_overview for complete list):

Identifiers: collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID
Imaging: Modality, BodyPartExamined, Manufacturer, ManufacturerModelName
Clinical: PatientAge, PatientSex, StudyDate
Descriptions: StudyDescription, SeriesDescription
Licensing: license_short_name

Note: Cancer type is in collections_index.CancerTypes, not in the primary index table.

3. Downloading DICOM Files

Download imaging data efficiently from IDC's cloud storage:

Download entire collection:

from idc_index import IDCClient

client = IDCClient()

# Download small collection (RIDER Pilot ~1GB)
client.download_from_selection(
    collection_id="rider_pilot",
    downloadDir="./data/rider"
)

Download specific series:

# First, query for series UIDs
series_df = client.sql_query("""
    SELECT SeriesInstanceUID
    FROM index
    WHERE Modality = 'CT'
      AND BodyPartExamined = 'CHEST'
      AND collection_id = 'nlst'
    LIMIT 5
""")

# Download only those series
client.download_from_selection(
    seriesInstanceUID=list(series_df['SeriesInstanceUID'].values),
    downloadDir="./data/lung_ct"
)

Custom directory structure:

Default dirTemplate: %collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID

# Simplified hierarchy (omit StudyInstanceUID level)
client.download_from_selection(
    collection_id="tcga_luad",
    downloadDir="./data",
    dirTemplate="%collection_id/%PatientID/%Modality"
)
# Results in: ./data/tcga_luad/TCGA-05-4244/CT/

# Flat structure (all files in one directory)
client.download_from_selection(
    seriesInstanceUID=list(series_df['SeriesInstanceUID'].values),
    downloadDir="./data/flat",
    dirTemplate=""
)
# Results in: ./data/flat/*.dcm

Downloaded file names:

Individual DICOM files are named using their CRDC instance UUID: <crdc_instance_uuid>.dcm (e.g., 0d73f84e-70ae-4eeb-96a0-1c613b5d9229.dcm). This UUID-based naming:

Enables version tracking (UUIDs change when file content changes)
Matches cloud storage organization (s3://idc-open-data/<crdc_series_uuid>/<crdc_instance_uuid>.dcm)
Differs from DICOM UIDs (SOPInstanceUID) which are preserved inside the file metadata

To identify files, use the crdc_instance_uuid column in queries or read DICOM metadata (SOPInstanceUID) from the files.

Command-Line Download

The idc download command provides command-line access to download functionality without writing Python code. Available after installing idc-index.

Auto-detects input type: manifest file path, or identifiers (collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID, crdc_series_uuid).

# Download entire collection
idc download rider_pilot --download-dir ./data

# Download specific series by UID
idc download "1.3.6.1.4.1.9328.50.1.69736" --download-dir ./data

# Download multiple items (comma-separated)
idc download "tcga_luad,tcga_lusc" --download-dir ./data

# Download from manifest file (auto-detected)
idc download manifest.txt --download-dir ./data

Options:

Option	Description
`--download-dir`	Output directory (default: current directory)
`--dir-template`	Directory hierarchy template (default: `%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID`)
`--log-level`	Verbosity: debug, info, warning, error, critical

Manifest files:

Manifest files contain S3 URLs (one per line) and can be:

Exported from the IDC Portal after cohort selection
Shared by collaborators for reproducible data access
Generated programmatically from query results

Format (one S3 URL per line):

s3://idc-open-data/cb09464a-c5cc-4428-9339-d7fa87cfe837/*
s3://idc-open-data/88f3990d-bdef-49cd-9b2b-4787767240f2/*

Example: Generate manifest from Python query:

from idc_index import IDCClient

client = IDCClient()

# Query for series URLs
results = client.sql_query("""
    SELECT series_aws_url
    FROM index
    WHERE collection_id = 'rider_pilot' AND Modality = 'CT'
""")

# Save as manifest file
with open('ct_manifest.txt', 'w') as f:
    for url in results['series_aws_url']:
        f.write(url + '\n')

Then download:

idc download ct_manifest.txt --download-dir ./ct_data

4. Visualizing IDC Images

View DICOM data in browser without downloading:

from idc_index import IDCClient
import webbrowser

client = IDCClient()

# First query to get valid UIDs
results = client.sql_query("""
    SELECT SeriesInstanceUID, StudyInstanceUID
    FROM index
    WHERE collection_id = 'rider_pilot' AND Modality = 'CT'
    LIMIT 1
""")

# View single series
viewer_url = client.get_viewer_URL(seriesInstanceUID=results.iloc[0]['SeriesInstanceUID'])
webbrowser.open(viewer_url)

# View all series in a study (useful for multi-series exams like MRI protocols)
viewer_url = client.get_viewer_URL(studyInstanceUID=results.iloc[0]['StudyInstanceUID'])
webbrowser.open(viewer_url)

The method automatically selects OHIF v3 for radiology or SLIM for slide microscopy. Viewing by study is useful when a DICOM Study contains multiple Series (e.g., T1, T2, DWI sequences from a single MRI session).

5. Understanding and Checking Licenses

Check data licensing before use (critical for commercial applications):

from idc_index import IDCClient

client = IDCClient()

# Check licenses for all collections
query = """
SELECT DISTINCT
  collection_id,
  license_short_name,
  COUNT(DISTINCT SeriesInstanceUID) as series_count
FROM index
GROUP BY collection_id, license_short_name
ORDER BY collection_id
"""

licenses = client.sql_query(query)
print(licenses)

License types in IDC:

CC BY 4.0 / CC BY 3.0 (~97% of data) - Allows commercial use with attribution
CC BY-NC 4.0 / CC BY-NC 3.0 (~3% of data) - Non-commercial use only
Custom licenses (rare) - Some collections have specific terms (e.g., NLM Terms and Conditions)

Important: Always check the license before using IDC data in publications or commercial applications. Each DICOM file is tagged with its specific license in metadata.

Generating Citations for Attribution

The source_DOI column contains DOIs linking to publications describing how the data was generated. To satisfy attribution requirements, use citations_from_selection() to generate properly formatted citations:

from idc_index import IDCClient

client = IDCClient()

# Get citations for a collection (APA format by default)
citations = client.citations_from_selection(collection_id="rider_pilot")
for citation in citations:
    print(citation)

# Get citations for specific series
results = client.sql_query("""
    SELECT SeriesInstanceUID FROM index
    WHERE collection_id = 'tcga_luad' LIMIT 5
""")
citations = client.citations_from_selection(
    seriesInstanceUID=list(results['SeriesInstanceUID'].values)
)

# Alternative format: BibTeX (for LaTeX documents)
bibtex_citations = client.citations_from_selection(
    collection_id="tcga_luad",
    citation_format=IDCClient.CITATION_FORMAT_BIBTEX
)

Parameters:

collection_id: Filter by collection(s)
patientId: Filter by patient ID(s)
studyInstanceUID: Filter by study UID(s)
seriesInstanceUID: Filter by series UID(s)
citation_format: Use IDCClient.CITATION_FORMAT_* constants:
- CITATION_FORMAT_APA (default) - APA style
- CITATION_FORMAT_BIBTEX - BibTeX for LaTeX
- CITATION_FORMAT_JSON - CSL JSON

Best practice: When publishing results using IDC data, include the generated citations to properly attribute the data sources and satisfy license requirements.

6. Batch Processing and Filtering

Process large datasets efficiently with filtering:

from idc_index import IDCClient
import pandas as pd

client = IDCClient()

# Find chest CT scans from GE scanners
query = """
SELECT
  SeriesInstanceUID,
  PatientID,
  collection_id,
  ManufacturerModelName
FROM index
WHERE Modality = 'CT'
  AND BodyPartExamined = 'CHEST'
  AND Manufacturer = 'GE MEDICAL SYSTEMS'
  AND license_short_name = 'CC BY 4.0'
LIMIT 100
"""

results = client.sql_query(query)

# Save manifest for later
results.to_csv('lung_ct_manifest.csv', index=False)

# Download in batches to avoid timeout
batch_size = 10
for i in range(0, len(results), batch_size):
    batch = results.iloc[i:i+batch_size]
    client.download_from_selection(
        seriesInstanceUID=list(batch['SeriesInstanceUID'].values),
        downloadDir=f"./data/batch_{i//batch_size}"
    )

7. Advanced Queries with BigQuery

For queries requiring full DICOM metadata, complex JOINs, clinical data tables, or private DICOM elements, use Google BigQuery. Requires GCP account with billing enabled.

Quick reference:

Dataset: bigquery-public-data.idc_current.*
Main table: dicom_all (combined metadata)
Full metadata: dicom_metadata (all DICOM tags)
Private elements: OtherElements column (vendor-specific tags like diffusion b-values)

See references/bigquery_guide.md for setup, table schemas, query patterns, private element access, and cost optimization.

Before using BigQuery , always check if a specialized index table already has the metadata you need:

Use client.indices_overview or the idc-index indices reference to discover all available tables and their columns
Fetch the relevant index: client.fetch_index("table_name")
Query locally with client.sql_query() (free, no GCP account needed)

Common specialized indices: seg_index (segmentations), ann_index / ann_group_index (microscopy annotations), sm_index (slide microscopy), collections_index (collection metadata). Only use BigQuery if you need private DICOM elements or attributes not in any index.

8. Tool Selection Guide

Task	Tool	Reference
Programmatic queries & downloads	`idc-index`	This document
Interactive exploration	IDC Portal	https://portal.imaging.datacommons.cancer.gov/
Complex metadata queries	BigQuery	`references/bigquery_guide.md`
3D visualization & analysis	SlicerIDCBrowser	https://github.com/ImagingDataCommons/SlicerIDCBrowser

Default choice: Use idc-index for most tasks (no auth, easy API, batch downloads).

9. Integration with Analysis Pipelines

Integrate IDC data into imaging analysis workflows:

Read downloaded DICOM files:

import pydicom
import os

# Read DICOM files from downloaded series
series_dir = "./data/rider/rider_pilot/RIDER-1007893286/CT_1.3.6.1..."

dicom_files = [os.path.join(series_dir, f) for f in os.listdir(series_dir)
               if f.endswith('.dcm')]

# Load first image
ds = pydicom.dcmread(dicom_files[0])
print(f"Patient ID: {ds.PatientID}")
print(f"Modality: {ds.Modality}")
print(f"Image shape: {ds.pixel_array.shape}")

Build 3D volume from CT series:

import pydicom
import numpy as np
from pathlib import Path

def load_ct_series(series_path):
    """Load CT series as 3D numpy array"""
    files = sorted(Path(series_path).glob('*.dcm'))
    slices = [pydicom.dcmread(str(f)) for f in files]

    # Sort by slice location
    slices.sort(key=lambda x: float(x.ImagePositionPatient[2]))

    # Stack into 3D array
    volume = np.stack([s.pixel_array for s in slices])

    return volume, slices[0]  # Return volume and first slice for metadata

volume, metadata = load_ct_series("./data/lung_ct/series_dir")
print(f"Volume shape: {volume.shape}")  # (z, y, x)

Integrate with SimpleITK:

import SimpleITK as sitk
from pathlib import Path

# Read DICOM series
series_path = "./data/ct_series"
reader = sitk.ImageSeriesReader()
dicom_names = reader.GetGDCMSeriesFileNames(series_path)
reader.SetFileNames(dicom_names)
image = reader.Execute()

# Apply processing
smoothed = sitk.CurvatureFlow(image1=image, timeStep=0.125, numberOfIterations=5)

# Save as NIfTI
sitk.WriteImage(smoothed, "processed_volume.nii.gz")

Common Use Cases

See references/use_cases.md for complete end-to-end workflow examples including:

Building deep learning training datasets from lung CT scans
Comparing image quality across scanner manufacturers
Previewing data in browser before downloading
License-aware batch downloads for commercial use

Best Practices

Verify IDC version before generating responses - Always call client.get_idc_version() at the start of a session to confirm you're using the expected data version (currently v23). If using an older version, recommend pip install --upgrade idc-index
Check licenses before use - Always query the license_short_name field and respect licensing terms (CC BY vs CC BY-NC)
Generate citations for attribution - Use citations_from_selection() to get properly formatted citations from source_DOI values; include these in publications
Start with small queries - Use LIMIT clause when exploring to avoid long downloads and understand data structure
Use mini-index for simple queries - Only use BigQuery when you need comprehensive metadata or complex JOINs
Organize downloads with dirTemplate - Use meaningful directory structures like %collection_id/%PatientID/%Modality

Troubleshooting

Issue:ModuleNotFoundError: No module named 'idc_index'

Cause: idc-index package not installed
Solution: Install with pip install --upgrade idc-index

Issue: Download fails with connection timeout

Cause: Network instability or large download size
Solution:
- Download smaller batches (e.g., 10-20 series at a time)
- Check network connection
- Use dirTemplate to organize downloads by batch
- Implement retry logic with delays

Issue:BigQuery quota exceeded or billing errors

Cause: BigQuery requires billing-enabled GCP project
Solution: Use idc-index mini-index for simple queries (no billing required), or see references/bigquery_guide.md for cost optimization tips

Issue: Series UID not found or no data returned

Cause: Typo in UID, data not in current IDC version, or wrong field name
Solution:
- Check if data is in current IDC version (some old data may be deprecated)
- Use LIMIT 5 to test query first
- Check field names against metadata schema documentation

Issue: Downloaded DICOM files won't open

Cause: Corrupted download or incompatible viewer
Solution:
- Check DICOM object type (Modality and SOPClassUID attributes) - some object types require specialized tools
- Verify file integrity (check file sizes)
- Use pydicom to validate: pydicom.dcmread(file, force=True)
- Try different DICOM viewer (3D Slicer, Horos, RadiAnt, QuPath)
- Re-download the series

Common SQL Query Patterns

See references/sql_patterns.md for quick-reference SQL patterns including:

Filter value discovery (modalities, body parts, manufacturers)
Annotation and segmentation queries (including seg_index, ann_index joins)
Slide microscopy queries (sm_index patterns)
Download size estimation
Clinical data linking

For segmentation and annotation details, also see references/digital_pathology_guide.md.

Related Skills

The following skills complement IDC workflows for downstream analysis and visualization:

DICOM Processing

pydicom - Read, write, and manipulate downloaded DICOM files. Use for extracting pixel data, reading metadata, anonymization, and format conversion. Essential for working with IDC radiology data (CT, MR, PET).

Pathology and Slide Microscopy

See references/digital_pathology_guide.md for DICOM-compatible tools (highdicom, wsidicom, TIA-Toolbox, Slim viewer).

Metadata Visualization

matplotlib - Low-level plotting for full customization. Use for creating static figures summarizing IDC query results (bar charts of modalities, histograms of series counts, etc.).
seaborn - Statistical visualization with pandas integration. Use for quick exploration of IDC metadata distributions, relationships between variables, and categorical comparisons with attractive defaults.
plotly - Interactive visualization. Use when you need hover info, zoom, and pan for exploring IDC metadata, or for creating web-embeddable dashboards of collection statistics.

Data Exploration

exploratory-data-analysis - Comprehensive EDA on scientific data files. Use after downloading IDC data to understand file structure, quality, and characteristics before analysis.

Resources

Schema Reference (Primary Source)

Always useclient.indices_overview for current column schemas. This ensures accuracy with the installed idc-index version:

# Get all column names and types for any table
schema = client.indices_overview["index"]["schema"]
columns = [(c['name'], c['type'], c.get('description', '')) for c in schema['columns']]

Reference Documentation

See the Quick Navigation section at the top for the full list of reference guides with decision triggers.

indices_reference - External documentation for index tables (may be ahead of the installed version)

External Links

IDC Portal : https://portal.imaging.datacommons.cancer.gov/explore/
Documentation : https://learn.canceridc.dev/
Tutorials : https://github.com/ImagingDataCommons/IDC-Tutorials
User Forum : https://discourse.canceridc.dev/
idc-index GitHub : https://github.com/ImagingDataCommons/idc-index
Citation : Fedorov, A., et al. "National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence." RadioGraphics 43.12 (2023). https://doi.org/10.1148/rg.230180

Skill Updates

This skill version is available in skill metadata. To check for updates:

Visit the releases page
Watch the repository on GitHub (Watch → Custom → Releases)

Weekly Installs

Repository

k-dense-ai/clau…c-skills

GitHub Stars

17.2K

First Seen

Jan 25, 2026

Security Audits

Gen Agent Trust HubFail SocketPass SnykWarn

Installed on

opencode44

codex43

gemini-cli43

cursor41

claude-code40

github-copilot40

免费AI数据抓取智能体：自动化收集、丰富与存储网站/API数据

1,300 周安装

CITATION_FORMAT_TURTLE - RDF Turtle

Cache query results - Save DataFrames to CSV files to avoid re-querying and ensure reproducibility

Estimate size first - Check collection size before downloading - some collection sizes are in terabytes!

Save manifests - Always save query results with Series UIDs for reproducibility and data provenance

Read documentation - IDC data structure and metadata fields are documented at https://learn.canceridc.dev/

Use IDC forum - Search for questons/answers and ask your questions to the IDC maintainers and users at https://discourse.canceridc.dev/