tessl install github:K-Dense-AI/claude-scientific-skills --skill imaging-data-commonsgithub.com/K-Dense-AI/claude-scientific-skills
Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Use for accessing large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Query by metadata, visualize in browser, check licenses.
Review Score
91%
Validation Score
13/16
Implementation Score
85%
Activation Score
100%
Use the idc-index Python package to query and download public cancer imaging data from the National Cancer Institute Imaging Data Commons (IDC). No authentication required for data access.
Primary tool: idc-index (GitHub)
Check current data scale for the latest version:
from idc_index import IDCClient
client = IDCClient()
# get IDC data version
print(client.get_idc_version())
# Get collection count and total series
stats = client.sql_query("""
SELECT
COUNT(DISTINCT collection_id) as collections,
COUNT(DISTINCT analysis_result_id) as analysis_results,
COUNT(DISTINCT PatientID) as patients,
COUNT(DISTINCT StudyInstanceUID) as studies,
COUNT(DISTINCT SeriesInstanceUID) as series,
SUM(instanceCount) as instances,
SUM(series_size_MB)/1000000 as size_TB
FROM index
""")
print(stats)Core workflow:
client.sql_query()client.download_from_selection()client.get_viewer_URL(seriesInstanceUID=...)IDC adds two grouping levels above the standard DICOM hierarchy (Patient → Study → Series → Instance):
tcga_luad, nlst). A patient belongs to exactly one collection.Use collection_id to find original imaging data, may include annotations deposited along with the images; use analysis_result_id to find AI-generated or expert annotations.
Key identifiers for queries:
| Identifier | Scope | Use for |
|---|---|---|
collection_id | Dataset grouping | Filtering by project/study |
PatientID | Patient | Grouping images by patient |
StudyInstanceUID | DICOM study | Grouping of related series, visualization |
SeriesInstanceUID | DICOM series | Grouping of related series, visualization |
The idc-index package provides multiple metadata index tables, accessible via SQL or as pandas DataFrames.
Important: Use client.indices_overview to get current table descriptions and column schemas. This is the authoritative source for available columns and their types — always query it when writing SQL or exploring data structure.
| Table | Row Granularity | Loaded | Description |
|---|---|---|---|
index | 1 row = 1 DICOM series | Auto | Primary metadata for all current IDC data |
prior_versions_index | 1 row = 1 DICOM series | Auto | Series from previous IDC releases; for downloading deprecated data |
collections_index | 1 row = 1 collection | fetch_index() | Collection-level metadata and descriptions |
analysis_results_index | 1 row = 1 analysis result collection | fetch_index() | Metadata about derived datasets (annotations, segmentations) |
clinical_index | 1 row = 1 clinical data column | fetch_index() | Dictionary mapping clinical table columns to collections |
sm_index | 1 row = 1 slide microscopy series | fetch_index() | Slide Microscopy (pathology) series metadata |
sm_instance_index | 1 row = 1 slide microscopy instance | fetch_index() | Instance-level (SOPInstanceUID) metadata for slide microscopy |
seg_index | 1 row = 1 DICOM Segmentation series | fetch_index() | Segmentation metadata: algorithm, segment count, reference to source image series |
Auto = loaded automatically when IDCClient() is instantiated
fetch_index() = requires client.fetch_index("table_name") to load
Key columns are not explicitly labeled, the following is a subset that can be used in joins.
| Join Column | Tables | Use Case |
|---|---|---|
collection_id | index, prior_versions_index, collections_index, clinical_index | Link series to collection metadata or clinical data |
SeriesInstanceUID | index, prior_versions_index, sm_index, sm_instance_index | Link series across tables; connect to slide microscopy details |
StudyInstanceUID | index, prior_versions_index | Link studies across current and historical data |
PatientID | index, prior_versions_index | Link patients across current and historical data |
analysis_result_id | index, analysis_results_index | Link series to analysis result metadata (annotations, segmentations) |
source_DOI | index, analysis_results_index | Link by publication DOI |
crdc_series_uuid | index, prior_versions_index | Link by CRDC unique identifier |
Modality | index, prior_versions_index | Filter by imaging modality |
SeriesInstanceUID | index, seg_index | Link segmentation series to its index metadata |
segmented_SeriesInstanceUID | seg_index → index | Link segmentation to its source image series (join seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID) |
Note: Subjects, Updated, and Description appear in multiple tables but have different meanings (counts vs identifiers, different update contexts).
Example joins:
from idc_index import IDCClient
client = IDCClient()
# Join index with collections_index to get cancer types
client.fetch_index("collections_index")
result = client.sql_query("""
SELECT i.SeriesInstanceUID, i.Modality, c.CancerTypes, c.TumorLocations
FROM index i
JOIN collections_index c ON i.collection_id = c.collection_id
WHERE i.Modality = 'MR'
LIMIT 10
""")
# Join index with sm_index for slide microscopy details
client.fetch_index("sm_index")
result = client.sql_query("""
SELECT i.collection_id, i.PatientID, s.ObjectiveLensPower, s.min_PixelSpacing_2sf
FROM index i
JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID
LIMIT 10
""")
# Join seg_index with index to find segmentations and their source images
client.fetch_index("seg_index")
result = client.sql_query("""
SELECT
s.SeriesInstanceUID as seg_series,
s.AlgorithmName,
s.total_segments,
src.collection_id,
src.Modality as source_modality,
src.BodyPartExamined
FROM seg_index s
JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID
WHERE s.AlgorithmType = 'AUTOMATIC'
LIMIT 10
""")Via SQL (recommended for filtering/aggregation):
from idc_index import IDCClient
client = IDCClient()
# Query the primary index (always available)
results = client.sql_query("SELECT * FROM index WHERE Modality = 'CT' LIMIT 10")
# Fetch and query additional indices
client.fetch_index("collections_index")
collections = client.sql_query("SELECT collection_id, CancerTypes, TumorLocations FROM collections_index")
client.fetch_index("analysis_results_index")
analysis = client.sql_query("SELECT * FROM analysis_results_index LIMIT 5")As pandas DataFrames (direct access):
# Primary index (always available after client initialization)
df = client.index
# Fetch and access on-demand indices
client.fetch_index("sm_index")
sm_df = client.sm_indexThe indices_overview dictionary contains complete schema information for all tables. Always consult this when writing queries or exploring data structure.
DICOM attribute mapping: Many columns are populated directly from DICOM attributes in the source files. The column description in the schema indicates when a column corresponds to a DICOM attribute (e.g., "DICOM Modality attribute" or references a DICOM tag). This allows leveraging DICOM knowledge when querying — standard DICOM attribute names like PatientID, StudyInstanceUID, Modality, BodyPartExamined work as expected.
from idc_index import IDCClient
client = IDCClient()
# List all available indices with descriptions
for name, info in client.indices_overview.items():
print(f"\n{name}:")
print(f" Installed: {info['installed']}")
print(f" Description: {info['description']}")
# Get complete schema for a specific index (columns, types, descriptions)
schema = client.indices_overview["index"]["schema"]
print(f"\nTable: {schema['table_description']}")
print("\nColumns:")
for col in schema['columns']:
desc = col.get('description', 'No description')
# Description indicates if column is from DICOM attribute
print(f" {col['name']} ({col['type']}): {desc}")
# Find columns that are DICOM attributes (check description for "DICOM" reference)
dicom_cols = [c['name'] for c in schema['columns'] if 'DICOM' in c.get('description', '').upper()]
print(f"\nDICOM-sourced columns: {dicom_cols}")Alternative: use get_index_schema() method:
schema = client.get_index_schema("index")
# Returns same schema dict: {'table_description': ..., 'columns': [...]}index TableMost common columns for queries (use indices_overview for complete list and descriptions):
| Column | Type | DICOM | Description |
|---|---|---|---|
collection_id | STRING | No | IDC collection identifier |
analysis_result_id | STRING | No | If applicable, indicates what analysis results collection given series is part of |
source_DOI | STRING | No | DOI linking to dataset details; use for learning more about the content and for attribution (see citations below) |
PatientID | STRING | Yes | Patient identifier |
StudyInstanceUID | STRING | Yes | DICOM Study UID |
SeriesInstanceUID | STRING | Yes | DICOM Series UID — use for downloads/viewing |
Modality | STRING | Yes | Imaging modality (CT, MR, PT, SM, etc.) |
BodyPartExamined | STRING | Yes | Anatomical region |
SeriesDescription | STRING | Yes | Description of the series |
Manufacturer | STRING | Yes | Equipment manufacturer |
StudyDate | STRING | Yes | Date study was performed |
PatientSex | STRING | Yes | Patient sex |
PatientAge | STRING | Yes | Patient age at time of study |
license_short_name | STRING | No | License type (CC BY 4.0, CC BY-NC 4.0, etc.) |
series_size_MB | FLOAT | No | Size of series in megabytes |
instanceCount | INTEGER | No | Number of DICOM instances in series |
DICOM = Yes: Column value extracted from the DICOM attribute with the same name. Refer to the DICOM standard for numeric tag mappings. Use standard DICOM knowledge for expected values and formats.
# Fetch clinical index (also downloads clinical data tables)
client.fetch_index("clinical_index")
# Query clinical index to find available tables and their columns
tables = client.sql_query("SELECT DISTINCT table_name, column_label FROM clinical_index")
# Load a specific clinical table as DataFrame
clinical_df = client.get_clinical_table("table_name")| Method | Auth Required | Best For |
|---|---|---|
idc-index | No | Key queries and downloads (recommended) |
| IDC Portal | No | Interactive exploration, manual selection, browser-based download |
| BigQuery | Yes (GCP account) | Complex queries, full DICOM metadata |
| DICOMweb proxy | No | Tool integration via DICOMweb API |
DICOMweb access
IDC data is available via DICOMweb interface (Google Cloud Healthcare API implementation) for integration with PACS systems and DICOMweb-compatible tools.
| Endpoint | Auth | Use Case |
|---|---|---|
| Public proxy | No | Testing, moderate queries, daily quota |
| Google Healthcare | Yes (GCP) | Production use, higher quotas |
See references/dicomweb_guide.md for endpoint URLs, code examples, supported operations, and implementation details.
Required (for basic access):
pip install --upgrade idc-indexImportant: New IDC data release will always trigger a new version of idc-index. Always use --upgrade flag while installing, unless an older version is needed for reproducibility.
Tested with: idc-index 0.11.7 (IDC data version v23)
Optional (for data analysis):
pip install pandas numpy pydicomDiscover what imaging collections and data are available in IDC:
from idc_index import IDCClient
client = IDCClient()
# Get summary statistics from primary index
query = """
SELECT
collection_id,
COUNT(DISTINCT PatientID) as patients,
COUNT(DISTINCT SeriesInstanceUID) as series,
SUM(series_size_MB) as size_mb
FROM index
GROUP BY collection_id
ORDER BY patients DESC
"""
collections_summary = client.sql_query(query)
# For richer collection metadata, use collections_index
client.fetch_index("collections_index")
collections_info = client.sql_query("""
SELECT collection_id, CancerTypes, TumorLocations, Species, Subjects, SupportingData
FROM collections_index
""")
# For analysis results (annotations, segmentations), use analysis_results_index
client.fetch_index("analysis_results_index")
analysis_info = client.sql_query("""
SELECT analysis_result_id, analysis_result_title, Subjects, Collections, Modalities
FROM analysis_results_index
""")collections_index provides curated metadata per collection: cancer types, tumor locations, species, subject counts, and supporting data types — without needing to aggregate from the primary index.
analysis_results_index lists derived datasets (AI segmentations, expert annotations, radiomics features) with their source collections and modalities.
Query the IDC mini-index using SQL to find specific datasets.
First, explore available values for filter columns:
from idc_index import IDCClient
client = IDCClient()
# Check what Modality values exist
modalities = client.sql_query("""
SELECT DISTINCT Modality, COUNT(*) as series_count
FROM index
GROUP BY Modality
ORDER BY series_count DESC
""")
print(modalities)
# Check what BodyPartExamined values exist for MR modality
body_parts = client.sql_query("""
SELECT DISTINCT BodyPartExamined, COUNT(*) as series_count
FROM index
WHERE Modality = 'MR' AND BodyPartExamined IS NOT NULL
GROUP BY BodyPartExamined
ORDER BY series_count DESC
LIMIT 20
""")
print(body_parts)Then query with validated filter values:
# Find breast MRI scans (use actual values from exploration above)
results = client.sql_query("""
SELECT
collection_id,
PatientID,
SeriesInstanceUID,
Modality,
SeriesDescription,
license_short_name
FROM index
WHERE Modality = 'MR'
AND BodyPartExamined = 'BREAST'
LIMIT 20
""")
# Access results as pandas DataFrame
for idx, row in results.iterrows():
print(f"Patient: {row['PatientID']}, Series: {row['SeriesInstanceUID']}")To filter by cancer type, join with collections_index:
client.fetch_index("collections_index")
results = client.sql_query("""
SELECT i.collection_id, i.PatientID, i.SeriesInstanceUID, i.Modality
FROM index i
JOIN collections_index c ON i.collection_id = c.collection_id
WHERE c.CancerTypes LIKE '%Breast%'
AND i.Modality = 'MR'
LIMIT 20
""")Available metadata fields (use client.indices_overview for complete list):
Note: Cancer type is in collections_index.CancerTypes, not in the primary index table.
Download imaging data efficiently from IDC's cloud storage:
Download entire collection:
from idc_index import IDCClient
client = IDCClient()
# Download small collection (RIDER Pilot ~1GB)
client.download_from_selection(
collection_id="rider_pilot",
downloadDir="./data/rider"
)Download specific series:
# First, query for series UIDs
series_df = client.sql_query("""
SELECT SeriesInstanceUID
FROM index
WHERE Modality = 'CT'
AND BodyPartExamined = 'CHEST'
AND collection_id = 'nlst'
LIMIT 5
""")
# Download only those series
client.download_from_selection(
seriesInstanceUID=list(series_df['SeriesInstanceUID'].values),
downloadDir="./data/lung_ct"
)Custom directory structure:
Default dirTemplate: %collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID
# Simplified hierarchy (omit StudyInstanceUID level)
client.download_from_selection(
collection_id="tcga_luad",
downloadDir="./data",
dirTemplate="%collection_id/%PatientID/%Modality"
)
# Results in: ./data/tcga_luad/TCGA-05-4244/CT/
# Flat structure (all files in one directory)
client.download_from_selection(
seriesInstanceUID=list(series_df['SeriesInstanceUID'].values),
downloadDir="./data/flat",
dirTemplate=""
)
# Results in: ./data/flat/*.dcmThe idc download command provides command-line access to download functionality without writing Python code. Available after installing idc-index.
Auto-detects input type: manifest file path, or identifiers (collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID, crdc_series_uuid).
# Download entire collection
idc download rider_pilot --download-dir ./data
# Download specific series by UID
idc download "1.3.6.1.4.1.9328.50.1.69736" --download-dir ./data
# Download multiple items (comma-separated)
idc download "tcga_luad,tcga_lusc" --download-dir ./data
# Download from manifest file (auto-detected)
idc download manifest.txt --download-dir ./dataOptions:
| Option | Description |
|---|---|
--download-dir | Output directory (default: current directory) |
--dir-template | Directory hierarchy template (default: %collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID) |
--log-level | Verbosity: debug, info, warning, error, critical |
Manifest files:
Manifest files contain S3 URLs (one per line) and can be:
Format (one S3 URL per line):
s3://idc-open-data/cb09464a-c5cc-4428-9339-d7fa87cfe837/*
s3://idc-open-data/88f3990d-bdef-49cd-9b2b-4787767240f2/*Example: Generate manifest from Python query:
from idc_index import IDCClient
client = IDCClient()
# Query for series URLs
results = client.sql_query("""
SELECT series_aws_url
FROM index
WHERE collection_id = 'rider_pilot' AND Modality = 'CT'
""")
# Save as manifest file
with open('ct_manifest.txt', 'w') as f:
for url in results['series_aws_url']:
f.write(url + '\n')Then download:
idc download ct_manifest.txt --download-dir ./ct_dataView DICOM data in browser without downloading:
from idc_index import IDCClient
import webbrowser
client = IDCClient()
# First query to get valid UIDs
results = client.sql_query("""
SELECT SeriesInstanceUID, StudyInstanceUID
FROM index
WHERE collection_id = 'rider_pilot' AND Modality = 'CT'
LIMIT 1
""")
# View single series
viewer_url = client.get_viewer_URL(seriesInstanceUID=results.iloc[0]['SeriesInstanceUID'])
webbrowser.open(viewer_url)
# View all series in a study (useful for multi-series exams like MRI protocols)
viewer_url = client.get_viewer_URL(studyInstanceUID=results.iloc[0]['StudyInstanceUID'])
webbrowser.open(viewer_url)The method automatically selects OHIF v3 for radiology or SLIM for slide microscopy. Viewing by study is useful when a DICOM Study contains multiple Series (e.g., T1, T2, DWI sequences from a single MRI session).
Check data licensing before use (critical for commercial applications):
from idc_index import IDCClient
client = IDCClient()
# Check licenses for all collections
query = """
SELECT DISTINCT
collection_id,
license_short_name,
COUNT(DISTINCT SeriesInstanceUID) as series_count
FROM index
GROUP BY collection_id, license_short_name
ORDER BY collection_id
"""
licenses = client.sql_query(query)
print(licenses)License types in IDC:
Important: Always check the license before using IDC data in publications or commercial applications. Each DICOM file is tagged with its specific license in metadata.
The source_DOI column contains DOIs linking to publications describing how the data was generated. To satisfy attribution requirements, use citations_from_selection() to generate properly formatted citations:
from idc_index import IDCClient
client = IDCClient()
# Get citations for a collection (APA format by default)
citations = client.citations_from_selection(collection_id="rider_pilot")
for citation in citations:
print(citation)
# Get citations for specific series
results = client.sql_query("""
SELECT SeriesInstanceUID FROM index
WHERE collection_id = 'tcga_luad' LIMIT 5
""")
citations = client.citations_from_selection(
seriesInstanceUID=list(results['SeriesInstanceUID'].values)
)
# Alternative format: BibTeX (for LaTeX documents)
bibtex_citations = client.citations_from_selection(
collection_id="tcga_luad",
citation_format=IDCClient.CITATION_FORMAT_BIBTEX
)Parameters:
collection_id: Filter by collection(s)patientId: Filter by patient ID(s)studyInstanceUID: Filter by study UID(s)seriesInstanceUID: Filter by series UID(s)citation_format: Use IDCClient.CITATION_FORMAT_* constants:
CITATION_FORMAT_APA (default) - APA styleCITATION_FORMAT_BIBTEX - BibTeX for LaTeXCITATION_FORMAT_JSON - CSL JSONCITATION_FORMAT_TURTLE - RDF TurtleBest practice: When publishing results using IDC data, include the generated citations to properly attribute the data sources and satisfy license requirements.
Process large datasets efficiently with filtering:
from idc_index import IDCClient
import pandas as pd
client = IDCClient()
# Find chest CT scans from GE scanners
query = """
SELECT
SeriesInstanceUID,
PatientID,
collection_id,
ManufacturerModelName
FROM index
WHERE Modality = 'CT'
AND BodyPartExamined = 'CHEST'
AND Manufacturer = 'GE MEDICAL SYSTEMS'
AND license_short_name = 'CC BY 4.0'
LIMIT 100
"""
results = client.sql_query(query)
# Save manifest for later
results.to_csv('lung_ct_manifest.csv', index=False)
# Download in batches to avoid timeout
batch_size = 10
for i in range(0, len(results), batch_size):
batch = results.iloc[i:i+batch_size]
client.download_from_selection(
seriesInstanceUID=list(batch['SeriesInstanceUID'].values),
downloadDir=f"./data/batch_{i//batch_size}"
)For queries requiring full DICOM metadata, complex JOINs, or clinical data tables, use Google BigQuery. Requires GCP account with billing enabled.
Quick reference:
bigquery-public-data.idc_current.*dicom_all (combined metadata)dicom_metadata (all DICOM tags)See references/bigquery_guide.md for setup, table schemas, query patterns, and cost optimization.
| Task | Tool | Reference |
|---|---|---|
| Programmatic queries & downloads | idc-index | This document |
| Interactive exploration | IDC Portal | https://portal.imaging.datacommons.cancer.gov/ |
| Complex metadata queries | BigQuery | references/bigquery_guide.md |
| 3D visualization & analysis | SlicerIDCBrowser | https://github.com/ImagingDataCommons/SlicerIDCBrowser |
Default choice: Use idc-index for most tasks (no auth, easy API, batch downloads).
Integrate IDC data into imaging analysis workflows:
Read downloaded DICOM files:
import pydicom
import os
# Read DICOM files from downloaded series
series_dir = "./data/rider/rider_pilot/RIDER-1007893286/CT_1.3.6.1..."
dicom_files = [os.path.join(series_dir, f) for f in os.listdir(series_dir)
if f.endswith('.dcm')]
# Load first image
ds = pydicom.dcmread(dicom_files[0])
print(f"Patient ID: {ds.PatientID}")
print(f"Modality: {ds.Modality}")
print(f"Image shape: {ds.pixel_array.shape}")Build 3D volume from CT series:
import pydicom
import numpy as np
from pathlib import Path
def load_ct_series(series_path):
"""Load CT series as 3D numpy array"""
files = sorted(Path(series_path).glob('*.dcm'))
slices = [pydicom.dcmread(str(f)) for f in files]
# Sort by slice location
slices.sort(key=lambda x: float(x.ImagePositionPatient[2]))
# Stack into 3D array
volume = np.stack([s.pixel_array for s in slices])
return volume, slices[0] # Return volume and first slice for metadata
volume, metadata = load_ct_series("./data/lung_ct/series_dir")
print(f"Volume shape: {volume.shape}") # (z, y, x)Integrate with SimpleITK:
import SimpleITK as sitk
from pathlib import Path
# Read DICOM series
series_path = "./data/ct_series"
reader = sitk.ImageSeriesReader()
dicom_names = reader.GetGDCMSeriesFileNames(series_path)
reader.SetFileNames(dicom_names)
image = reader.Execute()
# Apply processing
smoothed = sitk.CurvatureFlow(image1=image, timeStep=0.125, numberOfIterations=5)
# Save as NIfTI
sitk.WriteImage(smoothed, "processed_volume.nii.gz")Objective: Build training dataset of lung CT scans from NLST collection
Steps:
from idc_index import IDCClient
client = IDCClient()
# 1. Query for lung CT scans with specific criteria
query = """
SELECT
PatientID,
SeriesInstanceUID,
SeriesDescription
FROM index
WHERE collection_id = 'nlst'
AND Modality = 'CT'
AND BodyPartExamined = 'CHEST'
AND license_short_name = 'CC BY 4.0'
ORDER BY PatientID
LIMIT 100
"""
results = client.sql_query(query)
print(f"Found {len(results)} series from {results['PatientID'].nunique()} patients")
# 2. Download data organized by patient
client.download_from_selection(
seriesInstanceUID=list(results['SeriesInstanceUID'].values),
downloadDir="./training_data",
dirTemplate="%collection_id/%PatientID/%SeriesInstanceUID"
)
# 3. Save manifest for reproducibility
results.to_csv('training_manifest.csv', index=False)Objective: Compare image quality across different MRI scanner manufacturers
Steps:
from idc_index import IDCClient
import pandas as pd
client = IDCClient()
# Query for brain MRI grouped by manufacturer
query = """
SELECT
Manufacturer,
ManufacturerModelName,
COUNT(DISTINCT SeriesInstanceUID) as num_series,
COUNT(DISTINCT PatientID) as num_patients
FROM index
WHERE Modality = 'MR'
AND BodyPartExamined LIKE '%BRAIN%'
GROUP BY Manufacturer, ManufacturerModelName
HAVING num_series >= 10
ORDER BY num_series DESC
"""
manufacturers = client.sql_query(query)
print(manufacturers)
# Download sample from each manufacturer for comparison
for _, row in manufacturers.head(3).iterrows():
mfr = row['Manufacturer']
model = row['ManufacturerModelName']
query = f"""
SELECT SeriesInstanceUID
FROM index
WHERE Manufacturer = '{mfr}'
AND ManufacturerModelName = '{model}'
AND Modality = 'MR'
AND BodyPartExamined LIKE '%BRAIN%'
LIMIT 5
"""
series = client.sql_query(query)
client.download_from_selection(
seriesInstanceUID=list(series['SeriesInstanceUID'].values),
downloadDir=f"./quality_study/{mfr.replace(' ', '_')}"
)Objective: Preview imaging data before committing to download
from idc_index import IDCClient
import webbrowser
client = IDCClient()
series_list = client.sql_query("""
SELECT SeriesInstanceUID, PatientID, SeriesDescription
FROM index
WHERE collection_id = 'acrin_nsclc_fdg_pet' AND Modality = 'PT'
LIMIT 10
""")
# Preview each in browser
for _, row in series_list.iterrows():
viewer_url = client.get_viewer_URL(seriesInstanceUID=row['SeriesInstanceUID'])
print(f"Patient {row['PatientID']}: {row['SeriesDescription']}")
print(f" View at: {viewer_url}")
# webbrowser.open(viewer_url) # Uncomment to open automaticallyFor additional visualization options, see the IDC Portal getting started guide or SlicerIDCBrowser for 3D Slicer integration.
Objective: Download only CC-BY licensed data suitable for commercial applications
Steps:
from idc_index import IDCClient
client = IDCClient()
# Query ONLY for CC BY licensed data (allows commercial use with attribution)
query = """
SELECT
SeriesInstanceUID,
collection_id,
PatientID,
Modality
FROM index
WHERE license_short_name LIKE 'CC BY%'
AND license_short_name NOT LIKE '%NC%'
AND Modality IN ('CT', 'MR')
AND BodyPartExamined IN ('CHEST', 'BRAIN', 'ABDOMEN')
LIMIT 200
"""
cc_by_data = client.sql_query(query)
print(f"Found {len(cc_by_data)} CC BY licensed series")
print(f"Collections: {cc_by_data['collection_id'].unique()}")
# Download with license verification
client.download_from_selection(
seriesInstanceUID=list(cc_by_data['SeriesInstanceUID'].values),
downloadDir="./commercial_dataset",
dirTemplate="%collection_id/%Modality/%PatientID/%SeriesInstanceUID"
)
# Save license information
cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False)license_short_name field and respect licensing terms (CC BY vs CC BY-NC)citations_from_selection() to get properly formatted citations from source_DOI values; include these in publicationsLIMIT clause when exploring to avoid long downloads and understand data structure%collection_id/%PatientID/%ModalityIssue: ModuleNotFoundError: No module named 'idc_index'
pip install --upgrade idc-indexIssue: Download fails with connection timeout
dirTemplate to organize downloads by batchIssue: BigQuery quota exceeded or billing errors
references/bigquery_guide.md for cost optimization tipsIssue: Series UID not found or no data returned
LIMIT 5 to test query firstIssue: Downloaded DICOM files won't open
pydicom.dcmread(file, force=True)Quick reference for common queries. For detailed examples with context, see the Core Capabilities section above.
# What modalities exist?
client.sql_query("SELECT DISTINCT Modality FROM index")
# What body parts for a specific modality?
client.sql_query("""
SELECT DISTINCT BodyPartExamined, COUNT(*) as n
FROM index WHERE Modality = 'CT' AND BodyPartExamined IS NOT NULL
GROUP BY BodyPartExamined ORDER BY n DESC
""")
# What manufacturers for MR?
client.sql_query("""
SELECT DISTINCT Manufacturer, COUNT(*) as n
FROM index WHERE Modality = 'MR'
GROUP BY Manufacturer ORDER BY n DESC
""")Note: Not all image-derived objects belong to analysis result collections. Some annotations are deposited alongside original images. Use DICOM Modality or SOPClassUID to find all derived objects regardless of collection type.
# Find ALL segmentations and structure sets by DICOM Modality
# SEG = DICOM Segmentation, RTSTRUCT = Radiotherapy Structure Set
client.sql_query("""
SELECT collection_id, Modality, COUNT(*) as series_count
FROM index
WHERE Modality IN ('SEG', 'RTSTRUCT')
GROUP BY collection_id, Modality
ORDER BY series_count DESC
""")
# Find segmentations for a specific collection (includes non-analysis-result items)
client.sql_query("""
SELECT SeriesInstanceUID, SeriesDescription, analysis_result_id
FROM index
WHERE collection_id = 'tcga_luad' AND Modality = 'SEG'
""")
# List analysis result collections (curated derived datasets)
client.fetch_index("analysis_results_index")
client.sql_query("""
SELECT analysis_result_id, analysis_result_title, Collections, Modalities
FROM analysis_results_index
""")
# Find analysis results for a specific source collection
client.sql_query("""
SELECT analysis_result_id, analysis_result_title
FROM analysis_results_index
WHERE Collections LIKE '%tcga_luad%'
""")
# Use seg_index for detailed DICOM Segmentation metadata
client.fetch_index("seg_index")
# Get segmentation statistics by algorithm
client.sql_query("""
SELECT AlgorithmName, AlgorithmType, COUNT(*) as seg_count
FROM seg_index
WHERE AlgorithmName IS NOT NULL
GROUP BY AlgorithmName, AlgorithmType
ORDER BY seg_count DESC
LIMIT 10
""")
# Find segmentations for specific source images (e.g., chest CT)
client.sql_query("""
SELECT
s.SeriesInstanceUID as seg_series,
s.AlgorithmName,
s.total_segments,
s.segmented_SeriesInstanceUID as source_series
FROM seg_index s
JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID
WHERE src.Modality = 'CT' AND src.BodyPartExamined = 'CHEST'
LIMIT 10
""")
# Find TotalSegmentator results with source image context
client.sql_query("""
SELECT
seg_info.collection_id,
COUNT(DISTINCT s.SeriesInstanceUID) as seg_count,
SUM(s.total_segments) as total_segments
FROM seg_index s
JOIN index seg_info ON s.SeriesInstanceUID = seg_info.SeriesInstanceUID
WHERE s.AlgorithmName LIKE '%TotalSegmentator%'
GROUP BY seg_info.collection_id
ORDER BY seg_count DESC
""")# sm_index has detailed metadata; join with index for collection_id
client.fetch_index("sm_index")
client.sql_query("""
SELECT i.collection_id, COUNT(*) as slides,
MIN(s.min_PixelSpacing_2sf) as min_resolution
FROM sm_index s
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
GROUP BY i.collection_id
ORDER BY slides DESC
""")# Size for specific criteria
client.sql_query("""
SELECT SUM(series_size_MB) as total_mb, COUNT(*) as series_count
FROM index
WHERE collection_id = 'nlst' AND Modality = 'CT'
""")client.fetch_index("clinical_index")
# Find collections with clinical data and their tables
client.sql_query("""
SELECT collection_id, table_name, COUNT(DISTINCT column_label) as columns
FROM clinical_index
GROUP BY collection_id, table_name
ORDER BY collection_id
""")The following skills complement IDC workflows for downstream analysis and visualization:
Always use client.indices_overview for current column schemas. This ensures accuracy with the installed idc-index version:
# Get all column names and types for any table
schema = client.indices_overview["index"]["schema"]
columns = [(c['name'], c['type'], c.get('description', '')) for c in schema['columns']]