May 08, 2025
Pipeline to simplify data download in The Cancer Genome Atlas (TCGA) simplifies navigation for researchers

Graphik zum TCGA Download Helper
Relevant publication published at Frontiers
Under the title „TCGADownloadHelper: Simplifying TCGA data extraction and preprocessing“, a new article by our external PhD-student Alexandra Baumann has been published, which deals with the simplification of the extraction and preprocessing of data from The Cancer Genome Atlas (TGCA).
The article is based on the latest research work of Alexandra Baumann, who, in view of a new internship contract with us at the ZMI, now also holds our affiliation, in addition to the one from Rostock, Dr. Markus Wolfien and our Rostock colleague, Prof. Dr. Olaf Wolkenhauer. It is part of our scientific work within the framework of PM4Onco - Personalised Medicine for Oncology, a research project of the Medical Informatics Initiative (MII), which is dedicated to the further development of „personalised medicine“ within the treatment of oncological diseases.
The TCGA database provides comprehensive genomic data for various types of cancer. However, complex file naming conventions and the need to link different data types to individual case IDs can be a challenge for first-time users. While other tools have been introduced to facilitate the handling of TCGA data, they lack a straightforward combination of all necessary steps.
Based on this, our team developed an optimised pipeline that uses the cart system of the Genomic Data Commons (GDC) portal for file selection and the GDC Data Transfer Tool for data download. We use the sample sheet provided by the GDC portal to replace the default opaque file IDs and filenames with human-readable case IDs. A pipeline was created that integrates customisable Python scripts in a Jupyter notebook and a Snakemake pipeline for ID mapping and automating data pre-processing tasks.
The pipeline developed here simplifies the data download process and includes a case ID filtering step, which facilitates the handling of multimodal datasets related to individual patients. Thus, the pipeline significantly reduces the effort required to pre-process the data and allows researchers to efficiently navigate through the complexity of TCGA data extraction and pre-processing.
You can find the full article here: https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2025.1569290/full