Making Big Strides in Cancer Research by Mining Big Data
DNA sequencing technology has improved so much in the fifteen years since the completion of the Human Genome Project, that you can receive information about your ancestry from a saliva sample in just a few weeks. Now, the challenge for researchers is in developing tools and methods to extract as much information as possible from large biomedical datasets of genetic information, molecular profiles of cells, clinical diagnostic tests and electronic health records.
The Center for Predictive Computational Phenotyping (CPCP) was established at UW-Madison, under the leadership of UW Carbone Cancer Center members Mark Craven, PhD and Michael Newton, PhD, after the National Institutes of Health awarded a four-year, $12 million grant to the institution in 2014. The grant has primarily supported graduate students, postdoctoral fellows, and faculty, bringing people from different disciplines of data science across campus together to solve ongoing questions.
Some CPCP researchers, like Beth Burnside, MD lead investigator in a breast cancer screening project, are developing more sensitive methods for detecting signals in biomedical datasets.
“The computational challenge is that with all these potential data sources we have more evidence, but how do you really identify what signals in those data sources are important for assessing risk?” Craven says.
Burnside has shown that combining genetic data with mammogram and other imaging data as well as electronic health records provides a more accurate assessment of risk for breast cancer. These predictions could potentially inform a patient’s screening plan, and Burnside is working on a tool that would refine doctors’ risk calculators to incorporate the best data possible.
Other CPCP investigators, like UW Carbone member Colin Dewey, PhD, and computer science professor AnHai Doan, PhD are helping researchers make use of large repositories of existing data. Many research groups over the last fifteen years have been interested in how much of every gene is expressed in different cell types and tissues and how that changes under different conditions and disease states. The main repository of this data is the Sequence Read Archive, and while the DNA sequences are in a uniform format, the metadata describing the information about the samples in the datasets is very messy.
“There’s so much potential to having all this data publicly available,” Craven says. “But that potential can’t be realized if you can’t always tell what’s in the database.”
Dewey and Doan have built a system called MetaSRA that takes all the data from human samples in the Sequence Read Archive and automatically maps all the messy metadata for each sample into cleaned up metadata where everything is expressed in terms of standard gene naming conventions and standard vocabularies for descriptors like cell lines and experimental methods. Researchers can strengthen their results by integrating relevant datasets from other labs with their own data.
The CPCP has also developed a software tool, atSNP, that helps researchers understand the relationship between genomic variants and cellular factors involved in cancer. Through the CPCP, computational biologist Sunduz Keles, PhD, connected with researchers across campus to study how variation at a single genomic region affects the way proteins called transcription factors bind to DNA, and how that binding affects the abundance of each gene product in a cell – a process that is commonly altered in cancer and other diseases.
In addition to the initiatives and researchers funded through this grant over the last four years, the CPCP has had a lasting influence on collaborations between data scientists at UW.
“Across campus, there are various subgroups of people who are good with numbers: statisticians, biostatisticians, epidemiologists, computer scientists” Newton says. “The grant as a center has provided a nice framework for all these folks to start to work together in ways that they never did before.”
Date Published: 10/09/2018