Genomic Data Science

As humans dig deeper into the genome, the analysis and interpretation of the genomic data collected are helping to better understand human health and disease, while also bringing up questions about privacy and ethics.

The Big Picture

Genomic data science is a field of study that enables researchers to use powerful computational and statistical methods to decode the functional information hidden in DNA sequences.
Estimates predict that genomics research will generate between 2 and 40 exabytes of data within the next decade.
Our ability to sequence DNA has far outpaced our ability to decipher the information it contains, so genomic data science will be a vibrant field of research for many years to come.
Performing genomic data science carries with it a set of ethical responsibilities, as each person's sequence data are associated with issues related to privacy and identity.

How it affects you

As biomedical research projects and large-scale collaborations grow rapidly, the amount of genomic data being generated is also increasing, with roughly 2 to 40 billion gigabytes of data now generated each year. Researchers are working to extract valuable information from such complicated and large datasets so they can better understand human health and disease.

What is genomic data science?

Genomic data science is a field of study that enables researchers to use powerful computational and statistical methods to decode the functional information hidden in DNA sequence. Applied in the context of genomic medicine, these data science tools help researchers and clinicians uncover how differences in DNA affect human health and disease.

Genomic data science emerged as a field in the 1990s to bring together two laboratory activities:

Experimentation: Generating genomic information from studying the genomes of living organisms.
Data analysis: Using statistical and computational tools to analyze and visualize genomic data, which includes processing and storing data and using algorithms and software to make predictions based on available genomic data.

Both activities help researchers acquire and gain insights from the vast amounts of genomic data.

Genomic data science emerged as a field in the 1990s to bring together two laboratory activities:

Experimentation: Generating genomic information from studying the genomes of living organisms.
Data analysis: Using statistical and computational tools to analyze and visualize genomic data, which includes processing and storing data and using algorithms and software to make predictions based on available genomic data.

Both activities help researchers acquire and gain insights from the vast amounts of genomic data.

Why does genomics involve so much data?

Researchers are now generating more genomic data than ever before to understand how the genome functions and affects human health and disease. These data are coming from millions of people in various populations across the world. Data about a single human genome sequence alone would take up 200 gigabytes, or the space of about 200 copies of Jaws. We will need an estimated 40 exabytes to store the genome- sequence data generated worldwide by 2025. That’s almost one billion DVDs of Jaws! In comparison, five exabytes could store all of the words ever spoken by human beings.

Human genomics gained mainstream attention in the early 2000s when the Human Genome Project successfully generated the first sequence of the chemical bases (“letters”) — As, Cs, Gs and Ts — in the human genome. Each of the trillions of cells in the human body contains a complete copy of the genome, i.e., our DNA blueprint). Most cells actually have two copies of the genome, which together reflect about 6 billion DNA letters. Researchers are now generating more genomic data than ever before to understand how the genome functions and affects human health and disease. These data are coming from millions of people in various populations across the world. Data about a single human genome sequence alone would take up 200 gigabytes, or the space of about 200 copies of Jaws. We will need an estimated 40 exabytes to store the genome- sequence data generated worldwide by 2025. That’s almost one billion DVDs of Jaws! In comparison, five exabytes could store all of the words ever spoken by human beings.