This is the 1st part in a 3 part series.

Part 1: The bottleneck is not data generation

In 1951 Frederick Sanger first found out the amino acid sequence for insulin, the structure of DNA had not been deciphered yet (I would call them as ignite moments). It took almost two decades to sequence DNA in a similar way. The world has never looked back since.

The first ultimate test for DNA sequencing was the human genome project, where a race between Celera and UCSC led to the first publicly available human genome. It cost around 3 billion dollars to get there. Improvement in technology and key innovations by sequencing companies reduced this cost dramatically in recent years. The decrease in cost coincides with the development of new sequencing technologies.

Cost per Genome
Cost per Genome
History of sequencing
History of sequencing

Omics big data not only constitutes DNA sequencing but also transcriptome sequencing, proteomics, metabolomics, epigenetics. Fortunately, similar ignite moments occurred in other omics as well. Proteomics has seen the rise of the MALDI technique which scaled up the field.

Similarly, metabolomics saw the rise of LCMS as a technique which scaled up the field. Since biology is an intricate dance of all omics together to give rise to a phenotype, these ignite moments are very important and necessary to get a holistic picture of a cell or group of cells. Transcriptomics really scaled in a big way with the advent of technologies such as microarray and subsequently RNA Sequencing. In order to understand what is the impact of these ignites moments on research using omics, we can look at the number of studies using omics data. Two such databases which are used by researchers around the world are NCBI GEO and TCGA. If we look at the number of studies submitted by researchers across years, we can understand the increase in data across time.

No. of studies per year on GEO
No. of studies per year on GEO

Similarly, if we look at the number of publications citing TCGA over the years, it really tells the story of how data generation is influencing research.

Publications using TCGA Data
Publications using TCGA Data
Data accumulation by type
Data accumulation by type

With great data comes great responsibility of interpreting it.

Interpretation requires human expertise and is laborious. Papers such as (Bridging the gap: the need for genomic and clinical -omics data integration and standardization in overcoming the bottleneck of variant interpretation) clearly explain the need for reproducibility and transparency in interpreting variant data.

To deal with such big data there is a rise in algorithms which can provide quantitative interpretation for varied types of data. However, not everyone is aware of how to interpret the results of those algorithms and whether an algorithm exists for a particular context.

Next part talks about how: Data interpretation requires context and awareness about available algorithms (Genome → Transcriptome → Proteome → Metabolome)

Found this article useful? Talk to our data science team that wrote this.
Drop a message at datascience@elucidata.io

Leave a Reply

Your email address will not be published. Required fields are marked *