The explosion of coronavirus and its disease, COVID-19, has taken the world by storm. It has created a global health crisis that has profoundly impacted the way we perceive our everyday lives. As the COVID-19 positive cases continue to climb worldwide, we must work in unity across organizations, to combat the virus. The coronavirus data plays an essential role in understanding, planning, researching, and fighting this disease. The information in the right hands across biomedical research and the healthcare sector has been the critical need of the hour, more than ever. With scientists working in limbo, public data will increasingly help labs and organizations worldwide to mitigate the impact of the virus.
Access to open-source datasets and tools which can analyze this data in cloud infrastructure can fast-track the global collaborative efforts. Almost every company uses the publicly available multi-omics data and its proprietary data to supplement its research efforts and gain new insights. These freely available datasets can be leveraged to produce new insights as the world continues its fight against coronavirus. We are fortunate to be living in a world where the data is treasured, and efforts are made to collect and refine such datasets. Hence, the question is how to analyze and extract value from these public datasets for COVID-19 research. This will affect how the vaccines are made and how an effective solution is triggered.
The journey from raw omics data to insights
Finding relevant data
Since late March 2020, researchers are contributing to making any COVID-19 data accessible to the scientific community to find novel insights. Many of the omics data repositories have archived influenza virus genome sequences, transcriptomics data, protein structure, and much information in the public domain.
We are grateful to all the researchers across the globe to understand the mechanism of this deadly virus and provide accessibility of their high throughput data in public domains. The community worked out and made their data available on several pre-existing and new data repositories. Notable ones worth mentioning are:
To find a perfect dataset of interest from an ocean of data repositories is like Finding Nemo! For example, the number of gene expression datasets in the GEO database (accessed on July 13, 2020 ) itself is 3,684,233. The various repositories provide different data sets that fulfill different functions. A researcher may have to query each of these repositories to get the desired information. They might not be even aware of all the repositories. This might limit our searches. To find for datasets across repositories calls for a specialized solution.
Making data process-ready
With the availability of data, the next step is to make it available in a “usable” form. 76% of data scientists say that they spend most of the time cleaning the data rather than mining it. Despite being the least fun part of their job, it is a crucial prerequisite to put data in context to perform analysis and turn it into insights. This process usually includes cleaning, standardizing data formats, and normalization.
Having data in a raw form is an opportunity to integrate several different kinds of data. These omics experiments in biological research are inherently biased due to data collection to differences caused due to different instruments. Data integration from several sources brings in biologically irrelevant variance, which is collectively called the batch effect. It is essential to get rid of the bias and make samples comparable. The choice of a right normalization method is a pivotal task for the downstream analysis and results’ reliability. This lowers the potency to detect biologically exciting insights.
The manual part of the cleaning process makes it an overwhelming task. Thus building a platform for making data process-ready is imperative. Various GUI-Based analytical platforms like KNIME, Galaxy, Elucidata’s Polly are available to provide an easy-to-understand interface for processing data.
Processed data to insights
We have the data for several thousand variables (genes, metabolites, proteins, or concepts, word embeddings) in place with a comparable amount of samples. Different methodologies are required to carry out the hypothesis testing and get insights into which strong confidence can be placed.
Statistics is the first line of methods that we apply to interpret data to get insights that hold high statistical significance. The selection of an appropriate statistical method is a crucial step. A researcher needs to aware of the assumption and conditions of the statistical methods. A wrong choice of statistical method and you have a severe problem in insight interpretation and the research conclusion.
To generate a better harmonization of all the COVID study data, the researchers would also need machine learning techniques. in addition to a careful experimental design. The different applications of statistical methods and machine learning techniques for studying biological data in the context of COVID research can be;
- Clustering and/or classification of samples into different categories.
- Cell type annotation
- Differential gene expression analysis
- Phylogenetics analysis of viral genomes
- Drug discovery and repurposing predictions
- Identification of gene signatures
- Identification of mechanisms exploited by infected cells to alter neighboring cells
- Network-based methodologies for understanding cellular biology of COVID-19 infection
Together statistics and machine learning can be leveraged to find significant differences and/or relationships for different studies done for COVID-19.
Insights to function
Last but not the least, getting individual insights is useful for understanding the biology of SARS-CoV-2, identifying therapies, and understanding the effects of the virus on human biology. Still, they are not as discrete as we think and are like pieces of a puzzle that show the more exceptional picture when combined. Annotation comes into play to group those sets of alterations into a functional category, which can be viewed as a single entity connecting different individual insights. With the emergence of this pandemic, the community acted fast to curate resources related to COVID-19, including annotation resources like:
Elucidata has indexed 4000+ transcriptional datasets corresponding to SARS viruses, viral infections, therapeutics for COVID-19, normal lung tissue, in its data lake, from heterogeneous repositories. These numbers keep increasing to keep up with the rapidly evolving research on COVID. This curated COVID specific data lake is available on our platform – Polly.
Polly represents a tremendous opportunity to support researchers to improve COVID data analysis significantly. (1) It is a cloud platform that addresses the inefficiencies that exist in using biomedical data and tools efficiently. (2) The platform serves as a single place for storage, reproducible analysis, and high-throughput biomedical data integration.
We’re as dedicated to your success as ever! If you pivoted efforts towards understanding the SARS-CoV2 virus or COVID-19 disease progression get free access to our resources at – resources.elucidata.io/covid19. Collaborate on data analysis and please don’t hesitate to reach out via email or by using the intercom app on our website.