Managing data FAIRly to get to 100% reproducibility.
In 1907, the American Journal of Psychology described a peculiar phenomenon. The authors identified that looking at a string of words or a phrase, for too long, can often render it meaningless to the reader. In his doctoral thesis published in 1962 at McGill, Leon James coined the phrase “Semantic Satiation” to describe this phenomenon. He explained it as a process where meaningful words fall prey to irrelevance upon repetition. Working in the drug-target discovery space, we cannot help but wonder if the conversation around reproducible research is heading the same way.
The concerns around research reproducibility have long been a constant fixture in conversations involving academia, industry, and funding bodies. The earliest discussions on reproducibility were focused on refining protocols and techniques used in low-throughput bench experiments across labs. Despite the valuable empirical findings of many of these studies, these studies tested hypotheses using a mix of intuition and hit-and-miss trials, heavily relying on a priori knowledge of the known molecular mechanisms of the disease context. Notably many of these studies also suffered from a lack of reproducibility across different research settings.
Increasingly, the data revolution triggered by the human genome project, and later by high-throughput next-generation sequencing, has propelled us towards a big data-driven discovery paradigm dominated by diverse R&D teams made of experimental biologists, bioinformaticians, and data scientists. The exponential growth in data comes with a real opportunity to tie down the molecular underpinnings of disease to phenotypic traits and patient outcomes. Notably, our increasing ability to rapidly mine the data is hailed to be the panacea for declining R&D productivity.
Unsurprisingly, this opportunity comes with its own set of challenges. A single experiment in pre-clinical research today can produce TBs of data. This data are accumulating in the ever-growing public data repositories. The data explosion has reignited the conversation on establishing rigorous standards for the reproducibility of computational pipelines in biomedical sciences. As data continues to grow in volume and complexity, it has been widely acknowledged that data access, use, and management are not isolated goals, rather a critical requirement for enabling innovation and discovery.
Are you playing FAIR?
A small but growing collection of voices are advocating for a move away from traditional data management practices to focus on providing the data and its curation in machine-readable formats. The implementation of the findable, accessible, interoperable and reusable (FAIR) principles of data management and stewardship2 have emerged as an important practice for organizations aspiring to innovate in biomedical research. This shift towards FAIR data management is being driven by a myriad of organizations including the National Institutes of Health (NIH), USA.
Introduction to the FAIR Principles
“The FAIR principles put the onus on organizations that own and publish data to make it “machine actionable”, i.e. a machine can read the metadata that describes the data, and this enables the machine to access and utilize the data for various applications.” Overarchingly, implementation of FAIR principles will be critical to organizations that aim to holistically reuse legacy and newly generated data for tackling high-value health care challenges. The NIH and Elixir have been key supporters of the efforts to establish standards for data curation and metadata annotation for reuse and integration of Big Data based on the FAIR principals. Recently, the Ma’ayan Lab at the Icahn School of Medicine at Mount Sinai developed FAIRshake3. FAIRshake’s platform can be used to assess the FAIR compliance of datasets, tools, repositories and other digital biomedical objects. By scoring digital resources for FAIRness, data and tool producers can become informed about standards. This can enhance the utility of the resources they generate.
The quest for the holy grail: Achieving reproducibility in computational biology.
A significant bottleneck for reproducible computational analysis of biomedical data is the fragmented manner that we currently access data, analysis, and insights. This status quo is partly driven by the way research results are typically communicated, through paper printed journals. Additionally, despite the meticulous standards that apply to data generation, there is a culture of adopting home-brewed or community sourced DIY solutions for data analysis among researchers. There have been stellar efforts to bring together data and analysis into singular computational environments, for example, Galaxy4, GenePattern5, and the more recent BioJupies6 developed at the Ma’ayan Lab at Mount Sinai. The elementary ways that users can interact with the data and the tools they encapsulate using these platforms reiterate that reproducibility will ultimately be achieved by comprehensive, interactive environments as opposed to an ad-hoc mishmash of datasets and tools.
Closer to home, at Elucidata, we have been working on our own efforts to create a comprehensive environment that brings data and computation together. Our platform Polly now has diversified offerings that target specific challenges in harnessing data for asset discovery. Whether it is building high throughput workflows with independent modules, or creating cloud infrastructure that enables scalable data analysis, our vision is to create computing environments that interact effectively with FAIRified data to generate insights. At its core, data analytics on Polly is powered by Jupyter notebooks7 with multi-language capabilities. Jupyter notebooks are also a critical part of our research and innovation efforts enabling unprecedented reproducibility of analysis. Every analysis made on a Polly Jupyter notebook can be used to generate a proprietary git repo on Polly – Knowledge Book. Using a continuous integration pipeline for Knowledge repos, we also make sure that reproducibility is not a limited end goal but an evolving solution constantly evaluating every analysis and insight for impeccable standards. The Polly platform also allows you to host these to be shared easily with collaborators.
Whilst, there has been a push towards increasing FAIRness of publicly-funded data, private players have also been sensitized to this challenge. There have been misconceptions that FAIR data has to be open access. Experts, however, agree that FAIR data can be private whilst firmly adhering to the guidelines. Equally critical is the establishment of an in-house computational infrastructure that lets you store, analyze and generate data in accordance with FAIR guidelines. As a managed cloud platform, Polly is hosted in-house for industries to enable diverse biological discovery teams. The Polly infrastructure allows teams to seamlessly run the reproducible computation, build R GUI applications and share insights with different stakeholders. More importantly, Polly makes attaining FAIR, not a chore, but an opportunity. In line with this, our most recent efforts have been to create data lakes that can be host proprietary and context-dependent public data on Polly for faster insight and discovery. The unparalleled ease of programmatic access to data in a Polly data lake is key to making the data in your organization “FAIR” and valuable to multiple stakeholders.
In follow-up blogs, we will delve into the details of Pollyglot (Multi-Language Jupyter notebooks on Polly) and innovations with R GUI (Shiny) infrastructure.
E. Severance and M.F. Washburn in The American Journal of PsychologyWilkinson, Mark D., et al. “The FAIR Guiding Principles for scientific data management and stewardship.” Scientific data 3 (2016).
Clark, Daniel JB, et al. “FAIRshake: toolkit to evaluate the findability, accessibility, interoperability, and reusability of research digital resources.” BioRxiv (2019): 657676.
Zhou, Shuigeng, Ruiqi Liao, and Jihong Guan. “When cloud computing meets bioinformatics: a review.” Journal of bioinformatics and computational biology 11.05 (2013): 1330002.
Reich, Michael, et al. “The GenePattern Notebook Environment.” Cell systems 5.2 (2017): 149-151.
Torre, Denis, Alexander Lachmann, and Avi Ma’ayan. “BioJupies: automated generation of interactive notebooks for RNA-Seq data analysis in the cloud.” Cell systems 7.5 (2018): 556-561.
Kluyver, Thomas, et al. “Jupyter Notebooks-a publishing format for reproducible computational workflows.” ELPUB. 2016.