Justin van der Hooft
FAIRifying Multi-Omics Resources: the Paired Omics Data Platform
Open Science Objectives & Practices
The Paired Omics Data Platform (PoDP) provides an effective solution to connect genomic, metabolomic, (and proteomic) data derived from a single biological source. It uses and promotes Findable, Accessible, Interoperable, and Reusable (FAIR) principles by requiring all datasets that are linked to be publicly available and accompanied by metadata. Furthermore, the PoDP adds a layer of metadata for the recorded links where existing field-specific ontologies are used where possible. Finally, the PoDP has a manual curation step and project version control.
Introduction
Data-driven discovery of novel chemistry from natural sources can be greatly accelerated by applying multiomics approaches. Advances in (meta)genomic sequencing and increased sensitivity in metabolomic data acquisition are paving the way to our deepest understanding yet of the chemical language of microbial life. The synergy of multi-omics data analysis relies on access to well-documented, curated datasets from individual biological sources. While, nowadays, data sharing through public repositories for genomic or metabolomic data is becoming common, the connections between these types of data are hard and sometimes impossible to find. Yet, connecting genomes to metabolomics data enables multi-omics tools to facilitate structural elucidation of metabolic products, and to obtain additional information, such as mode of action, resistance mechanisms, and new enzymatic functions, that are not available from single omics approaches. Here, a platform was built that records annotated links between omics data types in human and computer-readable manner. The platform itself is available through docker and back-ups of the projects it contains can be found on Zenodo. Furthermore, version control is done for all the projects: if data is added or changed, this is logged and can be made visible to the user.
Motivation
Multi-omics approaches are on the rise and the PoDP facilitates their development by recording paired data as well as validated links between genomes and mass-spectral data. This will assist in validating novel algorithms that - in turn - will spark the creation of novel paired data sets. The PoDP enables and encourages FAIR data exchange within the scientific community, an initiative that is widely accepted, but in practice more difficult to realise without an incentive. The PoDP gives incentive to researchers to make their data public, as this curated source for paired data sets makes it much more likely that their data will be reused and cited. The PoDP also has the potential to catalyse new collaborations between groups across the world to creatively re-analyse datasets and discover new biochemistry that would otherwise have remained buried in the data. Finally, by collecting both recorded omics data links as well as validated genomicmetabolomic entities therein, the platform also contributes to the development of novel algorithms, which is of benefit to tool developers like myself.
Lessons learned
The funding of this research was part of a larger eScience grant. In general, to find funding purely for FAIRicifation of workflows or data is very difficult, and it is not always part of common practice (yet). We were able to present the platform at several conferences and seminars, and we will also integrate its use and existence in future workshops.
During the project itself, it was hard to to derive a minimal metadata list that was mutually agreed on as each laboratory (more than 100 researchers from >10 different countries were involved) has its own specific ways of doing things, and not everything is easily captured into (existing) ontologies. Furthermore, the (perceived) additional time to register paired omics data projects into the PoDP was also a factor of importance. Yet, the platform now contains an easy-to-fill form that accommodates most typical workflows. Michelle also bridged biochemistry and bioinformatics by designing a new community standard for paired data sets.
It is encouraging to observe that the platform is already picked up by the community and 75 novel entries have been made since the launch. Furthermore, the first tools that automatically connect gene clusters to mass spectra are integrating the platform and/or using the recorded paired data for training and validation.
How much extra time did the open practices require?
Building the platform took quite some time, and maintaining it will also continue to take some time, as the construction is such that contributors can submit projects, which also makes it (more) vulnerable to safety issues. For a user, it will take between 2 - 3 hours to collect the necessary information and add/submit the project to the platform. Usually, it requires 1 - 2 rounds of review to get to the final approved version, where metadata details are added or the location of the data is further specified using correctly formatted URLs.
URLs, references and further information
The PoDP projects (downloaded >100 times)
The PoDP web application in a zip
Tool that integrated the PoDP platform and used validated entities therein