Special Report: From tools to infrastructure
Bioinformatics is moving from individual tools to integrated infrastructures in order to handle Big Data.
The genetic information in DNA is transcribed into RNA, a large fraction of this is translated into proteins, and the rest is active as RNA molecules, known as non-coding RNAs. This is the basis for all life as we know it. In particular, the proteins are used as building blocks, enzymes, motor molecules, or receptors; basically everything that is needed to make functional cells and organisms. However, the process from DNA to RNA and protein has to be regulated. Important aspects of this have been the main research area of the Bioinformatics & Gene Regulation (BiGR) research group for several years, and will be used here to illustrate how bioinformatics is moving from individual tools and databases to large integrated systems and infrastructures.
Case study: How are genes regulated?
Gene regulation is complex and there is still much that we do not understand. However, at least three different mechanisms are essential. Access to genes in the genome is regulated by epigenetics; specific modifications to the DNA, or to proteins (histones) used to pack the genome inside the cell. The expression of accessible genes is regulated by transcription factors; proteins that can recognise and bind to different regions of DNA in the genome. The fate of the RNA transcript can be modified by micro-RNA; short RNA molecules that can bind to transcripts and inhibit translation or target them for degradation.
Most steps in these processes can be analysed with bioinformatics as, for example, predictions of where transcription factors will bind to DNA, and several tools have been developed for this. In our 2006 survey we identified more than 100 different tools that had been published, all of them trying to solve the same problem. The number of these tools is still increasing. This shows how challenging the problem is, where many different approaches have been tested. However, this is quite a common situation in bioinformatics, where a range of individual tools have been developed for many different problems.
Several research groups, including our own, have shown that the best approach for improving the situation is to combine more information into an integrated analysis, including different types of genome-wide experimental data. In 2008 we began to develop a workbench for analysing gene regulation, known as MotifLab, by integrating several important tools and methods into a common framework. MotifLab (Fig. 1) is now one of the best workbenches available for this type of analysis. The most recent version can even take the three-dimensional organisation of the genome into account during the analysis. We have also made a database resource, EpiFactors, for epigenetic components of gene regulation.
Consortia and big data
Access to high-throughput (or ‘next generation’) DNA sequencing has led to a wave of new data. This means that we are getting access to large amounts of data on, for example, gene regulation. An important driving force has been the formation of large consortia for data generation and analysis, the FANTOM consortium, for example, where the BiGR group participates, or the ENCODE consortium. Such consortia have facilitated large scale data generation, covering both multiple cell types and a variety of experimental methods, using standardised approaches, and this makes it possible to integrate and compare these data. But knowing how to utilise such data efficiently is a challenge, as the total amount of data is growing faster than the computer capacity for analysing that same data. Our own MotifLab is still important for analysing specific problems, but for general data analysis we need more powerful and clever approaches.
Towards infrastructures for bioinformatics
A traditional approach to integration of genomic data has been the ‘genome browser’, a window into a genomic region where different types of properties and experimental data can be integrated and visualised. Important examples are the Ensembl and UCSC genome browsers. The Genomic HyperBrowser, where we have made contributions, is an interesting extension to traditional genome browsers. Here statistical methods are used to test whether specific genomic properties are correlated. The calculation is done across the entire genome, not just inside a certain window, and can, for example, show whether the binding of a certain transcription factor is correlated with specific epigenetic modifications. This is an important step towards large scale data analysis.
However, in order to facilitate such integrated analyses we need suitable infrastructures. We need access to data through high quality databases, we need standards for how these data are described, stored and accessed, we need tools that can read and write these standards and communicate with each other in an efficient and reliable way, we need knowledge about where different tools and data are available, we need confidence that the resources will be available when needed, and we need access to competence on how to use the resources. This level of infrastructure quality is becoming essential for current research in molecular biology and medicine using methods from bioinformatics.
There are several ongoing infrastructure projects trying to address at least some of the challenges mentioned above. The BiGR research group is part of the Norwegian node in ELIXIR. This European infrastructure is co-ordinating European bioinformatics by making sure that key resources, both tools and databases, are available to users, actively maintained and integrated with each other.
Sensitive data
Integration projects require access to data of various types and from different sources; this can become a challenge if the data are sensitive, i.e. from human patients or donors. This can be data from population surveys like HUNT, or from medical projects at hospitals. Such data are essential for understanding diseases caused by, for example, mutations affecting gene regulation. Access to sensitive data is normally regulated through the written, informed consent of the donor. However, there is variation in how this is handled, and how strict the rules are, particularly for genomic data, as genome sequences in principle contain enough information to identify participants. Therefore, access to such data is carefully regulated. However, it is important that potential users actually know which data exist, what type of information they contain, and how they can be accessed. We have developed eGenVar, a data management system that can be used to provide structured and standardised information about data sets, without exposing the actual data. This makes it possible to advertise data in a consistent way, independent of whether the actual data are sensitive or not, which makes it easy for users to design studies. Legitimate users can then be given access to the sensitive part of the data after approval of the study.
Where to go from here
Reliable integration of data and tools is an important and ongoing process. Although a lot of work remains, we feel confident that it will provide more powerful approaches for large-scale data analysis, leading to novel insights in, for example, gene regulation and associated diseases. However, there are related problems that will need a lot of attention. One is the challenge of data transfer. With increasing data volumes distributed across servers, how can large and complex analyses be facilitated without saturating network capacity? Another challenge is reproducibility, how to ensure that both reviewers and users are able to reproduce a given study based on a complex collection of tools, parameter settings and input data. The possibility to combine data from multiple consortia and laboratories makes it increasingly challenging to reproduce and verify studies, but at the same time this highlights the importance of proper integration of data and tools.
Important web links |
MotifLab – http://www.motiflab.org |
Genomic HyperBrowser – https://hyperbrowser.uio.no/hb/ |
UCSC Genome Browser – https://genome.ucsc.edu/ |
Ensembl Genome Browser – http://www.ensembl.org/ |
EpiFactors – http://epifactors.autosome.ru/ |
HUNT – https://www.ntnu.edu/hunt |
ELIXIR – https://www.elixir-europe.org/ |
ELIXIR Norway – https://www.elixir-europe.org/about/elixir-norway |
eGenVar – http://bigr.medisin.ntnu.no/data/eGenVar/ |
Professor Finn Drabløs
Bioinformatics & Gene Regulation
Department of Cancer Research and Molecular Medicine
Norwegian University of Science and Technology (NTNU)
+47 72 57 33 33
[email protected]
http://www.ntnu.edu/employees/finn.drablos