s-moxon-2

Decoding living systems

Dr Simon Moxon from the Earlham Institute and Dr Matthew Stocks from UEA discuss the importance of a better understanding of miRNAs and the challenges involved in achieving this.

All cells within an organism carry the same genetic instructions but are clearly not equal. What makes a brain cell different from a skin cell is that different subsets of genes and their active proteins allow the cell to function correctly. Gene expression is a complex process and can be regulated at several levels. One of the most recently discovered regulatory layers, called RNA silencing, involves microRNAs (miRNAs), tiny RNA (ribonucleic acid) molecules of up to 22 nucleotides in length.

MiRNAs play a crucial role in the RNA silencing machinery and can interact with messenger RNAs (mRNAs) by binding to complementary target regions. This interaction, which can be predicted using specialised computer algorithms, can block protein production and lead to the degradation or ‘silencing’ of the mRNA molecule.

Pan European Networks asked Dr Simon Moxon, a project leader in the Swarbreck Group (Organisms and Ecosystems) at the Earlham Institute in the UK who has created algorithms and developed several complete bioinformatics tools for the discovery of miRNAs and their targets in plants and animals, as well as the study of general small RNA (sRNA) features, about the importance of a better understanding of miRNAs and the challenges involved in achieving this.

What role do miRNAs play in RNA silencing?

They form a key part of the RNA silencing machinery and act by regulating messenger RNA molecules in a variety of ways. They guide the silencing machinery to regulate target genes by binding to specific regions of an mRNA via sequence complementarity. The binding either prevents translation into protein or can initiate cleavage and degradation of the mRNA molecule.

What impact does this have on plant and animal health?

The impact microRNA regulation has on the health and development of animal and plant species is enormous. Without gene silencing, plants and animals fail to regulate gene expression and thus fail to develop correctly. Experiments in both plants and animals have shown that removing key enzymes in the miRNA biogenesis pathway leads to severe developmental defects which are usually lethal early on.

How can we use biological data to understand these regulatory and developmental processes, and what implications might this have for health more widely?

It is now straightforward to sequence the small RNA content of biological samples. However, the analysis and interpretation of this data is still challenging. The main focus of sRNA research is to identify regulatory molecules such as miRNAs, profile their expression (e.g. across tissues, developmental stages, or in healthy and diseased states) and then to identify the genes and biological pathways that they regulate.

In terms of the impact on health, miRNAs have been proven to be crucial for plant health in driving the response to biotic and abiotic stresses. Therefore, understanding these regulatory mechanisms is important for agriculture and plant breeding to develop new crops that are, for example, more resistant to drought or soil salinity.

In human health, we have seen a huge move towards the profiling and analysing of miRNA expression in a variety of diseases. The misregulation of miRNAs has been shown to be a hallmark of certain types of cancer, as well as neurodegenerative disorders, leading to great interest in the use of miRNAs as biomarkers for diagnosis and predictors of prognosis.

MiRNA-based therapeutics to treat disease are currently under development, and progress in this area is an exciting prospect for the future.

The miRNA Workshop 2016 sought to train early-career bioinformaticians how to analyse sRNA data. Can you expand on some of the techniques used in this and the difficulties involved in applying them in practice?

The techniques for teaching bioinformatics are, in general, example driven. We gave our attendees a set of data to work from and widely used bioinformatics software packages for analysis. However, it is important to understand the underlying principles involved in any analysis of this type, purely because the nature of bioinformatics is a prediction-based science. Algorithms can only go so far in providing direction for a wet-lab scientist seeking answers from their sequencing experiment. Therefore, all software examples came with a set of lectures on how the software makes its predictions and what the pitfalls and caveats are that should be taken into account to get the best out of the data.

The difficulties in applying such techniques, in practice, are often of a technical nature. Computers are by their very nature finite systems and resources can pose a significant challenge to large sets of experimental data. It is important to learn how to streamline analyses so that any questions can be asked in both a reasonable amount of time and within available system resources. On top of this, problems can often be encountered with the input data itself. The accuracy of bioinformatics algorithms are generally linked to the quality of input data. However, sequencing instrumentation and library preparation protocols have technical biases and experimental design is a key consideration when analysing sRNA data.

Considerable care must be taken in order to prevent downstream analysis, providing a false image of the process being studied. For this reason, we spend a good proportion of our time teaching not only sequence identification techniques, but also quality checking and procedures for computationally correcting differences in sequencing depth (termed normalisation).

There is a critical skills gap in life science data management. What needs to be done to attract more talent to the field?

People don’t necessarily think of bioinformatics as a career choice. There is potentially a lack of undergraduate training and awareness of what bioinformatics is and what opportunities are available for computer scientists that choose to specialise in this field. On another note, the cross-disciplinary nature of the field could also make an impact; students studying biology on one hand may be put off by the computational aspect and lack of coding skills, and computer scientists may be put off by an apparent lack of biological knowledge.

Overall, it is difficult to find people who know both biology and computer science – they have to have training in two different, highly specialised fields, and this is a rare skill set to find.

I also think that the salaries for computer scientists are higher for those working in industry for ‘big data’ analysis jobs. Many computer scientists might therefore ask why they would do short-term postdoc contracts over the stability of more general software development positions in industry, for example. Maybe this suggests the need for dedicated bioinformatics support and permanent contracts in academia.

Looking more widely, many life science data resources now double in size every six to 12 months and by 2020 estimates suggest that these data will be generated at up to one million times the current rate. What challenges – and indeed opportunities – do you foresee this data deluge bringing?

Small RNA sequence data is inherently noisy (the biological material is often filled with randomly degraded RNA products), and the level of noise is linked to the depth of sequencing. Therefore, the obvious challenge posed by this is detecting the signal (genuine sRNAs) as the depth (and, therefore, noise level) increases.

However, the overwhelming issues that will develop in this area are how to manage both storage and overall processing resources. Processing extremely large datasets poses significant and varied challenges, and I speak from experience when I say that just storage of the initial sequence data for a medium sized molecular biology group can prove extremely problematic. This problem is confounded when actually processing the data where intermediate files are often huge in comparison to the original input data, especially when datasets are being processed in parallel by multiple individuals. Therefore, the main challenge posed by the coming data deluge is producing efficient and robust methods for handling such huge data volumes, which is a challenge that perhaps is not receiving the amount of attention that it requires.

On a more positive note, more data means greater opportunities for gaining a more complete insight into biology. This includes not just deeper and cheaper sequencing, but also new applications and technologies allowing us to answer biological questions on a genome-wide scale. High throughput sequencing is a great example of the popular discipline spanning the term ‘big data’, and it should also be noted that in addition to gaining new biological insights, newly-developed computational techniques for dealing with such huge volumes of data can potentially be successfully applied to other scientific fields.

Dr Matthew Stocks

Vincent Mouton’s Computational Biology Group

University of East Anglia (UEA)

Dr Simon Moxon

Swarbreck Group (Organisms and Ecosystems)

The Earlham Institute

This article first appeared in issue 20 of Pan European Networks: Science and Technology, available here.