Partnering for progress

The Earlham Institute has joined forces with Kx Systems for an ambitious new project which will revolutionise bioinformatics research. EI’s head of scientific computing, Tim Stitt, explains all to Portal

The Earlham Institute (EI), a UK-based life sciences research organisation, is partnering with data analytics company Kx Systems for an innovative new project aiming to transform bioinformatics research and promote a sustainable bioeconomy. By combining Kx’s expertise in high-speed, big data analytics – built up over years of experience in the world’s major financial centres – and EI’s leading bioscience research capabilities, the six-month project will develop and embed the latest machine learning algorithms to create predictive models for studying crop growth patterns and agricultural methods. It’s hoped that the project will provide an effective, real-time means of coping with the ever-expanding datasets being rapidly generated in the life sciences – a problem which can only be solved by taking full advantage of stream data processing and in-memory computing.

To find out more, Portal spoke to EI’s head of scientific computing, Tim Stitt, who here discusses the ambitious new project, the institute’s partnership with Kx and the challenges posed by big omics (phenomics and genomics) data, as well as sharing his thoughts on the data science skills shortage evident in Europe and the UK government’s new approach to technology within the agriculture sector.

What would you say are the biggest challenges posed by big omics data?

Traditionally, when it comes to big data, the challenges are defined by the ‘four Vs’: volume, velocity, variety and veracity. In terms of the Earlham Institute and the big omics data that we work with, it is the first two of these that currently are our primary concerns. One reason for this is that now, with the next generation sequencing technologies and platforms – and the sensors and drones that we are using to measure and monitor living systems – huge amounts of data are being generated, so this is high volume as well as high velocity.

There are various facets to these challenges. One side, which is where I am involved, includes ensuring that we have the infrastructure required to handle the large amounts of data we are receiving, as well as making sure that we can transfer that data across computing systems efficiently and securely.

On the other side, we need to explore how to process the data quickly, and we also need to know how to process it to extract all the information that you need from these large and complex datasets. Scientists want this as quickly as possible, so ideally we want to do this in real time as well – so as we are generating the data we stream it to the computing and storage systems and analyse it there and then, and hopefully generate the knowledge information as quickly as possible.

That’s the desired workflow and environment, but that is unfortunately easier said than done. There are additional issues such as security and privacy, which ties into how you store and transfer the data; in some cases if the data is sensitive – as can be the case with next generation sequencing and particularly genome sequencing and precision medicine – you have to be very careful about how you secure that information.

Analysis can also be a challenge, i.e. developing the algorithms and computer code that can analyse the data. This naturally involves interdisciplinary teams with various skills (IT personnel to manage the infrastructure that stores the data, computer scientists to develop the algorithms, mathematicians and statisticians to define the mathematical theory domain scientists who can interpret the data, etc.), and it can be a challenge to make sure that everyone is co-ordinated and is communicating effectively.

How are you addressing these communication issues?

We try to reduce those problems by using open standards for our data, which is actually a strength of the bioinformatics community in the UK and even Europe. Various initiatives exist that are trying to define and utilise open standards for data formats – for instance ELIXIR, which spans Europe – which allow us to almost expect what the data will look like. The project we are working on with Kx deals with more structured data; so, if we use a standard format, then it is easier to anticipate that and store it more effectively and more efficiently.

Communication between different human personnel is more difficult because mathematicians speak one language and computer scientists another, and biologists who aren’t necessarily trained with the computing skills they need. That’s a challenge, but just getting people in a room and communicating with each other does work eventually; it’s just a case of investing the time to make sure everyone’s on the same page and there’s no misinterpretation.

How do you predict these challenges will develop given the evolution of the IoT etc.?

I mentioned that volume and velocity are the two things we are dealing with at the moment, and I think that side of things will get worse as the technology improves and is applied in different areas – the amount of data and the speed at which we receive it are just going to grow, and there will obviously be more scrutiny over security and privacy, as well as the supply to different and more sensitive areas.

At that point, the other Vs will come into play. Questions will be raised over the veracity of the data – for example, is the data reliable, where has it been collected from and has it been collected correctly, how is it being used, and does it make sense? And its structure – is it structured or unstructured data, how do you represent that, and how do you store it in an effective way? That is when standards etc. will come in.

I would hope these challenges can be resolved but the pace of the technology and the application technology could well outstrip the efforts to try to streamline it. It’s a nice problem to have, however, and I am sure there are many computer scientists and groups around the world who are working on this.

For us, data is our currency, and getting the data is the most important thing – the more data we have, the more data points we have and so the better predictions we can make. It’s very important that things like the IoT and data generation in general continue to happen. We just have to work a little bit harder to support it.

That leads into what may be seen as a side topic: a skills shortage, which is a big concern. Are we creating the next generation of data scientists and university graduates who are able to help manage this data? I am very concerned that we are not necessarily doing that as well as we could be.

How far is that changing? Primary school children are now being taught to code – does that perhaps signal a recognition of the skills gap challenge?

It’s finally been realised that there is a dearth of talent or at least there will be if something isn’t done, but it will be quite a few years before that pays off and begins generating benefits. I have been in the high-performance computing and supercomputing arena for many years, and we have been using multicore processors for ten or more years, but it is only now that we are teaching undergrad computer scientists how to do parallel computing and programming for multicore processors. For a long time we have had all this tremendous technology, but it’s been very difficult to find the programmers who can come and develop the code, and I think it will be similar for the data scientists.

We are starting now at primary school level, which is great, but it will still take over ten years for that cohort to come through the educational system and begin applying their skills. We really need the talent now, but I think that we are a few years behind that.

That puts Europe at a disadvantage to other areas, particularly Asian countries.

Asian countries have perhaps invested more in this over the last few years and could well benefit from it, yes, although the skills gap is not just a problem in Europe. It’s a big challenge in the USA too, and various initiatives are being put in place there to increase high performance computing, parallel computing and data-analytic skills.

How do you feel Kx’s experience in the financial sector will benefit Earlham Institute bioscience activities? Do you foresee any issues with transferring the knowledge, etc. across disciplines/fields?

Kx has been developing high-speed streaming data analytics platforms for the last 20 plus years within the financial trading and markets sector, so they’ve essentially been doing and processing big data for many years. They’ve perfected this technology within their own domain and are now looking for opportunities to exploit it in other areas.

In our case, high-speed streaming data analytics is effectively what we’re doing with our living systems project: Ji Zhou, the project leader, is leading a team of computer scientists to utilise IoT-powered technologies to monitor and model the growth patterns of different crops, particularly wheat, which is a very big and very complex genome with important implications for global food security, etc. Around 40-50% of the world’s food supply is dependent on wheat, but we are having difficulty generating the yields we need to support an ever-growing population. If we can understand the wheat genome through sequencing and assembly, then we can give information back to the wheat breeders to show, for instance, which genes are affected by heat or pathogens, and they can try to grow different types of wheat that are not as susceptible. So Zhou and his team are capturing adaptive traits of many varieties of wheat, including height, vegetative greenness, and plot canopy in the field – via remote sensors, UAV, 3D laser scanners, etc. – which generates big phenotypic datasets such as time-lapse image series, 3D point clouds, 4K videos, etc. Machine learning and neural network algorithms are then applied to that data to try to extract information from it.

That’s an example of high volume data being streamed from the fields. It’s more difficult for more traditional supercomputing systems to be able to deal with that in real time, which is why we have gone with Kx: their platform has been dealing with streaming data for decades, and by using their high speed analytics engine, coupled with our machine learning algorithms, etc., we hope to be able to improve our predictive models for crop growth, which we can then feed back to the breeders far quicker.

How will the Earlham Institute approach this new relationship? Where will your priorities lie?

Our first priority will simply be to acquaint ourselves with the Kx team. Most of their expertise will be on the data streaming side of things, while our team will bring the biology and bioinformatics knowhow. Going back to your first question, then, we’ll need to bridge that gap with communication; we need to communicate to them our needs, and they need to be able to communicate to us what they can do and how they can implement what we need to do.

By the end of the project we hope to have demonstrated that we can use this platform in our particular setting to help us generate more predictive models in real time. Once we get some of this initial data, we hope to be able to move forwards with larger proposals – perhaps aiming at an Innovate UK or even a European award – as we expect to have data to back up our proposals.

How open is your data? Can anyone access it?

We are funded by the Biotechnology and Biological Sciences Research Council, so it is effectively taxpayers’ money that funds our research. We therefore ensure that all our data is open. As soon as we generate data, we look for ways to publish it – online or in databases – so that other researchers in the UK, Europe or around the world can access it and do further post-analysis on it.

Given the increasing importance that is being placed on IoT technologies in agriculture by the UK government, how do you feel the benefits of your new partnership will extend into policy areas, etc.?

This is perhaps even more important now that we are leaving the EU, as the UK will no longer be part of the EU agricultural policies.

In the UK there has now been a change of strategy towards data-intensive IoT, etc. – particularly within agriculture – so this project is hopefully a proof of concept for both us and others of what we can achieve with current IoT technology, as well as our ability to back it up with real-time streaming computer infrastructure and data analytic solutions – in our domain and other industrial sectors. If we can demonstrate that it works, then there are no reasons as to why it can’t be applied in other areas as well. It will hopefully also help to justify the UK government’s strategy change towards IoT and investment.

Tim Stitt

Head of Scientific Computing

Earlham Institute

Tweet @EarlhamInst