Reviving Genomics Progress: Overcoming Big Data Challenges

3 minute read


I have had the opportunity to get to know a great many people in life sciences. Whether pharmaceutical, biotechnology, or direct-to-consumer (DTC) companies, they all share the common goal of advancing medicine and improving life. These brilliant researchers and scientists also share a common challenge: they are drowning in Big Data –  the very data that holds the key to new discoveries and the promise of revolutionising health care. 

Genomics has progressed faster than anyone could have foreseen. Since 2003, when the first whole human genome was sequenced (at a cost of US$2.7 billion), we have managed to reduce sequencing time from years to mere hours and the cost to do so to about US$550 (with the anticipated ‘$100 genome’ set to push the boundaries even further). Not surprisingly, this has had the effect of amassing torrents of genetic data worldwide with no abatement in sight. 

By 2025 estimates predict that over 60 million patients will have their genomes sequenced in a healthcare setting. Another study estimates that up to 2 billion genomes in total will be sequenced by 2025, translating to approximately 40 exabytes of data. 

We have come so far, so fast.

The UK Biobank recently reported that its data is growing to 15 petabytes by 2025, making downloading entirely unfeasible. To put this in perspective, if you were to download 15 petabytes of data using the fastest available retail fibre optics it would take 7.6 years – and even the most advanced cloud transfers would take more than 14 days.

Currently, the majority of all genomics data on the planet is collected and stored in silos – in public and private biobanks, research facilities, DTC genetic testing companies, pharmaceutical organisations, and so on.  

The pharmaceutical companies I regularly meet with typically have data distributed across their teams and across their organisations, spanning countries and jurisdictions. Even within organisations there is the major challenge of trying to combine disparate data sets to perform meaningful analyses, with regulations associated with cross-border patient data transfers further exacerbating the problem.  

Now let’s add to the mix partnerships. Keeping with my previous example, large pharmaceutical companies often collaborate with biobanks that restrict data from leaving their environments – making integration of private and public data sets impossible. The lack of standardisation across these multiple diverse datasets introduces yet another hurdle.

Organisations and consortia need a way to preserve the safety and integrity of their data while streamlining access and analyses. 

Some biobanks now allow researchers to BYOD (bring your own data), essentially permitting users to upload their data to run their analyses over combined datasets. However, this is not practical for a number of reasons: 

  1. Transfer costs are prohibitive,  
  2. This solution equates to copying massive datasets, creating double the storage costs, 
  3. Time – uploading genomics data takes forever (see above), and 
  4. Regulations and restrictions surrounding moving sensitive data present roadblocks – especially for pharmaceutical companies concerned about patents

And now we’re back to where we started. It’s all too problematic.

So what’s the answer? 

I have had the privilege to experience first-hand how pioneering organisations, for the very first time, are able to run their analyses across massively distributed data sets, yielding results that are far more impactful, in a multi-party collaborative way without the data ever moving.

Seems like magic. But it’s not.

This is federated data analysis and it literally abstracts the most complex, distributed, and fragmented IT landscapes into a user experience that makes it appear as if data, whether local in your HPC/hybrid cloud environment, in the public cloud or sitting in a biobank on the other side of the world, is all in one place, similar to a personal computer experience.

The most beautiful part is – data never moves. Unnecessary data storage and duplication costs are eliminated, and painful data transfers are a thing of the past. Access and analyses are instantaneous where previously it took days or weeks or months, and all the while data compliance and security is assured. 

At Lifebit, our mission is to democratise the analysis and understanding of genetic big data to leap forward cures and enhance life. Recognising that the major problem impeding progress is massively distributed omics data, we built the end-to-end genomics cloud operating system that brings computation and analysis to the data, wherever it resides. Lifebit CloudOS is accelerating genomics research and delivering enriched insights in personalised medicine. Users are able to scale quickly while drastically reducing costs and speeding time to insights. We are seeing positively transformative impacts for our customers daily – the scientists and researchers who share the mission to radically change how we do healthcare. 

If you would like to learn how Lifebit can help solve your genomics data access and analysis challenges or just want to chat about life sciences, please drop me a line at