Creating research-ready data from 135,000 cancer, rare disease & COVID-19 participants

5 minute read


Listen to the blog here:

February 2024

Author: Hannah Gaimster, PhD
Contributors: Amanda White 



Around the world, organisations across the healthcare system produce huge quantities of information about patients, diseases, and treatments. Highly sensitive and detailed, this information holds the potential for discovering cures, identifying successful therapeutics, and increasing our understanding of health and disease.

For example, a recent groundbreaking study using data from 67 million people in England, led by Health Data Research UK, highlighted the impact of missed COVID-19 vaccinations during the pandemic. Another high profile study, led by Genomics England, demonstrated how linking genomic data to clinical data can identify changes in cancer DNA that could be relevant for an individual patient’s care. These important research findings have only been made possible by secure access to high quality information and health data. 

Unfortunately, these types of health-related data come with an array of problems, which slows down research progress and ultimately limits benefits for patients and communities. They are highly sensitive; fragmented (coming from multiple, disparate sources e.g. hospitals, biobanks, laboratories, wearable devices); and can be multi-modal and unstandardised (e.g. in the form of written notes, electronic health records, medical images). 

Solving these challenges of data quality and standardisation is an important journey being tackled by organisations across life sciences. 


In this article we explore how Lifebit and Genomics England are working in partnership to create an incredible data resource for research, based on information from over 135,000 participants with cancer, rare disease and COVID-19.



In December 2022 HDR UK and EHDEN selected Genomics England and Lifebit as partners to deliver one of 22 data initiatives to improve the use of health data for research and innovation. 

The Genomics England and Lifebit project has mapped England national registry data to the Observational Medical Outcome Partnership (OMOP) common data model. The main benefit of the project was the ability to use Lifebit’s Cloud OS platform to incorporate pre-validated mapped data and automatically implement it to the ETL pipeline suite. 

The aim is to create standardised data to further enable research progress and scientific discoveries.  


Background to the project:

Genomics England began as a vessel to execute the UK Government's bold plan to sequence 100,000 whole genomes and incorporate genomic medicine into routine care in the NHS.

Though recruitment to the Project has ended, its impacts are still being realised – transforming the way people are cared for and bringing advanced diagnosis and personalised treatments to those who need them.

Through its partnership with Lifebit, Genomics England uses a secure, Cloud technology platform to enable approved researchers access to pseudo-anonymised data for research studies. Whilst the technology solutions are enabling safe access to these data by the research community there remains an essential need to standardise the data. Genomics Englands’ data uses several varied medical vocabularies and, as a result, the majority of a data scientists’ time is spent cleaning and organising data. Mapping the data to OMOP enables the numerous, disparate datasets to be standardised and, when ingested into Lifebit’s platform, to be securely accessed and queried for research. 

The aim is to unlock the power of these data and drive research progress and precision medicine efforts for patients worldwide.



What was involved?

Dr Prabhu Arumugam, Director of Clinical Data and Imaging, at Genomics England said:


" We had already started to standardise associated clinical data from participants of the 100,000 genomes projects and the COVID-19 study in 2022, but we required further resources to complete the project. "

The project focused on a dataset that included 135,000 participants within the National Genomic Research Library. These include people with cancer, rare diseases and COVID-19, across all age groups ranging from infants and toddlers through to adolescents and older people.


Anastasios Siapos, Data Engineering Manager, at Lifebit said:


" Recognising the need to protect patient privacy at all times, our work was conducted within the Genomics England’s Trusted Research Environment (TRE) so the data did not move. This environment uses the Lifebit Platform to run workloads and access data. "


We applied our proprietary ETL (extraction, transformation, loading) pipelines suite to carry out the transformation of Genomics England data to the OMOP data model. These pipelines were run via the Lifebit Platform. This included cleaning and profiling the data, carrying out high-level field mappings, low-level concept mappings and ingestion of final OMOP tables. 

Our team of experts worked in tandem with Genomics England’s data scientists to create the necessary mapping file inputs required by the pipelines.” 

What were the challenges?


Millahat Asif, Health Data Scientist, at Lifebit said:


“​​OMOP and Clinical experts across Lifebit and Genomics England contributed to producing high quality mappings for the highly heterogeneous source data. Support from our key collaborators Dr Prabhu Arumugam and Dr Laura Kerr were key to ensure the data integrity was preserved during the transformation.

This was a challenging endeavor and needed close working practices with the team at Genomics England in order to get their mappings into a structure that could be fed directly into our ETL suite, which would then automatically apply those mapping decisions to the raw data. This was possible due to the flexible nature of the ETL suite, which can utilise user-provided field mappings as well as custom vocabulary mappings to automatically transform source data into OMOP. 

The process included some refinement sessions where we worked together to optimise the mappings provided, to ensure they aligned with OMOP specifications, and Lifebit was able to validate their mappings against those provided by Genomics England. Throughout this process, it was important to make sure that Lifebit’s implementation of Genomics England’s mappings had a high fidelity to what was provided, and this was achieved through sessions where Genomics England reviewed output files from the ETL suite that illustrated exactly how their mappings were utilised.

Another challenging aspect of this project was the cleaning and harmonising of source data.  Due to the size and complexity of the data, this was challenging from both a technical and operational point of view. The profiling component of our ETL suite, as well as the direct collaboration with Genomics England’s data team was invaluable in ensuring the success of this part of the project. The reports produced by profiling pointed us to where we needed to focus our standardisation efforts, and we were able to utilise a mix of the ETL suites capabilities, as well as some custom scripts in order to carry out the cleaning work needed to map the data.”


What was the outcome?


Millahat continues: 


" The data mapped as part of the EHDEN project is now in a format that is more accessible for researchers, both in terms of a simplified structure, and in terms of standardisation of content where source values have been aligned to known vocabularies."


“Genomics England hosts a wealth of health data from various sources, however this is siloed in source-specific structures that require researchers to have intimate knowledge of in order to best utilise the data for research. When this research spans across various data sources, this dependency on knowing about the specific data sources quickly becomes a limiting factor for researchers as they need to understand each data source’s particularities. 

The act of standardising this information into a common data model, like OMOP, means that researchers can easily navigate the data to extract what is most useful to their research. Standardised data can also enable federation, meaning it can be securely accessed and more efficiently linked to other data for research. 

Furthermore, the data is now in an optimal format to be utilised through the Lifebit Platform, which can leverage the structural constraints of OMOP to allow users to do things such as create custom cohorts to which they can apply analyses like GWAS, greatly speeding up the research process.”

Jonny Blanksby, Product Manager at Genomics England, said: 

“The mappings that are being shared today are a great step in the right direction for standardising the great breadth of data that Genomics England has in its Research Environment, but there is still some distance to go in the complete unification of data to a single data model.

The team has been working diligently to achieve the current milestone and we hope that the mappings and transformed datasets will be widely used and will encourage expansion of the datasets to include further programmes and improved iterations of the model in the future.”

The OMOP mappings are available on Gitlab here.

About Lifebit

Lifebit provides health data solutions for clients, including Genomics England, Boehringer Ingelheim, Flatiron Health and more, to help researchers transform data into discoveries. 

Contact us

Request a demo