What is federated data analysis?
Hannah Gaimster, PhD
In research and healthcare, the size of datasets needed to solve crucial problems is continuing to increase. New technologies including the digitisation of healthcare tools, the accumulation of electronic healthcare records and massively reduced costs for high throughput technologies like genome sequencing for example all contribute to these large datasets.
These vast datasets can help provide answers to important questions and ultimately change lives for the better. Recent landmark studies that have utilised the power of big data in health research include the 100,000 Genomes study on rare diseases and the work detailing the host factors underlying severe COVID-19 which was conducted on almost 60,000 individuals.
However, secure storage and analysis of these large, sensitive datasets is becoming significantly harder. There are three key reasons for this:
- Globally, there are increasing restrictions on data access to help keep sensitive information private eg General Data Protection Regulation (GDPR).
- These datasets are large and can be hard to manage, making it difficult for researchers to effectively identify the right data for their analyses.
- Datasets reside in disparate labs and clinics in locations across the globe, they are all too commonly effectively siloed.
The World Economic Forum estimates that
Researchers and clinicians are missing out on the potential that these huge health datasets can bring as they are difficult to access and combine for analysis for risk of compromising security. Research progress and patient benefits are stalling due to inefficient models for secure health data access.
Data federation as a solution
This article explores how data federation is solving the problem of data access, without compromising data security.
In its simplest terms: Data federation is a software process that enables numerous databases to work together as one. Using this technology is highly relevant for accessing sensitive biomedical health data, as the data remains within appropriate jurisdictional boundaries, while metadata (information about the data) is centralised and searchable.
Data federation is an alternative to a model in which data is moved or duplicated then centrally housed - when data is moved it becomes vulnerable to interception and movement of large datasets is often very costly for researchers. Instead, approved users may access the data via linking technologies such as Application Programming Interfaces, or APIs.
- Federated architectures of individual organisations may be connected together into a federated data platform, enabling data access for users across organisations.
- Federated data analysis takes access a step further and brings approved researcher’s analysis and computation to where the data resides. Federated data analysis allows researchers to analyse data across multiple distinct organisations in a secure manner.
The video below highlights Professor Serena Nik-Zainal, Professor of Genomic Medicine and Bioinformatics at The University of Cambridge, discussing why researchers need to securely access and analyse health data and how organisations are solving this problem using data federation.
How does federated data analysis work?
The video below demonstrates how federated data analysis functions. Historically, data access has typically required researchers to access and analyse data by downloading it from its disparate sources and analysing it together within a centralized location (steps 1 and 2). Federated analysis (step 3) allows the distributed data from multiple sources to be analysed in parallel, saving the researcher time and money, while also keeping the data secure.
A federated approach to data analysis allows researchers and clinicians to combine global cohorts of data, to maximise new scientific discoveries that can be made when this data is securely combined.
Where is federated data analysis is being used?
Whilst this groundbreaking technology is still relatively new, federated architectures and data federation are increasingly becoming a trusted solution to both ensure data security while simultaneously enabling global collaboration. At Lifebit, we employ a federated architecture as part of the Lifebit Platform. The Lifebit Platform is being used by leading research organisations, precision medicine initiatives and government biobanks globally, including Genomics England, The Greek Newborn Genomic Screening project and the Danish National Genome Center.
Below are other key examples where federated data analysis is gaining traction across the public sector and industries worldwide:
The UK government has published a strategy paper Genome UK outlining its ambition to create a federated infrastructure for the administration of UK genomics data resources. A federated approach to genomic data access, according to the report, will provide substantial benefits for patients and the national health service (NHS) and ensure that a patient's genomic information can guide their care for the duration of their lives.
STANDARDS IMPLEMENTING ORGANISATIONS:
Federated data analysis is supported by the Global Alliance for Genomics and Health (GA4GH), which was established to encourage the international secure exchange of genomic and health-related data.
According to GA4GH, federation provides organisations more authority over their sensitive data without restricting openness and collaborations, promoting flexibility and adaptability.
The Canadian Distributed Infrastructure for Genomics (CanDIG) uses federation to glean new information from both genomic and clinical datasets. Data generated in each province in Canada must abide by provincial governance standards since each province has its own legislation protecting the privacy of health data. With a fully distributed federated data analysis model that allows for federated querying and analysis while also guaranteeing that local data governance laws are followed, the CanDIG platform fully addresses this issue surrounding compliance with state law.
To close the gap between its national health system and state-funded genetic services, Australian Genomics, the country's national genomics service, is also creating a federated library of genomic and phenotypic data.
By connecting the Trusted Research Environments (TREs) of the University of Cambridge and Genomics England, multi-party federation was successfully demonstrated between TREs for the first time between a higher education institution and a national biobank in the UK. Using Lifebit’s Platform for data federation, this method has the potential to eliminate the logistical, financial, and geographic obstacles that come with transferring exceptionally large genomic datasets. Conducting an analysis spanning integrated cohorts will help significantly minimise the time commitment that researchers are now under.
Boehringer Ingelheim recently announced an approach to speed up research and development efforts by using data federation. In order to create a secure "dataland" for analytics and research and, ultimately, to hasten the development of novel medications and enhance patient outcomes, this will offer strong analytical capabilities and worldwide secure biobank connectivity.
“A federated approach to data analysis allows researchers and clinicians to combine global cohorts of data, to maximise these new scientific discoveries that can be made when this data is securely combined.”
In summary, federated data analysis is a crucial approach to secure data access that enables authorised research on combined data. It provides both maximum security whilst ensuring disparate datasets can be combined for analysis.
Look out for the next blog in our series where we will take a deep dive into some of the other key benefits of data federation.
At Lifebit, we develop secure federated data analysis solutions for clients including Genomics England, NIHR Cambridge Biomedical Research Centre, Danish National Genome Centre and Boehringer Ingelheim to help researchers turn data into discoveries.
If you are interested in learning more about Lifebit’s federated data solution, please