Four key requirements to enabling federated data analysis
Hannah Gaimster, PhD
In research and healthcare, the size of datasets needed to solve crucial problems is continuing to increase. New technologies including the digitisation of healthcare tools, the accumulation of electronic healthcare records and massively reduced costs for high throughput technologies like genome sequencing all contribute to these large datasets.
However, secure storage and analysis of these large, sensitive datasets is becoming significantly harder. There are three key reasons for this:
- Globally, there are increasing restrictions on data access to help keep sensitive information private (eg. General Data Protection Regulation (GDPR))
- These datasets are large and can be hard to manage, making it difficult for researchers to identify the right data for their analyses.
- Datasets reside in disparate labs and clinics in locations across the globe, and as a consequence they are all too commonly effectively siloed
Data federation as a solution
Data federation is solving the problem of data access, without compromising data security. In its simplest terms: Data federation is a software process that enables numerous databases to work together as one. Using this technology is highly relevant for accessing sensitive biomedical health data, as the data remains within appropriate jurisdictional boundaries, while metadata is centralised and searchable and researchers can be virtually linked to where it resides for analysis.
This is an alternative to a model in which data is moved or duplicated then centrally housed - when data is moved it becomes vulnerable to interception and movement of large datasets is often very costly for researchers.
The video below highlights Thorben Seeger, Lifebit’s Chief Business Development Officer, discussing how researchers are limited in their ability to access and analyse sensitive data and how organisations are solving this problem using data federation.
It is clear that data federation is the future for enabling secure genomic and health data access at a global level as it brings multiple advantages compared to traditional methods of data access.
This article highlights the crucial requirements to enable data federation, which include:
- Appropriate computing infrastructure
- Authentication and analytics technology
- Standardised, interoperable data
- Best in class security measures
What is required for employing a federated data analysis approach?
There are four prerequisites to performing health data federation for research, either as a researcher or organisation, which are:
1. Scalable infrastructure
With the ability to process immense datasets, computational resources are an important consideration. Additionally, a robust database infrastructure is required for efficient data processing and integrated data analysis. Processing such large amounts of data requires a highly scalable platform.
The scale of distributed multi-omics and clinical datasets available today has brought an increasing shift towards commercial cloud infrastructure.
Being cloud-based provides ultimate flexibility and the ‘elastic’ nature of cloud computing means researchers only pay for what they need.
2. Advanced APIs, authentication and analytics technology
Achieving a federated connection to where the data resides requires a platform that can communicate with distributed data sources and other platforms. Typically, this will require:
- A set of APIs that enable computational coordination and communication between platforms that enable federation.
- Ability to integrate an authentication/authorisation system to make sure that only authorised users can access data across platforms.
- The platform should have all the downstream tools necessary to perform analytics on federated data. This will enable researchers to perform data analysis, ultimately helping to accelerate research and allowing novel insights to be gained.
3. Standardised data
Once the relevant infrastructure and data access requirements are in place, researchers will still be limited in the novel insights they can gain if the data cannot be effectively combined to enhance its statistical power. Common Data Models (CDMs) are crucial to ensuring data is interoperable, with several growing in popularity in the health sciences sector recently including Observational Medical Outcomes Partnership (OMOP) in the case of clinical-genomic data
Harmonising health data to OMOP provides structure according to common international standards which ensures it is fully interoperable with other clinical datasets from other labs or clinics. This fully enables the integration and analysis of datasets across distributed sources and platforms.
Additionally, extraction, transformation, loading pipelines (ETL) pipelines that can automate this work to process and convert raw data to analysis-ready data help further simplify this process for researchers.
Combining these datasets securely via federation then allows researchers to increase the statistical power of their research. For example, one genome-wide association study revealed that increasing sample size by 10-fold led to an approximately 100-fold increase in findings, enabling disease-causing genetic variants of interest to be more easily validated and studied. Secure access to full standardised and interoperable large datasets via federation can help to accelerate research by providing great power for clinical studies.
4. Maximum securityPatient or volunteer data used for research may contain highly sensitive health information. To protect patient and public privacy and build trust in the use of health data, it is vital to follow strict data security measures, including:
- Strict data encryption standards. Data should be encrypted at all stages, including at rest (such as when it resides in memory) in transit (such as when data moves between storage buckets and computing machines), and during analysis.
- Data pseudonymisation. Sometimes referred to as ‘de-identification’, this refers to the removal of any personal identifiers from a dataset, ensuring that a participant’s privacy is maintained.
- Role-based access control to data. This ensures that only authorised employees can decrypt it, and the security network imposes additional restrictions on specific users being able to access, view, or edit encrypted files. increase.
- Careful considerations on how to safely export results. An example of this would be via an ‘airlock’ process as used by Genomics England.
- Here, data cannot be exported or downloaded out of the environment. Users can only export appropriate, aggregate-level data via the secure airlock process, which allows authorised personnel to approve and validate the purpose of any data-download.
- The airlock policy can be fully enforced through workspaces, where no data may be extracted, aside from any previously whitelisted/received authorisation cases.
It is clear that data federation can bring many wide ranging benefits to researchers. It can provide secure access to global cohorts of data to help power analysis and ultimately answer important research questions. To enable federated data analysis, researchers and organisations need standardised, interoperable data, appropriate infrastructure including APIs, authentication and analytics technology and robust security measures.
By enabling data federation, organisations can provide researchers secure data access and analysis to ensure they spend time and effort on what matters most- gaining new insights on health and disease.
Author: Hannah Gaimster, PhD
Contributors: Hadley E. Sheppard, PhD and Amanda White
At Lifebit, we develop secure federated data analysis solutions for clients including Genomics England, NIHR Cambridge Biomedical Research Centre, Danish National Genome Centre and Boehringer Ingelheim to help researchers turn data into discoveries.
Interested in learning more about Lifebit’s federated data solution?