Federated Architecture in Genomics with Dr. Pablo Prieto Barja
Divya Narasimhan, MSc
Genome initiatives spearheaded by burgeoning numbers of data custodians such as Genomics England (GEL), a Lifebit customer, leverage technological advances to amass large volumes of patient data. As a result, data custodians are veritable gold mines of genome data that could transform precision medicine and provide an equitable solution for diverse healthcare requirements. Pharmaceutical companies recognize the growing value of genome data, partnering with biobanks to power massive sequencing projects such as AstraZeneca’s 2 Million Genomes Project and Boehringer Ingelheim joining a nation-wide research collaboration in Finland to analyse 500K genomes.
There is, however, a significant barrier that deters scientists from tapping into these genomic initiatives: the sequenced data is fragmented and stored in silos, where the available data remains locked in safe environments that are accessible only to a few authorised personnel to mitigate security risks. Data consumers such as big pharma thus need to find innovative solutions that collate fragmented data successfully while keeping data privacy and security at their core.
Federation is a technology that offers a seamless solution for bridging the gap between holistic data accessibility and very large, unwieldy datasets, enabling data consumers to grapple with exabytes of data while leveraging it for deep clinical analysis. Talking us through what federation means, and what a federated analysis for genome data looks like, is Lifebit co-founder and CTO Dr Pablo Prieto Barja, who is a recognized industry leader in informatics programmes for whole genome sequencing, bioinformatics, medical informatics and high performance computing.
Missing ingredient in Data Structures
In the early days of his research, Dr Prieto Barja recalls the excitement surrounding the advent of next-generation sequencing (NGS) techniques, which heralded the transformation of precision medicine. Precision medicine is an evidence-based healthcare approach that focuses on an individual’s genome data to stratify them into treatment groups for drug discovery, thus holding the potential to tailor a patient’s treatment according to their genetic constitution.
Maximising the scope and scale of precision medicine requires multiple data samples to validate clinical insights. NGS exponentially increased dataset volumes, and launched multiple genome sequencing enterprises such as the Encyclopedia of DNA Elements Project (ENCODE), in which Dr Prieto Barja was a researcher.
However, as data volumes exponentially increased, scientists struggled to analyse it.
“ENCODE was a huge project…but there was so much data, [that] was really unstructured, and required a lot of processing and data analysis. And people didn’t know how to properly use it,” says Dr Prieto Barja.
Bringing order to the overwhelming data influx needed collaborations and cross-functional expertise from diverse fields such as bioinformatics and technical data engineering. As Dr Prieto Barja worked toward structuring genome data, he was amazed at the sheer potential and insights that could be gleaned from genomic data. However, individual organisations were building their own softwares to wrangle their data, which would cause increasing fragmentation.
Finding himself in the unique position of understanding the potential of harnessing genome data to power healthcare, while combating the challenges of managing huge datasets, Dr Prieto Barja turned his focus on ways to support researchers’ efforts to organise and analyse genomic data without compromising its security.
“There was a huge reproducibility crisis being raised.” Dr Prieto Barja explains. “[Genome data] created a lot of confusion on how to use the right tools for analysis. We had to ask: what are the standards and best practices that can be used to standardise data and store it?”
Genome data needed to be benchmarked and secured so that a global standard could be maintained for data normalisation, formatting and storage. Also, distributed datasets are conventionally siloed to forestall security breaches.
Closing the gap between data custodians (providers) and data consumers (researchers) needed innovative platform solutions such as federated analysis, for data management, security, scaling and accessibility.
Federation is a disruptor in the Genomics Field
The Global Alliance for Genomics and Health (GA4H) is a policy-framing body that sets the standard and frameworks, and provides open source Application Programming Interfaces (APIs), to enable secure access to genomic data. As genome sequencing initiatives continue to multiply, the GA4H maintains the genome data ‘life-cycle’ from generation to analysis through competent approaches such as data federation, allowing diverse institutions to adopt it so that data can be made more discoverable and researchers get better access to resources around the world.
While working closely with researchers in the genomics field, Dr Prieto Barja recognized the value of adopting standardised industry practices that would enable responsible data sharing and data normalisation.
“Combining technology with infrastructure, we thought of building [platforms] that conform to industry standards, allowing organisations around the world to use our solutions for solving real-world problems.”
Researchers in genomics need secure, accessible and collaborative platforms that can adapt to advancing technologies and successfully manage large volumes of data. Most importantly, a trusted research environment (TRE) that provides a virtual collaboration of fragmented datasets would enable data analysis without having to shift data around. A federated architecture accomplishes this, where data can stay in its location for analysis, thus maintaining all security and compliance requirements.
Federation technology has been deployed in diverse fields; for example, Google coined the term Federated Learning in 2016, for a machine learning exercise that leveraged data from multiple distributed datasets.
“Federation could be understood in terms of tech companies and our mobile phones. Mobiles generate tons of data from multiple applications, such as usage patterns, which remain stored on the device. These data can be compiled and tracked in a decentralised manner while conforming to app restrictions, thus providing data for analytics. For example, a GPS system helps us navigate routes based on traffic data generated by multiple users.”
“Similarly, multiple datasets can be compiled for federated analytics to help researchers validate the quality of a genome dataset, and leverage it for learning about a disease, without having to move sensitive information from its secure location.”
Federated analysis is thus gaining popularity to power healthcare initiatives, such as the UK National Health Service (NHS) adopting federated learning to manage diverse clinical data, and Canadian Distributed Infrastructure for Genomics (CanDIG) employing federation to draw insights from both genomic and clinical datasets.
The UK government, in their recent Genome UK policy paper, have also outlined that they plan to set up a federated infrastructure for management of UK genomics data resources. Federated analysis in genomics is advantageous in many aspects. Data custodians have full control over their data, and can follow their own custom guidelines to deploy infrastructures that conform to their governance models. It also promotes data traceability, allowing researchers to understand the scale and scope of genome data usability.
Future of federated analysis in Genomics
By 2025, more than 60 million patients are expected to have their genomes sequenced- a gold mine for big pharmaceuticals. Federated analysis of complex data allows a seamless integration of distributed datasets, but its application is not confined to the genomics space.
“Federated analysis and federated learning, [they’re] going to be life changing, and a huge disruptor for healthcare,” says Dr Prieto Barja. “As a use case, federated machine learning in the NHS is using imaging data for diagnosis of eye diseases. The eye condition can be picked up in its early stages without ever having to go to the doctor.”
Currently, relevant clinical data is scattered throughout different health centres, hospitals, clinics and healthcare providers. Also, the data may not be standardised, thus leading to poor interfacing between different datasets.
Therefore, while federated analysis could redefine the future of healthcare and genomics, it pivots on the deployment of data standardisation to harness its full potential. Lifebit and other platforms use key standards such as the common data model (CDM) of the Observational Medical Outcomes Partnership (OMOP) that captures data uniformly across different health institutions.
Other standards that are getting adopted more widely across the healthcare industry and which are used on Lifebit include the Fast Healthcare Interoperability Resources (FHIR) and Health Level 7 (HL7).
Thus, federated analysis and federated learning could, in the future, allow researchers to apply their algorithms and analytics to distributed data, avoiding issues with compliance since no data needs to be moved.
“Federation is thus the way forward, not just for genomics, but for any sort of clinical healthcare data where multiple data sources are going to be in the future.” concluded Dr Prieto Barja.
If you have any questions about federated analysis and Lifebit’s patented platform to deploy it, reach out to us here.
Dr Pablo Prieto Barja has over 15 years of experience in IT, including service management experience maintaining and managing bioinformatics platforms. He was instrumental in the development of novel and innovative methods, frameworks and best practices for big bioinformatics data analysis, including Nextflow and the assessment of reproducibility in HPC and its impact in large scale bioinformatics analysis.