Better together: the promise of health data linkage and its challenges

10 minute read




When different types of health data on the same person are linked, for example linking electronic primary care records with hospital mortality statistics, bigger, better, multi-faceted datasets can be created. These offer rich opportunities for life sciences researchers to draw new conclusions and help health and care.

One of the greatest strengths of data linkage is that it can create ready-made big, complex datasets for a fraction of the cost of collecting primary data.Furthermore, it can provide a more thorough picture of a person’s health, can enable better follow-up of clinical trials and gather real world evidence on whole populations to inform health policy.

To build a greater depth of understanding of an individual’s health, medical data can be linked with other data types, such as lifestyle, environmental or social data, to get a fuller picture of what factors influence health and disease.


Types of data (1)

But data linkage is not without its difficulties. To be a success, patient trust and consent and making sure that the management and use of data are well governed and of sufficient quality are all critical issues that must be addressed.

This article explores the research challenges and opportunities, as well as the incredible potential for research, which are opened up through linkage of health data.


What are the benefits of data linkage?


In a nutshell, the power of data linkage is that it brings together disparate health data from different sources that relate to the same person, family, place or event. That might be connecting a person’s routine primary care electronic health records with their hospital blood test data. Or health data could be linked with education information, postcode demographics, insurance claims, receipt of benefits or the air quality in the local area – there’s a wide array of possibilities.

By doing this, a new, enhanced, large-scale data resource is created with a broad range of benefits, helping researchers to explore the pathways that lead to health, disease and disease outcomes.


Linking data increases researchers’ overall knowledge of a person’s health by uniting, fully exploiting and using research data that is available, which has often languished in different silos and systems.


Some of the exciting benefits for data linkage in health research include:


Creating ‘analysis-ready’ large-scale research data

Linking different datasets allows researchers to take advantage of existing data sources without having to spend time and money on collecting primary data from scratch. One study on the US Government’s National Center for Health Statistics data, for example, set out to discover whether social security disability insurance beneficiaries could access healthcare in the mandatory two-year waiting period for Medicare national health insurance. When data were linked from a large household survey of a cross section of the US civilian population with Medicare data, findings showed that a quarter of those on benefits had no health insurance and reported more problems with access to care than insured people – a policy-relevant insight that couldn’t have been made by simply looking at the administrative data. Also in the US exists the STAnford medicine Research data Repository or STARR - a research ecosystem containing research-ready, linked data from different sources in a secure environment. This resources is designed on the principles of data commons and contains reusable data processing pipelines, cohort and analysis tools, training, and user support for researchers. In another example of linked data research from the UK, researchers were able to prove that first-time pregnant women and their babies were at lower risk of death around the time of childbirth if the birth was induced at 40 weeks. The study linked NHS Hospital Episode Statistics delivery records with birth records using GP codes, baby birth weight, sex, and mother’s age. This gave the researchers a matched, but anonymous, sample of 77,327 women to analyse – this large sample size was important because such deaths are very rare and the study needed sufficient data to produce meaningful results.

Improving trial follow-up

Linked data can be used to supplement follow-up of people enrolled in conventional cohort studies or trials, improving the original data resource in the process. One UK study, for example, proved that babies given nutritionally enhanced milk formula went on to perform no better than their peers in their exams at age 16. The study used probabilistic linkage via a trusted third party. This third party linked data from seven trials of infant formula with maths exam results from the UK National Pupil Database using names, post codes and dates of birth, without access to any of the data.

Generating real world evidence

Traditional longitudinal studies which measure and follow a cohort of people over long periods of time are enormously valuable for research. That’s because they can answer research samples that need large sample sizes in order to consider a wide range of risk factors and include hard-to-reach segments of the population. However they are also costly to set up, run and maintain. In the UK, population-level electronic patient cohorts have been successfully created through linked data that are entirely made up of administrative data, making the whole process of tracking people over long periods of time more efficient. A landmark exemplar of this took place during the COVID-19 pandemic, when electronic health records from primary care, hospital episodes, death registry, COVID-19 lab test results, and community dispensing data were linked along with intensive care, cardiovascular and COVID-19 vaccination data. This created a dataset representing an impressive 54 million people, or 96% of the entire UK population. This was used to assess COVID-related cardiovascular events and plan the national pandemic response.

Tailor-made treatments

The 360 degree view of health provided by linked datasets can help to generate individual risk scores for diseases, and more accurately test emerging vaccines, diagnostics and treatments. This is the nascent, promising field of precision medicine. One organisation at the heart of precision medicine in the USA is the All of Us Research Program, which is gathering health data from one million US citizens to improve understanding of how biology, lifestyle and environment are linked and accelerate health research. All of Us wants to close some of the gaps in health inequality by making its dataset more representative of all of the ethnicities and races within the American population and including communities that have been under-represented in biomedical research in the past, such as African Americans and Hispanic Americans.


Find out more in this interview between Thorban Seeger, Chief Business Development Officer at Lifebit and Chris Lunt, CTO of the All of Us program:


What can be discovered by linking health data with other data types?


Linking up different forms of health data is the tip of the iceberg when it comes to the rewards that can be reaped from integrating different data sets. Recent decades have seen an explosion in big data in just about every area of modern life, and exploiting these data sources can give researchers unprecedented insights into people’s health, and the causes, effects and outcomes of disease.  

Whether that’s climate data, social data on recipients of disability benefits, tax information or data on our daily movements taken from wearable devices, the research possibilities are almost endless. Here are four examples:


Housing benefit data and lead exposure: Research in this US on linked housing and health survey data has found that children from families in receipt of housing benefit actually had lower levels of lead in their blood than the rest of the population, as a result of federal policy change on lead-based paints. 

This is thanks to the US National Center for Health Statistics (NCHS) data linkage program, through which housing assistance data from the US Social Security Administration has been integrated with routine blood testing data from the NCHS’s national health survey in a cross-section of US civilians.


Assessing stroke risk from space: A NASA-led study linked satellite and surface level data on environmental variables like particulate matter and land surface temperature with public health data on geographic and racial differences in stroke. The aim was to see if environmental risk factors are related to cognitive decline and other health outcomes. While no link was found, the study shows the imaginative ways in which very different types of data can be combined to answer research questions.


Linking local data for national insights: Quality of data is often better at a local level rather than a national level. To take full advantage of this, the Networked Data Lab Project in Aberdeen, Scotland, has linked datasets for prescriptions, outpatient care and hospital admissions and local council data. 

Through use of the Grampian Data Safe Haven, a safe and secure environment (also known as a Trusted Research Environment) unconsented patient data can be processed. This project uses data from five sites across the UK to gain insight into public health problems, including the impact of shielding from COVID on vulnerable people and children’s mental health.


Climate change and heart health: The impact of environmental exposures including climate factors on heart, lung, blood and sleep conditions are to be explored through a new project currently in the pipeline from the US National Heart, Lung, and Blood Institute (NHLBI)

The research will integrate the linkage of geospatial, temporal environment, and climate data with complex multi-modal health and population data available in the institute’s data repositories. The hope is that the results of this work can be applied to Social Determinants of Health research for assessing the effectiveness of interventions to minimise the impact of climate change on health.


So what about the challenges of linking data?


At a technical level, one challenge is the availability of accurate identifiers that can be used to link the same person across multiple data sources, with errors or missing information often hampering efforts to find the correct link. Interpreting and quantifying errors in the linkage process is not easy.


There is also the sheer size of linked datasets to contend with, particularly when it comes to genomics data, and the difficulty and risk of moving them for analysis. The linkage process can be riddled with inefficiencies. Furthermore, data quality and standards are key to success and it is critical that data governance is in place.


However, perhaps one of the biggest barriers to realising the full potential of data linkage is gaining and maintaining public trust and confidence in the process. Here we take a closer look at four of these challenges.


1. Patient safety, security and trust

The safety and security of patients’ sensitive information when it is linked and used for research is absolutely essential. Patients’ trust in the process is also important, and that can be hard to secure and maintain. 

For example, when it was proposed that health records in primary and secondary care should be routinely linked in England to support planning and research (from in 2012 to General Practice Data for Planning and Research in 2021) there was public concern about the lack of transparency surrounding how linked data were to be used, processes for opting out, and commercial interests.

To protect patient and public privacy and build trust in the use of health data, it is vital to follow and communicate the use of strict data security measures, including strict data encryption standards at all stages; data de-identification, or the removal of any personal identifiers from a data set; role-based access control to data to ensure only authorised employees can decrypt it; and careful considerations on how to safely export results, eg via an ‘airlock’ process as used by Genomics England.

In the UK, the COVID-19 pandemic demonstrated that efficient and secure access to linked data can support agile and responsive research - a boon for patient trust in health data linkage. Successful initiatives to link primary and secondary care data for the UK, including OpenSafely and the British Heart Foundation's Data Science Centre CVD-COVID-UK consortium are paving the way for public trust in future projects, as reported in the British Medical Journal.


2. The challenge of giant datasets


By 2025, more than 500 million human genomes will have been sequenced. That represents a staggering amount of data: more than Twitter and Youtube combined. These behemoths of genomic data, including linked genomics datasets, are becoming increasingly hard to store securely and analyse because of their sheer size, as well as new restrictions and regulations on data collections and storage such as the General Data Protection Regulations in Europe.

Therefore, one challenge facing researchers and data custodians is how to extract answers from massive data that is distributed, complex and inaccessible, particularly when linking  different genomic data sets for research. Furthermore, there is the issue of how it can be analysed and managed in a secure way that protects people’s privacy.

Federated data analysis is one way of solving the problem of data access, without compromising data security. Lifebit’s federated analytics platform provides a solution here, in that the data stays exactly where it is, with the organisation which generated it, with the compute, analysis and tools coming to the data, not the other way around.

Genomics England, the UK’s public sector genomics research organisation, currently hosts the data from over 135,000 NHS patients within a cloud-based Trusted Research Environment powered by AWS and Lifebit. Approved researchers, with separate processes for public and private sector applicants, can apply to access the clinical and genomic data from participants with cancer, rare diseases and COVID-19

So far, over 200 publications and 560 collaborative research projects have been approved to use the data across a wide range of disease areas. With the recent implementation of Genomics England’s TRE, the collaborative potential for research using this data will continue to grow.


3. The importance of data quality


Successful health data research depends on the quality and utility of the data. Good quality data is data that is fit for purpose; good enough to support the outcomes for which it is being used. 

Quality can be measured using six dimensions: completeness, uniqueness, consistency, timeliness, validity and accuracy. Different data uses will need different combinations of these dimensions.Some key considerations with regard to data quality include:


  • Data cleaning and de-duplication: this is the process of detecting and correcting corrupt or inaccurate records, finding incomplete, inaccurate or irrelevant parts of the data and replacing, modifying or deleting the ‘dirty’ data. After cleaning, the dataset should be consistent with the dataset to which it will be linked.  De-duplication is a technique for eliminating duplicate copies of repeating data. Done well, this can save considerable amounts of storage capacity and costs and make the whole process more efficient.

  • Data de-identification: To ensure the privacy of the person to whom the data belongs, it must be de-identified, or pseudonymised, which means removing any personal identifiers from a dataset.

  • Data Quality Frameworks: A data quality framework – also called data quality lifecycle – is usually designed in a loop where data is consistently monitored to detect and resolve data quality issues, iteratively improving data quality. Such frameworks are important because they provide organisations, or whole countries, with a structured approach to understanding, documenting and improving the quality of their data. In a similar vein, in the UK, Health Data Research UK has developed the nation’s first Data Utility Framework to define, categorise and curate the nation’s vast quantity of health datasets to make them more available for health research.

  • Common Data Models: researchers will be limited in the new insights they can gain from data if it cannot be effectively linked and combined to enhance its statistical power. Common data models are therefore crucial to ensuring data is interoperable, with several growing in popularity in the health sciences sector recently.




One development towards this end is the Observational Medical Outcomes Partnership (OMOP) common data model; a model widely used by leading biobanks, research institutions and pharmaceutical companies. Developed by the Observational Health Data Sciences (OHDSI) community, is based on the concept of harmonising disparate data sources by transforming them to a common format and using a standard set of vocabularies so they can be analysed using a library of standard analytic pipelines.

4. The importance of data governance


Data governance is essentially the policies and procedures controlling how data is managed, to ensure that data is secure, trustworthy, documented, managed and audited. In the context of health data research, data governance helps to ensure that people’s precious health data is useable, while also being accessible to authorised users and that patients’ privacy is protected. It plays a critical role in compliance with regulations on data storage, management and access, such as the General Data Protection Legislation (GDPR) in Europe. Good governance also enables data analytics to be carried out, and the use of data to be carefully monitored.

On the global stage, a recent paper in the International Journal of Health Governance makes the case for the adoption of a new set of health data governance principles into global, regional and national policy and practice. These principles are clustered around the interconnected objectives of protecting people, promoting health value and prioritising equity. According to the authors, the principles “offer a strong vision for HDG that reaps the public good benefits of health data whilst safeguarding individual rights. They can be used by governments and other actors as a guide for the equitable collection and use of health data.”




Linking health data has already brought about some impressive breakthroughs, such as monitoring COVID in the entire UK population to plan the national pandemic response.

Linking large-scale datasets in genomics offers still greater potential for improving human health, through precision medicine approaches. The advent of Trusted Research Environments and federated data analytics could provide the long-needed solution to balancing patient privacy with smooth and efficient data access.

Ongoing challenges to data linkage such as patient trust, data quality and data governance are beginning to be addressed through developments such as the Observational Medical Outcomes Partnership and the UK’s first Data Utility Framework.




About Lifebit


At Lifebit, we develop secure federated data analysis solutions for clients including Genomics England, NIHR Cambridge Biomedical Research Centre, Danish National Genome Centre and Boehringer Ingelheim to help researchers turn data into discoveries.


Interested in learning more about Lifebit’s federated data solution for genomics research?

Contact us  Request a demo