What Is The Best Imputation Pipeline?

2 minute read

Lifebit

Comparison of three different imputation pipelines: The good, the bad & the ugly

What is imputation?

Genome imputation enables researchers to use markers that are not directly available after genotyping, and include them in Genome-Wide Association Studies (GWAS). This can be useful for three main applications:

Finding new risk alleles prior to GWAS,
Increasing resolution to identify causal variants, and
Integrating multiple samples and platforms for meta-analysis.

What did we do?

Here, we compared three different imputation pipelines in terms of quality of the imputation (as measured by R²), runtime and cost. The pipelines we compared were Beagle5, The Michigan Imputation Server and Impute2.

Impute2 gave the best quality imputation, Beagle5 was the fastest pipeline and the Michigan Imputation Server was the cheapest. So which pipeline to choose then? We would argue Beagle5, find out why below…

Which gave the best quality imputation?

As shown by the figure above, at very low R² values The Michigan Imputation Server and Impute2 have far for more SNPs than Beagle5. At an R² value of 0.2, the three pipelines have approximately the same number of SNPs. Importantly, at higher R² values (eg 0.9), Impute2 has the highest number of SNPs of over 20,000, then Beagle5 and then The Michigan Imputation Server, which are both around 15,000 SNPs. This is important because it’s for these SNPs that we have the highest confidence in the imputation.

Which was the fastest?

*Run time & costs are shown for jobs run on a m4.2xlarge instance on CloudOS

The fastest pipeline was Beagle5 which took 15mins 6s. The next fastest to impute was the Michigan Imputation Server which took 17mins. And the slowest was the Impute2 pipeline which took 33mins 4s.

For the pipelines run on CloudOS, job sharing links can be found here for Beagle5 and Impute2.

Which was the cheapest?

The cheapest pipeline was the Michigan Imputation Server which is free. Importantly, unlike the other two pipelines, it is not run on CloudOS and there is a job submission limit of three jobs at any one time, which will limit users. The next cheapest pipeline was the Beagle5 pipeline which cost approximately $0.11 per 23andMe file. Finally, by far the most expensive pipeline was the Impute2 pipeline which cost $1.06 per 23andMe file, despite using spot instances.

Which was the best?

Overall, Impute2 gave the best quality imputation, Beagle5 was the fastest pipeline and the Michigan Imputation Server was the cheapest. If you are unsure of which imputation pipeline to use, we would recommend Beagle5 as it gave the best trade-off between these metrics. Beagle5 gave the second best quality of imputation, was the fastest, significantly cheaper than Impute2 and all of this while being more flexible than the Michigan Imputation Server.

Disagree with us? Let us know on Twitter. You can read more about the methods below.

You can see the scripts used to calculate this on GitHub and view the raw data. Documentation for each of the tools can be found here for Michigan Imputation Server, Beagle5 and Impute2.

We would like to know what you think! Please fill out the following form or contact us at hello@lifebit.ai. We welcome your comments and suggestions!

What Is The Best Imputation Pipeline?

Lifebit

Comparison of three different imputation pipelines: The good, the bad & the ugly

What is imputation?

What did we do?

Which gave the best quality imputation?

Which was the fastest?

Which was the cheapest?

Which was the best?

Boehringer Ingelheim Partners With Lifebit to Detect Disease Outbreaks

Lifebit's ISO Certification Underscores Genomic Data Security Commitment

Future-Proofing Population Genomics with Federated Technology

Company

Technology

Software

Use Cases

Resources

Lifebit Mission

Lifebit partners with Latin American innovators to help solve global health challenges through genomics research

ASHG Annual Meeting 2023

Bioinformatician (Remote - Nextflow Developer)

Lifebit partners with Flatiron Health

Get in Touch

Lifebit CloudOS

Lifebit REAL

Become a Pioneer in Precision Medicine

Become a Therapeutic Leader

Data Transformation (OMOP)

Federated data analysis

Trusted research environment

Frontiers in Genetics

Secure data, scalable research

Better together: the promise of health data linkage and its challenges

Lifebit CloudOS Documentation

What Is The Best Imputation Pipeline?

Lifebit

Comparison of three different imputation pipelines: The good, the bad & the ugly

What is imputation?

What did we do?

Which gave the best quality imputation?

Which was the fastest?

Which was the cheapest?

Which was the best?

Similar from our blog

Boehringer Ingelheim Partners With Lifebit to Detect Disease Outbreaks

Lifebit's ISO Certification Underscores Genomic Data Security Commitment

Future-Proofing Population Genomics with Federated Technology