Published August 6, 2020 | Version v1
Dataset Open

On the cross-population generalizability of gene expression prediction models

Description

The genetic control of gene expression is a core component of human physiology. For the past several years, transcriptome-wide association studies have leveraged large datasets of linked genotype and RNA sequencing information to create a powerful gene-based test of association that has been used in dozens of studies. While numerous discoveries have been made, the populations in the training data are overwhelmingly of European descent, and little is known about the generalizability of these models to other populations. Here, we test for cross-population generalizability of gene expression prediction models using a dataset of African American individuals with RNA-Seq data in whole blood. We find that the default models trained in large datasets such as GTEx and DGN fare poorly in African Americans, with a notable reduction in prediction accuracy when compared to European Americans. We replicate these limitations in cross-population generalizability using the five populations in the GEUVADIS dataset. Via realistic simulations of both populations and gene expression, we show that accurate cross-population generalizability of transcriptome prediction only arises when eQTL architecture is substantially shared across populations. In contrast, models with non-identical eQTLs showed patterns similar to real-world data. Therefore, generating RNA-Seq data in diverse populations is a critical step towards multi-ethnic utility of gene expression prediction.

Notes

All source code for this project can be found on Github here.

Genotype data are stored on dbGaP under ascension number phs000921.v4.p1.

Funding provided by: National Heart, Lung, and Blood Institute
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000050
Award Number: R01HL117004

Funding provided by: National Heart, Lung, and Blood Institute
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000050
Award Number: R01HL128439

Funding provided by: National Heart, Lung, and Blood Institute
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000050
Award Number: R01HL135156

Funding provided by: National Heart, Lung, and Blood Institute
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000050
Award Number: X01HL134589

Funding provided by: National Heart, Lung, and Blood Institute
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000050
Award Number: R01HL141992

Funding provided by: National Heart, Lung, and Blood Institute
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000050
Award Number: R01HL104608

Funding provided by: National Human Genome Research Institute
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000051
Award Number: U01HG007419

Funding provided by: National Institute of Environmental Health Sciences
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000066
Award Number: R01ES015794

Funding provided by: National Institute on Minority Health and Health Disparities
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100006545
Award Number: P60MD006902

Funding provided by: National Institute of General Medical Sciences
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000057
Award Number: RL5GM118984

Funding provided by: Tobacco-Related Disease Research Program
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100005188
Award Number: 24RT-0025

Funding provided by: Tobacco-Related Disease Research Program
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100005188
Award Number: 27IR-0030

Funding provided by: National Institute of General Medical Sciences
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000057
Award Number: TL4GM118986

Funding provided by: National Institute of General Medical Sciences
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000057
Award Number: UL1GM118985

Funding provided by: National Heart, Lung, and Blood Institute
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000050
Award Number: R01HL135156-S1

Funding provided by: Gordon and Betty Moore Foundation
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000936
Award Number: GBMF3834

Funding provided by: Alfred P. Sloan Foundation
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000879
Award Number: 2013-10-27

Funding provided by: National Heart, Lung, and Blood Institute
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000050
Award Number: R01HL117004-S1

Funding provided by: National Institute of General Medical Sciences
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000057
Award Number: K12GM081266

Funding provided by: National Heart, Lung, and Blood Institute
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000050
Award Number: K01HL140218

Funding provided by: National Institute of General Medical Sciences
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000057
Award Number: T34GM008574

Funding provided by: National Human Genome Research Institute
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000051
Award Number: R56HG010297

Funding provided by: National Human Genome Research Institute
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000051
Award Number: T32HG00044

Files

chr22.genelist.txt

Files (2.7 GB)

Name Size Download all
md5:7c244d8c57270e13d757cdb967f8b256
95.0 MB Download
md5:e7b9851f3a5ff72dd601857f3d01c2bd
9.8 MB Download
md5:1067b0a5516b1e870404a8849c76b1de
16.0 MB Download
md5:147e86a1edb201803523f6685588ae83
55.8 kB Preview Download
md5:8e3eb072f08533f3d7f27b8fad809adf
447.2 kB Download
md5:b512b72d6581fcfb5f7c98014508be8a
2.1 GB Download
md5:bced800e30fef6741434aeb6c06f026a
16.5 MB Download
md5:eb673fc12e57b1b26733cb8759121bb0
419.7 MB Download
md5:397aee37afdd74eeb2e00069779d4445
12.3 MB Download
md5:5c493c00593379849ba9233337dfc25c
1.5 MB Download
md5:80e86f243a2685f11cce1ef00e535b3c
21.2 MB Download

Additional details