GWAS summary statistics for 9 quantitative phenotypes from the UK Biobank (5-fold cross-validation)
Creators
Description
This dataset contains GWAS summary statistics for 9 quantitative phenotypes from the UK Biobank.
The dataset is designed to enable systematic PRS analyses with 5-fold cross validation. For each phenotype and fold, we provide GWAS summary statistics for the training, validation, and test sets. The validation summary statistics can be used for model selection/tuning. The test summary statistics can be used to evaluate PRS models via pseudo-validation metrics. Association testing for all phenotypes and samples was done with plink2.
The phenotypes included in this dataset are:
- HEIGHT: Standing height (Data-Field: 50)
- BMI: Body mass index (Data-Field: 21001)
- WC: Waist circumference (Data-Field: 48)
- HC: Hip circumference (Data-Field: 49)
- BW: Birth weight (Data-Field: 20022)
- FVC: Forced vital capacity (Data-Field: 3062)
- FEV1: Forced expiratory volume in 1-second (Data-Field: 3063)
- HDL: HDL cholesterol (Data-Field: 30760)
- LDL: LDL cholesterol (Data-Field: 30780)
To allow users to assess PRS performance as a function of sample size, we also provide subsampled training GWAS summary statistics. This is done by taking the training samples and randomly selecting (without replacement) a subset of them for conducting association testing. The training sample sizes are:
- N = 5000
- N = 10000
- N = 20000
- N = 40000
- N = 80000
- N = 160000
- Full training set (sample size varies by phenotype).
NOTE: Due to the smaller overall sample size for the Birth weight phenotype, we do not include training data for the `N=160000` setting.
The folder structure of the GWAS data for each phenotype is as follows:
train
N_5000
fold_1
chr_1.PHENO1.glm.linear
chr_2.PHENO1.glm.linear
...
fold_2
fold_3
...
N_10000
N_20000
N_40000
N_80000
N_160000
full
validation
fold_1
chr_1.PHENO1.glm.linear
chr_2.PHENO1.glm.linear
...
fold_2
fold_3
...
test
fold_1
fold_2
fold_3
...
For more details about the GWAS study, Quality Control (QC) criteria, or other information, please consult our publication:
Zabad, S., Gravel, S., & Li, Y. (2023). Fast and accurate Bayesian polygenic risk modeling with variational inference. The American Journal of Human Genetics, 110(5), 741–761. https://doi.org/10.1016/j.ajhg.2023.03.009
If you use this data in your work, please cite the publication above.
Files
Files
(13.7 GB)
Name | Size | Download all |
---|---|---|
md5:36dd7cd6a79a060c52d2b68447416d95
|
1.5 GB | Download |
md5:629f7c40ed8fbafb28c784d228410366
|
1.4 GB | Download |
md5:ec26a062a958ccf821db826a3fe3b3b4
|
1.5 GB | Download |
md5:2ec04cdf06b9dfcdb9b3c923a340bd87
|
1.5 GB | Download |
md5:87c82dd948fb999190be91ef12c6ef88
|
1.5 GB | Download |
md5:2f80e3c76fd22745fa485b05407a4e5f
|
1.5 GB | Download |
md5:3c568e74ee8351eb0c82aa779f77b573
|
1.5 GB | Download |
md5:502b6766acc86e680bb96b815804dfe6
|
1.5 GB | Download |
md5:3064bba7e64e2ee51a9f2852e341bf73
|
1.5 GB | Download |
Additional details
Related works
- Is described by
- Publication: 10.1016/j.ajhg.2023.03.009 (DOI)