Published February 5, 2025 | Version v3
Dataset Open

GWAS summary statistics for 9 quantitative phenotypes from the UK Biobank (5-fold cross-validation)

Description

This dataset contains GWAS summary statistics for 9 quantitative phenotypes from the UK Biobank.

The dataset is designed to enable systematic PRS analyses with 5-fold cross validation. For each phenotype and fold, we provide GWAS summary statistics for the training, validation, and test sets. The validation summary statistics can be used for model selection/tuning. The test summary statistics can be used to evaluate PRS models via pseudo-validation metrics. Association testing for all phenotypes and samples was done with plink2.

 

The phenotypes included in this dataset are:

  • HEIGHT: Standing height (Data-Field: 50)
  • BMI: Body mass index (Data-Field: 21001)
  • WC: Waist circumference (Data-Field: 48)
  • HC: Hip circumference (Data-Field: 49)
  • BW: Birth weight (Data-Field: 20022)
  • FVC: Forced vital capacity (Data-Field: 3062)
  • FEV1: Forced expiratory volume in 1-second (Data-Field: 3063)
  • HDL: HDL cholesterol (Data-Field: 30760)
  • LDL: LDL cholesterol (Data-Field: 30780)

 

To allow users to assess PRS performance as a function of sample size, we also provide subsampled training GWAS summary statistics. This is done by taking the training samples and randomly selecting (without replacement) a subset of them for conducting association testing. The training sample sizes are:

  • N = 5000
  • N = 10000
  • N = 20000
  • N = 40000
  • N = 80000
  • N = 160000
  • Full training set (sample size varies by phenotype).

NOTE: Due to the smaller overall sample size for the Birth weight phenotype, we do not include training data for the `N=160000` setting.

The folder structure of the GWAS data for each phenotype is as follows:

  • train
    • N_5000
      •  fold_1
        • chr_1.PHENO1.glm.linear
        • chr_2.PHENO1.glm.linear
        • ...
      • fold_2
      • fold_3
      • ...
    • N_10000
    • N_20000
    • N_40000
    • N_80000
    • N_160000
    • full
  • validation
    • fold_1
      • chr_1.PHENO1.glm.linear
      • chr_2.PHENO1.glm.linear
      • ...
    • fold_2
    • fold_3
    • ...
  • test
    • fold_1
    • fold_2
    • fold_3
    • ...

For more details about the GWAS study, Quality Control (QC) criteria, or other information, please consult our publication:

Zabad, S., Gravel, S., & Li, Y. (2023). Fast and accurate Bayesian polygenic risk modeling with variational inference. The American Journal of Human Genetics, 110(5), 741–761. https://doi.org/10.1016/j.ajhg.2023.03.009

If you use this data in your work, please cite the publication above.

 

Files

Files (13.7 GB)

Name Size Download all
md5:36dd7cd6a79a060c52d2b68447416d95
1.5 GB Download
md5:629f7c40ed8fbafb28c784d228410366
1.4 GB Download
md5:ec26a062a958ccf821db826a3fe3b3b4
1.5 GB Download
md5:2ec04cdf06b9dfcdb9b3c923a340bd87
1.5 GB Download
md5:87c82dd948fb999190be91ef12c6ef88
1.5 GB Download
md5:2f80e3c76fd22745fa485b05407a4e5f
1.5 GB Download
md5:3c568e74ee8351eb0c82aa779f77b573
1.5 GB Download
md5:502b6766acc86e680bb96b815804dfe6
1.5 GB Download
md5:3064bba7e64e2ee51a9f2852e341bf73
1.5 GB Download

Additional details

Related works

Is described by
Publication: 10.1016/j.ajhg.2023.03.009 (DOI)