GWAS summary statistics for 9 quantitative phenotypes from the UK Biobank (5-fold cross-validation)

Zabad, Shadi

doi:10.5281/zenodo.14823164

Published February 5, 2025 | Version v3

Dataset Open

GWAS summary statistics for 9 quantitative phenotypes from the UK Biobank (5-fold cross-validation)

Zabad, Shadi

This dataset contains GWAS summary statistics for 9 quantitative phenotypes from the UK Biobank.

The dataset is designed to enable systematic PRS analyses with 5-fold cross validation. For each phenotype and fold, we provide GWAS summary statistics for the training, validation, and test sets. The validation summary statistics can be used for model selection/tuning. The test summary statistics can be used to evaluate PRS models via pseudo-validation metrics. Association testing for all phenotypes and samples was done with plink2.

The phenotypes included in this dataset are:

HEIGHT: Standing height (Data-Field: 50)
BMI: Body mass index (Data-Field: 21001)
WC: Waist circumference (Data-Field: 48)
HC: Hip circumference (Data-Field: 49)
BW: Birth weight (Data-Field: 20022)
FVC: Forced vital capacity (Data-Field: 3062)
FEV1: Forced expiratory volume in 1-second (Data-Field: 3063)
HDL: HDL cholesterol (Data-Field: 30760)
LDL: LDL cholesterol (Data-Field: 30780)

To allow users to assess PRS performance as a function of sample size, we also provide subsampled training GWAS summary statistics. This is done by taking the training samples and randomly selecting (without replacement) a subset of them for conducting association testing. The training sample sizes are:

N = 5000
N = 10000
N = 20000
N = 40000
N = 80000
N = 160000
Full training set (sample size varies by phenotype).

NOTE: Due to the smaller overall sample size for the Birth weight phenotype, we do not include training data for the `N=160000` setting.

The folder structure of the GWAS data for each phenotype is as follows:

train
- N_5000
  - fold_1
    - chr_1.PHENO1.glm.linear
    - chr_2.PHENO1.glm.linear
    - ...
  - fold_2
  - fold_3
  - ...
- N_10000
- N_20000
- N_40000
- N_80000
- N_160000
- full
validation
- fold_1
  - chr_1.PHENO1.glm.linear
  - chr_2.PHENO1.glm.linear
  - ...
- fold_2
- fold_3
- ...
test
- fold_1
- fold_2
- fold_3
- ...

For more details about the GWAS study, Quality Control (QC) criteria, or other information, please consult our publication:

Zabad, S., Gravel, S., & Li, Y. (2023). Fast and accurate Bayesian polygenic risk modeling with variational inference. The American Journal of Human Genetics, 110(5), 741–761. https://doi.org/10.1016/j.ajhg.2023.03.009

If you use this data in your work, please cite the publication above.

Files

Files (13.7 GB)

Name	Size	Download all
BMI.tar.gz md5:36dd7cd6a79a060c52d2b68447416d95	1.5 GB	Download
BW.tar.gz md5:629f7c40ed8fbafb28c784d228410366	1.4 GB	Download
FEV1.tar.gz md5:ec26a062a958ccf821db826a3fe3b3b4	1.5 GB	Download
FVC.tar.gz md5:2ec04cdf06b9dfcdb9b3c923a340bd87	1.5 GB	Download
HC.tar.gz md5:87c82dd948fb999190be91ef12c6ef88	1.5 GB	Download
HDL.tar.gz md5:2f80e3c76fd22745fa485b05407a4e5f	1.5 GB	Download
HEIGHT.tar.gz md5:3c568e74ee8351eb0c82aa779f77b573	1.5 GB	Download
LDL.tar.gz md5:502b6766acc86e680bb96b815804dfe6	1.5 GB	Download
WC.tar.gz md5:3064bba7e64e2ee51a9f2852e341bf73	1.5 GB	Download

Additional details

Is described by: Publication: 10.1016/j.ajhg.2023.03.009 (DOI)

	All versions	This version
Views	162	53
Downloads	629	320
Data volume	649.2 GB	489.5 GB

GWAS summary statistics for 9 quantitative phenotypes from the UK Biobank (5-fold cross-validation)

Creators

Description

Files

Files (13.7 GB)

Additional details

Related works