Published May 13, 2024 | Version v2
Dataset Open

Example datasets for testing the GWASTic software

  • 1. Leibniz-Institut für Pflanzengenetik und Kulturpflanzenforschung Gatersleben

Description

The first dataset, called barley_set, contains barley data to validate the peak associations of two-row barley genes previously identified in Milner, Jost, and Taketa (2019). To replicate the Genome-Wide Association Study (GWAS) results, please use the pre-filtered and formatted genotypic file ‘WGS300_005_0020.bed’. The corresponding phenotype data for row type, pre-formatted for direct use with the GWAStic software, is available in ‘bridge_row_type_GWAS.txt’. To reproduce the genomic prediction experiments, please use the same files: ‘WGS300_005_0020.bed’ for the genotypic data and ‘bridge_row_type_GP.txt’ for the phenotypes. The file ‘validation_set.txt’ contains a set of 30 genotypes that have been excluded from the training data for use as a validation set.

Additionally, a minimalistic dataset called small_dataset is provided to facilitate quick testing of the GWAStic software. This dataset includes:

  • ‘example.vcf.gz’ to test the VCF to BED conversion.
  • ‘example.bed’, a filtered genotypic file ready for use.
  • ‘pheno_gwas.csv’ as a phenotypic file for GWAS.
  • ‘pheno_gp.csv’ as a phenotypic file for genomic prediction.

We generated two synthetic datasets using PLINK software, one with binary and one with quantitative phenotypes. Each synthetic dataset contains 2,000 samples—1,000 cases and 1,000 controls—with a total of 90,010 SNPs. For the dataset with binary phenotypes (called synthetic_binary), the SNPs were categorized into three groups: nullA, nullB, and nullC, each containing 30,000 SNPs not associated with the disease phenotype. Additionally, we included 5 SNPs labeled diseaseA and 5 labeled diseaseB, designed to mimic disease-associated loci. The diseaseA SNPs had allele frequencies between 0.1 and 0.2, with a relative risk of 2.5 under a multiplicative model, while the diseaseB SNPs had allele frequencies between 0.2 and 0.25, with a relative risk of 3.0. The remaining SNPs had a relative risk of 1.0, indicating no effect.

For the dataset with quantitative phenotypes (called synthetic_qt), we followed a similar structure. The SNPs were again divided into nullA, nullB, and nullC categories, with 30,000 SNPs each. We also included 5 SNPs labeled qtlA and 5 labeled qtlB, representing quantitative trait loci. The qtlA SNPs had allele frequencies from 0.1 to 0.2, with an effect size of 0.02, while qtlB SNPs had allele frequencies from 0.2 to 0.25, with an effect size of 0.03. These effect sizes indicate the SNPs' impact on the quantitative trait variance.

Files

Supplementary_dataset.zip

Files (66.3 MB)

Name Size Download all
md5:cae75d8fdbd46fe8f8e2f07f7b5b76f5
66.3 MB Preview Download

Additional details

Software

Repository URL
https://github.com/snowformatics/gwastic_desktop
Programming language
Python
Development Status
Active