Example datasets for testing the GWASTic software
Creators
- 1. Leibniz-Institut für Pflanzengenetik und Kulturpflanzenforschung Gatersleben
Description
The first dataset, called barley_set, contains barley data to validate the peak associations of two-row barley genes previously identified in Milner, Jost, and Taketa (2019). To replicate the Genome-Wide Association Study (GWAS) results, please use the pre-filtered and formatted genotypic file ‘WGS300_005_0020.bed’. The corresponding phenotype data for row type, pre-formatted for direct use with the GWAStic software, is available in ‘bridge_row_type_GWAS.txt’. To reproduce the genomic prediction experiments, please use the same files: ‘WGS300_005_0020.bed’ for the genotypic data and ‘bridge_row_type_GP.txt’ for the phenotypes. The file ‘validation_set.txt’ contains a set of 30 genotypes that have been excluded from the training data for use as a validation set.
Additionally, a minimalistic dataset called small_dataset is provided to facilitate quick testing of the GWAStic software. This dataset includes:
- ‘example.vcf.gz’ to test the VCF to BED conversion.
- ‘example.bed’, a filtered genotypic file ready for use.
- ‘pheno_gwas.csv’ as a phenotypic file for GWAS.
- ‘pheno_gp.csv’ as a phenotypic file for genomic prediction.
We generated two synthetic datasets using PLINK software, one with binary and one with quantitative phenotypes. Each synthetic dataset contains 2,000 samples—1,000 cases and 1,000 controls—with a total of 90,010 SNPs. For the dataset with binary phenotypes (called synthetic_binary), the SNPs were categorized into three groups: nullA, nullB, and nullC, each containing 30,000 SNPs not associated with the disease phenotype. Additionally, we included 5 SNPs labeled diseaseA and 5 labeled diseaseB, designed to mimic disease-associated loci. The diseaseA SNPs had allele frequencies between 0.1 and 0.2, with a relative risk of 2.5 under a multiplicative model, while the diseaseB SNPs had allele frequencies between 0.2 and 0.25, with a relative risk of 3.0. The remaining SNPs had a relative risk of 1.0, indicating no effect.
For the dataset with quantitative phenotypes (called synthetic_qt), we followed a similar structure. The SNPs were again divided into nullA, nullB, and nullC categories, with 30,000 SNPs each. We also included 5 SNPs labeled qtlA and 5 labeled qtlB, representing quantitative trait loci. The qtlA SNPs had allele frequencies from 0.1 to 0.2, with an effect size of 0.02, while qtlB SNPs had allele frequencies from 0.2 to 0.25, with an effect size of 0.03. These effect sizes indicate the SNPs' impact on the quantitative trait variance.
Files
Supplementary_dataset.zip
Files
(66.3 MB)
Name | Size | Download all |
---|---|---|
md5:cae75d8fdbd46fe8f8e2f07f7b5b76f5
|
66.3 MB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/snowformatics/gwastic_desktop
- Programming language
- Python
- Development Status
- Active