There is a newer version of the record available.

Published February 26, 2026 | Version 1.0
Dataset Open

Genomic Variant Dataset (VCF) from Ethiopian Sorghum Founder Lines and Key Landrace Accessions

Description

This dataset contains high-quality genome-wide variant calls (VCF format) generated from whole-genome sequencing (WGS) of 185 Ethiopian sorghum (Sorghum bicolor) accessions that passed quality control filtering.

The original dataset included 188 accessions representing founder lines prioritized by the Ethiopian Institute of Agricultural Research (EIAR) breeding program and diverse landraces spanning major agroecological zones of Ethiopia, including arid, semi-arid, sub-humid, and humid environments. After quality assessment using FastQC and read filtering, 185 samples were retained for downstream variant calling.

Raw FASTQ files for all 188 accessions have been deposited in the NCBI Sequence Read Archive (SRA) under BioProject accession PRJNA1428287.

Reads were aligned to the Sorghum bicolor reference genome assembly NCBIv3 (GCF_000003195.3). Variant calling was performed using a standardized pipeline, followed by stringent filtering to retain high-confidence biallelic SNPs located on primary chromosomes. The final VCF file includes variants filtered using the following criteria:

• Depth (DP) between 10 and 50
• Missing rate ≤ 20%
• Minor allele frequency (MAF) ≥ 0.05
• Main chromosomes only

The file provided here:

Sorghum_WGS_185samples_MAINCHR_DP10_50_MISS20_MAF05_NCBIv3.vcf.gz

contains a curated genome-wide variant dataset suitable for analyses of genetic diversity, population structure, genome-wide association studies, and genomics-assisted breeding applications.

Files

Files (6.9 GB)

Additional details

Related works

Is supplemented by
Dataset: PRJNA1428287 (Other)

Funding

United States Department of State
Climate Resilient Cereals Innovation Lab Cooperative Agreement No. 7200AA23LE00003

Dates

Submitted
2026-02-26