Published September 15, 2021 | Version v1
Dataset Open

Data from: Multi-generation genomic prediction of maize yield using parametric and non-parametric sparse selection indices

  • 1. Michigan State University
  • 2. International Maize and Wheat Improvement Center
  • 3. Colegio de Postgraduados

Description

Genomic prediction models are often calibrated using multi-generation data. Over time, as data accumulates, training data sets become increasingly heterogeneous. Differences in allele frequency and linkage disequilibrium patterns between the training and prediction genotypes may limit prediction accuracy. This leads to the question of whether all available data or a subset of it should be used to calibrate genomic prediction models. Previous research on training set optimization has focused on identifying a subset of the available data that is optimal for a given prediction set. However, this approach does not contemplate the possibility that different training sets may be optimal for different prediction genotypes. To address this problem, we recently introduced a sparse selection index (SSI) that identifies an optimal training set for each individual in a prediction set. Using additive genomic relationships, the SSI can provide increased accuracy relative to genomic-BLUP (GBLUP). Non-parametric genomic models using Gaussian kernels (KBLUP) have, in some cases, yielded higher prediction accuracies than standard additive models. Therefore, here we studied whether combining SSIs and kernel methods could further improve prediction accuracy when training genomic models using multi-generation data. Using four years of doubled haploid maize data from the International Maize and Wheat Improvement Center (CIMMYT), we found that when predicting grain yield the KBLUP outperformed the GBLUP, and that using SSI with additive relationships (GSSI) lead to 5-17% increases in accuracy, relative to the GBLUP. However, differences in prediction accuracy between the KBLUP and the kernel-based SSI were smaller and not always significant.

Notes

Data contains phenotypic observations on 3528 genotypes from which 901, 1419, 722, and 486 are from 2017, 2018, 2019, and 2020, respectively.

Missing values: In 2018, genotype with GID '973680' was not observed in Optimal experiments and genotype '976304' was not observed in Drought experiments. Genotype '1132699' from 2019 was not recorded in Drought experiments.

Phenotypic data: File 'Pheno_data.csv' is a matrix containing the adjusted means for all the 3528 genotypes (in rows) for each trait-environmental-condition combination (in columns). Column 'GID' contains the Genotype ID and column 'Year' contains the cycle to which each genotype belongs to.

Genotypic data: File 'Geno_data.csv' contains, for each genotype (in rows), presence-absence marker information on 4612 markers (in columns). Column 'GID' contains the Genotype ID and matches GID column in 'Pheno_data.csv' file.

Funding provided by: Bill and Melinda Gates Foundation
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000865
Award Number:

Funding provided by: Monsanto Beachell-Borlaug International Scholar Program*
Crossref Funder Registry ID:
Award Number:

Funding provided by: National Institute of Food and Agriculture
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100005825
Award Number: 2021-67015-33413

Funding provided by: Monsanto Beachell-Borlaug International Scholar Program
Crossref Funder Registry ID:

Files

Geno_data.csv

Files (32.8 MB)

Name Size Download all
md5:7d36038b66775ee6d62fd7957ceab7cb
32.6 MB Preview Download
md5:3b1cb8adb2f2b697d2873f313fafe1cf
194.6 kB Preview Download

Additional details

Related works

Is cited by
10.3389/fpls.2019.01502 (DOI)