vignettes/vignettes_cross-validation.Rmd
vignettes_cross-validation.Rmd
In this section, we will use the data in data_ge
. For more information, please, see ?data_ge
. Other data sets can be used provided that the following columns are in the dataset: environment, genotype, block/replicate and response variable(s).
The predictive accucary of both AMMI and BLUP models may be obtained using a cross-validation procedure implemented by the functions cv_ammif()
and cv_blup()
The cv_ammif()
function provides a complete cross-validation procedure for all member of AMMI model family (AMMI0-AMMIF) using replicate-based data, according to the diagram below. Automatically the first validation is carried out considering the AMMIF (all possible axis used). Considering this model, the original data set is split up into two sets: training set and validation set.
Diagram for cross-validation of AMMI models.
The training set has all combinations (genotype x environment) with the number of replications informed in nrepval
argument. The validation set has one replication that were not included in the training set. The splitting of the data set into training and validation sets depends on the design considered. For a Randomized Complete Block Design (default option) and the procedure we used in the article, completely blocks are randomly selected within environments, as suggested by Piepho (1994). The remaining block serves as validation data. If design = "CRD"
is informed, thus declaring that a completely randomized design was used, single observations are randomized for each treatment (genotype-by-environment combination). This is the same procedure suggested by Gauch and Zobel (1988). The estimated values for each member of the AMMI model family in each re-sampling cycle are compared with the observed values in the validation data. Then, the Root Mean Square Prediction Difference is computed as follows:
\[ RMSPD = {\left[ {\left( {\sum\nolimits_{i = 1}^n {{{\left( {{{\hat y}_{ij}} - {y_{ij}}} \right)}^2}} } \right)/n} \right]^{0.5}} \]
where \(\widehat{y}_{ij}\) is the model predicted value; and \(y_{ij}\) is the observed value in the validation set. The number of random selection of blocks/replicates (n) is defined in the argument nboot
. At the end of the n cycles for all models, a list with all estimated RMSPD and the average of RMSPD is returned.
The following code computes the cross-validation of the oat data using the AMMI model. There are two options for doing that. The first is to perform the cross-validation for a specific member of the AMMI-family model. The second (and more realistic) is to perform the cross-validation for all AMMI-family models in the same procedure.
The function cv_ammi()
is used to compute a cross-validation procedure for the AMMI0, AMMI2 and AMMIF (9 axes) models.
The function cv_ammif()
is used to compute a cross-validation procedure for all members of the AMMI-family model. In this case, AMMI0-AMMI9.
AMMIF <- cv_ammif(data_ge, ENV, GEN, REP, GY)
The function cv_blup()
provides a cross-validation of replicate-based data using mixed models. By default, complete blocks are randomly selected for each environment. Using the argument random
it is possible to chose the random effects of the model, as shown bellow.
BLUP_g <- cv_blup(data_ge, ENV, GEN, REP, GY, random = "gen")
BLUP_e <- cv_blup(data_ge, ENV, GEN, REP, GY, random = "env")
BLUP_ge <- cv_blup(data_ge, ENV, GEN, REP, GY, random = "all")
bind_mod <- bind_cv(AMMIF, BLUP_g, bind = "means") print_table(bind_mod$RMSPD)
The table above showns the descriptive statistics (mean, standard deviation, standar error of the mean, and quantiles 2.5% and 97.5%) of the 100 RMSPD estimates for each model, and are presented from the most accurate model (smallest RMSPD mean) to the least accurate model (highest RMSPD mean).
The values of the RMSPD estimates obtained in the cross-validation process may be plotted using the functionplot()
.
bind1 <- bind_cv(AMMI0, AMMI2, AMMI9) bind2 <- bind_cv(AMMIF, BLUP_g, BLUP_e, BLUP_ge) a <- plot(bind1, violin = TRUE) b <- plot(bind2, width.boxplot = 0.6, order_box = TRUE, plot_theme = theme_metan_minimal()) arrange_ggplot(a, b, labels = letters[1:2])
Six statistics are shown in this boxplot. The mean (black rhombus), the median (black line), the lower and upper hinges that correspond sto the first and third quartiles (the 25th and 75th percentiles, respectively). The upper whisker extends from the hinge to the largest value no further than \(1.5\times{IQR}\) from the hinge (where IQR is the inter-quartile range). The lower whisker extends from the hinge to the smallest value at most \(1.5\times{IQR}\) of the hinge. Data beyond the end of the whiskers are considered outlying points. If the condition violin = TRUE
, a violin plot is added along with the boxplot. A violin plot is a compact display of a continuous distribution displayed in the same way as a boxplot.
Rendering engine
This vignette was built with pkgdown. All tables were produced with the package DT
using the following function.
library(DT) # Used to make the tables # Function to make HTML tables print_table <- function(table, rownames = FALSE, digits = 3, ...){ df <- datatable(table, rownames = rownames, extensions = 'Buttons', options = list(scrollX = TRUE, dom = '<<t>Bp>', buttons = c('copy', 'excel', 'pdf', 'print')), ...) num_cols <- c(as.numeric(which(sapply(table, class) == "numeric"))) if(length(num_cols) > 0){ formatSignif(df, columns = num_cols, digits = digits) } else{ df } }
Gauch, H.G, and Zobel R. W. 1988. “Predictive and Postdictive Success of Statistical Analyses of Yield Trials.” Theor. Appl. Genet. 76 (1): 1–10. https://doi.org/10.1007/BF00288824.
Piepho, H. P. 1994. “Best Linear Unbiased Prediction (Blup) for Regional Yield Trials: A Comparison to Additive Main Effects and Multiplicative Interaction (Ammi) Analysis.” Theor. Appl. Genet. 89 (5): 647–54. https://doi.org/10.1007/BF00222462.