Data from: A real data-driven simulation strategy to select an imputation method for mixed-type trait data

May, Jacqueline A.; Feng, Zeny; Adamowicz, Sarah J.

doi:10.5061/dryad.crjdfn37m

Published February 15, 2023 | Version v1

Dataset Open

Data from: A real data-driven simulation strategy to select an imputation method for mixed-type trait data

1. University of Guelph

Missing observations in trait datasets pose an obstacle for analyses in myriad biological disciplines. Considering the mixed results of imputation, the wide variety of available methods, and the varied structure of real trait datasets, a framework for selecting a suitable imputation method is advantageous. We invoked a real data-driven simulation strategy to select an imputation method for a given mixed-type (categorical, count, continuous) target dataset. Candidate methods included mean/mode imputation, k-nearest neighbour, random forests, and multivariate imputation by chained equations (MICE). Using a trait dataset of squamates (lizards and amphisbaenians; order: Squamata) as a target dataset, a complete-case dataset consisting of species with nearly completed information was formed for the imputation method selection. Missing data were induced by removing values from this dataset under different missingness mechanisms: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). For each method, combinations with and without phylogenetic information from single gene (nuclear and mitochondrial) or multigene trees were used to impute the missing values for five numerical and two categorical traits. The performances of the methods were evaluated under each missing mechanism by determining the mean squared error and proportion falsely classified rates for numerical and categorical traits, respectively. A random forest method supplemented with a nuclear-derived phylogeny resulted in the lowest error rates for the majority of traits, and this method was used to impute missing values in the original dataset. Data with imputed values better reflected the characteristics and distributions of the original data compared to complete-case data. However, caution should be taken when imputing trait data as phylogeny did not always improve performance for every trait and in every scenario. Ultimately, these results support the use of a real data-driven simulation strategy for selecting a suitable imputation method for a given mixed-type trait dataset.

Notes

Alignment and phylogenetic trees may be opened and visualized by software capable of handling Newick and FASTA file formats.

Funding provided by: Canada First Research Excellence Fund
Crossref Funder Registry ID: http://dx.doi.org/10.13039/501100010785
Award Number:

Funding provided by: University of Guelph
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100008986
Award Number:

Funding provided by: Natural Sciences and Engineering Research Council of Canada
Crossref Funder Registry ID: http://dx.doi.org/10.13039/501100000038
Award Number:

Funding provided by: Genome Canada
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100008762
Award Number:

Funding provided by: Ontario Ministry of Economic Development, Job Creation and Trade
Crossref Funder Registry ID: http://dx.doi.org/10.13039/501100013569
Award Number:

Files

README.md

Files (235.8 kB)

Name	Size	Download all
README.md md5:b26af5f7cc625db928dff5c7511ebc1e	4.5 kB	Preview Download
Squamata_CMOS_Gene.tre md5:cda66a0c013c861be77be9a7abcc45a6	10.4 kB	Download
Squamata_CMOS_Gene_FinalImp.tre md5:9155440a3774cd67b21e2d11cfa7d614	47.4 kB	Download
Squamata_COI_Alignment.fasta md5:46ad8192217bc074fd978be17924ee66	142.2 kB	Download
Squamata_COI_Gene.tre md5:7f5f1f2bd02c0683637c440c747b25e8	10.4 kB	Download
Squamata_Multigene.tre md5:c73188e8bdfbf9d9ad83d84e77113525	10.4 kB	Download
Squamata_RAG1_Gene.tre md5:b7c7d884bebad5b761742aae96e29fc5	10.4 kB	Download

Additional details

Is cited by: 10.1101/2022.05.03.490388 (DOI)
Is source of: 10.5281/zenodo.7618009 (DOI)

	All versions	This version
Views	168	168
Downloads	101	101
Data volume	3.4 MB	3.4 MB

Data from: A real data-driven simulation strategy to select an imputation method for mixed-type trait data

Creators

Description

Notes

Files

README.md

Files (235.8 kB)

Additional details

Related works