The main processes included generating alignments, fitting four different partition and mixture models, and analysing results. The data were generated under Seq-Gen-1.3.4 (Rambaut and Grass 1997). The model fitting was performed IQ-TREE2 (Minh et al. 2020) on a Linux system. The results were analysed using the R package phangorn in R (version 3.6.2) (Schliep 2011, R Core Team 2019). We wrote custom bash scripts to extract relevant parts of the results from IQ-TREE2, and these results were processed in R.
The zip files contain four folders: "bash-scripts", "data", "R-codes", and "results-IQTREE2". The bash-scripts folder contains all the bash scripts for simulating alignments and performing model fitting. The data folder contains two child folders: "sequence-data" and "Rdata". The child folder "sequence-data" contains the alignments created for the simulations. The other child folder, "Rdata", contains the files created by R to store the results extracted from "IQTREE2" and the results calculated in R. The "R-codes" folder includes the R codes for analysing the results from "IQTREE2". The folder "results-IQTREE2" stores all the results from the fitted models. The three simulations we performed were essentially the same. We used the same parameters of the evolutionary models, and the trees with the same topologies but dierent edge lengths to generate the sequences.
The steps we used were: simulating alignments, model fitting and extracting results, and processing the extracted results. The first two steps were performed on a Linux system using bash scripts, and the last step was performed in R.
Simulating Alignment
To simulate heterogeneous data we created two multiple sequence alignments (MSAs) under simple homogeneous models with each model comprising a substitution model and an edge-weighted phylogenetic tree (the tree topology was fixed). Each MSA contained eight taxa and 1000 sites. This was performed using the bash script "step1_seqgen_data.sh" in Linux.
These two MSAs were then concatenated together giving a MSA with 2000 sites. This was equivalent to generating the concatenated MSA under a two-block unlinked edge lengths partition model (P-UEL). This was performed using the bash script "step2_concat_data.sh". This created the 0% group of MSAs.
In order to simulate a situation where the initial choice of blocks does not properly account for the heterogeneity in the concatenated MSA (i.e., mispartitioning), we randomly selected a proportion of 0%, 5%, 10%, 15%, …, up to 50% of sites from each block and swapped them. That is, the sites drawn from the first block were placed in the second block, and the sites drawn from the second block were placed in the first block. This process was repeated 100 times for each proportion of mispartitioned sites giving a total of 1100 MSAs.
This process involved two steps. The first step was to generate ten sets of different amounts of numbers without duplicates from each of the two intervals [1,1000] and [1001,2000]. The amounts of numbers were based on the proportions of incorrectly partitioning sites. For example, the first set has 50 numbers on each interval, and the second set has 100 numbers on each interval, etc. The first step was performed in R, and the R code was not provided but the random number text files were included. The second step was to select sites from the concatenated MSAs from the locations based on the numbers created in the first step. This created 5%, 10%, 15%, …, 50% groups of MSAs. The second steps used the following bash scripts: "step3_1_mixmatch_pre_data.sh" and "step3_2_mixmatch_data.sh". The MSAs used in the simulations were created and stored in the "data" folder.
Model Fitting and Extracting Results
The next steps were to fit four different partition and mixture models to the data in IQ-TREE2 and extract the results. The models used were P-LEL partition model, P-UEL partition model, M-UGP mixture model and M-LGP mixture model. For the partition models, the partitioning schemes were the same: the first 1000 sites as a block and the second 1000 sites as another. For the groups of MSAs with different proportions of mispartitioned sites, this was equivalent to fitting the partition models with an incorrect partitioning scheme. The partitioning scheme was called "parscheme.nex". The bash scripts for model fitting were stored in the "bash-scripts" folder. To run the bash scripts, users can follow the order which was shown in the names of these bash scripts.
The inferred trees, estimated base frequencies, estimated rate matrices, estimated weight factors and AIC values were extracted from the IQTREE2 results. These extracted results were stored in the "results-IQTREE2" folder and were used to evaluate the performance of AIC and performance of models in R.
Processing Extracted Results in R
To evaluate the performance of AIC and the performance of fitted partition models and mixture models, we calculated the following measures: the rEKL values, the bias of AIC in estimating the rEKL, the Robinson-Foulds (RF) distances, and the branch scores (bs). We also compared the distribution of the estimated model parameters (i.e. base frequencies and rate matrices) to the generating model parameters. These processes were performed in R.
The first step was to read in the inferred trees, estimated base frequencies, estimated rate matrices, estimated weight factors and AIC values that were extracted from IQTREE2 results. These R scripts were stored in the "R-codes" folder, and the names of these scripts started with "readpara_..." (e.g. "readpara_MLGP_standard"). After reading in all the parameters for each model, we estimated the measures mentioned above using the corresponding R scripts that were also in the "R-codes" folder.
The functions used in these R scripts were stored in the "R_functions_simulation". It is worth noting that the directories need to be changed if users want to run these R scripts on their computers.
Funding provided by: University of Tasmania
Crossref Funder Registry ID: http://dx.doi.org/10.13039/501100001249
Award Number: