Published April 1, 2026 | Version v1
Software Open

Constructing empirical phylogenetic tree datasets with controlled taxon overlap

Contributors

Description

Phylogenetic methods such as supertree construction, tree comparison, and tree clustering often operate on trees defined over partially overlapping taxon sets. However, reproducible empirical benchmarks for evaluating such methods remain limited, especially when branch lengths, controlled overlap structure, and reference trees are required. We present a reproducible method for constructing empirical phylogenetic tree datasets with controlled taxon overlap from broad-coverage biological phylogenies. The method provides two complementary construction modes. The first mode constructs overlapping collections of empirical trees with branch lengths. The second mode constructs benchmark datasets with a fixed reference tree by pruning a broad-coverage phylogeny to a target taxon set and then generates overlapping input trees by pruning that reference tree to partially intersecting taxon subsets, with optional controlled noise to topology and branch lengths. We implement the method in an open-source pipeline and apply it to amphibians, birds, mammals, sharks, and squamates. Validation confirms the intended subset sizes, overlap constraints, taxon coverage, anchor-taxon requirements, valid Newick output, and positive branch lengths after perturbation. A demonstration with supertree construction illustrates the utility of the generated datasets for evaluating phylogenetic methods under incomplete taxon sampling. The scripts and released datasets are publicly available to support reproducible method development in biodiversity and environmental data science.

Files

overlap-tree-data-pipeline-main.zip

Files (6.1 MB)

Name Size Download all
md5:61198f5df3276901e70d0b770f58892d
6.1 MB Preview Download