github.com/aofarrel/tree_nine/tree_nine

Ash O'Farrell

doi:10.5281/zenodo.17344734

Published October 13, 2025 | Version 0.4.2

Software Open

github.com/aofarrel/tree_nine/tree_nine

Ash O'Farrell¹

1. University of California Santa Cruz

Tree Nine

Put diff files on an existing phylogenetic tree using UShER's usher sampled task with a bit of help from SRANWRP, followed by conversion of that tree to Taxonium, Newick, and Nextstrain formats. Samples' SNP distance is calculated and output as a distance matrix, and samples will be placed into clusters based on the distance.

Verified on Terra-Cromwell and miniwdl. Make sure to add --copy-input-files for miniwdl. Default inputs assume you're working with Mycobacterium tuberculosis, be sure to change them if you aren't working with that bacterium.

This repo also contains the following subworkflows:

Annotate
Convert to Nextstrain (for viewing in Auspice, non-clade sample annotations, etc)
Extract
Mask tree
Mask subtree
Summarize

features

Highly scalable, even on lower-end computes
Can input a single pre-combined diff file
Includes a sample input tree created from SRA data if no input tree is specified
Trees automatically converted to UsHER (.pb), Taxonium (.jsonl.gz), Newick (.nwk), and Nextstrain (.json) formats
Automatic clustering based on configurable genetic distance
- Nextstrain tree(s) will be annotated by cluster
- Clustering can be limited to only samples specified by the user, all newly added samples, or all samples
- Clustering is also performed after backmasking
- (optional) Create per-cluster Nextstrain subtrees
(optional) Reroot the tree to a specified node
(optional) Backmask newly-added samples against each other to hide positions where any newly-added sample lacks data, then create a new set of trees based on the backmasked diff files
- Designed for highly clonal samples which have a plausible direct epidemiological relationship
- Backmasking can only be performed on samples which have a sample-level diff files
(optional) Summarize input, reroot, and output trees with matutils
(optional) Filter out positions by coverage at that position and/or entire samples by overall coverage
(optional) Specify your own reference genome if you don't want to work with H37Rv
(optional) Annotate clades via matutils with a specified annotation TSV

benchmarking

Formal benchmarks have not been established, but a full run of placing 60 new TB samples on an existing 7000+ TB sample tree, conversion to taxonium and newick formats, distance matrixing, clustering finding, and creating cluster-specific Nextstrain trees executes in about five minutes on a 2019 Macbook Pro.

Backmasking is the least scalable part of the pipeline. The comparison itself theoretically scales n² and once the comparison is completed, n backmasked disk files must be written to the disk. We have observed that memory problems tend to arise during the file-writing part when n≥55 on a local machine. Runtime attributes are adjustable as task-level variables to aid with scaling on cloud backends, although we have seen the default handle 60 samples at a time without much issue.

Files