Published October 13, 2025 | Version 0.4.2
Software Open

github.com/aofarrel/tree_nine/tree_nine

  • 1. University of California Santa Cruz

Description

Tree Nine

Put diff files on an existing phylogenetic tree using UShER's usher sampled task with a bit of help from SRANWRP, followed by conversion of that tree to Taxonium, Newick, and Nextstrain formats. Samples' SNP distance is calculated and output as a distance matrix, and samples will be placed into clusters based on the distance.

Verified on Terra-Cromwell and miniwdl. Make sure to add --copy-input-files for miniwdl. Default inputs assume you're working with Mycobacterium tuberculosis, be sure to change them if you aren't working with that bacterium.

This repo also contains the following subworkflows:

features

  • Highly scalable, even on lower-end computes
  • Can input a single pre-combined diff file
  • Includes a sample input tree created from SRA data if no input tree is specified
  • Trees automatically converted to UsHER (.pb), Taxonium (.jsonl.gz), Newick (.nwk), and Nextstrain (.json) formats
  • Automatic clustering based on configurable genetic distance
    • Nextstrain tree(s) will be annotated by cluster
    • Clustering can be limited to only samples specified by the user, all newly added samples, or all samples
    • Clustering is also performed after backmasking
    • (optional) Create per-cluster Nextstrain subtrees
  • (optional) Reroot the tree to a specified node
  • (optional) Backmask newly-added samples against each other to hide positions where any newly-added sample lacks data, then create a new set of trees based on the backmasked diff files
    • Designed for highly clonal samples which have a plausible direct epidemiological relationship
    • Backmasking can only be performed on samples which have a sample-level diff files
  • (optional) Summarize input, reroot, and output trees with matutils
  • (optional) Filter out positions by coverage at that position and/or entire samples by overall coverage
  • (optional) Specify your own reference genome if you don't want to work with H37Rv
  • (optional) Annotate clades via matutils with a specified annotation TSV

benchmarking

Formal benchmarks have not been established, but a full run of placing 60 new TB samples on an existing 7000+ TB sample tree, conversion to taxonium and newick formats, distance matrixing, clustering finding, and creating cluster-specific Nextstrain trees executes in about five minutes on a 2019 Macbook Pro.

Backmasking is the least scalable part of the pipeline. The comparison itself theoretically scales n2 and once the comparison is completed, n backmasked disk files must be written to the disk. We have observed that memory problems tend to arise during the file-writing part when n≥55 on a local machine. Runtime attributes are adjustable as task-level variables to aid with scaling on cloud backends, although we have seen the default handle 60 samples at a time without much issue.

Files

github.com-aofarrel-tree_nine-tree_nine_0.4.2.zip

Files (13.6 kB)

Additional details