#Readme document describing Grusz & Schuettpelz approach to obtaining homologous orthogroups from transcriptome-derived ProteinOrtho output -> 1_ProteinOrtho_2499_orthogroups - ProteinOrtho produced 2499 orthogroups that contain *at least 1* of each of our 13 taxa. - OGs meeting the above requirement (13 taxa) were pulled from *.faa files to make a new fasta file for each orthogroup (getseqs.GB.pl); output files == Gene.OG#.faa can be found within: ‘./1_ProteinOrtho_2499_orthogroups/‘) - 2499 OGs (fastas) were each aligned automatically in MUSCLE -> 2_RAxML_2499_trees - Tree inference in RAxML; trees were then rooted using Newick Utilities (nw_reroot) -> 3_Treefix - 2499 OGs (.fasta alignments + rooted .ML.tre trees) run through Treefix under the command: Treefix -i list_of_files.txt -s species.stree -S species.smap -A .fasta -o .ML.tre -n .treefix.tre -l treefix.log -V 1 -e "-m PROTGAMMAJTT" —> 4_PruneParalogs - 2499 treefix trees run through prune_paralogs_MO.py; prior to executing analysis, treefix output trees were modified with text wrangler (hereafter TW) multifile search-and-replace to be one-line newick format; command: python prune_paralogs_MO.py ‘.’ ’13’ ‘./pruned_trees/’ - after running the .py script, I used TW multifile search-and-replace to add “:1” branch lengths to the trees - yielded 930 1to1 orthologous trees and 1325 orthologous subtrees for a total of 2255 trees (trees labeled ‘.reroot’ are intermediate files, post-ortholog pruning but prior to pruning duplicated tips => the .ortho.tre trees have had their duplicated tips arbitrarily removed by prune_paralogs_MO.py) —> 5_Optroot - (1) prune_paralogs_MO.py file endings ‘.ortho.tre’ and ‘_1to1ortho.tre’ were changed to ‘.tre’ - (2) 2254 pruned trees rooted arbitrarily in nw_reroot (‘for file in *; do nw_reroot $file > $file.rooted.out; done’) - (3) 2254 pruned trees run through optroot (-d rf) to determine which OG topologies == assumed species trees, using command: ‘optroot.macosx -i trees4optroot.txt -o optroot.out -d rf’ - (4) optroot ‘.out’ filtered for trees with rf=0 (total: 2113 trees; rf≠0: 140 trees) using ‘filter_for_goodtrees.py’, given new suffixes (‘.good.tre’) -> 6_Backtranslate - 2113 final trees (13 taxa, 13 tips, matching expected spp. tree topology; files == OG#.good.tre can be found within: ‘./6_Backtranslate_2113_final_trees/’) were then processed as follows: - (1) using tw multifile search-and-replace, ‘.good.tre’ files were converted to OG-id list text files - (2) OG-id text files were processed through ‘og_nucleotide_get.py’ (to build new nuc fasta files for each OG; extension: ’.dna’) - (3) taxon IDs in ‘.dna’ files modified using tw to only include [A-Z]{4}-\d+ - (4) *.phy files were converted to *.fasta (phy2fasta.py) for use with og_aa_get.py - (5) ’.dna’ files run through ‘./6_backtranslate/trees4backtranslate/og_aa_get.py’ (to build new aa fast files for each OG; extension: ‘.aa’) - (6) line endings removed from *.sorted files (tw) so sequences are each on a single line - (7) ‘./4_backtranslated/back_translate.GB.pl’ executed using ‘./runbacktrans.pl’ - (8) run ’./4_backtranslated/fixsequences.pl’ - (9) Modify input file extension from ‘.dna.sorted’ to ‘.dna.ED.sorted’ in ‘./runbacktrans.pl’ and execute ‘./runbacktrans.pl’ - (10) 14 go files failed to reverse translate all taxon sequences and were excluded; final data set == 2091 trees (2099 before removing plastid-associated loci); fasta-formatted nucleotide alignments used for all hyphy analyses == OG#.fasta can be found within: ‘./6-10_fasta4hyphy/‘.