#Readme document describing Grusz & Schuettpelz approach to obtaining homologous orthogroups from transcriptome-derived ProteinOrtho output

-> 1_ProteinOrtho_2499_orthogroups
	- ProteinOrtho produced 2499 orthogroups that contain *at least 1* of each of our 13 taxa. 
	- OGs meeting the above requirement (13 taxa) were pulled from *.faa files to make a new fasta file for each orthogroup (getseqs.GB.pl); output files == Gene.OG#.faa can be found within: ‘./1_ProteinOrtho_2499_orthogroups/‘) 
	- 2499 OGs (fastas) were each aligned automatically in MUSCLE

-> 2_RAxML_2499_trees
	- Tree inference in RAxML; trees were then rooted using Newick Utilities (nw_reroot)

-> 3_Treefix
	- 2499 OGs (.fasta alignments + rooted .ML.tre trees) run through Treefix under the command: 
Treefix -i list_of_files.txt -s species.stree -S species.smap -A .fasta -o .ML.tre -n .treefix.tre -l treefix.log -V 1 -e "-m PROTGAMMAJTT"

—> 4_PruneParalogs
	- 2499 treefix trees run through prune_paralogs_MO.py; prior to executing analysis, treefix output trees were modified with text wrangler (hereafter TW) multifile search-and-replace to be one-line newick format; command: 
python prune_paralogs_MO.py ‘.’ ’13’ ‘./pruned_trees/’
	- after running the .py script, I used TW multifile search-and-replace to add “:1” branch lengths to the trees
	- yielded 930 1to1 orthologous trees and 1325 orthologous subtrees for a total of 2255 trees (trees labeled ‘.reroot’ are intermediate files, post-ortholog pruning but prior to pruning duplicated tips => the .ortho.tre trees have had their duplicated tips arbitrarily removed by prune_paralogs_MO.py)

—> 5_Optroot
	- (1) prune_paralogs_MO.py file endings ‘.ortho.tre’ and ‘_1to1ortho.tre’ were changed to ‘.tre’ 
	- (2) 2254 pruned trees rooted arbitrarily in nw_reroot (‘for file in *; do nw_reroot $file > $file.rooted.out; done’)
	- (3) 2254 pruned trees run through optroot (-d rf) to determine which OG topologies == assumed species trees, using command: ‘optroot.macosx -i trees4optroot.txt -o optroot.out -d rf’  
	- (4) optroot ‘.out’ filtered for trees with rf=0 (total: 2113 trees; rf≠0: 140 trees) using ‘filter_for_goodtrees.py’, given new suffixes (‘.good.tre’)

-> 6_Backtranslate
	- 2113 final trees (13 taxa, 13 tips, matching expected spp. tree topology; files == OG#.good.tre can be found within: ‘./6_Backtranslate_2113_final_trees/’) were then processed as follows: 
	- (1) using tw multifile search-and-replace, ‘.good.tre’ files were converted to OG-id list text files 
	- (2) OG-id text files were processed through ‘og_nucleotide_get.py’ (to build new nuc fasta files for each OG; extension: ’.dna’)
	- (3) taxon IDs in ‘.dna’ files modified using tw to only include [A-Z]{4}-\d+ 
	- (4) *.phy files were converted to *.fasta (phy2fasta.py) for use with og_aa_get.py
	- (5) ’.dna’ files run through ‘./6_backtranslate/trees4backtranslate/og_aa_get.py’ (to build new aa fast files for each OG; extension: ‘.aa’)
	- (6) line endings removed from *.sorted files (tw) so sequences are each on a single line
	- (7) ‘./4_backtranslated/back_translate.GB.pl’ executed using ‘./runbacktrans.pl’
	- (8) run ’./4_backtranslated/fixsequences.pl’ 
	- (9) Modify input file extension from ‘.dna.sorted’ to ‘.dna.ED.sorted’ in ‘./runbacktrans.pl’ and execute ‘./runbacktrans.pl’
	- (10) 14 go files failed to reverse translate all taxon sequences and were excluded; final data set == 2091 trees (2099 before removing plastid-associated loci); fasta-formatted nucleotide alignments used for all hyphy analyses == OG#.fasta can be found within: ‘./6-10_fasta4hyphy/‘.