##########
# README #
##########

Readme file generated on 2024-03-22 by Francisco Pereira Lobo.


###############
# DESCRIPTION #
###############

The base directory "reproducibility/" stores both the raw data ("data/"
directory) and the code ("bin/" directory) used to generate all results
described in:

https://www.biorxiv.org/content/10.1101/2024.03.14.585020v1

It also contains the outputs (figures and R objects) produced when executing
sequentially the R code ("results/" directory).

To interactively explore the results through dynamic HTML files, open the files
"results/CALANGO/<experiment>/index.html", where <experiment> corresponds to
the distinct analyses we've done in our publication.

Of special interest are <experiment> directories "gene2GO" and "homologous2IPR",
which respectively contain the GO terms and IPR homologs associated with NCT
discussed in our main findings.


#######################
# DIRECTORY STRUCTURE #
#######################

Starting from "reproducibility/", the directory structure is as follows:

.
├── README.txt
├── bin
├── data
│   ├── annotation
│   ├── dics
│   ├── metadata
│   ├── parameters_CALANGO
│   └── trees
└── results
    ├── CALANGO
    ├── RData
    └── figures


####################################
# DATA/DIRECTORY/FILE DESCRIPTIONS #
####################################

-- bin/

* R code used to generate the results found within the "results/" directory.

* Execute them sequentially to generate all figures of our publication.

* It will take a couple of hours to generate all results from the scratch.


-- data/

* Raw data files used to generate the results found in the "results/" directory
  (phylogenetic / genome annotation / phenotype).


---- data/annotation/

 * Genome annotation data to run CALANGO.

 * Each annotation file is a two-column annotation of individual genomic
   components associated to annotation terms.

 * One annotation file per genome/annotation schema.

 * Annotation directories contain the following data:


------ data/annotation/SUPERFAMILY2GO/

  * genomic components: homologs as predicted by the SUPERFAMILY hmms.
  * annotation schema: GO


------ data/annotation/gene2GO/

  * genomic components: protein-coding genes (longest isoform per locus).
  * annotation schema: GO


------ data/annotation/gene2IPR/

  * genomic components: protein-coding genes (longest isoform per locus).
  * annotation schema: IPR IDs


------ data/annotation/homologous2SUPERFAMILY/

  * genomic components: homologs as predicted by the SUPERFAMILY hmms.
  * annotation schema: SUPERFAMILY IDs


---- data/dics/

   * Stores the dictionary file linking the IDs of homologs to their biological
     roles.
   
   * Needed to run CALANGO (IPR and SUPERFAMILY annotation schemas).


---- data/metadata/

  * Many metadata files linking external information (e.g. BUSCO values,
    number of protein-coding genes, major phylogenetic groups) to species.

  * Used to generate many of the figures & CALANGO output files.


---- data/parameters_CALANGO/
  
   * CALANGO's configuration files used to run this tool and search for
     genotype-phenotype associations


---- data/trees/
  
   * Phylogenetic data. Includes raw phylogenetic trees (scaffold and donor
     trees), together with the final tree used to build phylogeny-aware models
     when searching for genotype-phenotype associations.

   * Final tree produced using code found in "bin/".


-- results/

* Results produced after sequentially executing the R code in "bin/".


---- results/CALANGO/

 * Dynamic HTML files as produce by CALANGO.

 * Useful to explore/visualize the associations described in our analysis.
   (hopefully ;-)

 * "gene2GO" and "homologous2IPR" contain the main results described in our
   article.

 * Open files "index.html" within each result directory to display the main
   output results page.

 * From there, navigate to the dynamic tables to explore significant
   associations.


---- results/RData/ 

 * R objects as produced by CALANGO and used in downstream analyses.

    * gene2GO.R: gene-level GO annotation. Used to generate most results.

    * gene2GO_less_H_sapiens.R: gene-level GO annotation excluding H. sapiens.
      Used to generate Supplementary Figure 2.

    * gene2GO_original_NCT.R: gene-level GO annotation considering original NCT
       value for H. sapiens (254.5). Used to generate Supplementary Figure 2.

    * homologous2IPR.R: gene-level IPR annotation. Used to generate most
      results.

    * homologous2IPR_less_H_sapiens.R gene-level IPR annotation excluding H.
       sapiens. Used to generate Supplementary Figure 2.

    * homologous2IPR_original_NCT.R: gene-level IPR annotation considering
       original NCT value for H. sapiens (254.5). Used to generate
       Supplementary Figure 2.

    * homologous2SUPERFAMILY.R: SUPERFAMILY-level annotation using homology.
       Used to compare our work with VC data.

    * SUPERFAMILY2GO.R: SUPERFAMILY-level annotation using GO terms. Used to
       compare our work to VC data.

    * homologous2SUPERFAMILY_VC.R: SUPERFAMILY-level annotation using homology
       and a "naive" statistical modelling that does not consider phylogenetic
       information when searching for associations. Used to compare our work
       to VC data.


---- results/figures/

 * Raw figures (main and supplementary) from our article.
