# Changes

These changes were made between the v1 and the v2:

- SAG:
  - contigs have been renamed and only those greater than 2.5kb were kept
  - gene prediction was done with _GeneMark-EP+_
  - genes are available in _gff3_, transcripts and proteins as _Fasta_ sequences
- The DNA-dependent RNA polymerase phylogenomic tree was updated with two SAGs
  from this dataset
- Mapping public metagenomes
  - rows corresponding to metagenomes that had less than 100 reads mapped were
    removed from the table
  - column removed:
    - "_used_for_UMAP_", with the removal of some rows, this column contained
      only '_yes_' values
  - columns added:
    - "_clean_bases_"
    - "_Nb_MAG_detected_sup_10pct_"
    - "_cluster_"
- A metadata table for MAGs and SAGs has been added, it contains the assemblies
  statistics, BUSCO results and taxonomy affiliation

# Archive contents

This Zenodo record contains the following items:

## Single Amplified Genome (SAG)

The files present in the _Zip_ archive `SAG.zip` are:

- `genome/*.fa.gz`: SAG genomes
- `gene_prediction/<SAG>.gff3.gz`: gene predictions
- `gene_prediction/<SAG>.transcript.fa.gz`: transcript sequences
- `gene_prediction/<SAG>.protein.fa.gz` : protein sequences

A summary for each SAG is available in the spreadsheet `MAG_SAG_metadata.ods`.

## Metagenome-Assembled Genomes (MAG)

The files present in the _Zip_ archive `MAG.zip` are:

- `euk_mag/*.fa.gz`: MAG genomes
- `gene_prediction_final/<MAG>.<GeneCaller>.gff3.gz`: gene predictions
- `gene_prediction_final/<MAG>.<GeneCaller>.transcript.fa.gz`: transcript sequences
- `gene_prediction_final/<MAG>.<GeneCaller>.prot.fa.gz` : protein sequences

Some MAG for which the gene prediction failed (too short sequence, other reason)
and that were not used in the analysis with the mapping of public metagenomes,
are available in the directory `not_used/`.

A summary for each MAG is available in the spreadsheet `MAG_SAG_metadata.ods`.

## Metagenomic assemblies and bins

The files present in the _Zip_ Archive `metagenomic_assemblies.zip` are:

- `contigs_db/`: the 4 _anvi'o_ contigs databases that contain the contigs sequences
- `merged_profiles_db`: the 4 _anvi'o_ profiles databases that contain the mapping
  information; the binning results are stored in these databases
- `README_metagenomic_assemblies.md` : a text file with more details

**WARNING:** files have been compressed in `.bz2`, they must be decompressed
(`bunzip2`) before usage. They use about 110 GB of disk-space.

Even though these artifacts were generated for a previous version of _anvi'o_,
they can still be used. They need to be updated with the _anvi'o_ command
`anvi-migrate`.

## Phylogenetic tree

All files are provided in the _Zip_ archive `phylogenetic_tree.zip`.

### The _DNA-dependent RNA polymerase_

The files present in the directory `phylogenetic_tree/DNA_dependent_RNA_pol` are:

- `00_hmm_profiles/` : the HMM profiles that target the DNA-dependent RNA polymerase sub-units
- `01_sequences/` : best hit for each sub-unit
- `02_alignments_raw/` : raw alignment, by _MAFFT v7.526_
- `03_alignments_cleaned/` : alignment after _goalign clean sites -c 0.5_
- `merge_alignments.py` : Python script to concatenate the 6 sub-units
- `RNAP_aln_v6_concat.fa` : the concatenated alignment
- `04_tree/` : the run of _IQ-Tree v2.4.0_
- `metadata_RNAP_tree.tsv` : Metadata to decorate the tree

The Python script `merge_alignments.py` requires Python version 3 and BioPython
(https://biopython.org/) to work.

The file `metadata_RNAP_tree.tsv` lists the reference genomes. The "source"
column corresponds to:

- _Mendota_: Krinos et al.,2024, Microbiome (https://doi.org/10.1186/s40168-024-01831-y).
  MAGs are available at https://osf.io/9epa8/?view_only=152af26e11894ac0bcdfe542e02c6ab1
- _public_database_ : EBI / NCBI / DDBJ
- _METDB_ : https://metdb.sb-roscoff.fr/metdb/. DNA-dependent RNA polymerase
  sequences are available at https://www.genoscope.cns.fr/tara/
  section _Curated DNA-dependent RNA polymerase_
- _Tara_ : Delmont et al., 2022, Cell Genomics (https://doi.org/10.1016/j.xgen.2022.100123).
  DNA-dependent RNA polymerase sequences are available at
  https://www.genoscope.cns.fr/tara/ section _Curated DNA-dependent RNA polymerase_

### _Phylisift_ tree

The files present in the directory `phylogenetic_tree/phylosift_tree/` are :

- `01_marker_present/` : marker identified and aligned by _Phylosift_
- `02_marker_selected/` : marker selected for the tree, markers present in at least 50% of the genomes
- `03_marker_alignment_cleaned/` : alignment cleaned by _Trimal v1.5 _ and parameter `-automated1`
- `04_phylosift_concatenated_alignment.fa` : concatenation of the 50 markers
- `05_tree/` : phylogeny built by IQTree v2.2.3
- `metadata_phylosift_tree.csv` : list of reference genomes and they taxonomy.
  This file can be directly used to decorate the tree visualised with
  [TreeViewer](https://github.com/arklumpus/TreeViewer/).

### Taxonomy affiliation

The file `phylogenetic_tree/taxonomy_affiliation_MAG_SAG.tsv` summarises the
taxonomic affiliation proposed for MAGs and SAGs for which markers were
present in a sufficient number.

The final taxonomic affiliation is also availabe in the spreadsheet
`MAG_SAG_metadata.ods`.

## Mapping public metagenomes

The files present in the _Zip_ archive `mapping_public_metagenomes.zip` are :

- `public_metagenomes_metadata.tsv` : the metagenomes metadata table. The columns
  corresponds to :

      - accession : Identifier in the public databases, except for datasets of the project
      - origin : where the metagenome was collected
      - origin_simplified : simplified version, as some names were long
      - country : country in which the sample was taken
      - broad_geo_region : the UN geoschemes code corresponding to the country (https://en.wikipedia.org/wiki/United_Nations_geoscheme)
      - dataset : this project or public data
      - DLATITUDE : latitude, in decimal degree
      - DLONGITUDE : longitude in decimal degree
      - salinity : relation of the sample to the salinity
      - ECOSYSTEM.TYPE : type of ecosystem sampled
      - MINIMUM.SIZE.FRACTION : when available, the pore size on which the genetic material was collected
      - MAXIMUM.SIZE.FRACTION : when available, the pore size used to prefilter the sample
      - SAMPLE.MATERIAL : nature of the sample, mostly water
      - clean_reads : number of reads used for the mapping
      - clean_bases : number of bases from the "clean_reads"
      - mapped_reads : number of reads that mapped on the MAGs and SAGs of this project
      - filtered_reads : number of reads that passed the filters from mSamTools
      - Nb_MAG_detected_sup_10pct : number of MAG and SAG that were detected at more
      than 10 % (breadth of coverage) in the metagenome
      - cluster : the cluster the metagenome belongs to

- The table that summarise the read count is `public_metagenomes_read_count_on_MAGs_and_SAGs.tsv`.
  The first column refers to the MAG and SAG identifiers,
  and the other 3097 columns represent one public metagenome each. The data
  present in this file is the number of read mapped per MAG/SAG per
  public metagenome, after filtering the mapping with _mSamtools_ v1.1.0
  and the parameters _filter -b -l 50 -p 95 -z 80_.

- The file `public_metagenomes_detection_of_MAGs_and_SAGs.tsv`, summarises the
  breadth of coverage, in percent, of each MAG/SAG in each public metagenome.
  The value of "100" means that all positions of a particular genome is covered by
  at least one read from the given metagenome. And "0" means that no read from
  the metagenome X had mapped on the MAG/SAG Y.

## Unigenes

The files stored in the _Zip_ archive `unigenes.zip` are :

- `unigenes_sequences.fa.gz` : unigenes sequences, clean from
  contamination (human, metazoans, bacteria, archaea and viruses)
- `table_readCount.noHuman.noConta.noMetazoa.annot.tsv.gz`: counts of
  mapped reads on the unigenes plus functionnal annotations KEGG K0, Pfam and GO
  (derived from Pfam)
- `table_taxonomy.perUnigene.allUnigenes.tsv.gz` : unigenes taxonomic
  annotation

See also the work of [**Monjot et al.,2023**](https://doi.org/10.1111/1462-2920.16531)
