Metatranscriptomic unigenes catalog of MICROSTORE project
Authors/Creators
- 1. CNRS, Laboratoire Microorganismes : Génome et Environnement, Université Clermont Auvergne, Clermont-Ferrand, F-63000, France.
- 2. Genoscope, Institut de Biologie François Jacob, Commissariat à l'Energie Atomique (CEA), Université Paris-Saclay, Evry, France.
- 3. Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, 91057 Evry, France.
Description
Extracted from Monjot et al., 2023
Sequencing data are archived at ENA under accession number PRJEB61527.
The metatranscriptome derived unigene catalog and the assessment of their expression were obtained as described in Carradec et al. (2018). Paired-ends reads from each metatranscriptomic sample were assembled using velvet (v1.2.07) with a kmer size of 89 as described in Carradec et al. (2018). Isoform detection was performed using oases (v0.2.08). Contigs smaller than 150 bp were removed from further analysis. Contig redundancy was removed using CD-HIT-EST (v4.6.1), with the following parameters: -id 95 -aS 90 (95% of nucleic identity over 90% of the length of the smallest sequence). For each cluster of contigs, the longest sequence was kept as reference for the unigene catalog. In order to estimate the expression of each unigene in each sample, cleaned reads were mapped against the reference catalog using the bwa tool (v0.7.15). The following parameters were used: bwa aln -l 30 -O 11 -R 1; bwa sampe -a 20000 -n 1 –N; samtools; rmdup. Low complexity reads were removed. Reads covering at least 80% of read length with at least 95% of identity were retained for further analysis. In the case of several possible best matches, a random one was picked.
Proteins were predicted from all unigenes with Transdecoder.LongOrfs followed by TransDecoder.Predict (v5.5.0) using the default parameters. Then, unigenes without predicted protein were used for a second run with a minimum protein length of 70 (-m). Finally, the predicted proteins were tested against the AntiFam database (v7.0) (Eberhardt et al., 2012) with hmmsearch using the --cut_ga parameter (Eddy, 2011).
The KEGG Orthology (KO) identifiers were assigned by KoFamScan (v1.3.0) with the KO’s HMM profiles (2022-01-03 release). For proteins without significant hit, the best hit with an e-value <1e-5 was retained as described in Hu et al.(2018).
Taxonomic affiliation was performed on proteins with the MMseqs2 suite (v407b315) (Steinegger & Söding, 2017), against the MetaEuk database (Levy Karin et al., 2020). Taxonomy was assigned with mmseqs taxonomy and the parameters --tax-lineage 1 --lca-mode 2 --max-seqs 100 -e 0.00001 -s 6 --max-accept 100. The unigene catalog was cleaned of contaminants by excluding proteins and unigenes affiliated to Human, Bacteria, Archaea, Virus and Metazoans.
References
Carradec, Q., Pelletier, E., Da Silva, C., Alberti, A., Seeleuthner, Y., Blanc-Mathieu, R., et al. (2018) A global ocean atlas of eukaryotic genes. Nat Commun 9: 373.
Eddy, S.R. (2011) Accelerated Profile HMM Searches. PLoS Comput Biol 7: e1002195.
Hu, S.K., Liu, Z., Alexander, H., Campbell, V., Connell, P.E., Dyhrman, S.T., et al. (2018) Shifting metabolic priorities among key protistan taxa within and below the euphotic zone. Environmental Microbiology 20: 2865–2879.
Levy Karin, E., Mirdita, M., and Söding, J. (2020) MetaEuk—sensitive, high-throughput gene discovery, and annotation for large-scale eukaryotic metagenomics. Microbiome 8: 48.
Steinegger, M. and Söding, J. (2017) MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 35: 1026–1028.
Files
Files
(2.4 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:8b745ef5f33b0417377e5c015f550a2a
|
1.5 GB | Download |
|
md5:2d9c8fd80e34e6664c62d5e9f0843c96
|
864.4 MB | Download |