Metatranscriptomic unigenes catalog of MICROSTORE project

Monjot Arthur; Bronner Gisèle; Courtine Damien; Corinne Cruaud; Da Silva Corinne; Aury Jean-Marc; Moné Anne; Vellet Agnès; Wawrzyniak Ivan; Colombet Jonathan; Billard Hermine; Debroas Didier; Lepère Cécile

doi:10.5281/zenodo.8376851

Published December 25, 2023 | Version v1

Dataset Open

Metatranscriptomic unigenes catalog of MICROSTORE project

1. CNRS, Laboratoire Microorganismes : Génome et Environnement, Université Clermont Auvergne, Clermont-Ferrand, F-63000, France.
2. Genoscope, Institut de Biologie François Jacob, Commissariat à l'Energie Atomique (CEA), Université Paris-Saclay, Evry, France.
3. Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, 91057 Evry, France.

Extracted from Monjot et al., 2023

Sequencing data are archived at ENA under accession number PRJEB61527.

The metatranscriptome derived unigene catalog and the assessment of their expression were obtained as described in Carradec et al. (2018). Paired-ends reads from each metatranscriptomic sample were assembled using velvet (v1.2.07) with a kmer size of 89 as described in Carradec et al. (2018). Isoform detection was performed using oases (v0.2.08). Contigs smaller than 150 bp were removed from further analysis. Contig redundancy was removed using CD-HIT-EST (v4.6.1), with the following parameters: -id 95 -aS 90 (95% of nucleic identity over 90% of the length of the smallest sequence). For each cluster of contigs, the longest sequence was kept as reference for the unigene catalog. In order to estimate the expression of each unigene in each sample, cleaned reads were mapped against the reference catalog using the bwa tool (v0.7.15). The following parameters were used: bwa aln -l 30 -O 11 -R 1; bwa sampe -a 20000 -n 1 –N; samtools; rmdup. Low complexity reads were removed. Reads covering at least 80% of read length with at least 95% of identity were retained for further analysis. In the case of several possible best matches, a random one was picked.

Proteins were predicted from all unigenes with Transdecoder.LongOrfs followed by TransDecoder.Predict (v5.5.0) using the default parameters. Then, unigenes without predicted protein were used for a second run with a minimum protein length of 70 (-m). Finally, the predicted proteins were tested against the AntiFam database (v7.0) (Eberhardt et al., 2012) with hmmsearch using the --cut_ga parameter (Eddy, 2011).

The KEGG Orthology (KO) identifiers were assigned by KoFamScan (v1.3.0) with the KO’s HMM profiles (2022-01-03 release). For proteins without significant hit, the best hit with an e-value <1e-5 was retained as described in Hu et al.(2018).

Taxonomic affiliation was performed on proteins with the MMseqs2 suite (v407b315) (Steinegger & Söding, 2017), against the MetaEuk database (Levy Karin et al., 2020). Taxonomy was assigned with mmseqs taxonomy and the parameters --tax-lineage 1 --lca-mode 2 --max-seqs 100 -e 0.00001 -s 6 --max-accept 100. The unigene catalog was cleaned of contaminants by excluding proteins and unigenes affiliated to Human, Bacteria, Archaea, Virus and Metazoans.

References

Carradec, Q., Pelletier, E., Da Silva, C., Alberti, A., Seeleuthner, Y., Blanc-Mathieu, R., et al. (2018) A global ocean atlas of eukaryotic genes. Nat Commun 9: 373.

Eddy, S.R. (2011) Accelerated Profile HMM Searches. PLoS Comput Biol 7: e1002195.

Hu, S.K., Liu, Z., Alexander, H., Campbell, V., Connell, P.E., Dyhrman, S.T., et al. (2018) Shifting metabolic priorities among key protistan taxa within and below the euphotic zone. Environmental Microbiology 20: 2865–2879.

Levy Karin, E., Mirdita, M., and Söding, J. (2020) MetaEuk—sensitive, high-throughput gene discovery, and annotation for large-scale eukaryotic metagenomics. Microbiome 8: 48.

Steinegger, M. and Söding, J. (2017) MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 35: 1026–1028.

Files

Files (2.4 GB)

Name	Size	Download all
main_table.mapping.unique.raw.noHuman.noConta.noMetazoa.annot.tsv md5:8b745ef5f33b0417377e5c015f550a2a	1.5 GB	Download
table_taxonomy.perUnigene.allUnigenes.tsv md5:2d9c8fd80e34e6664c62d5e9f0843c96	864.4 MB	Download

	All versions	This version
Views	205	203
Downloads	76	76
Data volume	106.3 GB	106.3 GB

Metatranscriptomic unigenes catalog of MICROSTORE project

Authors/Creators

Description

Files

Files (2.4 GB)