Published July 10, 2018 | Version v1
Journal article Open

The evolutionary signal in metagenome phyletic profiles predicts many gene functions

  • 1. Faculty of Information Studies, 8000, Novo Mesto, Slovenia
  • 2. Division of Electronics, Rudjer Boskovic Institute, 10000, Zagreb, Croatia
  • 3. Department of Knowledge Technologies, Jozef Stefan Institute, 1000, Ljubljana, Slovenia
  • 4. Genome Data Science, Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, 08028, Barcelona, Spain

Description

Background: The function of many genes is still not known even in model organisms. An increasing availability of microbiome DNA sequencing data provides an opportunity to infer gene function in a systematic manner.

Results: We evaluated if the evolutionary signal contained in metagenome phyletic profiles (MPP) is predictive of a broad array of gene functions. The MPPs are an encoding of environmental DNA sequencing data that consists of relative abundances of gene families across metagenomes. We find that such MPPs can accurately predict 826 Gene Ontology functional categories, while drawing on human gut microbiomes, ocean metagenomes, and DNA sequences from various other engineered and natural environments. Overall, in this task, the MPPs are highly accurate, and moreover they provide coverage for a set of Gene Ontology terms largely complementary to standard phylogenetic profiles, derived from fully sequenced genomes. We also find that metagenomes approximated from taxon relative abundance obtained via 16S rRNA gene sequencing may provide surprisingly useful predictive models. Crucially, the MPPs derived from different types of environments can infer distinct, non-overlapping sets of gene functions and therefore complement each other. Consistently, simulations on > 5000 metagenomes indicate that the amount of data is not in itself critical for maximizing predictive accuracy, while the diversity of sampled environments appears to be the critical factor for obtaining robust models.

Conclusions: In past work, metagenomics has provided invaluable insight into ecology of various habitats, into diversity of microbial life and also into human health and disease mechanisms. We propose that environmental DNA sequencing additionally constitutes a useful tool to predict biological roles of genes, yielding inferences out of reach for existing comparative genomics approaches.

Files

40168_2018_506_MOESM1_ESM.pdf

Files (31.5 MB)

Name Size Download all
md5:4ff6915e1f6e6bc239aff9fe6d5dc40b
1.7 MB Preview Download
md5:149ce1ca9e34fbbf949defb7f196eed5
306.1 kB Download
md5:048d6ed3d1b41a2e83f0d62f8d7b2894
19.4 kB Preview Download
md5:52cfee1f08d1437bf019a3b724241c7a
119.9 kB Download
md5:9cfe7f709b16cdeba0565f01e3ee6d45
156.1 kB Download
md5:069abc4f60b5c47ec3edd459d6199774
7.2 MB Download
md5:dbfd302dbb0dffad3f8433bfdbfcd3af
400.9 kB Download
md5:5216d71ed7e32efad0336394e5a105d4
18.5 MB Download
md5:e8ea3971c0ab67257d85b2647fcfb025
143.9 kB Download
md5:60239366837ade95c131c0bc52c748e3
3.0 MB Preview Download
md5:dc802c06bd57c8ddf2ed11f111f70723
18.8 kB Download

Additional details

Funding

MAESTRA – Learning from Massive, Incompletely annotated, and Structured Data 612944
European Commission