There is a newer version of the record available.

Published October 2025 | Version 1.0
Dataset Open

MycoMobilome: A non-redundant database of transposable element consensus sequences for the fungal kingdom

  • 1. Université de Neuchâtel Institut de Biologie
  • 2. ROR icon University of Neuchâtel

Description

MycoMobilome: A non-redundant database of transposable element consensus sequences for the fungal kingdom.
 
For more information on using the database, how it was constructed, best practices, and how to contribute, please visit the MycoMobilome GitHub.

Three versions of the database are provided:

  • MycoMobilome_v1.0-allConsensus_TE_library.fasta: All known and unknown TE consensus sequences detected across fungal diversity. Most useful for most use cases.
  • MycoMobilome_v1.0-proteinEvidence_TE_library.fasta: All TE consensus sequences with ORF hits to known TE proteins. Note the evidence markers in sequence headers and that this subset will not contain any non-autonomous TEs (i.e. SINEs, MITEs, solo LTRs, etc).
  • MycoMobilome_v1.0-unknown_TE_library.fasta: All TE consensus sequences with NO protein evidence supporting their status as true TEs. These have the potential to be real given little existing knowledge of TE diversity across the kingdom. Many of these are likely non-autonomous elements, such as MITEs (non-autonomous DNA elements), solo LTRs, and SINEs, which will NOT be found in the proteinEvidence subset. However, some sequences are also likely to be erroneous, so use carefully.

In addition to these three database files, the following files are also provided:

  • MycoMobilome_v1.0_assemblyRecord.xlsx: A record of all publicly available genome assemblies used to generate MycoMobilome. Here, you will find information on assembly length, N50, L50, GC content, species phylogenetic information, genome assembly source and ID, publication, and BUSCO scores.
  • MycoMobilome-hitsToKnownTransposonProteins-repetPfam35.txt: A TAB-separated file showing hmmscan hits for each MycoMobilome consensus sequence open reading frame to TE domains from the REPET Pfam 35.0 and Gypsy DB curated TE domain dataset. Here, qseqid ends with _n, where n is the ORF number. The query sequence to match to MycoMobilome sequence headers can be found in the column named qseqid_noFrame.
  • MycoMobilome-hitsToKnownTransposonProteins-rmRepeatPeps.txt: A TAB-separated file showing BLASTp hits for each MycoMobilome consensus sequence open reading frame to TE domains from the RepeatMasker RepeatPeps.lib file supplied with RepeatMasker v4.1.9.
 
More About MycoMobilome Curation
MycoMobilome was generated from all publicly available fungal genome resources from JGI (excluding restricted assemblies) and NCBI. This is an uncurated database, but all consensus sequences have been generated using a consistent and reproducible curation process. 
 
The MycoMobilome database was generated using a standardised de novo TE curation approach among all publicly available fungal genome resources (n=4,309 genomes). A table containing information on all assemblies used to generate version 1.0 of this database is provided within MycoMobilome_v1.0 in the file MycoMobilome_v1.0_assemblyRecord.xlsx.

Each genome was used to generate putative TE consensus sequences using earlGreyLibConstruct in Earl Grey (v4.4.0)[1], configured with Dfam curated elements (v3.7)[2], using default settings. All putative consensus sequences were combined into a single FASTA file containing 773,843 entries. A non-redundant TE library was constructed using a scalable cascaded clustering approach using MMseqs2[3] easy-cluster with --min-seq-id 0.8 -c 0.8 --cov-mode 1 --cluster-reassign, resulting in 354,315 non-redundant sequences. Representative sequences for each cluster were extracted and labelled with the species name from which the representative originated.

Open reading frames (ORFs) were detected in all six frames of each consensus sequence using transeq in EMBOSS (v6.6.0)[4] with -clean -frame 6. Matches to known host proteins were identified using the Fungi RefSeq[5] database (Release 228) and Diamond BLASTp[6] with --sensitive --matrix BLOSUM62 --evalue 1e-3. Potential hits were combined for each query sequence. Sequences with hits to RefSeq, and either no hits to known TE protein domains, or partial hits to known TE protein domains that do not overlap with RefSeq hit coordinates, were labelled as potential host genes and removed from the MycoMobilome dataset. Any hits to proteins labelled as uncharacterized|hypothetical|low quality|predicted protein were kept due to the potential to be TE-derived.

Matches to known TE proteins were identified using two complementary approaches: (i) Using HMMscan in HMMER (v3.4)[7] to detect homology to known TE protein domains curated by the REPET group. Matches were identified using hmmscan -E 10 --noali. Hits were filtered to retain those where fseq_evalue <=0.001 and fseq_bitscore >= 50. Hits were retained as potential TEs unless the query also matched RefSeq proteins, in which case they were removed to avoid including host genes or chimeric TE–host gene models.

(ii) Using BLASTp to detect homology to known TE protein domains supplied with RepeatMasker (v4.1.5) RepeatPeps.lib.(repeatmasker.org). Matches were identified using blastp -evalue 1e-3. Nested hits were removed to retain the highest quality protein hit for each query, followed by combining adjacent and overlapping hits. Hits were retained as potential TEs unless non-overlapping hits to the same query were also found in the RefSeq hits set, in which case these were removed due to the potential that these hits could be host genes, or chimeric TE-host gene models.

A total of 24,571 consensus sequences were identified as putative host genes and removed from the database, resulting in a potential TE consensus set containing 329,744 sequences. This set was further filtered to remove all putative TE consensus sequences <120bp in length, as these are likely to be poor quality and incomplete. In addition, the base composition of each consensus was calculated using seqtk comp (https://github.com/lh3/seqtk) and all sequences with an N content >=5% were removed due to being poor quality, reducing the final MycoMobilome library to 276,641 sequences.

For each consensus sequence, if there are hits to known TE protein domains, the sequences were labelled as "supported". Following this, the identity of each protein domain hit was evaluated to determine whether the consensus sequence classification is supported by protein hits from the REPET profiles bank or RepeatMasker RepeatPeps. If the identified domains support the consensus classification, the consensus sequence is labelled with _PE for protein evidence. If the identified domains conflict with the consensus classification, the consensus sequence is labelled with _DA for disagreement. If there are no identified domains, the consensus sequence is labelled with _NE for no evidence. The appropriate domains for each classification are defined in the table below:

High level TE classification Appropriate Domain Hits from REPET RepeatMasker RepeatPeps
DNA Tase,Tase*,DDE,HTH,[ATP,INT,AP for crypton,maverick] DNA
RC HEL,EN,RPA RC
LTR RT,INT,RH,GAG,AP,VirusRelated,LTRrelated,Caulimovirus,ClassIrelated,ENV LTR
LINE RT,EN,RH,GAG,ClassIrelated,LINErelated LINE
PLE RT,EN,ClassIrelated PLE
Retroposon RT,INT,RH,GAG,AP,VirusRelated,LTRrelated,Caulimovirus,ClassIrelated,ENV,EN,LINErelated Retroposon

Sequences are named with the convention MycMob1.0_family-[n]-[six digit species code]_[protein evidence]#[high level classification]/[sub level classification] @[genus species]. Protein hits to known TE proteins are provided with MycoMobilome to support further investigation in specific use cases. No changes were made to classifications assigned during automated curation, therefore this database should be treated as uncurated and caution should be used to check important or interesting TE loci on a case-by-case basis. Please note that all nonautonomous elements will have the label _NE as they do not contain any intact protein domains. This does not mean they are not real TEs. As such, for most use cases we suggest using the complete MycoMobilome v1.0 dataset, unless you are specifically interested in autonomous TEs only.

Bibliography

  1. Baril T, Galbraith J, Hayward A. Earl Grey: a fully automated user-friendly transposable element annotation and analysis pipeline. Molecular Biology and Evolution. 2024 Apr;41(4):msae068. 

  2. Hubley R, Finn RD, Clements J, Eddy SR, Jones TA, Bao W, Smit AF, Wheeler TJ. The Dfam database of repetitive DNA families. Nucleic acids research. 2016 Jan 4;44(D1):D81-9. 

  3. Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology. 2017 Nov;35(11):1026-8. 

  4. Rice P, Longden I, Bleasby A. EMBOSS: the European molecular biology open software suite. Trends in genetics. 2000 Jun 1;16(6):276-7. 

  5. Goldfarb T, Kodali VK, Pujar S, Brover V, Robbertse B, Farrell CM, Oh DH, Astashyn A, Ermolaeva O, Haddad D, Hlavina W. NCBI RefSeq: reference sequence standards through 25 years of curation and annotation. Nucleic Acids Research. 2025 Jan 6;53(D1):D243-57. 

  6. Buchfink B, Reuter K, Drost HG. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature methods. 2021 Apr;18(4):366-8. 

  7. Sonnhammer EL, Eddy SR, Birney E, Bateman A, Durbin R. Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic acids research. 1998 Jan 1;26(1):320-2. 

Files

Files (471.6 MB)

Name Size Download all
md5:bd050e241af4bebb85d13cf382ca521a
471.6 MB Download
md5:44292318be7227d32a41d82a791aa15a
2.2 kB Download

Additional details

Funding

Swiss National Science Foundation
Crop pathogen evolution in a changing climate 201149

Software

Repository URL
https://github.com/TobyBaril/MycoMobilome
Development Status
Active