Dataset for training SENMAP, a automatic tool to curate LTR-retrotransposons using convolutional neural networks
Authors/Creators
Description
Transposable elements (TEs) are specific structures of the genome of species, which can move from one location to another. For that reason, they can cause mutations or changes that can be negative, such as the appearance of diseases, or beneficial, such as participating in fundamental roles in the evolution of genomes and genetic diversity. Long Terminal Repeat retrotransposons (LTR-RT) are the most abundant in plant species, hence the importance of studying these structures in particular. Over the time, these elements can suffer changes called nested insertions, which can inactivate or modify the functioning of the element, for that they are no longer consider as intact element and cannot be used for identification and classification studies. We create a dataset containing 56,442 LTR-RTs targed as "non-intact" elements and 49,215 considered as "intact".
We formated the sequences IDs in order to keep relevant information as the superfamily and the lineage, as well as the category (Negative for "non-intact" and Positive for "intact" elements).
This dataset (the npy files obtained from the fasta file) was used for training SENMAP, a convolutional neural network architecture to obtain intact LTR-RT sequences in plant genomes, which is composed by four convolutional layers, LeakyReLU as activation function and BinaryFocalLoss as loss function. Achieving an F1-score percentage of 91.37% with test data, identifying low quality sequences rapidly and efficiently, contributing to curate libraries of LTR retrotransposons of plants genomes published in large-scale sequencing projects due to the post-genomic era.
Files
2_labels_DB.fa.filtered_center.zip
Files
(1.3 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:c814788ae5925be831eb159110c504cb
|
776.9 MB | Download |
|
md5:9f7c6dd4294d4266aa0b9cbf5a8a7615
|
486.3 MB | Preview Download |
|
md5:9ae8a03cd86a2254325ec33674290e99
|
105.8 kB | Download |
Additional details
Software
- Repository URL
- https://github.com/simonorozcoarias/SENMAP