Published March 24, 2021 | Version v1
Dataset Open

Data from: Transposable element annotation in non-model species - on the benefits of species specific repeat libraries using semi-automated EDTA and DeepTE de novo pipelines

  • 1. University of East Anglia

Description

Transposable elements (TEs) are significant genomic components which can be detected either through sequence homology against existing databases or de novo, with the latter potentially reducing underestimates of TE abundance. Here, we describe the semi-automated generation of a de-novo TE library which combines the newly described EDTA pipeline and DeepTE classifier in a non-model teleost (Corydoras sp. C115). We assess performance using both genomic and transcriptomic input by five metrics: (i) abundance (ii) composition (iii) fragmentation (iv) age distributions and (v) capture of potential horizontally transferred TEs. We identified notable differences in these metrics between different TE libraries, and highlight how  library choice can have a major impact on TE content estimates in non-model species.

This repository incorporates six raw (unparsed) Repeat Masker (RM) output files for two genomes (Corydoras sp. c115 and Corydoras maculifer) one transcriptome (C. maculifer), two Repeat Libraries (one based on the RepBase Danio rerio library and one de novo library build on the C. sp. c115 genome). The RM ouput files correspond to one homology based transposon search using the D. rerio library and one species specific search using the de novo library. It also includes a script to acompany horizontal transfer analysis and a transposable element renamins script.

Notes

Please note that the Repeat Masker output files are raw and unparsed. To parse data as in the manuscript please use the parse script published here: https://github.com/clbutler/RM_TRIPS

File List:

DanioLib_DeepTE_clean.fasta -> The RepBase Danio library which has been run through the DeepTE program for TE classification

Lin1C115wtgdb_EDTADeepTE_cleanLib_V1.1.fasta -> The de novo transposible element library we produced for the Corydoras sp. C115 genome using EDTA and then DeepTE

Horizontal_transfer_Analysis_script.R -> The R script used for the horizontal transfer of transposible elements analysis 

Unparsed_DanioDeepTElib_CmaculiferGenome_AssemblyNameCM_19_scafSeq.fas.out -> Unparsed Repeat Masker output using DanioLib_DeepTE_clean.fasta as the repeat library and the Corydoras maculifer genome (available on genbank)

Unparsed_DanioDeepTElib_CmaculiferSample56_transcriptome.out -> Unparsed Repeat Masker output using DanioLib_DeepTE_clean.fasta as the repeat library and the Corydoras maculifer transcriptome (available on genbank)

Unparsed_DanioDeepTElib_CorydorasC115genome_AssemblyNameLin1PacBio.ctg.fa.r3p3_pilon_3.fasta.out ->  Unparsed Repeat Masker output using DanioLib_DeepTE_clean.fasta as the repeat library and the Corydoras sp. C115 genome (available on genbank)

Unparsed_DeNovolib_CmaculiferGenome_AssemblyNameCM_19_scafSeq.fas.out -> Unparsed Repeat Masker output using Lin1C115wtgdb_EDTADeepTE_cleanLib_V1.1.fasta as the repeat library and the Corydoras maculifer genome (available on genbank)

Unparsed_DeNovolib_CmaculiferSample56_transcriptome.out -> Unparsed Repeat Masker output using Lin1C115wtgdb_EDTADeepTE_cleanLib_V1.1.fasta as the repeat library and the Corydoras maculifer transcriptome (available on genbank)

Unparsed_DeNovolib_CorydorasC115genome_AssemblyNameLin1PacBio.ctg.fa.r3p3_pilon_3.fasta.out -> Unparsed Repeat Masker output using Lin1C115wtgdb_EDTADeepTE_cleanLib_V1.1.fasta as the repeat library and the Corydoras sp. C115 genome (available on genbank)

Unparsed_DanioDeepTElib_DanioGenome_AccessionNoGCF_000002035.6_GRCz11.out -> Unparsed Repeat Masker output using DanioLib_DeepTE_clean.fasta as the repeat library against the Danio rerio genome (Accession number: GCF_000002035.6_GRCz11)

Funding provided by: Biotechnology and Biological Sciences Research Council
Crossref Funder Registry ID: http://dx.doi.org/10.13039/501100000268
Award Number: BB/R017174/1

Funding provided by: Biotechnology and Biological Sciences Research Council
Crossref Funder Registry ID: http://dx.doi.org/10.13039/501100000268
Award Number: BB/R017174/1

Funding provided by: NERC Environmental Bioinformatics Centre
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100008668
Award Number: NE/L002582/1

Files