Published July 27, 2021 | Version 2
Dataset Open

REPIN population analysis in 4 Dokdonia genomes

  • 1. Max Planck Institute for Evolutionary Biology

Description

This dataset is the output of RAREFAN (http://rarefan.evolbio.mpg.de/) a webserver to identify REPIN populations across an entire bacterial species. The data was created using the following command "java -jar -Xmx10g rarefan.jar dokdonia/in/ dokdonia/out/ 4h-3-7-5.fas 55 21 in/yafM_Ecoli.faa dokdonia.nwk 1e-10 true 1"

All input files are located in the folder dokdonia/in/, all output data is located in dokdonia/out/.

The input files include the 4 fasta formatted Dokdonia genome files (*.fas) and a RAYT protein sequence called yafM_Ecoli.faa.

The output files include the following:

A phylogenetic tree "dokdonia.nwk" of all genomes generated with andi (http://github.com/evolbioinf/andi/) and clustDist (http://guanine.evolbio.mpg.de/problemsBook/node1.html).

A file containing the frequencies of all 21bp long sequences found in the 4h-3-7-5 genome: 4h-3-7-5.wfr

A file containing all 21bp long sequences that occur more frequently than 55 times in the 4h-3-7-5 genome: 4h-3-7-5.overrep

A file containing information on the RAYTs and their cooccurrence with different REPIN populations: prox.stats

A file containing the nucleotide sequences of all yafM_Ecoli.faa relatives identified with BLAST+ in the Dokdonia species: yafM_relatives.fna

maxREPIN_0.txt Contains the most frequent REPIN identified for each sequence type in each Dokdonia strain.

 presAbs_0.txt Contains for each strain information on the number of RAYTs, the number of REPINs, the master sequence, the number of master sequences, the entire REP/REPIN population size, the number of REPIN clusters that contain more than 10 sequences, all REPINs in the population as well as all REPINs that differ to the master sequences in at most three nucleotides.

rayt_[strain name].tab contains location information for each identified RAYT relative for each strain. The files can be viewed with artemis.

results.txt contains for each strain the frequency of the six identified 21bp long seeds.

There is one folder called groupSeedSequences, which includes the data for identifying the most common 21 bp long sequences in D. sp. 4h-3-7-5. All 21bp long sequences in the genome that occur more frequently than 55 times are sorted into 6 sequence groups. These sequence groups are stored in the files Group_4h-3-7-5_*.out and .out.fas. There is also a 4h-3-7-5_words.tab file, which contains the locations of all overrepresented 21bp long sequences in the 4h-3-7-5 genome. This file can be viewed in artemis (https://www.sanger.ac.uk/tool/artemis/) together with the 4h-3-7-5 genome file. The most common sequence in each group is used as a seed sequence to determine REPIN populations across all 4 genomes.

 

For each genome there is one output folder (ending in _0), for each sequence group one.

Each folder contains the following files:

*.dd: Degree distribution of the REPIN network, where each REPIN is a node. A REPIN is connected to another REPIN if they differ in exactly one position. The degree distribution is a histogram of the number of connections of all the nodes. 

*.hist For the largest sequence cluster determined by mcl that consists of REPINs (two REPs in inverted orientation) this file contains the number of REPINs in each sequence class. Sequence class 0 is the master sequence. By definition the most common REPIN in the sequence population. Sequence class 1 contains all REPINs differing in exactly one position to the master sequence. Sequence class 2 contains REPINs differing in 2 positions etc.

*.mcl Contains the clustering output by mcl. Each line contains the member of a cluster. Lines are sorted by cluster size.

*.mw Contains the most common 21bp long sequence and its frequency in the genome, which is the basis for identifying first all related REP sequences and from those the REPINs formed by these REP sequences.

*.nodes The identity and frequency of all REPINs and REP sequences for either all sequences or only for the largest sequence cluster.

*.ss Contains REPINs and REP sequences as well as their positions in fasta format. Position information starts with the location in genome fasta file (first sequence is 0...) followed by the start and end position of the entire REPIN/REP sequence.  

*.ss.REP REP sequence information in fasta format.

*.tab Location in tab format. Can be used to display locations of REPs and REPINs in the genome via artemis.

*_[0-9].ss Contains REPIN/REP sequence information for each subcluster separately.

*_[0-9].tab Contains the location of REP/REPINs for each subcluster separately for viewing in artemis.

*allSeed.nw Contains network connections between nodes of all sequences. Can be used to view network in for example R or cytoscape together with the nodes file.

*largestCluster.nodes Information on nodes only from the largest REPIN cluster.

*largestCluster.ss *.ss file for the largest REPIN cluster.

*largestCluster.tab *.tab file for the largest REPIN cluster.

*_rayt_repin_prox.txt shows which REPIN/REP cluster is in proximity to any of the RAYT genes identified in the genome (within 200bp).

And a subfolder that contains the complete sequences (including the variable region) for all identified REPs and REPINs.

The dataset was generated using the following external tools:

andi for tree building:

B Haubold, F Klötzl, and P Pfaffelhuber. andi: fast and accurate estimation of evolutionary distances between closely related genomes. Bioinformatics, 2015 vol. 31 (8) pp. 1169-1175.

MCL for REPIN population clustering:

A J Enright, S Van Dongen, and C A Ouzounis. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research, 2002 vol. 30 (7) pp. 1575-1584.

BLAST+ for identifying RAYT relatives in the different genomes:

C Camacho, G Coulouris, V Avagyan, N Ma, J Papadopoulos, K Bealer, and T L Madden. BLAST+: architecture and applications. BMC Bioinformatics, 2009 vol. 10 (1) pp. 421-9.

Files

dokdonia.zip

Files (28.2 MB)

Name Size Download all
md5:62e11e0164a360a2299bcd3ef61359ac
28.2 MB Preview Download

Additional details

References

  • B Haubold, F Klötzl, and P Pfaffelhuber. andi: fast and accurate estimation of evolutionary distances between closely related genomes. Bioinformatics, 2015 vol. 31 (8) pp. 1169-1175.
  • A J Enright, S Van Dongen, and C A Ouzounis. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research, 2002 vol. 30 (7) pp. 1575-1584.
  • C Camacho, G Coulouris, V Avagyan, N Ma, J Papadopoulos, K Bealer, and T L Madden. BLAST+: architecture and applications. BMC Bioinformatics, 2009 vol. 10 (1) pp. 421-9.