Published July 27, 2021 | Version 2
Dataset Open

REPIN population analysis in 42 Pseudomonas chlororaphis genomes

  • 1. Max Planck Institute for Evolutionary Biology

Description

This dataset is the output of RAREFAN (http://rarefan.evolbio.mpg.de/) a webserver to identify REPIN populations across an entire bacterial species. The data was created using the following command "java -jar -Xmx10g rarefan.jar chlororaphis/in/ chlororaphis/out/ chlTAMOak81.fas 55 21 chlororaphis/in/yafM_SBW25.faa chlororaphis.nwk 1e-30 true 1"

All input files are located in the folder chlororaphis/in/, all output data is located in chlororaphis/out/.

The input files include the 42 fasta formatted P. chlororaphis genome files (*.fas) and a RAYT protein sequence called yafM_SBW25.faa.

The output files include the following:

A phylogenetic tree "chlororaphis.nwk" of all genomes generated with andi (http://github.com/evolbioinf/andi/) and clustDist (http://guanine.evolbio.mpg.de/problemsBook/node1.html).

A file containing the frequencies of all 21bp long sequences found in the TAMOak81 genome: chlTAMOak81.wfr

A file containing all 21bp long sequences that occur more frequently than 55 times in the TAMOak81 genome: chlTAMOak81.overrep

A file containing information on the RAYTs and their cooccurrence with different REPIN populations: prox.stats

A file containing the nucleotide sequences of all yafM_SBW25.faa relatives identified with BLAST+ in the P. chlororaphis species: yafM_relatives.fna

maxREPIN_[0-5] Contains the most frequent REPIN identified for each sequence type in each P. chlororaphis strain.

 presAbs_[0-5].txt Contains for each strain information on the number of RAYTs, the number of REPINs, the master sequence, the number of master sequences, the entire REP/REPIN population size, the number of REPIN clusters that contain more than 10 sequences, all REPINs in the population as well as all all REPINs that differ to the master sequences in at most three nucleotides.

rayt_[strain name].tab contains location information for each identified RAYT relative for each strain. The files can be viewed with artemis.

results.txt contains for each strain the frequency of the six identified 21bp long seeds.

There is one folder called groupSeedSequences, which includes the data for identifying the most common 21 bp long sequences in P. chlororaphis TAMOak81. All 21bp long sequences in the genome that occur more frequently than 55 times are sorted into 6 sequence groups. These sequence groups are stored in the files Group_chlTAMOak81_*.out and .out.fas. There is also a chlTAMOak81_words.tab file, which contains the locations of all overrepresented 21bp long sequences in the TAMOak81 genome. This file can be viewed in artemis (https://www.sanger.ac.uk/tool/artemis/) together with the TAMOak81 genome file. The most common sequence in each group is used as a seed sequence to determine REPIN populations across all 42 genomes.

 

For each genome there are six output folders (ending in _0 to _5), for each sequence group one.

Each folder contains the following files:

*.dd: Degree distribution of the REPIN network, where each REPIN is a node. A REPIN is connected to another REPIN if they differ in exactly one position. The degree distribution is a histogram of the number of connections of all the nodes. 

*.hist For the largest sequence cluster determined by mcl that consists of REPINs (two REPs in inverted orientation) this file contains the number of REPINs in each sequence class. Sequence class 0 is the master sequence. By definition the most common REPIN in the sequence population. Sequence class 1 contains all REPINs differing in exactly one position to the master sequence. Sequence class 2 contains REPINs differing in 2 positions etc.

*.mcl Contains the clustering output by mcl. Each line contains the member of a cluster. Lines are sorted by cluster size.

*.mw Contains the most common 21bp long sequence and its frequency in the genome, which is the basis for identifying first all related REP sequences and from those the REPINs formed by these REP sequences.

*.nodes The identity and frequency of all REPINs and REP sequences for either all sequences or only for the largest sequence cluster.

*.ss Contains REPINs and REP sequences as well as their positions in fasta format. Position information starts with the location in genome fasta file (first sequence is 0...) followed by the start and end position of the entire REPIN/REP sequence.  

*.ss.REP REP sequence information in fasta format.

*.tab Location in tab format. Can be used to display locations of REPs and REPINs in the genome via artemis.

*_[0-9].ss Contains REPIN/REP sequence information for each subcluster separately.

*_[0-9].tab Contains the location of REP/REPINs for each subcluster separately for viewing in artemis.

*allSeed.nw Contains network connections between nodes of all sequences. Can be used to view network in for example R or cytoscape together with the nodes file.

*largestCluster.nodes Information on nodes only from the largest REPIN cluster.

*largestCluster.ss *.ss file for the largest REPIN cluster.

*largestCluster.tab *.tab file for the largest REPIN cluster.

*_rayt_repin_prox.txt shows which REPIN/REP cluster is in proximity to any of the RAYT genes identified in the genome (within 200bp).

And a subfolder that contains the complete sequences (including the variable region) for all identified REPs and REPINs.

The dataset was generated using the following external tools:

andi for tree building:

B Haubold, F Klötzl, and P Pfaffelhuber. andi: fast and accurate estimation of evolutionary distances between closely related genomes. Bioinformatics, 2015 vol. 31 (8) pp. 1169-1175.

MCL for REPIN population clustering:

A J Enright, S Van Dongen, and C A Ouzounis. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research, 2002 vol. 30 (7) pp. 1575-1584.

BLAST+ for identifying RAYT relatives in the different genomes:

C Camacho, G Coulouris, V Avagyan, N Ma, J Papadopoulos, K Bealer, and T L Madden. BLAST+: architecture and applications. BMC Bioinformatics, 2009 vol. 10 (1) pp. 421-9.

Files

chlororaphis.zip

Files (147.7 MB)

Name Size Download all
md5:3e9f5e69f5c84a0be8bd497842f6ce82
147.7 MB Preview Download

Additional details

References

  • B Haubold, F Klötzl, and P Pfaffelhuber. andi: fast and accurate estimation of evolutionary distances between closely related genomes. Bioinformatics, 2015 vol. 31 (8) pp. 1169-1175.
  • A J Enright, S Van Dongen, and C A Ouzounis. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research, 2002 vol. 30 (7) pp. 1575-1584.
  • C Camacho, G Coulouris, V Avagyan, N Ma, J Papadopoulos, K Bealer, and T L Madden. BLAST+: architecture and applications. BMC Bioinformatics, 2009 vol. 10 (1) pp. 421-9.