REPIN population analysis in 130 Neisseria meningitidis and N. gonorrhoeae genomes

Frederic Bertels; Carsten Fortmann-Grote; Paul Rainey

doi:10.5281/zenodo.5139705

Published July 27, 2021 | Version 2

Dataset Open

REPIN population analysis in 130 Neisseria meningitidis and N. gonorrhoeae genomes

1. Max Planck Institute for Evolutionary Biology

This dataset is the output of RAREFAN (http://rarefan.evolbio.mpg.de/) a webserver to identify REPIN populations across an entire bacterial species. The data was created using the following command "java -jar -Xmx10g rarefan.jar neisseria/in/ neisseria/out/ Nmen_2594.fas 55 21 neisseria/in/NMAA_0235.faa neisseria.nwk 1e-80 true 1"

All input files are located in the folder neisseria/in/, all output data is located in neisseria/out/.

The input files include the 130 fasta formatted Neisseria genome files (*.fas) and a RAYT protein sequence called NMAA_0235.faa.

The output files include the following:

A phylogenetic tree "neisseria.nwk" of all genomes generated with andi (http://github.com/evolbioinf/andi/) and clustDist (http://guanine.evolbio.mpg.de/problemsBook/node1.html).

A file containing the frequencies of all 21bp long sequences found in the Nmen_2594 genome: Nmen_2594.wfr

A file containing all 21bp long sequences that occur more frequently than 55 times in the Nmen_2594 genome: Nmen_2594.overrep

A file containing information on the RAYTs and their cooccurrence with different REPIN populations: prox.stats

A file containing the nucleotide sequences of all NMAA_0235.faa relatives identified with BLAST+ in the Neisseria species: yafM_relatives.fna

maxREPIN_[0-5] Contains the most frequent REPIN identified for each sequence type in each Neisseria strain.

presAbs_[0-5].txt Contains for each strain information on the number of RAYTs, the number of REPINs, the master sequence, the number of master sequences, the entire REP/REPIN population size, the number of REPIN clusters that contain more than 10 sequences, all REPINs in the population as well as all REPINs that differ to the master sequences in at most three nucleotides.

rayt_[strain name].tab contains location information for each identified RAYT relative for each strain. The files can be viewed with artemis.

results.txt contains for each strain the frequency of the six identified 21bp long seeds.

There is one folder called groupSeedSequences, which includes the data for identifying the most common 21 bp long sequences in Neisseria meningitidis WUE 2594. All 21bp long sequences in the genome that occur more frequently than 55 times are sorted into 6 sequence groups. These sequence groups are stored in the files Group_Nmen_2594_*.out and .out.fas. There is also a Nmen_2594_words.tab file, which contains the locations of all overrepresented 21bp long sequences in the Nmen_2594 genome. This file can be viewed in artemis (https://www.sanger.ac.uk/tool/artemis/) together with the Nmen_2594 genome file. The most common sequence in each group is used as a seed sequence to determine REPIN populations across all 130 genomes.

For each genome there are six output folders (ending in _0 to _5), for each sequence group one.

Each folder contains the following files:

*.dd: Degree distribution of the REPIN network, where each REPIN is a node. A REPIN is connected to another REPIN if they differ in exactly one position. The degree distribution is a histogram of the number of connections of all the nodes.

*.hist For the largest sequence cluster determined by mcl that consists of REPINs (two REPs in inverted orientation) this file contains the number of REPINs in each sequence class. Sequence class 0 is the master sequence. By definition the most common REPIN in the sequence population. Sequence class 1 contains all REPINs differing in exactly one position to the master sequence. Sequence class 2 contains REPINs differing in 2 positions etc.

*.mcl Contains the clustering output by mcl. Each line contains the member of a cluster. Lines are sorted by cluster size.

*.mw Contains the most common 21bp long sequence and its frequency in the genome, which is the basis for identifying first all related REP sequences and from those the REPINs formed by these REP sequences.

*.nodes The identity and frequency of all REPINs and REP sequences for either all sequences or only for the largest sequence cluster.

*.ss Contains REPINs and REP sequences as well as their positions in fasta format. Position information starts with the location in genome fasta file (first sequence is 0...) followed by the start and end position of the entire REPIN/REP sequence.

*.ss.REP REP sequence information in fasta format.

*.tab Location in tab format. Can be used to display locations of REPs and REPINs in the genome via artemis.

*_[0-9].ss Contains REPIN/REP sequence information for each subcluster separately.

*_[0-9].tab Contains the location of REP/REPINs for each subcluster separately for viewing in artemis.

*allSeed.nw Contains network connections between nodes of all sequences. Can be used to view network in for example R or cytoscape together with the nodes file.

*largestCluster.nodes Information on nodes only from the largest REPIN cluster.

*largestCluster.ss *.ss file for the largest REPIN cluster.

*largestCluster.tab *.tab file for the largest REPIN cluster.

*_rayt_repin_prox.txt shows which REPIN/REP cluster is in proximity to any of the RAYT genes identified in the genome (within 200bp).

And a subfolder that contains the complete sequences (including the variable region) for all identified REPs and REPINs.

The dataset was generated using the following external tools:

andi for tree building:

B Haubold, F Klötzl, and P Pfaffelhuber. andi: fast and accurate estimation of evolutionary distances between closely related genomes. Bioinformatics, 2015 vol. 31 (8) pp. 1169-1175.

MCL for REPIN population clustering:

A J Enright, S Van Dongen, and C A Ouzounis. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research, 2002 vol. 30 (7) pp. 1575-1584.

BLAST+ for identifying RAYT relatives in the different genomes:

C Camacho, G Coulouris, V Avagyan, N Ma, J Papadopoulos, K Bealer, and T L Madden. BLAST+: architecture and applications. BMC Bioinformatics, 2009 vol. 10 (1) pp. 421-9.

Files

neisseria.zip

Files (130.9 MB)

Name	Size	Download all
neisseria.zip md5:b046eb8fa532c0bc34e55070c5825bcd	130.9 MB	Preview Download

Additional details

B Haubold, F Klötzl, and P Pfaffelhuber. andi: fast and accurate estimation of evolutionary distances between closely related genomes. Bioinformatics, 2015 vol. 31 (8) pp. 1169-1175.
A J Enright, S Van Dongen, and C A Ouzounis. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research, 2002 vol. 30 (7) pp. 1575-1584.
C Camacho, G Coulouris, V Avagyan, N Ma, J Papadopoulos, K Bealer, and T L Madden. BLAST+: architecture and applications. BMC Bioinformatics, 2009 vol. 10 (1) pp. 421-9.

	All versions	This version
Views	548	320
Downloads	165	152
Data volume	29.9 GB	27.3 GB

REPIN population analysis in 130 Neisseria meningitidis and N. gonorrhoeae genomes

Authors/Creators

Description

Files

neisseria.zip

Files (130.9 MB)

Additional details

References