SigmoID overview
SigmoID front end is a GUI application written in Xojo
which gives it the expected look and usability on all three supported
platforms (OS X, Linux and Windows). The GUI provides the interface for
selected programmes from the HMMER, MEME Suite and TransTerm HP, which are responsible for the actual searches. Sequence logos are calculated by WebLogo
which is written in Python, hence Python (version 2.7.x) and BioPython (version 1.64
and above) are required. Processing of nhmmer, mast and TransTerm HP
output, sequence format conversions and adding regulatory sites to
genome annotation are implemented as separate Python scripts. The
scripts are called from the GUI, but could easily be used separately
and integrated in an annotation pipeline if desired. Detailed
installation instructions are provided within distributions for each
platform. The source code for the whole SigmoID application is available with GPL 2.0 license.
SigmoID allows to:
- get binding site data from specialised databases;
- visualise binding site alignments with sequence logos;
- extend, shorten and mask alignments;
- create optimised hmm profiles from alignments;
- search bacterial genomes with calibrated (and uncalibrated) hmm profiles;
- add annotation of promoters and transcription factor binding sites to GenBank-formatted genome files;
- edit genome annotation.
This version includes 80 calibrated profiles (for 5 sigma factors and 36 TFs) optimised for enterobacterial phytopathogens Pectobacterium spp. and Dickeya dadantii.
Efficiency of these profiles will be lower for other bacteria, but with
threshold adjustment they may be usable for many enterobacteria.
The search for binding sites is done by nhmmer which is expected to be
installed in the default location.
Adding annotations to GenBank files is done by the HmmGen.py script
which could be used on its own. BioPython (version 1.64 and up) is
required.
At the moment SigmoID is known to work on OS X (10.8-10.11), Windows (Vista, 7
and 8), Ubuntu (12.04 and 14.04). It may also work on other Linux
distributions, provided the required libraries are installed.
Installation
SigmoID relies heavily on Python and will have severely limited
functionality without it. The various python scripts included with
SigmoID require python version 2.7 and will not work with version 3.
Linux and OS X systems should have python installed, while Windows
users can download it from python.org.
Please note that on Windows you need to modify system PATH environment
variable to include path to python. The easiest way to do it is to
select the "Add python.exe to path" option in the installer (it's not
selected by default!). You may check that python is installed correctly
by typing 'python' at the command prompt, which should launch the
python interpreter.
Biopython (v. 1.64+) should be installed on top of
python 2.7. You can download the distribution and read the installation
details for your system at biopython.org. Please make sure you download and install the correct version for you system and python version.
Depending on your system set-up, MEME
may require additional python and perl modules. Please check the log
pane of SigmoID's window for error messages and install missing modules
if necessary. MEME within SigmoID could be called in two ways: as a
simple converter of the aligned sequences to MEME format (via the Convert to MEME menu command) or for finding binding sites within unaligned sequences (via the Find Sites with MEME
menu command). The second option generates html output via a perl
script and relies on template files that MEME expects to find in
certain locations, hence this is only likely to work when MEME is
installed on the user system and the path to the system version is appropriately set
in SigmoID preferences.
There are two Linux distributions of SigmoID, for
32-bit and 64-bit systems. Both may require installation of additional
libraries. The 32-bit version depends on WebKit version 1 for
displaying help and database search results. WebKit1 is actually
included with the 32-bit Linux SigmoID distribution; please see the SigmoID.sh
file for the correct command to launch SigmoID with included WebKit1
libraries.
Supported file formats
SigmoID can open two types of sequence files: with genome sequences and TFBS/promoter sequences.
SigmoID expects annotated genome sequence files that
should be in GenBank format. Only the files in GenBank format can be
opened in SigmoID via the command from the File menu. The current
version of genome browser can
properly open only files with a single accession per file. You
can still perform a nhmmer (but not MAST) search of the genome split
into several separate accessions, but please don't try to open this
type of GenBank files in genome browser.
SigmoID can also work with unannotated genome
sequences in fasta format. An unannotated genome in fasta format
can't be opened directly in genome browser, but can be selected
as a target for nhmmer/MAST search. Of course, filtering
options of the post-processing script (HmmGen.py) that rely on feature
table of .gbk files will not work with genome sequences in fasta format.
TFBS/promoter sequences can be in either fasta or
special SigmoID profile format (with .sig extension). TFBS/promoter
sequences in fasta files should be aligned and of the same length. If
facing varying lengths sequences, SigmoID displays error meaasage and
doesn't show sequence logo; the sequences, however, are loaded (for
viewing purposes and/or for aligning them properly with MEME).
SigmoID profile format
files (.sig extension) are virtual folders containing several separate
files: the actual binding site sequences in fasta format, calibrated
hmm and MEME profiles, as well as two text files with profile
description and search engine/postprocessing options. The contents of
the files within the .sig virtual folder can be viewed in the main SigmoID window via commands from the View menu.
SigmoID has two hidden menu commands for converting a real folder to
the .sig file and vice versa; holding down the 'Alt' button while
selecting the File menu reveals them.
Please note that due to current Xojo limitation
SigmoID can't handle virtual folders properly on 64-bit machines,
therefore 64-bit Linux version converts all .sig files to real folders.
These can only be accessed via toolbar of the main window and can't be
opened via any menu command. Therefore, on 64-bit Linux machine, please
move to the current profile folder any .sig folders you may want to open and
use the leftmost toolbar button of the main window to open them.
SigmoID can save genome files in standard GenBank
format, export unannotated genome sequence in fasta format and export
feature table in Sequin table format required by the NCBI tbl2asn
program. The appropriate commands are located in the File menu.
Interface
Windows
Main Window
The main window opens at SigmoID launch and is split into two major interface elements: the topmost Viewer and the Log pane behind it. The Log
displays informative messages (including errors) and also shows textual
output from some of the included command line programmes and python
scripts. The Viewer
is hidden at SigmoID launch and opens once binding site data is loaded.
It displays the sequence logo by default, but can be switched to
display other info for the loaded data via the View menu.
The info that can be shown by the Viewer depends on the type of the
file opened: only the logo and the sequences could be shown for fasta
files, while all options are available for .sig files. The Viewer
can be used to edit binding site sequences which has two consequences:
sequence logo is recalculated and all profile settings are discarded
since they become invalid (this is also reflected in the Profile Wizard
window). Please note that other types of information (settings,
description, hmm profile, etc.) should not be edited here as this will
have no effect. Only the changes made via the Profile Wizard window can be saved (in a .sig file) and reused.
The sequence logo displayed by the Viewer
is interactive and allows part(s) of the alignment to be selected. A
single area can be selected by dragging a mouse across the logo;
additional area(s) could then be added by pressing the "Shift" button
and dragging again. Such a selection can have two possible uses. First,
you can save the sequences corresponding to the selected area of the
logo in a new fasta file (via the Save Profile Selection...
command from File menu). Second, you can launch the nhmmer search with
selected parts of the alignment masked. The masking happens by default
if you initiate a search when a part of the alignment is selected. This
works by invoking the alimask programme from the HMMER package to
produce a masked hmm profile which is then used to search the target
genome. You can set masking options in the nhmmer configuration window.
Please refer to HMMER User Guide for the details.
The toolbar contains the buttons for few of the most used functions. The leftmost "Load Alignment"
button allows to open binding site data from either fasta or .sig
files. The .sig files provided with SigmoID could be chosen from the
drop down menu. User files can be opened by just pressing this button
(on Linux or Windows) or choosing More... at the very bottom of the
drop down list (OS X).
The next "Search" button launches nhmmer search with the currently loaded profile. The raw search results appear in the Log pane. In case a .sig file is opened the post processing python script is launched by default and the Genome Browser window
is opened to display search results. For non-calibrated profiles
(opened from fasta files) the post processing script has to be launched
separately by pressing the third "PostProcess"
toolbar button. Please note that the original GenBank file is never
modified and SigmoID will ask where to save the file in the same
(GenBank) format with the additions it makes.
The fourth "Terminators"
toolbar button allows to search for terminators. This function uses
TransTerm HP, performs the necessary format conversions and adds the
terminators to genome annotation. As TransTerm takes some time to run,
the results may take couple of minutes to appear, depending on
available processing power.
The fifth "Palindromise"
button does a simple thing - it reverse complement sequences of the
currently loaded binding sites, adds them to the currently loaded data
and recalculates the sequence logo. This function is only meaningful
for sites known to be palindromic and is especially useful when only
few sequences are available. When searching with palindromised profile,
the "Palindromic" check box should be checked. This function should not
be used with combination with MEME (since MEME itself does a similar
thing). Please also avoid using this function before saving
calibrated profile via the Profile Wisard window, as setting the
"Palindromic" check box in Profile Wisard does exactly the same thing
(an you'll end up wit every sequence duplicated).
All genome search commands are also available from the Genome menu.
The last toolbar button, "Settings" currently allows to set the paths to command line programmes and key scripts used by the GUI.
Genome Browser Window
This window is opened after a search for binding sites and
could be
used to quickly skim through the sites just found. Alternatively, the
browser could be used to view an existing GenBank file independently of
any search function. The window is split into three viewers which
display feature map (on top), the actual sequence with six frame
translation (in the middle) and search results (in the bottom part).
The feature map is interactive and can be used to select either a
feature by clicking on it or part of the displayed genome fragment by
dragging across it. Selected sequence can be copied to the clipboard,
used as a query to launch database searches, edited or deleted. The
corresponding commands are located in the contextual menu (brought up
by right-clicking in the feature map); you can also double-click a
feature to open its editor. Please note that currently no format checks
are done in this window: be careful to adhere to GenBank format!
Double-clicking outside any features centers feature map at the clicked
coordinate.
Database searches can be launched via contextual
menu. Depending on the current selection, the menu will contain
commands to search (with BLAST) against the nr database or (with
hmmer/BLAST) against SwissProt/Uniprot/CDD database. Since NCBI servers
are overloaded most of the time, hmmer searches of SwissProt/UniProt
usually run much faster.
Search results are displayed in the bottom of the window which in
essence is a very simple web browser. Rudimentary navigation here
(Back/Forward/Reload) is possible via the contextual menu.
You can manually resize the top and bottom parts of
this window by dragging the separator (the line with three dots abow
the browser pane) up or down.
The toolbar located on top of this window can be
used to navigate the last hmmer/mast search results (the leftmost
arrows control); keyboard 'left arrow' and 'right arrow' keys could
also be used for navigation. The hit sequences could be saved to
a text
file (in fasta format) via the corresponding command from the Genome
menu. The check box
to the right of this control could be used to exclude the undesired
hits when saving them. The toolbar also allows to zoom in/out feature
map (the rightmost
control with +/- signs) or to search within the genome. The "smart"
search field can distinguish three types of queries (sequence,
coordinate or
feature text) and performs the search according to query type.
Navigating to the next search result is posible via the Control-G
shortcat (Command-G on a mac) or the command from the Genome menu.
Database Windows
The RegPrecise and RegulonDB windows
provide access to the corresponding databases with regulatory
informations. These windows have similar organisation and behaviour.
Most of each window is occupied by the regulator list. Since RegulonDB
contains information only for E. coli, the regulators are displayed
straight away, while in case of RegPrecise you have to choose a species
first (from the popup above the list). The top of the RegulonDB window
allows to switch between the transcription and sigma factor
binding sites and filter the sites according to the evidence confidence
level.
Clicking a regulator in the list activates action
buttons in the bottom of the windows. The leftmost button (with an "i"
letter) connects to the database and displays the information on the
regulator in a new window.
The Check TF Presence
button, located to the right of the info button, can be used to verify
the presence of the transcription factor in the currently opened
genome. The button will be disabled if there's no genome opened. This
button connects to the corresponding database to get the amino acid
sequence of the regulator and then launches tfastx search vs the opened
genome. The three topmost hits of this search are displayed in the log
pane of the main window. Since similarity levels vary greatly between
genomes, it's up to the user to estimate its significance. A reciprocal
confirmation could be helpful here: copy the coorginate of the topmost
hit, find the corresponding ORF in the genome browser and launch phmmer
search vs SwissProt/Uniprot to see if the original TF and its obvious
orthologues come up as the top hits.
Please note that the path to the genome file should
not contain spaces! This is due to the way options are treated by
tfastx.
The Regulon Logo
button is located in the lower right of the windows and could be used
to load the binding site data for the currently selected regulator an
display its logo in the main window. The RegPrecise window contains an
additional button to display the logo of the binding sites from the
corresponding regulog. Depending on the number of sites available for
the regulator and their diversity, either button can be more
usable.
Profile Wizard Window
This window allows to enter the settings required to make a calibrated
profile. The top left of the window contains search thresholds, of
which only the nhmmer gathering threshold is strictly required as it is used by default by nhmmer. Choosing the right threshold can be simplified by the Find Minimal Score
command which finds the minimal score for binding sites in the training set.
Although other two nhmmer cutoffs (and mast p-value threshold) are not
required for saving the calibrated .sig file, they are still desirable.
Entering the correct post-processing options in the top right of this
window is critical for making correct additions to genome
annotation. The Palindromic site
check box sets corresponding options when running MEME and MAST and
filters overlapping results produced for palindromic sites by nhmmer.
Checking the Use next locus_tag
option will pick up both the locus_tag and gene qualifiers from the
following gene when possible (these qualifiers won't be added if the
binding site is located between divergently transcribed genes). The Ignore sites within ORFs
option can significantly reduce the amount of non-specific hits for
"noisy" profiles. This option, however, should be used with caution, as
it may also remove some of the specific hits, especially for
repressors. The text entered in the protein name field will be used as the value for the required bound_moiety qualifier when adding sites to genome annotation.
Profile description should be entered in the text box in the lower part
of the window. It is required to activate the Save... button and is
expected to include the information on the data source(s) and
description of the profile construction procedure.
If an existing .sig file is opened, all settings
from this file are entered in the fields of the Profile Wizard.
However, these values will be erased if you edit the alignment
sequences. If you want to prevent this, press the Lock button located in the lower right of this window.
Command Configuration Windows
These simple windows are opened in most cases before launching command
line utilities (nhmmer, meme, mast, TransTerm HP) and python scripts to
allow changing some of the options. The options are hopefully
self-explanatory, but an explanation is provided in most cases via help
tags: hold a mouse pointer over the option for a second to see this help.
Web Browser and Help Windows
The minimalistic web browser windows are used to display info from
RegPrecise and RegulonDB databases, as well as SigmoID help. If these
seem inconvenient, a link could be copied (via a contextual menu) and
opened in a browser of your choice.
SigmoID Preferences Window
This window can be opened via the Prefereces... command located in the
SigmoID menu on OS X or in the Edit menu on Windows/Linux or by
pressing the rightmost toolbar button in the main window. The buttons
in the top part of this window switch between three preference panels.
The panels allow to:
1) Set the paths to executable files (nhmmer, meme, mast, etc.) used by
SigmoID. This can be useful if SigmoID can't find some of the required
programmes or if you like to use the ones already installed on your
system. You can also reset all paths to their defaults (pointing to the
files distributed with SigmoID) with the button in the bottom left of
this window.
2) Set the databases searched by BLAST and optionally restrict the
searches to a smaller taxonomic group to speed up them. This panel also
allows to switch between two result formats that can be output by the
HMMER web server: the full graphics rich html format (default) and
simple text format. The HMMER search pages
have been changing recently a lot, and some versions could not be
displayed properly by the default browser engines used by SigmoID on
Windows and 32-bit Linux. This option should be used if you have
problems with the default html format.
3) Switch to an alternative folder with calibrated profiles from the
one provided with SigmoID. Only the profiles from this folder will be
accessible via the leftmost toolbar button in the main window. Only
these profiles could be used by the Scan Genome function.
Menu Reference
File Menu
This menu contains the standard open and save commands separated into three groups.
The topmost group contains commands related to alignments/profiles, the next one – to genome files.
Open Profile...
Displays an Open File dialog where you can select a
profile/alignment file from your local disk. SigmoID can open files in
its own format (.sig) or text files in fasta format. The file should
have one of the following extensions: .sig, .fasta, .fas, .fsa,
.fa.
Save Profile As...
becomes enabled if the binding site sequences are
changed. Rather than saving the changes directly, this command opens
the Profile Wizard window which allows
to enter new profile settings and save the alignment in a .sig file. If
you want to save just the sequences in fasta format, please use the
next command.
Save Profile Selection...
saves (in fasta
format) the part of the profile corresponding to the currently selected
part of the sequence logo.
Save Logo Picture...
does what it says and does it in PNG format.
Close
Closes the current window. The main window can't be closed with this command.
Open Genome...
Displays an Open File dialog where you can select a
genome file from your local disk. The file should be in the GenBank
format and have the .gb or .gbk extension.
Save Genome...
Save the file currently opened in the Genome Browser window with the same filename.
Save Genome As...
Saves the genome currently opened in the Genome Browser window with a different filename.
Export Sequence...
Export the contents of the current genome as plain text file in fasta format. This discards feature table.
Export Feature Table...
Export feature table in GenBank Sequin table format.
The resulting .tbl file can be used to prepare GenBank submission with
the help of tbl2asn.
Quit
Closes all SigmoID windows and exits SigmoID completely. If you
select this option with unsaved profile or genome, SigmoID will first
ask you to save the changes.
Edit Menu
Undo
Undoes the last editing action done in the currently
active text field. Unfortunately, Undo is not available for changes
made to genome files.
Cut
Copies the selected text to the clipboard and deletes it from the original position.
Copy
Copies the selected text to the clipboard. In Genome
Browser window this command copies the nucleotide sequence.
Copy Protein Sequence
If a CDS is currently selected in the Genome
Browser, this command copies its amino acid sequence to the clipboard.
Paste
Pastes the text from the clipboard copied using Cut or Copy command to the current cursor location.
Clear
Deletes the selected text.
Select All
Selects all text in the current text field
Preferences
Opens the Preferences window to change personal
preferences for SigmoID. Currently only alows to set the paths to
executable files. On OS X this submenu is located in the SigmoID menu.
View Menu
This menu can be used to change the information displayed in the Main window and in the Genome Browser. Only the last command, View Details, is related to the Genome Browser. The remaining commands are related to the main window and swich the type of information displayed in the topmost Viewer
pane. This menu allows to view contents of all components of a .sig
file which is actually a virtual folder containing six text files. This
meny allows to view the information contained within the .sig files.
Editing the sequences can be done directly in the Viewer, while the
rest of information can only be edited via the Profile Wizard.
Logo
Displays
sequence logo for the sequences in the currently opened .sig or fasta
file (or downloaded from the RegPrecise or RegulonDB databases).
Currently, the logo is calculated using the original T. Schneider
(1986) formula without small sample correction.
Sequences
Shows the actual nucleotide sequences for the loaded binding site data.
You can edit the sequences in this view, but if you wish to use the
edited data further, please switch to the Logo view, as this finalises
your editing and recalculates the logo and the hmm profile.
Profile Info
The description of the profile as given by its author. Available only for data from .sig files.
Hmm Profile
The calibrated hmm profile produced by hmmbuild when creating the .sig
file. Available only for data from .sig files.
MEME data
The same sequences in MEME format. These are used for MAST searches. Available only for data from .sig files.
Profile Settings
Various settings, including profile calibration thresholds and post
processing options. Available only for data from .sig files.
Hide Viewer
hiddes/unhides via the Viewer in the main window to give less/more room for the Log pane.
View Details
shows/hides sequence display with six frame translation in the Genome Browser window.
Profile Menu
Extend Binding Sites...
Opens a small window where you can specify the left
and right extension limits, as well as the genome file to search. This
command finds every sequence from the currently opened binding site
data in the genome sequence and adds the specified number of bases to
the left and to the right. The results are written to the log pane.
Convert to Stockholm
Converts the current profile to the minimal
Stockholm format (as required by hmmbuild) and outputs the results to
the Log pane.
Convert to Hmm
Runs hmmbuild from the HMMER package to create a hmm profile that could be used as input for nhmmer.
Convert to MEME
Runs MEME with the currently loaded binding site sequences and outputs the results as plain text to the Log
pane. These results san be used as input by MAST (in fact, when running
MAST with uncalibrated data MEME is run first in exactly the same way)
Find Sites with MEME...
Shows a window allowing you to configure MEME parameters. For this
command MEME is configured to produce results in html format, hence
they are displayed in the Web Browser window. This command may be useful when dealing with unalugned data, e.g. from RegulonDB.
The command is currently not available in SigmoID for Windows. Please use the Convert To MEME command which runs the same command, but outputs plain text into the log pane of the main window.
Profile Wizard...
Opens the Profile Wizard window which allows to enter calibrated profile settings and then save it in as a .sig file.
Regulon menu
The first two items of this menu provide access to databases with regulon information. The RegPrecise database contains high quality information on binding sites for many bacteria while the RegulonDB is a specialised E. coli
regulon database. While the information from RegulonDB in most cases
requires additional steps to be usable, it has data on
regulators not present in RegPrecise. When using the data from RegulonDB for genomes other than E. coli, it might be worthwhile to check for the presence of the TF orthologue in the studied genome. This can be done with the Check TF Presence
command.
RegPrecise...
Opens the window providing access to the RegPrecise database.
RegulonDB...
Opens the window providing access to the RegulonDB database.
Regulon Info
Opens the RegPrecise or RegulonDB web page with info for the regulon currently selected in one of the database windows.
Show Logo
Shows the logo of the currently selected regulator
binding sites in the main SigmoID window. The result can usually be
used for nhmmer/mast search for RegPrecise data. The binding sites in
RegulonDB are often misaligned, in which case the Find Sites with MEME command from the Profile menu may (or may not) be useful.
Check TF Presence
This command gets the regulator protein sequence from
RegulonDB and runs tfastx search versus the currently open genome. The
top three tfastx search results are displayed in the log pane of the
main SigmoID window. We recommend a reciprocal check using the
coordinates of the best hit to locate it in the current genome and run
the phmmer search vs the SwissProt database which (in case of
orthology) should bring the original regulator as the best E. coli hit. At the moment this
command is not available for RegPrecise since there's no
straightforward way to get regulator sequences from this database.
Find Minimal Score
May be helpful when determining search thresholds.
This command
can only be issued when a nhmmer search has just been run and its
results are displayed in the Genome Browser window. This function
simply compares the current hits to the training set (original binding
sites opened in the main window), outputs the lowest score among the
training set and lists missed hits. The lowest specific and highest
unspecific scores found are also entered as the nhmmer trusted and
noise thresholds into the Profile Wizard
window. If the noise threshold appears lower than the gathering one,
their mean is entered in this window as the gathering threshold
(otherwise the value of trusted threshold is entered here). The
gathering threshold is actually the one that will be used for further
searches. Depending on the original data and the actual genome, this
simple approach may fail to choose the right values. You still have to
verify thoroughly that these scores are the ones that you want! Also
note that this command can't find sites with redundant bases or gaps.
Genome Menu
This menu collects genome related commands and is mostly oriented at
various ways of searching for regulatory information in currently
opened genome.
nhmmer Search...
Opens the window allowing to configure and
launch nhmmer, which is the primary search engine in SigmoID. This
function is enabled if binding site data is loaded (and sequence logo
of the site is displayed in the main window). To launch the
search, you have to choose a file with genome sequence in GenBank
format and choose the cutoff score (which is critical for getting the
correct results). If a calibrated profile is loaded, then the correct
cutoff will be chosen already. The raw search results appear in the Log
pane. If you are sure that the cutoff is correct, you may check the
"Add annotation to the genome" check box which will run the HmmGen.py
python script to filter nhmmer results, add the binding sites to genome
annotation and open the updated genome sequence in the Genome Browser window. In case a calibrated profile (.sig file) is opened, this script is launched by default. For non-calibrated profiles
(opened from fasta files) the post processing script has to be launched
separately by the Annotate Current Sites command.
Annotate Current Sites...
Opens the window allowing to configure and launch python script
(HmmGen.py) to filter nhmmer results, add the binding sites to genome
annotation and open the updated genome sequence in the Genome Browser window.
Using this command separately from nhmmer search may be convenient for
unoptimised profiles when deciding on the correct search options. This
comand doesn't modify the original GenBank file, but will write a new
one (asking where to save it)
with the additions it makes.
MAST Search...
Opens the window allowing to launch MAST from the MEME Suite package.
Please note that compared to nhmmer MAST wasn't extensively tested
within SigmoID. The configuration window provides minimum options
(basically, just one – the p-value cutoff), but allows to enter
additional options if desired. These will be appended to the end of
MAST command line. If a non-calibrated profile is opened, MEME is run
before MAST to convert the binding site sequences to the required
format. The raw search
results (and, in case MEME was run, its output as well) appear in
the Log
pane. The checkboxes in the bottom of this window instruct SigmoID to show the results filtered by the post processing script (MastGen.py) in the Genome Browser window with or without modifying the annotation.
Terminator Search...
Opens the window allowing to configure and launch TransTerm HP to
search for terminators. This command performs the necessary format
conversions and adds the
terminators to genome annotation (with the TermGen.py script). As TransTerm takes some time to run,
the results may take couple of minutes to appear, depending on
available processing power.
Scan Genome...
This command is
designed to perform a full genome scan with all available calibrated
profiles with minimal interaction with the user. You can select the
desired profiles (or use all of them) and choose if terminator search
will be performed. After pressing the Run button, SigmoID will run
nhmmer followed by the HmmGen.py script with all checked profiles using
preconfigured settings followed by the terminator search. The results
will be written to the file specified by you in the GenBank format.
With many profiles this function may take a while to run.
Save Checked Sites...
Saves to a text
file (in fasta format) the hits from the last search currently displayed in the Genome Browser. The check box
to the right of the navigation arrows in the toolbar of Genome Browser can be used to exclude the undesired
hits when saving them.
List Regulons...
Outputs to the Log pane of the main window either a single regulon
(controlled by the specified regulator) or all regulons currently
annotated in the genome. The regulons are output by sequentially
listing the operons/divergons controlled by a regulator. For the
purpose of this command the operon is defined as the genes between a
binding site and the nearest terminator or a long intergenic gap. Two
divergently transcribed operons are output as a single divergon with
the regulator site in the middle. The settings window opened by this
command allows to choose the criteria for operon beginning and start.
Find
This command simply puts the cursor into the search
field of the Genome Browser window located in its top right corner. The
search field can distinguish three types of queries (sequence,
coordinate or
feature text) and performs the search according to the type of query.
Type what to search for and press "Enter" on the keyboard to initiate
the search.
The position of the first occurrence of the query in the genome will be
highlighted.
Find Again
Highlights the next position in genome of the previous search query.
Add Plot...
This command can be used to plot either RNA-seq coverage data produced by samtools (please see the instructions) or simple numerical data (e.g. %GC). Four overlapping graphs can be shown in the plot area – just use the Add Plot…
command repeatedly. All graphs are shown in the same plot area which
allows, for example, to compare RNA-seq data for two conditions and for
both strands. Each plot is currently scaled separately, so the maximal
values plotted (shown on the left and on the right) should be taken
into account when comparing the plots.
Remove Plots
Well, it removes all plots that are displayed
Merge Plot Data...
This is an auxiliary command that can be used to
merge two data files produced by samtools depth command. This is
required to properly display RNA-seq data according to these instructions.
Window Menu
Lists all currently opened SigmoID windows. Selecting a window from this list brings it to the front.
Help Menu
About SigmoID
Displays a window with information about SigmoID including the current version and a brief list of credits.
SigmoID Help
Opens the Help Viewer window.
HMMER User Guide
Opens (in the default PDF viewer) the HMMER User guide from the HMMER package distribution.
Hmmer.org
Opens HMMER website in the web browser window.
MEME Suite Web Portal
Opens the main MEME Suite website in the web browser window.
Using SigmoID to view RNA-seq coverage data
SigmoID can display graphs of RNA-seq coverage that
could be very helpful for verifying regulatory sequences and operon
boundaries. As of now, SigmoID does not include all functions required
and can only load and display read count values. These can be produced
in many different ways, one of which (not necessarily the best one) is
described below. This approach requires bowtie2 for read mapping and samtools for processing the resulting file.
The commands to produce the required files in the
case of paired reads are described below. The switches -p 8 and -@ 8
will run the bowtie and samtools tasks on eight processor cores: adjust
those for your system.
The commands below assume that the genome file is
called 'genome.fasta' and RNA-seq data are in the files 'read1.fastq'
and 'read2.fastq'.
1. Index your genome file:
bowtie2-build genome.fasta genome_index
2. Map the reads to your sequence:
bowtie2 -x genome_index -p 8 --very-sensitive-local --no-mixed --no-discordant -1 read1.fastq -2 read2.fastq -S mapped.sam
3. Convert sam to bam, sort and index it
samtools view -bS -@ 8 mapped.sam | samtools sort -@ 8 - mapped.bam
samtools index mapped.bam mapped.bai
4. Remove reads with mapping quality below 2 (which map to more than one location):
samtools view -b -q 2 -@ 8 mapped.bam > mapped2.bam
5. Split the sam file into reads mapping to sense and antisense
strands. Since mates of a paired read map to different strands and
samtools can't extract those at the same time, samtools has to be run
four times:
samtools view -b -@ 8 -f 99 mapped2.bam > sense1.bam
samtools view -b -@ 8 -f 147 mapped2.bam > sense2.bam
samtools view -b -@ 8 -f 83 mapped2.bam > antisense1.bam
samtools view -b -@ 8 -f 163 mapped2.bam > antisense2.bam
(note: three extra bits added to -f exclude unmapped and improperly
mapped reads, which is not required in this particular case (but does
no harm)
6. Count the reads:
samtools depth sense1.bam > sense1.depth
samtools depth sense2.bam > sense2.depth
samtools depth antisense1.bam > antisense1.depth
samtools depth antisense2.bam > antisense2.depth
7. Open the genome file in SigmoID and combine read counts for sense
and antisense strands using the Merge Plot Data... command from the
Genome menu.
8. Load the combined read count data for the sense strand using the Add Plot... command from the Genome menu, then load the data for the antisense strand via the same menu command. Repeat for another sample.
Python Scripts
The scripts described below process output produced
by various search programmes, perform format conversions and add
features to genome annotation. The scripts are called by SigmoID GUI
when necessary, but can be used separately if desired. Type the command below in terminal to get help on command line usage:
python <path_to_the_script> -h
HmmGen.py
SigmoID processes nhmmer results (table of hits) with the help of the HmmGen.py
python script, adding corresponding feature annotations to the genbank
file being searched and saving the result in a new file. Some useful
options are provided to make annotation more convenient. These you can
find in the "HmmGen Settings" window, which pops up after clicking the "Postprocess" button in the main window.
To run the script, enter the appropriate threshold
(either bit score or E-value). By default SigmoID chooses the same
value that was used to run nhmmer, but you can increase the bit score or decrease the E-value to reduce the number of hits without re-running nhmmer.
To filter out all intragenic hits, check the "Consider intergenic regions only" box. nhmmer
reports hits on both strands, and in the case of palindromic sites
there will be two hits with the same coordinates and identical (or very
close) scores. To remove one of the duplicate sites, check the "Palindromic site" checkbox.
This script can also add 'locus_tag' and 'gene'
qualifiers to the feature being annotated, but please note that GenBank
will object such additions if you later decide to submit this sequence
to the database. If you are certain you really want this addition,
check the "Add qualifier" box.
Choose feature type ("promoter" or "protein_bind") from the "Feature to add:"
box (or just type in the valid feature type). The window also allows to
configure one qualifier for this feature. The qualifier name could be
typed in, but it should remain as 'protein_bind' in most cases. A valid
protein name should be entered in the field to the right.
Pressing the Run
button will ask you for the name of the file in which you'd like to
save the genome sequence with modified annotation. If the "Show hits in
genome browser" box is checked, You'll see the results in the browser
window. The script also appends the detailed text report to the log
pane.
MastGen.py
This script allows to add features to a genbank file
according to MAST results. From SigmoID it is called when
usage:
MastGen <report_file> <input_file> <output_file> [options]
positional arguments:
report_file path to MAST report file produced with -tblout option.
input_file path to input GenBank file.
output_file path to output GenBank file.
optional arguments:
-h, --help show this help message and exit
-L <integer>, --length <integer>
final feature's length in genbank file
-q [<key#"value"> [<key#"value"> ...]], --qual [<key#"value"> [<key#"value"> ...]]
add this qualifier to each annotated feature.
-p, --palindromic filter palindromic sites.
-n, --name don't pick 'locus_tag' and 'gene' qualifiers from the
next CDS feature.
-V <float or integer>, --pval <float or integer>
threshold E-Value.
-S <float or integer>, --score <float or integer>
threshold Bit Score.
-i, --insert don't add features inside CDS
-d, --duplicate no duplicate features with the same location and the
same protein_bind qualifier value
-v, --version show program's version number and exit
-f <"feature key">, --feature <"feature key">
feature key to add (promoter, protein_bind etc.)
TermGen.py
This script allows to add terminators to a genbank file according to TransTerm HP results.
usage:
TermGen <input_file> <output_file> [options]
positional arguments:
input_file path to input GenBank file.
output_file path to output GenBank file.
optional arguments:
-h, --help show this help message and exit
-o <path>, --output <path>
redirects TransTerm HP output file to directory given
-C <integer>, --confidence <integer>
threshold Score.
--minstem <integer> Stem must be n nucleotides long
--minloop <integer> Loop portion of the hairpin must be at least n long
--maxlen <integer> Total extent of hairpin <= n NT long
--maxloop <integer> The loop portion can be no longer than n
-v, --version show program's version number and exit
ptt_converter.py
This script allows to convert genbank file into .ptt file format.
usage:
Genbank to PTT converter <input_file>
positional arguments:
input_file path to input Genbank file.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
OperOn.py
This script finds putative operons between regulator binding sites and/or terminators/long intergenic gaps.
usage:
OperOn <input_file> [options]
positional arguments:
input_file path to input GenBank file.
optional arguments:
-h, --help show this help message and exit
-g <int>, --gap <int>
minimal gap between operons
-i <int>, --indent <int>
maximal distance from binding site to the first
downstream CDS
-t, --terminator terminators are regarded as operon separator
-r <name of regulator>, --regulator <name of regulator>
only specified regulators are considered
-p, --palindromic treat all binding sites as palindromic
-s, --strict operon stops on first terminator (if -t is set)
-v, --version show program's version number and exit
gbk2tbl.py
This script allows to convert GenBank file into .tbl file format. The resulting table is output to stdout.
usage:
Genbank to .tbl converter <input_file> [options]
positional arguments:
input_file path to input GenBank file.
optional arguments:
-h, --help show this help message and exit
-f, --fasta creates fasta from genbank file.
-p PREFIX, --prefix PREFIX
sequencing centre prefix.
-t, --translation adds translation qualifier to CDS features in .tbl
-v, --version show program's version number and exit