RSA-Tools - Tutorials - analyzing regulatory sequences in a unix shell
Table of contents
Introduction
This tutorial aims at introducing how to use Regulatory Sequence Analysis Tools (RSAT) directly from the unix shell.
RSAT is a package combining a series of specialized programs for the detection of regulatory signals in non-coding sequences. A variety of tasks can be performed : retrieval of upstream or downstream sequences, motif discovery, pattern matching, and gaphical reprsentation of regulatory regions.
A web interface has been developed for the most common tools, and is freely available for academic users.
All programs can also be used directly from the unix shell. The shell access is less intuitive than the web interface, but is very convenient for automatizing repetitive tasks.
This tutorial was written by Jacques van Helden (http://www.ucmb.ulb.ac.be/~jvanheld/). Unless otherwise specified, the programs presented here were written by Jacques van Helden.
Accessing the programs
In order to use the shell version of RSAT, you first need to have a local installation of the scripts. If this is not your case, please contact .
If your configuration is correct, this command should return a random sequence of 350 nucleotides.
- Open a ssh session to your account.
- TO check if the tools are installed, just type :
random-seq -l 350You are now able to use any program from the RSAT package, untill you quit your telnet session. It is however not very convenient to set the path manually each time you open a new connection. You can modifu your defaul configuration by adding the following line to the file .personal-cshrc in the root of your home directory.
set path=(~jvanheld/rsa-tools/perl-scripts/ ~jvanheld/rsa-tools/bin/ $path)If you don't know how to modify this file, see the system adiministrator.Getting help
The first step before using any program is to read the manual. All programs in the RSAT package come with an on-line help, which is obtained by typing the name of the program followed by -h. For example, to get a detailed description of the functionality and options for the program retrieve-seq, type
retrieve-seq -hThe detailed help is specially convenient before using the program for the first time. A complementary functionality is offered by the option -help, which prints a short list of options. Try :retrieve-seq -helpwhich is convenient to remind the precise formulation of arguments for a given progam.Retrieving sequences
The program retrieve-seq allows you to retrieve sequences from a genome (provided this genome is supported on your machine). In particular (and by default), this program extracts the non-coding sequences located upstream the start codon of a series of genes, where regulatory elements are generally found, at least in microbial organisms.
Retrieving a single upstream sequence
First trial : we wll extract the upstream sequence for a single yeast gene. Try:
retrieve-seq -type upstream -org ecoli -q metA -from -200 -to -1This command retrieves a 200 bp upstream sequence for the gene metA of Escherichia coli. Note the negative coordinates, indicating the upstream side. Also note that all coordinates are calculated starting relative to the star codon (position 0 is the A from the start ATG).Combining upstream and coding sequence
For coli genes, regulatory signals sometimes overlap the 5' side of the coding sequence. This is often ssociated to a repression effect: the bound transcription factor prevents RNA-polymerase from binding DNA. retrieve-seq allows you to extract a sequence that combines an upstream and a coding segment. Try :
retrieve-seq -type upstream -org ecoli -q metA -from -200 -to 49Retrieving a few upstream sequences
The option -q can be used iteratively in a command to retrieve sequences for several genes.retrieve-seq -org ecoli -from -200 -to 49 -q metA -q metB -q metCRetrieving many upstream sequences
If you have to retrieve a large number of sequences, it might become cumbersome to type each gene name on the command-line. A list of gene enames can be provided in a text file, each gene name coming as the first word of a new line.To create a test file, you can execute the following steps :
- to create a new file, call the standard unix command
cat > PHO_genes.txt- You can now type a list of gene names, for example :
PHO5 PHO8 PHO11 PHO81 PHO84- Once you have finished typing gene names, press Ctrl-D
- Check the content of your file by typing
cat PHO_genes.txtThis file can now be used as input to indicate the list of genes.
retrieve-seq -type upstream -i PHO_genes.txt \ -org yeast -from -800 -to -1 -label orfThe option -o allows you to indicate a file where the sequence will be stored.retrieve-seq -type upstream -i PHO_genes.txt -org yeast \ -from -800 -to -1 -label gene \ -o PHO_up800.fastaCheck the sequence file :more PHO_up800.fastaRetrieving all upstream sequences
For genome-scale analyses, it is convenient to retrieve upstream sequences for all the genes of a given genome, without having to specify the complete list of names. Fr this, simply use the option -all.retrieve-seq -type upstream -org ecoli -from 0 -to 2 \ -all -format wc -nocomments -label orf_gene \ -o ecoli_start_codons.wcCheck the result :more ecoli_start_codons.wcRetrieving downstream sequences
retrieve-seq can also be used to retrieve downstream sequences. in this case, the origin (position 0) is the third base of the stop codon, positive coordinates indicate downstream (3') location, and negative coordinates locations upstream (5') from th stop codon (i.e. coding sequences).Thus,
retrieve-seq -type upstream -org ecoli -from -2 -to 0 \ -all -format wc -nocomments -label orf_gene \ -o ecoli_stop_codons.wcreturns all the stop codong for E.coli.Motif discovery
In a motif discovery problem, you start from a set of functionally related sequences (e.g. upstream sequences for a set of co-regulated genes) and you try to extract motifs (e.g. regulatory elements) that are characteristic of these sequences.
Several approaches exist, either string-based or matrix-based. For yeast regulatory elements, string-based approaches give excellent results. The advantages :
- simple to use
- Deterministic (if you run it repeatedly, you always get the same result)
- easily parametrizable
- easy to interpret
- fast
- ability to return a negative answer : if no motif is significant, the programs return an empty list of motifs. This is particularly important to reduce the rate of false positive.
Matrix-based approach can provide a more refined description of omtifs presenting a high degree of degeneracy. The problem of matrix-based approaches is that it is impossible to analyze all possible position-weight matrices, and thus on has to use heuristics. There is thus a risk to miss the global optimum because the program is attracted to local maxima. Another problem is that there are more parameters to select (typically, matrix width and expected number of occurrences of the motif), and their choice drastically affects the quality of the result. Last problem : the result is not easily interpretable because the programs always return an answer.
Basically, I would tend to prefer string-based approaches for any problem of motif discovery. On the contrary, matrix-based approaches are much more sensitive for pattern matching problems (see below). The ideal would thus be to combine string-based pattern disovery and matrix-based pattern matching.
Requirements
This part of the tutorial assumes that you already performed the tutorial about sequence retrieval (above), and that you have the result files in the current directory. Check with the command:
ls -1You should see the following file list :PHO_genes.txt PHO_up800.fasta ecoli_start_codons.wc ecoli_stop_codons.wcoligo-analysis
The program oligo-analysis is the simplest motif discovery program. It counts the number of occurrences of all oligonucleotides (word) of a given length (typically 6), and compares, for each word, the observed and expected occurrences, and return words with a significant level of over-representation.Despite its simplicity, this program already returns good results for may families of co-regulated genes in yeast.
In a first time, we will simply use the program to count word occurrences. The application will be to check the start and stop codons retrieved above.
We will then use oligo-analysis in a motif discovery process, to detect over-represented words from the set of 5 upstream sequences retrieved above (the PHO family). In a first time, we will use the appropriate parameters, which have been optimized for pattern discovery in yeast upstream sequences (van Helden et al., 1998). We will then use the sub-optimal settings to illustrate the fact that the success of word-based pattern-discovery crucially depends on a rigorous statistical approach.
Counting word occurrences and frequencies
Try the following command :oligo-analysis -i ecoli_start_codons.wc -format wc -l 3 -1strCall the on-line option description to understand the meaning of the options you used :oligo-analysis -helpOr, to obtain more details :oligo-analysis -hYou can also ask some more information (verbose) and store the result in a file :
oligo-analysis -i ecoli_start_codons.wc -format wc -l 3 -1str \ -return occ,freq -v -o ecoli_start_codon_frequenciesReaad the result file :more ecoli_start_codon_frequenciesNote the effect of the verbose. You receive information about sequence length, number of possible oligonucleotides, the content of the output columns, ...
Exercise: check the frequencies of E.coli stop codons.
Motif discovery in yeast upstream regions
Try the following command :oligo-analysis -i PHO_up800.fasta -format fasta -v -l 6 -2str \ -return occ,proba -thosig 0 -ncf -org yeast -sort -o \ PHO_up800_6nt_2str_ncf_sig0Call the on-line help to understand the meaning of the parameters.oligo-analysis -hNote that we used pre-calibrated tables as estimators of expected word frequencies. these tables have been previously calculated (with oligo-analysis) by counting hexanucleotide frequencies in the whole set of yeast non-coding (intergenic) regions. Our experience is that these frequencies are the optimal estimator for discovering regulatory elements in non-coding sequences.Look the result file :
more PHO_up800_6nt_2str_ncf_sig0A few questions :Answers
- How many hexanucleotides can be formed with the 4-letter alphabet A,T,G,C ?
- How may possible oligonucleotides are indicated ? Is it the number you would expect ? Why ?
- How many patterns have been selected as significant ?
- Do you see some similarity between some of the selected patterns ?
- 4^6=4,096
- 2,080. This is due to the fact that the analysis was performed on both strands. Each oligonucleotide is thus equivalent to its reverse complement.
- 9
- there are strong mutual overlap between some words (e.g. cACGTG and ACGTGc).
Assembling the patterns
A separate program, pattern-assembly allows to assemble a list of patterns, in order to group those that overlap mutually. Try :pattern-assembly -v -i PHO_up800_6nt_2str_ncf_sig0 -sc 7 -subst 1 \ -2str -o PHO_up800_6nt_2str_ncf_sig0.assembCall the on-line help to have a look at the assembly parameters.pattern-assembly -hLook at the result. There are two alignments (with two contigs), and two isolated patterns. Each alignment is made of strongly overlapping patterns. The first alignment (cgcacgtgcg) corresponds to the high affinity binding site for Pho4p, the protein controlling transcriptional response to Phosphate in yeast. the second alignment (cgcacgttt) corresponds to the medium affinity binding site for Pho4p. Medium affinity binding sites have been shown to participate in the transcriptional response to some PHO genes.more PHO_up800_6nt_2str_ncf_sig0.assembSuboptimal settings
This chapter only aims at emphasizing how crucial is the choice of appropriate statistical parameters. we saw above that the optimal parameters give good results with the PHO family : despite the simplicity of the algorithm (counting non-degenerate hexanucleotide occurrences), we were ablt to extract a description of the regulatory motif over a larger width than 6 (by pattern assembly), and we got some decription of the degeneracy (the high and low affinity stes).We will now intentionally try other parameter settings and see how they affect the quality of the results.
Equiprobable oligonucleotides
Let us try the simplest approach : each word is considered equiprobable. For this, we simply suppress the options -ncf -org yeast fom the above commands. We also ommit to specify the output file, so results will immediately apper on the screen.oligo-analysis -i PHO_up800.fasta -format fasta -v -l 6 -2str \ -return occ,proba -thosig 0 -sortNote that
- The number of selected motifs is higher (27) than in the previous trial
- The most significant motifs have nothing to with Pho4p binding sites. All these false positive are A-rich motifs (or T-rich, since we are grouping patterns with their reverse-complement).
Two patterns (acgttt and acgtgc) are selected which are related to Pho4p binding site. However, they come at the 12th and 14th positions only. You can combine oligo-analysis and pattern-assembly in a single command, by using the pipe character "|".
oligo-analysis -i PHO_up800.fasta -format fasta -v \ -l 6 -2str -return occ,proba -thosig 0 -sort \ | pattern-assembly -2str -sc 7 -subst 1 -vOn unix systems, this special character is used to concatenate commands, i.e. the output of the first command (in this case oligo-analysis) is not printed to the screen, but is sent as input for the second command (in this case pattern-assembly).Note that the most significant patterns are associated to the poly-A (aaaaaa) contig. The true positive come isolated. due to the bad choice of expected frequencies (all hexanucleotides were considered equiprobable here), regulatory sites were lost within a majority of false positive, and their description is much less accurate than with the option -ncf.
Markov chains
Another possibility is to use Markov chain models to estimate expected word frequencies. Try the following commands and compare the results. None is as good as the -ncf option, but in case one would not have the pre-calibrated non-coding frequencies (for instance if the organism has not been completely sequenced), markov chains can provide an interesting approach.oligo-analysis -markov 0 -i PHO_up800.fasta -format fasta -v -l 6 \ -2str -return occ,proba -thosig 0 -sort \ | pattern-assembly -2str -sc 7 -subst 1 -v oligo-analysis -markov 1 -i PHO_up800.fasta -format fasta -v -l 6 \ -2str -return occ,proba -thosig 0 -sort \ | pattern-assembly -2str -sc 7 -subst 1 -v oligo-analysis -markov 2 -i PHO_up800.fasta -format fasta -v -l 6 \ -2str -return occ,proba -thosig 0 -sort \ | pattern-assembly -2str -sc 7 -subst 1 -v oligo-analysis -markov 3 -i PHO_up800.fasta -format fasta -v -l 6 \ -2str -return occ,proba -thosig 0 -sort \ | pattern-assembly -2str -sc 7 -subst 1 -v oligo-analysis -markov 4 -i PHO_up800.fasta -format fasta -v -l 6 \ -2str -return occ,proba -thosig 0 -sort \ | pattern-assembly -2str -sc 7 -subst 1 -vRemarks
- Markov 0 returns AT-rich patterns with the highest significance, but the Pho4p high affinity site is described with a good accuracy. te medium affinity site appears as a single word (acgttt) in the isolated patterns.
- Markov order 1 returns less AT-rich motifs. The poly-A (aaaaaa) is however still associated with the highest significance, but comes as isolated pattern.
- The higher the order of the markov chain, the most stringent are the conditions. For small sequence sets, selecting a too high order prevents from selecting any pattern. Markov order 2 mises most of the patterns, and higher orders don't return any single significant word.
dyad-analysis
gibbs motif sampler (program developed by Andrew Neuwald)
consensus (program developed by Jerry Hertz)
An alternative approach for matrix-based motif discovery is consensus, a program written by Jerry hertz, an based on a greedy algorithm. We will see how to extract a profile matrix from ths upstream regions of the PHO genes.Getting help
As for RSAT programs, there are two ways to get help from Jerry Hertz' proigrams : a detailed manual can be obtained with the option -h, and a summary of options with -help. Try these options and read the manual.consensus -h consensus -helpSequence conversion
consensus uses a custom sequence format. Fortunately, the RSAT package contains a sequence conversion program (convert-seq) which supports Jerry Hertz' format. We will thus start by converting the fasta sequences in this format.
convert-seq -i PHO_up800.fasta -from fasta -to wc -o PHO_up800.wcRunning consensus
Using consensus requires to chose the appropriate value for a series of parameters. We found the following combination of parameters quite efficient for discovering patterns in yeast upstream sequences.consensus -L 10 -f PHO_up800.wc -A a:t c:g -c2 -N 10The two main options here is that we suppose that the pattern has a length of about 10 bp (-L 10), and that we will find about 10 occurrences in the sequence set. Since there are 5 genes in the family, this means that we expect on average 2 regulatory sites per gene, which is generally a good guess for yeast.
Notice that several matrices are returned. Each matrix is followed by the alignment of the sites on which it is based. Notice that the 4 matrices are highly similar, basically they are all made of several occurrences of the high afinity site CACGTG, and mtrices 1 and 3 contain one occurrence of the medium affinity site CACGTT.
Also notice that these matrices are not made of exactly 10 sites each. consensus is able to adapt the number of sites in the alignment in order to get the highest information content. The option -N 10 was an indication rather than a rigid requirement.
To save the result in a file, you can use the symbol ">" which redirects the output of a program to a file.
consensus -L 10 -f PHO_up800.wc -A a:t c:g -c2 -N 10 > PHO_consensus(this takes a few minuts).Once the task is achieved, check the result.
more PHO_consensusPattern matching
In a pattern matching problem, you start from one or several predefined patterns, and you match this pattern against a sequence, i.e. you locate all occurrences of this pattern in the sequences.
Patterns can be represented as strings (with dna-pattern) or position-weight matrices (with patser).
dna-pattern
dna-pattern is a string-based pattern matching program, specialized for searching patterns in DNA sequences.
- This specialization mainly consists in the ability to search on both the direct and reverse complement strands.
- A single run can either search for a single pattern, or for a list of patterns.
- multi-sequence file formats (fasta, filelist, wc, ig) are supported, allowing to match patterns against a list of sequences with a single run of the program.
- String descriptions can be refined by using the 15-letters IUPAC code for uncompletely specified nucleotides, or by using regular expressions.
- The program can either return a list of matching positions (default behaviour), or the count of occurrences of each pattern.
- Imperfect matches can be searched by allowing substitutions. Insertions and deletions are not supported. The reason is that, when a regulatory site presents variations, it is generally in the form of a tolerance for substitution at a specific position, rather than insertions or deletions. It is thus essential to be able distinguishing between these types of imperfect matches.
Matching a single pattern
We will start by searching all positions of a single pattern in a sequence set. The sequence is the set of upstream regions from the PHO genes, that was obtained in the tutorial on sequence retrieval. We will search all occurrences of the most conserved core of the Pho4p medium affinity binding site ("CACGTT") in this sequence set.Try the following command :
dna-pattern -i PHO_up800.fasta -format fasta -1str -p cacgtt -id 'Pho4p_site'You see a list of positions for all the occurrences of CACGTT in the sequence. Each row represents one match, and the columns provide the following information :
- pattern identiifier
- strand
- pattern searched
- sequence identifier
- start position of the match
- end position of the match
- matched sequence
- matching score
Matching on both strands
To perform the search on both strands, type :
dna-pattern -i PHO_up800.fasta -format fasta -2str -p cacgtt -id 'Pho4p_site'Notice that the strand colmn now contains two possible values : D for "direct" and R for "reverse complement".Allowing substitutions
To allow one substitutions, type :dna-pattern -i PHO_up800.fasta -format fasta -2str -p cacgtt -id 'Pho4p_site' -subst 1Notice that the score column now contains 2 values : 1.00 for perfect matches, 0.83 (=5/6) for single substitutions. This si one possible use of the score column : when substitutions are allowed, the score indicates the percentage of matching nucleotides.Actually, for regulatory patterns, allowing substitutions usually returns many false positive, and this option is usually avoided. We will not use it further in the tutorial.
Extracting flanking sequences
The matching positions can be extracted along with their flanking nucleotides. Try :dna-pattern -i PHO_up800.fasta -format fasta -2str -p cacgtt \ -id 'Pho4p_site' -N 4Notice the change in the matched sequence column : each matched sequence contains the pattern CACGTT in uppercase, and 4 lowercase letters on each side (the flanks).Changing the origin
When working with upstream sequences, it is convenient to work with coordinates relative to the start codon (i.e. the right side of the sequence). Sequence matching programs (including dna-pattern) return the positions relative to the beginning (i.e. the left side) of the sequence. The reference (coordinate 0) can however be changed iwith the option -origin. In this case, we retrieved upstream sequences over 800bp. the start codon is thus located at position 801. Try :dna-pattern -i PHO_up800.fasta -format fasta -2str -p cacgtt \ -id 'Pho4p_site' -N 4 -origin 801Notice the change in coordinates.In some cases, a sequence file will contain a mixture of sequences of different length (for example if one clipped the sequences to avoid upstream coding sequences). The origin should thus vary from sequence to sequence. A convenient way to circumvent the problem is to use a egative value with the option origin. for example, -origin -100 would take as origin the 100th neucleotide starting from the right of each sequence in the sequence file. But in our case we want to take as origin the position immediately after the last nucleotide. For this, there is a special convention : -origin -0.
dna-pattern -i PHO_up800.fasta -format fasta -2str -p cacgtt \ -id 'Pho4p_site' -N 4 -origin -0In the current example, since all sequences have exactly 800bp length, the result is identical to the one obtained with -origin 801.Matching degenerate patterns
As we said before, there are two forms of Pho4p binding sites : the protein has high affinity for motifs containing the core CACGTG, but can alos bind, with a medium affinity, CACGTT sites. The IUPAC code for partly specified nucleotides allows to represent any combination of nucleotids by a single letter.
A (Adenine) C (Cytosine) G (Guanine) T (Trymine) R = A or G (puRines) Y = C or T (pYrimidines) W = A or T (Weak hydrogen bonding) S = G or C (Strong hydrogen bonding) M = A or C (aMino group at common position) K = G or T (Keto group at common position) H = A, C or T (not G) B = G, C or T (not A) V = G, A, C (not T) D = G, A or T (not C) N = G, A, C or T (aNy) Thus, we could use the string CACGTK to represent the Pho4p consensus, and search both high and medium affinity sites in a single run of the program.
dna-pattern -i PHO_up800.fasta -format fasta -2str -p cacgtk \ -id 'Pho4p_site' -N 4 -origin -0Matching regular expressions
Another way to represent partly specified strings is by using regular expressions. This not only allows to represent combinations of letters as we did above, but also spacings of variable width. For example, we could search for tandem repeats of 2 Pho4p binding sites, separated by less than 100bp. This can be represented by the following regular expression :cacgt[gt].{0,100}cacgt[gt]
which means
- cacgt
- followed by either g or t [gt]
- followed by 0 to 100 unspecified letters .{0,100}
- followed by cacgt
- followed by either g or t [gt]
Let us try to use it with dna-pattern
dna-pattern -i PHO_up800.fasta -format fasta -2str -id 'Pho4p_pair' \ -N 4 -origin -0 -p 'cacgt[gt].{0,100}cacgt[gt]'Note that the pattern has to be quoted, to avoid possible conflicts between special characters used in the regular expression and the unix shell.Matching several patterns
TO match a series of patterns, you first need to store these patterns in a file. Let create a pattern file :cat > test_patterns.txt cacgtg high cacgtt medium(then tpye Ctrl-d to close)check the content of your pattern file.
more test_patterns.txtThere are two lines, each representing a pattern. The first word of each line contains the pattern, the second word the identifier for that pattern. This column can be left lank, in which case the pattern is used as identifier.We can now use this file to search all matching psitions of both patterns in the PHO sequences.
dna-pattern -i PHO_up800.fasta -format fasta -2str \ -pl test_patterns.txt -N 4 -origin -0Counting pattern matches
In the previous examples, we were interested in matching positions. It is sometimes interesting to get a more synthetic information, in the form of a count of matching positions for each sequences. Try :dna-pattern -i PHO_up800.fasta -format fasta -2str \ -pl test_patterns.txt -N 4 -origin -0 -cWith the option -c, the program returns the number of occurrences of each pattern in each sequence. The output format is different : there is one row for each combination pattern-sequence. Te columns indicate respectively
- sequence identifier
- pattern identifier
- pattern sequence
- match count
An even more synthetic result can be obtained with the option -ct (count total).
dna-pattern -i PHO_up800.fasta -format fasta -2str \ -pl test_patterns.txt -N 4 -origin -0 -ctThis time, only two rows are returned, one per pattern.Getting a count table
Anothe rway to display the count information is in the form of a table, where each row represents a gene and each column a pattern.dna-pattern -i PHO_up800.fasta -format fasta -2str \ -pl test_patterns.txt -N 4 -origin -0 -tableThis representation is very convenient for aplying multivariate statistics on the results (e.g. classificatying genes according to the patterns found in their upstream sequences)
Last detail : we can add one column and one row for the totals per gene and per pattern.
dna-pattern -i PHO_up800.fasta -format fasta -2str \ -pl test_patterns.txt -N 4 -origin -0 -table -totalpatser (program developed by by Jerry Hertz)
We will now see how to match a profile matrix against a sequence set. For this, we use patser, a program written by Jerry Hertz.Gettin help
help can be obtained with the two usual options.patser -h patser -helpMatrix conversion
Patser expects as input a matrix like the 4 matrices we obtained above with consensus. The output from consensus can however not be used directly because it contains several matrices, and a lot of additional information. One possibility is to copy-paste the matrix of interest to a separate file.To avoid manual editing, RSAT contains a program matrix-from-consensus, which automaticaly extacts the first matrix from a consensus output.
matrix-from-consensus -i PHO_consensus -o PHO_matrix more PHO_matrixDetecting Pho4p sites in the PHO genes
After having extracted the matrix, we can match it against the PHO sequences to detect putative regulatory sites.patser -m PHO_matrix -f PHO_up800.wc -A a:t c:g -c -l 9
Detecting Pho4p sites in all upstream regions
We will now match our PHO matrix against the whole set of upstream regions from the 6200 yeast genes. This should allow us to detect new genes potentially regulated by Pho4p.One possibility would be to use retrieve-seq to extract all yeast upstream regions, and save the result in a file, which will then be used as input by patser. To avoid occupying too much space on the disk, we could combine both tasks in a single command, and immediately redirect the output of retrieve-seq as input for patser. This can be done with the pipe ("|") character.
retrieve-seq -type upstream -from -1 -to -800 -org yeast \ -all -format wc -label gene \ | patser -m PHO_matrix -l 9 -A a:t c:gDrawing graphs
Accessing graphs in your account from the web
feature-map
XYgraph
Utilities
orf-info
orf-info allows you to get information on ORFs, given a series of query words. Queries are matched against ORF identifiers and ORF names. Imperfect matches can be specified by using regular expressions. For example, to get all info about the yeast gene GAT1 :orf-info -org yeast -q GAT1And to get all the purine genes from Escherichia coli, type :orf-info -org ecoli -q 'pur.*'Note the use of quotes, which is necessary whenever the query contains a *.You can also combine several queries on the same command line, by using reiteratively the -q option :
orf-info -org ecoli -q 'met.*' -q 'thr.*' -q 'lys.*'Advanced use of RSA-Tools
Installing additional organisms
In this chapter, we explain how to add support for an organism on your local configuration of RSAT. This assumes that you have the complete sequence of a genome, and a table describing the predicted location of genes.Genome data
First, prepare a directory where you will store the data for your organism. For example :~myaccount/rsat-add/data/Mygenus_myspecies/You need two informations to start installing a new genome :
- The genome in fasta format. If the genome contains multiple chromosomes, they should all be included in a common multi-sequence fasta file.
- A feature-table giving the basic information about genes. This is a tab-delimited text file. Each row contains information about one gene. The columns contain the following information :
- Identifier
- Feature type (e.g. ORF, tRNA, ...)
- Name
- Chromosome. This must correspond to one of the sequence identifiers from the fasta file.
- Left limit
- Right limit
- Strand (D for direct, R for reverse complemet)
- Description. A one-sentence description of the gene function.
- Optionally, you can provide a synonym file, which contains two columns:
Multiple synonyms can be given for a gene, by adding several lines with the same ID in the first column.
- ID. This must be one identifier found in the feature table
- Synonym
Installing the genome locally
One you have this information, start the programinstall-organismYou will be asked to enter the information needed for genome installation.Updating your local configuration
- Modify the local config file
- You need to define an environment variable called RSA_LOCAL_CONFIG, and which indicates the loca config file.
Checking that the organism is installed properly
To check the installation, start by checking whether your newly installed now appears in the list of supported organisms.retrieve-seq -helpWill give you a list of installed organisms.Once the organism is found in your configuration, you need to check whether sequences are retrieved properly. A good test for this is to retrieve all the start codons, and check whether they are made of the expected codons (mainly ATG, plus some alternative start codons like GTG or TTG for bacteria).
retrieve-seq -org myorganism -all -from 0 -to 2 -format multi | oligo-analysis -format multi -v -1str -l 3 -return occ,freqReferences
- van Helden, J., Andre, B. & Collado-Vides, J. (1998). Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol 281(5), 827-42. Pubmed 9719638
- van Helden, J., André, B. & Collado-Vides, J. (2000). A web site for the computational analysis of yeast regulatory sequences. Yeast 16(2), 177-187. Pubmed 10641039
- van Helden, J., Olmo, M. & Perez-Ortin, J. E. (2000). Statistical analysis of yeast genomic downstream sequences reveals putative polyadenylation signals. Nucleic Acids Res 28(4), 1000-1010. Pubmed 10648794
- van Helden, J., Rios, A. F. & Collado-Vides, J. (2000). Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res. 28(8):1808-18. Pubmed 10734201
- van Helden, J., Gilbert, D., Wernisch, L., Schroeder, M. & Wodak, S. (2001). Applications of regulatory sequence analysis and metabolic network analysis to the interpretation of gene expression data. Lecture Notes in Computer Sciences 2066: 155-172.