RSA-tools - Tutorials - genome-scale patser
Contents
Prerequisite
This tutorial assumes that you already followed the tutorials on
Introduction
Position-Specifi Scoring Matrices (PSSM) offer a more sensitive way than strings to describe the variability of DNA binding sites. However, the motifs are usually very short, so that genome-scale matching usually returns many matches.
Applying matrix-based pattern matching to rpedict the target genes of a given transcription factor is thus far fom trivial. The threshold has to be selected carefully, and there is always a tradeoff beteweeb sensitivity (which percentage of the true target genes will be detected ?) and specificity (among the prediction, how many represent true targets ?).
Patser is abel to select a threshold automatically, on the basis of the information content of the matrix and the sequence size. This usually provides a good first guess, but it might be wise to sort the results by score, and select a more restrictive threshold on the basis of the funciton of the selected genes.
Example of utilization
We will use a position-specific scoring matrix from Transfac (entry F$GCN4_01) to scan the whole genomes for genes potentially regulated by the transcription factor Gcn4p.
- In the left frame, select the tool genome-scale patser
- Paste the following matrix in the matrix box.
A | 4 2 7 2 3 5 5 4 6 19 0 0 43 0 2 1 43 1 12 4 12 6 7 4 5 8 6 C | 6 7 6 7 6 4 11 12 9 6 0 0 0 43 0 42 0 11 16 14 10 9 10 8 10 7 7 G | 2 5 2 9 6 10 7 9 11 15 0 43 0 0 0 0 0 5 2 13 4 11 4 11 6 3 4 T | 2 1 3 2 8 6 4 5 11 2 43 0 0 0 41 0 0 26 8 3 8 7 10 5 6 7 6- NoteThis matrix is already in consensus format. Make sure that the matrix format option is set to consensus.
- You can try to run the program with a threshold of 0, but this usually returns many suprious matches. We will increase the threshold to be more selective. Set the lower threshold to 9.
- Let us assume that, in a first time, we simply want a list of the genes which have at least one match above the selected threshold in their upstream region. For the time being, we are thus not interested in the precise positioning of the matching sites. For this, we can select to return the top value for each sequence.
- To be even more selective, we will prevent matches with upsteram ORFs. Beware that this might provoke a loss in sensitivity, since there are many false ORFs in the yeast genome, as discussed in the tutorial on sequence retrieval. Inactivate the option to Allow overlaps with neighbour genes.
- Click GO.
After a few seconds, the result of patser will be displayed.
Interpreting the results
When analyzing the selected genes, you can notice that many of these genes are associated with amino-acid metabolism. This is consistent with the function of Gcn4p, the general control of amino-acid metabolism.
The relatively good results might come from the fact that the Gcn4p matrix is quite informative, due to its large size. For most transcription factors, the PSSM is restricted to a few nucleotides, and the matrix is poorly discriminant.
Additional exercises
- Use the Pho4p matrix from SCPD (the same as in the patser tutorial) to predict targets for this transcription factor in the genome of Saccharomyces crevisiae. Apply the same strategy as in the patser tutorial in order to select the lower scoring threshold on the basis of the known binding sites. Compare the results obtained with this strategy, and with the automatic threshold selection, respectively.
You can now come back to the tutorial main page and follow the next tutorials.
For suggestions or information request, please contact