RSA-tools - Tutorials - TRANSFAC
Contents
Introduction
TRANSFAC is a database of transcription factors. It contains a collection of binding sites for transcription factors of many prokaryotes and eukaryotes.
TRANSFAC is equipped with query tools allowing to perform differnet tasks.
- browse the data through a classification of transcription factors
- perform basic search (e.g. search by factor name)
- Match - Matrix Search for Transcription Factor Binding Sites.
RSAT and TRANSFAC are offer complementary services. The information about trnscription binding sites stored in TRANSFAC can be used as input for various tools in RSAT (e.g. genome-scake pattern matching). Reciprocally, promoter sequences can be retrieved from RSAT and then submitted to TRANSFAC for detecting putatuve binding sites for known factors. We will illustrate and discuss those approaches.
We will test the matching tools with some genes for which we already know something about trnscriptional regulation, and compare the results of library matching with our prior knowledge. We will then use the same tool with random sequences and compare the results.
TRANSFAC registration
The public version of TRANSFAC is freely available for academic users, but the first time you access it, you need to registrate. This is done very easily, and it immediately gives you access to the tools. Follow the registration steps at TRANSFAC before executing the next steps.
Example of utilization
Warning: This tutorial needs to be updated, becausde the matching tool evolved since I wrote the first version, and they now include a choice between different levels of stringency (minimize false positive, false negative, or a compromize between both)
- Retrieve the upstream sequence of the gene PHO5 (as you have seen in the tutorial on sequence retrieval). Once the reult is dispayed, select the sequence and copy it.
- Open the Match form (http://www.gene-regulation.com/cgi-bin/pub/programs/match/bin/match.cgi?). Paste PHO5 upstream sequence in the appropriate box.
- Specify that you want and output sorted by quality.
- Submit your query.
- Select Fungi for the Matrix group.
Analyze the result.
- How many binding sites were predicted ? For how many distinct transcription factors ?
- How many matrices were matched against your sequence ?
We will now analyze different types of random sequences, in order to estimate the rate of spurious matches.
- In RSAT, open the form random-sequences.
- Specify a length of 800bp, and a single sequence (in order to anlayze the same length as the PHO5 promoter analyzed above).
- Leave all parameters unchanged, and click GO.
- Copy the sequence, and paste it in another MatSearch form.
- Use the same settings as above, and submit the query.
Analyze the result obtained with a randoms sequence. Note that this random sequence was generated with a Markov model of order 5, calibrated on yeast intergenic regions. Thus, although it is a random sequence, it has the
Interpreting the results
The principal problem with whole-library matching is that any sequence is expected to contain matches for some of the motifs by chance. Results are thus not easy to evaluate.
In addition, the PSSM stored in TRANSFAC represent binding sites of varying widths and information content. Thus, short motifs will tend to be found in any sequences, and they will be reported frequently, but the confidence in these predictions will be quite poor.
Additional exercises
You can now come back to the tutorial main page and follow the next tutorials.
For suggestions or information request, please contact