Published April 29, 2019 | Version v1
Journal article Open

Simulation data for approximating pairwise evolutionary distances between amino acid sequences

  • 1. Hub de Bioinformatique et Biostatistique – Département Biologie Computationnelle, Institut Pasteur, USR 3756 CNRS, Paris, France

Description

This repository contains simulation data (and associated source codes) that have been generated for dealing with the strong positive monotonic relationship between the number d of substitution events per site that have occurred during the evolution of a pair of homologous amino acid sequences and the proportion p of observed differences between the two aligned sequences.
This repository contains five zipped archives whose content is described below.


0.src.zip — source codes used for generating and analyzing the provided data

For more details, see README.md inside 0.src.zip.

 

1.modeltrees.zip — a set of 20,000 phylogenetic trees selected from PhylomeDB (ftp://phylomedb.org/phylomedb/phylomes) for simulating sequence evolution

Each line of this tab-separated text file is made up by 10 entries: PhylomeDB identifier, no. taxa, min, 25th permille, first, second and third quartiles, 975th permille, and max patristic distance, respectively, and NEWICK-formatted phylogenetic tree with four digit leaf names.

Below is an example of tab-delimited line:

Phy0059AU4_PENEN	25	0.004061	0.029313	0.221575	0.290080	0.366015	0.474169	0.517979	(t001:0.0273393,t002:0.0910109,(((t003:0.0148896,t004:0.0136687):0.0598389,(t005:0.0648873,(t006:0.0372885,(t007:0.100912,t008:0.0643177):0.0304199):0.0289566):0.0225943):0.0230693,(t009:0.207337,((t010:0.0606602,(t011:0.019222,t012:0.0291446):0.0426208):0.0331557,((t013:0.0156203,(t014:0.0000002,t015:0.00406072):0.00963212):0.0581148,((t016:0.0146537,(t017:0.0274094,(t018:0.00415298,(t019:0.00820694,t020:0.0120778):0.00804507):0.010822):0.0612382):0.0474965,(t021:0.0321492,((t022:0.0197832,t023:0.0365331):0.0327991,(t024:0.0351663,t025:0.108968):0.0241437):0.021989):0.0149717):0.0415465):0.028888):0.0383597):0.0236143):0.124487);

 

2.tpd.zip — simulation data (trees t, uncorrected distances p and evolutionary distances d) available as raw text files

For each empirical model of amino acid evolution (i.e. AB, BLOSUM62, cpREV64, cpREV, Dayhoff, DCMut-Dayhoff, DCMut-JTT, DEN, FLU, gcpREV, HIVb, HIVw, JTT, LG, MtArt, mtInv, mtMam, mtMet, mtREV, mtVer, MtZoa, PMB, rtREV, stmtREV, VT, WAGstar, WAG; see http://giphy.pasteur.fr/empirical-models-of-amino-acid-substitution), a tpd file is available containing blocks of simulation data. Each block corresponds to five consecutive lines:

  1. The first line contains the PhylomeDB identifier followed by the integer seed used for simulating sequence evolution using INDELible.
  2. The original phylogenetic tree t (from 1.modeltrees.zip) used for simulating sequence evolution is written on the second line.
  3. The tree with branch lengths refitted using RAxML (for inferring d) is written on the third line.
  4. Uncorrected distances p estimated using FastME, multiplied by the total number of simulated amino acid characters (i.e. 50,000), rounded to the closest integer, and sorted according to the alphabetical order of the taxon names are written on the fourth line.
  5. Evolutionary distances d derived from the tree on the third line, multiplied by the total number of simulated amino acid characters (i.e. 50,000), rounded to the closest integer, and sorted according to the alphabetical order of the taxon names are written on the fifth line.

Below is an example of 5-line block:

Phy0059AU4_PENEN 268
(t001:0.0273393,t002:0.0910109,(((t003:0.0148896,t004:0.0136687):0.0598389,(t005:0.0648873,(t006:0.0372885,(t007:0.100912,t008:0.0643177):0.0304199):0.0289566):0.0225943):0.0230693,(t009:0.207337,((t010:0.0606602,(t011:0.019222,t012:0.0291446):0.0426208):0.0331557,((t013:0.0156203,(t014:0.0000002,t015:0.00406072):0.00963212):0.0581148,((t016:0.0146537,(t017:0.0274094,(t018:0.00415298,(t019:0.00820694,t020:0.0120778):0.00804507):0.010822):0.0612382):0.0474965,(t021:0.0321492,((t022:0.0197832,t023:0.0365331):0.0327991,(t024:0.0351663,t025:0.108968):0.0241437):0.021989):0.0149717):0.0415465):0.028888):0.0383597):0.0236143):0.124487);
(t001:0.027267,t002:0.092904,(((t003:0.014953,t004:0.013712):0.059210,(t005:0.063628,(t006:0.036517,(t007:0.100320,t008:0.064210):0.029943):0.030255):0.023485):0.022767,(t009:0.206030,((t010:0.060839,(t011:0.019723,t012:0.029707):0.043531):0.033382,((t013:0.015545,(t014:0.000001,t015:0.004376):0.009642):0.057519,((t016:0.014228,(t017:0.026438,(t018:0.003880,(t019:0.008531,t020:0.012421):0.007888):0.011186):0.059736):0.046510,(t021:0.033029,((t022:0.019875,t023:0.035399):0.034937,(t024:0.035286,t025:0.108043):0.023522):0.020945):0.014965):0.040146):0.030080):0.040800):0.022987):0.127039):0.0;
 5578 10744 13088 10691 13041 1404 11232 13513 7285 7250 11354 13611 7419 7356 6015 14517 16560 11034 10972 9761 7494 13312 15473 9695 9641 8356 6030 7437 15172 17128 13386 13338 13804 13868 16735 15664 12951 15121 10844 10818 11358 11469 14609 13451 13854 13031 15182 10953 10889 11461 11564 14700 13563 13922 5720 13323 15470 11283 11242 11800 11897 15021 13887 14229 6156 2393 13223 15383 11165 11084 11652 11801 14927 13790 14130 8710 8805 9177 13016 15201 10957 10880 11434 11582 14754 13593 13965 8478 8567 8952 1238 13164 15341 11120 11046 11581 11722 14891 13735 14108 8638 8735 9109 1450 218 14092 16143 12169 12114 12638 12755 15769 14683 14981 9824 9848 10227 7798 7574 7752 16298 18162 14499 14449 15003 15042 17826 16811 17120 12377 12455 12799 10566 10350 10519 4716 15960 17866 14141 14095 14638 14718 17523 16496 16765 11979 12061 12402 10142 9934 10102 4207 2018 16336 18202 14541 14502 15052 15114 17889 16867 17143 12410 12495 12829 10611 10398 10564 4769 2609 997 16451 18318 14640 14598 15152 15213 17958 16953 17245 12521 12596 12924 10747 10541 10704 4922 2776 1193 1032 13675 15779 11707 11652 12161 12296 15376 14282 14612 9328 9403 9771 7270 7029 7215 5056 8071 7606 8115 8256 15117 17081 13184 13151 13649 13767 16719 15662 15924 10880 10964 11316 8998 8776 8952 6892 9737 9283 9757 9897 5088 15600 17524 13713 13668 14178 14300 17187 16154 16386 11469 11570 11895 9579 9375 9547 7535 10346 9897 10363 10502 5746 2668 15147 17104 13305 13238 13756 13851 16768 15747 16069 11017 11129 11492 9119 8883 9060 7046 9863 9407 9894 10032 5239 5287 5964 17388 19136 15678 15638 16106 16177 18857 17944 18220 13590 13671 14029 11861 11663 11826 9919 12471 12071 12488 12628 8254 8277 8891 6541
 6009 12562 15844 12500 15782 1433 13209 16491 8064 8002 13367 16648 8221 8159 6520 18054 21336 12908 12846 11207 8339 16248 19530 11103 11041 9402 6534 8227 19166 22448 16297 16235 16945 17102 21789 19984 15616 18898 12747 12685 13394 13552 18239 16433 17053 15736 19018 12868 12806 13515 13672 18360 16554 17173 6205 16236 19518 13367 13305 14014 14172 18859 17053 17673 6704 2472 16062 19344 13193 13131 13841 13998 18685 16880 17499 9868 9989 10488 15767 19049 12898 12836 13545 13703 18390 16584 17204 9573 9694 10193 1259 15986 19267 13117 13055 13764 13921 18609 16803 17422 9792 9913 10412 1478 219 17453 20735 14584 14522 15232 15389 20076 18271 18890 11259 11380 11879 8697 8402 8621 21050 24332 18181 18119 18829 18986 23673 21868 22487 14857 14977 15477 12295 12000 12218 5020 20482 23763 17613 17551 18260 18417 23105 21299 21918 14288 14409 14908 11726 11431 11650 4452 2075 21109 24390 18240 18178 18887 19044 23732 21926 22545 14915 15036 15535 12353 12058 12277 5078 2702 1015 21303 24585 18434 18372 19082 19239 23926 22121 22740 15109 15230 15729 12548 12252 12471 5273 2897 1209 1048 16816 20098 13947 13885 14594 14752 19439 17633 18253 10622 10743 11242 8060 7765 7984 5437 9034 8465 9092 9287 18952 22234 16083 16021 16731 16888 21575 19770 20389 12758 12879 13378 10197 9902 10120 7573 11170 10602 11229 11423 5439 19728 23010 16859 16797 17507 17664 22351 20546 21165 13535 13655 14155 10973 10678 10896 8349 11947 11378 12005 12199 6216 2764 19152 22434 16283 16221 16931 17088 21775 19970 20589 12958 13079 13578 10396 10101 10320 7773 11370 10802 11428 11623 5639 5681 6457 22790 26072 19921 19859 20568 20726 25413 23607 24227 16596 16717 17216 14034 13739 13958 11411 15008 14439 15066 15261 9277 9319 10095 7166

Of note, the ith number of observed amino acid differences (4th line) corresponds to the ith number of substitution events (5th line), e.g. according to the evolutionary model used for generating the above block, 6,009 substitution events correspond to 5,578 observed amino acid differences.

For each empirical model of amino acid evolution (see above), the directory all/ contains all simulated blocks, i.e. 20,000 data blocks from the 20,000 model trees (from 1.modeltrees.zip), whereas the directory select/ contains block that have been selected for obtaining at least 500,000 pairs p,d (available in 3.pd.zip) with the values p that approximately follow a uniform distribution between 0 and 0.9.

 

3.pd.zip — simulation data (uncorrected distances p and evolutionary distances d) available as tab-delimited text files

For each empirical model of amino acid evolution (see above), a pd file is available. For each pd file, each line is made up by 3 entries: the PhylomeDB identifier, an uncorrected distance p (integer value), and the corresponding evolutionary distance d (integer value). Of note, these entries are derived from the selection tpd files (2.tpd/selected/), i.e. first (identifier), fourth (p) and fifth (d) lines for each block. To obtain the floating values of p and d, each entry should be divided by the number of simulated characters, i.e. 50,000.

 

4.img.zip — scatter plots available as tiff files

For each empirical model of amino acid evolution, the scatter plot representing the pairs (p,d) from the pd files (see above) divided by the total number of simulated characters (i.e. 50,000) are represented as a tiff-formatted file.
The directory nc/ contains the figures representing the scatter plots without regression curve. The directory wc/ contains the same figure completed with regression curves.

Files

0.src.zip

Files (9.2 GB)

Name Size Download all
md5:bd85987b8d272936ffab4bdb97e77012
11.5 kB Preview Download
md5:0ff7f3ffe9a6d2a4e9fdd04a05af1dcf
12.4 MB Preview Download
md5:c738bff9bf39f0d8a02d03459135d4ac
9.1 GB Preview Download
md5:7319db7149eaca49638b685291993f4a
90.1 MB Preview Download
md5:0f2d463a8c3a999c3d97db576cf6e11c
7.2 MB Preview Download