ReadMe File for Molecular Ecology manuscript, "GENE DUPLICATION AND DIVERGENCE PRODUCE DIVERSE MHC GENOTYPES WITHOUT DISASSORTATIVE MATING."

Donald C. Dearborn, Andrea B. Gager, Andrew G. McArthur, Morgan E. Gilmour, Elena Mandzhukova, & Robert A. Mauck

Correspondence: Don Dearborn, Department of Biology, Bates College, 44 Campus Ave, Lewiston, Maine 04240 USA.  ddearbor@bates.edu

- - - - - - - - - SUMMARY

The data package on Dryad contains this ReadMe file, 5 data files, 1 Microsoft Excel file containing a macro for randomization tests, and 5 text files of 4th Dimension scripts for the phylogeny permutation model.  File names are denoted as <filename.type>, followed by description of contents.

All molecular work is from samples obtained from Leach's Storm-petrels (Oceanodroma leucorhoa) that were breeding (188 adults) or known hatched (34 nestlings) on Kent Island, New Brunswick, Canada.

- - - - - - - - - ALLELES

<allele_sequences.txt>  
24 MHC Class II B alleles found at Ocle-DAB1 and Ocle-DAB2 in this sample of 188 adults and 22 nestlings.  Each allele is trimmed to exon 2 and is described with a DNA sequence, an amino acid translation, and a GenBank accession number.  Four alleles have an in-frame 3-bp deletion; this is indicated by '---' in the DNA sequence, but the gap is closed in the amino acid sequence.

- - - - - - - - - DATA OVERVIEW

<data_list_by_nest.csv>  
List of genetic data obtained for individuals in each of 94 nests.  Columns are Year (year of nesting attempt: 2010 or 2013), Nest # (burrow number, unique within year but not between years), Adult Female Band # (leg band # of breeding female), Adult Male Band # (leg band # of breeding male), and then a series of Yes/No columns reporting whether genetic data are available for parents and for nestlings at MHC genes and at microsatellite loci.

- - - - - - - - - MICROSATELLITE DATA

<microsatellite_genotypes.txt>  
Genotypes of 188 adults and 34 nestlings at 15 microsatellite loci.  File is in GenePop format. The file has 3 introductory rows: 1) metadata, 2) 15 locus names, 3) 'POP'.  Next is a table of 222 rows x 16 tab-delimited columns. The first column of the table is the identity of the bird.  Adults are identified by the last 5 digits of their Canadian Wildlife Service metal leg band.  Nestlings were too young to band and thus are identified by their nest number and the suffix -Chk (for 'Chick').  The remaining 15 columns are microsatellite genotypes at 15 loci, with the loci ordered according to the names in the second row of the file. Genotypes are 6 digits: two 3-digit numbers, each corresponding to the size (in base pairs) of an allele.  Missing genotypes are represented by a single zero.

- - - - - - - - -  MHC DATA

<MHC_noCNV.csv>
MHC genotypes of 188 adults + 22 chicks assuming no Copy Number Variation, in a table of 211 rows x 9 columns.  Genotypes were determined from Illumina sequencing based on an assumption of no Copy Number Variation as described in Supporting Information lines 104-127.  First row is header, and each remaining row corresponds to one bird.  Columns are Year (year of nesting attempt: 2010 or 2013), Nest # (burrow number, unique within year but not between years), Age (Adult or Chick), Sex (determined by PCR), CWS Bird Band #, Ocle-DAB1*allele1, Ocle-DAB1*allele2, Ocle-DAB2*allele1, and Ocle-DAB2*allele2.  

<MHC_withCNV.csv>
MHC genotypes of 188 adults + 22 chicks allowing Copy Number Variation, in a table of 211 rows x 12 columns.  Genotypes were determined from Illumina sequencing based on an algorithm that permits Copy Number Variation as described in Supporting Information lines 129-145.  First row is header, and each remaining row corresponds to one bird.  Columns are Year (year of nesting attempt: 2010 or 2013), Nest # (burrow number, unique within year but not between years), Age (Adult or Chick), Sex (determined by PCR), CWS Bird Band #, Ocle-DAB1*allele1, Ocle-DAB1*allele2, Ocle-DAB1*allele3, Ocle-DAB2*allele1, and Ocle-DAB2*allele2, Ocle-DAB2*allele3, and Ocle-DAB2*allele4.  

- - - - - - - - - MATE CHOICE RANDOMIZATION TESTS

<Excel_macro.xls>
Randomization tests were used to test whether mean and variance in MHC similarity of actual mates were significantly different from random.  These tests were run in Microsoft Excel using a macro to create 10,000 iterations of randomly pairing each of the 94 females with one male (without replacement), saving the average and variance of the MHC similarity between random mates for each iteration.  This file is an example of one such Excel file -- in this case, the one that analyzes average p-distance between MHC alleles of mates using all 89 amino acids of exon 2, the results of which are shown in the graph in Figure S4a in Supplemental Information.  Similar analyses were conducted for other metrics of MHC similarity and for microsatellite-based estimates of relatedness coefficients between mates.
  Column A: instructions
  Column B: ID number of 94 females, each repeated 94 times in a row
  Column C: ID number of 94 females, with entire set repeated 94 times
  Column D: MHC difference between each possible male-female pairing
  Column E: 94 random numbers in cells E2 to E95, then referentially copied in 93 more sets continuing down Column E; each male gets a random number that reappears in Column E each time that male appears in Column D 
  Column F: mechanism for choosing the nth smallest random number of a male, where n is the number of a female; this is the mechanism for pairing males and females without replacement
  Column G: this looks up and reports the MHC difference for each of the 94 random pairings in the current iteration
  Column H: empty column
  Column I: temporary storage of MEAN output from 1 of 10,000 iterations
  Column J: temporary storage of VARIANCE output from 1 of 10,000 iterations
  Column K: empty column
  Column L: empty column
  Column M: labels for Columns O and P
  Column N: iteration number (1 to 10,000)
  Column O: composite list of MEAN values from each of 10,000 iterations
  Column P: composite list of VARIANCE values from each of 10,000 iterations
Additional specific cells:
  Q1 = mean of MHC difference for actual pairs
  Q2 = variance of MHC differences for actual pairs
  Q13 = average value across 10,000 simulations for mean MHC difference between random pairs
  Q16 = variance across 10,000 simulations for mean MHC difference between random pairs
  S9 = 2-tailed p-value for whether mean of actual pairs is larger or smaller than random
  S10 = 1-tailed p-value for whether variance of actual pairs is smaller than random


- - - - - - - - - PHYLOGENY PERMUTATION MODEL

                            The permutation of the phylogeny was conducted with a set of scripts written and run in 4th Dimension (4D Inc., San Jose, CA). 4th Dimension is a relational database with a proprietary programming language in which many database functions are given to the programmer Ôfor freeÕ. For example, ÔORDER BYÕ does a sort on a field in a table. These built-in functions appear in ALL CAPS in the code below.

Tables are referenced in brackets (e.g., [Alleles]). Fields within tables are referenced by the table and the field name (e.g., [Alleles]Orig_Position). Array index positions are designated by curly brackets (Ô{ }Õ). Comments in the code are denoted by ?//?. Local variables are denoted with leading Ô$Õ.

Relevant tables created in 4D for this project are:

[Alleles] - one record for each allele - data imported
-- Field: Allele_ID
-- Field: Allele Label
-- Field: Locus - A1 or A2
-- Field: Orig_Position - this is 1-24 with the first 11 from A1 and the next 13 from A2 - position in the existing population

[Distances] - one record for every unique pair of alleles - reference table derived from Alleles data
-- Field: Distance - distance between the pair of alleles
-- Field: PairID - identifier for the pair combination

[Individuals] - 1 record for each individual for which we have genotype - data imported
-- Field: L1A1 - which allele is in the L1A1 position
-- Field: L1A2 - which allele is in the L1A2 position
-- Field: L2A1 - which allele is in the L2A1 position
-- Field: L2A2 - which allele is in the L2A2 position
-- Field: Sex
-- Field: ID - year and band number of individual

[Sim_Data] - one record for every simulated pairing - data derived from combination of Individuals, Distances, Alleles
-- Field: PairID - name of the pair ID - used to get the distance from table of [Distances]
-- Field: Type - Name of the simulation type: MHC or MSAT, - see DD_Simulation
-- Field: Value - Calculated distance between the pair

[Trials] - one record for every trial run
-- Field: Trial ID 
-- Field: L1_Alleles_in_L1 - how many original L1 alleles are in L1 for this trial
-- Field: Dist_Mean - mean of the mean distance between all dyads in the trial
-- Field: Dist_Min
-- Field: Dist_Max
-- Field: Dist_Sum
-- Field: Old. . .  - 24 fields starting with ÔOldÕ, one for each original location - records the randomized alleles at each location

The phylogeny permutation model took the 24 alleles of Ocle-DAB1 and Ocle-DAB2 and permuted them into new locations in the existing skeleton of the phylogeny. A total of 6.204 x 10^23 permutations are possible; we examined a stratified random sample of 12,000 permutations, with 1,000 from each of the 12 degrees of monophyly. Within each degree of monophyly, permutations varied naturally in the location of the two most common alleles. For each of the 12,000 permutations, we recorded the degree of monophyly, the distance between the two most common alleles, and the resulting average divergence of the MHC alleles within each individual bird.

The program was written to generate phylogeny permutations in a purely random, unstratified way; particular permutations were extracted from this until we reached the desired stratification of 1,000 permutations for each of the 12 levels of monophyly. However, the very lowest and highest values of monophyly remained rare (much less than the target of 1,000 permutations) even after 2,000,000 total permutations. Thus, we created one additional script that generated random permutations with particularly low or high levels of monophyly.

The 5 scripts are:

<DD_BS_Setup_Reference_Arrays.txt> - Takes the imported data in Individuals, Distances, Alleles and puts them into arrays.

<DD_BS_Distribute_Alleles.txt> - Simulation core.  Makes N number of trial records composed of randomized distribution of alleles in the phylogenetic framework.

<DD_BS_Distribute_Special.txt>  - Same as <DD_BS_Distribute_Alleles.txt>, but modified to handle the rarity of very low or high monophyly.

<DD_Rtn_Mean_Dist.txt> - Code that looks up pairwise distances between alleles and then calculates the average distance between 4 alleles within an individual.

<DD_BS_Calc_means_from_trial.txt> - Once the trials are made, this calculates the mean of mean distances between alleles within individuals in population.

In total, then, there are five 4th Dimension scripts.


- - - - - - - - - END OF ReadMe FILE








                                    



5