Comparing ddRAD-seq and sdRAD-seq
Notes started: 13 September 2016

So, sdRAD and ddRAD have significant allele frequency differences. What to do about it?
Outline of the approach:
1. Analyze the two datasets together in one stacks run
	-Variance in coverage across individuals in combined analysis
	-Private alleles in the two datasets
	-#reads/individual for loci
2. Analyze each dataset separately in separate stacks runs
	-#reads/individual
	-overall levels of heterozygosity
	-levels of polymorphism
	-are the loci the same?
3. in silico digestion of the genome, assuming different types of error.

Overall: try to distinguish between allelic dropout, shearing bias, and PCR bias, if possible




#############################################LAB BOOK###########################################
#####Friday, 14 October 2016
I wrote a quick little subsetting function in R to subset the vcf files rather than deal with the plink BS
Now I'll take the things I've already run, use the subsets to subset those, and re-do the analysis.
1. Coverage
	> TukeyHSD(aov(log(lc.comp$AvgCovRatio+1)~lc.comp$LibraryPrep*lc.comp$Assembly))#use this!
	  Tukey multiple comparisons of means
		95% family-wise confidence level

	Fit: aov(formula = log(lc.comp$AvgCovRatio + 1) ~ lc.comp$LibraryPrep * lc.comp$Assembly)

	$`lc.comp$LibraryPrep`
					   diff         lwr         upr p adj
	sdRAD-ddRAD -0.03182711 -0.03928312 -0.02437109     0

	$`lc.comp$Assembly`
						  diff         lwr         upr p adj
	Together-Alone -0.04790026 -0.05669206 -0.03910845     0

	$`lc.comp$LibraryPrep:lc.comp$Assembly`
										  diff          lwr          upr     p adj
	sdRAD:Alone-ddRAD:Alone       -0.021023595 -0.032383659 -0.009663531 0.0000118
	ddRAD:Together-ddRAD:Alone     0.002094985 -0.016315944  0.020505915 0.9913092
	sdRAD:Together-ddRAD:Alone    -0.107643717 -0.125241228 -0.090046206 0.0000000
	ddRAD:Together-sdRAD:Alone     0.023118580  0.006827704  0.039409457 0.0015201
	sdRAD:Together-sdRAD:Alone    -0.086620122 -0.101985756 -0.071254487 0.0000000
	sdRAD:Together-ddRAD:Together -0.109738702 -0.130857746 -0.088619659 0.0000000

	> TukeyHSD(aov(log(lc.comp$AvgCovTotal)~lc.comp$LibraryPrep*lc.comp$Assembly))#use this!
	  Tukey multiple comparisons of means
		95% family-wise confidence level

	Fit: aov(formula = log(lc.comp$AvgCovTotal) ~ lc.comp$LibraryPrep * lc.comp$Assembly)

	$`lc.comp$LibraryPrep`
					  diff        lwr         upr p adj
	sdRAD-ddRAD -0.1022779 -0.1064995 -0.09805627     0

	$`lc.comp$Assembly`
						 diff        lwr        upr p adj
	Together-Alone 0.04921168 0.04423373 0.05418964     0

	$`lc.comp$LibraryPrep:lc.comp$Assembly`
										  diff         lwr          upr     p adj
	sdRAD:Alone-ddRAD:Alone       -0.113786883 -0.12021900 -0.107354762 0.0000000
	ddRAD:Together-ddRAD:Alone    -0.003228407 -0.01365276  0.007195946 0.8564623
	sdRAD:Together-ddRAD:Alone    -0.024045719 -0.03400951 -0.014081927 0.0000000
	ddRAD:Together-sdRAD:Alone     0.110558476  0.10133451  0.119782445 0.0000000
	sdRAD:Together-sdRAD:Alone     0.089741164  0.08104107  0.098441257 0.0000000
	sdRAD:Together-ddRAD:Together -0.020817312 -0.03277501 -0.008859612 0.0000457

#####Thursday, 13 October 2016
Did anovas to compare the in silico digest results, but I'm not sure that's the best approach.
Attempting to pull out loci from the coverage analysis that are in the subsets, but apparently they're different?
> dim(omap)
[1] 92973     5
> dim(o.cov)
[1] 250425     15
> dim(o.cov[o.cov$SNP %in% omap$SNP,])
[1] 2222   15
> dim(omap[omap$SNP %in% o.cov$SNP,])
[1] 2207    5

This is just frustrating.

It seems that the vcf and plink files have different SNPs.
> dim(orad[orad$SNP %in% omap$SNP,])
[1] 2222   70

maybe I took plink files from the wrong place?

Maybe I should just subset the vcf file and ignore the HWE concerns

#####Wednesday, 12 October 2016
The 20 remaining NAs are in gap regions of the genome.
The question is: are outliers more likely to be in coding regions?
All SNPs from drad:
        contig  five_prime_UTR             gap            gene three_prime_UTR 
          28576             962              20           37435            2116 
	Should I just look locus-by-locus? probably..
	
Returning to the in silico results...

reading in the subsetted "raw" from PLINK from last week...now I just need to do fsts with plink files.
I want to characterize the number of loci fixed for different alleles!

summary(as.numeric(bsub.fst$Fst))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
-1.9890 -0.0386  0.0122  0.0116  0.0873  0.5456     604 

#####Tuesday, 11 October 2016
Today I worked mostly on identifying the genomic locations of SNPs 
(though I need to still work with it because 57827 of 69309 SNPs aren't being matched to a genomic region)

Also ran the null in silico digest with sdRAD n = 60, ddRAD n = 400.

#####Monday, 10 October 2016
Running the in silico with null distribution and non-skewed AFS.

Then I'll run it with the different numbers of individuals.

Some of the Fst weirdness seems to have come from loci that are polymorphic in one dataset but not the other.
In both:
	68008 are polymorphic
	1216 are fixed for different alleles
	15627 are fixed in one pop but not the other.
	summary(fsts.both$Fst)
	> summary(fsts.both$Fst[!is.na(fsts.both$Fst)])
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-2.135000 -0.040860  0.005249  0.004975  0.074550  0.959200 
	 
I randomly sampled 120 dRAD individuals and compared 60 and 60.
> summary(drad.tes.fst$Fst[!is.na(drad.tes.fst$Fst)])
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-0.218600  0.000000  0.002123  0.006264  0.009188  0.227800 


	

#####Thursday, 6 October 2016
OK, so maybe the Fst calculation in gwsca_biallelic_vcf was missing a parentheses..re-running it to check.

#####Wednesday, 5 October 2016
I am unhappy with the Fst calculations I've got from R. 
It doesn't seem right that there should be so many ==1 or < 0

Maybe run gwsca_biallelic_vcf? See if that gives the same distribution? 

Meanwhile, I'm thinking about things still to-do:
1. Choose one SNP per locus
2. Decide what to do, if anything, about loci at the same location
3. Compare hardy weinberg equilibrium between sdRAD and ddRAD
4. What about comparisons of the different ddRAD plates?
5. Add coverage differences to in silico digestion.

The gwsca_biallelic_vcf provides a drastically different set of Fsts for orad-drad in the 'both' analysis:
summary(gwsca$orad.drad[gwsca$orad.drad!=-1])
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
2.000e-08 4.731e-04 1.391e-03 3.390e-03 3.399e-03 1.366e-01 

Ran popgen/prune_snps.cpp...retained:
	32910 SNPs in both
	29631 SNPs in drad
	92973 SNPs in orad

Then plink:
in stacks/
	plink --file drad --hwe 0.001 --noweb --allow-no-sex --write-snplist
	plink --file drad --extract plink.snplist --recode --recodeA --recode-structure --out subset --noweb --allow-no-sex
	plink --file drad --extract plink.snplist --recodeA --out subset --noweb --allow-no-sex
in stacks_both/
	plink --file both --hwe 0.001 --noweb --allow-no-sex --write-snplist
	plink --file both --extract plink.snplist --recode --recodeA --recode-structure --out subset --noweb --allow-no-sex
	plink --file both --extract plink.snplist --recodeA --out subset --noweb --allow-no-sex
in stacks_orad/
	plink --file orad --hwe 0.001 --noweb --allow-no-sex --write-snplist
	plink --file orad --extract plink.snplist --recode --recodeA --recode-structure --out subset --noweb --allow-no-sex
	plink --file orad --extract plink.snplist --recodeA --out subset --noweb --allow-no-sex
	
#####Tuesday, 4 October 2016
I've decided to log-transform the coverage statistics and then use aov() and TukeyHSD() to compare them.
At least for now...I need to ask Adam re: normality assumptions (they're not really normal)

Also adding a coverage threshold to the fst calculation--must be in 50% of individuals. I'll see if that changes it.

Re: in-silico digestion...what if I remove the random shearing step and have every restriction site have a RAD locus?
-> or just choose a random number and that's where the length comes from
^^implementing this one and I'll see what happens

#####Wednesday, 28 September 2016
Neither single-digest nor double digest had any polymorphic restriction site events.
sdigest had 7308 PCR duplication events
ddigest had 67122 PCR duplication events

The Fsts from these don't have the same crazy pattern.

Running it with pcr_duplicate_rate ranging from 0.01 to 0.1 (although maybe I need to stop at 0.05 because that would be 100% w/ 20 cycles.)

#####Tuesday, 27 September 2016
I made the insilico digestion more object-oriented and it feels so much better! Now it's running, but it's sort of slow.
We'll see if it works the way I think it should.

In the meantime, I'm looking at the data again. 
->The ones with Fst=1 between the two datasets do actually have different alleles fixed 
->if I only keep loci with coverage between 3 and 50 (or 10 and 50 or 3 and 20), there are still a bunch fixed for different alleles
summary(od.both.fst[od.both.fst$SNP %in% cov.pass,"Fst"])
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
-3.44800 -0.02332  0.02860  0.16000  0.20130  1.00000       72 
^Common suggestion for eliminating PCR duplicates is not effective.

#####Monday, 26 September 2016
Today I'm adding the polymorphic restriction sites and the pcr duplication. 
For each individual at each locus:
1. which alleles does the individual have? 
	if genrand() < pop_af, al1 = 1 else al1 = 2.
	pop_af assigned to the locus using genrand()
2. Does the individual have a polymorphic restriction site?
	10% of loci are not polymorphic restriction sites
	mutation rate = 4*ne*uniform_distribution(0.00000001,0.00000001)
	mutation rate is unique to each locus *****IS THIS CORRECT?
	poissonrand(mutation rate*6) to determine # sites
		if(poissonrand(mut*6)==1) genrand() to choose which one is ommitted
		if(poissonrand(mut*6)>1) both are ommitted
3. PCR duplicates, assuming 2% of reads per PCR cycle are wrong (based on literature).
	Assuming reads are on average evenly distributed among the loci, the duplication rate * number of genotype calls will be affected
	pull a number from a poisson distribution with the pcr duplication rate per cycle as the mean..if it returns number > 0, then there's a duplication event
	if the allele is missing, the other one is taken. If both are missing, the locus is missing.
4. Output genotypes in vcf format.

Right now it's really not object-oriented, but I would like to change that. 

#####Friday, 23 September 2016
The output was funky because I'd done start-end instead of end-start to calculate the length.
The mean fragment length for the single digest is 498.5, but the distribution is skewed and still contains large fragments. (max = 19920). I think that's probably OK, but I should check my lab notebook.
Now I need to only keep the loci with the restriction site at one end.

The double digest provides a lovely distribution of fragments 250-700 bp, since that's what I selected for.

Use a poisson distribution for PCR duplicates, with mean number of copies in reads as mean (http://www.cureffi.org/2012/12/11/how-pcr-duplicates-arise-in-next-generation-sequencing/)
^^this is small! 0.000625


#####Thursday, 22 September 2016
Added single-digest to the in-silico digest
That included simulating shearing, so I'm trying out a process where:
	(1) if the mean fragment size > 500 with the new fragment, then If
	(2) randomly break up the fragment into x pieces, where x=fragment length/500
	(3) if there are multiple shears, I do subsequent shears on the longest fragments.
I'm going to see if this gives me a somewhat normal distribution of fragment sizes centered around 500.
I think this incorporates shearing bias without having to give it an explicit rate, but I'm not 100% sure.
I need to think about that.

Something about the output isn't right but I need to take a break.

 
#####Wednesday, 21 September 2016
I narrowed down my focus to only loci with missingness > 0.5, which reduced the number of loci 
Now able to do the analysis, and the Ref/Alt and variance in coverage are significantly different among plates
->suggests PCR duplicates are the culprit

in silico digest: after trying to figure out why it's been infinite-looping, I finally figured out that .find() returns -1 if it doesn't find a match
Now, I need to keep fragments only in the 250-700 range (keep it broad for now)
What do I want to do with it? 
	-Add allelic dropout
	-Add PCR duplicates
	-Add genotyping errors?
	
So, I need to implement sampling. First, I'm going to test it on the whole genome, make sure it will run through the entire fasta file correctly.
^That worked! Success. Now, do I want to keep the sequences? Or just continue as-is and when I sample arbitrarily give individuals a value of 0 or 1 to represent SNPs?
I think for now I'll sample arbitrarily.
I also need to do both the double and single digest to compare them.

#####Tuesday, 20 September 2016
Looking at the different ddRAD plates, and it is not clear what is going on.
I need to use locus as a random effect, but this takes over the computer (too many loci)

#####Friday, 16 September 2016
Summary so far:
-Assembling them together gets weird results, including elevated Fsts, lower coverage for sdRAD
-sdRAD has lower Alt/Ref ratios than ddRAD
-sdRAD has higher variance in coverage and fewer heterozygotes
-generally, loci shared among separate analyses have same ref/alt.

RE: Fst..if they're assembled separately, FST is centered arond 0 and a lot are negative. 
This is also true if you take the ones that were analyzed together but only keep the loci shared between the two separate assemblies.
If you just prune for coverage (50%  in each pop) you get a WEIRD distribution with lots of Fst=1.

populations was killed...trying:
populations -b 3 -P ./results/stacks_both/ -M ./sca_popmap_sdvsdd.txt -r 0.5 -a 0.05 --vcf --plink -m 3 -t 4

#####Thursday, 15 September 2016
Running populations on 'both'
sca_popmap_sdvsdd.txt

populations -b 3 -P ./results/stacks_both/ -M ./sca_popmap_sdvsdd.txt -r 0.5 -a 0.05 --fstats --vcf --plink -m 3

Meanwhile...
in-silico digestion: while reading fasta file, I'm digesting each sequence as it comes (rather than storing them)
Outputting: Fragment	StartBP	EndBP	FragmentLength	StartEnz	EndEnz
Enzyme codes: 0 = start of sequence, 1 = PstI, 2 = MboI, 3 = end of sequence
Only searching for 5'-3' enzyme recognition sequence

Analysis of allele dropout vs. PCR bias: sdRAD has lower heterozygosity and lower ratios than ddRAD and higher variance when compared to a subsample of ddRAD (males and females only)

49893 SNPs are the same in ddRAD and sdRAD analyses (assembled separately)...
weird issue: some SNPs are found in multiple RAD loci in ddRAD?
#####Wednesday, 14 September 2016
Is the total number of reads per individual different when assembled together vs separately?
	> wilcox.test(oc$NumReadsTogether,oc$NumReadsAlone,paired=T,alternative="less")

		Wilcoxon signed rank test with continuity correction

	data:  oc$NumReadsTogether and oc$NumReadsAlone
	V = 0, p-value = 8.357e-12
	alternative hypothesis: true location shift is less than 0

	> wilcox.test(dc$NumReadsTogether,dc$NumReadsAlone,paired=T,alternative="greater")

		Wilcoxon signed rank test with continuity correction

	data:  dc$NumReadsTogether and dc$NumReadsAlone
	V = 73920, p-value < 2.2e-16
	alternative hypothesis: true location shift is greater than 0

I should plot the by-locus coverage too.
OK--maybe put all of them together so it's a comparison of the two tests.
For the by-locus coverage, they all seem pretty similar whether assembled together or separately.
	> wilcox.test(loc.cov$oRAD.AvgCovRatio,o.cov$AvgCovRatio)#less

		Wilcoxon rank sum test with continuity correction

	data:  loc.cov$oRAD.AvgCovRatio and o.cov$AvgCovRatio
	W = 3.0348e+10, p-value < 2.2e-16
	alternative hypothesis: true location shift is not equal to 0

	> wilcox.test(loc.cov$dRAD.AvgCovRatio,d.cov$AvgCovRatio)#greater

		Wilcoxon rank sum test with continuity correction

	data:  loc.cov$dRAD.AvgCovRatio and d.cov$AvgCovRatio
	W = 1.1963e+10, p-value < 2.2e-16
	alternative hypothesis: true location shift is not equal to 0

	> wilcox.test(loc.cov$AvgCovTotal.x,o.cov$AvgCovTotal)#less

		Wilcoxon rank sum test with continuity correction

	data:  loc.cov$AvgCovTotal.x and o.cov$AvgCovTotal
	W = 3.7692e+10, p-value = 0.04538
	alternative hypothesis: true location shift is not equal to 0

	> wilcox.test(loc.cov$AvgCovTotal.y,d.cov$AvgCovTotal)#greater

		Wilcoxon rank sum test with continuity correction

	data:  loc.cov$AvgCovTotal.y and d.cov$AvgCovTotal
	W = 1.0705e+10, p-value < 2.2e-16
	alternative hypothesis: true location shift is not equal to 0

	BUT statistically, sdRAD has lower coverage when assembled together and ddRAD has lower coverage when assembled separately.
	
sdRAD has lower coverage than ddRAD
	> wilcox.test(o.cov$AvgCovTotal,d.cov$AvgCovTotal,"less")

		Wilcoxon rank sum test with continuity correction

	data:  o.cov$AvgCovTotal and d.cov$AvgCovTotal
	W = 7379800000, p-value < 2.2e-16
	alternative hypothesis: true location shift is less than 0
but sdRAD is less skewed toward reference allele than ddRAD
	> wilcox.test(o.cov$AvgCovRatio,d.cov$AvgCovRatio,"less")

		Wilcoxon rank sum test with continuity correction

	data:  o.cov$AvgCovRatio and d.cov$AvgCovRatio
	W = 8592100000, p-value = 0.002174
	alternative hypothesis: true location shift is less than 0
..regardless of whether they're assembled together or separately. (can I do this analysis somehow more elegantly?)
	> summary(aov(lc.comp$AvgCovRatio~lc.comp$LibraryPrep*lc.comp$Assembly))
											 Df    Sum Sq   Mean Sq F value Pr(>F)    
	lc.comp$LibraryPrep                       1 1.385e+09 1.385e+09  1213.1 <2e-16 ***
	lc.comp$Assembly                          1 1.760e+08 1.760e+08   154.1 <2e-16 ***
	lc.comp$LibraryPrep:lc.comp$Assembly      1 3.688e+08 3.688e+08   322.9 <2e-16 ***
	Residuals                            923467 1.055e+12 1.142e+06                   
	---
	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
	5 observations deleted due to missingness
	> summary(aov(lc.comp$AvgCovTotal~lc.comp$LibraryPrep*lc.comp$Assembly))
											 Df    Sum Sq Mean Sq F value   Pr(>F)    
	lc.comp$LibraryPrep                       1 3.464e+04   34637   7.837  0.00512 ** 
	lc.comp$Assembly                          1 1.111e+06 1110605 251.273  < 2e-16 ***
	lc.comp$LibraryPrep:lc.comp$Assembly      1 2.880e+05  287958  65.150 6.95e-16 ***
	Residuals                            923467 4.082e+09    4420                     
	---
	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
	5 observations deleted due to missingness
	
Variance in coverage:
	oRAD has higher variance in coverage..
	> wilcox.test(log(o.cov$CovVariance+1),log(d.cov$CovVariance+1),"greater")

		Wilcoxon rank sum test with continuity correction

	data:  log(o.cov$CovVariance + 1) and log(d.cov$CovVariance + 1)
	W = 8914700000, p-value < 2.2e-16
	alternative hypothesis: true location shift is greater than 0

	->due to more loci? subsample both of them-->still sig, but not as much (probably real??)
	->could be due to PCR duplicates rather than allele dropout--what does heterozygosity do?
	
ddRAD has higher proportion of heterozygotes at loci.
	> wilcox.test(o.cov$PropHet,d.cov$PropHet,"less")

		Wilcoxon rank sum test with continuity correction

	data:  o.cov$PropHet and d.cov$PropHet
	W = 7296700000, p-value < 2.2e-16
	alternative hypothesis: true location shift is less than 0

Assembled together:
	> wilcox.test(bo.cov$CovVariance,bd.cov$CovVariance,paired=T,"less")

		Wilcoxon signed rank test with continuity correction

	data:  bo.cov$CovVariance and bd.cov$CovVariance
	V = 1727900000, p-value < 2.2e-16
	alternative hypothesis: true location shift is less than 0
	
	> wilcox.test(bo.cov$PropHet,bd.cov$PropHet,paired=T,"less")

		Wilcoxon signed rank test with continuity correction

	data:  bo.cov$PropHet and bd.cov$PropHet
	V = 908260000, p-value < 2.2e-16
	alternative hypothesis: true location shift is less than 0
	
Assembled together, sdRAD has lower variance in coverage but still has fewer heterozygotes.

Suggesting that if they are assembled together, the PCR duplicates issue in sdRAD is ameliorated?? 
	> summary(aov(lc.comp$PropHet~lc.comp$LibraryPrep*lc.comp$Assembly))
											 Df Sum Sq Mean Sq  F value Pr(>F)    
	lc.comp$LibraryPrep                       1     98   98.04 3577.429 <2e-16 ***
	lc.comp$Assembly                          1    141  141.39 5159.532 <2e-16 ***
	lc.comp$LibraryPrep:lc.comp$Assembly      1      0    0.17    6.377 0.0116 *  
	Residuals                            489229  13407    0.03                    
	---
	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
	3 observations deleted due to missingness
	> summary(aov(log(lc.comp$CovVariance+1)~lc.comp$LibraryPrep*lc.comp$Assembly))
											 Df Sum Sq Mean Sq F value   Pr(>F)    
	lc.comp$LibraryPrep                       1    130  129.50  163.37  < 2e-16 ***
	lc.comp$Assembly                          1    263  263.15  331.98  < 2e-16 ***
	lc.comp$LibraryPrep:lc.comp$Assembly      1     19   18.94   23.89 1.02e-06 ***
	Residuals                            489215 387782    0.79                     
	---
	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
	17 observations deleted due to missingness


In-silico digestion: segmentation fault recorded overnight
->could this be due to maxing out memory by storing too many seqs? 
	->probably not, but checking it out by writing directly to the file.
		->maxed out the storage space, need to find another solution.
	Process each sequence as I read it in from fasta file--don't need to store the fasta
	print out the length of each fragment
	*I didn't want to deal with processing each sequence right now, for now I'm just printing out the length of each fragment into the output file.
	**Rather than dealing with sequences, what if i deal with fragment sizes and identity of enzyme on either side?

	Output:
		23407086 fragments on LG1
		9912669 fragments on LG10
		11493655 fragments on LG11
		13251395 fragments on LG12
		8483744 fragments on LG13
		13101037 fragments on LG14
		10552769 fragments on LG15
		9810463 fragmentson LG16
		9684039 fragments on LG17
		7908134 fragments on LG18
		7424705 fragments on LG19
		21572244 fragments on LG2
		
	
#####Tuesday, 13 September 2016
Starting with the dataset when it was assembled together:
for each locus, calculate average coverage per individual for the two datasets.
Wrote a function to calculate coverage statistics for subsets of individuals (e.g. orad and drad).
It also reports # of heterozygotes

This way I can compare the two. I can do a paired t-test maybe?

	> t.test(loc.cov$oRAD.AvgCovRatio,loc.cov$dRAD.AvgCovRatio,paired=T)

		Paired t-test

	data:  loc.cov$oRAD.AvgCovRatio and loc.cov$dRAD.AvgCovRatio
	t = -28.323, df = 301960, p-value < 2.2e-16
	alternative hypothesis: true difference in means is not equal to 0
	95 percent confidence interval:
	 -102.99975  -89.66692
	sample estimates:
	mean of the differences 
		      -96.33334 

	> summary(loc.cov$oRAD.AvgCovRatio)
	    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
	   0.000    1.060    3.035    8.521    7.641 3348.000        5 
	> summary(loc.cov$dRAD.AvgCovRatio)
	     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
	     0.01      2.38      5.85    104.90     13.78 221300.00 
     
     
Unsure why there are some with such huge numbers--ref is 576 and alt is 0.0026 (so it makes sense)
	> t.test(loc.cov$oRAD.AvgCovRatio,loc.cov$dRAD.AvgCovRatio,paired=T,alternative="less")

		Paired t-test

	data:  loc.cov$oRAD.AvgCovRatio and loc.cov$dRAD.AvgCovRatio
	t = -28.323, df = 301960, p-value < 2.2e-16
	alternative hypothesis: true difference in means is less than 0
	95 percent confidence interval:
	      -Inf -90.73871
	sample estimates:
	mean of the differences 
		      -96.33334 
		    
How about heterozygosity?
	> summary(loc.cov$oRAD.PropHet)
	   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
	0.00000 0.02222 0.13560 0.18190 0.26000 1.00000       5 
	> summary(loc.cov$dRAD.PropHet)
	   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
	0.00000 0.08857 0.17670 0.22440 0.31690 1.00000 
	> t.test(loc.cov$oRAD.PropHet,loc.cov$dRAD.PropHet,paired=T)

		Paired t-test

	data:  loc.cov$oRAD.PropHet and loc.cov$dRAD.PropHet
	t = -108.21, df = 301960, p-value < 2.2e-16
	alternative hypothesis: true difference in means is not equal to 0
	95 percent confidence interval:
	 -0.04323838 -0.04169985
	sample estimates:
	mean of the differences 
		    -0.04246912 

Total coverage per locus?
	> summary(loc.cov$AvgCovTotal.y)
	    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
	   3.523    7.750   10.850   13.480   15.240 2732.000 
	> summary(loc.cov$AvgCovTotal.x)
	    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
	    2.75     7.46     9.27    14.44    12.04 33000.00        5 
	> t.test(loc.cov$AvgCovTotal.x,loc.cov$AvgCovTotal.y,paired=T)

		Paired t-test

	data:  loc.cov$AvgCovTotal.x and loc.cov$AvgCovTotal.y
	t = 6.1149, df = 301960, p-value = 9.675e-10
	alternative hypothesis: true difference in means is not equal to 0
	95 percent confidence interval:
	 0.6491353 1.2615620
	sample estimates:
	mean of the differences 
		      0.9553486 
		      

Proportion missing?
	> t.test(loc.cov$oRAD.PropMiss,loc.cov$dRAD.PropMiss,paired=T)

		Paired t-test

	data:  loc.cov$oRAD.PropMiss and loc.cov$dRAD.PropMiss
	t = 98.967, df = 301970, p-value < 2.2e-16
	alternative hypothesis: true difference in means is not equal to 0
	95 percent confidence interval:
	 0.03499156 0.03640553
	sample estimates:
	mean of the differences 
		     0.03569854 

	> summary(loc.cov$oRAD.PropMiss)
	   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
	 0.0000  0.1333  0.2167  0.2449  0.3333  1.0000 
	> summary(loc.cov$dRAD.PropMiss)
	   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
	0.00000 0.08333 0.17970 0.20920 0.30990 0.77860 

Private alleles/heterozygosity
	> dim(loc.cov[(loc.cov$AvgCovRef.x == 0 & loc.cov$AvgCovAlt.y == 0) | (loc.cov$AvgCovAlt.x == 0 & loc.cov$AvgCovRef.y == 0),])
	[1] 2042   19
	> dim(loc.cov[loc.cov$NumHet.x==0,])
	[1] 72307    19
	> dim(loc.cov[loc.cov$NumHet.y==0,])
	[1] 14436    19
	> dim(loc.cov[loc.cov$NumHet.x==0 & loc.cov$NumHet.y==0,])
	[1] 10031    19
	> dim(loc.cov[loc.cov$NumPresent.x == 0,])
	[1]  5 19
	
Coverage per individual (rather than per locus):
	> tapply(as.numeric(ind.cov$AvgCovTot), as.factor(ind.cov$Method),summary)
	$dRAD
	   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
	  4.537  11.410  13.520  13.630  15.570  40.290 

	$oRAD
	   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
	  8.584  11.940  14.150  15.180  17.250  34.520 

	> t.test(as.numeric(ind.cov$AvgCovTot)~as.factor(ind.cov$Method))

	Welch Two Sample t-test

	data:  as.numeric(ind.cov$AvgCovTot) by as.factor(ind.cov$Method)
	t = -2.1119, df = 69.224, p-value = 0.03831
	alternative hypothesis: true difference in means is not equal to 0
	95 percent confidence interval:
	 -3.0138505 -0.0858908
	sample estimates:
	mean in group dRAD mean in group oRAD 
			  13.62667           15.17654 

Starting to work on the in silico digest as well
->beginning with just digesting it with the restriction sites.
Q: they are palindromic, so do I not need to use the reverse complement? Should I do the 3'->5' sequence?
	Note: this is eating up the entire CPU on the virtual machine.


