Selection Components Analysis Log File
This is being started on 27 January 2016 so it's a bit late, and I've already filled my lab notebook with >50 pages
Outline of the approach:
1. Run Stacks, prune dataset, calculate Fsts among groups, compare those Fst distributions to null distributions from simulation models.
	Programs utilized:
	-> ref_map.pl (Stacks)
	-> populations (Stacks)
	biallelic:
		-> scripts/convert_snps.R 
		-> infer_maternal_contribution/infer_mat_vcf/infer_mat_vcf 
		-> merge_vcfs.R 
		-> gwsca_biallelic/gwsca_biallelic_vcf
		-> gwsca_biallelic_analysis.R
		-> annotate top 5% outliers	
			[-> blastn RAD seqs]
			-> extract 5kb region surrounding RAD seq and blastn
		-> Monnahan/Kelly analysis
			-> vcf_to_sca_monnahan
			-> ML.2016.pipefish.py
			-> gwsca_biallelic_analysis.R
	#NOT PART OF THE PAPER:
	haplotypes:
		-> convert_matches 
		-> infer_maternal_contribution 
		-> gwsca_haplotypes
		-> R to compare to model (from sca_simulation)
		relatedness/parentage:
			(make haplotypes_files.txt)
			-> haplotypes_to_cervus
			-> prune_cervus_genotypes.R
			-> CERVUS 
			-> calculate_relatedness 
			-> band_sharing




#############################################LAB BOOK###########################################
#####Wednesday, 16 November 2016
Trying to figure out the GO results - it's a different format now, so that's a bit tricky.
Also it seems like very few have biology 2nd level GO annotations, so that's confusing.


#####Thursday, 3 November 2016
Comparing the simulated maternal alleles with different error rates.
> Anova(lm(as.numeric(inf.af$AlleleFreq)~inf.af$Type*as.factor(inf.af$ErrorRate)))
Anova Table (Type II tests)

Response: as.numeric(inf.af$AlleleFreq)
                                        Sum Sq     Df F value Pr(>F)
inf.af$Type                                  0      1  0.0922 0.7614
as.factor(inf.af$ErrorRate)                  0      5  0.0195 0.9998
inf.af$Type:as.factor(inf.af$ErrorRate)      0      5  0.0077 1.0000
Residuals                                15582 753768 

Still no bias! Let's keep increasing the error rate.

OK, once we get to ~30% we start to see some error.
---
Now I'm re-doing the blast2go stuff, which means I need to re-do the extracting of the RAD regions and re-do the blastx searches.
in SCA/programs/extract_sequence_part/:
./extract_sequence_part -f ../../../scovelli_genome/SSC_integrated.fa -i ../../results/biallelic_outliers/rad_region/all.shared_extract.txt -o ../../results/biallelic_outliers/rad_region/all.shared_extract.fasta
(36 loci)
./extract_sequence_part -f ../../../scovelli_genome/SSC_integrated.fa -i ../../results/biallelic_outliers/rad_region/all.unique_extract.txt -o ../../results/biallelic_outliers/rad_region/all.unique_extract.fasta
(649 loci)
Need to re-blast...best way to do this? at home?


#####Thursday, 27 October 2016
> wilcox.test(fm.sig$FemAllele1Freq[fm.sig$Chi.p.adj<=0.05],fm.sig$RefFreq[fm.sig$Chi.p.adj<=0.05],paired=T,alternative = "greater")

	Wilcoxon signed rank test with continuity correction

data:  fm.sig$FemAllele1Freq[fm.sig$Chi.p.adj <= 0.05] and fm.sig$RefFreq[fm.sig$Chi.p.adj <= 0.05]
V = 114910, p-value = 4.924e-06
alternative hypothesis: true location shift is greater than 0

> wilcox.test(fm.sig$MalAllele1Freq[fm.sig$Chi.p.adj<=0.05],fm.sig$RefFreq[fm.sig$Chi.p.adj<=0.05],paired=T,alternative = "less")

	Wilcoxon signed rank test with continuity correction

data:  fm.sig$MalAllele1Freq[fm.sig$Chi.p.adj <= 0.05] and fm.sig$RefFreq[fm.sig$Chi.p.adj <= 0.05]
V = 70438, p-value = 9.628e-09
alternative hypothesis: true location shift is less than 0

wilcox.test(mo.sig$FemAllele1Freq[mo.sig$Chi.p.adj<=0.05],mo.sig$RefFreq[mo.sig$Chi.p.adj<=0.05],paired=T,alternative = "less")

	Wilcoxon signed rank test with continuity correction

data:  mo.sig$FemAllele1Freq[mo.sig$Chi.p.adj <= 0.05] and mo.sig$RefFreq[mo.sig$Chi.p.adj <= 0.05]
V = 678, p-value = 0.2216
alternative hypothesis: true location shift is less than 0

> wilcox.test(mo.sig$MomAllele1Freq[mo.sig$Chi.p.adj<=0.05],mo.sig$RefFreq[mo.sig$Chi.p.adj<=0.05],paired=T,alternative = "less")

	Wilcoxon signed rank test with continuity correction

data:  mo.sig$MomAllele1Freq[mo.sig$Chi.p.adj <= 0.05] and mo.sig$RefFreq[mo.sig$Chi.p.adj <= 0.05]
V = 521, p-value = 0.01867
alternative hypothesis: true location shift is less than 0

#####Friday, 2 September 2016
I've figured out the issue--it wasn't standardized. So I exported everything with Level 3 Biological Process. 

I can't check the error rate because the files are on the external at work, but I can update the figure legends.

#####Wednesday, 31 August 2016
Reacquainting myself with the status of the SCA manuscript...
I've delved into Kelly's python script to understand how the test of successful mothers vs adults is done:
	it's simply a test of differing allele frequencies using measured population af as input, but assuming successful moms have a different freq than the rest of the population
Looking into the GO analyses: although the LRT and Fst analyses have overlapping SNPs in the FM comparisons, they don't have overlapping bar graphs in the bio	bar graph
but they do in bio2. 
The blast results seem to be the same, though, for the same genes--so why are they so different? I'm going to have to re-do the blast2go analysis I think.
I need to update the supplement.
#####Monday, 25 July 2016
It seems that perhaps the mapping results weren't the ones I'd output before. 
And I need to put all the shared ones together in one output.
Another idea: include "NAs" in plot.
I need to fix this at home, so I'll just work on writing for now.

#####Sunday, 24 July 2016
I ran Blast2Go on all of the outlier blast results. Output the 'mapping' results. 

#####Friday, 22 July 2016
THE REASON IS THAT THERE ARE MULTIPLE SNPS PER LOCUS! Duh.

#####Thursday, 21 July 2016
Ran gwsca_biallelic_vcf.

Now re-doing the analysis.
gwsca:
	> dim(aj.out)
	[1] 563  26
	> dim(fm.out)
	[1] 458  26
	> dim(mo.out)
	[1] 346  26
	> aj.top1[1]
	[1] 0.0104507
	> fm.top1[1]
	[1] 0.0382764
	> mo.top1[1]
	[1] 0.0388301
	length(levels(factor(c(as.character(aj.fm$Locus),as.character(aj.mo$Locus),
	+ as.character(fm.mo$Locus)))))
	[1] 147
	> dim(aj.unique)
	[1] 508  26
	> dim(fm.unique)
	[1] 399  26
	> dim(mo.unique)
	[1] 263  26
	> dim(shared.out)
	[1]  1 26
	> dim(aj.fm)
	[1] 40 26
	> dim(aj.mo)
	[1] 51 26
	> dim(fm.mo)
	[1] 58 26

Monnahan:
	> dim(hd[hd$bh_0<=0.05,])#87
	[1] 87 35
	> dim(hd[hd$bh_2<=0.05,])#0
	[1]  0 35
	> dim(hd[hd$bh_3<=0.05,])#19
	[1] 19 35
	
> length(fm.both.out)
[1] 22
> length(mo.both.out)
[1] 5

It doesn't seem to have changed anything..	So re-running it all didn't actually change the fact that the Monnahan analysis hs loci that seemingly aren't in the other dataset despite them originating from the same thing. 

#####Wednesday, 20 July 2016
Pulling out 5kb regions around the outliers from both model 0 (viability selection btwn males and females) and model 3 (sexual selection) from the LRT. 
Need to also ID the number of SNPs per locus for the ones that aren't in gw.sum. So I'm going to re-run both of the analyses.........funnnnn
Ran convert_snps, infer_mat_vcf, and merge_vcfs.
Also re-ran the Monnahan analysis without the PRM077-OFF077 father-offspring pair.

#####Tuesday, 19 July 2016
Successfully identified outliers! And there are some that overlap the viability and sexual selection analyses.

#####Monday, 18 July 2016
The Kelly approach isn't working right now...getting this error:
	Traceback (most recent call last):
	  File "ML.2016.pipefish.py", line 265, in <module>
		snpX=cols[1]
	IndexError: list index out of range
Due to the empty line at the beginning of the file. Now it's running.

Re-running populations on the 'both' dataset:
	populations -b 3 -P ./results/stacks_both/ -t 3 -M ./sca_popmap_both.txt -s -r 0.5 -p 3 -a 0.05 --fstats --vcf --genomic --plink


#####Thursday, 14 July 2016
Ubuntu had frozen yesterday, so today I quit and re-started ./run_refmap_both.sh

#####Wednesday, 13 July 2016
 summary(cov$VarianceInDepthPerInd)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
      1.4       7.6      15.3     216.3      29.9 2886000.0 

#####Tuesday, 12 July 2016
Mapped high LD regions on the Fst plot

Forgot that there were two PRM035 in PstI dataset, so I'm re-aligning those (and renamed them PRM351 and PRM352). Then I can run refmap.

	bowtie2 --sensitive -x ../../scovelli_genome/ssc_chromonome -S orad_PRM351.sam -t -U orad_PRM351.fq -p 4
	Time loading reference: 00:00:00
	Time loading forward index: 00:00:02
	Time loading mirror index: 00:00:01
	Multiseed full-index search: 00:02:38
	3714166 reads; of these:
	  3714166 (100.00%) were unpaired; of these:
		236508 (6.37%) aligned 0 times
		2678649 (72.12%) aligned exactly 1 time
		799009 (21.51%) aligned >1 times
	93.63% overall alignment rate
	Time searching: 00:02:41
	Overall time: 00:02:42
	
	bowtie2 --sensitive -x ../../scovelli_genome/ssc_chromonome -S orad_PRM352.sam -t -U orad_PRM352.fq -p 4
	Time loading reference: 00:00:01
	Time loading forward index: 00:00:01
	Time loading mirror index: 00:00:00
	Multiseed full-index search: 00:01:56
	2706820 reads; of these:
	  2706820 (100.00%) were unpaired; of these:
		142579 (5.27%) aligned 0 times
		2030999 (75.03%) aligned exactly 1 time
		533242 (19.70%) aligned >1 times
	94.73% overall alignment rate
	Time searching: 00:01:58
	Overall time: 00:01:58

And ran Stacks on orad and both datasets.

#####Monday, 11 July 2016
Selection differentials: standardize entire population, then calculate means for each group.

Linkage disequilibrium: finding the groups of 50 loci that have a mean pairwise LD > 0.4.

Meanwhile, aligning PstI to new genome
#sarah@sarah-vb:~/sf_ubuntushare/SCA/scripts$ ./run_bowtie_orad.sh 2>&1 ../align_psti.log

#####Thursday, 7 July 2016
I've been working the past few days on characterizing the weirdness in the dataset. There is something weird going on.
Many SNPs have oddly high heterozygosities and those SNPs are not those with the lowest likelihood ratios, nor with the lowest coverage. But they go away if I prune for HWE using PLINK, so that's something.

I've also been writing code to adapt my vcf file to work with John Kelly's newest program. It is fairly straightforward but there are a few small things that took some finagling. Including renaming the individuals (which I had to do with Notepad++)...and it's only using data from the 131 pregnant males & their offspring and the 57 females. I could alter this further to include all of the males and non pregnant males but I want to get this working first. I think it will use all of the adults vs. the offspring to infer viability selection. 

#####Tuesday, 5 July 2016
The coverage analysis shows that I have a bunch of loci doing weird things (high coverage, all hets)..why would this be? I'm looking at the likelihood ratios to try to figure it out, and I'll do some pruning.

Also need to look at some randomly chosen loci to look at the distribution of each allele (is it binomial?).

#####Thursday, 29 June 2016
For some reason the stack depth associated with individuals in the SCA dataset are 0 for many loci....there was a bug in Stacks v. 1.39.

Calculated selection differentials using raw means (matedu-allu). There are no significant differences between mated and unmated using t-test. But what if I restrict it to those used in RADseq?

Running Batemanater on Adam's dataset from the 2001 paper.
	Mean Mating Success of Females (males)=1
	Mean Reproductive Success Females (males) = 16.40
	Sex Ratio = 0.442623
	Estimated Number Males in Population = 55
	Bootstrap Replicates=1000
	Data:
		1	23
		1	14
		1	16
		2	34
		4	54
		3	47
		2	38
	
	Results Estimate Mating Success Distribution:
		Mean Male Mating Success: 	1.2593
		Standard Deviation in MS: 	1.5592
		Opp. for Sexual Selection:	1.5331
		Std. Dev. for Simulation: 	4.0838
	Results Estimate St. Dev. RS and Bateman Grad.
		Mean MS:    	1.253
		Mean RS:    	20.652
		StDev MS:   	1.548
		StDev RS:   	24.143
		BateGrad:   	15.355
		BG,no 0s:   	13.662
		BG':           	0.931
		Is:              	1.547
		I:               	1.386
		S'max:         	1.151
		RS incr./mate:	12.005
		StDev RS param.	3.767
	Results Estimate and Bootstrap St Dev MS, RS and BG
		Variable       	Estimate	LowerCI	UpperCI	(95.0%)	LowerCI	UpperCI	(90.0%)
		MeanMS         	1.26	1.26	1.26		1.26	1.26
		MeanRS         	20.65	20.65	20.65		20.65	20.65
		Std.Dev.MS     	1.56	1.55	1.58		1.55	1.58
		Std.Dev.RS     	24.46	22.15	26.45		22.48	26.32
		OppSexSel(Is)  	1.54	1.51	1.57		1.52	1.56
		OppSelec(I)    	1.43	1.17	1.66		1.20	1.64
		BatemanGradient	15.55	13.78	16.93		14.19	16.89
		BateGrad(no0s) 	14.21	9.74	17.56		10.73	17.45
		Standardized BG	0.94	0.83	1.03		0.86	1.02
		S'max            	1.17	1.03	1.27		1.06	1.27
		SDmsForEstim.  	4.13	3.93	4.22		3.99	4.22
		RS incr./mate 	13.07	7.18	18.50		8.27	18.30
		StDev RS param	3.27	0.69	7.24		0.89	6.84

		
		
#####Wednesday, 28 June 2016
The vcf_to_cervus program wasn't writing the data out correctly, which is why I ran into issues. I think I've fixed it--I'm re-running it, so we'll see. 

#####Thursday, 23 June 2016
After the Evolution meeting, I've got some new ideas:
	1. Odd LD patterns might be due to a segregating inversion (thanks Molly!)
		-look into papers re: segregating inversions
		-do the inversions match regions of differentiation between males and females?
	2. I might need to use the likelihood approach for the SCA because both SCA talks at Evolution used likelihood-based approaches.
	3. I should do an in-silico digestion of the genome to get estimates of allelic dropout etc. for the ddRAD/sdRAD paper.
	
#####Thursday, 16 June 2016
Sent raw data to Cresko lab via OneDrive

Thinking about how to analyze allele dropout:
	stack depth in matches files. For each locus, count the depth and then for each locus get the distribution of depths.
	
	I'd consider using the vcf files because they're easier to parse, but they look like this:
		GT:DP:AD:GL    0/0:0:0,0:.,4.16,.       0/1:8:5,3:.,11.09,.     ./.:0:.,.:.,.,. 0
		where some individuals get genotyped but have an allele depth of 0?? doesn't make any sense. So I think the matches file will be better.

Parentage analysis after talking with Adam:
	Calculate selection differentials as mated vs non-mated females and their traits
	Calculate bivariate mean to get intercept (meanxbeta+intercept=meany) to plot bateman gradient
	Run CERVUS on SNPs.
		-> Need to convert SNPs to CERVUS format. Adapting haplotypes_to_cervus to deal with vcf.
		It says there are 48351 loci in 384 individuals..these are the MAF > 0.1 loci. I need to use the other vcf.
	
	



#####Monday, 13 June 2016
Populations is finished running, but it gave the "Warning: Unable to find allele depths for datum" message again.

Don't care, doing:
vcftools --keep adults_list.txt --vcf batch_1.vcf --out batch_1.adults_MAF1 --recode
48351 loci.

Ran the LD script.

I'm trying to find out info about the regions taht have high LD across the linkage groups but the sumstats file doesn't seem to match up with the locus IDs in ld file. 
Oh! LD file had Chr.BP.Locus not Chr.Locus.BP. My bad.



#####Thursday, 9 June 2016
It appears that the dataset with maf 0.1 cutoff will have 27730 loci.

#####Wednesday, 8 June 2016
I discovered that my offspring file for CERVUS contained mismatch errors between offspring and their putative parents, so I fixed that and need to re-run CERVUS. GAHHHH nothing can ever be done only 3 times.

For the LD analysis for Clay:
	vcftools --keep adults_list.txt --vcf batch_1.vcf --out batch_1.adults.vcf --recode
to keep only the adults and re-run the LD analysis.

Now I've done that but should probably re-run populations with the maf 0.1 also...I'll do the LD figs for this set too though.
	populations -b 1 -P ./ -M ../../sca_popmap_ddrad.txt -a 0.1 -t 3 -r 0.5 -p 3 --vcf --fstats --plink

Re-running Batemanater:
		Mean mating success of females (actually males) = 0.93596059
		Mean reproductive success of females (acutally males) = 16.2967
		Sex ratio: 87/(87+203) = 0.3
		Estimated number of males (acutally females) = 379
		Bootstrap Replicates: 1000
	Alarm Clock Results:
		Mean Male Mating Success: 	2.1839
		Standard Deviation in MS: 	2.6336
		Opp. for Sexual Selection:	1.4542
		Std. Dev. for Simulation: 	6.9064
	Earth Results:
		Mean of the Top 10 Solutions:
		Mean MS:    	2.179
		Mean RS:    	38.030
		StDev MS:   	2.624
		StDev RS:   	49.813
		BateGrad:   	17.663
		BG,no 0s:   	17.890
		BG':           	1.012
		Is:              	1.452
		I:               	1.719
		S'max:         	1.219
		RS incr./mate:	19.954
		StDev RS param.	27.561
	Left-Pointing Hand:
		Variable       	Estimate	LowerCI	UpperCI	(95.0%)	LowerCI	UpperCI	(90.0%)
		MeanMS         	2.18	2.18	2.18		2.18	2.18
		MeanRS         	38.03	38.03	38.03		38.03	38.03
		Std.Dev.MS     	2.63	2.62	2.64		2.62	2.64
		Std.Dev.RS     	48.29	39.17	50.22		42.08	49.97
		OppSexSel(Is)  	1.45	1.44	1.46		1.44	1.46
		OppSelec(I)    	1.62	1.06	1.75		1.23	1.73
		BatemanGradient	17.41	14.41	17.96		15.39	17.93
		BateGrad(no0s) 	17.36	11.16	18.51		13.20	18.44
		Standardized BG	1.00	0.83	1.03		0.88	1.03
		S'max            	1.20	0.99	1.24		1.06	1.24
		SDmsForEstim.  	6.89	6.78	6.94		6.81	6.94
		RS incr./mate 	18.61	8.88	20.37		11.39	20.27
		StDev RS param	22.22	4.38	31.76		8.94	29.93
		
#####Tuesday, 7 June 2016
I need to do statistics to compare the band sharing results. ANOVA, probably.

Which Bateman gradient info do I present? Do I present the final table for females or the estimate from the actual data? The actual data aren't within the confidence intervals, so I think I probably want the estimates.

Question re: plotting: can I plot the estimated bateman gradient instead of the lm()??


#####Monday, 6 June 2016
I fixed up the Fst and SNPstats figures and wrote up methods and (very brief) results & sent them to Clay.
There may be more analyses to explore but I didn't do them.

Is there a way to calculate expected band sharing? Not easily.

Plotting band sharing results: boxplots of father-offspring and putatively unrelated, but what about assigned mothers?

#####Friday, 3 June 2016
I have a few goals for the parentage analysis:
1. How many alleles per locus? What is the distribution of minimum allele frequencies? What do the allele frequency distributions look like?
2. Do the ones that are higher in CERVUS have different allele frequency distributions?
	Higher SNP sets: gen100_7,gen_150_7, gen150_6,gen200_5,gen200_3,gen200_9,gen300_10,gen300_2,gen300_9
	The average number of alleles per locus is 4.509 for all sets, with a min of 4.15 and a max of 4.945. 
	Observed Het doesn't follow the same pattern either...although 1600 has the most median observed het with smallest variance
	Observed Het:
		$`50`
		   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
		0.08640 0.09034 0.09495 0.09593 0.09961 0.10970 

		$`100`
		   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
		0.07406 0.07952 0.08983 0.08834 0.09424 0.10580 

		$`150`
		   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
		0.07237 0.08260 0.08636 0.08794 0.09170 0.10480 

		$`200`
		   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
		0.07562 0.08536 0.09141 0.08951 0.09352 0.09855 

		$`300`
		   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
		0.07868 0.08572 0.08932 0.08895 0.09180 0.10110 

		$`400`
		   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
		0.07593 0.08461 0.09310 0.08955 0.09422 0.09624 

		$`800`
		   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
		0.08361 0.08548 0.08752 0.08794 0.09074 0.09266 

		$`1600`
		   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
		0.08743 0.08776 0.08804 0.08813 0.08850 0.08904 
	In 150 loci, set 7 is the one with the highest assignment rate and the highest observed het. 
	
	The most polymorphic loci are contained in the sets with more loci (simple sampling of loci)
	
	
To run band sharing on all pairwise combinations I need to create a file with two columns, which I did in R.
Now the question is: Is there a difference in the band sharing among unrelated individuals and fathers and offspring?

	t.test(all.bs$Shared~all.bs$combo)

			Welch Two Sample t-test

	data:  all.bs$Shared by all.bs$combo
	t = 17.664, df = 133.07, p-value < 2.2e-16
	alternative hypothesis: true difference in means is not equal to 0
	95 percent confidence interval:
	 0.03107706 0.03891432
	sample estimates:
	mean in group father-offspring        mean in group unrelated 
						 0.9451931                      0.9101974 


#####Wednesday, 1 June 2016
I'm re-running Batemanater but saving a log so I can generate a figure like in Kenyon's paper with observed and estimated mating successes.
Also going to re-run it with non-matched females as zeros.

When calculating reproductive success average, do I include non-mated males? 

OK, I calculated an estimate for the number of females using Pollock et al (1990):
nhat=(n1+1)(n2+2)/(m2+1)-1

Females sans zeros:
	Variable       	Estimate	LowerCI	UpperCI	(95.0%)	LowerCI	UpperCI	(90.0%)
	MeanMS         	1.421	1.158	1.684		1.211	1.632
	MeanRS         	26.895	19.737	34.684		20.684	33.263
	StDevMS     	0.607	0.375	0.772		0.419	0.761
	StDevRS     	16.822	10.086	20.415		10.767	19.780
	OppSexSel(Is)  	0.182	0.105	0.259		0.113	0.250
	OppSelec(I)    	0.391	0.215	0.563		0.235	0.533
	BatemanGradient	21.540	14.087	34.064		15.293	31.500
	StandardizedBG	1.138	0.795	1.708		0.870	1.560
	S'max         	0.486	0.291	0.642		0.330	0.612
	
Results:
Mean Male Mating Success: 	2.1839
Standard Deviation in MS: 	2.6244
Opp. for Sexual Selection:	1.4441
Std. Dev. for Simulation: 	6.9064

Mean of the Top 10 Solutions:
Mean MS:    	2.180
Mean RS:    	38.030
StDev MS:   	2.620
StDev RS:   	49.424
BateGrad:   	17.603
BG,no 0s:   	17.771
BG':           	1.009
Is:              	1.447
I:               	1.692
S'max:         	1.213
RS incr./mate:	19.536
StDev RS param.	26.466

 
Variable       	Estimate	LowerCI	UpperCI	(95.0%)	LowerCI	UpperCI	(90.0%)
MeanMS         	2.18	2.18	2.18		2.18	2.18
MeanRS         	38.03	38.03	38.03		38.03	38.03
Std.Dev.MS     	2.63	2.62	2.64		2.62	2.64
Std.Dev.RS     	48.17	38.54	50.14		41.91	49.96
OppSexSel(Is)  	1.45	1.44	1.46		1.44	1.46
OppSelec(I)    	1.61	1.03	1.74		1.22	1.73
BatemanGradient	17.35	14.16	17.96		15.37	17.91
BateGrad(no0s) 	17.24	10.64	18.50		13.16	18.41
Standardized BG	0.99	0.81	1.03		0.88	1.03
S'max            	1.20	0.98	1.24		1.06	1.23
SDmsForEstim.  	6.90	6.78	6.94		6.81	6.94
RS incr./mate 	18.41	8.25	20.27		11.60	20.16
StDev RS param	22.14	3.29	31.21		7.12	29.93

Estimate with female data with zeros:
 
Variable       	Estimate	LowerCI	UpperCI	(95.0%)	LowerCI	UpperCI	(90.0%)
MeanMS         	0.474	0.298	0.684		0.333	0.649
MeanRS         	8.965	5.175	13.579		5.737	12.702
StDevMS     	0.758	0.567	0.925		0.590	0.901
StDevRS     	15.955	10.588	20.200		11.441	19.390
OppSexSel(Is)  	2.563	1.492	4.421		1.652	3.917
OppSelec(I)    	3.167	1.881	5.347		2.005	4.857
BatemanGradient	19.464	15.542	23.401		16.157	22.666
StandardizedBG	1.028	0.954	1.113		0.968	1.092
S'max         	1.647	1.259	2.160		1.302	2.051

And estimate for the males:
Variable       	Estimate	LowerCI	UpperCI	(95.0%)	LowerCI	UpperCI	(90.0%)
MeanMS         	0.936	0.901	0.966		0.906	0.961
MeanRS         	14.611	13.291	15.818		13.507	15.635
StDevMS     	0.245	0.183	0.299		0.195	0.292
StDevRS     	9.678	8.888	10.456		8.982	10.349
OppSexSel(Is)  	0.069	0.036	0.110		0.041	0.104
OppSelec(I)    	0.439	0.343	0.552		0.359	0.535
BatemanGradient	15.611	14.307	16.757		14.513	16.593
StandardizedBG	1.000	1.000	1.000		1.000	1.000
S'max         	0.262	0.189	0.331		0.203	0.322

#####Tuesday, 31 May 2016
I ran band_sharing.  Band-sharing is ~0.95 for all of them.
...what about the relatedness analysis? I'll try running it, see what happens.
The code was clearly not fixed. It was a little bit broken.

Also made a lovely heatmap of LG1, but I'm working on getting data for all the LGs to get some for comparison.

#####Monday, 30 May 2016
Parentage analysis...matched same 27 females to offspring, using 1642 loci.
Now to run Batemanator
	Bootstrap Reps: 1000
		In the box:
			2	62
			1	28
			1	16
			1	27
			1	16
			1	12
			2	24
			3	55
			1	19
			1	14
			2	52
			2	27
			1	3
			1	15
			2	55
			1	15
			1	26
			1	17
			2	28
		Output:	
			Variable       	Estimate	LowerCI	UpperCI	(95.0%)	LowerCI	UpperCI	(90.0%)
			MeanMS         	1.421	1.158	1.684		1.211	1.632
			MeanRS         	26.895	19.789	34.316		20.737	33.211
			StDevMS     	0.607	0.375	0.772		0.419	0.761
			StDevRS     	16.822	10.055	20.681		11.027	20.108
			OppSexSel(Is)  	0.182	0.105	0.256		0.113	0.250
			OppSelec(I)    	0.391	0.216	0.558		0.238	0.525
			BatemanGradient	21.540	13.846	34.452		15.284	32.079
			StandardizedBG	1.138	0.792	1.713		0.859	1.622
			S'max         	0.486	0.286	0.635		0.328	0.604
	Part II.
		Mean mating success of females (actually males) = 0.93596059
		Mean reproductive success of females (acutally males) = 16.2967
		Sex ratio: 87/(87+203) = 0.3
		Estimated number of males (acutally females) = 100 (guess)
		Bootstrap Replicates: 1000
		Results:
			Mean Male Mating Success: 	2.1839
			Standard Deviation in MS: 	2.6422
			Opp. for Sexual Selection:	1.4637
			Std. Dev. for Simulation: 	6.9391
	
		Mean of the Top 10 Solutions:
			Mean MS:    	2.179
			Mean RS:    	38.029
			StDev MS:   	2.626
			StDev RS:   	49.448
			BateGrad:   	17.941
			BG,no 0s:   	18.474
			BG':           	1.028
			Is:              	1.463
			I:               	1.702
			S'max:         	1.239
			RS incr./mate:	20.372
			StDev RS param.	22.450
		
		Variable       	Estimate	LowerCI	UpperCI	(95.0%)	LowerCI	UpperCI	(90.0%)
		MeanMS         	2.18	2.18	2.18		2.18	2.18
		MeanRS         	38.03	38.03	38.03		38.03	38.03
		Std.Dev.MS     	2.58	1.97	2.66		2.32	2.66
		Std.Dev.RS     	47.71	35.34	50.24		39.03	50.06
		OppSexSel(Is)  	1.41	0.81	1.49		1.13	1.48
		OppSelec(I)    	1.60	0.87	1.76		1.06	1.74
		BatemanGradient	17.69	15.08	18.24		16.27	18.17
		BateGrad(no0s) 	17.99	12.61	19.04		15.00	18.93
		Standardized BG	1.01	0.86	1.04		0.93	1.04
		S'max            	1.20	0.89	1.25		0.97	1.25
		SDmsForEstim.  	6.53	2.71	6.94		4.12	6.94
		RS incr./mate 	19.47	10.76	20.69		13.69	20.58
		StDev RS param	19.73	2.19	28.84		5.84	27.74
	
	
	
#####Thursday, 26 May 2016
Running CERVUS. simulating:
10000 offspring, 57 mothers, prop. sampled 0.05, prop. loci typed 1, prop. loci mistyped 0.01, min typed loci 0.5*number of loci, strick% 95, relaxed% 90.

#####Thursday, 12 May 2016
I'm trying to figure out the snpstats output. There doesn't seem to be any pattern in the p-values for the G analysis.

I extracted the coverage info from the ref_map file using grep and got an average of 12.45638 +/- 3.054524

Working on the LD analysis..I think I need to make a heatmap?

#####Wednesday, 11 May 2016
The LD program output says 203514388 pairwise LD calculations performed...but file didn't save (misspelled folder name), so I'm re-running.
Ran gwsca_haplotypes.

Meanwhile, I'm working on SNPstats1. Need a population ID file:
	SNPstats1 and statistics1 require a text file identifying the individuals to be
	included in each population. This file contains (either tab-separated or by line) the
	number of populations, number of individuals in each population, and column number in
	the call file for the individuals in each population. Individuals do not have to be in order
	in the call file, and not all individuals in the call file need to be included. Comments are
	allowed only at the end of this file. Example:
	3
	6
	8
	5
	7 8 9 10 21 22
	1 2 3 4 5 6 11 12
	13 14 15 16 17
and a genotype call file:
	The first line of this file has the total number of nucleotide sites
	and then the number of individuals in the file, separated by a tab or spaces. Each
	subsequent line is a single nucleotide position, with the following tab-separated fields:
	any number of ID fields (strings); chromosome/linkage group name (a string); position
	(an integer); diploid genotype for each individual (integer from 0-10; see below).
	genotype prints one ID field by default (position within each RAD tag), and the other
	programs below expect one ID field by default; Stacks may output one or more ID fields.
	The positions must be ordered such that linkage groups are together, positions are in
	increasing order within each linkage group, and the linkage groups are ordered to match
	the LGnames.txt file (required by the programs genotype or collate). Example:
	24035 5
	tag6 groupI 426 5 5 5 0 6
	tag6 groupI 427 8 8 0 0 8
	tag6 groupI 428 10 9 10 8 8
			Nucleotide 1
Nucleotide 2 A C G T
			A 1 2 3 4
			C 2 5 6 7
			G 3 6 8 9
			T 4 7 9 10

I wrote a program (vcf_to_snpstats) to convert batch_1.vcf to the two snpstats1 files. Hopefully it works!


gwsca_haplotypes had a lot of Fst > 1. Ran haplotypes_to_cervus.

Running coverage_from_stacks on SCA dataset. Also re-calculating error rates.
The coverage_from_stacks output is really different from before--went from almost 10X to 2X???

./programs/RADpopgen/SNPstats1 ./results/sexlinked/genotype_calls.txt ./results/sexlinked/populationID.txt -o ./results/sexlinked/snpstats1_out.txt
..hit a segmentation fault.
The populationID file had an extra space and the genotype_calls file had the first alleles as -84920281 or whatever a pointer is..I replaced that with 0 and it ran.
Ran on 67018 sites. It's really fast. Do I just plot the log10 of the p_val after G-geno? I guess so.

I'll do this once I've finished pruning the Cervus genotypes with prune_cervus_genotypes.R

#####Tuesday, 10 May 2016
Troubleshooting both of these. Also running Blast2Go to re-make SCA figures.
Not sure why the C++ code is breaking...because I was not deleting pointers properly, for one!

Meanwhile, I'm going to work on the relatedness/parentage side of things. ..
running run_convert_matches.sh
Now I'm running infer_maternal_contribution..

#####Monday, 9 May 2016
Wrote R code to calculate LD but I'm not sure it's doing what I think it should do so I'm also writing a C++ program.
#####Sunday, 8 May 2016
> dim(aj.unique)
[1] 547  25
> dim(fm.unique)
[1] 442  25
> dim(mo.unique)
[1] 263  25

> dim(shared)
[1]  1 25
> length(levels(factor(c(as.character(aj.fm$Locus),as.character(aj.mo$Locus),
+ as.character(fm.mo$Locus)))))
[1] 147
> dim(aj.fm)
[1] 40 25
> dim(aj.mo)
[1] 51 25
> dim(fm.mo)
[1] 58 25


There aren't any chromosomes that seem to have extra male-female outliers.

#####Saturday, 7 May 2016
I ran gwsca_biallelic_vcf
 
> dim(adt.n)
[1] 63038    18
> dim(juv.n)
[1] 63639    18
> dim(fem.n)
[1] 55685    18
> dim(mal.n)
[1] 63906    18
> dim(mom.n)
[1] 43017    18

dim(aj.prune)
[1] 60451    25
> dim(fm.prune)
[1] 53956    25
(mo.prune)
[1] 38537    25

If you only keep the loci on the LGs:
> dim(mo.plot)
[1] 36796    25
> dim(fm.plot)
[1] 50009    25
> dim(aj.plot)
[1] 56087    25


#####Friday, 6 May 2016
69109 loci.

#####Thursday, 5 May 2016
Somehow I got these warnings again: Warning: Unable to find allele depths for datum
I wonder what that means...I'm trying populations with  -p 3 filter also..but I also may just re-run the whole pipeline on this computer and see what happens.

OK, I'm just re-running it on here. I don't know what's happening but I wonder if it had to do with matching the wrong files or something. Anyway, hopefully this fixes whatever the problem is/was.

#####Wednesday, 4 May 2016
NOTES TO SELF:
	**Run convert_snps.R in 64-bit version of R. Otherwise it crashes
	**Run infer_mat_vcf with the correct file listing--in this case, dad.kid.pairs.fullnames.txt
	**Make sure all programs are writing to legitimate file names & locations
	
Today I:
1. Re-ran convert_snps.R
2. Ran infer_mat_vcf with dad.kid.pairs.fullnames.txt
3. Ran merge_vcfs.R to merge batch_1.vcf and biallelic_maternal.vcf (from infer_mat_vcf)
4. gwsca analysis in R

Hmm...pruned snps:
dim(adt.n)
[1] 11782    18
> dim(juv.n)
[1] 12216    18
> dim(fem.n)
[1] 11092    18
> dim(mal.n)
[1] 12213    18
> dim(mom.n)
[1] 717  18

maybe I should re-do populations without the filters.

#####Tuesday, 3 May 2016
Re-doing populations:
populations -b 1 -P ./ -M ../../sca_popmap_ddrad.txt -t 3 --vcf -m 5 -a 0.05 

requires stack depth of 5 to genotype individual. Now I don't have any cryptic error messages.

infer_mat_vcf reports 304632 SNPs. That's too many. 
populations -b 1 -P ./ -M ../../sca_popmap_ddrad.txt -a 0.05 -t 3 --vcf -m 5 -r 0.5 -p 30

let's see what that does.

According to infer_mat_vcf, this populations run yielded 15620 SNPs.

#####Monday, 2 May 2016
Running populations on work computer:
populations -b 1 -P ./ -M ../../sca_popmap_ddrad.txt -a 0.05 -t 3

(it was maxxing out the memory on my home computer)

It finished, says it retained 214702 loci.

Removed PRM077-OFF077 from the dad.kid.list file...but forgot the --vcf flag with populations so I have to re-run it.

Hmm...lots of:
"Warning: Unable to find allele depths for datum 95285
Warning: Unable to find allele depths for datum 106385
Warning: Unable to find allele depths for datum 105901
Warning: Unable to find allele depths for datum 105916
Warning: Unable to find allele depths for datum 91350
Warning: Unable to find allele depths for datum 109831
Warning: Unable to find allele depths for datum 95285
Warning: Unable to find allele depths for datum 106385
Warning: Unable to find allele depths for datum 105901
Warning: Unable to find allele depths for datum 105916
Warning: Unable to find allele depths for datum 91350
Warning: Unable to find allele depths for datum 109831
Warning: Unable to find allele depths for datum 95285
Warning: Unable to find allele depths for datum 106385
Warning: Unable to find allele depths for datum 105901
Warning: Unable to find allele depths for datum 105916"

^^has to do with making a vcf I think. infer_vcf worked just fie.

#####Saturday, 30 April 2016
Stacks finished and the vcf file is empty. I'm running populations again. I've run convert_snps, which hopefully worked, and now I'm gonna try infer_mat_vcf. Hopefully everything isn't ruined.

#####################################NEW GENOME###########################################

#####Monday, 25 April 2016
I re-made the CERVUS figure. Also downloaded a bunch of papers and outlined the MS. I need to figure out how many haplotypes were identified by convert_matches.

#####Sunday, 24 April 2016
I ran Batemanator. First stpe is:
Variable       	Estimate	LowerCI	UpperCI	(95.0%)	LowerCI	UpperCI	(90.0%)
MeanMS         	1.421	1.158	1.684		1.211	1.632
MeanRS         	25.842	19.158	33.105		20.000	32.053
StDevMS     	0.607	0.375	0.769		0.419	0.761
StDevRS     	15.756	7.406	20.207		9.534	19.450
OppSexSel(Is)  	0.182	0.103	0.256		0.113	0.237
OppSelec(I)    	0.372	0.144	0.544		0.177	0.509
BatemanGradient	19.794	10.205	30.048		12.090	28.133
StandardizedBG	1.088	0.597	1.592		0.722	1.482
S'max         	0.465	0.198	0.613		0.264	0.593

Then I added the demographic data
mean mating succes of females = males  = 0.93596
mean reproductive success females = males = 16.27072
sex ratio males/males+females = females/males+females = 0.30
estimated number males = females = 100
number clutches = 27

max sd ms = 30 
number of ms intervals = 200

 Estimate mating success:
Results:
Mean Male Mating Success: 	2.1839
Standard Deviation in MS: 	2.6614
Opp. for Sexual Selection:	1.4850
Std. Dev. for Simulation: 	6.9391

 Estimate SD and Bateman Gradient
Mean of the Top 10 Solutions:
Mean MS:    	2.180
Mean RS:    	37.969
StDev MS:   	2.624
StDev RS:   	49.214
BateGrad:   	17.894
BG,no 0s:   	18.407
BG':           	1.027
Is:              	1.459
I:               	1.691
S'max:         	1.236
RS incr./mate:	20.339
StDev RS param.	22.050

Final:
Variable       	Estimate	LowerCI	UpperCI	(95.0%)	LowerCI	UpperCI	(90.0%)
MeanMS         	2.18	2.18	2.18		2.18	2.18
MeanRS         	37.97	37.97	37.97		37.97	37.97
Std.Dev.MS     	2.61	2.32	2.66		2.39	2.66
Std.Dev.RS     	47.20	36.89	50.09		38.73	49.92
OppSexSel(Is)  	1.43	1.13	1.48		1.20	1.48
OppSelec(I)    	1.57	0.95	1.75		1.05	1.74
BatemanGradient	17.37	14.08	18.14		14.70	18.08
BateGrad(no0s) 	17.36	10.53	18.87		11.76	18.79
Standardized BG	1.00	0.81	1.04		0.84	1.04
S'max            	1.19	0.93	1.25		0.98	1.25
SDmsForEstim.  	6.65	4.15	6.94		4.68	6.94
RS incr./mate 	18.43	8.55	20.55		9.49	20.55
StDev RS param	18.46	1.64	28.61		2.00	27.52


#####Tuesday, 19 April 2016
I am running CERVUS on the 99% found in HWE dataset subsetted 10 times with 150 and 300 loci. I'm running the simulation with 10000 offspring, 57 candidate mothers, prop. sampled = 0.05, prop loci typed = 0.99, prop. loci mistyped = 0.01, minimum loci typed = 0.5*num loci, relaxed%=90,strict=95.

#####Saturday, 16 April 2016
All the blast runs were done so I did blast2go and made supplementary files and graphs.


#####Friday, 15 April 2016
FML I was wrong AGAIN!! 
Gametic selection: do heterozygous moms/dads have 50% heterozygous offspring?
Sexual selection: breeders vs non breeders (this one's ok)
Viability selection: adults-offspring and males-females.
**Adults-offspring assumes HWE.

So I need to ditch the breeders-offspring thing and re-institute the males-females. I was right originally. alwefjaoivj;kljuiwerjao;ijfajf;oijf

So, I'm re-doing all the 5kb for blast extraction stuff.
1. subset_fasta_file
2. extract_sequence_part
3. cat together
4. blastx
***********************************
Last night I worked on the simulation model and figured out that the negative LD values pop up only when I use the empirical AFS. I'm wondering if maybe negative values are OK? So I'm testing to make sure genotypes are OK etc. by outputting the genotypes of all individuals at the first generation and then at the last generation. 

Summary stats:
dim(aj.prune)
[1] 44937    32
> dim(bj.prune)
[1] 42382    32
> dim(mo.prune)
[1] 28435    32
> mean(aj.prune$ADULT.JUVIE)
[1] 0.001372477
> mean(mo.prune$FEM.MOM)
[1] 0.005170965
> mean(bj.prune$JUVIE.BREEDER)
[1] 0.0007825038
> length(levels(as.factor(c(as.character(aj.bj$Locus[aj.bj$LocID!=41888]),
+ as.character(aj.mo$Locus[aj.mo$LocID!=41888]),
+ as.character(bj.mo$Locus[bj.mo$LocID!=41888])))))
[1] 172
> length(levels(as.factor(c(as.character(aj.bj$Chrom[aj.bj$LocID!=41888]),
+ as.character(aj.mo$Chrom[aj.mo$LocID!=41888]),
+ as.character(bj.mo$Chrom[bj.mo$LocID!=41888])))))
[1] 122
> length(levels(as.factor(c(as.character(aj.bj$LocID[aj.bj$LocID!=41888]),
+ as.character(aj.mo$LocID[aj.mo$LocID!=41888]),
+ as.character(bj.mo$LocID[bj.mo$LocID!=41888])))))
[1] 136

> dim(mo.unique)
[1] 251  32
> length(levels(as.factor(mo.unique$LocID)))
[1] 206
> length(levels(factor(mo.unique$Chrom)))
[1] 182
> dim(bj.unique)
[1] 306  32
> length(levels(as.factor(bj.unique$LocID)))
[1] 260
> length(levels(factor(bj.unique$Chrom)))
[1] 222
> > dim(aj.unique)
[1] 302  32
> length(levels(as.factor(aj.unique$LocID)))
[1] 258
> length(levels(factor(aj.unique$Chrom)))
[1] 222


For the supplemental files, I want to create files with RAD locus ID,Number of SNPs, Chromosome, First BP, RAD sequence, and blast and GO results.
#####Thursday, 14 April 2016
I started looking at the SNPs that are extreme outliers and noticed that the RAD loci they're on have many SNPs, so I started looking at that more formally. There is a significant difference, if I do a linear model with comparison (AO,BO,FM) as a blocking factor:
spr.lm<-lm(as.numeric(NumSNPs)~Comparison+SNPType,data=snps.per.rad)
#> anova(spr.lm)
#Analysis of Variance Table
#
#Response: as.numeric(NumSNPs)
#              Df   Sum Sq Mean Sq F value    Pr(>F)    
#Comparison     2      877   438.5  2.2405    0.1064    
#SNPType        2    10365  5182.6 26.4778 3.206e-12 ***
#Residuals  59097 11567337   195.7 

This either means the outliers are from more erroneous regions or from more polymorphic regions. 

Do they have a higher proportion of N SNPs?
AO and BO do, but Mo not really...and it's hard to say. Not sure how to statistically test this. 

Do they have higher likelihoods?<-these data aren't actually available for the catalogs. 

For outliers on loci with a bunch of SNPs, are most of the SNPs significant?
Nope. Most have <20% of SNPs as outliers
> summary(aj.out.dat$NumSNPsOut/aj.out.dat$NumSNPs)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.04167 0.08333 0.11110 0.15760 0.20000 1.00000 
> summary(bj.out.dat$NumSNPsOut/bj.out.dat$NumSNPs)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.03846 0.08333 0.12500 0.15320 0.20000 1.00000 
> summary(mo.out.dat$NumSNPsOut/mo.out.dat$NumSNPs)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.03846 0.09091 0.12500 0.16100 0.20000 1.00000 

Adam says not to worry about this too much for the SCA paper, but to look into differences between ddRAD and sdRAD datasets (e.g. can we discern the type of error? Are rates the same?)

#####Wednesday, 13 April 2016
In revisiting Christiansen & Frydenberg, I realized that I should be focusing on the adult-offspring comparison for viability selection, the inferred moms-female comparison for sexual selection, and that if I want to address gametic selection I need to compare offspring and the inferred breeding population (inferred moms and pregnant males). So I need to calculate Fsts for the last set.

I updated the gwsca_biallelic_vcf to accept a 'phenotype', which I've set MOMs and PREGGERs to "BREEDER" (others are 0).

Now that I'm using those three comparisons,
> dim(shared)<-this is one RAD locus, 4 SNPs
[1]  4 32
> dim(bj.unique)
[1] 306  32
> dim(mo.unique)
[1] 251  32
> dim(aj.unique)
[1] 302  32
> dim(aj.bj)#shared
[1] 133  32
> dim(aj.mo)
[1] 39 32
> dim(bj.mo)
[1]  9 32

The shared locus is 41888 on scaffold_985, and not all SNPs are significant in each analysis. 

I'm going to have to re-do all of the blasting...le sigh.
I'm doing this and have streamlined the process a bit by moving the files after making them.
#####Thursday, 7 April 2016
Investigating the blast results, starting with shared loci
This one is in all three:
 Chrom   Pos LocID                    Locus
50953  scaffold_985 22374 41888 scaffold_985.41888.22374
50960  scaffold_985 22430 41888 scaffold_985.41888.22430
50957  scaffold_985 22398 41888 scaffold_985.41888.22398
509531 scaffold_985 22374 41888 scaffold_985.41888.22374
509601 scaffold_985 22430 41888 scaffold_985.41888.22430
509571 scaffold_985 22398 41888 scaffold_985.41888.22398
50955  scaffold_985 22389 41888 scaffold_985.41888.22389
509572 scaffold_985 22398 41888 scaffold_985.41888.22398

Reformatted blast tables to make them supplementary. I need to figure out how to make GO charts.

#####Wednesday, 6 April 2016
I need to calculate error rate. Maybe run band sharing on my matches? I did a quick thing in R
	OFF016          OFF027       
	Mode :logical   Mode :logical  
	FALSE:20685     FALSE:18572    
	TRUE :30671     TRUE :32784    
	NA's :0         NA's :0        

	OFF032          PRM177       
	Mode :logical   Mode :logical  
	FALSE:27408     FALSE:17465    
	TRUE :23948     TRUE :33891    
	NA's :0         NA's :0        

TRUE means they match. But this doesn't account for some that are genotyped in one but not another. I might do a C++ program because I feel like I know better how to deal with that. 

#####Tuesday, 5 April 2016
I hadn't blasted overlapping SNPs--just the unique snps. So now I'm looking at those. Need to run extract codes to make fasta files. Doing that and running blast on TIGGS database now.

Also the model is MUCH faster now that I'm not re-allocating memory all the time.

#####Monday, 4 April 2016
I'm running band_sharing to estimate error rates and calculate band sharing stats.
The simulation model is running now, but loci near each other all have identical Fst values, even though I ran it for 200 generations. 
PRM077 and OFF077 have a mean incompatibility of 0.95 or something so they clearly aren't actually father and offspring. I'm removing MOM077 from the analysis and re-doing gwsca.
Now I need to re-do the outlier analyses.
Ran extract_outlier_radloc.sh
Ran align_outliers_to_annotated.sh
	Time loading reference: 00:00:48
	Time loading forward index: 00:00:02
	Time loading mirror index: 00:00:01
	Multiseed full-index search: 00:00:00
	254 reads; of these:
	  254 (100.00%) were unpaired; of these:
		7 (2.76%) aligned 0 times
		205 (80.71%) aligned exactly 1 time
		42 (16.54%) aligned >1 times
	97.24% overall alignment rate
	Time searching: 00:00:51
	Overall time: 00:00:51
	Time loading reference: 00:02:32
	Time loading forward index: 00:00:10
	Time loading mirror index: 00:00:05
	Multiseed full-index search: 00:00:00
	192 reads; of these:
	  192 (100.00%) were unpaired; of these:
		2 (1.04%) aligned 0 times
		168 (87.50%) aligned exactly 1 time
		22 (11.46%) aligned >1 times
	98.96% overall alignment rate
	Time searching: 00:02:48
	Overall time: 00:02:48
	Time loading reference: 00:01:31
	Time loading forward index: 00:00:01
	Time loading mirror index: 00:00:01
	Multiseed full-index search: 00:00:01
	335 reads; of these:
	  335 (100.00%) were unpaired; of these:
		2 (0.60%) aligned 0 times
		305 (91.04%) aligned exactly 1 time
		28 (8.36%) aligned >1 times
	99.40% overall alignment rate
	Time searching: 00:01:34
	Overall time: 00:01:34

530 scaffolds, creating one file for each. Then edited the extract.sh files written by R to run extract_sequence_part for each comparison. Ran those and then created cat files and created fastas.

Now I need to run blastx on all of these.....doing the outlier loci on xsede with time of 24 hr, hopefully it doesn't run out of time..I changed it so that it's using 16 threads. batch job 6841006.

#####Sunday, 3 April 2016
After re-doing that, I'm re-making figures etc.
AJ: 44937 loci, 2247 5% outliers, 1390 unique
FM: 40334 loci, 2017 5% outliers, 1596 unique
MO: 28434 loci, 1423 5% outliers, 1093 unique
PJ: 45289 loci, 2265 5% outliers, 1427 unique
132 outliers shared for viability.
3 shared in all.

Exported the RAD locus to files. Then created a fasta file for each outlier set using fasta_from_catalog (run using extract_outlier_radloc.sh in linux).

Used subset_fasta_file to extract scaffold-specific fasta files from allpaths_cms1.scaff.fa based on scaffold names in rad_region/top5_scaffolds.txt. There are 1340 scaffolds. Then used comparison-specific .sh files to run extract_sequence_part. Then cat-ed the extracted bits with comparison-specific cat scripts (mo_cat.sh in rad_region/mo). 

Moved all fasta files to the XSEDE database and ran blast on XSEDE. Also am running blast in blast2go.

Aligning the RAD loci (in rad_locus/) to the annotated genome with align_outliers_to_annotated.sh.
AJ:
Time loading reference: 00:00:48
Time loading forward index: 00:00:02
Time loading mirror index: 00:00:01
Multiseed full-index search: 00:00:00
1270 reads; of these:
  1270 (100.00%) were unpaired; of these:
    16 (1.26%) aligned 0 times
    1076 (84.72%) aligned exactly 1 time
    178 (14.02%) aligned >1 times
98.74% overall alignment rate
FM:
Time searching: 00:00:51
Overall time: 00:00:51
Time loading reference: 00:00:51
Time loading forward index: 00:00:02
Time loading mirror index: 00:00:07
Multiseed full-index search: 00:00:00
1369 reads; of these:
  1369 (100.00%) were unpaired; of these:
    18 (1.31%) aligned 0 times
    1149 (83.93%) aligned exactly 1 time
    202 (14.76%) aligned >1 times
98.69% overall alignment rate
MO:
Time searching: 00:01:00
Overall time: 00:01:00
Time loading reference: 00:00:56
Time loading forward index: 00:00:02
Time loading mirror index: 00:00:01
Multiseed full-index search: 00:00:01
984 reads; of these:
  984 (100.00%) were unpaired; of these:
    10 (1.02%) aligned 0 times
    913 (92.78%) aligned exactly 1 time
    61 (6.20%) aligned >1 times
98.98% overall alignment rate
PJ:
Time searching: 00:01:00
Overall time: 00:01:00
Time loading reference: 00:04:29
Time loading forward index: 00:00:03
Time loading mirror index: 00:00:01
Multiseed full-index search: 00:00:01
1280 reads; of these:
  1280 (100.00%) were unpaired; of these:
    13 (1.02%) aligned 0 times
    1100 (85.94%) aligned exactly 1 time
    167 (13.05%) aligned >1 times
98.98% overall alignment rate
Time searching: 00:04:34
Overall time: 00:04:34

I  ran CERVUS on all of the subsetted datasets and combined the data. Next, I need to calculate relatedness and error rates. Maybe I can include both in my thesis?

#####Saturday, 2 April 2016
I re-did the merge_vcfs.R code to combine vcfs and make them both only contain GT scores. Also did some minor edits to the gwsca_biallelic_vcf program to accommade slightly different formatting and to keep the ID field

#####Friday, 1 April 2016
So, problem: the vcftools --extract-FORMAT-info GT somehow creates duplicate entries for SNPs...the POS field is unique in batch_1.vcf but it's not in biallelic.gt.vcf. So, I need to figure out a way around this. I'm thinking of just gleaning the GT field from the vcf file in the gwsca_biallelic_vcf, but this will take some work. GRRRRRRRRRRRRRRRR

I am running CERVUS on the 99% found in HWE dataset . I'm running the simulation with 10000 offspring, 57 candidate mothers, prop. sampled = 0.05, prop loci typed = 0.99, prop. loci mistyped = 0.01, minimum loci typed = 821, relaxed%=90,strict=95.

#####Friday, 25 March 2016
395 of my RAD loci that mapped to the annotated scovelli genome are in annotated regions. Of those, 6 are in the 5' UTR and 19 are in 3'UTR and the rest are just in gene/mRNA/CDS regions.

My grep command had been missing a character so the query sequence ID wasn't included. Now I'm re-running all my blast searches so I can compare those results to the alignment results.

####Thursday, 24 March 2016
I was able to compare the mapped file to the gff file to see if any of my RAD loci map to annotated regions of the genome.

#####Wednesday, 23 March 2016
I ran subset_fasta_file to extract the 494 scaffolds that contain SCA outliers. I'm making one fasta file per scaffold so I can subset them using extract_sequence_part. It found 492 matches (so some didn't match? probably at least one empty line..) but that was because of a bug so I fixed it and got all of them.
Then I ran extract_sequence_part and then used cat and sed to merge all the files into one. Then I blasted (both blastn and blastx) the merged extracted sequence bits on putty.

To annotate my sam file with info from the gff file, I need to 

#####Tuesday, 22 March 2016
Something weird is going on with the locus identification. the map locus IDs don't quite match the locus IDs in gwsca. Quitting the plink map approach and switching to sumstats.tsv. That worked. There are 688 RAD loci containing outliers.
Using fasta_from_stacks_catalog I'm extracting the RAD loci so I can blast them and find them in the genome and try to identify what they are/what they're near.
On TIGGS I ran blastn search but it doesn't give good output.
So I'm trying xsede...in $SCRATCH/blastdb2 I did 
	$module load blast/2.2.29
	$update_blastdb.pl --decompress nt
Then I ran sca_blast.sh on putty.

I'm also going to align them to the annotated genome. Done! 
688 reads; of these:
  688 (100.00%) were unpaired; of these:
    11 (1.60%) aligned 0 times
    548 (79.65%) aligned exactly 1 time
    129 (18.75%) aligned >1 times
98.40% overall alignment rate


#####Monday, 21 March 2016
Using a 1% cutoff threshold I identified 356 A-J outliers, 384 F-M outliers, and 383 M-O outliers.
Of the A-J outliers, 332 are unique; F-M has 180 unique; and M-O has 195 unique loci.


#####Thursday, 10 March 2016
To compare stacks and GATK, I'm going to do the gwsca_biallelic_vcf on filtered_bi.vcf.
First I need to create an ind info map.
	grep "CHROM" filtered_bi.vcf > gatk.ind.info.txt
I should probably also do infer_mat_vcf. 
vcftools --vcf filtered_bi.vcf --keep extract_from_vcf_gatk.txt --extract-FORMAT-info GT --out ./gatk_gwsca/non_fam
filtered_bi.vcf has 30087 sites.
vcftools --vcf gatk_maternal.vcf --extract-FORMAT-info GT --out gatk_maternal
then used merge_vcfs.R to merge them.

To investigate some of what's going on:
grep "CHROM" filtered_bi.vcf > hets_w_1read.txt
grep "0/1:1,0:1:0:35,0,0" filtered_bi.vcf >> hets_w_1read.txt
grep "0/1:0,1:1:73:73,0,73" filtered_bi.vcf >> hets_w_1read.txt
grep "0/1:0,1:1:37:39,0,37" filtered_bi.vcf >> hets_w_1read.txt
grep "0/1:0,1:1:31:74,0,31" filtered_bi.vcf >> hets_w_1read.txt
grep "0/1:0,1:1:31:71,0,31" filtered_bi.vcf >> hets_w_1read.txt
grep "0/1:0,1:1:31:74,0,31" filtered_bi.vcf >> hets_w_1read.txt
grep "0/1:0,1:1:1:76,0,1" filtered_bi.vcf >> hets_w_1read.txt
grep "0/1:0,1:1:29:74,0,29" filtered_bi.vcf >> hets_w_1read.txt
grep "0/1:0,1:1:19:39,0,19" filtered_bi.vcf >> hets_w_1read.txt
These are 
scaffold_70	86152
scaffold_209	276764
scaffold_343	42436
scaffold_1247	6286
scaffold_1247	6286
scaffold_1247	6286
scaffold_1247	49375
scaffold_1247	49375
I've changed filter_gatk.sh adding --setFilteredGtToNocall to the VariantFiltration and --minFilteredGenotypes 100 to SelectVariants
	...it gives me some warnings about --filterName allelenum
	I had to change some of the logical statements so that --filterName marks BAD ones (not good)
		so that it can remove them.
		
	Getting error "ERROR MESSAGE: Argument depthhas a bad value. Invalid expression used (DP < 1 || DP => 100). Please see the JEXL docs for correct syntax."
Doing SelectVariants before VariantFiltration worked! Because before pruning for biallelic AF refers to a list and not a double so it breaks.
Other problem: --minFilteredGenotypes doesn't work the way I thought/want. I need another way to remove those loci not present in a certain number of individuals (75% or something=223 individuals)..not sure that's gonna work. The filtering steps didn't remove the weird ones. 

#***CERVUS***#
Ran prune_cervus_genotypes and recovered 1642 loci in HWE and in 99% of individuals. I subsampled those to run Cervus a bunch of times.

	

#####Wednesday, 9 March 2016
#***Haplotypes gwsca***#
It seems that inferring mother alleles in the haplotype dataset is going to be tricky--too many of the haplotypes don't match up? I guess. Using a threshold of 25 alleles only yields ~8500 loci.

#***Parentage***#
To do the parentage analyses, I had to create a list of haplotype files.
ls results/haplotypes/*_haplotypes.txt > results/haplotypes/haplotypes_files.txt
Then I edited it in Notepad++ to have haplotypes_file_name,ind_id, pop_id format.
Then ran haplotypes_to_cervus.

#***GATK***#
vcftools --vcf filtered_output.vcf --min-alleles 2 --max-alleles 2 --out filtered_bi --recode
Kept 33315 of 33615 loci.
Copied that file to monnahan. Then ran het_v_depth_mod.py.
Now I've got 9 that are called heterozygotes with a read depth of 1...
m = 1 and het, ['0/1', '1,0', '1', '0', '35,0,0'] 10049
m = 1 and het, ['0/1', '0,1', '1', '73', '73,0,73'] 16864
m = 1 and het, ['0/1', '0,1', '1', '37', '39,0,37'] 21063
m = 1 and het, ['0/1', '0,1', '1', '31', '74,0,31'] 33909
m = 1 and het, ['0/1', '0,1', '1', '31', '71,0,31'] 33909
m = 1 and het, ['0/1', '0,1', '1', '31', '74,0,31'] 33909
m = 1 and het, ['0/1', '0,1', '1', '1', '76,0,1'] 34037
m = 1 and het, ['0/1', '0,1', '1', '29', '74,0,29'] 34037
m = 1 and het, ['0/1', '0,1', '1', '19', '39,0,19'] 36218
Could this be vcftools being unreliable? Maybe I should pull out those 33315 biallelic loci using GATK?
I found this on GATK:
java -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   -R reference.fasta \
   -V input.vcf \
   -o output.vcf \
	--restrictAllelesTo BIALLELIC
	
I added that to filter_gatk.sh and ran it (commenting out the previous filtering). I'll see if that makes het_v_depth work any better. 
This has the same problem:
m = 1 and het, ['0/1', '1,0', '1', '0', '35,0,0'] 9410
m = 1 and het, ['0/1', '0,1', '1', '73', '73,0,73'] 15584
m = 1 and het, ['0/1', '0,1', '1', '37', '39,0,37'] 19348
m = 1 and het, ['0/1', '0,1', '1', '31', '74,0,31'] 30949
m = 1 and het, ['0/1', '0,1', '1', '31', '71,0,31'] 30949
m = 1 and het, ['0/1', '0,1', '1', '31', '74,0,31'] 30949
m = 1 and het, ['0/1', '0,1', '1', '1', '76,0,1'] 31075
m = 1 and het, ['0/1', '0,1', '1', '29', '74,0,29'] 31075
GT:AD:DP:GQ:PL, where AD is the allele depth for each allele and DP is the approximate read depth (maybe this is what's being used??)


#####Tuesday, 8 March 2016
This is something in the calling of maternal alleles, I think.
I went back over infer_mat_vcf and I think the way I'd been assigning maternal alleles was just wrong. I've re-done it so hopefully it's better (maybe it'll fix the problems??). I'll need to re-do the haplotypes infer_maternal_contribution and the model now. I think this has fixed the problem!!!

What is the missing symbol in haplotypes?? I think it's "0".

I added allelic dropout to the model so that for every individual that is sampled, their loci are assigned random numbers. If a locus gets a random number that is less than the error rate (0.01), that locus's maternal allele gets assigned as the paternal allele, so there's an increase in homozygotes. Running that with 200 generations now.

#####Monday, 7 March 2016
Could it be in the vcf-merge file? Where does it arise? 

#####Friday, 4 March 2016
I have no idea what's causing the weird pattern in ADULT-JUVIE comparison. It's only happening in the gwsca_biallelic_vcf Fst calculations, it's not in the calculations from Stacks. This is a bit concerning. I wonder if there's a bug in my program?

The haplotypes are doing it to a small extent too--they just don't have as high Fsts to start with and the group is located in a different part of the axis (probably because it's in a different order).

This is something in my program. I don't know what but it's there. I re-did it in R and it works just fine. I should de-bug the program but I think I'm just going to use R, at least for now. I should figure out what's going wrong.

I ran GATK's VariantFiltration module to filter the gatk dataset. It doesn't restrict it to biallelic loci, though, so I might have to do that myself.
#####Thursday, 3 March 2016
There are 133 dad-kid pairs. Ran infer_mat_vcf. Now I need to merge the batch_1.vcf and the biallelic_maternal.
Then ran extract_gt_info.sh to run vcftools to extract the relevant info.
For some reason vcftools isn't recognizing genotypes in biallelic_maternal.vcf. There was a space before the first header entry and this broke it. jeeez. I changed this in the infer_mat_vcf code so now this problem shouldn't come up.
Then merge_vcfs.R
Need to create ind_info_vcf.txt
OK, I want to include only one instance of individuals that were sequenced twice (OFF016,OFF027,OFF032,PRM177). I'll want to do some analysis with them to estimate error rates and see how reliably the parentage analysis etc works but for the biallelic gwsca I don't want to include them. So going back to infer_mat_vcf in the pipeline.
Now I've run infer_mat_vcf,vcftools, merge_vcfs,gwsca_biallelic_vcf, generate_empiricalAFS
And I've started the sca simulation.
Meanwhile, looking at the gwsca results:
> dim(aj.prune)
[1] 34084    31
> dim(fm.prune)
[1] 40182    31
> dim(mo.prune)
[1] 40093    31
OK, something weird is happening in the adult-offspring comparison. I can't explain it but there's a big bunch of loci that have high Fsts in the middle of the set......

I also started run_gatk.sh with the subset of individuals (not including duplicated ones, so excluding OFF016-1,OFF-27-1,OFF-32-1, and PRM177-1). pre_gatk didn't need to be run because it's run on each individual separately anyway.

Also running convert_matches using run_convert_matches.sh. And now running infer_maternal_contribution. Now running gwsca_haplotypes. Let's see if weird patterns pop up there too. 


#####Wednesday, 2 March 2016
The ref_map script ran populations as:
populations -b 1 -P ./results/stacks -s -t 2 -r 0.5 -a 0.05 --fstats --vcf --plink -p 3 -M ./sca_popmap_ddrad.txt
And it resulted in 51356 SNPs (I think--that's the number of lines in batch_1.sumstats.tsv).
These seem like reasonable filters--minor allele frequency of 0.05, present in 3 populations (of PRM,NPM,FEM,OFF), and present in half of each population. 

I'll move forward with the biallelic analysis. convert_snps is done!
I need to make a dad.kid.pairs.fullnames.txt file.

#####Tuesday, 1 March 2016
So...ddRAD and oRAD apparently have very different allelic dropout/error rates. So I'm going to re-do the SCA with just the ddRAD individuals. 
I should re-do this all from the beginning, including the stacks analysis.
Yayyyyyyy
I re-installed stacks to have newest version (v1.37) and re-made a popmap with the 384 ddRAD individuals.
I'm running ref_map.pl using run_refmap_ddrad.sh


#####Monday, 29 February 2016
OLD INFO FROM UP ABOVE:
1. Run Stacks, prune dataset, calculate Fsts among groups, compare those Fst distributions to null distributions from simulation models.
Programs utilized:
->ref_map.pl (Stacks)
->populations (Stacks)
->prune_for_coverage_oRADddRAD.R
->process_alleles_files (?)
->scripts/convert_snps.R | infer_maternal_contribution/infer_mat_vcf/infer_mat_vcf | vcftools | merge_vcfs.R | gwsca_biallelic | generate_empiricalAFS.R | sca_simulation | gwsca_biallelic_analysis.R
->convert_matches | infer_maternal_contribution | gwsca_haplotypes
->convert_matches | prune_cervus_genotypes.R | CERVUS or calculate_relatedness or band_sharing
->R to compare to model (from sca_simulation)
->mom_female_fst
2. Run GATK pipeline, prune dataset, and follow Monnahan et al. approach to analyze selection components analysis.
->scripts/pre_gatk.sh
->scripts/run_gatk.sh
->filter_vcf (but this didn't work right)

For #1: Problem is that Fsts are elevated relative to model by quite a bit. 
Solution 1: Maybe using haplotypes was not the best approach. So, let's look at biallelic SNPs
Programs utilized:
->ref_map.pl (Stacks)
->populations (Stacks)
->scripts/convert_snps.R (R) | infer_maternal_contribution/infer_mat_vcf/infer_mat_vcf | gwsca_biallelic
->R to compare to model (from sca_simulation)

Solution 2: Maybe there really are elevated Fsts due to increased relatedness. Need to calculate pairwise relatedness and estimate parentage
->haplotypes_to_cervus | CERVUS
->calculate_relatedness
->band_sharing

For #2: I ran into the problem that my filter_vcf file somehow messed up the format of the vcf so it was no longer readable.
Also, I hadn't included all of my samples.
Solution 1: Maybe GATK pipeline was messing things up? So I tried to convert my bowtie alignments and the stacks output
^This was not the problem.
Programs I used:
->scripts/run_bowite_for_gatk.sh
->scripts/sam_to_bam.sh
->scripts/sort_bam.sh
->scripts/test_bwtgatk.sh
->convert_vcf_for_sca
->subset_plink_file

Solution 2: Re-run GATK to start from scratch and then use vcftools to filter vcf
(vcftools)

#THIS ENTRY
It turns out the Moms weren't specified correctly in the ind.info.vcf.txt file! They were missing the individual identifier and a few were even marked as males. I'm going to re-run the analysis and then run it with ind.info.datasets.txt, where there are three "status" options: ddRAD, oRAD, or MOM.

Somehow five PRM records seem to be missing from the biallelic.gt.vcf: PRM040,PRM063,PRM116,PRM141, and PRM180. Are they found in the output from convert_snps?? They are in convert_snps.R. Maybe they're not getting kept by infer_mat_vcf?? Oh, they probably don't have offspring in the dataset. Yep! That's it. I should probably include them. 

I added 30 PRM and 3 OFF to the extract_from_vcf file.
vcftools --vcf ../stacks/batch_1.pruned.vcf --keep extract_from_vcf.txt --extract-FORMAT-info GT --out fem
Then merge_vcfs.R to re-write biallelic.gt.vcf

Added those individuals to ind.info.vcf.txt and ind.info.datasets.txt. Now re-running gwsca_biallelic_vcf with both cases.
Did that. The Fsts between the two datasets are pretty high...

#####Sunday, 28 February 2016
I don't need to change convert_snps.R, I think I just need to use my batch_1.pruned.vcf file in inver_mat_vcf.
I'm running that analysis now. Had to change the way files were being named because I must have changed the dad.kid.pairs.txt file to just be PRM001 and OFF001 instead of sample_PRM001_align. Anyway, now it's running. I should be back on-track soon!  
Now that I've done infer_mat_vcf, I need to merge files. First, I had to add at least one line with ## (I did ##fileformat=VCFv4.0)
vcftools --vcf biallelic_maternal.vcf --extract-FORMAT-info GT --out biallelic_maternal
vcftools --vcf ../stacks/batch_1.pruned.vcf --keep extract_from_vcf.txt --out fem --recode-INFO-all --recode
vcftools --vcf fem.recode.vcf --extract-FORMAT-info GT --out fem
Then I'm using merge_vcfs.R to combine the maternal one with females.

Well, this didn't completely eliminate the male-female issue in biallelic Fst comparisons....frackkkkkk

#####Saturday, 27 February 2016
Making progress on pruning...counting the number of times "./.:0:.,.:.,.,." occurs in each line for each group.
Used grep to get the vcf header, saved it as a file batch_1.vcf.header.txt. Then subsetted vcf files using ind.info.ddRAD.txt and ind.info.psti.txt.

Then kept the loci in both sets, and there are 122845. Wrote those to text file stacks/LociToKeep.txt and to stacks/CatalogIDsToKeep.txt and I subsetted the vcf in stacks/batch_1.pruned.vcf

I modified process_alleles_files to accept a whitelist and only keep the loci in the whitelist. Re-running that with run_process_alleles_files.sh in VirtualBox. I'm not sure this program is necessary.

I also modified convert_matches (which outputs the "haplotypes.txt" files) to accept a whitelist and am running it with matches.txt in VirtualBox.

calculate_relatedness was running but very slowly, and I'm going to have to re-start it later anyway because I have to prune the loci. So I stopped it. I might also switch to doing the relatedness calculations in R, I think it might be faster. 


#####Friday, 26 February 2016
Working on pruning the batch_1.vcf file in R to keep loci only in both groups of sequencing libraries.

#####Thursday, 25 February 2016
I'm going to try pruning for HWE first, and then I'll see if that fixes it.
According to my calculations, all biallelic SNPs adhere to hardy weinberg expectations...which seems improbable.
But it's the result so I'm going to try comparing males and females from psti lane only (maybe there's a difference between ddRAD and oRAD)
vcftools --vcf ../stacks/batch_1.vcf --keep extract_psti_vcf.txt --out psti --recode-INFO-all --recode
vcftools --vcf psti.recode.vcf --extract-FORMAT-info GT --out psti
This resulted in a file named psti.GT.FORMAT
grep 'CHROM' psti.GT.FORMAT > ind.info.psti.txt
and then I manually edited it in Notepad++ to get it into the correct format and ran gwsca_biallelic_vcf.
This actually looks basically right. Let's look at Fsts from ddRAD only..
vcftools --vcf ../stacks/batch_1.vcf --remove extract_psti_vcf.txt --out ddRAD --recode-INFO-all --recode
vcftools --vcf ddRAD.recode.vcf --extract-FORMAT-info GT --out ddRAD
grep 'CHROM' ddRAD.GT.FORMAT > ind.info.ddRAD.txt
and then I manually edited it in Notepad++ to get it into the correct format and ran gwsca_biallelic_vcf.
This also looks normal.
Now let's compare a random subset of ddRAD that is the same number as PstI (30 females, 28 males)
vcftools --vcf ../stacks/batch_1.vcf --keep extract_ddRADsub.txt --out ddRADsub --recode-INFO-all --recode
vcftools --vcf ddRADsub.recode.vcf --extract-FORMAT-info GT --out ddRADsub
I already have the correct info file.


#####Wednesday, 24 February 2016
There is a negative relationship between Fst and N in both males and females. And a dearth of low-Hs values for low-N values. 
The loci with Fsts in the gap range (0.025-0.05) have low N in males, which causes them to be lost during pruning.
I improved the pruning pipeline to have loci present in 75% of all individuals and then make sure they're present in 50% of each population, plus the polymorphic and allele frequency requirements. But this isn't helping!!
The high-Fst group seems to have high homozygosity compared to the others and Hs is skewed in males but not in females. Could these simply just not be in HWE? If I want to address this I'll need different output from gwsca_biallelic_vcf. 
Meanwhile, I'm working on looking at the actual SNPs and sequences. Merging snps, tags, and sumstats files in R. Assuming I don't crash everything.
I'm having trouble matching up the tags, snps, and sumstats files. Sumstats has the catalog ID, which should be the Locus ID reported in the tags and snps files, as per a google groups post from Julian Catchen on 10/25/15:

	Each RAD locus is considered a single object. The RAD or GBS locus was 
	created by the restriction enzyme shearing the DNA and the subsequent 
	sequencing of the downstream bases. Given 100bp Illumina reads, each RAD 
	locus should be 100bp in length (slightly shorter if inline barcodes 
	were used). 

	If there are multiple SNPs in that locus, they are all recorded to that 
	same locus, and the haplotype, the combination of all those SNPs, is the 
	RAD locus. Each RAD locus in each individual has it's own local ID, 
	however, the Catalog ID is the population-wide representative of each 
	locus (after they have been lined up across individuals). This is the ID 
	reported in the catalog files and all downstream exports. All SNPs are 
	reported by the catalog ID they originated from, and the haplotype is 
	the full RAD locus, so it also is reported with the same catalog ID. 
		https://groups.google.com/forum/#!searchin/stacks-users/locus	$20id$20in$20catalog.tags/stacks-users/UmpYi34jddg/-DPDWXyRBQAJ

So...what is going on??


#####Tuesday, 23 February 2016
The simulation finished running the 200 generations with the inferred moms getting dad alleles occasionally--did it affect Fsts?? It looks like it reduced the percentage of outliers from 47% to 24% in the MOM-FEM comparison. Question: Have I been running it with 200 generations only?? Yes, that's what I ran the model paper with.

I think I'm going to have to adjust the relatedness program to empty out its allele containers after several hundred loci so that it can handle thousands of haplotypes...it's crashed and I don't know why, but I'm thinking it's out of memory. I updated it to be similar to band_sharing and hopefully now it will run smoothly. 

I'm also re-running simulation model with an error rate of 0.03.

What I want to discuss with Adam:
-should I have allele dropout in the 'actual' individuals, not just the inferred maternal alleles?
-Any ideas as to what could be causing the weird male-female pattern?
-What else to do??
-Band sharing: do these seem like reasonable values? 
-relatedness (if I get it to run): What can we learn from this?
After talking to Adam, I need to focus on the male-female FST weirdness. 

It comes from the pruning step where fem.n<-sum.list$FEM[sum.list$FEM$N>100& !is.na(sum.list$FEM$Hs),]
if I change the N threshold to be 87, the gap is not as obvious but it's still there. WHY????
>tapply(weirdsum$Hs,factor(weirdsum$Pop),summary)
$FEM
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.09764 0.45010 0.46470 0.45970 0.47940 0.50000 

$MAL
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.09717 0.22480 0.25270 0.25760 0.27960 0.49990 

> tapply(regsum$Hs,factor(regsum$Pop),summary)
$FEM
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.09614 0.17790 0.28340 0.29880 0.42850 0.50000 

$MAL
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.09557 0.18100 0.28750 0.30200 0.43220 0.50000 

So what am I removing that opens up the gap? 

Meanwhile, I'm figuring out what's going on with het_v_depth. It looks like vcftools filtering steps somehow messed up the calls?? The top is the out.vcf bit and the bottom is genotype_output.vcf.
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	FEM001	FEM002	FEM004	FEM005	FEM006	FEM007	FEM008	FEM009	FEM010	FEM011	FEM012	FEM013	FEM014	FEM015	FEM016	FEM017	FEM018	FEM019	FEM020	FEM021	FEM022	FEM023	FEM024	FEM025	FEM026	FEM027	FEM028	FEM029	FEM030	FEM031	FEM032	FEM033	FEM034	FEM035	FEM036	FEM037	FEM038	FEM039	FEM040	FEM041	FEM042	FEM043	FEM044	FEM045	FEM046	FEM047	FEM048	FEM049	FEM050	FEM051	FEM052	FEM053	FEM054	FEM054-1	FEM055	FEM056	FEM057	FEM058	FEM059	FEM060	FEM061	FEM062	FEM063	FEM064	FEM065	FEM066	FEM067	FEM068	FEM069	FEM070	FEM071	FEM072	FEM073	FEM074	FEM075	FEM076	FEM077	FEM078	FEM079	FEM080	FEM081	FEM082	FEM083	FEM084	FEM085	FEM086	FEM087	NPM005	NPM006	NPM007	NPM008	NPM010	NPM011	NPM012	NPM1128	PRM001	PRM002	PRM003	PRM005	PRM006	PRM007	PRM009	PRM010	PRM011	PRM012	PRM013	PRM014	PRM015	PRM016	PRM017	PRM018	PRM019	PRM022	PRM023	PRM024	PRM025	PRM026	PRM027	PRM028	PRM029	PRM030	PRM031	PRM032	PRM033	PRM034	PRM035	PRM035-2	PRM036	PRM037	PRM038	PRM039	PRM040	PRM041	PRM042	PRM043	PRM044	PRM045	PRM046	PRM047	PRM048	PRM049	PRM050	PRM051	PRM052	PRM053	PRM054	PRM055	PRM056	PRM057	PRM058	PRM059	PRM060	PRM061	PRM062	PRM063	PRM064	PRM065	PRM066	PRM067	PRM068	PRM069	PRM070	PRM071	PRM072	PRM073	PRM074	PRM075	PRM076	PRM077	PRM078	PRM079	PRM080	PRM081	PRM082	PRM083	PRM084	PRM085	PRM086-23	PRM086R	PRM087	PRM088	PRM089	PRM090	PRM091	PRM092	PRM093	PRM094	PRM095	PRM096	PRM097	PRM098	PRM099	PRM100	PRM101	PRM102	PRM103	PRM104	PRM105	PRM106	PRM107	PRM108	PRM109	PRM110	PRM111	PRM112	PRM113	PRM114	PRM115	PRM116	PRM117	PRM118	PRM119	PRM120	PRM121	PRM122	PRM123	PRM124	PRM125	PRM126	PRM127	PRM128	PRM129	PRM130	PRM131	PRM132	PRM133	PRM134	PRM135	PRM135-1	PRM136	PRM137	PRM138	PRM139	PRM140	PRM141	PRM142	PRM143	PRM144	PRM145	PRM146	PRM147	PRM148	PRM149	PRM150	PRM151	PRM152	PRM153	PRM154	PRM155	PRM156	PRM157	PRM158	PRM159	PRM160	PRM161	PRM162	PRM163	PRM164	PRM165	PRM166	PRM167	PRM168	PRM169	PRM170	PRM171	PRM172	PRM173	PRM174	PRM175	PRM176	PRM177	PRM177-1	PRM178	PRM179	PRM180	PRM181	PRM182	PRM183	PRM184	PRM185	PRM186	PRM187	PRM188	PRM189	ROFF016	ROFF027	ROFF032OFF001	OFF004	OFF005	OFF006	OFF007	OFF008	OFF009	OFF010	OFF011	OFF012	OFF013	OFF014	OFF015	OFF016	OFF017	OFF018	OFF020	OFF022	OFF024	OFF025	OFF026	OFF027	OFF028	OFF029	OFF030	OFF031	OFF032	OFF033	OFF034	OFF035	OFF036	OFF037	OFF038	OFF039	OFF041	OFF042	OFF043	OFF044	OFF045	OFF046	OFF047	OFF049	OFF050	OFF051	OFF052	OFF053	OFF054	OFF055	OFF056	OFF057	OFF058	OFF059	OFF060	OFF061	OFF064	OFF066	OFF067	OFF068	OFF070	OFF071	OFF072	OFF073	OFF074	OFF075	OFF076	OFF077	OFF078	OFF079	OFF080	OFF081	OFF083	OFF084	OFF085	OFF086	OFF08623	OFF088	OFF089	OFF090	OFF091	OFF092	OFF093	OFF094	OFF095	OFF096	OFF097	OFF100	OFF101	OFF102	OFF103	OFF105	OFF106	OFF110	OFF111	OFF112	OFF113	OFF114	OFF115	OFF117	OFF118	OFF119	OFF120	OFF121	OFF122	OFF123	OFF124	OFF125	OFF126	OFF127	OFF134	OFF135	OFF136	OFF137	OFF138	OFF139	OFF140	OFF142	OFF143	OFF144	OFF145	OFF146	OFF149	OFF150	OFF151	OFF152	OFF153	OFF154	OFF155	OFF156	OFF157	OFF158	OFF159	OFF160	OFF161	OFF165	OFF166	OFF167	OFF168	OFF169	OFF170	OFF171	OFF172	OFF173	OFF174	OFF175	OFF176	OFF177	OFF178	OFF179	OFF181	OFF182	OFF183	OFF184	OFF185	OFF186	OFF187	OFF188	OFF189	
scaffold_1247	49375	.	G	A	18960.5	.	AC=293;AF=0.921;AN=318;BaseQRankSum=-7.310e-01;ClippingRankSum=0.727;DP=577;FS=20.813;InbreedingCoeff=-0.0605;MLEAC=297;MLEAF=0.934;MQ=50.49;MQRankSum=-7.310e-01;QD=30.42;ReadPosRankSum=0.736;SOR=0.012	GT:AD:DP:GQ:PL	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	1/1:0,2:2:6:90,6,0	./.:1,0:1:.:.	./.:0,0:0:.:.	1/1:0,2:2:6:90,6,0	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:1,0:1:.:.	./.:1,0:1:.:.	./.:1,0:1:.:.	./.:1,0:1:.:.	./.:1,0:1:.:.	0/0:2,0:2:0:0,0,5	./.:1,0:1:.:.	./.:1,0:1:.:.	./.:2,0:2:.:.	1/1:0,3:3:9:135,9,0	./.:1,0:1:.:.	./.:1,0:1:.:.	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:1,0:1:.:.	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	0/1:1,1:2:39:39,0,76	./.:1,0:1:.:.	1/1:0,2:2:6:90,6,0	1/1:0,4:4:12:180,12,0	./.:0,0:0:.:.	1/1:0,2:2:6:90,6,0	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	0/1:1,4:5:67:162,0,67	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	0/1:0,2:2:67:78,0,67	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	0/1:1,2:3:29:81,0,29	./.:0,0:0:.:.	1/1:0,3:3:9:130,9,0	./.:0,0:0:.:.	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:1,0:1:.:.	./.:0,0:0:.:.	1/1:0,2:2:6:90,6,0	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	1/1:0,4:4:12:180,12,0	1/1:0,3:3:9:128,9,0	./.:0,0:0:.:.	./.:1,0:1:.:.	./.:1,0:1:.:.	./.:1,0:1:.:.	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	1/1:0,2:2:6:90,6,0	1/1:0,4:4:12:180,12,0	1/1:0,3:3:9:135,9,0	1/1:0,2:2:6:90,6,0	1/1:0,2:2:6:82,6,0	1/1:0,2:2:6:90,6,0	1/1:0,4:4:12:180,12,0	0/1:1,3:4:67:123,0,67	./.:2,0:2:.:.	1/1:0,2:2:6:90,6,0	./.:0,0:0:.:.	1/1:0,3:3:9:135,9,0	./.:1,0:1:.:.	1/1:0,11:11:33:495,33,0	./.:0,0:0:.:.	./.:1,0:1:.:.	0/1:1,5:6:27:207,0,27	./.:1,0:1:.:.	./.:1,0:1:.:.	1/1:0,3:3:9:135,9,0	./.:0,0:0:.:.	1/1:0,1:1:3:45,3,0	1/1:0,2:2:6:90,6,0	1/1:0,3:3:9:135,9,0	./.:1,0:1:.:.	./.:0,0:0:.:.	1/1:0,3:3:9:135,9,0	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:1,0:1:.:.	1/1:0,2:2:6:90,6,0	./.:0,0:0:.:.	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	0/1:1,2:3:78:78,0,78	./.:1,0:1:.:.	1/1:0,2:2:6:90,6,0	0/1:0,4:4:27:165,0,27	1/1:0,5:5:15:225,15,0	./.:1,0:1:.:.	./.:0,0:0:.:.	1/1:0,3:3:9:135,9,0	1/1:0,2:2:6:90,6,0	0/1:0,1:1:1:76,0,1	./.:1,0:1:.:.	./.:0,0:0:.:.	1/1:0,3:3:9:135,9,0	1/1:0,4:4:12:180,12,0	1/1:0,4:4:12:180,12,0	1/1:0,5:5:18:263,18,0	1/1:0,4:4:12:172,12,0	1/1:0,4:4:12:180,12,0	1/1:0,5:5:15:225,15,0	0/0:2,0:2:0:0,0,8	./.:1,0:1:.:.	1/1:0,4:4:12:180,12,0	./.:0,0:0:.:.	1/1:0,5:5:15:225,15,0	./.:1,0:1:.:.	1/1:0,2:2:6:90,6,0	./.:0,0:0:.:.	1/1:0,2:2:6:90,6,0	1/1:0,2:2:6:90,6,0	1/1:0,2:2:6:90,6,0	1/1:0,2:2:6:90,6,0	./.:1,0:1:.:.	./.:1,0:1:.:.	1/1:0,3:3:9:131,9,0	./.:0,0:0:.:.	1/1:0,3:3:9:134,9,0	1/1:0,2:2:6:90,6,0	./.:0,0:0:.:.	1/1:0,3:3:9:135,9,0	./.:1,0:1:.:.	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	1/1:0,1:1:6:70,6,0	1/1:0,5:5:15:205,15,0	1/1:0,2:2:6:90,6,0	1/1:0,1:1:3:45,3,0	./.:1,0:1:.:.	0/1:1,3:4:36:123,0,36	./.:1,0:1:.:.	./.:1,0:1:.:.	1/1:0,3:3:9:135,9,0	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	1/1:0,2:2:6:90,6,0	./.:0,0:0:.:.	0/1:0,3:3:33:115,0,33	./.:1,0:1:.:.	1/1:0,4:4:12:172,12,0	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:2,0:2:.:.	./.:0,0:0:.:.	1/1:0,2:2:6:79,6,0	1/1:0,4:4:12:180,12,0	1/1:0,2:2:9:115,9,0	1/1:0,2:2:6:90,6,0	./.:0,0:0:.:.	./.:1,0:1:.:.	./.:1,0:1:.:.	./.:0,0:0:.:.	1/1:0,2:2:6:90,6,0	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	1/1:0,3:3:9:130,9,0	1/1:0,5:5:15:225,15,0	./.:1,0:1:.:.	./.:0,0:0:.:.	1/1:0,3:3:9:135,9,0	0/1:1,4:5:30:165,0,30	./.:0,0:0:.:.	1/1:0,4:4:12:180,12,0	./.:0,0:0:.:.	1/1:0,3:3:9:131,9,0	./.:1,0:1:.:.	0/1:0,3:3:33:123,0,33	1/1:0,2:2:6:90,6,0	./.:0,0:0:.:.	./.:1,0:1:.:.	1/1:0,1:1:3:45,3,0	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:1,0:1:.:.	./.:1,0:1:.:.	./.:0,0:0:.:.	1/1:0,2:2:6:90,6,0	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	1/1:0,2:2:6:90,6,0	1/1:0,2:2:6:86,6,0	./.:2,0:2:.:.	./.:1,0:1:.:.	0/1:0,3:3:33:123,0,33	./.:1,0:1:.:.	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:1,0:1:.:.	./.:1,0:1:.:.	1/1:0,4:4:12:180,12,0	./.:0,0:0:.:.	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	1/1:0,2:2:6:90,6,0	./.:0,0:0:.:.	1/1:0,3:3:9:132,9,0	1/1:0,4:4:12:179,12,0	1/1:0,3:3:9:135,9,0	./.:0,0:0:.:.	1/1:0,4:4:12:180,12,0	./.:1,0:1:.:.	./.:2,0:2:.:.	./.:0,0:0:.:.	1/1:0,1:1:3:39,3,0	./.:0,0:0:.:.	1/1:0,4:4:12:180,12,0	./.:0,0:0:.:.	./.:0,0:0:.:.	1/1:0,2:2:6:90,6,0	./.:1,0:1:.:.	./.:1,0:1:.:.	1/1:0,2:2:6:90,6,0	1/1:0,2:2:6:90,6,0	./.:1,0:1:.:.	1/1:0,1:1:3:45,3,0	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	0/1:0,1:1:29:74,0,29	./.:1,0:1:.:.	./.:0,0:0:.:.	1/1:0,1:1:6:85,6,0	./.:1,0:1:.:.	1/1:0,3:3:9:135,9,0	./.:0,0:0:.:.	1/1:0,7:7:21:315,21,0	1/1:0,2:2:6:90,6,0	1/1:0,3:3:9:122,9,0	1/1:0,1:1:3:45,3,0	./.:0,0:0:.:.	./.:1,0:1:.:.	./.:1,0:1:.:.	./.:1,0:1:.:.	1/1:0,3:3:9:135,9,0	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:1,0:1:.:.	1/1:0,4:4:12:180,12,0	0/1:1,2:3:36:81,0,36	1/1:0,2:2:6:90,6,0	1/1:0,2:2:6:90,6,0	./.:0,0:0:.:.	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:1,0:1:.:.	./.:1,0:1:.:.	./.:0,0:0:.:.	1/1:0,2:2:6:90,6,0	1/1:0,1:1:3:45,3,0	1/1:0,1:1:6:85,6,0	./.:0,0:0:.:.	./.:2,0:2:.:.	1/1:0,2:2:6:90,6,0	./.:0,0:0:.:.	./.:1,0:1:.:.	./.:1,0:1:.:.	1/1:0,2:2:6:90,6,0	./.:0,0:0:.:.	./.:0,0:0:.:.	1/1:0,3:3:9:135,9,0	./.:0,0:0:.:.	1/1:0,3:3:9:135,9,0	./.:1,0:1:.:.	1/1:0,2:2:6:90,6,0	./.:1,0:1:.:.	./.:0,0:0:.:.	1/1:0,2:2:6:90,6,0	1/1:0,1:1:3:45,3,0	1/1:0,3:3:9:135,9,0	./.:2,0:2:.:.	1/1:0,2:2:6:90,6,0	./.:0,0:0:.:.	./.:1,0:1:.:.	1/1:0,6:6:18:259,18,0	./.:1,0:1:.:.	1/1:0,3:3:9:135,9,0	1/1:0,5:5:15:224,15,0	1/1:0,3:3:9:135,9,0	./.:1,0:1:.:.	0/0:2,0:2:0:0,0,12	./.:1,0:1:.:.	1/1:0,1:1:3:45,3,0	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:1,0:1:.:.	./.:1,0:1:.:.	./.:1,0:1:.:.	1/1:0,4:4:12:180,12,0	./.:1,0:1:.:.	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:1,0:1:.:.	1/1:0,6:6:18:262,18,0	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:1,0:1:.:.	1/1:0,3:3:9:135,9,0	1/1:0,3:3:9:135,9,0	1/1:0,2:2:6:90,6,0	1/1:0,2:2:6:90,6,0	./.:0,0:0:.:.	./.:2,0:2:.:.	./.:0,0:0:.:.	./.:1,0:1:.:.	1/1:0,2:2:6:90,6,0	1/1:0,4:4:12:180,12,0	./.:1,0:1:.:.	./.:1,0:1:.:.	./.:1,0:1:.:.	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	1/1:0,2:2:6:90,6,0	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	./.:1,0:1:.:.	./.:1,0:1:.:.	./.:1,0:1:.:.	1/1:0,2:2:6:86,6,0	./.:1,0:1:.:.	./.:1,0:1:.:.	0/0:2,0:2:0:0,0,6	./.:0,0:0:.:.	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	1/1:0,1:1:3:45,3,0	0/1:0,3:3:26:97,0,26	./.:0,0:0:.:.	1/1:0,3:3:9:135,9,0	1/1:0,2:2:6:90,6,0	1/1:0,3:3:9:135,9,0	./.:0,0:0:.:.	./.:1,0:1:.:.	1/1:0,2:2:6:90,6,0	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	1/1:0,3:3:9:135,9,0	./.:0,0:0:.:.	./.:1,0:1:.:.	./.:1,0:1:.:.	./.:1,0:1:.:.	./.:0,0:0:.:.	1/1:0,3:3:9:133,9,0	1/1:0,2:2:6:90,6,0	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:0,0:0:.:.	1/1:0,2:2:6:90,6,0	./.:0,0:0:.:.	./.:1,0:1:.:.	1/1:0,1:1:3:38,3,0	./.:0,0:0:.:.	1/1:0,1:1:3:45,3,0	./.:1,0:1:.:.	./.:0,0:0:.:.	./.:1,0:1:.:.	1/1:0,1:1:3:45,3,0	1/1:0,4:4:12:180,12,0	1/1:0,3:3:9:135,9,0	./.:0,0:0:.:.	1/1:0,3:3:9:135,9,0	1/1:0,2:2:6:90,6,0	./.:0,0:0:.:.
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	FEM001	FEM002	FEM004	FEM005	FEM006	FEM007	FEM008	FEM009	FEM010	FEM011	FEM012	FEM013	FEM014	FEM015	FEM016	FEM017	FEM018	FEM019	FEM020	FEM021	FEM022	FEM023	FEM024	FEM025	FEM026	FEM027	FEM028	FEM029	FEM030	FEM031	FEM032	FEM033	FEM034	FEM035	FEM036	FEM037	FEM038	FEM039	FEM040	FEM041	FEM042	FEM043	FEM044	FEM045	FEM046	FEM047	FEM048	FEM049	FEM050	FEM051	FEM052	FEM053	FEM054	FEM054-1	FEM055	FEM056	FEM057	FEM058	FEM059	FEM060	FEM061	FEM062	FEM063	FEM064	FEM065	FEM066	FEM067	FEM068	FEM069	FEM070	FEM071	FEM072	FEM073	FEM074	FEM075	FEM076	FEM077	FEM078	FEM079	FEM080	FEM081	FEM082	FEM083	FEM084	FEM085	FEM086	FEM087	NPM005	NPM006	NPM007	NPM008	NPM010	NPM011	NPM012	NPM1128	OFF001	OFF004	OFF005	OFF006	OFF007	OFF008	OFF009	OFF010	OFF011	OFF012	OFF013	OFF014	OFF015	OFF016	OFF017	OFF018	OFF020	OFF022	OFF024	OFF025	OFF026	OFF027	OFF028	OFF029	OFF030	OFF031	OFF032	OFF033	OFF034	OFF035	OFF036	OFF037	OFF038	OFF039	OFF041	OFF042	OFF043	OFF044	OFF045	OFF046	OFF047	OFF049	OFF050	OFF051	OFF052	OFF053	OFF054	OFF055	OFF056	OFF057	OFF058	OFF059	OFF060	OFF061	OFF064	OFF066	OFF067	OFF068	OFF070	OFF071	OFF072	OFF073	OFF074	OFF075	OFF076	OFF077	OFF078	OFF079	OFF080	OFF081	OFF083	OFF084	OFF085	OFF086	OFF08623OFF088	OFF089	OFF090	OFF091	OFF092	OFF093	OFF094	OFF095	OFF096	OFF097	OFF100	OFF101	OFF102	OFF103	OFF105	OFF106	OFF110	OFF111	OFF112	OFF113	OFF114	OFF115	OFF117	OFF118	OFF119	OFF120	OFF121	OFF122	OFF123	OFF124	OFF125	OFF126	OFF127	OFF134	OFF135	OFF136	OFF137	OFF138	OFF139	OFF140	OFF142	OFF143	OFF144	OFF145	OFF146	OFF149	OFF150	OFF151	OFF152	OFF153	OFF154	OFF155	OFF156	OFF157	OFF158	OFF159	OFF160	OFF161	OFF165	OFF166	OFF167	OFF168	OFF169	OFF170	OFF171	OFF172	OFF173	OFF174	OFF175	OFF176	OFF177	OFF178	OFF179	OFF181	OFF182	OFF183	OFF184	OFF185	OFF186	OFF187	OFF188	OFF189	PRM001	PRM002	PRM003	PRM005	PRM006	PRM007	PRM009	PRM010	PRM011	PRM012	PRM013	PRM014	PRM015	PRM016	PRM017	PRM018	PRM019	PRM022	PRM023	PRM024	PRM025	PRM026	PRM027	PRM028	PRM029	PRM030	PRM031	PRM032	PRM033	PRM034	PRM035	PRM035-2	PRM036	PRM037	PRM038	PRM039	PRM040	PRM041	PRM042	PRM043	PRM044	PRM045	PRM046	PRM047	PRM048	PRM049	PRM050	PRM051	PRM052	PRM053	PRM054	PRM055	PRM056	PRM057	PRM058	PRM059	PRM060	PRM061	PRM062	PRM063	PRM064	PRM065	PRM066	PRM067	PRM068	PRM069	PRM070	PRM071	PRM072	PRM073	PRM074	PRM075	PRM076	PRM077	PRM078	PRM079	PRM080	PRM081	PRM082	PRM083	PRM084	PRM085	PRM086-23	PRM086R	PRM087	PRM088	PRM089	PRM090	PRM091	PRM092	PRM093	PRM094	PRM095	PRM096	PRM097	PRM098	PRM099	PRM100	PRM101	PRM102	PRM103	PRM104	PRM105	PRM106	PRM107	PRM108	PRM109	PRM110	PRM111	PRM112	PRM113	PRM114	PRM115	PRM116	PRM117	PRM118	PRM119	PRM120	PRM121	PRM122	PRM123	PRM124	PRM125	PRM126	PRM127	PRM128	PRM129	PRM130	PRM131	PRM132	PRM133	PRM134	PRM135	PRM135-1	PRM136	PRM137	PRM138	PRM139	PRM140	PRM141	PRM142	PRM143	PRM144	PRM145	PRM146	PRM147	PRM148	PRM149	PRM150	PRM151	PRM152	PRM153	PRM154	PRM155	PRM156	PRM157	PRM158	PRM159	PRM160	PRM161	PRM162	PRM163	PRM164	PRM165	PRM166	PRM167	PRM168	PRM169	PRM170	PRM171	PRM172	PRM173	PRM174	PRM175	PRM176	PRM177	PRM177-1	PRM178	PRM179	PRM180	PRM181	PRM182	PRM183	PRM184	PRM185	PRM186	PRM187	PRM188	PRM189	ROFF016	ROFF027	ROFF032	
scaffold_1247	49375	.	G	A	18960.54	.	AC=293;AF=0.921;AN=318;BaseQRankSum=-7.310e-01;ClippingRankSum=0.727;DP=577;FS=20.813;InbreedingCoeff=-0.0605;MLEAC=297;MLEAF=0.934;MQ=50.49;MQRankSum=-7.310e-01;QD=30.42;ReadPosRankSum=0.736;SOR=0.012	GT:AD:DP:GQ:PL	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:1,0:1	./.:0,0:0	./.:0,0:0	1/1:0,2:2:6:90,6,0	./.:1,0:1	./.:0,0:0	1/1:0,2:2:6:90,6,0	./.:0,0:0	./.:0,0:0	./.:1,0:1	./.:1,0:1	./.:1,0:1	./.:1,0:1	./.:1,0:1	0/0:2,0:2:0:0,0,5	./.:1,0:1	./.:1,0:1	./.:2,0:2	1/1:0,3:3:9:135,9,0	./.:1,0:1	./.:1,0:1	./.:1,0:1	./.:0,0:0	./.:1,0:1	./.:1,0:1	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	0/1:1,1:2:39:39,0,76	./.:1,0:1	1/1:0,2:2:6:90,6,0	1/1:0,4:4:12:180,12,0	./.:0,0:0	1/1:0,2:2:6:90,6,0	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	0/1:1,4:5:67:162,0,67	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	0/1:0,2:2:67:78,0,67	./.:1,0:1	./.:0,0:0	./.:0,0:0	0/1:1,2:3:29:81,0,29	./.:0,0:0	1/1:0,3:3:9:130,9,0	./.:0,0:0	./.:1,0:1	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:1,0:1	./.:0,0:0	1/1:0,2:2:6:90,6,0	./.:1,0:1	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:1,0:1	./.:0,0:0	./.:1,0:1	./.:0,0:0	./.:0,0:0	1/1:0,4:4:12:180,12,0	1/1:0,3:3:9:128,9,0	./.:0,0:0	./.:1,0:1	./.:1,0:1	./.:1,0:1	./.:1,0:1	./.:0,0:0	./.:0,0:0	./.:0,0:0	1/1:0,2:2:6:90,6,0	1/1:0,4:4:12:180,12,0	1/1:0,3:3:9:135,9,0	1/1:0,2:2:6:90,6,0	1/1:0,2:2:6:82,6,0	1/1:0,2:2:6:90,6,0	1/1:0,4:4:12:180,12,0	0/1:1,3:4:67:123,0,67	./.:2,0:2	1/1:0,1:1:3:45,3,0	./.:0,0:0	./.:1,0:1	./.:1,0:1	./.:1,0:1	1/1:0,3:3:9:135,9,0	./.:1,0:1	./.:0,0:0	./.:1,0:1	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:1,0:1	1/1:0,4:4:12:180,12,0	0/1:1,2:3:36:81,0,36	1/1:0,2:2:6:90,6,0	1/1:0,2:2:6:90,6,0	./.:0,0:0	./.:1,0:1	./.:0,0:0	./.:1,0:1	./.:1,0:1	./.:0,0:0	1/1:0,2:2:6:90,6,0	1/1:0,1:1:3:45,3,0	1/1:0,1:1:6:85,6,0	./.:0,0:0	./.:2,0:2	1/1:0,2:2:6:90,6,0	./.:0,0:0	./.:1,0:1	./.:1,0:1	1/1:0,2:2:6:90,6,0	./.:0,0:0	./.:0,0:0	1/1:0,3:3:9:135,9,0	./.:0,0:0	1/1:0,3:3:9:135,9,0	./.:1,0:1	1/1:0,2:2:6:90,6,0	./.:1,0:1	./.:0,0:0	1/1:0,2:2:6:90,6,0	1/1:0,1:1:3:45,3,0	1/1:0,3:3:9:135,9,0	./.:2,0:2	1/1:0,2:2:6:90,6,0	./.:0,0:0	./.:1,0:1	1/1:0,6:6:18:259,18,0	./.:1,0:1	1/1:0,3:3:9:135,9,0	1/1:0,5:5:15:224,15,0	1/1:0,3:3:9:135,9,0	./.:1,0:1	0/0:2,0:2:0:0,0,12	./.:1,0:1	1/1:0,1:1:3:45,3,0	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:1,0:1	./.:1,0:1	./.:1,0:1	1/1:0,4:4:12:180,12,0	./.:1,0:1	./.:1,0:1	./.:0,0:0	./.:0,0:0	./.:1,0:1	1/1:0,6:6:18:262,18,0	./.:0,0:0	./.:0,0:0	./.:1,0:1	./.:0,0:0	./.:1,0:1	1/1:0,3:3:9:135,9,0	1/1:0,3:3:9:135,9,0	1/1:0,2:2:6:90,6,0	1/1:0,2:2:6:90,6,0	./.:0,0:0	./.:2,0:2	./.:0,0:0	./.:1,0:1	1/1:0,2:2:6:90,6,0	1/1:0,4:4:12:180,12,0	./.:1,0:1	./.:1,0:1	./.:1,0:1	./.:1,0:1	./.:0,0:0	./.:0,0:0	./.:0,0:0	1/1:0,2:2:6:90,6,0	./.:1,0:1	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:1,0:1	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:1,0:1	./.:1,0:1	./.:1,0:1	1/1:0,2:2:6:86,6,0	./.:1,0:1	./.:1,0:1	0/0:2,0:2:0:0,0,6	./.:0,0:0	./.:1,0:1	./.:0,0:0	./.:0,0:0	1/1:0,1:1:3:45,3,0	0/1:0,3:3:26:97,0,26	./.:0,0:0	1/1:0,3:3:9:135,9,0	1/1:0,2:2:6:90,6,0	1/1:0,3:3:9:135,9,0	./.:0,0:0	./.:1,0:1	1/1:0,2:2:6:90,6,0	./.:1,0:1	./.:0,0:0	./.:0,0:0	1/1:0,3:3:9:135,9,0	./.:0,0:0	./.:1,0:1	./.:1,0:1	./.:1,0:1	./.:0,0:0	1/1:0,3:3:9:133,9,0	1/1:0,2:2:6:90,6,0	./.:1,0:1	./.:0,0:0	./.:0,0:0	1/1:0,2:2:6:90,6,0	./.:0,0:0	./.:1,0:1	1/1:0,1:1:3:38,3,0	./.:0,0:0	1/1:0,1:1:3:45,3,0	./.:1,0:1	./.:0,0:0	./.:1,0:1	1/1:0,1:1:3:45,3,0	1/1:0,4:4:12:180,12,0	1/1:0,3:3:9:135,9,0	./.:0,0:0	1/1:0,3:3:9:135,9,0	1/1:0,2:2:6:90,6,0	./.:0,0:0	1/1:0,2:2:6:90,6,0	./.:0,0:0	1/1:0,3:3:9:135,9,0	./.:1,0:1	1/1:0,11:11:33:495,33,0	./.:0,0:0	./.:1,0:1	0/1:1,5:6:27:207,0,27	./.:1,0:1	./.:1,0:1	1/1:0,3:3:9:135,9,0	./.:0,0:0	1/1:0,1:1:3:45,3,0	1/1:0,2:2:6:90,6,0	1/1:0,3:3:9:135,9,0	./.:1,0:1	./.:0,0:0	1/1:0,3:3:9:135,9,0	./.:1,0:1	./.:0,0:0	./.:1,0:1	1/1:0,2:2:6:90,6,0	./.:0,0:0	./.:1,0:1	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	0/1:1,2:3:78:78,0,78	./.:1,0:1	1/1:0,2:2:6:90,6,0	0/1:0,4:4:27:165,0,27	1/1:0,5:5:15:225,15,0	./.:1,0:1	./.:0,0:0	1/1:0,3:3:9:135,9,0	1/1:0,2:2:6:90,6,0	0/1:0,1:1:1:76,0,1	./.:1,0:1	./.:0,0:0	1/1:0,3:3:9:135,9,0	1/1:0,4:4:12:180,12,0	1/1:0,4:4:12:180,12,0	1/1:0,5:5:18:263,18,0	1/1:0,4:4:12:172,12,0	1/1:0,4:4:12:180,12,0	1/1:0,5:5:15:225,15,0	0/0:2,0:2:0:0,0,8	./.:1,0:1	1/1:0,4:4:12:180,12,0	./.:0,0:0	1/1:0,5:5:15:225,15,0	./.:1,0:1	1/1:0,2:2:6:90,6,0	./.:0,0:0	1/1:0,2:2:6:90,6,0	1/1:0,2:2:6:90,6,0	1/1:0,2:2:6:90,6,0	1/1:0,2:2:6:90,6,0	./.:1,0:1	./.:1,0:1	1/1:0,3:3:9:131,9,0	./.:0,0:0	1/1:0,3:3:9:134,9,0	1/1:0,2:2:6:90,6,0	./.:0,0:0	1/1:0,3:3:9:135,9,0	./.:1,0:1	./.:1,0:1	./.:0,0:0	./.:0,0:0	1/1:0,1:1:6:70,6,0	1/1:0,5:5:15:205,15,0	1/1:0,2:2:6:90,6,0	1/1:0,1:1:3:45,3,0	./.:1,0:1	0/1:1,3:4:36:123,0,36	./.:1,0:1	./.:1,0:1	1/1:0,3:3:9:135,9,0	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	1/1:0,2:2:6:90,6,0	./.:0,0:0	0/1:0,3:3:33:115,0,33	./.:1,0:1	1/1:0,4:4:12:172,12,0	./.:1,0:1	./.:0,0:0	./.:2,0:2	./.:0,0:0	1/1:0,2:2:6:79,6,0	1/1:0,4:4:12:180,12,0	1/1:0,2:2:9:115,9,0	1/1:0,2:2:6:90,6,0	./.:0,0:0	./.:1,0:1	./.:1,0:1	./.:0,0:0	1/1:0,2:2:6:90,6,0	./.:0,0:0	./.:0,0:0	./.:0,0:0	1/1:0,3:3:9:130,9,0	1/1:0,5:5:15:225,15,0	./.:1,0:1	./.:0,0:0	1/1:0,3:3:9:135,9,0	0/1:1,4:5:30:165,0,30	./.:0,0:0	1/1:0,4:4:12:180,12,0	./.:0,0:0	1/1:0,3:3:9:131,9,0	./.:1,0:1	0/1:0,3:3:33:123,0,33	1/1:0,2:2:6:90,6,0	./.:0,0:0	./.:1,0:1	1/1:0,1:1:3:45,3,0	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:0,0:0	./.:1,0:1	./.:1,0:1	./.:0,0:0	1/1:0,2:2:6:90,6,0	./.:0,0:0	./.:0,0:0	./.:0,0:0	1/1:0,2:2:6:90,6,0	1/1:0,2:2:6:86,6,0	./.:2,0:2	./.:1,0:1	0/1:0,3:3:33:123,0,33	./.:1,0:1	./.:1,0:1	./.:0,0:0	./.:0,0:0	./.:1,0:1	./.:1,0:1	1/1:0,4:4:12:180,12,0	./.:0,0:0	./.:1,0:1	./.:0,0:0	./.:0,0:0	1/1:0,2:2:6:90,6,0	./.:0,0:0	1/1:0,3:3:9:132,9,0	1/1:0,4:4:12:179,12,0	1/1:0,3:3:9:135,9,0	./.:0,0:0	1/1:0,4:4:12:180,12,0	./.:1,0:1	./.:2,0:2	./.:0,0:0	1/1:0,1:1:3:39,3,0	./.:0,0:0	1/1:0,4:4:12:180,12,0	./.:0,0:0	./.:0,0:0	1/1:0,2:2:6:90,6,0	./.:1,0:1	./.:1,0:1	1/1:0,2:2:6:90,6,0	1/1:0,2:2:6:90,6,0	./.:1,0:1	1/1:0,1:1:3:45,3,0	./.:0,0:0	./.:0,0:0	./.:0,0:0	0/1:0,1:1:29:74,0,29	./.:1,0:1	./.:0,0:0	1/1:0,1:1:6:85,6,0	./.:1,0:1	1/1:0,3:3:9:135,9,0	./.:0,0:0	1/1:0,7:7:21:315,21,0	1/1:0,2:2:6:90,6,0	1/1:0,3:3:9:122,9,0


#####Monday, 22 February 2016
Ok, so what if I check band sharing with calculations in R? Plus manually counting for a small number of loci? Just to see if my program is doing it correctly? I can count at least a few comparisons for the 10 loci one. 
Counting the 10 loci confirms that the relatedness seem to be correct. 
But I think I want to verify that in R. I checked it in R and it matches.
I couldn't find the genotypes.txt file to play around with so I'm re-writing the genotypes file from haplotypes using haplotypes_to_cervus.

I added an option to the infer_maternal_allele part of the sca_simulation to opt in to allele dropout. If a random number is below the error rate (0.01), then the mother gets assigned the paternal allele no matter what. I didn't change the offspring alleles but maybe I should...we'll see what this does, I guess.

I'm calculating relatedness on the 22934 haplotypes that are present in 90% of individuals (but not HWE-pruned). Also running the band_sharing program on the same file (/parentage/PolymorphicIn90PercInds.txt)

Then I'll also do it on the HWE pruned set, and see what distributions I get.


#####Wednesday, 17 February 2016
The number of bands shared just seems improbable. It's incredibly frustrating.

#####Tuesday, 16 February 2016
Band sharing resulted in sharing = 1 and 0 incompatible loci for the biallelic dataset. That just doesn't seem right.
And relatedness isn't working right either. Why is this so difficult???
OK, so the relatedness program issue was because of loci with only one allele...which was shared so the denominator of rxy was 0. I think that's fixed, now we'll see what the values look like.
That's fixed, but now I'm getting unusual values. With 1600 loci, the majority of the values are less than 0, and most father-offspring pairs have negative relatedness values.

#####Monday, 15 February 2016
I finished running all of the CERVUS combinations yesterday (up to 1600 loci), so now I need to analyze those data.
I'm extracting delta, num assignments, and percentage of assignments for each one. 
I think it would also be useful to see how many times females are assigned and if each time they're assigned to the same offspring...still figuring out how to get that info.
So I think I've got decent summaries. Though I'm not sure I'm really graphically representing info in the best way possible. But I'll show it to Adam.
I'm going to show Adam:
	-gwsca_biallelic plot
	-Structure plot
	-CERVUS summary info
	-Batemanator output and selection differentials
And ask what to do. (sort of)

OK, so after talking with Adam:
1. Batemanator results and selection differentials help for  proof-of-principle thing.
2. Re-make histograms for 100,200,1600 loci or something--number of assignments and MS
3. Relatedness should be summed over all loci
4. Calculate band-sharing as well (this generates the distribution of values for male-offspring pairs)
5. Add simulation of offspring allele dropout to the gwsca biallelic simulation
6. Calculate proportion of loci in which fathers and offspring are compatible.

I re-made the CERVUS histograms (saved as parentage/CERVUS_incremental_snps/CervusMaternitySummary.png).
I'm re-running the calculate_relatedness with 1600 loci.
Calculating band-sharing and the proportion of incompatible father-offspring genotypes using band_sharing program.
I'm getting really high band-sharing numbers (~0.95), which I guess suggests most loci are homozygous for the same allele in fathers and offspring. 
I'm trying this on the biallelic.vcf file to see what it looks like for biallelic snps.

Something is really not working with the relatedness calculations. It's spitting out a bunch of -1.#IND r-values. I'm not sure when or why it's happening...at first I thought it was because I was including missing loci, but that's not it.

#####Sunday, 14 February 2016
I calculated the selection differentials for various morphometric tratis using relative mating success, number of offspring, and number of non-reduced offspring based on the parentage analysis from gen1600_8

                        SVL SnoutLength SnoutDepth    TailLength      BandNum MeanBandArea
Mating Success    0.1282916   0.1634555  0.1188022 -0.0003594433  0.006399823   0.02001239
Total Num Embryos 0.2685957   0.2396699  0.2319460 -0.0104568167 -0.052075048   0.16136224
Surviving Embryos 0.2858035   0.2461369  0.2421699 -0.0132749882 -0.038329322   0.16094581


#####Saturday, 13 February 2016
I worked on the relatedness code. I switched to using Lynch & Ritland (1999) formulas, but it's not giving me reasonable output.
I'm not sure what to do about that...maybe there's a program to use to calculate relatedness?
Maybe I shouldn't do the weighting??

#####Friday, 12 February 2016
In the simulation model I had calculated Fst as 1-(hs1+hs2)/(2*ht) whereas in my calculations I'd done ht-avg(hs)/ht...which I believe is mathematically identical but I changed it just to be safe.
I also output the allele frequencies and re-ran gwsca_biallelic_vcf. Next I'm going to extract the allele frequency spectrum from the biallelic data and run the model with that AFS, just to be sure that I'm using the best possible model.

#####Thursday, 11 February 2016
I think I figured out the indexing problm, though it's a little broken at the moment. 
Fixed it! Now it's done and it runs much more quickly! A few minutes tops! Woo hoo!!
So the biallelic pipeline is no longer convert_snps.R | infer_mat_vcf | gwsca_biallelic, it's now
stacks/batch_1 | infer_mat_vcf | vcftools --extract-FORMAT-info GT | merge_vcfs.R | gwsca_biallelic_vcf
I have to fix a few bugs...first the alleles weren't being designated correctly, then the Fst values are off.
Some of the allele2s are NULL...I don't think I check for that scenario when calculating Fsts.
It's mostly fixed but for some reason the NONPREG group is having frequencies > 1...it's saying there are 45 instances of allele 1!! But the ind info file is correct and there are only 8 non pregnant males and only 8 individuals marked as NONPREG...so something in the code. It looks like I hadn't cleared the freq counters...did that, we'll see if it's fixed.

Re-zeroing the frequency counts seems to have fixed it. Though some of the max Fst values were a little high (0.8).
I didn't output the allele frequencies so I'm not sure I can prune based on major allele freq.

OK, so I got rid of any NAs and any with Fst < 0 and kept those above certain N cutoffs.

#####Wednesday, 10 February 2016
For some reason R does not like me asking if sum(allelefreqs)=="1". So I just removed that sanity check and will assume that the expected allele frequencies always sum to 1. 
This resulted in 11202 loci. Then I'm pruning to 97% coverage, which yields 3119 loci.
Now I just need to sample! Done. Now just need to run through CERVUS...perfect thing to do while watching TV ;)

Meanwhile, worked on the gwsca_biallelic code. I can do one locus at a time in the vcf format so I don't have to store much of anything. I just need to assign indices for the popstats? or something. I am not sure how this is going to work yet... I should also do a biallelic sanity check. I need to figure out a way to index the individuals in the pop stats one.

#####Tuesday, 9 February 2016
Working on pruning for hardy weinberg equilibrium...I think I figured it out. I'm calculated the Chi-Squared test statistic myself and then using 1-pchisq(x,df) to calculate the p-value based on df=num alleles -1.
Now I need to choose those with p>0.05 and then take those loci from the main dataframe.

Thinking about the biallelic gwsca, I could read in the first ~20000 columns, save those and save the index of the last column, and do the calculations, empty the vectors, and read in those columns to tmp and then do the next 20000...I think this could work...except I have to go through each individual I guess? That could make it more complicated. But I think it's still do-able...
PROBLEM! The individuals don't have the locus IDs that they're missing! This would be easier in vcf format...
I have vcf format from the infer_maternal_contribution!
I think the solution is going to be to re-do gwsca_biallelic using vcf input
Got all the vcf ind names with 
	sarah@sarah-vb:~/sf_ubuntushare/SCA/results/biallelic$ grep '#CHROM' biallelic_maternal.vcf > ind.info.bi.vcf.txt
sarah@sarah-vb:~/sf_ubuntushare/SCA/results/biallelic$ grep '#CHROM' ../stacks/batch_1.vcf > all_seqd.vcf.txt
manually removed all PRM and OFF, keeping only FEM and NPM
sarah@sarah-vb:~/sf_ubuntushare/SCA/results/biallelic$ vcftools --vcf ../stacks/batch_1.vcf --keep extract_from_vcf.txt --out fem --recode-INFO-all --recode
*Had to remove ##Includes inferred maternal alleles header *
vcf-merge is not working!!! I don't know why.
In the fem.vcf file, the header with FORMAT info said AD field should have 1 number but it has 2. I just changed that to say that it should have "." numbers.
Now checking them with vcf-validator.

vcf-sort biallelic_maternal.vcf > maternal.vcf
bgzip maternal.vcf
tabix -p vcf maternal.vcf.gz
vcf-sort fem.recode.vcf > fem.vcf
bgzip fem.vcf
tabix -p vcf fem.vcf.gz
vcf-merge fem.vcf.gz maternal.vcf.gz > merged.vcf
STILL NOT WORKING!
Probably because some individuals were duplicated, so I'm just doing this in R
Found some loci that have the same scaffold and position but a different REF, so I'm just removing those from the dataset because that's too confusing for me. Re-wrote a vcf (biallelic.vcf) with 312755 loci and 565 genotypes (including inferred mothers).

Using Notepad++ I created a ind.info.vcf.txt file with the individual information.

Now I just have to re-write the program...



**could I do this with the assigned maternity rather than the inferred allele?


#####Monday, 8 February 2016
Well, I accidentally permanently deleted parse_structure_output. Just FYI. I'll have to rewrite it at some point.

So to run CERVUS a bunch of times, I need to test to see if loci are in hardy-weinberg first. 
I'm going to start with the full genotypes file, and start playing around with that. 
This is going OK, although I need to account for non-alphabetical order of genotype calls..or sort it first? maybe that will work...

gwsca_biallelic crashed again--after OFF026, I'm pretty sure it's just running out of memory.

#####Sunday, 7 February 2016
The structure run had finished, so I zipped all of the output files (from structure/Results/ directory) and uploaded them to Structure Harvester. Downloaded the structure harvester output.
in the scovelli_popgen/saltwater/programs is a program called parse_structure_output. Ran that in Linux for all of my results files.
It appears that K=2 is the best. It has the highes L'(K) and the highest Delta(K) values
Using R I plotted the structure output.

Also, gwsca_biallelic is not finding non-biallelic alleles with the updated new.snps.txt files. I'm going to re-start it in "Release" mode because then it should run faster. And I'll see if it runs out of memory more quickly. In release mode it is finding a few loci with more than two alleles, but not very many. Probably Ns or something. 

Meanwhile, started in on the relatedness code. I'm writing it on my home computer for now. I need to ask Adam about a few things re: calculating relatedness.
 

#####Saturday, 6 February 2016
I'm debugging gwsca_biallelic. It looks like the 'new.snps.txt' files aren't actually reduced? And maybe something's wrong with the way they're output.
So when I merge snps and matches the dataframe becomes huge, but then I make it smaller using keep.snp and pruning for stack depth<-becomes larger because there are duplicate entries for "LocusID".
Two things:
	1. I'm going to remove any from snps that have a type = "U"...that suggests that the likelihood ratio was too low to call het/hom so it's not worth including.
	2. If Type="O" I need to recode the genotypes (I haven't been doing that)<-this is critical for calculating allele frequencies.
Maybe this will help? Really there shouldn't be ANY loci that aren't biallelic.
I don't know if it will help with the memory issues, but I worry about trying to break it up and messing up the calculations. I could probably calculate allele frequencies (or at least count the number of alleles) and not have to save everything to file...

#####Friday, 5 February 2016
gwsca_biallelic crashed--it either ran out of memory or deleted too many alleles. It's difficult to say.
So I'm going to try pruning the sumstats file to just biallelic loci. Well, actually, they're all biallelic...so there must be something buggy about my program. I'm re-starting it in debug mode. 


#####Thursday, 4 February 2016
I re-started gwsca_biallelic with the pre-pruned files.

GATK: finished running! Now I'm doing the Monnahan pipeline but it's a bit confusing. 
I'm renaming the scripts so that they're in order and my fixes/updates are in them. Done!
Ran het_v_depth successfully, and then also the first file (bigVCF_v1.py). But the next files require a file that's in a format I don't know and I don't really feel like trying to figure it out from the code (because it's really not obvious) so I emailed John Kelly. I asked him several questions, including what to do with the females.
John Kelly pointed out that I have an excess of heterozygotes (lots near 0.6). Could this explain elevated Fsts? I don't have a good answer...Why might there be elevated heterozygosity?

KINGROUP is not working correctly? Maybe? Or it's just buggy in general? I can't tell.
I should look up when it was last cited.
Adam recommends calculating relatedness myself, and just looking at the point estimates.
Also he thinks we should do a parentage/relatedness paper.

ALSO! I should do a structure analysis to see if females are different than others.
populations -b 1 -P ./results/stacks -M ./results/stacks/sca.null.map.txt -W ./results/stacks/shared_loci.txt -t 3 --plink
Then to prune:
../Nerophis_ophidion/results/plink --file ./results/stacks/batch_1.plink --r2 --ld-window-r2 0.2 --hardy --geno 0.05 --maf 0.05 --max-maf 0.95 --noweb --allow-no-sex --write-snplist --out ./results/stacks/pruned
[this removes those with LD > 0.2, not in HWE, minimum allele freq between 0.05 and 0.95, and present in <95% of individuals. Leaves 3041 loci]
../Nerophis_ophidion/results/plink --file ./results/stacks/batch_1.plink --extract ./results/stacks/pruned.snplist --noweb --allow-no-sex --recode --recode-structure --out ./results/stacks/pruned
Running STRUCTURE on 3041 pruned loci in 443 individuals
	missing value = 0 (unless specified with --output-missing-genotype)
	one row per individual
	row 1 = locus name
	row 2 = map distance?
	column 1 = individual id
	column 2 = sampling location
Now it's running with 10,000 burn-in, 10000 MCMC, admixture, and correlated frequencies; 
batch job with K=1 through K=4 with 10 iterations.





#####Wednesday, 3 February 2016
Although KINGROUP accepts CERVUS format files, it doesn't if you want to run it with 1000 loci. So I wrote a C++ program to re-format CERVUS files in KINSHIP format (cervus_to_kinship). It seems to have worked! It replaces "0" with "/".

I ran CERVUS with the 99% present loci (394) but it only assigned 10% of offspring, and the critical delta was 0?? I wonder if ~400 loci is too many. 

**Not all of these loci are polymorphic!! If they're "consensus" then they're not polymorphic...**

In KINGROUP v2 I'm running Pairwise Relatedness, Maximum Likelihood method from Goodknight & Queller (1999), calculating allele frequencies, displaying p-values and half matrix, and sorting descending by ID (it's very slow to respond).

Re-ran CERVUS with 1744 polymorphic loci found in 98% of individuals, which resulted in a 19% assignment rate (30 offspring assigned). And all those not assigned were excluded (Delta = 0, LOC < 0).

For gwsca_biallelic, it occurred to me that I could prune the input files before the C++ program in R so that I only have the reference SNPs..this prunes the input file from 4,848,705 to 230,650 SNPs.

#####Tuesday, 2 February 2016
VirtualBox crashed while GATK HaplotypeCaller was running OFF113...so I created rerun_haplotypecaller.sh and re-ran it.
Organized the significant and moderately-significant maternity results in a table to print and put in lab notebook to discuss with Adam.
OK, so I need to run CERVUS with another set of loci to see if I can reproduce the same results. Then I need to start ramping up the number of loci included in the analysis until CERVUS breaks. So I'm using R to subset each of the subsetted files and also write one to file that has loci present in 98% of the individuals. 

Also, downloaded KINGROUP v.2, which is a java program to analyze pairwise relatedness. It accepts up to 1000 loci in CERVUS format, so I can use my subsetting in R to create a useful file.
	[1] "genotypes14  had no loci present in 98% of individuals"
	[1] "genotypes19  had no loci present in 98% of individuals"
	[1] "genotypes27  had no loci present in 98% of individuals"
	[1] "genotypes28  had no loci present in 98% of individuals"
	[1] "genotypes29  had no loci present in 98% of individuals"
	[1] "genotypes30  had no loci present in 98% of individuals"
	[1] "genotypes31  had no loci present in 98% of individuals"
	[1] "genotypes32  had no loci present in 98% of individuals"
	[1] "genotypes35  had no loci present in 98% of individuals"
	[1] "genotypes37  had no loci present in 98% of individuals"
	[1] "genotypes43  had no loci present in 98% of individuals"
	[1] "genotypes44  had no loci present in 98% of individuals"
	[1] "genotypes45  had no loci present in 98% of individuals"
	[1] "genotypes47  had no loci present in 98% of individuals"
	[1] "genotypes54  had no loci present in 98% of individuals"
	[1] "genotypes56  had no loci present in 98% of individuals"
	[1] "genotypes57  had no loci present in 98% of individuals"
	[1] "genotypes58  had no loci present in 98% of individuals"
	[1] "genotypes59  had no loci present in 98% of individuals"
	[1] "genotypes6  had no loci present in 98% of individuals"
	[1] "genotypes60  had no loci present in 98% of individuals"
	[1] "genotypes61  had no loci present in 98% of individuals"
	[1] "genotypes62  had no loci present in 98% of individuals"
	[1] "genotypes7  had no loci present in 98% of individuals"
	[1] "genotypes8  had no loci present in 98% of individuals"

	
...meanwhile biallelic SCA is still chugging along....


#####Monday, 1 February 2016
Figuring out CERVUS simulations:
"you should not simulate the number of offspring in your actual analysis"
"The average number of candidate parents per offspring should be estimated"<-so this should be one?
"Prop. sampled...you should not set this parameter to one unless you are certain that you have sampled all observed candidate parents and there is no possibility that there are candidate parents which have eluded observation."
"Prop. loci typed...allows for missing data and should be an average value...simulation should select the calculated value by default."
"Prop. loci mistyped...by default the proportion of loci mistyped is also used as the error rate in the likelihood calculations (both by the simulation and in actual parentage analysis)...default value is 0.01."
"Minimum typed loci. By default this parameter is set to half the total number of loci."
set 10000 offspring, 1000 candidate mothers, .25 prop sampled, typed .75 and mistyped .02 and minimum typed = 1500

It keeps giving me a floating point error...

I'm going to sub-sample 100-200 loci all of which are present in at least 75% of individuals.
Using R, I found 143 loci in genotypes0.txt that are present in 98% of individuals. 

Ran CERVUS but noticed that the output said that most known fathers weren't typed...because the offspring file specified offspring as their own father. Manually fixing that file.

Cervus results:

	Mother given known father:

	Level       Confidence (%)  Critical Delta  Assignments        Assignment Rate  
												Observed Expected  Observed Expected
	Strict               95.00            5.88       20 (     14)      13%    (  9%)
	Relaxed              90.00            4.50       25 (     15)      16%    ( 10%)
	Unassigned                                      130 (    140)      84%    ( 90%)
	Total                                           155 (    155)     100%    (100%)


	**** Number of individuals tested ****

	Offspring (total):                                           160
	  Tested (typed at 72 or more loci):                         159
		Known father typed at 72 or more loci:                   155
		Known father typed at fewer than 72 loci:                  4
	  Not tested (typed at fewer than 72 loci):                    1

	Candidate mothers (total):                                    87
	  Tested (typed at 72 or more loci):                          87
	  Not tested (typed at fewer than 72 loci):                    0
	  Average number of candidate mothers per offspring:          87
	  Average proportion of sampled candidate mothers:             1.0000


	**** Files ****

	Input
	  Offspring file:                 offspring.txt
	  Candidate mother file:          candidate_parents.txt
	  Genotype file:                  genotypes0.pruned.txt
	  Allele frequency file:          genotypes0.pruned_allelefreqs.alf
	  Simulation data file:           genotypes0.pruned_simulation.sim

	Output
	  Parentage summary file:         genotypes0.pruned_maternity.txt
	  Parentage data file:            genotypes0.pruned_maternity.csv

...now what??


#####Sunday, 31 January 2016
I wrote a program to parse the genotypes.txt file into multiple smaller files, each with 3000 loci. 
This seemed to work, although I noticed some of the alleles are coded as "consensus"...but since all individuals are compared to the same "consensus" sequence then this is just as good as having the haplotype sequence. We'll see what happens.
To run CERVUS, I need to run the allele frequency thing first. So I'm doing that with genotypes0.txt
Then ran maternity analysis simulation with 10000 offspring, 1000 mothers, sampled .25, typed 0.25 loci and mistyped 0.05, with minimum typed loci = 3000. Confidence using delta and relaxed Confidence level = 90% and strict = 95%.
Finally, I ran the maternity analysis...but 0% assigned! Is this right or is there something about the simulatio that I did wrong??

#####Saturday, 30 January 2016
CERVUS can't handle all of my loci. So I need to split the file...maybe rewrite the haplotypes_to_cervus to output a certain number of loci per file and just run a bunch of different files? 

#####Friday, 29 January 2016
So today haplotypes_to_cervus finished running, but problem: CERVUS didn't accept the input! 
It turns out I misunderstood their instructions. I need three files:
	1. File with just offspring INFO (Off ID, Known Parent ID, etc.)
	2. File with just candidate parent INFO (just a list of parent ids)
	3. File with ALL genotypes
So I modified haplotypes_to_cervus and am re-running it.

GATK is still running.
So is gwsca_biallelic. It's finding loci that are not biallelic and removing them. Hopefully it's not *all* of the loci. Also, this is freaking slow. I don't know if there is but there should seriously be a better way to go about doing this. Or something faster, IDK. Maybe there's a different way to index vectors (or arrays) than I've been using.


#####Thursday, 28 January 2016
When I arrived this morning, ./convert_matches had finished running to generate new sample_*_haplotypes.txt files.
GATK had finished running.
gwsca_biallelic was still running through the first individual's file! And then it ran out of memory and crashed.

So now I'm running haplotypes_to_cervus with the new haplotype files...
	and thinking about a way to fix the gwsca_biallelic issue. Maybe I need a different format, or I need a list 
	of SNPs, or maybe I should create a whitelist of loci.
	
OK, gwsca_biallelic:
Can I have a reference and then not store individual info?
Another problem: MOMs have different "SNPID"s than population individuals: 
	MOMs use CatID.Pos, whereas the others have CatID.Col, where column is the position in the RAD locus not the	
	reference basepair.

	I can fix that with batch_1.sumstats.tsv, plus that may help prune some loci since the loci have to have 
	minor allele frequency >= 0.05.
	
**Note: haplotypes_to_cervus needed me to re-enter the name for sample_NPM1128_haplotypes.txt and sample_PRM087
**Also, the ./results/parentage directory didn't exist so I had to re-run haplotypes_to_cervus.

So I don't think I'll be able to append to files the way I did with infer_maternal_contribution, although I guess
	it could be possible. Instead I'm creating the reference up-front and then disregarding any loci that aren't
	in the reference. So far it's successfully read in the reference (sans allele info), so that's good. 
	Ran into a few other small issues debugging but those were mostly me being sloppy. We'll see what else pops up.
	It's finding a lot of loci not in the reference...
	
NOW GATK!
So I was able to run vcftools and filter by allele frequency (between 0.05 and 0.95) and depth (between 1 and 100)
	and restrict it to biallelic loci. This resulted in only 1662 loci! f***
Anyway, when I ran het_v_depth.py it didn't produce any depths. It looks like this might be because of a modification
	I made to the python code. When I changed that, I got an error. So now I have to figure that out.
That is due to the fact that the GATK output sometimes has 5 fields and sometimes it has 7. Sometimes it includes PGT and PID
	I didn't run HaplotypeCaller with -doNotRunPhysicalPhasing. FML.
	Re-running HaplotypeCaller and GenotypeGVCFs using ./scripts/run_haplotypecaller.sh
