# SNPcalling.md

The raw data is available on the High Capacity Storage of Otago University. Contact dutoit.ludovic@gmail.com for access.


## Quality control

The data is single end 100bp across two lanes ( confusingly called lane 2 and 3 and not one and two). To understand the structure, the adaptor content and the barcodes I subset the two big *.gz* files to 250'000 reads.

```
#!/bin/sh
# in source_files
zcat  SQ1109_CDRH1ANXX_s_2_fastq.txt.gz | head -n 1000000 > lane2_sample.fq
zcat  SQ1109_CDRH1ANXX_s_3_fastq.txt.gz | head -n 1000000 > lane3_sample.fq
```

Then I check the quality of the sequencing run using fastqc [metadata/lane2_sample_fastqc.html](metadata/lane2_sample_fastqc) and [metadata//lane3_sample_fastqc](metadata/lane3_sample_fastqc).

```
#!/bin/sh
fastqc *fq
```


Sequence quality is okay but quite some error in lane 2 first few bases and last few.

I will first remove the last bases in each lane as there is a lot of adaptor contamination and I try to balance number of reads vs number of bases trimming all reads to a common 70bp. Shorter reads are removed.

```
#!/bin/sh
module load cutadapt
cutadapt --length  70  -a AGATCGGAAGAGC  -m 70  -o trimmed_lane_2.fastq  SQ1109_CDRH1ANXX_s_2_fastq.txt.gz

cutadapt --length  70  -a AGATCGGAAGAGC  -m 70  -o trimmed_lane_3.fastq  SQ1109_CDRH1ANXX_s_3_fastq.txt.gz
fastqc trimmed_*
```

## SNP calling

### Demultiplexing

I first extract barcodes from the .key file of the sequencing platform.

```
#!/bin/sh
##Key provided in folder
cat 190830_D00390_0499_BCDRH1ANXX.SQ1109.all.ApeKI.ApeKI.key | grep -E "CDRH1ANXX\s+2" | cut -f 3-4   > barcodes_lane2.txt
cat 190830_D00390_0499_BCDRH1ANXX.SQ1109.all.ApeKI.ApeKI.key | grep -E "CDRH1ANXX\s+3" | cut -f 3-4   > barcodes_lane3.txt
cd ..
```
 
I then create different folders to deal with samples sequenced across the two different lanes before concatenating them together.

```
#!/bin/sh
mkdir raw2 samples2 raw3 samples3 samples_concatenated
cd raw2
ln -s ../source_files/trimmed_lane_2.fastq .
cd ..

cd raw3
ln -s ../source_files/trimmed_lane_3.fastq
cd ..
```

I am now ready to run process_radtags to demultiplex.

```
#!/bin/sh
process_radtags  -p raw2/ -o ./samples2/ -b source_files/barcodes_lane2.txt -e ApeKI -r -c -q --inline-null
process_radtags -p raw3/ -o ./sample3/ -b source_files/barcodes_lane2.txt -e ApeKI -r -c -q --inline_inline

```
It seems to work ok, keeping around 95% of reads.

```
Outputing details to log: './samples3/process_radtags.raw3.log'

103112587 total sequences
  1967776 barcode not found drops (1.9%)
  1865275 low quality read drops (1.8%)
  1151696 RAD cutsite not found drops (1.1%)
 98127840 retained reads (95.2%)


Outputing details to log: './samples2/process_radtags.raw2.log'

101814869 total sequences
  2131984 barcode not found drops (2.1%)
  1684523 low quality read drops (1.7%)
  1463586 RAD cutsite not found drops (1.4%)
 96534776 retained reads (94.8%)
```

I now concatenate samples in one single folder.

```
#python
import os
#
for sample in os.listdir ("samples3"):
  print sample
  if sample.endswith("gz") and not sample in os.listdir("samples2"):
    raise Exception

for sample in os.listdir ("samples3"):
  if sample.endswith("gz"):
    os.system("cat samples2/"+sample+" samples3/"+sample+" > samples_concatenated/"+sample)
```

We now have clean sample files and we are ready to run the alignment.


### Alignment and variant calling

First alignment for every sample using BWA. Here is one example command:


```bash
#!/bin/sh
bwa mem -t 8 $bwa_db $src/${sample}.fq.gz  |   samtools view -b | samtools sort --threads 4 > ${sample}.bam
```
The complete list of commands is in [realign.sh](realign.sh)

Then I run refmap for q auick run to identify low quality individuals:

```
#!/bin/sh
ref_map.pl --samples alignment --popmap popmap_all.txt -T 8  -o output_refmap
```


After manually checking samples and output files, we remove 3 low quality individuals ( LU700_F08  MC950_F10  TW900_F06) with less 500 samples, the blank and two misidentified individuals (BO1050_F05 and EL960_F10) likely from a different species.


I run ref_map again cleanly:

```
populations -P output_refmap/ -M popmap_allNOOUTLIERandlowdataNONEG.txt  --vcf --structure --plink --treemix --max-obs-het 0.65 -r 0.75 
 -O allwithref
 ```
```
50706 variants remained
```

[output_files/standard_withref](output_files/standard_withref)



## Re-filtering populations individually

To do opulation-level analyses with the finest level of structure, we re-run some SNP filtering as below.

All those extra files are into [output_files/subpopwithref](output_files/subpopwithref) along with the catalog reference.

### Extra attempt for BL/MC/TW/WH

I am doing one more SNP filtering focusing on four localities that will be attempted to find a gene of interest: BL/MC/TW/WH

```bash
cat popmap_allNOOUTLIERandlowdata.txt | grep  -E "BL|MC|TW|WH" | wc -l
#94 samples
cat popmap_allNOOUTLIERandlowdata.txt | grep  -E "BL|MC|TW|WH" > popmap_allNOOUTLIERandlowdata_4sites.txt

mkdir 4sitesfilteringwithref
populations -P output_refmap/ -M popmap_allNOOUTLIERandlowdata_4sites.txt  --vcf --structure --plink --treemix --max-obs-het 0.65 -r 0.75 -O 4sitesfilteringwithref  # then filter it without

```


```
Removed 785512 loci that did not pass sample/population constraints from 811445 loci.
Kept 25933 loci, composed of 1597508 sites; 25294 of those sites were filtered, 38276 variant sites remained.
```

```bash
mkdir 4sitesfilteringwithrefonesnppersite
populations -P output_refmap/ -M popmap_allNOOUTLIERandlowdata_4sites.txt  --vcf --structure --plink --treemix --max-obs-het 0.65 -r 0.75 -O 4sitesfilteringwithrefonesnppersite --write-single-snp
```



```
Removed 785512 loci that did not pass sample/population constraints from 811445 loci.
Kept 25933 loci, composed of 1597508 sites; 25294 of those sites were filtered, 17687 variant sites remained.
```




### Extra attempt for LU  SI

I am doing one more SNP filtering focusing on four localities that will be attempted to find a gene of interest: LU_F, LU_V, SI_F, SI_V

```bash
cat popmap_allNOOUTLIERandlowdata.txt | grep  -E "LU|SI" | wc -l
#46 samples
#rename manually the sample column

cat popmap_allNOOUTLIERandlowdata.txt | grep  -E "LU|SI" > popmap_allNOOUTLIERandlowdata_LUSI.txt

mkdir cp 
populations -P output_refmap/ -M popmap_allNOOUTLIERandlowdata_LUSI.txt  --vcf --structure --plink --genepop --treemix --max-obs-het 0.65 -r 0.75 --write-single-snp -O LUSIfilteringwithrefonesnppersite  # then filter it without
```

```
Removed 785601 loci that did not pass sample/population constraints from 811445 loci.
Kept 25844 loci, composed of 1593060 sites; 22977 of those sites were filtered, 12678 variant sites remained.
```

```bash
mkdir LUSIfilteringwithref
populations -P output_refmap/ -M popmap_allNOOUTLIERandlowdata_LUSI.txt  --vcf --structure --plink --genepop --treemix --max-obs-het 0.65 -r 0.75 -O LUSIfilteringwithref
```
```
Removed 785601 loci that did not pass sample/population constraints from 811445 loci.
Kept 25844 loci, composed of 1593060 sites; 22977 of those sites were filtered, 20300 variant sites remained.

```
### Extra attempt for EL, BO

I am doing one more SNP filtering focusing on four localities that will be attempted to find a gene of interest: LU_F, LU_V, SI_F, SI_V

```bash
cat popmap_allNOOUTLIERandlowdata.txt | grep  -E "EL|BO" | wc -l
#rename manually the sample columns
#46 samples
cat popmap_allNOOUTLIERandlowdata.txt | grep  -E "EL|BO" > popmap_allNOOUTLIERandlowdata_EL_BO.txt

mkdir ELBOfilteringwithref
populations -P output_refmap/ -M popmap_allNOOUTLIERandlowdata_EL_BO.txt  --vcf --structure --plink --genepop --treemix --max-obs-het 0.65 -r 0.75 -O ELBOfilteringwithref  # then filter it without
```

```
Kept 27605 loci, composed of 1700920 sites; 24817 of those sites were filtered, 22646 variant sites remained.
```

```bash
mkdir ELBOfilteringwithrefonesnppersite
populations -P output_refmap/ -M popmap_allNOOUTLIERandlowdata_EL_BO.txt  --vcf --structure --plink --genepop --treemix --max-obs-het 0.65 -r 0.75 -O ELBOfilteringwithrefonesnppersite --write-single-snp
```

```
Kept 27605 loci, composed of 1700920 sites; 24817 of those sites were filtered, 13696 variant sites remained.
```

