Gene exchange between two divergent species of the fungal human pathogen, Coccidioides

Table of Contents

1 About this document

The README.html file was exported from README.org, which is in org-mode format and can be opened with the emacs text editor. This document contains the code used to generate the figures and to calculate the statistics for the paper "Gene exchange between two divergent species of the fungal human pathogen, Coccidioides". This document is a literate programming document, which means that the code used to generate the figures was created from this document by "tangling" it. The document shows the code used to create the figures along with comments and annotations to enable the code to be interpreted.

To create the figures, run the file make-figures.R. The code requires that bedtools is installed and that some R packages are installed. See external dependencies for details.

Note that the statistics have the quotes from the paper that they were used for:

"This is an example of how a quote is formatted"

CSS stylesheet for the HTML file is Fabrice Niessen https://github.com/fniessen/org-html-themes

2 Data provided in Dryad submission

2.1 Meta data (cocci-metadata.csv)

The column "population" is a list of the geographical origin of the sample. "population2" is assigned based on the angsd analysis (Figure S3).

head(read.csv("cocci-metadata.csv"))
species sample population population2
immitis B0727_Argentina LAm CA/LAm
immitis Coahuila_1 LAm CA/LAm
immitis Guerrero_1 LAm CA/LAm
immitis Michoacan_2 LAm CA/LAm
immitis San_Diego_1_14 CA CA/LAm
immitis San_Diego_1_16 CA CA/LAm

2.2 GFF file

Downloaded annotations from Ensembl fungi. Filtered using AWK for gene annotations and renamed the chromosomes

ftp://ftp.ensemblgenomes.org/pub/fungi/release-41/gff3/fungi_ascomycota1_collection/coccidioides_immitis_rs_gca_000149335

sed s/supercont/Supercontig_/ < \
    Coccidioides_immitis_rs.ASM14933v2.40.gff3 | \
    awk '$3==\"gene\" {print $0}' > \
    Coccidioides_immitis_rs.ASM14933v2.40-contigs-renamed.gff3

2.3 Int-HMM output

2.3.1 hmm_tables_united

These give the location of admixed tracks (tracts) for all individuals for a particular combination of reference genome, species, recombination rate, and marker frequency cutoff.

i_p_i_hmm_hap_pab_01_6_all_30_tracks translates to "introgression into immitis from posadasii with a 1e-6 recombination rate and a 30% frequency required for marker selection"

The files ending in _unftrk contain tracts defined without merging small tracks, etc. as defined in the paper.

2.3.2 hmm_tables_no_filter

These give the HMM output at a per marker level, giving the probability that the marker is shared or not shared between species.

2.3.3 angsd.Rdata

This is a workspace containing the results of the PCAngsd analysis. See below for description of the file and the code used to interpret it.

2.4 immitis.genome

immitis.genome is a file giving the chrom lengths of the contigs for the immitis reference.

2.5 data

This is an empty directory that intermediate files from the code are created in.

2.6 Neafsey_SupTable1.csv

3 External dependencies

3.1 R packages

install.packages(c("ggplot2", "ggthemes", "ggpubr", 
                   "stringr", "scales", "reshape2", "plyr",
                   "gggenes", "car", "readr", "broom"))

I used Vennerable to make the R package. I installed it according to this git repository.

https://github.com/js229/Vennerable

R.Version()
$platform
[1] "x86_64-apple-darwin15.6.0"

$arch
[1] "x86_64"

$os
[1] "darwin15.6.0"

$system
[1] "x86_64, darwin15.6.0"

$status
[1] ""

$major
[1] "3"

$minor
[1] "5.1"

$year
[1] "2018"

$month
[1] "07"

$day
[1] "02"

$`svn rev`
[1] "74947"

$language
[1] "R"

$version.string
[1] "R version 3.5.1 (2018-07-02)"

$nickname
[1] "Feather Spray"

3.2 Bedtools

bedtools --version
bedtools v2.27.1

4 Code ran on cluster

Int-HMM requires an SQL database and the SLURM scheduling engine to run, which means that it is difficult to create a reproducible document for running the code. In the sections I below, I document the steps and code used to generate the Int-HMM output.

4.1 Mapping

code-from-cluster/map.py is a script that writes sbatch files that will use BWA to map read, Picard to call duplicates, and GATK to realign indels. The output is a deduplicated, sorted, and realigned bam file with attendent index and the samtools idxstats file. The shell script code-from-cluster/run_map.sh calls map.py with the appropriate fastq locations and reference files. The output from run_map.sh is passed to sbatch:

sh run_map.sh | xargs -I {} sbatch {}

This code was used to map the data for use by angsd. A different script (with the same code and steps) was used to map the data for Int-HMM.

4.1.1 Command line arguments

python map.py --help
usage: map.py [-h] -f FASTQ [FASTQ ...] -g GENOME -s SBATCH -b BAM

Map fastq files, dedup, realign them

optional arguments:
  -h, --help            show this help message and exit
  -f FASTQ [FASTQ ...], --fastq FASTQ [FASTQ ...]
                        Folder(s) containing the fastq files. Expects pairs of
                        fastq files in the form: <SAMPLE>_<SRR-
                        number>_R1_001.fastq.gz <SAMPLE>_<SRR-
                        number>_R2_001.fastq.gz.
  -g GENOME, --genome GENOME
                        Reference fasta. Needs bwa index already built.
  -s SBATCH, --sbatch SBATCH
                        Folder to put sbatch scripts and output.
  -b BAM, --bam BAM     Folder to put bam output.

4.1.2 Summary of pipeline

The commands are run in a temporary file. The output is the BAM, bam index, and the stats file.

  1. cat :: If you supply multiple folders containing fastq files that have the same name, the script will cat them together before passing them to BWA
  2. bwa mem :: Maps the reads. This call is based on David Turissini's scripts.
    • -M " Mark shorter split hits as secondary (for Picard compatibility). "
    • -R '@RG\\tID:{samp_name}\\tSM:{samp_name}\\tPL:illumina\\tLB:lib1\\tPU:unit1' Assigns the read group tags.
    • -k 10 "Minimum seed length. Matches shorter than INT will be missed."
  3. samtools sort :: The output of BWA is piped directly to be sorted
  4. Picard/MarkDuplicates :: Mark the duplicates
  5. GATK/RealignerTargetCreator ::
  6. GATK/IndelRealigner ::
  7. samtools index/idxstats ::

4.2 RepeatMasker

Repeat masker was run with the options:

-pa 8
Uses 8 cores
-species "fungi"
Uses the fungal database
-gff
Emit a GFF file

I masked repetitive regions in angsd by converting the GFF output to BED format:

cd /pine/scr/m/a/maxwell3/2018-cocci-introgression/
bedtools merge -i repeats-posadasii/posadasii_rmscc_3488_ref.fasta.out.gff > \
         repeats-posadasii/posadasii_rmscc_3488_ref.fasta.out.bed
bedtools merge -i repeats-immitis/immitis_rs_ref.fasta.out.gff > \
         repeats-immitis/immitis_rs_ref.fasta.out.bed

And then using bedtools to take the complement:

bedtools complement -i immitis_rs_ref.fasta.out.bed -g imm_ref.genome | awk '{print $1"\t"$2+1"\t"$3}' > imm_not_variable.angsd
bedtools complement -i posadasii_rmscc_3488_ref.fasta.out.bed -g pos_ref.genome | awk '{print $1"\t"$2+1"\t"$3}' > pos_not_variable.angsd
./angsd sites index /pine/scr/m/a/maxwell3/2018-cocci-introgression/repeats-posadasii/pos_not_variable.angsd
./angsd sites index /pine/scr/m/a/maxwell3/2018-cocci-introgression/repeats-immitis/imm_not_variable.angsd

4.3 PCAngsd

I created a list of BAM files using the following code:

imm=/pine/scr/m/a/maxwell3/2018-cocci-introgression/immitis/imm_ref/
pos=/pine/scr/m/a/maxwell3/2018-cocci-introgression/posadasii/imm_ref/

ls $imm | grep "realigned\.bam$" | awk -v base="$imm" '{print base$0}' > imm.bamlist
ls $pos | grep "realigned\.bam$" | awk -v base="$pos" '{print base$0}' > pos.bamlist
cat imm.bamlist pos.bamlist > all.bamlist

I then used code-from-cluster/run_angsd_glf.sh and code-from-cluster/angsd-glf.py to calculate the genotype likelihoods using angsd (angsd version: 0.921-11-g20b0655 (htslib: 1.8-37-gcaed43e) build(Jul 3 2018 12:06:22)).

PCAngsd (version 0.95) was run using code-from-cluster/run_angsd_pca.sh and code-from-cluster/angsd-pca.py.

The script code-from-cluster/plot-angsd-pca-display.R was used to read in the MAP admixture and PCA results and create angsd.Rdata used below.

4.4 IntHMM

4.4.1 Find markers from VCF

The script code-from-cluster/get_haploid_site_all_introgress_pop_from_vcf.pl was used to find the markers for the HMM. See the comments in the code itself for how it was run.

4.4.2 Calculate admixture probabilities

The script code-from-cluster/hmm_introgress_all_haploid_master.pl was called multiple times to submit jobs to run the HMM itself. For example, this runs the HMM on immitis using the immitis reference, a sequencing error rate of 0.01, an assumed recombination rate of 1e-5, a marker cut-off of 0.30, and using the table sites_all_i_p_i_30.

cd /proj/matutelb/projects/programs/
module load perl
perl hmm_introgress_all_haploid_master.pl i_p_i immitis .01 1e-5 .30 sites_all_i_p_i_30

The table sites_all_p_i_i_30 was used for looking at introgression into posadasii with an immitis reference.

4.4.3 Assemble tracks

The script code-from-cluster/hmm_genotype_tracks_haploid_master_parallel.pl was called multiple times to join the admixed sites identified by the HMM into tracks. For example, this code assembles tracks for immitis using the immitis reference for a recombination rate of 1e-6.

cd /proj/matutelb/projects/programs/
module load perl
perl hmm_genotype_tracks_haploid_master_parallel.pl i_p_i 6_all_30 hap immitis immitis_unique_repeat_regions

4.4.4 Get tables from SQL database

I used the script code-from-cluster/get_hmm_tables.R to copy the tables from the SQL database into CSV files.

5 Figures

5.1 Setup code

This code reads in the metadata and loads packages that are shared across multiple pieces of code used to create the figures.

library(ggplot2)
library(ggthemes)
library(ggpubr)
library(stringr)
library(scales)
library(reshape2)
library(plyr)

meta <- read.csv("./cocci-metadata.csv")

## The number of individuals in each population
meta_pop_counts  <- count(meta, c("species", "population2"))
colnames(meta_pop_counts) <- c("species", "population2", "pop_size")

## Read in the GFF file and extract the gene info
all_genes <-
 readr::read_delim("Coccidioides_immitis_rs.ASM14933v2.40-contigs-renamed.gff3",
                      delim="\t",
                      col_names=c("supercontig", "X2", "X3", 
                                  "start", "stop", "score", 
                                  "strand", "foo", "info")) %>%
    subset(X3 == "gene") %>% 
    tidyr::extract(col="info", ## Extract gene info and description from the GFF
                   into=c("description", "gene_id"),
                   regex="description=(.*);gene_id=(.*);") 

5.2 Calculate coverage and regions

The goal of this is to calculate the allele frequency of introgressions both across whole species and within populations defined by angsd. The strategy is to create bedgraphs of introgressed regions for each species/population, then use bedtools unionbg to allow the bedgraphs to be compared.

Create BEDs out of the tracks by filtering for shared ancestry greater than 500bp with less than 5% repetitive sequence The files being read are for the immitis reference at 10^-6 recomb rate.

source("01-setup.R")

filter_tracks = function(track_file){
  track_file %>%
    subset(genotype == "homo_2") %>%  ## Admixed regions only
    subset(track_len > 500) %>% ## Require 500bp length
    subset(per_repeat < 5) ## Remove >5% repetitive
}

write_bed = function(track_file, out_name){
  track_file %>% 
  {.[,c("supercontig", "start", "end", "ind", "introgress_snps")]} %>%
    transform(strand = ".") %>%
    write.table(out_name, quote=FALSE, sep="\t",
                row.names=FALSE, col.names=FALSE)
}

## Create the bedfile for the location of the tracts for immitis
imm_tracks <-  read.csv("hmm_tables_united/i_p_i_hmm_hap_pab_01_6_all_30_tracks.csv") %>%
  filter_tracks()

## And for posadasii
pos_tracks  <- read.csv("hmm_tables_united/p_i_i_hmm_hap_pab_01_6_all_30_tracks.csv") %>%
  filter_tracks()

Concatenate the tracts from the two different species, adding a column giving which species it belongs to, then save a CSV file for the concatenated tracks. This is needed for Figure 5.

all_tracks <- list(posadasii=pos_tracks, immitis=imm_tracks) %>%
  ldply(.id="species", function(x) x)
write.csv(all_tracks, "data/all-tracks-recomb6-imm-ref.csv")

5.2.1 Write bedgraphs of introgressions at the species level

This creates bedgraphs of the coverage of the introgressed tracts across the species. Note that the -bga tag reports regions with zero coverage. This is needed for plotting the coverage using the geom_step() method in Figure 3.

 ## Write the bed files
 imm_tracks %>% write_bed("data/i_p_i_hmm_hap_pab_01_6_all_30_tracks.bed6")
 pos_tracks %>% write_bed("data/p_i_i_hmm_hap_pab_01_6_all_30_tracks.bed6")

 ## Sort the bed files
 system("bedtools sort -i data/p_i_i_hmm_hap_pab_01_6_all_30_tracks.bed6 > data/p_i_i_hmm_hap_pab_01_6_all_30_tracks_sorted.bed6")
 system("bedtools sort -i data/i_p_i_hmm_hap_pab_01_6_all_30_tracks.bed6 > data/i_p_i_hmm_hap_pab_01_6_all_30_tracks_sorted.bed6")
 system("rm data/p_i_i_hmm_hap_pab_01_6_all_30_tracks.bed6 data/i_p_i_hmm_hap_pab_01_6_all_30_tracks.bed6")
 ## Create bedgraphs of the coverage of introgressions
 system("bedtools genomecov -i data/i_p_i_hmm_hap_pab_01_6_all_30_tracks_sorted.bed6 -bga -g immitis.genome > data/i_p_i_hmm_hap_pab_01_6-all.bg")
 system("bedtools genomecov -i data/p_i_i_hmm_hap_pab_01_6_all_30_tracks_sorted.bed6 -bga -g immitis.genome > data/p_i_i_hmm_hap_pab_01_6-all.bg")

track_cov <- list(
  immitis = read.table("data/i_p_i_hmm_hap_pab_01_6-all.bg",col.names=c("supercontig", "start", "end", "cov")),
  posadasii = read.table("data/p_i_i_hmm_hap_pab_01_6-all.bg",col.names=c("supercontig", "start", "end", "cov"))) %>%
  ldply(.id="species", function(x) x) %>%
  transform(n_ind = ifelse(species=="immitis", 27, 51),
            bin = cut_width(start/1e6, 0.1))
write.csv(track_cov, "data/all-coverage-recomb6-imm-ref.csv")

This is what each bedgraph looks like

head -5 data/p_i_i_hmm_hap_pab_01_6-all.bg
Supercontig_3.1 0 37941 0
Supercontig_3.1 37941 38556 3
Supercontig_3.1 38556 38557 1
Supercontig_3.1 38557 39116 2
Supercontig_3.1 39116 39202 3

5.2.2 Write bedgraphs of introgressions at the population level

## For posadasii
for(pop in c("AZ", "Guatemala", "Mex/TX")){
  ## Merge with metadata, take subset by the population
  to_write = pos_tracks %>% 
    merge(meta, by.x="ind", by.y="sample") %>% 
    subset(population2 == pop)
  pop = gsub("/", "_", pop) # Allow pop names to be written
  ## Write a bed of the introgressions, sort it
  out_name = str_interp("data/p_i_i_hmm_hap_pab_01_6_all_30_tracks-${pop}.bed6")
  to_write %>% write_bed("data/tmp.bed6")
  system(str_interp("bedtools sort -i data/tmp.bed6 > ${out_name}"))
  system("rm data/tmp.bed6")
  ## Write a begraph of the introgressions
  out_name2 = str_interp("data/p_i_i_hmm_hap_pab_01_6-${pop}.bg")
  system(str_interp("bedtools genomecov -i ${out_name} -bg -g immitis.genome > ${out_name2}"))}

## For immitis
for(pop in c("CA/LAm", "WA")){
  to_write = imm_tracks %>% 
    merge(meta, by.x="ind", by.y="sample") %>% 
    subset(population2 == pop)
  pop = gsub("/", "_", pop)
  out_name = str_interp("data/i_p_i_hmm_hap_pab_01_6_all_30_tracks-${pop}.bed6")
  out_name2 = str_interp("data/i_p_i_hmm_hap_pab_01_6-${pop}.bg")
  to_write %>% write_bed("data/tmp.bed6")
  system(str_interp("bedtools sort -i data/tmp.bed6 > ${out_name}"))
  system("rm data/tmp.bed6")
  system(str_interp("bedtools genomecov -i ${out_name} -bg -g immitis.genome > ${out_name2}"))}

Take the union of all the bedgraphs across species and populations

system("bedtools unionbedg -header -g immitis.genome -i data/p_i_i_hmm_hap_pab_01_6-all.bg data/p_i_i_hmm_hap_pab_01_6-AZ.bg data/p_i_i_hmm_hap_pab_01_6-Guatemala.bg data/p_i_i_hmm_hap_pab_01_6-Mex_TX.bg -names all AZ Guatemala Mex_TX > data/p_i_i_populations_union.bg")
system("bedtools unionbedg -header -g immitis.genome -i data/i_p_i_hmm_hap_pab_01_6-all.bg data/i_p_i_hmm_hap_pab_01_6-CA_LAm.bg data/i_p_i_hmm_hap_pab_01_6-WA.bg -names all CA_LAm WA > data/i_p_i_populations_union.bg")

This is what the union bedgraphs look like.

head -5 data/i_p_i_populations_union.bg
chrom start end all CA_LAm WA
Supercontig_3.1 0 3120 0 0 0
Supercontig_3.1 3120 14924 4 0 4
Supercontig_3.1 14924 15976 6 1 5
Supercontig_3.1 15976 29398 0 0 0

5.2.3 Merge introgressed tracts to look at frequency

bedtools merge merges overlapping intervals into a single interval. To be "overlap", I'm requiring them to overlap by 250 bp (that's what the -d flag does). This is what the data looks like. The fourth column is the individual. The fifth is the number of introgressed snps (not used). The last is the strand.

head -5 data/p_i_i_hmm_hap_pab_01_6_all_30_tracks_sorted.bed6
Supercontig_3.1 37941 38556 Phoenix_8 3 0
Supercontig_3.1 37941 38556 B10757_Nevada 3 0
Supercontig_3.1 37941 38557 Tucson_17 4 0
Supercontig_3.1 38557 39202 B1249_Guatemala 5 0
Supercontig_3.1 38557 39202 B0858_Guatemala 5 0

The -o distinct flag applies the distinct operation to the column specified by -c. distinct prints a comma separated list.

system("bedtools merge -c 4 -d -250 -o distinct -i data/p_i_i_hmm_hap_pab_01_6_all_30_tracks_sorted.bed6 > data/p_i_i_hmm_hap_pab_01_6_all_30_regions.bed6")
system("bedtools merge -c 4 -d -250 -o distinct -i data/i_p_i_hmm_hap_pab_01_6_all_30_tracks_sorted.bed6 > data/i_p_i_hmm_hap_pab_01_6_all_30_regions.bed6")
tail -3 data/i_p_i_hmm_hap_pab_01_6_all_30_regions.bed6
Supercontig_3.6 239890 240423 SJV_10,SJV_7,SJV_8,WA_202,WA_205,WA_211,WA_212,Washington_1
Supercontig_3.6 3066444 3067046 SJV_11,San_Joaquin_Valley_11
Supercontig_3.6 3343338 3345375 B0727_Argentina,Michoacan_2,SJV_4,SJV_6,SJV_8

So the results give unique intervals shared across particular sets of individuals.

5.2.4 Calculate intersections of introgressions and genes

This finds genes that overlap with introgressions as defined above. The -F 0.1 tag requires that 10% of the gene's length be overlapped by the introgression.

system("bedtools intersect -a data/p_i_i_hmm_hap_pab_01_6_all_30_regions.bed6 -b Coccidioides_immitis_rs.ASM14933v2.40-contigs-renamed.gff3 -F 0.1 -wa -wb -sortout > data/posadasii_introgressed_genes.bed")
system("bedtools intersect -a data/i_p_i_hmm_hap_pab_01_6_all_30_regions.bed6 -b Coccidioides_immitis_rs.ASM14933v2.40-contigs-renamed.gff3 -F 0.1 -wa -wb -sortout > data/immitis_introgressed_genes.bed")

5.3 Calculate marker frequency

The HMM computes the likelihood that a SNP is shared between species. This code creates a table of the frequency of markers called "shared" (homo_1) or "not shared" (homo_2) by the HMM.

source("01-setup.R")

table_regex="(.)_(.)_(.)_(.*)_hmm_hap_pab_01_([5679])_all_30\\.csv"

## hmm_tables_no_filter contains the output of the HMM assembled into
## tracts but without any filter on probability of p_homo_1 or
## p_homo_2
to_read <- str_match(dir("hmm_tables_no_filter"), table_regex) %>%
  as.data.frame() %>%
  subset(!is.na(V1))  %>% #Remove non-matching files
  transform(V6 = as.numeric(as.character(V6))) # V6 is the group for the recombination rate

colnames(to_read) <- c("table", "species", "src", "reference",
                       "sample", "recomb")
head(to_read)
table species src reference sample recomb
i_p_i_B0727_Argentina_hmm_hap_pab_01_5_all_30.csv i p i B0727_Argentina 5
i_p_i_B0727_Argentina_hmm_hap_pab_01_6_all_30.csv i p i B0727_Argentina 6
i_p_i_B0727_Argentina_hmm_hap_pab_01_7_all_30.csv i p i B0727_Argentina 7
i_p_i_B0727_Argentina_hmm_hap_pab_01_9_all_30.csv i p i B0727_Argentina 9
i_p_i_Coahuila_1_hmm_hap_pab_01_5_all_30.csv i p i Coahuila_1 5
i_p_i_Coahuila_1_hmm_hap_pab_01_6_all_30.csv i p i Coahuila_1 6
all_introgressed <- to_read %>%
  subset(recomb == 6) %>%  ## Restrict to only 10^-6 recombination rate
  subset(reference == "i") %>% ## Restrict to only the immitis reference
  ddply(.(table, species, src, sample, reference, recomb), ## Read in the data, keeping track of meta data
        with,
        read.csv(file.path("hmm_tables_no_filter", table))[,c("supercontig","pos","p_homo_1", "p_homo_2", "genotype")],
        .progress="text") %>%
  plyr::mutate(species = ifelse(species == "i", "immitis", "posadasii")) ## Add column for species

 ## This command counts the instances of a marker being assigned a
 ## particular genotype for each strain, etc.
marker_freq <- dcast(all_introgressed,
                   pos+supercontig+recomb+reference+src+species~genotype)

## Scale the marker frequency for the two genotypes by the species
## population size
marker_freq$pop_size = ifelse(marker_freq$species == "immitis", 27, 51) 
marker_freq$homo_1 <- marker_freq$homo_1/marker_freq$pop_size
marker_freq$homo_2 <- marker_freq$homo_2/marker_freq$pop_size
head(marker_freq)
pos supercontig recomb reference src species homo_1 homo_2 pop_size
3120 Supercontig_3.1 6 i p immitis 0.555555555555556 0.148148148148148 27
5987 Supercontig_3.2 6 i p immitis 0.666666666666667 0.037037037037037 27
5998 Supercontig_3.2 6 i p immitis 0.703703703703704 0 27
5999 Supercontig_3.2 6 i p immitis 0.703703703703704 0 27
8821 Supercontig_3.3 6 i p immitis 0.555555555555556 0.296296296296296 27
11476 Supercontig_3.4 6 i i posadasii 0.117647058823529 0.0196078431372549 51

Since the markers were chosend such that there must be a 30% difference in frequency between the species, there are (almost) no markers that are assigned "introgressed" with greater than 70% frequency. I'm not sure what's with the three markers with greater than 70% frequency, but I'm ignoring them because they are a very small portion of the total and they are all close to 70%.

summary(marker_freq$homo_2 >= 0.7)
   Mode   FALSE    TRUE 
logical  561787       3
## Melt the site frequency so that the "variable" is the genotype and
## the "value" is the frequency
marker_freq2 = melt(marker_freq, id.vars=c("pos", "supercontig", "recomb", 
                                       "reference", "src", "pop_size", 
                                       "species"))
save(marker_freq2, file="data/marker-freq-recomb6-imm-ref.Rdata")

In the table, 'variable' is the genotype assigned by the HMM and 'value' is the frequency of markers assigned that value in the

head(marker_freq2)
pos supercontig recomb reference src pop_size species variable value
3120 Supercontig_3.1 6 i p 27 immitis homo_1 0.555555555555556
5987 Supercontig_3.2 6 i p 27 immitis homo_1 0.666666666666667
5998 Supercontig_3.2 6 i p 27 immitis homo_1 0.703703703703704
5999 Supercontig_3.2 6 i p 27 immitis homo_1 0.703703703703704
8821 Supercontig_3.3 6 i p 27 immitis homo_1 0.555555555555556
11476 Supercontig_3.4 6 i i 51 posadasii homo_1 0.117647058823529

5.4 Introgressed genes

source("01-setup.R")

The strategy here is to use the coverage created by the bedgraphs to find genes that were introgressed. There are two bedgraph files created using the bedtools unionbg command:

head -3 data/p_i_i_populations_union.bg
chrom	start	end	all	AZ	Guatemala	Mex_TX
Supercontig_3.1	0	37941	0	0	0	0
Supercontig_3.1	37941	38556	3	3	0	0

5.4.1 Intersect bedgraph unions with GFF files

Intersect the coverage with genes, requiring the gene to be 10% covered by the bedgraph line. Since the bedgraph has line with no introgression coverage, this contains some genes with no coverage The result has the same columns as the bedgraph and the GFF concatenated.

system("bedtools intersect -a data/i_p_i_populations_union.bg -b Coccidioides_immitis_rs.ASM14933v2.40-contigs-renamed.gff3 -F 0.1 -wa -wb -sortout > data/immitis_introgressed_genes.bed")
system("bedtools intersect -a data/p_i_i_populations_union.bg -b Coccidioides_immitis_rs.ASM14933v2.40-contigs-renamed.gff3 -F 0.1 -wa -wb -sortout > data/posadasii_introgressed_genes.bed")

5.4.2 Read and process the lists of introgressed genes

library(readr)
imm_genes =read_delim("data/immitis_introgressed_genes.bed", "\t",
                      col_names = c("supercontig", "cov_start", "cov_end", "species", "CA/LAm", "WA",
                                    "X1", "X2", "X3", "gene_start", "gene_end", "X4", "strand", "X5", "gene_info")) %>%
  # Extract the gene info from the GFF info string
  tidyr::extract(col="gene_info", into=c("description", "gene_id"), 
                 regex="description=(.*);gene_id=(.*);") %>%
  # Restrict to relevant columns
  {.[,c("supercontig", "cov_start", "cov_end", "species", 
        "CA/LAm", "WA","gene_id", "description")]} %>%
  # Melt to put the population counts (e.g "species", "CA/LAm", etc.) in "variable" columne
  melt(id.vars=c("supercontig", "cov_start", 
                 "cov_end","gene_id", "description")) %>%
  # Convert the variable column for merging
  mutate(variable = as.character(variable)) %>%
  # Edit the variable to annotate with species name
  {.$variable[.$variable == "species"] = "immitis"; .} 

## Same steps are repeated for the posadasii data. 
## Probably should have written a function or something
pos_genes =read_delim("data/posadasii_introgressed_genes.bed", "\t",
                      col_names = c("supercontig", "cov_start", "cov_end", "species", "AZ", "Guatemala", "Mex/TX",
                                    "X1", "X2", "X3", "gene_start", "gene_end", "X4", "strand", "X5", "gene_info")) %>%
  tidyr::extract(col="gene_info", into=c("description", "gene_id"),
                 regex="description=(.*);gene_id=(.*);") %>%
                 {.[,c("supercontig", "cov_start", "cov_end", "species", 
                       "AZ", "Guatemala", "Mex/TX","gene_id", "description")]} %>%
  melt(id.vars=c("supercontig", "cov_start", "cov_end", "gene_id", "description"))%>%
  mutate(variable = as.character(variable)) %>%
  {.$variable[.$variable == "species"] = "posadasii"; .}

## Create a df with the counts of each population and species
pop_counts = rbind(count(meta, "population2"), 
                   data.frame(population2=c("immitis", "posadasii"), 
                              freq=c(27,51)))
colnames(pop_counts) = c("population", "population_count")

## Concatenate the imm and pos data, then annotate with the population counts
gene_introgression = rbind(imm_genes, pos_genes)
colnames(gene_introgression)[c(6,7)]=c("population", "introgression_count")
gene_introgression = merge(gene_introgression, pop_counts)


## Same steps are repeated for the posadasii data. 
## Probably should have written a function or something
pos_genes =read_delim("data/posadasii_introgressed_genes.bed", "\t",
                      col_names = c("supercontig", "cov_start", "cov_end", "species", "AZ", "Guatemala", "Mex/TX",
                                    "X1", "X2", "X3", "gene_start", "gene_end", "X4", "strand", "X5", "gene_info")) %>%
  tidyr::extract(col="gene_info", into=c("description", "gene_id"),
                 regex="description=(.*);gene_id=(.*);") %>%
                 {.[,c("supercontig", "cov_start", "cov_end", "species", 
                       "AZ", "Guatemala", "Mex/TX","gene_id", "description")]} %>%
  melt(id.vars=c("supercontig", "cov_start", "cov_end", "gene_id", "description"))%>%
  mutate(variable = as.character(variable)) %>%
  {.$variable[.$variable == "species"] = "posadasii"; .}

## Create a df with the counts of each population and species
pop_counts = rbind(count(meta, "population2"), 
                   data.frame(population2=c("immitis", "posadasii"), 
                              freq=c(27,51)))
colnames(pop_counts) = c("population", "population_count")

## Concatenate the imm and pos data, then annotate with the population counts
gene_introgression = rbind(imm_genes, pos_genes)
colnames(gene_introgression)[c(6,7)]=c("population", "introgression_count")
gene_introgression = merge(gene_introgression, pop_counts)
write.csv(gene_introgression, "data/gene_introgression.csv")
head(gene_introgression)
population supercontig cov_start cov_end gene_id description introgression_count population_count
AZ Supercontig_3.1 0 37941 CIMG_12556 hypothetical protein 0 36
AZ Supercontig_3.1 0 37941 CIMG_12557 hypothetical protein 0 36
AZ Supercontig_3.1 0 37941 CIMG_12558 hypothetical protein 0 36
AZ Supercontig_3.1 0 37941 CIMG_12559 hypothetical protein 0 36
AZ Supercontig_3.1 0 37941 CIMG_12560 hypothetical protein 0 36
AZ Supercontig_3.1 0 37941 CIMG_12561 hypothetical protein 0 36

5.5 Figure 2 - Histograms of size of shared ancestry/introgressions

Figure 2 shows the size of shared and introgressed haplotypes calculated from the Int-HMM output.

source("01-setup.R")

all_shared <- list(immitis = read.csv("./hmm_tables_united/i_p_i_hmm_hap_pab_01_6_all_30_tracks.csv"),
                   posadasii = read.csv("./hmm_tables_united/p_i_i_hmm_hap_pab_01_6_all_30_tracks.csv")) %>%
  ldply(.id="species", function(x) x) %>%
  subset(genotype == "homo_2") %>% ## "homo_2" = "shared"
  subset(per_repeat < 5) ## Filter to less than 5% repetitive

"We characterized the general characteristics of shared ancestry between species. The majority of shared haplotypes are small. The average size of the shared haplotype in C. immitis is 64 bp (range = 1-63,135bp) and 46bp in C. posadasii (range 1-36,960) using 10-6 M/bp as the recombination rate."

all_shared %>% 
    ddply(.(species), plyr::summarize, 
          mean_size = mean(track_len),
          max_size = max(track_len),
          min_size = min(track_len))
species mean_size max_size min_size
immitis 64.86435722377 63135 1
posadasii 46.1725482371125 36960 1

"The average size of a C. immitis to C. posadasii introgression was 3.4 kb. In the reciprocal direction, C. posadasii to C. immitis, the average size of an introgression was 4.2 kb."

introgression_mean_size <- all_shared %>% 
  subset(track_len > 500) %>% 
  ddply(.(species), 
        plyr::summarize, mean_size = mean(track_len)) %>%
  transform(species = paste("C.", species))
species mean_size
C. immitis 4193.22263222632
C. posadasii 3398.53901373283
## Plot of the size of shared ancestry tracts
all_shared_plot <- all_shared %>%
  transform(species = paste("C.", species)) %>%
  ggplot(aes(x=log10(track_len)))+
  geom_histogram(bins=15, fill=NA, col="black")+
  facet_wrap(~species,scale="free_y")+theme_base()+
  theme(plot.background=element_rect(fill="white", colour=NA),
        strip.text = element_text(face="italic"))+
  xlab("log10 shared haplotype size (bp)")+
  scale_y_continuous(labels = function(x) format(x, scientific = TRUE))

## Plot size of introgressed tracts
introgression_size_plot <- all_shared %>%
  subset( track_len > 500) %>% ## This filters to 500bp
  transform(species = paste("C.", species)) %>%
  ggplot(aes(x=track_len))+
  geom_histogram(bins=15, fill=NA, col="black")+
  facet_wrap(~species,scale="free_y")+theme_base()+
  theme(plot.background=element_rect(fill="white", colour=NA),
        strip.text = element_text(face="italic"))+
  geom_vline(data=introgression_mean_size,
             aes(xintercept=mean_size),
             lty=3, col="blue")+
  xlab("Introgressed haplotype size (bp)")

ggsave("Figure2.pdf",
       ggarrange(all_shared_plot, 
                 introgression_size_plot, 
                 nrow=2, labels=c("A", "B")),
       height=7, width=7)

Figure2.pdf

5.6 Figure 3 - Location and frequency of introgressions

source("01-setup.R")

track_cov=read.csv("data/all-coverage-recomb6-imm-ref.csv", row.names=1) %>%
  transform(track_len = end-start)

tracks =
  track_cov %>%
  subset(cov != 0)

"Consistent with our expectations, the majority of introgressions are at low frequency (mean allele frequency 14.1% and 16.9% for C. immitis and C. posadasii, respectively; Figure 3), however we did find some introgressions at high frequency (see below). There is not a significant difference in the mean introgressed allele frequency between the species (Permutation test; P = 0.0899)."

The allele frequency of introgressions was weighted by the size of the introgression.

# Weight by the size of the introgression to compute the means
track_stats = tracks %>%
  ddply(.(species), 
        plyr::summarize,
        mean = weighted.mean(cov/n_ind, track_len)) %>%
  transform(species=paste("C.", species))
species mean
C. immitis 0.14145629036292
C. posadasii 0.169600587482152

Tested significance of the difference in allele frequency by permutation. Note that since it's stochastic that the P-value varies slightly from what is reported in the paper.

randomize_species = function(x){
  x$species = sample(x$species, nrow(x), replace=TRUE)
  x}
samples_species = rdply(.n=10000, 
      .expr = {
        randomize_species(tracks) %>% 
          ddply(.(species), 
                plyr::summarize,
                mean = weighted.mean(cov/n_ind, track_len))},
      .progress = "text") %>%
  dcast(.n~species)

the_ecdf = ecdf(samples_species$immitis - samples_species$posadasii)
the_ecdf(0.1414563 - 0.1696006) # see allele freqs above
0.0918
track_loc <- track_cov %>%
  transform(supercontig=gsub("Supercontig_", "", supercontig)) %>%
  transform(species=paste("C.", species)) %>% 
  ggplot(aes(x=start/1e6, y=cov/n_ind))+
  geom_step(size=0.3, alpha=0.5)+
  facet_grid(species~supercontig,
             space="free_x",scale="free_x")+
  theme_base(14)+xlab("Position (Mb)")+
  ylab("Introgressed region frequency")+
  ylim(0,1.05)+geom_hline(data=track_stats, aes(yintercept=mean), col="blue", lty=3)+
  theme(plot.background=element_rect(fill="white", colour=NA),
        strip.text.y=element_text(face="italic"))

track_freq = track_cov %>%
  transform(supercontig=gsub("Supercontig_", "", supercontig)) %>%
  transform(species=paste("C.", species)) %>%
  ggplot(aes(cov/n_ind))+
  facet_grid(species~"")+
  geom_histogram(bins=30, aes(y=..count../100))+coord_flip()+
  theme_base(14)+ylab("Count (x100)")+xlim(0,1.05)+
  xlab("")+geom_vline(data=track_stats, aes(xintercept=mean), col="blue", lty=3)+
  theme(plot.background=element_rect(fill="white", colour=NA),
        strip.text.y=element_text(face="italic"))
track_plot <- cowplot::plot_grid(track_loc, track_freq, ncol=2,
                                 rel_widths=c(1,0.3),labels=c("A", "B"))+
  theme(plot.background=element_rect(fill="white", colour=NA))
ggsave("Figure3.pdf", track_plot, height=3.5, width=10)

Figure3.pdf

5.7 Figure 4 - Example of an introgression

This code defines a function that uses the data from the HMM to produce a summary graphic of a region around a gene of interest. I chose this one because it is a fixed introgression. We can call it fixed because the HMM extends the introgression tracts to overlap at the gene.

source("01-setup.R")
library(gggenes)


all_tracks = read.csv("data/all-tracks-recomb6-imm-ref.csv", row.names=1)
load("data/marker-freq-recomb6-imm-ref.Rdata")
track_cov = read.csv("data/all-coverage-recomb6-imm-ref.csv", row.names=1) %>% transform(freq = cov/n_ind)

plot_region = function(st = 101423,
                       en = 115414,
                       flank = 0,
                       contig = "Supercontig_3.2",
                       sp = "posadasii",
                       GENE = "CIMG_09822",
                       tc = track_cov,
                       sf = marker_freq2,
                       the_meta = meta,
                       genes = all_genes,
                       the_out = "tmp.pdf"){
  high_freq_genes = subset(genes, 
                           (start > st - flank) & 
                            (stop < en + flank) & 
                             (supercontig == contig))
  high_freq_genes = subset(genes, 
                           (start > st - flank) & 
                             (stop < en + flank) & 
                             (supercontig == contig))
  high_freq_genes$direction <- ifelse(high_freq_genes$strand == "+", 1, -1)
  high_freq_region_bg = subset(tc, 
                               (start > st - flank) & 
                                 (end < en + flank) & 
                                 (species == sp) & 
                                 (supercontig == contig))
  high_freq_sites = subset(sf, (pos > st - flank) &
                             (pos < en + flank) & 
                             (species == sp) & 
                             (supercontig == contig)) %>%
    transform(variable = factor(variable, 
                                levels=c("homo_1", "homo_2"),
                                labels=c("not shared", "shared")))
  sample_order = the_meta %>% 
    subset(species == sp) %>% 
    with(as.character(sample)[order(as.character(population2, population))])
  high_freq_region = subset(all_tracks, 
                            (start > st - flank) & 
                              (end < en + flank) & 
                              (species == sp) & 
                              (supercontig == contig)) %>% 
    subset(select = c(ind, start, end))
  not_in_data = sample_order[!(sample_order %in% high_freq_region$ind)]
  if( length(not_in_data) > 0){
    high_freq_region = rbind(high_freq_region, data.frame(ind = not_in_data, start=NA, end=NA))
  }
  high_freq_region = high_freq_region %>%
    merge(the_meta, by.x="ind", by.y="sample") %>%
    transform(ind = factor(as.character(ind), 
                           levels=sample_order))
  #high_freq_genes = subset(all_genes, 
  #                         (start > (st - flank)) & 
  #                         (stop < (en + flank)) & 
  #                         (supercontig == contig))
  plot_coords = subset(genes, gene_id == GENE) %>% 
  {.[1,c("start", "stop")]} %>% 
    as.numeric()
  high_1 = high_freq_region %>%
    ggplot(aes(x=start, xend=end,y=ind,yend=ind,col=population2))+
    geom_segment(size=1)+
    ylab("Isolate")+
    xlab("Supercontig 3.2 coordinate")+
    xlim(st-flank,en+flank)+
    theme_base()+
    facet_grid(population2~., scale="free_y", space="free_y")+
    geom_vline(xintercept=plot_coords, lty=3)+
    scale_color_discrete("")+
    theme(plot.background=element_rect(fill="white", colour=NA),
          axis.text.y = element_text(size=6),
          legend.position ="bottom",
          strip.background =element_blank(),
          strip.text=element_blank(),
          panel.grid.major.y=element_line(size=0.1,color="grey70", linetype="dotted"))
  high_2 = high_freq_region_bg %>%
    ggplot(aes(x=start,y=cov/n_ind))+geom_step()+
    ylab("Introgressed region\nfrequency")+
    xlab("")+
    xlim(st-flank,en+flank)+theme_base()+
    geom_vline(xintercept=plot_coords, lty=3)+
    theme(plot.background=element_rect(fill="white", colour=NA))
  high_3 = high_freq_sites %>%
    ggplot(aes(x=pos,y=value,col=variable))+geom_point(alpha=0.5)+
    ylab("Marker\nfrequency")+
    xlab("")+scale_color_manual("Inferred genotype", values=c("blue", "darkorange"))+
    xlim(st-flank,en+flank)+theme_base()+theme(legend.position="top")+
    geom_vline(xintercept=plot_coords, lty=3)+
    theme(plot.background=element_rect(fill="white", colour=NA),
          strip.text.x = element_text(size = 12))
  high_4 = ggplot(high_freq_genes, 
                  aes(xmin=start,xmax=stop,x=start+(stop-start)/2,y=direction,
                      forward=direction,
                      label=gene_id))+
    geom_gene_arrow(arrowhead_height = unit(3, "mm"), arrowhead_width = unit(1, "mm"))+
    theme_base()+geom_text(aes(y=direction+0.5), size=1.75)+xlim(st-flank,en+flank)+
    geom_vline(xintercept=plot_coords, lty=3)+
    theme(plot.background=element_rect(fill="white", colour=NA),
          legend.pos="bottom",
          axis.text.y=element_blank())+xlab("")+ylab("")+ylim(-1.5,1.5)
  high_p <- cowplot::plot_grid(high_3, high_2, high_4, high_1 , nrow=4,
                               rel_heights=c(0.5, 0.3, 0.3, 1),
                               align="vh", axis="l",
                               labels = c("A", "B","C","D"))
  ggsave(the_out, high_p, height=9, width=6)}



plot_region(st=2945000,en=2945000,flank=20000,
            contig="Supercontig_3.5", sp="posadasii",GENE="CIMG_09462",
            the_out="Figure4.pdf")

Figure4.pdf

5.8 Figure S2 - Principle components analysis

We ran the program PCAngsd on our cluster

source("01-setup.R")
## This was created by plot-angsd-pca-display.R

load("angsd.Rdata")

The following are in the Rdata file:

  • PCA_X = results of the PCA
  • admix_X = results of the admixture mapping
  • read_admix and read_pca were the functions used to read in the data
  • meta, meta_imm, meta_pos are metadata files
ls()
 [1] "admix_both"      "admix_imm"       "admix_pos"       "all_genes"      
 [5] "meta"            "meta_imm"        "meta_pop_counts" "meta_pos"       
 [9] "PCA_both"        "PCA_imm"         "PCA_pos"         "read_admix"     
[13] "read_pca"

Each PCA object contains:

  • $PCA a data.frame with the first four PCs and the metadata added
  • $values a vector with the eigenvectors of the SVD
str(PCA_both)
List of 2
 $ PCA   :'data.frame':	78 obs. of  8 variables:
  ..$ PC1        : num [1:78] -0.155 -0.153 -0.154 -0.155 -0.155 ...
  ..$ PC2        : num [1:78] -0.006966 -0.000137 0.001286 0.006966 0.001831 ...
  ..$ PC3        : num [1:78] 0.00147 -0.00141 0.00324 -0.00115 -0.00553 ...
  ..$ PC4        : num [1:78] 0.025 -0.113 -0.109 -0.129 -0.146 ...
  ..$ species    : Factor w/ 2 levels "immitis","posadasii": 1 1 1 1 1 1 1 1 1 1 ...
  ..$ sample     : Factor w/ 78 levels "730332_Guatemala",..: 4 10 15 17 30 31 32 33 34 35 ...
  ..$ population : Factor w/ 6 levels "AZ","CA","CO/NV",..: 4 4 4 4 2 2 2 2 2 2 ...
  ..$ population2: Factor w/ 5 levels "AZ","CA/LAm",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ values: num [1:78] 73.73 4.22 3.44 2.79 1.33 ...
gg_color_hue <- function(n) {
  hues = seq(15, 375, length = n + 1)
  hcl(h = hues, l = 65, c = 100)[1:n]
}
colors=gg_color_hue(6)
names(colors) = levels(PCA_both$PCA$population)

## Calculate the percent variance explained for each eigenvector
## Sum across the eigenvalues and then normalize.
both_PC1 = percent(PCA_both$values[1]/sum(PCA_both$values))
both_PC2 = percent(PCA_both$values[2]/sum(PCA_both$values))
imm_PC1 = percent(PCA_imm$values[1]/sum(PCA_imm$values))
imm_PC2 = percent(PCA_imm$values[2]/sum(PCA_imm$values))
pos_PC1 = percent(PCA_pos$values[1]/sum(PCA_pos$values))
pos_PC2 = percent(PCA_pos$values[2]/sum(PCA_pos$values))

pca_1 <- ggplot(PCA_both$PCA,
                aes(x=PC1, y=PC2, col=population, pch=species))+
  geom_point(size=2, alpha=0.75)+
  scale_shape_manual(values=c(1,17))+
  theme_base()+
  theme(plot.background=element_rect(fill="white", colour=NA))+
  scale_color_manual("location", values=colors)+
  xlab(str_interp("PC1 (${both_PC1})"))+
  ylab(str_interp("PC2 (${both_PC2})"))
pca_2 <- ggplot(PCA_imm$PCA,
                aes(x=PC1, y=PC2, col=population, pch=species))+
  geom_point(size=2, alpha=0.75)+
  scale_shape_manual(values=c(1))+
  theme_base()+
  theme(plot.background=element_rect(fill="white", colour=NA))+
  theme(legend.position = "none")+
  scale_color_manual("location", values=colors[levels(PCA_imm$PCA$population)])+
  xlab(str_interp("PC1 (${imm_PC1})"))+
  ylab(str_interp("PC2 (${imm_PC2})"))
pca_3 <- ggplot(PCA_pos$PCA,
                aes(x=PC1, y=PC2, col=population, pch=species))+
  geom_point(size=2, alpha=0.75)+
  scale_shape_manual(values=c(17))+
  theme_base()+
  theme(plot.background=element_rect(fill="white", colour=NA))+
  theme(legend.position = "none")+
  scale_color_manual("location", values=colors[levels(PCA_pos$PCA$population)])+
  xlab(str_interp("PC1 (${pos_PC1})"))+
  ylab(str_interp("PC1 (${pos_PC2})"))

out <- ggarrange(pca_1,
                 ggarrange(pca_2, pca_3, nrow=2, labels=c("B", "C")),
                 ncol = 2, labels="A",
                 common.legend=TRUE,
                 legend="right", widths=c(1.75,1))
ggsave("FigureS2.pdf",out,width=9,height=5)

5.9 Figure S3 - Admixture plots

source("01-setup.R")
## This was created by plot-angsd-pca-display.R

load("angsd.Rdata")

$K is the name of the eigenvector $value is the percentage of the genome explained by that eigenvector

str(admix_both)
'data.frame':	390 obs. of  6 variables:
 $ sample     : Factor w/ 78 levels "Phoenix_1","Phoenix_2",..: 59 59 59 59 59 60 60 60 60 60 ...
 $ K          : Factor w/ 5 levels "K 1","K 2","K 3",..: 1 4 3 2 5 1 5 3 2 4 ...
 $ value      : num  9.99e-05 9.99e-05 2.73e-04 2.77e-04 9.99e-01 ...
 $ species    : Factor w/ 2 levels "immitis","posadasii": 2 2 2 2 2 2 2 2 2 2 ...
 $ population : Factor w/ 6 levels "AZ","CA","CO/NV",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ population2: Factor w/ 5 levels "AZ","CA/LAm",..: 3 3 3 3 3 3 3 3 3 3 ...
get_max_ordering <- function(x){
  rankings = tapply(x$value, x$K, max) # Find the maximum explained in an individual by that K
  order_order = names(rankings)[order(rankings)] # Order the Ks by the max
  x_cast = dcast(x, sample~K, value.var="value") # data.frame of samples x K values
  the_order = splat(order)( x_cast[,rev(order_order)] ) # pass the order function each of the K value columns to get the ordering
  x_cast$sample[the_order] # Return samples in order of the K values
  }

sample_labels2 = unlist(dlply(admix_both, .(population2), get_max_ordering))

out2 <- admix_both %>%
  # This puts the factors in the right order, labels the species correctly
  # and puts the K values in a pleasing order.
  transform(population2 = factor(population2,
                                 levels=c("AZ","CA/LAm","Guatemala","Mex/TX","WA"),
                                 labels=c("AZ","CA/LAm","Guat.","Mexico/TX","WA")),
            species=factor(species, levels=c("immitis", "posadasii"),
                           labels=c("imm.", "pos.")),
            sample = factor(sample, levels=sample_labels2),
            K=factor(K, levels=paste("K", c(3,4,1,5,2)))) %>% 
  ggplot(aes(x=sample, y=value,fill=K))+
  theme_bw()+geom_bar(stat="identity")+coord_flip()+
  scale_fill_discrete("")+
  facet_grid(species+population2~.,
             scale="free_y", space="free_y")+
  theme_base(13)+theme(plot.background=element_rect(fill="white", colour=NA),
                       strip.background=element_rect(fill="white", colour="black"))+
  xlab("")+ylab("% admixture")
ggsave("FigureS3.pdf", out2, width=7, height=12)

5.10 Figure S4 - Compare effect of recombination rate on introgression size, etc.

This code reads in the admixture tracks for each individual and recombination rate.

source("01-setup.R")

## Read in all the recombination rates 
read_string = "hmm_tables_united/${prefix}_hmm_hap_pab_01_${recomb}_all_30_tracks.csv"
to_read <- expand.grid(prefix=c("p_i_i", "i_p_i"),
                       recomb=c(5,6,7,9))
tracks <- ddply(to_read, .(prefix, recomb), with,
                read.csv(str_interp(read_string)))

## Summarize mean length, count, and total length of introgressions by
## recombination rate
track_summaries <- tracks %>%
    subset(genotype == "homo_2") %>%
    subset(track_len > 500) %>%
    subset(per_repeat  < 5) %>% 
    ddply(.(ind, prefix, recomb),
          plyr::summarize,
          avg_len_introgression = mean(track_len),
          count_introgression = length(track_len),
          sum_introgression = sum(track_len))
track_summaries2 <- merge(track_summaries, meta, by.x="ind", by.y="sample")

## Numbers are for the immitis reference
imm_ref = sum(c(8482323,5260912,4323945,3952524,3469364,3458857))
track_summaries2$percent_genome = with(track_summaries2,
                                       100*sum_introgression/imm_ref)

## Transform and re-annotate the levels with nice names
comb_melt <- melt(track_summaries2,
                  measure.vars=c("avg_len_introgression",
                                 "count_introgression",
                                 "percent_genome")) %>%
    transform(variable = factor(variable,
                                levels=c("percent_genome",
                                         "avg_len_introgression",
                                         "count_introgression"),
                                labels=c("Introgressed\ngenome (%)",
                                         "Average introgression\ntract length (bp)",
                                         "Number of\nintrogression tracts")))

Plot the statistics.

compare_by_recomb <- comb_melt %>%
    subset(recomb != 9) %>% # Exclude 10^-9 recomb rate. Not analyzed in paper.
    transform( recomb = paste0("10^-", recomb)) %>%
  ggplot(aes(x=recomb,
             y=value,
             group=ind))+
  geom_line(alpha=0.5)+geom_point(alpha=0.5)+
  facet_grid(variable~species, scale="free_y")+theme_bw()+
  xlab("recomb. rate (M/bp)")+theme_base()+ylab("")
ggsave("FigureS4.pdf", compare_by_recomb, height=7, width=7)

5.11 Figure S5 - Plot jaccard stat for different recomb. rates

The goal of this plot is to look at how similar the introgressions identified by the HMM at different rates of assumed recombination. The method is to compute the jaccard statistic for each individual for the three different comparisons of recombination rate and to vary the percent of overlap required for the features to count as "the same"

source("01-setup.R")

## A list of individuals for each species
imm_indiv = as.character(unique(read.csv("hmm_tables_united/i_p_i_hmm_hap_pab_01_5_all_30_tracks.csv")$ind))
pos_indiv = as.character(unique(read.csv("hmm_tables_united/p_i_i_hmm_hap_pab_01_5_all_30_tracks.csv")$ind))

## Read in the tracks for each recombination rate. Restrict to introgressions.
imm_data5 = read.csv("hmm_tables_united/i_p_i_hmm_hap_pab_01_5_all_30_tracks.csv") %>%
    subset(genotype == "homo_2") %>%
    subset(track_len > 500) %>%
    subset(per_repeat < 5)
imm_data6 = read.csv("hmm_tables_united/i_p_i_hmm_hap_pab_01_6_all_30_tracks.csv") %>%
    subset(genotype == "homo_2") %>%
    subset(track_len > 500) %>%
    subset(per_repeat < 5)
imm_data7 = read.csv("hmm_tables_united/i_p_i_hmm_hap_pab_01_7_all_30_tracks.csv") %>%
    subset(genotype == "homo_2") %>%
    subset(track_len > 500) %>%
    subset(per_repeat < 5)

pos_data5 = read.csv("hmm_tables_united/p_i_i_hmm_hap_pab_01_5_all_30_tracks.csv") %>%
    subset(genotype == "homo_2") %>%
    subset(track_len > 500) %>%
    subset(per_repeat < 5)
pos_data6 = read.csv("hmm_tables_united/p_i_i_hmm_hap_pab_01_6_all_30_tracks.csv") %>%
    subset(genotype == "homo_2") %>%
    subset(track_len > 500) %>%
    subset(per_repeat < 5)
pos_data7 = read.csv("hmm_tables_united/p_i_i_hmm_hap_pab_01_7_all_30_tracks.csv") %>%
    subset(genotype == "homo_2") %>%
    subset(track_len > 500) %>%
    subset(per_repeat < 5)

This uses the bedtools jaccard function to calculate the jaccard stat. The flag -r requires a reciprocal overlap based on the percent overlap required by -f.

## write a bed file based on the track files, then sort it
write_bed = function(df, n){
    df %>%
        {.[,c("supercontig", "start", "end", "ind")]} %>%
        write.table("tmp.bed",quote=FALSE, sep="\t",
                    row.names=FALSE, col.names=FALSE)
    system(str_interp("bedtools sort -i tmp.bed > ${n}"))
    system("rm tmp.bed")
}

## Calculate the jaccard stat between two track data frames
calc_jacc = function(A,B,f){
    write_bed(A, "data/A.bed")
    write_bed(B, "data/B.bed")
    read.table(text=system(str_interp("bedtools jaccard -r -f ${f} -a data/A.bed -b data/B.bed"), intern=T),header=T,sep="\t")
}

Make data frame that contain the jaccard statistic between each of the recombination rates for each of the individuals.

## C. immitis
jacc_imm56 = expand.grid(ind=imm_indiv,
                         f=c(seq(0.001, 0.95, by=0.1),1)) %>%
    ddply(.(ind, f), function(x){
        A = subset(imm_data5, ind == x$ind)
        B = subset(imm_data6, ind == x$ind)
        calc_jacc(A,B,x$f)
    }, .inform=T)
jacc_imm67 = expand.grid(ind=imm_indiv,
                         f=c(seq(0.001, 0.95, by=0.1),1)) %>%
    ddply(.(ind, f), function(x){
        A = subset(imm_data7, ind == x$ind)
        B = subset(imm_data6, ind == x$ind)
        calc_jacc(A,B,x$f)
    }, .inform=T)
jacc_imm57 = expand.grid(ind=imm_indiv,
                         f=c(seq(0.001, 0.95, by=0.1),1)) %>%
    ddply(.(ind, f), function(x){
        A = subset(imm_data5, ind == x$ind)
        B = subset(imm_data7, ind == x$ind)
        calc_jacc(A,B,x$f)
    }, .inform=T)
## C. posadasii
jacc_pos56 = expand.grid(ind=pos_indiv,
                         f=c(seq(0.001, 0.95, by=0.1),1)) %>%
    ddply(.(ind, f), function(x){
        A = subset(pos_data5, ind == x$ind)
        B = subset(pos_data6, ind == x$ind)
        calc_jacc(A,B,x$f)
    }, .inform=T)
jacc_pos67 = expand.grid(ind=pos_indiv,
                         f=c(seq(0.001, 0.95, by=0.1),1)) %>%
    ddply(.(ind, f), function(x){
        A = subset(pos_data7, ind == x$ind)
        B = subset(pos_data6, ind == x$ind)
        calc_jacc(A,B,x$f)
    }, .inform=T)
jacc_pos57 = expand.grid(ind=pos_indiv,
                         f=c(seq(0.001, 0.95, by=0.1),1)) %>%
    ddply(.(ind, f), function(x){
        A = subset(pos_data5, ind == x$ind)
        B = subset(pos_data7, ind == x$ind)
        calc_jacc(A,B,x$f)
        }, .inform=T)
plot_jacc = function(d){
    for_plot = d
    for_text = subset(transform(d,x=0), (jaccard < 0.75) & (f == 0.001))
    ggplot(for_plot,
           aes(x=f,y=jaccard,group=ind))+
        geom_line(alpha=0.5)+
        geom_text(data=for_text,
                  aes(x=x,y=jaccard,label=ind),size=2)+
        ylim(0,1)+theme_base()+xlab("% overlap required")+
        xlim(-0.2,1)+ylab("jaccard stat")+
        theme(plot.background = element_blank())}

jaccard_plot = ggpubr::ggarrange(
            plot_jacc(jacc_imm56), plot_jacc(jacc_imm67), plot_jacc(jacc_imm57),
            plot_jacc(jacc_pos56), plot_jacc(jacc_pos67), plot_jacc(jacc_pos57),
            labels=c("A", "B", "C", "D", "E", "F"))

ggsave("FigureS5.pdf", height=6, width=12)

At the most stringent cutoff (100% overlap required), the mean Jaccard statistic comparing a recombination rate of 10-5 and 10-7 was 0.55 for C. immitis and 0.54 for C. posadasii. If a 50% overlap was required, the mean statistic was 0.82 for C. immitis and C. posadasii. Finally, if only a 0.1% overlap was required, the statistic was 0.89 and 0.85 for C. immitis and C. posadasii, respectively.

print("100% overlap required")
print("C. immitis")
mean(subset(jacc_imm57, f==1)$jaccard)
print("C. posadasii")
mean(subset(jacc_pos57, f==1)$jaccard)
print("50% overlap required")
print("C. immitis")
mean(subset(jacc_imm57, f==0.501)$jaccard)
print("C. posadasii")
mean(subset(jacc_pos57, f==0.501)$jaccard)
print("0.1% overlap required")
print("C. immitis")
mean(subset(jacc_imm57, f==0.001)$jaccard)
print("C. posadasii")
mean(subset(jacc_pos57, f==0.001)$jaccard)
[1] "100% overlap required"
[1] "C. immitis"
[1] 0.7639479
[1] "C. posadasii"
[1] 0.6840271
[1] "50% overlap required"
[1] "C. immitis"
[1] 0.880334
[1] "C. posadasii"
[1] 0.8644842
[1] "0.1% overlap required"
[1] "C. immitis"
[1] 0.8822519
[1] "C. posadasii"
[1] 0.8789823

5.12 Figure S6 - Correlation between species

The strategy for correlation is to use bedtools to make windows in the genome, the to count intersections of those windows with introgression tracts across all individuals for the two species, then finally to look at the correlation between the two.

source("01-setup.R")

get_correlation = function(window_size){
    system(str_interp("bedtools makewindows -g immitis.genome -w $[i]{window_size} -s $[i]{window_size/2} > data/bedtools_windows.bed"))
    window_counts_imm = read.table(text=system("bedtools intersect -c -F 0.51 -a data/bedtools_windows.bed -b data/i_p_i_hmm_hap_pab_01_6_all_30_tracks_sorted.bed6",intern=TRUE),sep="\t", col.names = c("supercontig", "start", "end", "imm"))
    window_counts_pos = read.table(text=system("bedtools intersect -c -F 0.51 -a data/bedtools_windows.bed -b data/p_i_i_hmm_hap_pab_01_6_all_30_tracks_sorted.bed6",intern=TRUE),sep="\t", col.names = c("supercontig", "start", "end", "pos"))
    comb = merge(window_counts_imm, window_counts_pos, by=c("supercontig", "start", "end"))
    with(comb, cor(imm,pos))}

window_sizes = seq(10000,500000,10000)

correlations = data.frame(window_size = window_sizes,
                r=sapply(window_sizes, get_correlation))


corr_plot = ggplot(correlations, aes(x=window_size,y=r))+
    geom_point(pch=1)+geom_line()+theme_bw(14)+
    geom_hline(yintercept=0.54, lty=2)+
    ylab("Pearson's r")+xlab("Window size (bp)")
ggsave("FigureS6.pdf", corr_plot, height=4, width=4)

FigureS6.pdf

5.13 Figure S7, Table S3, S4, S5 - Venn diagram with Neafsey, potentially adaptive genes

source("01-setup.R")

gene_introgression = read.csv("data/gene_introgression.csv")

### For comparison to our data
neafsey <- read.csv("Neafsey_SupTable1.csv", header=FALSE) %>%
  transform( neafsey = str_trim( V1 ))

intro_genes = gene_introgression %>%
  ## Restrict to lines with some introgression coverage
  subset(introgression_count > 0) %>%
  ## If there are multiple depths of coverage for a gene in a population,
  ## then return only the deepest
  ddply(.(population, supercontig, gene_id, description),
        function(x){
          subset(x, introgression_count == max(introgression_count))[1,]
        }) %>%
  transform(species = c("AZ"="posadasii", "Guatemala" = "posadasii", "Mex/TX"="posadasii",
                        "CA/LAm"="immitis", "WA"="immitis", 
                        "immitis" = "immitis", "posadasii" = "posadasii")[as.character(population)])

5.13.1 Write table S3 (all introgressed genes)

intro_genes %>%
  transform(neafsey = gene_id %in% neafsey$neafsey) %>%
  write.csv("TableS3.csv")

5.13.2 Test for correlation between introgressed genes in this study and Neafsey's

fisher.test(all_genes$gene_id %in% neafsey$neafsey,
            all_genes$gene_id %in% intro_genes$gene_id)

	Fisher's Exact Test for Count Data

data:  all_genes$gene_id %in% neafsey$neafsey and all_genes$gene_id %in% intro_genes$gene_id
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  7.077592 10.150718
sample estimates:
odds ratio 
  8.478441

5.13.3 Figure S7 (Venn diagram)

library(Vennerable)
pdf("FigureS7.pdf")
plot(Venn(list(Neafsey = unique(as.character(neafsey$neafsey)),
               `This study`=unique(as.character(intro_genes$gene_id)))))
dev.off()
2

5.13.4 Table S4 - Write table of genes introgressed at high frequency globally

The rule here is that the frequency should be greater than 50% if the population is greater than 10 individuals. Otherwise require it to be 100%

high_freq_genes = subset(gene_introgression,
       ((population_count < 10) & (population_count == introgression_count)) |
       ((population_count > 10) & (introgression_count/population_count > 0.5)))

high_freq_global = high_freq_genes %>% 
  subset( population %in% c("immitis", "posadasii")) %>%
  ## If there are multiple depths of coverage for a gene in a population,
  ## then return only the deepest
  ddply(.(population, supercontig, gene_id, description),
        function(x){
          subset(x, introgression_count == max(introgression_count))[1,]
          })
## Write table. Order by population and then frequency. Make the values nice looking.
high_freq_global = high_freq_global[order(high_freq_global$population, 
                                          high_freq_global$introgression_count/high_freq_global$population_count,
                                          decreasing = TRUE),]
high_freq_global %>%
  summarize(species=population,
          loc = paste0(supercontig,": ",scales::comma(cov_start),"-",scales::comma(cov_end)),
          freq = paste0( scales::percent(introgression_count/population_count), 
                         " (", introgression_count,"/",population_count,")"),
          gene_id=gene_id,
          description=description) %>% write.csv("TableS4.csv")

print("Number of introgressed genes at high frequency")
print(length(unique(high_freq_global$gene_id))) # 55
print(" ... in immitis")
print(nrow(subset(high_freq_global, population == "immitis"))) #3
print(" ... in posadasii")
print(nrow(subset(high_freq_global, population == "posadasii"))) #57
[1] "Number of introgressed genes at high frequency"
[1] 55
[1] " ... in immitis"
[1] 3
[1] " ... in posadasii"
[1] 54

5.13.5 Table S5 - Write table of genes introgressed at high frequency in particular populations

high_freq_local = high_freq_genes %>%
  subset(!(population %in% c("immitis", "posadasii"))) %>%
  transform(species = c("AZ"="posadasii", "Guatemala" = "posadasii", "Mex/TX"="posadasii",
                        "CA/LAm"="immitis", "WA"="immitis")[as.character(population)]) %>% 
  ## If there are multiple depths of coverage for a gene in a population,
  ## then return only the deepest
  ddply(.(species, population, supercontig, gene_id, description),
        function(x){
          subset(x, introgression_count == max(introgression_count))[1,]
        })
## Write the table as above.
high_freq_local = high_freq_local[order(high_freq_local$population, 
                                        high_freq_local$introgression_count/high_freq_local$population_count,
                                          decreasing = TRUE),]
high_freq_local %>%
  summarize(population=population,
            loc = paste0(supercontig,": ",scales::comma(cov_start),"-",scales::comma(cov_end)),
            freq = paste0( scales::percent(introgression_count/population_count), 
                           " (", introgression_count,"/",population_count,")"),
            gene_id=gene_id,
            description=description) %>% write.csv("TableS5.csv")
high_freq_local_only = subset(high_freq_local, !(gene_id %in% high_freq_global$gene_id))
print("Number of introgressed genes found locally at high frequency")
print(length(unique(high_freq_local_only$gene_id))) # 161
hf_pos = subset(high_freq_local_only, species == "posadasii")
hf_imm = subset(high_freq_local_only, species == "immitis")
print(" ... in immitis")
print(length(unique(hf_imm$gene_id))) #31
print(" ... in posadasii")
print(length(unique(hf_pos$gene_id))) #78
[1] "Number of introgressed genes found locally at high frequency"
[1] 106
[1] " ... in immitis"
[1] 31
[1] " ... in posadasii"
[1] 78

A list of genes reciprocally introgressed at high frequency.

found_in_both = intersect(hf_imm$gene_id, hf_pos$gene_id)
subset(high_freq_local_only, gene_id %in% found_in_both)[,-1]
population supercontig cov_start cov_end gene_id description introgression_count population_count species
WA Supercontig_3.3 60716 61367 CIMG_08918 hypothetical protein 5 5 immitis
Mex/TX Supercontig_3.5 2932794 2933405 CIMG_12350 SacI domain-containing protein 9 11 posadasii
Guatemala Supercontig_3.2 102921 103597 CIMG_09822 multidrug resistance protein 4 4 posadasii
CA/LAm Supercontig_3.5 2934433 2935000 CIMG_12350 SacI domain-containing protein 13 22 immitis
CA/LAm Supercontig_3.2 103059 113538 CIMG_09822 multidrug resistance protein 12 22 immitis
AZ Supercontig_3.3 50529 65520 CIMG_08918 hypothetical protein 19 36 posadasii

5.13.6 Calculations of total introgressions

### Read in and process the data
all_cov = list("immitis" = read_delim("data/i_p_i_populations_union.bg", 
                                      col_names = c("supercontig", "start", "stop", "species", "CA/LAm", "WA"),
                                      delim='\t', skip=1),
               "posadasii" = read_delim("data/p_i_i_populations_union.bg", 
                                        col_names = c("supercontig", "start", "stop", "species", "AZ", "Guatemala", "Mex/TX"),
                                        delim='\t', skip=1)) %>%
  ## By melting each, I can concatenate b/c the columns like 'species' 'CA' etc. are in the variable column
  lapply(melt, id.vars=c("supercontig", "start" ,"stop")) %>%
  ldply(function(x) x, .id="species")
colnames(all_cov)[5] = "cov_type"
colnames(all_cov)[6] = "introgression_count"
all_cov$cov_type = as.character(all_cov$cov_type)
all_cov$cov_type[all_cov$cov_type == "species"] = as.character(all_cov$species[all_cov$cov_type == "species"])
all_cov = merge(all_cov, pop_counts, by.x="cov_type", by.y="population")
head(all_cov)
cov_type species supercontig start stop introgression_count population_count
AZ posadasii Supercontig_3.1 0 37941 0 36
AZ posadasii Supercontig_3.1 37941 38556 3 36
AZ posadasii Supercontig_3.1 38556 38557 1 36
AZ posadasii Supercontig_3.1 38557 39116 0 36
AZ posadasii Supercontig_3.1 39116 39202 1 36
AZ posadasii Supercontig_3.1 39202 39263 2 36

This displays the total number of introgressed regions

all_cov %>%
  subset(introgression_count > 0) %>%
  transform(freq = round(introgression_count/population_count,4)) %>%
  {.[,c("supercontig", "start", "stop", "cov_type", "freq", "introgression_count")]} %>%
  #transform(strand = ".") %>%
  write.table("data/all-introgressions.bed6", quote=FALSE, sep="\t",
              row.names=FALSE, col.names=FALSE)
print("Total")
system("bedtools sort -i data/all-introgressions.bed6 | bedtools merge -c 4 -o distinct | wc -l")

print("... immitis")
all_cov %>%
  subset(introgression_count > 0) %>%
  subset(species == "immitis") %>%
  transform(freq = round(introgression_count/population_count,4)) %>%
  {.[,c("supercontig", "start", "stop", "cov_type", "freq", "introgression_count")]} %>%
  #transform(strand = ".") %>%
  write.table("data/all-introgressions-immitis.bed6", quote=FALSE, sep="\t",
              row.names=FALSE, col.names=FALSE)
system("bedtools sort -i data/all-introgressions-immitis.bed6 | bedtools merge -c 4 -o distinct | wc -l")

print("... posadasii")
all_cov %>%
  subset(introgression_count > 0) %>%
  subset(species == "posadasii") %>%
  transform(freq = round(introgression_count/population_count,4)) %>%
  {.[,c("supercontig", "start", "stop", "cov_type", "freq", "introgression_count")]} %>%
  #transform(strand = ".") %>%
  write.table("data/all-introgressions-posadasii.bed6", quote=FALSE, sep="\t",
              row.names=FALSE, col.names=FALSE)
system("bedtools sort -i data/all-introgressions-posadasii.bed6 | bedtools merge -c 4 -o distinct | wc -l")
[1] "Total"
     416
[1] "... immitis"
     142
[1] "... posadasii"
     345

5.13.7 Calculate potentially adaptive introgressions

high_freq_introgressions = all_cov %>%
  subset(((population_count < 10) & (population_count == introgression_count)) |
         ((population_count > 10) & (introgression_count/population_count > 0.5)))
## Format as a BED, and write
high_freq_introgressions %>%
  transform(freq = round(introgression_count/population_count,4)) %>%
  {.[,c("supercontig", "start", "stop", "cov_type", "freq", "introgression_count")]} %>%
    #transform(strand = ".") %>%
    write.table("data/high-frequency-introgressions.bed6", quote=FALSE, sep="\t",
                row.names=FALSE, col.names=FALSE)
all_genes %>%
  {.[,c("supercontig", "start", "stop", "gene_id", "description", "strand")]} %>%
  write.table("data/genes.bed", quote=FALSE, sep="\t",
              row.names=FALSE, col.names=FALSE)

"In total, we found 6 and 32 introgressions in C. immitis and C. posadasii, respectively, at greater than 50% frequency in the species."

There are 99 distinct potentially adaptive introgressions across both species.

system("bedtools sort -i data/high-frequency-introgressions.bed6 | bedtools merge -c 4,5 -o collapse > data/high-frequency-introgressions-merged.bed6")
print("High frequency in immitis")
system("grep 'immitis' data/high-frequency-introgressions-merged.bed6 | wc -l") #6
print("High frequency in immitis population")
system("grep 'WA\\|CA/LAm' data/high-frequency-introgressions-merged.bed6 | grep -v 'immitis' | wc -l") #25
print("High frequency in posadasii")
system("grep 'posadasii' data/high-frequency-introgressions-merged.bed6 | wc -l") #32
print("High frequency in posadasii population")
system("grep 'AZ\\|Guatemala\\|Mex/TX' data/high-frequency-introgressions-merged.bed6 | grep -v 'posadasii' | wc -l") 
[1] "High frequency in immitis"
       6
[1] "High frequency in immitis population"
      25
[1] "High frequency in posadasii"
      32
[1] "High frequency in posadasii population"
      41

5.14 Figure S8 - Amount of introgression in population

source("01-setup.R")
all_tracks = read.csv("data/all-tracks-recomb6-imm-ref.csv")
track_summaries = 
  ddply(all_tracks,
        .(ind),
        plyr::summarize,
        avg_len_introgression = mean(track_len),
        count_introgression = length(track_len),
        sum_introgression = sum(track_len))

track_summaries2 = merge(track_summaries, meta, by.x="ind", by.y="sample")

immitis_genome_size=sum(8482323,5260912,4323945,3952524,3469364,3458857)
track_summaries2$percent_genome = with(track_summaries2,
                                  100*sum_introgression/immitis_genome_size)

comb_melt <- melt(track_summaries2,
                  measure.vars=c("avg_len_introgression",
                                 "count_introgression",
                                 "percent_genome")) %>%
  ## Make the labels nice.
    transform(variable = factor(variable, levels=c("percent_genome",
                                                   "avg_len_introgression",
                                                   "count_introgression"),
                                labels=c("Introgressed\ngenome (%)",
                                         "Average introgression\ntract length (bp)",
                                         "Number of\nintrogression tracts")))

Test for differences between the populations. Nothing significant at p = 0.008 cutoff.

ddply(comb_melt, .(variable,species, population2),
      summarize,
      mean = round(mean(value),2),
      sd = round(sd(value),2))
library(broom)
dlply(comb_melt, .(variable, species),
      function(x) aov(lm(value~population2, data=x))) %>%
  ldply(tidy)
Introgressed              
genome (%) immitis population2 1 0.183410940632791 0.183410940632791 6.27945643000321 0.0190891407977407
Introgressed              
genome (%) immitis Residuals 25 0.730202298070159 0.0292080919228064 nil nil
Introgressed              
genome (%) posadasii population2 2 0.245040687686517 0.122520343843258 3.2646935133915 0.0468441021317963
Introgressed              
genome (%) posadasii Residuals 48 1.80138701545892 0.0375288961553942 nil nil
Average introgression              
tract length (bp) immitis population2 1 14959699.5439941 14959699.5439941 7.45067757294761 0.011445035810654
Average introgression              
tract length (bp) immitis Residuals 25 50195768.7657519 2007830.75063008 nil nil
Average introgression              
tract length (bp) posadasii population2 2 2122651.2603628 1061325.6301814 2.90925101637238 0.0641846478960499
Average introgression              
tract length (bp) posadasii Residuals 48 17510909.1522223 364810.607337964 nil nil
Number of              
introgression tracts immitis population2 1 14.0121212121212 14.0121212121212 0.311476116571823 0.581739941893402
Number of              
introgression tracts immitis Residuals 25 1124.65454545455 44.9861818181818 nil nil
Number of              
introgression tracts posadasii population2 2 607.189542483662 303.594771241831 2.19194673891886 0.122756106877944
Number of              
introgression tracts posadasii Residuals 48 6648.22222222222 138.50462962963 nil nil
panelA <- comb_melt %>%
    subset(variable == "Introgressed\ngenome (%)") %>%
    transform(species = paste("C.",species)) %>% 
    ggplot(aes(x=population2,
               y=value))+
    geom_boxplot(fill=NA)+
    geom_dotplot(binaxis="y",stackdir = "center", fill="darkred")+
    facet_grid(.~species, scale="free")+theme_bw()+
    scale_fill_discrete(guide=FALSE)+
    theme_base()+ylab("Introgressed\ngenome (%)")+xlab("")+
    theme(plot.background = element_rect(colour = NA),
          strip.text.x=element_text(face="italic"))
panelB <- comb_melt %>%
  subset(variable == "Average introgression\ntract length (bp)") %>%
  transform(species = paste("C.",species)) %>% 
  ggplot(aes(x=population2,
             y=value))+
  geom_boxplot(fill=NA)+geom_dotplot(binaxis="y",stackdir = "center", fill="darkred")+
  facet_grid(.~species, scale="free")+theme_bw()+
  scale_fill_discrete(guide=FALSE)+
  theme_base()+ylab("Average introgression\ntract length (bp)")+xlab("")+
  theme(plot.background = element_rect(colour = NA),
        strip.text.x=element_blank())
panelC <- comb_melt %>%
  subset(variable == "Number of\nintrogression tracts") %>%
  transform(species = paste("C.",species)) %>% 
  ggplot(aes(x=population2,
             y=value))+
  geom_boxplot(fill=NA)+geom_dotplot(binaxis="y",stackdir = "center", fill="darkred")+
  facet_grid(.~species, scale="free")+theme_bw()+
  scale_fill_discrete(guide=FALSE)+
  theme_base()+ylab("Number of\nintrogression tracts")+xlab("")+
  theme(plot.background = element_rect(colour = NA),
        strip.text.x=element_blank())
compare_pops = ggarrange(panelA, panelB, panelC, labels=c("A", "B", "C"), nrow=3, heights = c(1,0.9,0.9))
ggsave("FigureS8.pdf", compare_pops, height=7,width=7)

6 Statistics

6.1 Percentage of tracts merged during filtering

"We reapplied this filter 4 times. This consolidation decreased the number of introgressions per individual by an average of 24% per individual in C. immitis (minimum 13% maximum 35%) and 12% in C. posadasii (minimum 2%, maximum 25%)."

The code reads in the filtered and the unfiltered track tables then looks at the ratio of their lengths.

source("01-setup.R")
## For immitis
unfiltered_tracks = read.csv("hmm_tables_united/i_p_i_hmm_hap_pab_01_6_all_30_unftrk.csv") %>%
    subset(genotype == "homo_2") %>%
    subset(track_len > 500) %>% # Introgressions
    ddply(.(ind), plyr::summarize,
          count_unfiltered = length(track_len))
filtered_tracks = read.csv("hmm_tables_united/i_p_i_hmm_hap_pab_01_6_all_30_tracks.csv")%>%
    subset(genotype == "homo_2") %>%
    subset(track_len > 500) %>%
    ddply(.(ind), plyr::summarize,
          count_filtered = length(track_len))
d = merge(filtered_tracks, unfiltered_tracks)
print("C. immitis")
print(with(d,summary((count_unfiltered - count_filtered)/count_unfiltered)))

## For posadasii
unfiltered_tracks = read.csv("hmm_tables_united/p_i_i_hmm_hap_pab_01_6_all_30_unftrk.csv") %>%
    subset(genotype == "homo_2") %>%
    subset(track_len > 500) %>%
    ddply(.(ind), plyr::summarize,
          count_unfiltered = length(track_len))
filtered_tracks = read.csv("hmm_tables_united/p_i_i_hmm_hap_pab_01_6_all_30_tracks.csv")%>%
    subset(genotype == "homo_2") %>%
    subset(track_len > 500) %>%
    ddply(.(ind), plyr::summarize,
          count_filtered = length(track_len))
d = merge(filtered_tracks, unfiltered_tracks)
print("C. posadasii")
print(with(d,summary((count_unfiltered - count_filtered)/count_unfiltered)))
[1] "C. immitis"
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1282  0.1892  0.2328  0.2390  0.2830  0.3580 
[1] "C. posadasii"
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.02632 0.08664 0.11613 0.12042 0.15035 0.24771 

6.2 Introgressions aren't uniformly distributed

This is a chi-squared test against a uniform distribution with bin sizes of 1MB.

"Introgressions in each species of Coccidioides are circumscribed to particular and distinct portions of their genomes. We divided the genome into 500kB bins and compared the observed number of introgressions in each bin to a null model of a uniform distribution. In each species, we rejected the null model (Figure 3A; C. posadasii into C. immitis: chi-259 =3188.0, P < 0.00001; C. immitis into C. posadasii: chi-259 =4,530.0, P < 0.00001)."

source("01-setup.R")
window_size = 500000

system(str_interp("bedtools makewindows -g immitis.genome -w $[i]{window_size} -s $[i]{window_size/2} > data/bedtools_windows.bed"))

window_counts_imm = read.table(text=system("bedtools intersect -c -F 0.51 -a data/bedtools_windows.bed -b data/i_p_i_hmm_hap_pab_01_6_all_30_tracks_sorted.bed6",intern=TRUE),sep="\t", col.names = c("supercontig", "start", "end", "imm"))

window_counts_pos = read.table(text=system("bedtools intersect -c -F 0.51 -a data/bedtools_windows.bed -b data/p_i_i_hmm_hap_pab_01_6_all_30_tracks_sorted.bed6",intern=TRUE),sep="\t", col.names = c("supercontig", "start", "end", "pos"))

the_counts = merge(window_counts_imm, window_counts_pos, by=c("supercontig", "start", "end"))

print("Chi-squared statistics")
print("C. immitis")
print(sum((the_counts$imm - mean(the_counts$imm))^2/mean(the_counts$imm)))
print("C. posadasii")
print(sum((the_counts$pos - mean(the_counts$pos))^2/mean(the_counts$pos)))
[1] "Chi-squared statistics"
[1] "C. immitis"
[1] 5428.56
[1] "C. posadasii"
[1] 8158.621

6.3 Test for differences in admixture

source("01-setup.R")

all_shared <- list(immitis = read.csv("./hmm_tables_united/i_p_i_hmm_hap_pab_01_6_all_30_tracks.csv"),
                   posadasii = read.csv("./hmm_tables_united/p_i_i_hmm_hap_pab_01_6_all_30_tracks.csv")) %>%
  ldply(.id="species", function(x) x) %>%
  subset(genotype == "homo_2") %>% ## "homo_2" = "shared"
  subset(per_repeat < 5) ## Filter to less than 5% repetitive

immitis_genome_size=sum(8482323,5260912,4323945,3952524,3469364,3458857)

all_shared_stats_by_indiv <- all_shared %>% 
  ddply(.(species, ind), plyr::summarize, 
        avg_len_introgression = mean(track_len),
        count_introgression = length(track_len),
        sum_introgression = sum(track_len)) %>%
  transform( percent_genome = 100*sum_introgression/immitis_genome_size) %>%
  merge(meta, by.x="ind", by.y="sample")

introgression_stats_by_indiv <- all_shared %>% 
  subset(track_len > 500) %>% 
  ddply(.(species, ind), plyr::summarize, 
        avg_len_introgression = mean(track_len),
        count_introgression = length(track_len),
        sum_introgression = sum(track_len))%>%
  transform( percent_genome = 100*sum_introgression/immitis_genome_size) %>%
  merge(meta, by.x="ind", by.y="sample")

6.3.1 Posadasii contains more admixed material

"Consistent with the results from our D-statistic calculations, the levels of introgression between the two reciprocal directions are not equivalent. Haplotypes with shared ancestry covered significantly more of the C. posadasii genome on average than the genome of C. immitis (0.9% and 0.54%, respectively; One Way ANOVA: F1,76 =74.8, P = 6.26 10-13)."

ANOVA for percent genome admixed

admix_anova = aov(lm(percent_genome~species.x, data=all_shared_stats_by_indiv))
print(summary(admix_anova))
            Df Sum Sq Mean Sq F value   Pr(>F)    
species.x    1  3.501   3.501   74.82 6.26e-13 ***
Residuals   76  3.556   0.047                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Calculate the mean percent of the genome admixed

with(all_shared_stats_by_indiv,
     tapply(percent_genome, species.x, mean))
immitis 0.542139414521459
posadasii 0.987449352722918

6.3.2 Posadasii contains more introgressed material

"On average, C. immitis individuals carry slightly less introgressed material from C. posadasii than C. posadasii does from C. immitis (0.43% and 0.73% of their genomes, respectively; One-Way ANOVA: F1,76 =41.2, P = 1.08 10-8)."

introgression_anova = aov(lm(percent_genome~species.x, 
                             data=introgression_stats_by_indiv))
print(summary(introgression_anova))
            Df Sum Sq Mean Sq F value   Pr(>F)    
species.x    1  1.604  1.6036   41.17 1.08e-08 ***
Residuals   76  2.960  0.0389                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
with(introgression_stats_by_indiv,
     tapply(percent_genome, species.x, mean))
immitis 0.436171478931884
posadasii 0.737559654746496

6.3.3 No sig. difference in the variance of introgression

"We found no significant difference in the variance in the percent of the genome introgressed between species (Levene's test for homogeneity of variance: F1,76 = 0.005, P = 0.94)."

library(car)
print(leveneTest(percent_genome~species.x, data=introgression_stats_by_indiv))
Loading required package: carData
Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  1   0.005 0.9436
      76

6.3.4 Most tracts are small

"Next, we compared the introgression haplotype sizes in each species. We found that the vast majority (98% for both species) of introgressed haplotypes are small (less than 25kb, which is ~0.1% of the genome)"

This gives the percentage of introgressed tracts that are less than 25,000bp

list(immitis = read.csv("./hmm_tables_united/i_p_i_hmm_hap_pab_01_6_all_30_tracks.csv"),
                   posadasii = read.csv("./hmm_tables_united/p_i_i_hmm_hap_pab_01_6_all_30_tracks.csv")) %>%
  ldply(.id="species", function(x) x) %>%
  subset(genotype == "homo_2") %>% ## "homo_2" = "shared"
  subset(per_repeat < 5) %>%
  subset(track_len > 500) %>%
  transform(over_25000 = track_len>25000) %>%
  ddply(.(species), plyr::summarize, under_25000 = round(1-sum(over_25000)/length(over_25000),3))
immitis 0.982
posadasii 0.976

6.3.5 Immitis has larger introgressed haploblocks

"The mean haplotype size differs between the two species (One-Way ANOVA, F1,76 = 11.07, P = 0.0014). The average size of a C. immitis to C. posadasii introgression was 3.4 kb. In the reciprocal direction, C. posadasii to C. immitis, the average size of an introgression was 4.2 kb."

introgression_anova_size = aov(lm(avg_len_introgression~species.x, 
                             data=introgression_stats_by_indiv))
print(summary(introgression_anova_size))
            Df   Sum Sq  Mean Sq F value  Pr(>F)   
species.x    1 12348147 12348147   11.07 0.00136 **
Residuals   76 84789029  1115645                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
with(introgression_stats_by_indiv,
     tapply(avg_len_introgression, species.x, mean))
immitis 4229.75693737504
posadasii 3393.42010492857

6.3.6 Posadasii has a greater number of introgressions

The average number of introgressed tracts in an isolate was 30.1 and 62.8 for C. immitis and C. posadasii, respectively, and this difference was statistically significant (One-Way ANOVA, F1,76 = 171, P < 1e-10)

introgression_anova_number = aov(lm(count_introgression~species.x, 
                             data=introgression_stats_by_indiv, population))
print(summary(introgression_anova_number))
            Df Sum Sq Mean Sq F value  Pr(>F)    
species.x    1   1547  1546.9   75.26 5.6e-13 ***
Residuals   76   1562    20.6                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
with(introgression_stats_by_indiv,
     tapply(count_introgression, species.x, mean))
immitis 30.1111111111111
posadasii 62.8235294117647

6.3.7 Washington immitis has less introgression than other immitis

"Coccidioides immitis from Washington also shows the least amount of introgression from C. posadasii (Figure S3)."

introgression_anova_immitis <-
  aov(lm(percent_genome~population2, 
         data=subset(introgression_stats_by_indiv, species.x == "immitis")))
print(summary(introgression_anova_immitis))
            Df Sum Sq Mean Sq F value Pr(>F)  
population2  1 0.1834 0.18341   6.279 0.0191 *
Residuals   25 0.7302 0.02921                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
with(subset(introgression_stats_by_indiv, species.x == "immitis"),
     tapply(percent_genome, population2, mean))
AZ nil
CA/LAm 0.47546350276166
Guatemala nil
Mex/TX nil
WA 0.263286574080871

Author: Colin S. Maxwell

Created: 2018-11-02 Fri 12:48

Validate