Gene exchange between two divergent species of the fungal human pathogen, Coccidioides
Table of Contents
- 1. About this document
- 2. Data provided in Dryad submission
- 3. External dependencies
- 4. Code ran on cluster
- 5. Figures
- 5.1. Setup code
- 5.2. Calculate coverage and regions
- 5.3. Calculate marker frequency
- 5.4. Introgressed genes
- 5.5. Figure 2 - Histograms of size of shared ancestry/introgressions
- 5.6. Figure 3 - Location and frequency of introgressions
- 5.7. Figure 4 - Example of an introgression
- 5.8. Figure S2 - Principle components analysis
- 5.9. Figure S3 - Admixture plots
- 5.10. Figure S4 - Compare effect of recombination rate on introgression size, etc.
- 5.11. Figure S5 - Plot jaccard stat for different recomb. rates
- 5.12. Figure S6 - Correlation between species
- 5.13. Figure S7, Table S3, S4, S5 - Venn diagram with Neafsey, potentially adaptive genes
- 5.13.1. Write table S3 (all introgressed genes)
- 5.13.2. Test for correlation between introgressed genes in this study and Neafsey's
- 5.13.3. Figure S7 (Venn diagram)
- 5.13.4. Table S4 - Write table of genes introgressed at high frequency globally
- 5.13.5. Table S5 - Write table of genes introgressed at high frequency in particular populations
- 5.13.6. Calculations of total introgressions
- 5.13.7. Calculate potentially adaptive introgressions
- 5.14. Figure S8 - Amount of introgression in population
- 6. Statistics
- 6.1. Percentage of tracts merged during filtering
- 6.2. Introgressions aren't uniformly distributed
- 6.3. Test for differences in admixture
- 6.3.1. Posadasii contains more admixed material
- 6.3.2. Posadasii contains more introgressed material
- 6.3.3. No sig. difference in the variance of introgression
- 6.3.4. Most tracts are small
- 6.3.5. Immitis has larger introgressed haploblocks
- 6.3.6. Posadasii has a greater number of introgressions
- 6.3.7. Washington immitis has less introgression than other immitis
1 About this document
The README.html
file was exported from README.org
, which is in org-mode
format and can be opened with the emacs text editor. This document
contains the code used to generate the figures and to calculate the
statistics for the paper "Gene exchange between two divergent species
of the fungal human pathogen, Coccidioides". This document is a
literate programming document, which means that the code used to
generate the figures was created from this document by "tangling"
it. The document shows the code used to create the figures along with
comments and annotations to enable the code to be interpreted.
To create the figures, run the file make-figures.R. The code
requires that bedtools
is installed and that some R packages are
installed. See external dependencies for details.
Note that the statistics have the quotes from the paper that they were used for:
"This is an example of how a quote is formatted"
CSS stylesheet for the HTML file is Fabrice Niessen https://github.com/fniessen/org-html-themes
2 Data provided in Dryad submission
2.1 Meta data (cocci-metadata.csv
)
The column "population" is a list of the geographical origin of the sample. "population2" is assigned based on the angsd analysis (Figure S3).
head(read.csv("cocci-metadata.csv"))
species | sample | population | population2 |
---|---|---|---|
immitis | B0727_Argentina | LAm | CA/LAm |
immitis | Coahuila_1 | LAm | CA/LAm |
immitis | Guerrero_1 | LAm | CA/LAm |
immitis | Michoacan_2 | LAm | CA/LAm |
immitis | San_Diego_1_14 | CA | CA/LAm |
immitis | San_Diego_1_16 | CA | CA/LAm |
2.2 GFF file
Downloaded annotations from Ensembl fungi. Filtered using AWK for gene annotations and renamed the chromosomes
ftp://ftp.ensemblgenomes.org/pub/fungi/release-41/gff3/fungi_ascomycota1_collection/coccidioides_immitis_rs_gca_000149335
sed s/supercont/Supercontig_/ < \ Coccidioides_immitis_rs.ASM14933v2.40.gff3 | \ awk '$3==\"gene\" {print $0}' > \ Coccidioides_immitis_rs.ASM14933v2.40-contigs-renamed.gff3
2.3 Int-HMM output
2.3.1 hmm_tables_united
These give the location of admixed tracks (tracts) for all individuals for a particular combination of reference genome, species, recombination rate, and marker frequency cutoff.
i_p_i_hmm_hap_pab_01_6_all_30_tracks
translates to "introgression
into immitis from posadasii with a 1e-6 recombination rate and a 30%
frequency required for marker selection"
The files ending in _unftrk
contain tracts defined without merging
small tracks, etc. as defined in the paper.
2.3.2 hmm_tables_no_filter
These give the HMM output at a per marker level, giving the probability that the marker is shared or not shared between species.
2.3.3 angsd.Rdata
This is a workspace containing the results of the PCAngsd analysis. See below for description of the file and the code used to interpret it.
2.4 immitis.genome
immitis.genome
is a file giving the chrom lengths of the contigs for
the immitis reference.
2.5 data
This is an empty directory that intermediate files from the code are created in.
2.6 Neafsey_SupTable1.csv
This is table S1 from "Population genomic sequencing of Coccidioides fungi reveals recent hybridization and transposon control." by Neafsey and co-workers.
3 External dependencies
3.1 R packages
install.packages(c("ggplot2", "ggthemes", "ggpubr", "stringr", "scales", "reshape2", "plyr", "gggenes", "car", "readr", "broom"))
I used Vennerable to make the R package. I installed it according to this git repository.
https://github.com/js229/Vennerable
R.Version()
$platform [1] "x86_64-apple-darwin15.6.0" $arch [1] "x86_64" $os [1] "darwin15.6.0" $system [1] "x86_64, darwin15.6.0" $status [1] "" $major [1] "3" $minor [1] "5.1" $year [1] "2018" $month [1] "07" $day [1] "02" $`svn rev` [1] "74947" $language [1] "R" $version.string [1] "R version 3.5.1 (2018-07-02)" $nickname [1] "Feather Spray"
3.2 Bedtools
bedtools --version
bedtools v2.27.1
4 Code ran on cluster
Int-HMM requires an SQL database and the SLURM scheduling engine to run, which means that it is difficult to create a reproducible document for running the code. In the sections I below, I document the steps and code used to generate the Int-HMM output.
4.1 Mapping
code-from-cluster/map.py is a script that writes sbatch files that will use BWA to
map read, Picard to call duplicates, and GATK to realign indels. The
output is a deduplicated, sorted, and realigned bam file with
attendent index and the samtools idxstats file. The shell script
code-from-cluster/run_map.sh calls map.py
with the appropriate fastq locations
and reference files. The output from run_map.sh
is passed to sbatch
:
sh run_map.sh | xargs -I {} sbatch {}
This code was used to map the data for use by angsd. A different script (with the same code and steps) was used to map the data for Int-HMM.
4.1.1 Command line arguments
python map.py --help
usage: map.py [-h] -f FASTQ [FASTQ ...] -g GENOME -s SBATCH -b BAM Map fastq files, dedup, realign them optional arguments: -h, --help show this help message and exit -f FASTQ [FASTQ ...], --fastq FASTQ [FASTQ ...] Folder(s) containing the fastq files. Expects pairs of fastq files in the form: <SAMPLE>_<SRR- number>_R1_001.fastq.gz <SAMPLE>_<SRR- number>_R2_001.fastq.gz. -g GENOME, --genome GENOME Reference fasta. Needs bwa index already built. -s SBATCH, --sbatch SBATCH Folder to put sbatch scripts and output. -b BAM, --bam BAM Folder to put bam output.
4.1.2 Summary of pipeline
The commands are run in a temporary file. The output is the BAM, bam index, and the stats file.
- cat :: If you supply multiple folders containing fastq files that have the same name, the script will cat them together before passing them to BWA
- bwa mem :: Maps the reads. This call is based on David Turissini's scripts.
-M
" Mark shorter split hits as secondary (for Picard compatibility). "-R '@RG\\tID:{samp_name}\\tSM:{samp_name}\\tPL:illumina\\tLB:lib1\\tPU:unit1'
Assigns the read group tags.-k 10
"Minimum seed length. Matches shorter than INT will be missed."
- samtools sort :: The output of BWA is piped directly to be sorted
- Picard/MarkDuplicates :: Mark the duplicates
- GATK/RealignerTargetCreator ::
- GATK/IndelRealigner ::
- samtools index/idxstats ::
4.2 RepeatMasker
Repeat masker was run with the options:
-pa 8
- Uses 8 cores
-species "fungi"
- Uses the fungal database
-gff
- Emit a GFF file
I masked repetitive regions in angsd by converting the GFF output to BED format:
cd /pine/scr/m/a/maxwell3/2018-cocci-introgression/ bedtools merge -i repeats-posadasii/posadasii_rmscc_3488_ref.fasta.out.gff > \ repeats-posadasii/posadasii_rmscc_3488_ref.fasta.out.bed bedtools merge -i repeats-immitis/immitis_rs_ref.fasta.out.gff > \ repeats-immitis/immitis_rs_ref.fasta.out.bed
And then using bedtools to take the complement:
bedtools complement -i immitis_rs_ref.fasta.out.bed -g imm_ref.genome | awk '{print $1"\t"$2+1"\t"$3}' > imm_not_variable.angsd bedtools complement -i posadasii_rmscc_3488_ref.fasta.out.bed -g pos_ref.genome | awk '{print $1"\t"$2+1"\t"$3}' > pos_not_variable.angsd ./angsd sites index /pine/scr/m/a/maxwell3/2018-cocci-introgression/repeats-posadasii/pos_not_variable.angsd ./angsd sites index /pine/scr/m/a/maxwell3/2018-cocci-introgression/repeats-immitis/imm_not_variable.angsd
4.3 PCAngsd
I created a list of BAM files using the following code:
imm=/pine/scr/m/a/maxwell3/2018-cocci-introgression/immitis/imm_ref/ pos=/pine/scr/m/a/maxwell3/2018-cocci-introgression/posadasii/imm_ref/ ls $imm | grep "realigned\.bam$" | awk -v base="$imm" '{print base$0}' > imm.bamlist ls $pos | grep "realigned\.bam$" | awk -v base="$pos" '{print base$0}' > pos.bamlist cat imm.bamlist pos.bamlist > all.bamlist
I then used code-from-cluster/run_angsd_glf.sh and
code-from-cluster/angsd-glf.py to calculate the genotype
likelihoods using angsd (angsd version: 0.921-11-g20b0655 (htslib: 1.8-37-gcaed43e) build(Jul 3 2018 12:06:22)
).
PCAngsd (version 0.95) was run using code-from-cluster/run_angsd_pca.sh and code-from-cluster/angsd-pca.py.
The script code-from-cluster/plot-angsd-pca-display.R was used to
read in the MAP admixture and PCA results and create angsd.Rdata
used below.
4.4 IntHMM
4.4.1 Find markers from VCF
The script code-from-cluster/get_haploid_site_all_introgress_pop_from_vcf.pl was used to find the markers for the HMM. See the comments in the code itself for how it was run.
4.4.2 Calculate admixture probabilities
The script code-from-cluster/hmm_introgress_all_haploid_master.pl
was called multiple times to submit jobs to run the HMM itself. For
example, this runs the HMM on immitis using the immitis reference, a
sequencing error rate of 0.01, an assumed recombination rate of 1e-5,
a marker cut-off of 0.30, and using the table sites_all_i_p_i_30
.
cd /proj/matutelb/projects/programs/
module load perl
perl hmm_introgress_all_haploid_master.pl i_p_i immitis .01 1e-5 .30 sites_all_i_p_i_30
The table sites_all_p_i_i_30
was used for looking at introgression
into posadasii with an immitis reference.
4.4.3 Assemble tracks
The script code-from-cluster/hmm_genotype_tracks_haploid_master_parallel.pl was called multiple times to join the admixed sites identified by the HMM into tracks. For example, this code assembles tracks for immitis using the immitis reference for a recombination rate of 1e-6.
cd /proj/matutelb/projects/programs/
module load perl
perl hmm_genotype_tracks_haploid_master_parallel.pl i_p_i 6_all_30 hap immitis immitis_unique_repeat_regions
4.4.4 Get tables from SQL database
I used the script code-from-cluster/get_hmm_tables.R to copy the tables from the SQL database into CSV files.
5 Figures
5.1 Setup code
This code reads in the metadata and loads packages that are shared across multiple pieces of code used to create the figures.
library(ggplot2) library(ggthemes) library(ggpubr) library(stringr) library(scales) library(reshape2) library(plyr) meta <- read.csv("./cocci-metadata.csv") ## The number of individuals in each population meta_pop_counts <- count(meta, c("species", "population2")) colnames(meta_pop_counts) <- c("species", "population2", "pop_size") ## Read in the GFF file and extract the gene info all_genes <- readr::read_delim("Coccidioides_immitis_rs.ASM14933v2.40-contigs-renamed.gff3", delim="\t", col_names=c("supercontig", "X2", "X3", "start", "stop", "score", "strand", "foo", "info")) %>% subset(X3 == "gene") %>% tidyr::extract(col="info", ## Extract gene info and description from the GFF into=c("description", "gene_id"), regex="description=(.*);gene_id=(.*);")
5.2 Calculate coverage and regions
The goal of this is to calculate the allele frequency of
introgressions both across whole species and within populations
defined by angsd. The strategy is to create bedgraphs of introgressed
regions for each species/population, then use bedtools unionbg
to
allow the bedgraphs to be compared.
Create BEDs out of the tracks by filtering for shared ancestry greater than 500bp with less than 5% repetitive sequence The files being read are for the immitis reference at 10^-6 recomb rate.
source("01-setup.R") filter_tracks = function(track_file){ track_file %>% subset(genotype == "homo_2") %>% ## Admixed regions only subset(track_len > 500) %>% ## Require 500bp length subset(per_repeat < 5) ## Remove >5% repetitive } write_bed = function(track_file, out_name){ track_file %>% {.[,c("supercontig", "start", "end", "ind", "introgress_snps")]} %>% transform(strand = ".") %>% write.table(out_name, quote=FALSE, sep="\t", row.names=FALSE, col.names=FALSE) } ## Create the bedfile for the location of the tracts for immitis imm_tracks <- read.csv("hmm_tables_united/i_p_i_hmm_hap_pab_01_6_all_30_tracks.csv") %>% filter_tracks() ## And for posadasii pos_tracks <- read.csv("hmm_tables_united/p_i_i_hmm_hap_pab_01_6_all_30_tracks.csv") %>% filter_tracks()
Concatenate the tracts from the two different species, adding a column giving which species it belongs to, then save a CSV file for the concatenated tracks. This is needed for Figure 5.
all_tracks <- list(posadasii=pos_tracks, immitis=imm_tracks) %>% ldply(.id="species", function(x) x) write.csv(all_tracks, "data/all-tracks-recomb6-imm-ref.csv")
5.2.1 Write bedgraphs of introgressions at the species level
This creates bedgraphs of the coverage of the introgressed tracts
across the species. Note that the -bga
tag reports regions with
zero coverage. This is needed for plotting the coverage using the
geom_step()
method in Figure 3.
## Write the bed files imm_tracks %>% write_bed("data/i_p_i_hmm_hap_pab_01_6_all_30_tracks.bed6") pos_tracks %>% write_bed("data/p_i_i_hmm_hap_pab_01_6_all_30_tracks.bed6") ## Sort the bed files system("bedtools sort -i data/p_i_i_hmm_hap_pab_01_6_all_30_tracks.bed6 > data/p_i_i_hmm_hap_pab_01_6_all_30_tracks_sorted.bed6") system("bedtools sort -i data/i_p_i_hmm_hap_pab_01_6_all_30_tracks.bed6 > data/i_p_i_hmm_hap_pab_01_6_all_30_tracks_sorted.bed6") system("rm data/p_i_i_hmm_hap_pab_01_6_all_30_tracks.bed6 data/i_p_i_hmm_hap_pab_01_6_all_30_tracks.bed6") ## Create bedgraphs of the coverage of introgressions system("bedtools genomecov -i data/i_p_i_hmm_hap_pab_01_6_all_30_tracks_sorted.bed6 -bga -g immitis.genome > data/i_p_i_hmm_hap_pab_01_6-all.bg") system("bedtools genomecov -i data/p_i_i_hmm_hap_pab_01_6_all_30_tracks_sorted.bed6 -bga -g immitis.genome > data/p_i_i_hmm_hap_pab_01_6-all.bg") track_cov <- list( immitis = read.table("data/i_p_i_hmm_hap_pab_01_6-all.bg",col.names=c("supercontig", "start", "end", "cov")), posadasii = read.table("data/p_i_i_hmm_hap_pab_01_6-all.bg",col.names=c("supercontig", "start", "end", "cov"))) %>% ldply(.id="species", function(x) x) %>% transform(n_ind = ifelse(species=="immitis", 27, 51), bin = cut_width(start/1e6, 0.1)) write.csv(track_cov, "data/all-coverage-recomb6-imm-ref.csv")
This is what each bedgraph looks like
head -5 data/p_i_i_hmm_hap_pab_01_6-all.bg
Supercontig_3.1 | 0 | 37941 | 0 |
Supercontig_3.1 | 37941 | 38556 | 3 |
Supercontig_3.1 | 38556 | 38557 | 1 |
Supercontig_3.1 | 38557 | 39116 | 2 |
Supercontig_3.1 | 39116 | 39202 | 3 |
5.2.2 Write bedgraphs of introgressions at the population level
## For posadasii for(pop in c("AZ", "Guatemala", "Mex/TX")){ ## Merge with metadata, take subset by the population to_write = pos_tracks %>% merge(meta, by.x="ind", by.y="sample") %>% subset(population2 == pop) pop = gsub("/", "_", pop) # Allow pop names to be written ## Write a bed of the introgressions, sort it out_name = str_interp("data/p_i_i_hmm_hap_pab_01_6_all_30_tracks-${pop}.bed6") to_write %>% write_bed("data/tmp.bed6") system(str_interp("bedtools sort -i data/tmp.bed6 > ${out_name}")) system("rm data/tmp.bed6") ## Write a begraph of the introgressions out_name2 = str_interp("data/p_i_i_hmm_hap_pab_01_6-${pop}.bg") system(str_interp("bedtools genomecov -i ${out_name} -bg -g immitis.genome > ${out_name2}"))} ## For immitis for(pop in c("CA/LAm", "WA")){ to_write = imm_tracks %>% merge(meta, by.x="ind", by.y="sample") %>% subset(population2 == pop) pop = gsub("/", "_", pop) out_name = str_interp("data/i_p_i_hmm_hap_pab_01_6_all_30_tracks-${pop}.bed6") out_name2 = str_interp("data/i_p_i_hmm_hap_pab_01_6-${pop}.bg") to_write %>% write_bed("data/tmp.bed6") system(str_interp("bedtools sort -i data/tmp.bed6 > ${out_name}")) system("rm data/tmp.bed6") system(str_interp("bedtools genomecov -i ${out_name} -bg -g immitis.genome > ${out_name2}"))}
Take the union of all the bedgraphs across species and populations
system("bedtools unionbedg -header -g immitis.genome -i data/p_i_i_hmm_hap_pab_01_6-all.bg data/p_i_i_hmm_hap_pab_01_6-AZ.bg data/p_i_i_hmm_hap_pab_01_6-Guatemala.bg data/p_i_i_hmm_hap_pab_01_6-Mex_TX.bg -names all AZ Guatemala Mex_TX > data/p_i_i_populations_union.bg") system("bedtools unionbedg -header -g immitis.genome -i data/i_p_i_hmm_hap_pab_01_6-all.bg data/i_p_i_hmm_hap_pab_01_6-CA_LAm.bg data/i_p_i_hmm_hap_pab_01_6-WA.bg -names all CA_LAm WA > data/i_p_i_populations_union.bg")
This is what the union bedgraphs look like.
head -5 data/i_p_i_populations_union.bg
chrom | start | end | all | CA_LAm | WA |
Supercontig_3.1 | 0 | 3120 | 0 | 0 | 0 |
Supercontig_3.1 | 3120 | 14924 | 4 | 0 | 4 |
Supercontig_3.1 | 14924 | 15976 | 6 | 1 | 5 |
Supercontig_3.1 | 15976 | 29398 | 0 | 0 | 0 |
5.2.3 Merge introgressed tracts to look at frequency
bedtools merge
merges overlapping intervals into a single
interval. To be "overlap", I'm requiring them to overlap by 250 bp
(that's what the -d
flag does). This is what the data looks
like. The fourth column is the individual. The fifth is the number of
introgressed snps (not used). The last is the strand.
head -5 data/p_i_i_hmm_hap_pab_01_6_all_30_tracks_sorted.bed6
Supercontig_3.1 | 37941 | 38556 | Phoenix_8 | 3 | 0 |
Supercontig_3.1 | 37941 | 38556 | B10757_Nevada | 3 | 0 |
Supercontig_3.1 | 37941 | 38557 | Tucson_17 | 4 | 0 |
Supercontig_3.1 | 38557 | 39202 | B1249_Guatemala | 5 | 0 |
Supercontig_3.1 | 38557 | 39202 | B0858_Guatemala | 5 | 0 |
The -o distinct
flag applies the distinct
operation to the column
specified by -c
. distinct
prints a comma separated list.
system("bedtools merge -c 4 -d -250 -o distinct -i data/p_i_i_hmm_hap_pab_01_6_all_30_tracks_sorted.bed6 > data/p_i_i_hmm_hap_pab_01_6_all_30_regions.bed6") system("bedtools merge -c 4 -d -250 -o distinct -i data/i_p_i_hmm_hap_pab_01_6_all_30_tracks_sorted.bed6 > data/i_p_i_hmm_hap_pab_01_6_all_30_regions.bed6")
tail -3 data/i_p_i_hmm_hap_pab_01_6_all_30_regions.bed6
Supercontig_3.6 | 239890 | 240423 | SJV_10,SJV_7,SJV_8,WA_202,WA_205,WA_211,WA_212,Washington_1 |
Supercontig_3.6 | 3066444 | 3067046 | SJV_11,San_Joaquin_Valley_11 |
Supercontig_3.6 | 3343338 | 3345375 | B0727_Argentina,Michoacan_2,SJV_4,SJV_6,SJV_8 |
So the results give unique intervals shared across particular sets of individuals.
5.2.4 Calculate intersections of introgressions and genes
This finds genes that overlap with introgressions as defined
above. The -F 0.1
tag requires that 10% of the gene's length be
overlapped by the introgression.
system("bedtools intersect -a data/p_i_i_hmm_hap_pab_01_6_all_30_regions.bed6 -b Coccidioides_immitis_rs.ASM14933v2.40-contigs-renamed.gff3 -F 0.1 -wa -wb -sortout > data/posadasii_introgressed_genes.bed") system("bedtools intersect -a data/i_p_i_hmm_hap_pab_01_6_all_30_regions.bed6 -b Coccidioides_immitis_rs.ASM14933v2.40-contigs-renamed.gff3 -F 0.1 -wa -wb -sortout > data/immitis_introgressed_genes.bed")
5.3 Calculate marker frequency
The HMM computes the likelihood that a SNP is shared between species. This code creates a table of the frequency of markers called "shared" (homo_1) or "not shared" (homo_2) by the HMM.
source("01-setup.R") table_regex="(.)_(.)_(.)_(.*)_hmm_hap_pab_01_([5679])_all_30\\.csv" ## hmm_tables_no_filter contains the output of the HMM assembled into ## tracts but without any filter on probability of p_homo_1 or ## p_homo_2 to_read <- str_match(dir("hmm_tables_no_filter"), table_regex) %>% as.data.frame() %>% subset(!is.na(V1)) %>% #Remove non-matching files transform(V6 = as.numeric(as.character(V6))) # V6 is the group for the recombination rate colnames(to_read) <- c("table", "species", "src", "reference", "sample", "recomb")
head(to_read)
table | species | src | reference | sample | recomb |
---|---|---|---|---|---|
i_p_i_B0727_Argentina_hmm_hap_pab_01_5_all_30.csv | i | p | i | B0727_Argentina | 5 |
i_p_i_B0727_Argentina_hmm_hap_pab_01_6_all_30.csv | i | p | i | B0727_Argentina | 6 |
i_p_i_B0727_Argentina_hmm_hap_pab_01_7_all_30.csv | i | p | i | B0727_Argentina | 7 |
i_p_i_B0727_Argentina_hmm_hap_pab_01_9_all_30.csv | i | p | i | B0727_Argentina | 9 |
i_p_i_Coahuila_1_hmm_hap_pab_01_5_all_30.csv | i | p | i | Coahuila_1 | 5 |
i_p_i_Coahuila_1_hmm_hap_pab_01_6_all_30.csv | i | p | i | Coahuila_1 | 6 |
all_introgressed <- to_read %>% subset(recomb == 6) %>% ## Restrict to only 10^-6 recombination rate subset(reference == "i") %>% ## Restrict to only the immitis reference ddply(.(table, species, src, sample, reference, recomb), ## Read in the data, keeping track of meta data with, read.csv(file.path("hmm_tables_no_filter", table))[,c("supercontig","pos","p_homo_1", "p_homo_2", "genotype")], .progress="text") %>% plyr::mutate(species = ifelse(species == "i", "immitis", "posadasii")) ## Add column for species ## This command counts the instances of a marker being assigned a ## particular genotype for each strain, etc. marker_freq <- dcast(all_introgressed, pos+supercontig+recomb+reference+src+species~genotype) ## Scale the marker frequency for the two genotypes by the species ## population size marker_freq$pop_size = ifelse(marker_freq$species == "immitis", 27, 51) marker_freq$homo_1 <- marker_freq$homo_1/marker_freq$pop_size marker_freq$homo_2 <- marker_freq$homo_2/marker_freq$pop_size
head(marker_freq)
pos | supercontig | recomb | reference | src | species | homo_1 | homo_2 | pop_size |
---|---|---|---|---|---|---|---|---|
3120 | Supercontig_3.1 | 6 | i | p | immitis | 0.555555555555556 | 0.148148148148148 | 27 |
5987 | Supercontig_3.2 | 6 | i | p | immitis | 0.666666666666667 | 0.037037037037037 | 27 |
5998 | Supercontig_3.2 | 6 | i | p | immitis | 0.703703703703704 | 0 | 27 |
5999 | Supercontig_3.2 | 6 | i | p | immitis | 0.703703703703704 | 0 | 27 |
8821 | Supercontig_3.3 | 6 | i | p | immitis | 0.555555555555556 | 0.296296296296296 | 27 |
11476 | Supercontig_3.4 | 6 | i | i | posadasii | 0.117647058823529 | 0.0196078431372549 | 51 |
Since the markers were chosend such that there must be a 30% difference in frequency between the species, there are (almost) no markers that are assigned "introgressed" with greater than 70% frequency. I'm not sure what's with the three markers with greater than 70% frequency, but I'm ignoring them because they are a very small portion of the total and they are all close to 70%.
summary(marker_freq$homo_2 >= 0.7)
Mode FALSE TRUE logical 561787 3
## Melt the site frequency so that the "variable" is the genotype and ## the "value" is the frequency marker_freq2 = melt(marker_freq, id.vars=c("pos", "supercontig", "recomb", "reference", "src", "pop_size", "species")) save(marker_freq2, file="data/marker-freq-recomb6-imm-ref.Rdata")
In the table, 'variable' is the genotype assigned by the HMM and 'value' is the frequency of markers assigned that value in the
head(marker_freq2)
pos | supercontig | recomb | reference | src | pop_size | species | variable | value |
---|---|---|---|---|---|---|---|---|
3120 | Supercontig_3.1 | 6 | i | p | 27 | immitis | homo_1 | 0.555555555555556 |
5987 | Supercontig_3.2 | 6 | i | p | 27 | immitis | homo_1 | 0.666666666666667 |
5998 | Supercontig_3.2 | 6 | i | p | 27 | immitis | homo_1 | 0.703703703703704 |
5999 | Supercontig_3.2 | 6 | i | p | 27 | immitis | homo_1 | 0.703703703703704 |
8821 | Supercontig_3.3 | 6 | i | p | 27 | immitis | homo_1 | 0.555555555555556 |
11476 | Supercontig_3.4 | 6 | i | i | 51 | posadasii | homo_1 | 0.117647058823529 |
5.4 Introgressed genes
source("01-setup.R")
The strategy here is to use the coverage created by the bedgraphs to find genes that were introgressed. There are two bedgraph files created using the bedtools unionbg command:
head -3 data/p_i_i_populations_union.bg
chrom start end all AZ Guatemala Mex_TX Supercontig_3.1 0 37941 0 0 0 0 Supercontig_3.1 37941 38556 3 3 0 0
5.4.1 Intersect bedgraph unions with GFF files
Intersect the coverage with genes, requiring the gene to be 10% covered by the bedgraph line. Since the bedgraph has line with no introgression coverage, this contains some genes with no coverage The result has the same columns as the bedgraph and the GFF concatenated.
system("bedtools intersect -a data/i_p_i_populations_union.bg -b Coccidioides_immitis_rs.ASM14933v2.40-contigs-renamed.gff3 -F 0.1 -wa -wb -sortout > data/immitis_introgressed_genes.bed") system("bedtools intersect -a data/p_i_i_populations_union.bg -b Coccidioides_immitis_rs.ASM14933v2.40-contigs-renamed.gff3 -F 0.1 -wa -wb -sortout > data/posadasii_introgressed_genes.bed")
5.4.2 Read and process the lists of introgressed genes
library(readr) imm_genes =read_delim("data/immitis_introgressed_genes.bed", "\t", col_names = c("supercontig", "cov_start", "cov_end", "species", "CA/LAm", "WA", "X1", "X2", "X3", "gene_start", "gene_end", "X4", "strand", "X5", "gene_info")) %>% # Extract the gene info from the GFF info string tidyr::extract(col="gene_info", into=c("description", "gene_id"), regex="description=(.*);gene_id=(.*);") %>% # Restrict to relevant columns {.[,c("supercontig", "cov_start", "cov_end", "species", "CA/LAm", "WA","gene_id", "description")]} %>% # Melt to put the population counts (e.g "species", "CA/LAm", etc.) in "variable" columne melt(id.vars=c("supercontig", "cov_start", "cov_end","gene_id", "description")) %>% # Convert the variable column for merging mutate(variable = as.character(variable)) %>% # Edit the variable to annotate with species name {.$variable[.$variable == "species"] = "immitis"; .} ## Same steps are repeated for the posadasii data. ## Probably should have written a function or something pos_genes =read_delim("data/posadasii_introgressed_genes.bed", "\t", col_names = c("supercontig", "cov_start", "cov_end", "species", "AZ", "Guatemala", "Mex/TX", "X1", "X2", "X3", "gene_start", "gene_end", "X4", "strand", "X5", "gene_info")) %>% tidyr::extract(col="gene_info", into=c("description", "gene_id"), regex="description=(.*);gene_id=(.*);") %>% {.[,c("supercontig", "cov_start", "cov_end", "species", "AZ", "Guatemala", "Mex/TX","gene_id", "description")]} %>% melt(id.vars=c("supercontig", "cov_start", "cov_end", "gene_id", "description"))%>% mutate(variable = as.character(variable)) %>% {.$variable[.$variable == "species"] = "posadasii"; .} ## Create a df with the counts of each population and species pop_counts = rbind(count(meta, "population2"), data.frame(population2=c("immitis", "posadasii"), freq=c(27,51))) colnames(pop_counts) = c("population", "population_count") ## Concatenate the imm and pos data, then annotate with the population counts gene_introgression = rbind(imm_genes, pos_genes) colnames(gene_introgression)[c(6,7)]=c("population", "introgression_count") gene_introgression = merge(gene_introgression, pop_counts) ## Same steps are repeated for the posadasii data. ## Probably should have written a function or something pos_genes =read_delim("data/posadasii_introgressed_genes.bed", "\t", col_names = c("supercontig", "cov_start", "cov_end", "species", "AZ", "Guatemala", "Mex/TX", "X1", "X2", "X3", "gene_start", "gene_end", "X4", "strand", "X5", "gene_info")) %>% tidyr::extract(col="gene_info", into=c("description", "gene_id"), regex="description=(.*);gene_id=(.*);") %>% {.[,c("supercontig", "cov_start", "cov_end", "species", "AZ", "Guatemala", "Mex/TX","gene_id", "description")]} %>% melt(id.vars=c("supercontig", "cov_start", "cov_end", "gene_id", "description"))%>% mutate(variable = as.character(variable)) %>% {.$variable[.$variable == "species"] = "posadasii"; .} ## Create a df with the counts of each population and species pop_counts = rbind(count(meta, "population2"), data.frame(population2=c("immitis", "posadasii"), freq=c(27,51))) colnames(pop_counts) = c("population", "population_count") ## Concatenate the imm and pos data, then annotate with the population counts gene_introgression = rbind(imm_genes, pos_genes) colnames(gene_introgression)[c(6,7)]=c("population", "introgression_count") gene_introgression = merge(gene_introgression, pop_counts) write.csv(gene_introgression, "data/gene_introgression.csv")
head(gene_introgression)
population | supercontig | cov_start | cov_end | gene_id | description | introgression_count | population_count |
---|---|---|---|---|---|---|---|
AZ | Supercontig_3.1 | 0 | 37941 | CIMG_12556 | hypothetical protein | 0 | 36 |
AZ | Supercontig_3.1 | 0 | 37941 | CIMG_12557 | hypothetical protein | 0 | 36 |
AZ | Supercontig_3.1 | 0 | 37941 | CIMG_12558 | hypothetical protein | 0 | 36 |
AZ | Supercontig_3.1 | 0 | 37941 | CIMG_12559 | hypothetical protein | 0 | 36 |
AZ | Supercontig_3.1 | 0 | 37941 | CIMG_12560 | hypothetical protein | 0 | 36 |
AZ | Supercontig_3.1 | 0 | 37941 | CIMG_12561 | hypothetical protein | 0 | 36 |
5.5 Figure 2 - Histograms of size of shared ancestry/introgressions
Figure 2 shows the size of shared and introgressed haplotypes calculated from the Int-HMM output.
source("01-setup.R") all_shared <- list(immitis = read.csv("./hmm_tables_united/i_p_i_hmm_hap_pab_01_6_all_30_tracks.csv"), posadasii = read.csv("./hmm_tables_united/p_i_i_hmm_hap_pab_01_6_all_30_tracks.csv")) %>% ldply(.id="species", function(x) x) %>% subset(genotype == "homo_2") %>% ## "homo_2" = "shared" subset(per_repeat < 5) ## Filter to less than 5% repetitive
"We characterized the general characteristics of shared ancestry between species. The majority of shared haplotypes are small. The average size of the shared haplotype in C. immitis is 64 bp (range = 1-63,135bp) and 46bp in C. posadasii (range 1-36,960) using 10-6 M/bp as the recombination rate."
all_shared %>%
ddply(.(species), plyr::summarize,
mean_size = mean(track_len),
max_size = max(track_len),
min_size = min(track_len))
species | mean_size | max_size | min_size |
---|---|---|---|
immitis | 64.86435722377 | 63135 | 1 |
posadasii | 46.1725482371125 | 36960 | 1 |
"The average size of a C. immitis to C. posadasii introgression was 3.4 kb. In the reciprocal direction, C. posadasii to C. immitis, the average size of an introgression was 4.2 kb."
introgression_mean_size <- all_shared %>% subset(track_len > 500) %>% ddply(.(species), plyr::summarize, mean_size = mean(track_len)) %>% transform(species = paste("C.", species))
species | mean_size |
---|---|
C. immitis | 4193.22263222632 |
C. posadasii | 3398.53901373283 |
## Plot of the size of shared ancestry tracts all_shared_plot <- all_shared %>% transform(species = paste("C.", species)) %>% ggplot(aes(x=log10(track_len)))+ geom_histogram(bins=15, fill=NA, col="black")+ facet_wrap(~species,scale="free_y")+theme_base()+ theme(plot.background=element_rect(fill="white", colour=NA), strip.text = element_text(face="italic"))+ xlab("log10 shared haplotype size (bp)")+ scale_y_continuous(labels = function(x) format(x, scientific = TRUE)) ## Plot size of introgressed tracts introgression_size_plot <- all_shared %>% subset( track_len > 500) %>% ## This filters to 500bp transform(species = paste("C.", species)) %>% ggplot(aes(x=track_len))+ geom_histogram(bins=15, fill=NA, col="black")+ facet_wrap(~species,scale="free_y")+theme_base()+ theme(plot.background=element_rect(fill="white", colour=NA), strip.text = element_text(face="italic"))+ geom_vline(data=introgression_mean_size, aes(xintercept=mean_size), lty=3, col="blue")+ xlab("Introgressed haplotype size (bp)") ggsave("Figure2.pdf", ggarrange(all_shared_plot, introgression_size_plot, nrow=2, labels=c("A", "B")), height=7, width=7)
5.6 Figure 3 - Location and frequency of introgressions
source("01-setup.R") track_cov=read.csv("data/all-coverage-recomb6-imm-ref.csv", row.names=1) %>% transform(track_len = end-start) tracks = track_cov %>% subset(cov != 0)
"Consistent with our expectations, the majority of introgressions are at low frequency (mean allele frequency 14.1% and 16.9% for C. immitis and C. posadasii, respectively; Figure 3), however we did find some introgressions at high frequency (see below). There is not a significant difference in the mean introgressed allele frequency between the species (Permutation test; P = 0.0899)."
The allele frequency of introgressions was weighted by the size of the introgression.
# Weight by the size of the introgression to compute the means track_stats = tracks %>% ddply(.(species), plyr::summarize, mean = weighted.mean(cov/n_ind, track_len)) %>% transform(species=paste("C.", species))
species | mean |
---|---|
C. immitis | 0.14145629036292 |
C. posadasii | 0.169600587482152 |
Tested significance of the difference in allele frequency by permutation. Note that since it's stochastic that the P-value varies slightly from what is reported in the paper.
randomize_species = function(x){ x$species = sample(x$species, nrow(x), replace=TRUE) x} samples_species = rdply(.n=10000, .expr = { randomize_species(tracks) %>% ddply(.(species), plyr::summarize, mean = weighted.mean(cov/n_ind, track_len))}, .progress = "text") %>% dcast(.n~species) the_ecdf = ecdf(samples_species$immitis - samples_species$posadasii) the_ecdf(0.1414563 - 0.1696006) # see allele freqs above
0.0918
track_loc <- track_cov %>% transform(supercontig=gsub("Supercontig_", "", supercontig)) %>% transform(species=paste("C.", species)) %>% ggplot(aes(x=start/1e6, y=cov/n_ind))+ geom_step(size=0.3, alpha=0.5)+ facet_grid(species~supercontig, space="free_x",scale="free_x")+ theme_base(14)+xlab("Position (Mb)")+ ylab("Introgressed region frequency")+ ylim(0,1.05)+geom_hline(data=track_stats, aes(yintercept=mean), col="blue", lty=3)+ theme(plot.background=element_rect(fill="white", colour=NA), strip.text.y=element_text(face="italic")) track_freq = track_cov %>% transform(supercontig=gsub("Supercontig_", "", supercontig)) %>% transform(species=paste("C.", species)) %>% ggplot(aes(cov/n_ind))+ facet_grid(species~"")+ geom_histogram(bins=30, aes(y=..count../100))+coord_flip()+ theme_base(14)+ylab("Count (x100)")+xlim(0,1.05)+ xlab("")+geom_vline(data=track_stats, aes(xintercept=mean), col="blue", lty=3)+ theme(plot.background=element_rect(fill="white", colour=NA), strip.text.y=element_text(face="italic")) track_plot <- cowplot::plot_grid(track_loc, track_freq, ncol=2, rel_widths=c(1,0.3),labels=c("A", "B"))+ theme(plot.background=element_rect(fill="white", colour=NA)) ggsave("Figure3.pdf", track_plot, height=3.5, width=10)
5.7 Figure 4 - Example of an introgression
This code defines a function that uses the data from the HMM to produce a summary graphic of a region around a gene of interest. I chose this one because it is a fixed introgression. We can call it fixed because the HMM extends the introgression tracts to overlap at the gene.
source("01-setup.R") library(gggenes) all_tracks = read.csv("data/all-tracks-recomb6-imm-ref.csv", row.names=1) load("data/marker-freq-recomb6-imm-ref.Rdata") track_cov = read.csv("data/all-coverage-recomb6-imm-ref.csv", row.names=1) %>% transform(freq = cov/n_ind) plot_region = function(st = 101423, en = 115414, flank = 0, contig = "Supercontig_3.2", sp = "posadasii", GENE = "CIMG_09822", tc = track_cov, sf = marker_freq2, the_meta = meta, genes = all_genes, the_out = "tmp.pdf"){ high_freq_genes = subset(genes, (start > st - flank) & (stop < en + flank) & (supercontig == contig)) high_freq_genes = subset(genes, (start > st - flank) & (stop < en + flank) & (supercontig == contig)) high_freq_genes$direction <- ifelse(high_freq_genes$strand == "+", 1, -1) high_freq_region_bg = subset(tc, (start > st - flank) & (end < en + flank) & (species == sp) & (supercontig == contig)) high_freq_sites = subset(sf, (pos > st - flank) & (pos < en + flank) & (species == sp) & (supercontig == contig)) %>% transform(variable = factor(variable, levels=c("homo_1", "homo_2"), labels=c("not shared", "shared"))) sample_order = the_meta %>% subset(species == sp) %>% with(as.character(sample)[order(as.character(population2, population))]) high_freq_region = subset(all_tracks, (start > st - flank) & (end < en + flank) & (species == sp) & (supercontig == contig)) %>% subset(select = c(ind, start, end)) not_in_data = sample_order[!(sample_order %in% high_freq_region$ind)] if( length(not_in_data) > 0){ high_freq_region = rbind(high_freq_region, data.frame(ind = not_in_data, start=NA, end=NA)) } high_freq_region = high_freq_region %>% merge(the_meta, by.x="ind", by.y="sample") %>% transform(ind = factor(as.character(ind), levels=sample_order)) #high_freq_genes = subset(all_genes, # (start > (st - flank)) & # (stop < (en + flank)) & # (supercontig == contig)) plot_coords = subset(genes, gene_id == GENE) %>% {.[1,c("start", "stop")]} %>% as.numeric() high_1 = high_freq_region %>% ggplot(aes(x=start, xend=end,y=ind,yend=ind,col=population2))+ geom_segment(size=1)+ ylab("Isolate")+ xlab("Supercontig 3.2 coordinate")+ xlim(st-flank,en+flank)+ theme_base()+ facet_grid(population2~., scale="free_y", space="free_y")+ geom_vline(xintercept=plot_coords, lty=3)+ scale_color_discrete("")+ theme(plot.background=element_rect(fill="white", colour=NA), axis.text.y = element_text(size=6), legend.position ="bottom", strip.background =element_blank(), strip.text=element_blank(), panel.grid.major.y=element_line(size=0.1,color="grey70", linetype="dotted")) high_2 = high_freq_region_bg %>% ggplot(aes(x=start,y=cov/n_ind))+geom_step()+ ylab("Introgressed region\nfrequency")+ xlab("")+ xlim(st-flank,en+flank)+theme_base()+ geom_vline(xintercept=plot_coords, lty=3)+ theme(plot.background=element_rect(fill="white", colour=NA)) high_3 = high_freq_sites %>% ggplot(aes(x=pos,y=value,col=variable))+geom_point(alpha=0.5)+ ylab("Marker\nfrequency")+ xlab("")+scale_color_manual("Inferred genotype", values=c("blue", "darkorange"))+ xlim(st-flank,en+flank)+theme_base()+theme(legend.position="top")+ geom_vline(xintercept=plot_coords, lty=3)+ theme(plot.background=element_rect(fill="white", colour=NA), strip.text.x = element_text(size = 12)) high_4 = ggplot(high_freq_genes, aes(xmin=start,xmax=stop,x=start+(stop-start)/2,y=direction, forward=direction, label=gene_id))+ geom_gene_arrow(arrowhead_height = unit(3, "mm"), arrowhead_width = unit(1, "mm"))+ theme_base()+geom_text(aes(y=direction+0.5), size=1.75)+xlim(st-flank,en+flank)+ geom_vline(xintercept=plot_coords, lty=3)+ theme(plot.background=element_rect(fill="white", colour=NA), legend.pos="bottom", axis.text.y=element_blank())+xlab("")+ylab("")+ylim(-1.5,1.5) high_p <- cowplot::plot_grid(high_3, high_2, high_4, high_1 , nrow=4, rel_heights=c(0.5, 0.3, 0.3, 1), align="vh", axis="l", labels = c("A", "B","C","D")) ggsave(the_out, high_p, height=9, width=6)} plot_region(st=2945000,en=2945000,flank=20000, contig="Supercontig_3.5", sp="posadasii",GENE="CIMG_09462", the_out="Figure4.pdf")
5.8 Figure S2 - Principle components analysis
We ran the program PCAngsd on our cluster
source("01-setup.R") ## This was created by plot-angsd-pca-display.R load("angsd.Rdata")
The following are in the Rdata file:
- PCA_X = results of the PCA
- admix_X = results of the admixture mapping
- read_admix and read_pca were the functions used to read in the data
- meta, meta_imm, meta_pos are metadata files
ls()
[1] "admix_both" "admix_imm" "admix_pos" "all_genes" [5] "meta" "meta_imm" "meta_pop_counts" "meta_pos" [9] "PCA_both" "PCA_imm" "PCA_pos" "read_admix" [13] "read_pca"
Each PCA object contains:
- $PCA a data.frame with the first four PCs and the metadata added
- $values a vector with the eigenvectors of the SVD
str(PCA_both)
List of 2 $ PCA :'data.frame': 78 obs. of 8 variables: ..$ PC1 : num [1:78] -0.155 -0.153 -0.154 -0.155 -0.155 ... ..$ PC2 : num [1:78] -0.006966 -0.000137 0.001286 0.006966 0.001831 ... ..$ PC3 : num [1:78] 0.00147 -0.00141 0.00324 -0.00115 -0.00553 ... ..$ PC4 : num [1:78] 0.025 -0.113 -0.109 -0.129 -0.146 ... ..$ species : Factor w/ 2 levels "immitis","posadasii": 1 1 1 1 1 1 1 1 1 1 ... ..$ sample : Factor w/ 78 levels "730332_Guatemala",..: 4 10 15 17 30 31 32 33 34 35 ... ..$ population : Factor w/ 6 levels "AZ","CA","CO/NV",..: 4 4 4 4 2 2 2 2 2 2 ... ..$ population2: Factor w/ 5 levels "AZ","CA/LAm",..: 2 2 2 2 2 2 2 2 2 2 ... $ values: num [1:78] 73.73 4.22 3.44 2.79 1.33 ...
gg_color_hue <- function(n) { hues = seq(15, 375, length = n + 1) hcl(h = hues, l = 65, c = 100)[1:n] } colors=gg_color_hue(6) names(colors) = levels(PCA_both$PCA$population) ## Calculate the percent variance explained for each eigenvector ## Sum across the eigenvalues and then normalize. both_PC1 = percent(PCA_both$values[1]/sum(PCA_both$values)) both_PC2 = percent(PCA_both$values[2]/sum(PCA_both$values)) imm_PC1 = percent(PCA_imm$values[1]/sum(PCA_imm$values)) imm_PC2 = percent(PCA_imm$values[2]/sum(PCA_imm$values)) pos_PC1 = percent(PCA_pos$values[1]/sum(PCA_pos$values)) pos_PC2 = percent(PCA_pos$values[2]/sum(PCA_pos$values)) pca_1 <- ggplot(PCA_both$PCA, aes(x=PC1, y=PC2, col=population, pch=species))+ geom_point(size=2, alpha=0.75)+ scale_shape_manual(values=c(1,17))+ theme_base()+ theme(plot.background=element_rect(fill="white", colour=NA))+ scale_color_manual("location", values=colors)+ xlab(str_interp("PC1 (${both_PC1})"))+ ylab(str_interp("PC2 (${both_PC2})")) pca_2 <- ggplot(PCA_imm$PCA, aes(x=PC1, y=PC2, col=population, pch=species))+ geom_point(size=2, alpha=0.75)+ scale_shape_manual(values=c(1))+ theme_base()+ theme(plot.background=element_rect(fill="white", colour=NA))+ theme(legend.position = "none")+ scale_color_manual("location", values=colors[levels(PCA_imm$PCA$population)])+ xlab(str_interp("PC1 (${imm_PC1})"))+ ylab(str_interp("PC2 (${imm_PC2})")) pca_3 <- ggplot(PCA_pos$PCA, aes(x=PC1, y=PC2, col=population, pch=species))+ geom_point(size=2, alpha=0.75)+ scale_shape_manual(values=c(17))+ theme_base()+ theme(plot.background=element_rect(fill="white", colour=NA))+ theme(legend.position = "none")+ scale_color_manual("location", values=colors[levels(PCA_pos$PCA$population)])+ xlab(str_interp("PC1 (${pos_PC1})"))+ ylab(str_interp("PC1 (${pos_PC2})")) out <- ggarrange(pca_1, ggarrange(pca_2, pca_3, nrow=2, labels=c("B", "C")), ncol = 2, labels="A", common.legend=TRUE, legend="right", widths=c(1.75,1)) ggsave("FigureS2.pdf",out,width=9,height=5)
5.9 Figure S3 - Admixture plots
source("01-setup.R") ## This was created by plot-angsd-pca-display.R load("angsd.Rdata")
$K is the name of the eigenvector $value is the percentage of the genome explained by that eigenvector
str(admix_both)
'data.frame': 390 obs. of 6 variables: $ sample : Factor w/ 78 levels "Phoenix_1","Phoenix_2",..: 59 59 59 59 59 60 60 60 60 60 ... $ K : Factor w/ 5 levels "K 1","K 2","K 3",..: 1 4 3 2 5 1 5 3 2 4 ... $ value : num 9.99e-05 9.99e-05 2.73e-04 2.77e-04 9.99e-01 ... $ species : Factor w/ 2 levels "immitis","posadasii": 2 2 2 2 2 2 2 2 2 2 ... $ population : Factor w/ 6 levels "AZ","CA","CO/NV",..: 4 4 4 4 4 4 4 4 4 4 ... $ population2: Factor w/ 5 levels "AZ","CA/LAm",..: 3 3 3 3 3 3 3 3 3 3 ...
get_max_ordering <- function(x){ rankings = tapply(x$value, x$K, max) # Find the maximum explained in an individual by that K order_order = names(rankings)[order(rankings)] # Order the Ks by the max x_cast = dcast(x, sample~K, value.var="value") # data.frame of samples x K values the_order = splat(order)( x_cast[,rev(order_order)] ) # pass the order function each of the K value columns to get the ordering x_cast$sample[the_order] # Return samples in order of the K values } sample_labels2 = unlist(dlply(admix_both, .(population2), get_max_ordering)) out2 <- admix_both %>% # This puts the factors in the right order, labels the species correctly # and puts the K values in a pleasing order. transform(population2 = factor(population2, levels=c("AZ","CA/LAm","Guatemala","Mex/TX","WA"), labels=c("AZ","CA/LAm","Guat.","Mexico/TX","WA")), species=factor(species, levels=c("immitis", "posadasii"), labels=c("imm.", "pos.")), sample = factor(sample, levels=sample_labels2), K=factor(K, levels=paste("K", c(3,4,1,5,2)))) %>% ggplot(aes(x=sample, y=value,fill=K))+ theme_bw()+geom_bar(stat="identity")+coord_flip()+ scale_fill_discrete("")+ facet_grid(species+population2~., scale="free_y", space="free_y")+ theme_base(13)+theme(plot.background=element_rect(fill="white", colour=NA), strip.background=element_rect(fill="white", colour="black"))+ xlab("")+ylab("% admixture") ggsave("FigureS3.pdf", out2, width=7, height=12)
5.10 Figure S4 - Compare effect of recombination rate on introgression size, etc.
This code reads in the admixture tracks for each individual and recombination rate.
source("01-setup.R") ## Read in all the recombination rates read_string = "hmm_tables_united/${prefix}_hmm_hap_pab_01_${recomb}_all_30_tracks.csv" to_read <- expand.grid(prefix=c("p_i_i", "i_p_i"), recomb=c(5,6,7,9)) tracks <- ddply(to_read, .(prefix, recomb), with, read.csv(str_interp(read_string))) ## Summarize mean length, count, and total length of introgressions by ## recombination rate track_summaries <- tracks %>% subset(genotype == "homo_2") %>% subset(track_len > 500) %>% subset(per_repeat < 5) %>% ddply(.(ind, prefix, recomb), plyr::summarize, avg_len_introgression = mean(track_len), count_introgression = length(track_len), sum_introgression = sum(track_len)) track_summaries2 <- merge(track_summaries, meta, by.x="ind", by.y="sample") ## Numbers are for the immitis reference imm_ref = sum(c(8482323,5260912,4323945,3952524,3469364,3458857)) track_summaries2$percent_genome = with(track_summaries2, 100*sum_introgression/imm_ref) ## Transform and re-annotate the levels with nice names comb_melt <- melt(track_summaries2, measure.vars=c("avg_len_introgression", "count_introgression", "percent_genome")) %>% transform(variable = factor(variable, levels=c("percent_genome", "avg_len_introgression", "count_introgression"), labels=c("Introgressed\ngenome (%)", "Average introgression\ntract length (bp)", "Number of\nintrogression tracts")))
Plot the statistics.
compare_by_recomb <- comb_melt %>% subset(recomb != 9) %>% # Exclude 10^-9 recomb rate. Not analyzed in paper. transform( recomb = paste0("10^-", recomb)) %>% ggplot(aes(x=recomb, y=value, group=ind))+ geom_line(alpha=0.5)+geom_point(alpha=0.5)+ facet_grid(variable~species, scale="free_y")+theme_bw()+ xlab("recomb. rate (M/bp)")+theme_base()+ylab("") ggsave("FigureS4.pdf", compare_by_recomb, height=7, width=7)
5.11 Figure S5 - Plot jaccard stat for different recomb. rates
The goal of this plot is to look at how similar the introgressions identified by the HMM at different rates of assumed recombination. The method is to compute the jaccard statistic for each individual for the three different comparisons of recombination rate and to vary the percent of overlap required for the features to count as "the same"
source("01-setup.R") ## A list of individuals for each species imm_indiv = as.character(unique(read.csv("hmm_tables_united/i_p_i_hmm_hap_pab_01_5_all_30_tracks.csv")$ind)) pos_indiv = as.character(unique(read.csv("hmm_tables_united/p_i_i_hmm_hap_pab_01_5_all_30_tracks.csv")$ind)) ## Read in the tracks for each recombination rate. Restrict to introgressions. imm_data5 = read.csv("hmm_tables_united/i_p_i_hmm_hap_pab_01_5_all_30_tracks.csv") %>% subset(genotype == "homo_2") %>% subset(track_len > 500) %>% subset(per_repeat < 5) imm_data6 = read.csv("hmm_tables_united/i_p_i_hmm_hap_pab_01_6_all_30_tracks.csv") %>% subset(genotype == "homo_2") %>% subset(track_len > 500) %>% subset(per_repeat < 5) imm_data7 = read.csv("hmm_tables_united/i_p_i_hmm_hap_pab_01_7_all_30_tracks.csv") %>% subset(genotype == "homo_2") %>% subset(track_len > 500) %>% subset(per_repeat < 5) pos_data5 = read.csv("hmm_tables_united/p_i_i_hmm_hap_pab_01_5_all_30_tracks.csv") %>% subset(genotype == "homo_2") %>% subset(track_len > 500) %>% subset(per_repeat < 5) pos_data6 = read.csv("hmm_tables_united/p_i_i_hmm_hap_pab_01_6_all_30_tracks.csv") %>% subset(genotype == "homo_2") %>% subset(track_len > 500) %>% subset(per_repeat < 5) pos_data7 = read.csv("hmm_tables_united/p_i_i_hmm_hap_pab_01_7_all_30_tracks.csv") %>% subset(genotype == "homo_2") %>% subset(track_len > 500) %>% subset(per_repeat < 5)
This uses the bedtools jaccard
function to calculate the jaccard
stat. The flag -r
requires a reciprocal overlap based on the percent
overlap required by -f
.
## write a bed file based on the track files, then sort it write_bed = function(df, n){ df %>% {.[,c("supercontig", "start", "end", "ind")]} %>% write.table("tmp.bed",quote=FALSE, sep="\t", row.names=FALSE, col.names=FALSE) system(str_interp("bedtools sort -i tmp.bed > ${n}")) system("rm tmp.bed") } ## Calculate the jaccard stat between two track data frames calc_jacc = function(A,B,f){ write_bed(A, "data/A.bed") write_bed(B, "data/B.bed") read.table(text=system(str_interp("bedtools jaccard -r -f ${f} -a data/A.bed -b data/B.bed"), intern=T),header=T,sep="\t") }
Make data frame that contain the jaccard statistic between each of the recombination rates for each of the individuals.
## C. immitis jacc_imm56 = expand.grid(ind=imm_indiv, f=c(seq(0.001, 0.95, by=0.1),1)) %>% ddply(.(ind, f), function(x){ A = subset(imm_data5, ind == x$ind) B = subset(imm_data6, ind == x$ind) calc_jacc(A,B,x$f) }, .inform=T) jacc_imm67 = expand.grid(ind=imm_indiv, f=c(seq(0.001, 0.95, by=0.1),1)) %>% ddply(.(ind, f), function(x){ A = subset(imm_data7, ind == x$ind) B = subset(imm_data6, ind == x$ind) calc_jacc(A,B,x$f) }, .inform=T) jacc_imm57 = expand.grid(ind=imm_indiv, f=c(seq(0.001, 0.95, by=0.1),1)) %>% ddply(.(ind, f), function(x){ A = subset(imm_data5, ind == x$ind) B = subset(imm_data7, ind == x$ind) calc_jacc(A,B,x$f) }, .inform=T) ## C. posadasii jacc_pos56 = expand.grid(ind=pos_indiv, f=c(seq(0.001, 0.95, by=0.1),1)) %>% ddply(.(ind, f), function(x){ A = subset(pos_data5, ind == x$ind) B = subset(pos_data6, ind == x$ind) calc_jacc(A,B,x$f) }, .inform=T) jacc_pos67 = expand.grid(ind=pos_indiv, f=c(seq(0.001, 0.95, by=0.1),1)) %>% ddply(.(ind, f), function(x){ A = subset(pos_data7, ind == x$ind) B = subset(pos_data6, ind == x$ind) calc_jacc(A,B,x$f) }, .inform=T) jacc_pos57 = expand.grid(ind=pos_indiv, f=c(seq(0.001, 0.95, by=0.1),1)) %>% ddply(.(ind, f), function(x){ A = subset(pos_data5, ind == x$ind) B = subset(pos_data7, ind == x$ind) calc_jacc(A,B,x$f) }, .inform=T)
plot_jacc = function(d){ for_plot = d for_text = subset(transform(d,x=0), (jaccard < 0.75) & (f == 0.001)) ggplot(for_plot, aes(x=f,y=jaccard,group=ind))+ geom_line(alpha=0.5)+ geom_text(data=for_text, aes(x=x,y=jaccard,label=ind),size=2)+ ylim(0,1)+theme_base()+xlab("% overlap required")+ xlim(-0.2,1)+ylab("jaccard stat")+ theme(plot.background = element_blank())} jaccard_plot = ggpubr::ggarrange( plot_jacc(jacc_imm56), plot_jacc(jacc_imm67), plot_jacc(jacc_imm57), plot_jacc(jacc_pos56), plot_jacc(jacc_pos67), plot_jacc(jacc_pos57), labels=c("A", "B", "C", "D", "E", "F")) ggsave("FigureS5.pdf", height=6, width=12)
At the most stringent cutoff (100% overlap required), the mean Jaccard statistic comparing a recombination rate of 10-5 and 10-7 was 0.55 for C. immitis and 0.54 for C. posadasii. If a 50% overlap was required, the mean statistic was 0.82 for C. immitis and C. posadasii. Finally, if only a 0.1% overlap was required, the statistic was 0.89 and 0.85 for C. immitis and C. posadasii, respectively.
print("100% overlap required") print("C. immitis") mean(subset(jacc_imm57, f==1)$jaccard) print("C. posadasii") mean(subset(jacc_pos57, f==1)$jaccard) print("50% overlap required") print("C. immitis") mean(subset(jacc_imm57, f==0.501)$jaccard) print("C. posadasii") mean(subset(jacc_pos57, f==0.501)$jaccard) print("0.1% overlap required") print("C. immitis") mean(subset(jacc_imm57, f==0.001)$jaccard) print("C. posadasii") mean(subset(jacc_pos57, f==0.001)$jaccard)
[1] "100% overlap required" [1] "C. immitis" [1] 0.7639479 [1] "C. posadasii" [1] 0.6840271 [1] "50% overlap required" [1] "C. immitis" [1] 0.880334 [1] "C. posadasii" [1] 0.8644842 [1] "0.1% overlap required" [1] "C. immitis" [1] 0.8822519 [1] "C. posadasii" [1] 0.8789823
5.12 Figure S6 - Correlation between species
The strategy for correlation is to use bedtools to make windows in the genome, the to count intersections of those windows with introgression tracts across all individuals for the two species, then finally to look at the correlation between the two.
source("01-setup.R") get_correlation = function(window_size){ system(str_interp("bedtools makewindows -g immitis.genome -w $[i]{window_size} -s $[i]{window_size/2} > data/bedtools_windows.bed")) window_counts_imm = read.table(text=system("bedtools intersect -c -F 0.51 -a data/bedtools_windows.bed -b data/i_p_i_hmm_hap_pab_01_6_all_30_tracks_sorted.bed6",intern=TRUE),sep="\t", col.names = c("supercontig", "start", "end", "imm")) window_counts_pos = read.table(text=system("bedtools intersect -c -F 0.51 -a data/bedtools_windows.bed -b data/p_i_i_hmm_hap_pab_01_6_all_30_tracks_sorted.bed6",intern=TRUE),sep="\t", col.names = c("supercontig", "start", "end", "pos")) comb = merge(window_counts_imm, window_counts_pos, by=c("supercontig", "start", "end")) with(comb, cor(imm,pos))} window_sizes = seq(10000,500000,10000) correlations = data.frame(window_size = window_sizes, r=sapply(window_sizes, get_correlation)) corr_plot = ggplot(correlations, aes(x=window_size,y=r))+ geom_point(pch=1)+geom_line()+theme_bw(14)+ geom_hline(yintercept=0.54, lty=2)+ ylab("Pearson's r")+xlab("Window size (bp)") ggsave("FigureS6.pdf", corr_plot, height=4, width=4)
5.13 Figure S7, Table S3, S4, S5 - Venn diagram with Neafsey, potentially adaptive genes
source("01-setup.R") gene_introgression = read.csv("data/gene_introgression.csv") ### For comparison to our data neafsey <- read.csv("Neafsey_SupTable1.csv", header=FALSE) %>% transform( neafsey = str_trim( V1 )) intro_genes = gene_introgression %>% ## Restrict to lines with some introgression coverage subset(introgression_count > 0) %>% ## If there are multiple depths of coverage for a gene in a population, ## then return only the deepest ddply(.(population, supercontig, gene_id, description), function(x){ subset(x, introgression_count == max(introgression_count))[1,] }) %>% transform(species = c("AZ"="posadasii", "Guatemala" = "posadasii", "Mex/TX"="posadasii", "CA/LAm"="immitis", "WA"="immitis", "immitis" = "immitis", "posadasii" = "posadasii")[as.character(population)])
5.13.1 Write table S3 (all introgressed genes)
intro_genes %>% transform(neafsey = gene_id %in% neafsey$neafsey) %>% write.csv("TableS3.csv")
5.13.2 Test for correlation between introgressed genes in this study and Neafsey's
fisher.test(all_genes$gene_id %in% neafsey$neafsey, all_genes$gene_id %in% intro_genes$gene_id)
Fisher's Exact Test for Count Data data: all_genes$gene_id %in% neafsey$neafsey and all_genes$gene_id %in% intro_genes$gene_id p-value < 2.2e-16 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 7.077592 10.150718 sample estimates: odds ratio 8.478441
5.13.3 Figure S7 (Venn diagram)
library(Vennerable) pdf("FigureS7.pdf") plot(Venn(list(Neafsey = unique(as.character(neafsey$neafsey)), `This study`=unique(as.character(intro_genes$gene_id))))) dev.off()
2
5.13.4 Table S4 - Write table of genes introgressed at high frequency globally
The rule here is that the frequency should be greater than 50% if the population is greater than 10 individuals. Otherwise require it to be 100%
high_freq_genes = subset(gene_introgression, ((population_count < 10) & (population_count == introgression_count)) | ((population_count > 10) & (introgression_count/population_count > 0.5))) high_freq_global = high_freq_genes %>% subset( population %in% c("immitis", "posadasii")) %>% ## If there are multiple depths of coverage for a gene in a population, ## then return only the deepest ddply(.(population, supercontig, gene_id, description), function(x){ subset(x, introgression_count == max(introgression_count))[1,] }) ## Write table. Order by population and then frequency. Make the values nice looking. high_freq_global = high_freq_global[order(high_freq_global$population, high_freq_global$introgression_count/high_freq_global$population_count, decreasing = TRUE),] high_freq_global %>% summarize(species=population, loc = paste0(supercontig,": ",scales::comma(cov_start),"-",scales::comma(cov_end)), freq = paste0( scales::percent(introgression_count/population_count), " (", introgression_count,"/",population_count,")"), gene_id=gene_id, description=description) %>% write.csv("TableS4.csv")
print("Number of introgressed genes at high frequency") print(length(unique(high_freq_global$gene_id))) # 55 print(" ... in immitis") print(nrow(subset(high_freq_global, population == "immitis"))) #3 print(" ... in posadasii") print(nrow(subset(high_freq_global, population == "posadasii"))) #57
[1] "Number of introgressed genes at high frequency" [1] 55 [1] " ... in immitis" [1] 3 [1] " ... in posadasii" [1] 54
5.13.5 Table S5 - Write table of genes introgressed at high frequency in particular populations
high_freq_local = high_freq_genes %>% subset(!(population %in% c("immitis", "posadasii"))) %>% transform(species = c("AZ"="posadasii", "Guatemala" = "posadasii", "Mex/TX"="posadasii", "CA/LAm"="immitis", "WA"="immitis")[as.character(population)]) %>% ## If there are multiple depths of coverage for a gene in a population, ## then return only the deepest ddply(.(species, population, supercontig, gene_id, description), function(x){ subset(x, introgression_count == max(introgression_count))[1,] }) ## Write the table as above. high_freq_local = high_freq_local[order(high_freq_local$population, high_freq_local$introgression_count/high_freq_local$population_count, decreasing = TRUE),] high_freq_local %>% summarize(population=population, loc = paste0(supercontig,": ",scales::comma(cov_start),"-",scales::comma(cov_end)), freq = paste0( scales::percent(introgression_count/population_count), " (", introgression_count,"/",population_count,")"), gene_id=gene_id, description=description) %>% write.csv("TableS5.csv")
high_freq_local_only = subset(high_freq_local, !(gene_id %in% high_freq_global$gene_id)) print("Number of introgressed genes found locally at high frequency") print(length(unique(high_freq_local_only$gene_id))) # 161 hf_pos = subset(high_freq_local_only, species == "posadasii") hf_imm = subset(high_freq_local_only, species == "immitis") print(" ... in immitis") print(length(unique(hf_imm$gene_id))) #31 print(" ... in posadasii") print(length(unique(hf_pos$gene_id))) #78
[1] "Number of introgressed genes found locally at high frequency" [1] 106 [1] " ... in immitis" [1] 31 [1] " ... in posadasii" [1] 78
A list of genes reciprocally introgressed at high frequency.
found_in_both = intersect(hf_imm$gene_id, hf_pos$gene_id)
subset(high_freq_local_only, gene_id %in% found_in_both)[,-1]
population | supercontig | cov_start | cov_end | gene_id | description | introgression_count | population_count | species |
---|---|---|---|---|---|---|---|---|
WA | Supercontig_3.3 | 60716 | 61367 | CIMG_08918 | hypothetical protein | 5 | 5 | immitis |
Mex/TX | Supercontig_3.5 | 2932794 | 2933405 | CIMG_12350 | SacI domain-containing protein | 9 | 11 | posadasii |
Guatemala | Supercontig_3.2 | 102921 | 103597 | CIMG_09822 | multidrug resistance protein | 4 | 4 | posadasii |
CA/LAm | Supercontig_3.5 | 2934433 | 2935000 | CIMG_12350 | SacI domain-containing protein | 13 | 22 | immitis |
CA/LAm | Supercontig_3.2 | 103059 | 113538 | CIMG_09822 | multidrug resistance protein | 12 | 22 | immitis |
AZ | Supercontig_3.3 | 50529 | 65520 | CIMG_08918 | hypothetical protein | 19 | 36 | posadasii |
5.13.6 Calculations of total introgressions
### Read in and process the data all_cov = list("immitis" = read_delim("data/i_p_i_populations_union.bg", col_names = c("supercontig", "start", "stop", "species", "CA/LAm", "WA"), delim='\t', skip=1), "posadasii" = read_delim("data/p_i_i_populations_union.bg", col_names = c("supercontig", "start", "stop", "species", "AZ", "Guatemala", "Mex/TX"), delim='\t', skip=1)) %>% ## By melting each, I can concatenate b/c the columns like 'species' 'CA' etc. are in the variable column lapply(melt, id.vars=c("supercontig", "start" ,"stop")) %>% ldply(function(x) x, .id="species") colnames(all_cov)[5] = "cov_type" colnames(all_cov)[6] = "introgression_count" all_cov$cov_type = as.character(all_cov$cov_type) all_cov$cov_type[all_cov$cov_type == "species"] = as.character(all_cov$species[all_cov$cov_type == "species"]) all_cov = merge(all_cov, pop_counts, by.x="cov_type", by.y="population")
head(all_cov)
cov_type | species | supercontig | start | stop | introgression_count | population_count |
---|---|---|---|---|---|---|
AZ | posadasii | Supercontig_3.1 | 0 | 37941 | 0 | 36 |
AZ | posadasii | Supercontig_3.1 | 37941 | 38556 | 3 | 36 |
AZ | posadasii | Supercontig_3.1 | 38556 | 38557 | 1 | 36 |
AZ | posadasii | Supercontig_3.1 | 38557 | 39116 | 0 | 36 |
AZ | posadasii | Supercontig_3.1 | 39116 | 39202 | 1 | 36 |
AZ | posadasii | Supercontig_3.1 | 39202 | 39263 | 2 | 36 |
This displays the total number of introgressed regions
all_cov %>% subset(introgression_count > 0) %>% transform(freq = round(introgression_count/population_count,4)) %>% {.[,c("supercontig", "start", "stop", "cov_type", "freq", "introgression_count")]} %>% #transform(strand = ".") %>% write.table("data/all-introgressions.bed6", quote=FALSE, sep="\t", row.names=FALSE, col.names=FALSE) print("Total") system("bedtools sort -i data/all-introgressions.bed6 | bedtools merge -c 4 -o distinct | wc -l") print("... immitis") all_cov %>% subset(introgression_count > 0) %>% subset(species == "immitis") %>% transform(freq = round(introgression_count/population_count,4)) %>% {.[,c("supercontig", "start", "stop", "cov_type", "freq", "introgression_count")]} %>% #transform(strand = ".") %>% write.table("data/all-introgressions-immitis.bed6", quote=FALSE, sep="\t", row.names=FALSE, col.names=FALSE) system("bedtools sort -i data/all-introgressions-immitis.bed6 | bedtools merge -c 4 -o distinct | wc -l") print("... posadasii") all_cov %>% subset(introgression_count > 0) %>% subset(species == "posadasii") %>% transform(freq = round(introgression_count/population_count,4)) %>% {.[,c("supercontig", "start", "stop", "cov_type", "freq", "introgression_count")]} %>% #transform(strand = ".") %>% write.table("data/all-introgressions-posadasii.bed6", quote=FALSE, sep="\t", row.names=FALSE, col.names=FALSE) system("bedtools sort -i data/all-introgressions-posadasii.bed6 | bedtools merge -c 4 -o distinct | wc -l")
[1] "Total" 416 [1] "... immitis" 142 [1] "... posadasii" 345
5.13.7 Calculate potentially adaptive introgressions
high_freq_introgressions = all_cov %>% subset(((population_count < 10) & (population_count == introgression_count)) | ((population_count > 10) & (introgression_count/population_count > 0.5))) ## Format as a BED, and write high_freq_introgressions %>% transform(freq = round(introgression_count/population_count,4)) %>% {.[,c("supercontig", "start", "stop", "cov_type", "freq", "introgression_count")]} %>% #transform(strand = ".") %>% write.table("data/high-frequency-introgressions.bed6", quote=FALSE, sep="\t", row.names=FALSE, col.names=FALSE) all_genes %>% {.[,c("supercontig", "start", "stop", "gene_id", "description", "strand")]} %>% write.table("data/genes.bed", quote=FALSE, sep="\t", row.names=FALSE, col.names=FALSE)
"In total, we found 6 and 32 introgressions in C. immitis and C. posadasii, respectively, at greater than 50% frequency in the species."
There are 99 distinct potentially adaptive introgressions across both species.
system("bedtools sort -i data/high-frequency-introgressions.bed6 | bedtools merge -c 4,5 -o collapse > data/high-frequency-introgressions-merged.bed6") print("High frequency in immitis") system("grep 'immitis' data/high-frequency-introgressions-merged.bed6 | wc -l") #6 print("High frequency in immitis population") system("grep 'WA\\|CA/LAm' data/high-frequency-introgressions-merged.bed6 | grep -v 'immitis' | wc -l") #25 print("High frequency in posadasii") system("grep 'posadasii' data/high-frequency-introgressions-merged.bed6 | wc -l") #32 print("High frequency in posadasii population") system("grep 'AZ\\|Guatemala\\|Mex/TX' data/high-frequency-introgressions-merged.bed6 | grep -v 'posadasii' | wc -l")
[1] "High frequency in immitis" 6 [1] "High frequency in immitis population" 25 [1] "High frequency in posadasii" 32 [1] "High frequency in posadasii population" 41
5.14 Figure S8 - Amount of introgression in population
source("01-setup.R") all_tracks = read.csv("data/all-tracks-recomb6-imm-ref.csv") track_summaries = ddply(all_tracks, .(ind), plyr::summarize, avg_len_introgression = mean(track_len), count_introgression = length(track_len), sum_introgression = sum(track_len)) track_summaries2 = merge(track_summaries, meta, by.x="ind", by.y="sample") immitis_genome_size=sum(8482323,5260912,4323945,3952524,3469364,3458857) track_summaries2$percent_genome = with(track_summaries2, 100*sum_introgression/immitis_genome_size) comb_melt <- melt(track_summaries2, measure.vars=c("avg_len_introgression", "count_introgression", "percent_genome")) %>% ## Make the labels nice. transform(variable = factor(variable, levels=c("percent_genome", "avg_len_introgression", "count_introgression"), labels=c("Introgressed\ngenome (%)", "Average introgression\ntract length (bp)", "Number of\nintrogression tracts")))
Test for differences between the populations. Nothing significant at p = 0.008 cutoff.
ddply(comb_melt, .(variable,species, population2), summarize, mean = round(mean(value),2), sd = round(sd(value),2)) library(broom) dlply(comb_melt, .(variable, species), function(x) aov(lm(value~population2, data=x))) %>% ldply(tidy)
Introgressed | |||||||
genome (%) | immitis | population2 | 1 | 0.183410940632791 | 0.183410940632791 | 6.27945643000321 | 0.0190891407977407 |
Introgressed | |||||||
genome (%) | immitis | Residuals | 25 | 0.730202298070159 | 0.0292080919228064 | nil | nil |
Introgressed | |||||||
genome (%) | posadasii | population2 | 2 | 0.245040687686517 | 0.122520343843258 | 3.2646935133915 | 0.0468441021317963 |
Introgressed | |||||||
genome (%) | posadasii | Residuals | 48 | 1.80138701545892 | 0.0375288961553942 | nil | nil |
Average introgression | |||||||
tract length (bp) | immitis | population2 | 1 | 14959699.5439941 | 14959699.5439941 | 7.45067757294761 | 0.011445035810654 |
Average introgression | |||||||
tract length (bp) | immitis | Residuals | 25 | 50195768.7657519 | 2007830.75063008 | nil | nil |
Average introgression | |||||||
tract length (bp) | posadasii | population2 | 2 | 2122651.2603628 | 1061325.6301814 | 2.90925101637238 | 0.0641846478960499 |
Average introgression | |||||||
tract length (bp) | posadasii | Residuals | 48 | 17510909.1522223 | 364810.607337964 | nil | nil |
Number of | |||||||
introgression tracts | immitis | population2 | 1 | 14.0121212121212 | 14.0121212121212 | 0.311476116571823 | 0.581739941893402 |
Number of | |||||||
introgression tracts | immitis | Residuals | 25 | 1124.65454545455 | 44.9861818181818 | nil | nil |
Number of | |||||||
introgression tracts | posadasii | population2 | 2 | 607.189542483662 | 303.594771241831 | 2.19194673891886 | 0.122756106877944 |
Number of | |||||||
introgression tracts | posadasii | Residuals | 48 | 6648.22222222222 | 138.50462962963 | nil | nil |
panelA <- comb_melt %>% subset(variable == "Introgressed\ngenome (%)") %>% transform(species = paste("C.",species)) %>% ggplot(aes(x=population2, y=value))+ geom_boxplot(fill=NA)+ geom_dotplot(binaxis="y",stackdir = "center", fill="darkred")+ facet_grid(.~species, scale="free")+theme_bw()+ scale_fill_discrete(guide=FALSE)+ theme_base()+ylab("Introgressed\ngenome (%)")+xlab("")+ theme(plot.background = element_rect(colour = NA), strip.text.x=element_text(face="italic")) panelB <- comb_melt %>% subset(variable == "Average introgression\ntract length (bp)") %>% transform(species = paste("C.",species)) %>% ggplot(aes(x=population2, y=value))+ geom_boxplot(fill=NA)+geom_dotplot(binaxis="y",stackdir = "center", fill="darkred")+ facet_grid(.~species, scale="free")+theme_bw()+ scale_fill_discrete(guide=FALSE)+ theme_base()+ylab("Average introgression\ntract length (bp)")+xlab("")+ theme(plot.background = element_rect(colour = NA), strip.text.x=element_blank()) panelC <- comb_melt %>% subset(variable == "Number of\nintrogression tracts") %>% transform(species = paste("C.",species)) %>% ggplot(aes(x=population2, y=value))+ geom_boxplot(fill=NA)+geom_dotplot(binaxis="y",stackdir = "center", fill="darkred")+ facet_grid(.~species, scale="free")+theme_bw()+ scale_fill_discrete(guide=FALSE)+ theme_base()+ylab("Number of\nintrogression tracts")+xlab("")+ theme(plot.background = element_rect(colour = NA), strip.text.x=element_blank()) compare_pops = ggarrange(panelA, panelB, panelC, labels=c("A", "B", "C"), nrow=3, heights = c(1,0.9,0.9)) ggsave("FigureS8.pdf", compare_pops, height=7,width=7)
6 Statistics
6.1 Percentage of tracts merged during filtering
"We reapplied this filter 4 times. This consolidation decreased the number of introgressions per individual by an average of 24% per individual in C. immitis (minimum 13% maximum 35%) and 12% in C. posadasii (minimum 2%, maximum 25%)."
The code reads in the filtered and the unfiltered track tables then looks at the ratio of their lengths.
source("01-setup.R") ## For immitis unfiltered_tracks = read.csv("hmm_tables_united/i_p_i_hmm_hap_pab_01_6_all_30_unftrk.csv") %>% subset(genotype == "homo_2") %>% subset(track_len > 500) %>% # Introgressions ddply(.(ind), plyr::summarize, count_unfiltered = length(track_len)) filtered_tracks = read.csv("hmm_tables_united/i_p_i_hmm_hap_pab_01_6_all_30_tracks.csv")%>% subset(genotype == "homo_2") %>% subset(track_len > 500) %>% ddply(.(ind), plyr::summarize, count_filtered = length(track_len)) d = merge(filtered_tracks, unfiltered_tracks) print("C. immitis") print(with(d,summary((count_unfiltered - count_filtered)/count_unfiltered))) ## For posadasii unfiltered_tracks = read.csv("hmm_tables_united/p_i_i_hmm_hap_pab_01_6_all_30_unftrk.csv") %>% subset(genotype == "homo_2") %>% subset(track_len > 500) %>% ddply(.(ind), plyr::summarize, count_unfiltered = length(track_len)) filtered_tracks = read.csv("hmm_tables_united/p_i_i_hmm_hap_pab_01_6_all_30_tracks.csv")%>% subset(genotype == "homo_2") %>% subset(track_len > 500) %>% ddply(.(ind), plyr::summarize, count_filtered = length(track_len)) d = merge(filtered_tracks, unfiltered_tracks) print("C. posadasii") print(with(d,summary((count_unfiltered - count_filtered)/count_unfiltered)))
[1] "C. immitis" Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1282 0.1892 0.2328 0.2390 0.2830 0.3580 [1] "C. posadasii" Min. 1st Qu. Median Mean 3rd Qu. Max. 0.02632 0.08664 0.11613 0.12042 0.15035 0.24771
6.2 Introgressions aren't uniformly distributed
This is a chi-squared test against a uniform distribution with bin sizes of 1MB.
"Introgressions in each species of Coccidioides are circumscribed to particular and distinct portions of their genomes. We divided the genome into 500kB bins and compared the observed number of introgressions in each bin to a null model of a uniform distribution. In each species, we rejected the null model (Figure 3A; C. posadasii into C. immitis: chi-259 =3188.0, P < 0.00001; C. immitis into C. posadasii: chi-259 =4,530.0, P < 0.00001)."
source("01-setup.R") window_size = 500000 system(str_interp("bedtools makewindows -g immitis.genome -w $[i]{window_size} -s $[i]{window_size/2} > data/bedtools_windows.bed")) window_counts_imm = read.table(text=system("bedtools intersect -c -F 0.51 -a data/bedtools_windows.bed -b data/i_p_i_hmm_hap_pab_01_6_all_30_tracks_sorted.bed6",intern=TRUE),sep="\t", col.names = c("supercontig", "start", "end", "imm")) window_counts_pos = read.table(text=system("bedtools intersect -c -F 0.51 -a data/bedtools_windows.bed -b data/p_i_i_hmm_hap_pab_01_6_all_30_tracks_sorted.bed6",intern=TRUE),sep="\t", col.names = c("supercontig", "start", "end", "pos")) the_counts = merge(window_counts_imm, window_counts_pos, by=c("supercontig", "start", "end")) print("Chi-squared statistics") print("C. immitis") print(sum((the_counts$imm - mean(the_counts$imm))^2/mean(the_counts$imm))) print("C. posadasii") print(sum((the_counts$pos - mean(the_counts$pos))^2/mean(the_counts$pos)))
[1] "Chi-squared statistics" [1] "C. immitis" [1] 5428.56 [1] "C. posadasii" [1] 8158.621
6.3 Test for differences in admixture
source("01-setup.R") all_shared <- list(immitis = read.csv("./hmm_tables_united/i_p_i_hmm_hap_pab_01_6_all_30_tracks.csv"), posadasii = read.csv("./hmm_tables_united/p_i_i_hmm_hap_pab_01_6_all_30_tracks.csv")) %>% ldply(.id="species", function(x) x) %>% subset(genotype == "homo_2") %>% ## "homo_2" = "shared" subset(per_repeat < 5) ## Filter to less than 5% repetitive immitis_genome_size=sum(8482323,5260912,4323945,3952524,3469364,3458857) all_shared_stats_by_indiv <- all_shared %>% ddply(.(species, ind), plyr::summarize, avg_len_introgression = mean(track_len), count_introgression = length(track_len), sum_introgression = sum(track_len)) %>% transform( percent_genome = 100*sum_introgression/immitis_genome_size) %>% merge(meta, by.x="ind", by.y="sample") introgression_stats_by_indiv <- all_shared %>% subset(track_len > 500) %>% ddply(.(species, ind), plyr::summarize, avg_len_introgression = mean(track_len), count_introgression = length(track_len), sum_introgression = sum(track_len))%>% transform( percent_genome = 100*sum_introgression/immitis_genome_size) %>% merge(meta, by.x="ind", by.y="sample")
6.3.1 Posadasii contains more admixed material
"Consistent with the results from our D-statistic calculations, the levels of introgression between the two reciprocal directions are not equivalent. Haplotypes with shared ancestry covered significantly more of the C. posadasii genome on average than the genome of C. immitis (0.9% and 0.54%, respectively; One Way ANOVA: F1,76 =74.8, P = 6.26 10-13)."
ANOVA for percent genome admixed
admix_anova = aov(lm(percent_genome~species.x, data=all_shared_stats_by_indiv)) print(summary(admix_anova))
Df Sum Sq Mean Sq F value Pr(>F) species.x 1 3.501 3.501 74.82 6.26e-13 *** Residuals 76 3.556 0.047 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Calculate the mean percent of the genome admixed
with(all_shared_stats_by_indiv, tapply(percent_genome, species.x, mean))
immitis | 0.542139414521459 |
posadasii | 0.987449352722918 |
6.3.2 Posadasii contains more introgressed material
"On average, C. immitis individuals carry slightly less introgressed material from C. posadasii than C. posadasii does from C. immitis (0.43% and 0.73% of their genomes, respectively; One-Way ANOVA: F1,76 =41.2, P = 1.08 10-8)."
introgression_anova = aov(lm(percent_genome~species.x, data=introgression_stats_by_indiv)) print(summary(introgression_anova))
Df Sum Sq Mean Sq F value Pr(>F) species.x 1 1.604 1.6036 41.17 1.08e-08 *** Residuals 76 2.960 0.0389 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
with(introgression_stats_by_indiv, tapply(percent_genome, species.x, mean))
immitis | 0.436171478931884 |
posadasii | 0.737559654746496 |
6.3.3 No sig. difference in the variance of introgression
"We found no significant difference in the variance in the percent of the genome introgressed between species (Levene's test for homogeneity of variance: F1,76 = 0.005, P = 0.94)."
library(car)
print(leveneTest(percent_genome~species.x, data=introgression_stats_by_indiv))
Loading required package: carData Levene's Test for Homogeneity of Variance (center = median) Df F value Pr(>F) group 1 0.005 0.9436 76
6.3.4 Most tracts are small
"Next, we compared the introgression haplotype sizes in each species. We found that the vast majority (98% for both species) of introgressed haplotypes are small (less than 25kb, which is ~0.1% of the genome)"
This gives the percentage of introgressed tracts that are less than 25,000bp
list(immitis = read.csv("./hmm_tables_united/i_p_i_hmm_hap_pab_01_6_all_30_tracks.csv"), posadasii = read.csv("./hmm_tables_united/p_i_i_hmm_hap_pab_01_6_all_30_tracks.csv")) %>% ldply(.id="species", function(x) x) %>% subset(genotype == "homo_2") %>% ## "homo_2" = "shared" subset(per_repeat < 5) %>% subset(track_len > 500) %>% transform(over_25000 = track_len>25000) %>% ddply(.(species), plyr::summarize, under_25000 = round(1-sum(over_25000)/length(over_25000),3))
immitis | 0.982 |
posadasii | 0.976 |
6.3.5 Immitis has larger introgressed haploblocks
"The mean haplotype size differs between the two species (One-Way ANOVA, F1,76 = 11.07, P = 0.0014). The average size of a C. immitis to C. posadasii introgression was 3.4 kb. In the reciprocal direction, C. posadasii to C. immitis, the average size of an introgression was 4.2 kb."
introgression_anova_size = aov(lm(avg_len_introgression~species.x, data=introgression_stats_by_indiv)) print(summary(introgression_anova_size))
Df Sum Sq Mean Sq F value Pr(>F) species.x 1 12348147 12348147 11.07 0.00136 ** Residuals 76 84789029 1115645 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
with(introgression_stats_by_indiv, tapply(avg_len_introgression, species.x, mean))
immitis | 4229.75693737504 |
posadasii | 3393.42010492857 |
6.3.6 Posadasii has a greater number of introgressions
The average number of introgressed tracts in an isolate was 30.1 and 62.8 for C. immitis and C. posadasii, respectively, and this difference was statistically significant (One-Way ANOVA, F1,76 = 171, P < 1e-10)
introgression_anova_number = aov(lm(count_introgression~species.x, data=introgression_stats_by_indiv, population)) print(summary(introgression_anova_number))
Df Sum Sq Mean Sq F value Pr(>F) species.x 1 1547 1546.9 75.26 5.6e-13 *** Residuals 76 1562 20.6 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
with(introgression_stats_by_indiv, tapply(count_introgression, species.x, mean))
immitis | 30.1111111111111 |
posadasii | 62.8235294117647 |
6.3.7 Washington immitis has less introgression than other immitis
"Coccidioides immitis from Washington also shows the least amount of introgression from C. posadasii (Figure S3)."
introgression_anova_immitis <- aov(lm(percent_genome~population2, data=subset(introgression_stats_by_indiv, species.x == "immitis"))) print(summary(introgression_anova_immitis))
Df Sum Sq Mean Sq F value Pr(>F) population2 1 0.1834 0.18341 6.279 0.0191 * Residuals 25 0.7302 0.02921 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
with(subset(introgression_stats_by_indiv, species.x == "immitis"),
tapply(percent_genome, population2, mean))
AZ | nil |
CA/LAm | 0.47546350276166 |
Guatemala | nil |
Mex/TX | nil |
WA | 0.263286574080871 |