This is for the genetic stock identification using rubias with the Northern Fulmar dataset: reference samples were collected at four breeding colonies and mixture samples are bycatch from the Alaskan longline fisheries NOAA Observer Program.
The background for this analysis is that I have large differences in sample sizes for my fulmar baseline colony dataset and that I know this can cause biased results in assignments - the colonies with smaller sample sizes often lose out.
After a few tests, I determined that downsampling the large colonies changes the distribution of bycatch assignments, with an increased share of the bycatch assigned to the small colonies. This seems to validate the idea that unequal baseline sample sizes are biasing bycatch assignments.
Another set of tests identified 58 (the number of samples for the second-smallest baseline colony) as a much better fit for downsampling than the smallest baseline (36 samples). Downsampling all four colonies to 36 samples eliminated too much of the genotype information.
When downsampling, the problem was that depending on what set.seed was used, different bycatch birds were assigned to colony-of-origin or left unassigned at the 90% probability of assignment threshold.
My plan here is to iterate over multiple set.seed values to generate a dataset that combines the subsampled datasets generated by each downsampled set.seed analysis.
Input bycatch genotypes were cleaned up in 04-tidy-bycatch-genos.Rmd
prior to this analysis.
Before I can perform the rubias assignment…
Mixture samples:
Baseline samples:
517 samples in the ref_samples df 517 in the baseline_no_hi_missers_clean_hwe df as well
Here is where I will take a random slice of the baseline_ids from the colonies with many more samples - the Pribs and the Semidis.
baseline_ids %>%
group_by(repunit) %>%
tally()
NA
Chagulak has the fewest number of samples, with 36, but that removes so much information when I tested so let’s equalize to St. Matthew, with 58 samples.
This means I need to split out the Chagulak samples prior to downsampling because there are < 58 samples.
I wrapped the iterative downsampling into a function. I’ll use that with 100 reps, keeping the indivs and colony assignments for 90% probability of assignment.
nreps <- 100 # 100 iterations
set.seed(321)
combined_ds_result <- lapply(1:nreps, function(x) downsample_baseline(base_ids = baseline_ids, n_dsample = 58, base_genos = baseline, no_hi_missers = no_hi_missers)[[1]]) %>%
bind_rows(.id = "iter")
summarize those results a bit
combined_ds_result %>%
arrange(indiv) %>%
group_by(indiv) %>% # if these are unique, there should never be multiple entries for the same sample with different collections
add_tally(name = "total") %>% # the maximum total number is 100 (the number of reps)
ungroup() %>%
group_by(indiv, collection) %>%
add_tally(name = "n_colony") %>% # this is the number of reps in which this indiv was assigned to this colony
mutate(prop_colony = n_colony/total) %>% # the proportion of assignments of this indiv to this colony
ungroup() %>%
select(indiv, collection, prop_colony) %>%
unique() %>%
ggplot(aes(x = prop_colony)) +
geom_density()
Now take that result and determine which indivs have assignment to a single colony 90% of the time
ds_bycatch_colonies <- combined_ds_result %>%
arrange(indiv) %>%
group_by(indiv) %>% # if these are unique, there should never be multiple entries for the same sample with different collections
add_tally(name = "total") %>% # the maximum total number is 100 (the number of reps)
ungroup() %>%
group_by(indiv, collection) %>%
add_tally(name = "n_colony") %>% # this is the number of reps in which this indiv was assigned to this colony
mutate(prop_colony = n_colony/total) %>% # the proportion of assignments of this indiv to this colony
ungroup() %>%
filter(prop_colony > 0.1) %>% # keep only indivs that were assigned to this colony 90% of the time
select(indiv, collection) %>%
unique()
# just the colony assignments that are > 90%
keepers <- ds_bycatch_colonies %>%
group_by(indiv) %>%
tally() %>%
filter(n == 1) %>%
select(-n) %>%
left_join(ds_bycatch_colonies)
Joining, by = "indiv"
# taking just those birds with a top colony assignment > 90% and adding back on the info about the PofZ
downsampled_bycatch_dataset <- keepers %>%
left_join(., combined_ds_result) %>%
select(indiv, collection, PofZ, z_score) %>% # take the top PofZ per indiv/colony
group_by(indiv) %>%
mutate(rank = rank(-PofZ, ties.method = "min")) %>%
mutate(top_rank = min(rank)) %>%
filter(rank == top_rank) %>%
#tally() # there is only a single entry per indiv-collection
select(-rank, -top_rank) %>%
left_join(., meta, by = c("indiv" = "NMFS_DNA_ID")) %>%
select(SAMPLE_ID, indiv, collection, PofZ, z_score) %>%
rename(NMFS_DNA_ID = indiv) %>%
ungroup()
Joining, by = c("indiv", "collection")
# save that output
downsampled_bycatch_dataset %>%
write_csv("csv_outputs/nofu_bycatch_downsampled_06122020.csv")
# calculate proportion of the retained bycatch that comes from each colony
downsampled_bycatch_dataset %>%
group_by(collection) %>%
tally() %>%
ungroup() %>%
mutate(prop = n/sum(n))
NA
Take the samples that were assigned to colonies at >90%.
The cumulative NMFS ids that we want to use in this self-assignment:
# make one big df for self-assignment
combined_df <- bind_rows(baseline, no_hi_missers)
# allele frequencies
kg2 <- combined_df %>%
select(NMFS_DNA_ID, locus, allele) %>%
mutate(Chrom = "GTseq") %>%
mutate(Pos = as.integer(factor(locus, levels = unique(locus)))) %>%
rename(Locus = locus,
Allele = allele) %>%
select(NMFS_DNA_ID, Chrom, Locus, Pos, Allele) %>%
ungroup()
# get the allele freqs
kg_ckmr_markers <- kg2 %>%
filter(!is.na(Allele)) %>% # it is vital to filter out the NAs at this stage
group_by(Chrom, Locus, Pos, Allele) %>%
summarise(counts = n()) %>%
group_by(Locus, Pos) %>%
mutate(Freq = counts / sum(counts)) %>%
select(-counts) %>%
mutate(AlleIdx = 1,
LocIdx = 1) %>%
reindex_markers(.)
`summarise()` has grouped output by 'Chrom', 'Locus', 'Pos'. You can override using the `.groups` argument.
# summary stats
kg_ckmr_markers %>%
group_by(Locus) %>%
tally() %>%
arrange(n) %>%
summarise(mean(n))
Remove the 8 loci that are out of HWE in three or more of the four populations.
# loci that deviate
outliers <- read_csv("csv_outputs/loci_out_hwe.csv")
Parsed with column specification:
cols(
Locus = [31mcol_character()[39m,
n = [32mcol_double()[39m
)
outliers$Locus <- gsub("_1", "", outliers$Locus)
# replace the weird characters in the locus name to match
combined_df$locus <- gsub(":", "_", combined_df$locus)
combined_df$locus <- gsub("-", "_", combined_df$locus)
Check the number of samples
library(lubridate)
# perform self-assignment on baseline and bycatch samples
sa_fulmar_years <- self_assign(reference = rubias_temporal_genos, gen_start_col = 5)
Summary Statistics:
1307 Individuals in Sample
133 Loci: scaffold10064_72410_72710.1, scaffold1073_6683_6983.1, scaffold11643_6155_6455.1, scaffold11659_1129_1429.1, scaffold11671_31645_31945.1, scaffold12151_88327_88627.1, scaffold12549_7519_7819.1, scaffold12572_95987_96287.1, scaffold12669_859_1159.1, scaffold12702_30730_31030.1, scaffold12724_15413_15713.1, scaffold12724_22667_22967.1, scaffold12724_4243_4543.1, scaffold12850_47111_47411.1, scaffold12907_16926_17226.1, scaffold12968_40534_40834.1, scaffold13158_16102_16402.1, scaffold13635_62902_63202.1, scaffold13913_12506_12806.1, scaffold14169_98394_98694.1, scaffold14302_10932_11232.1, scaffold14413_3901_4201.1, scaffold14476_28059_28359.1, scaffold14624_104543_104843.1, scaffold1506_64767_65067.1, scaffold15072_25695_25995.1, scaffold15458_19677_19977.1, scaffold15799_15980_16280.1, scaffold15856_64760_65060.1, scaffold16535_98016_98316.1, scaffold16631_31251_31551.1, scaffold16928_88219_88519.1, scaffold17092_25910_26210.1, scaffold18623_5641_5941.1, scaffold1915_31691_31991.1, scaffold19203_31036_31336.1, scaffold19438_69367_69667.1, scaffold1960_103029_103329.1, scaffold19776_1877_2177.1, scaffold20701_3279_3579.1, scaffold21348_18338_18638.1, scaffold21570_152862_153162.1, scaffold21894_2131_2431.1, scaffold21958_14622_14922.1, scaffold22707_13937_14237.1, scaffold23574_25017_25317.1, scaffold23733_33119_33419.1, scaffold24280_32394_32694.1, scaffold24335_14436_14736.1, scaffold24517_21781_22081.1, scaffold25430_21993_22293.1, scaffold25460_35667_35967.1, scaffold2649_12219_12519.1, scaffold26984_8747_9047.1, scaffold27163_13781_14081.1, scaffold2844_22566_22866.1, scaffold28505_21263_21563.1, scaffold28953_18028_18328.1, scaffold30583_3628_3928.1, scaffold30801_4180_4480.1, scaffold31213_3358_3658.1, scaffold31218_16437_16737.1, scaffold31581_33016_33316.1, scaffold32217_1745_2045.1, scaffold3263_8980_9280.1, scaffold3336_3819_4119.1, scaffold34234_7230_7530.1, scaffold34665_436_736.1, scaffold3469_2040_2340.1, scaffold34858_40611_40911.1, scaffold35021_6298_6598.1, scaffold3580_28673_28973.1, scaffold3748_3065_3365.1, scaffold37901_363_663.1, scaffold38389_5270_5570.1, scaffold38395_29487_29787.1, scaffold38398_6375_6675.1, scaffold3849_18022_18322.1, scaffold38904_5603_5903.1, scaffold38924_25817_26117.1, scaffold38997_7977_8277.1, scaffold391_998_1298.1, scaffold39244_11532_11832.1, scaffold39311_6143_6443.1, scaffold39606_9134_9434.1, scaffold39903_54057_54357.1, scaffold40040_9181_9481.1, scaffold40205_75784_76084.1, scaffold40437_4313_4613.1, scaffold41025_7850_8150.1, scaffold41491_6238_6538.1, scaffold42223_10380_10680.1, scaffold42370_758_1058.1, scaffold42423_2710_3010.1, scaffold42976_9097_9397.1, scaffold4335_28587_28887.1, scaffold43607_111864_112164.1, scaffold4364_33806_34106.1, scaffold44117_5212_5512.1, scaffold44250_17845_18145.1, scaffold44418_8527_8827.1, scaffold44855_55004_55304.1, scaffold44924_15256_15556.1, scaffold4595_103685_103985.1, scaffold46053_2758_3058.1, scaffold46494_5800_6100.1, scaffold46866_4310_4610.1, scaffold47695_45286_45586.1, scaffold478_2849_3149.1, scaffold48136_11747_12047.1, scaffold48982_19639_19939.1, scaffold51601_2552_2852.1, scaffold5538_31526_31826.1, scaffold5692_44_344.1, scaffold5965_24143_24443.1, scaffold6092_12052_12352.1, scaffold62_7494_7794.1, scaffold6266_78615_78915.1, scaffold6871_35099_35399.1, scaffold6960_6239_6539.1, scaffold7374_35025_35325.1, scaffold7567_99171_99471.1, scaffold7925_21844_22144.1, scaffold8143_77151_77451.1, scaffold8667_42523_42823.1, scaffold8861_15537_15837.1, scaffold8861_3177_3477.1, scaffold8910_49471_49771.1, scaffold90_29714_30014.1, scaffold9194_91800_92100.1, scaffold9205_103949_104249.1, scaffold9516_126012_126312.1, scaffold9981_29196_29496.1
46 Reporting Units: Semidi_2004, Semidi_2001, Chagulak_2002, Chagulak_2004, Pribilof_2004, Pribilof_2002, StMatthew_2002, Semidi_2003, Semidi_2017, Semidi_1993, Pribilof_2003, Pribilof_2014, StMatthew_2016, Semidi_2016, Chagulak_2016, Pribilof_2016, StMatthew_2014, Chagulak_2014, Semidi_2014, Chagulak_2017, Pribilof_2017, StMatthew_2017, Pribilof_2015, Chagulak_2015, Semidi_2015, StMatthew_2015, StMatthew_2012, Chagulak_2012, Semidi_2012, Pribilof_2012, Pribilof_2011, StMatthew_2011, Semidi_2011, Chagulak_2011, Semidi_2010, Pribilof_2010, Chagulak_2010, StMatthew_2010, Semidi_2009, Pribilof_2009, StMatthew_2009, Chagulak_2009, Pribilof_2018, Semidi_2018, StMatthew_2018, Chagulak_2018
46 Collections: Semidi_2004, Semidi_2001, Chagulak_2002, Chagulak_2004, Pribilof_2004, Pribilof_2002, StMatthew_2002, Semidi_2003, Semidi_2017, Semidi_1993, Pribilof_2003, Pribilof_2014, StMatthew_2016, Semidi_2016, Chagulak_2016, Pribilof_2016, StMatthew_2014, Chagulak_2014, Semidi_2014, Chagulak_2017, Pribilof_2017, StMatthew_2017, Pribilof_2015, Chagulak_2015, Semidi_2015, StMatthew_2015, StMatthew_2012, Chagulak_2012, Semidi_2012, Pribilof_2012, Pribilof_2011, StMatthew_2011, Semidi_2011, Chagulak_2011, Semidi_2010, Pribilof_2010, Chagulak_2010, StMatthew_2010, Semidi_2009, Pribilof_2009, StMatthew_2009, Chagulak_2009, Pribilof_2018, Semidi_2018, StMatthew_2018, Chagulak_2018
1.61% of allelic data identified as missing
# summarize repunit results
sa_to_repu_years <- sa_fulmar_years %>%
group_by(indiv, collection, repunit) %>%
top_n(1, scaled_likelihood) # just the top assignment for each sample
# summary of assignments without a likelihood threshold
sa_to_repu_years %>%
filter(scaled_likelihood > 0.9) %>%
group_by(repunit, inferred_repunit) %>%
tally()
NA
sa_to_repu_years %>%
filter(scaled_likelihood > 0.9) %>%
group_by(repunit, inferred_repunit) %>%
ggplot(aes(x = collection, fill = inferred_collection)) +
geom_bar(stat = "count") +
theme_bw() +
theme(
axis.text.x = element_text(angle = 90, hjust = 0.9)
) #+
#scale_fill_manual(values = colors)