libraries:
library(tidyverse)
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Need help getting started? Try the cookbook for R: http://www.cookbook-r.com/Graphs/
Conflicts with tidy packages --------------------------------------------------------------------------
filter(): dplyr, stats
lag(): dplyr, stats
library(rubias)
library(stringr)
The guys at DFO have been using GSI_SIM to run some tests on a big data set of theirs. This seemed like it would be a great chance to squash their data into rubias to see how it works. Huge baseline with 28,000 individuals and 302 SNPs…
Just for our records (i.e. these chunks are not set to evaluate), the data are in a GSI_SIM file, which I have turned into a space-delimited file that will be appropriate for rubias
like so:
# in: /Users/eriq/Downloads/mco_Apr6
le2unix bco_GSI_SIM_Mar_31_2017.txt | awk '
BEGIN {printf("sample_type repunit collection indiv")}
NF==1 {printf(" %s %s.1", $1, $1);}
$1=="POP" {if(go==0) printf("\n"); go=1; pop=$2; next;}
NF>100 {printf("reference lumped %s %s", pop, $1); for(i=2;i<=NF;i++) printf(" %s", $i); printf("\n");}
' > bco.txt
The read it in with readr and format it to be appropriate for rubias. Note that we don’t have any reporting unit designations from them, so we just put everyone into a single reporting unit called “lumped.” I’m curious to see if having only a sinble reporting unit breaks anything.
After reading it we compress it into an RDS file that only takes up 2.4 Mb. Way better than the 35 Mb text file format it was before.
bco <- read_delim("~/Downloads/mco_Apr6/bco.txt", delim = " ", progress = FALSE, na = "0") %>%
mutate(repunit = factor(repunit, levels = unique(repunit)),
collection = factor(collection, levels = unique(collection)))
# save it into development/data
saveRDS(bco, file = "development/data/bco_baseline.rds", compress = "xz")
Now we can just read that into R easily:
bco <- readRDS("development/data/bco_baseline.rds")
As it turns out, I think that having a single reporting unit does break things. So, for now, I will just arbitrarily break it into two (A-M) and (N-Z).
bco2 <- bco %>%
mutate(repunit = ifelse(str_detect(collection, "^[A-M]"), "first", "second")) %>%
mutate(repunit = factor(repunit))
I want to do self-assignment and then plot the resulting matrix as a raster using ggplot. Should not be hard, though it takes about two minutes.
# in: /Users/eriq/Downloads/mco_Apr6
le2unix bco_GSI_SIM_Mar_31_2017.txt | awk '
BEGIN {printf("sample_type repunit collection indiv")}
NF==1 {printf(" %s %s.1", $1, $1);}
$1=="POP" {if(go==0) printf("\n"); go=1; pop=$2; next;}
NF>100 {printf("reference lumped %s %s", pop, $1); for(i=2;i<=NF;i++) printf(" %s", $i); printf("\n");}
' > bco.txt
Now, from that we want to compute the mean scaled likelihood from each collection to each inferred collection
sa <- self_assign(bco2, 5)
Summary Statistics:
28005 Individuals in Sample
302 Loci: Oki_RAD100331_48, Oki_RAD100507_58, Oki_RAD101032_66, Oki_RAD101136_60, Oki_RAD101377_37, Oki_RAD101607_49, Oki_RAD102175_55, Oki_RAD102786_61, Oki_RAD104180_61, Oki_RAD104335_44, Oki_RAD104779_63, Oki_RAD105692_35, Oki_RAD106749_40, Oki_RAD107732_42, Oki_RAD109706_63, Oki_RAD111748_36, Oki_RAD112339_67, Oki_RAD112734_32, Oki_RAD115799_69, Oki_RAD13015_68, Oki_RAD14549_57, Oki_RAD15228_35, Oki_RAD17843_62, Oki_RAD20524_31, Oki_RAD23085_53, Oki_RAD23108_39, Oki_RAD27348_46, Oki_RAD27801_45, Oki_RAD345_59, Oki_RAD35869_61, Oki_RAD3702_58, Oki_RAD37037_59, Oki_RAD37101_64, Oki_RAD37278_54, Oki_RAD37493_51, Oki_RAD37537_45, Oki_RAD37979_59, Oki_RAD39712_67, Oki_RAD40740_40, Oki_RAD41030_31, Oki_RAD42227_43, Oki_RAD42296_68, Oki_RAD42356_106, Oki_RAD42611_52, Oki_RAD42662_33, Oki_RAD43051_33, Oki_RAD43472_33, Oki_RAD44444_52, Oki_RAD45691_45, Oki_RAD45746_54, Oki_RAD45971_66, Oki_RAD46141_169, Oki_RAD46160_48, Oki_RAD46744_47, Oki_RAD4695_38, Oki_RAD47313_50, Oki_RAD49257_48, Oki_RAD49488_47, Oki_RAD49978_70, Oki_RAD49991_69, Oki_RAD51585_47, Oki_RAD51713_42, Oki_RAD51912_66, Oki_RAD53121_66, Oki_RAD53155_32, Oki_RAD53655_42, Oki_RAD53705_68, Oki_RAD55090_49, Oki_RAD55249_49, Oki_RAD55327_41, Oki_RAD55690_46, Oki_RAD56094_43, Oki_RAD57956_47, Oki_RAD58946_67, Oki_RAD59556_32, Oki_RAD60246_68, Oki_RAD62768_58, Oki_RAD62917_33, Oki_RAD63267_39, Oki_RAD63468_42, Oki_RAD64084_65, Oki_RAD65234_35, Oki_RAD65345_34, Oki_RAD65466_31, Oki_RAD65902_30, Oki_RAD66265_54, Oki_RAD66324_35, Oki_RAD66397_30, Oki_RAD66663_68, Oki_RAD66922_64, Oki_RAD66976_58, Oki_RAD66994_58, Oki_RAD68033_63, Oki_RAD68125_68, Oki_RAD68190_55, Oki_RAD68238_42, Oki_RAD69161_64, Oki_RAD69779_33, Oki_RAD70461_61, Oki_RAD70496_44, Oki_RAD70600_60, Oki_RAD70812_52, Oki_RAD71346_48, Oki_RAD71442_69, Oki_RAD72460_47, Oki_RAD72503_39, Oki_RAD72979_40, Oki_RAD73094_68, Oki_RAD73130_59, Oki_RAD73526_35, Oki_RAD75608_42, Oki_RAD75909_38, Oki_RAD76045_68, Oki_RAD77210_64, Oki_RAD77342_49, Oki_RAD77803_60, Oki_RAD78773_67, Oki_RAD78988_32, Oki_RAD81387_37, Oki_RAD85448_97, Oki_RAD85949_47, Oki_RAD86669_50, Oki_RAD87446_62, Oki_RAD87621_67, Oki_RAD8763_51, Oki_RAD88551_51, Oki_RAD88786_44, Oki_RAD90555_50, Oki_RAD90652_58, Oki_RAD90772_47, Oki_RAD91362_68, Oki_RAD91470_66, Oki_RAD91750_33, Oki_RAD92243_63, Oki_RAD92616_68, Oki_RAD92875_31, Oki_RAD92978_42, Oki_RAD93028_59, Oki_RAD94260_66, Oki_RAD95534_53, Oki_RAD95743_39, Oki_RAD95780_30, Oki_RAD96072_42, Oki_RAD97168_52, Oki_RAD97325_35, Oki_RAD97903_41, Oki_RAD97993_40, Oki_RAD98485_66, Oki_RAD9927_35, Oki_RAD99298_47, Oki_RAD99659_52, Oki_RAD99813_32, Oki100771_83, Oki100884_210, Oki100974_293, Oki101119_1006, Oki101419_103, Oki101554_359, Oki101770_525, Oki102267_166, Oki102414_499, Oki102457_67, Oki102801_511, Oki102867_667, Oki103271_161, Oki103577_70, Oki103713_182, Oki104515_99, Oki104519_45, Oki104569_261, Oki105105_245, Oki105115_49, Oki1051321_169, Oki105235_460, Oki106172_60, Oki106313_353, Oki106419_292, Oki106479_278, Oki107336_45, Oki107607_213, Oki107974_46, Oki108505_331, Oki109243_480, Oki109525_359, Oki109651_152, Oki109874_122, Oki109894_418, Oki110078_191, Oki110381_77, Oki111681_407, Oki113457_324, Oki114315_360, Oki114448_101, Oki114587_309, Oki116362_411, Oki116865_244, Oki117043_374, Oki117144_64, Oki117286_291, Oki117815_369, Oki118152_314, Oki118175_264, Oki118654_330, Oki120024_226, Oki122593_430, Oki123205_88, Oki123470_92, Oki123921_90, Oki125998_324, Oki126160_142, Oki127760_301, Oki128302_547, Oki128757_232, Oki128851_185, Oki129870_552, Oki130295_48, Oki130524_184, Oki131460_243, Oki131802_368, Oki94903_192, Oki95318_100, Oki96127_66, Oki96158_278, Oki96222_70, Oki96376_63, Oki97954_228, Okiafp410_66, Okianp_168, Okiarp_105, OkiaspAT_273, OkibcAKal_274, Okicarban_107, Okigdh189_132, Okigh183_160, Okigshpx152_132, Okihsc71p313_136, Okihsf1b_85, Okihsp90B_83, Okiitpa_85, OkimapK3p_93, Okinips_100, Okipigh33_33, Okipoop5265_175, Okirpo2j235_212, OkiSClkF2R2_67, Okiserpin328_119, Okispf30119_107, OkiSWS1op38_141, Okiu6257_187, Ots_104048_143, Ots_105401_242, Ots_106747_266, Ots_107806_821, Ots_108390_387, Ots_110201_250, Ots_110381_445, Ots_118938_311, Ots_127236_107, Ots_128693_382, Ots_129303b_554, Ots_130720_411, Ots_96500_124, Ots_apoc1_832, Ots_aspat_229, Ots_Cath_D141_134, Ots_CCR7_182, Ots_CD59_2_553, Ots_Chin30up_495, Ots_crRAD17324_192, Ots_crRAD17527_50, Ots_crRAD18336_135, Ots_crRAD20376_46, Ots_crRAD21752_87, Ots_crRAD26541_169, Ots_crRAD27247_97, Ots_crRAD2806_95, Ots_crRAD30341_147, Ots_crRAD32203_124, Ots_crRAD33491_135, Ots_crRAD44588_193, Ots_crRAD46081_199, Ots_crRAD53756_119, Ots_crRAD57366_94, Ots_crRAD57520_52, Ots_crRAD73823_115, Ots_Est803_263, Ots_ETIF1A_269, Ots_FARSLA_224, Ots_hnRNPL_58, Ots_il_1racp_174, Ots_mapKpr_377, Ots_MHC2_185, Ots_nkef_148, Ots_parp3_45, Ots_PGK_54_568, Ots_RAD3470_45, Ots_RAD3513_104, Ots_stk6_521, Ots_TAPBP_579, Ots_u07_07_240, Ots_u202_249, Ots_U5121_459
2 Reporting Units: first, second
216 Collections: Aaltanhash_R, Ahnuhati_R, Albreda_R, Alouette_R_S, Atnarko_R, Avola_Cr, Barriere_R, Barriere_R_East, Beaver_Cr, Bella_Bella, Berners_R, Bessette_Cr, Big_QualicuM_R, Bing_Cr_Hatchery, Birch_Island_Channel, Birkenhead_R, Blue_R, Bonaparte_R, Bonneville_Hatchery, BriM_R, Bulkley_R, Canoona_R, Capilano_R, Chapman_Cr, Chase_R, Chehalis_R, Chilko_R, Chilliwack_R, Chilqua_Cr, Chown_Brk, Clackamus_R, Clearwater_Cr, Coal_Cr, Coates_Cr, Coghlan_Cr, Coldwater_R, Conuma_R, Cook_Cr_TOMF, Copper_Cr, CoquitlaM_R, Cowichan_R, Cypre_R, Damshilgwit_Cr, Datlamen_Cr, Deena_Cr, Docee_R, Drake_Cr, Dungeness_H, DuteaU_Cr, Eagle_Cr_Hatchery, Eagle_R, Elwha_R, Evelyn_Cr, Fennell_Cr, Ford_ArM_Lake, French_Cr, GastineaU_Channel, Gates_Cr, Gilttoyees_Cr, Gitnadoix, Glenlion_R, GoldstreaM_R, Goodspeed_R, Grizzly_R, Hagensborg_Sl, Harbour_Cr, Harris_Cr, Hartley_Bay_Cr, Heydon_Cr, Hicks_Cr, Hidden_Falls, Homathko_R, Honna_R, Hugh_Cr, Hugh_Smith_Lake, Inch_Cr, Indian_Cr, Issaquah_Cr, Jenny_Inlet_West_Cr, Johnston_Cr, Jones_Cr, Kainet_Cr, Kakweiken_R, Kanaka_Cr, Kasiks_R, Kawkawa_Cr, Kennedy_R_Up, Keogh_R, Khutze_R, Kilbella_R, Kiltuish_R, Kiskosh_Cr, Kitasoo_Cr, Kitimat_R, KitsumkaluM_R, KlukshU_R, Kootowis_Cr, Kwalate_Cr, Lachmach_R, Lemieux_Cr, Lewis_R_Hatch, Little_Campbell_R, Loon_Lk_Cr, Louis_Cr, Lyon_Cr, Mamin_R, MamquaM_R, Mann_Cr, Marblemount_H, Margaret_Cr, Martin_R, McClinton_Cr, McKinley_Cr, McLaughlin_Bay_Cr, McNeil_R, McNomee_Cr, Mercer_Cr, Meziadin_R, Minter_Cr, Momich_R, Morice_R, Nathan_Cr, Necleetsconnay_R, Neechanze_R, NehaleM_Hatchery, Nekite_Ch, Nicomekl_R, Nisqually_R, Nitinat_R, Nooksack_SF, Norrish_Cr, Oona_R, Pallant_Cr, Paril_R, Peach_Cr, Phillips_R, Pig_Channel, Poole_Cr, Post_Cr, Puntledge_R, Puyallup_R, Quaal_R, Quartcha_Cr, Quatlena_R, Queets_R, Quilcene_R, Quillayute, QuinsaM_R, Raft_R, Red_Cabin_Cr, Reflection_Lake, Robertson_Cr, Roscoe_Cr, Rosewall_Cr, Salloomt_R, Salmon_Cr, Salmon_R_LWFR, Salmon_R_SOTH, Salmon_R_TOMF, Sandy_Hatchery, Sangan_R, Scud_R, Senn_Cr, Serpentine_R, Seymour_R_GSMN, Seymour_R_JNST, Shale_H, Shaw_Cr, Sheemahant_R, Shovelnose_Cr, Shuswap_R_Middle, Siddle_Cr, Siletz_R, Siltcoos_Lk, Silver_Cr_NCST, Silverdale_Cr, Sinmax_Cr, Siuslaw_R, Skykomish_R, Slamgeesh_R, Smokehouse_Cr, Snohimish_R, Snootli_Cr, Sooke_R, Sorenson_Cr, Stave_R, Stillaguamish_R, Sustut_R, Tahkenitch_Lk, Tankeeah_R, Ten_Mile_Lk, Tenderfoot_Cr, Thornton_Cr, Thorsen_Cr_CCST, Tlell_R, Toboggan_Cr, Tranquil_Cr, Trask_Hatchery, Trent_R, Tseax_R, TumtuM_Cr, Tyler_Cr, Umpqua_R, Upper_Birkenhead_R, Verrett_(Stikine), Wakeman_R, Wap_Cr, White_R, Whitman_Lake, Whonnock, Willapa_R, Worth_Cr, Yakoun_R, Yaquina_R, Zolzap_Cr, Zymacord_R
8.94185167974971% of allelic data identified as missing
Maybe we can make an interactive heat map:
getwd()
[1] "/Users/eriq/Documents/git-repos/rubias"
We will set alpha_collection
to 0.01 which gives us about 4 to 6 populations with a fair number of individuals in it for each mixture (the rest are typically 0). If we do 600 of these, then each population should have a chance to have appreciable number at least a few times. We need to write a simulation function that we have more control over!
big_ugly <- assess_reference_loo(reference = bco2,
gen_start_col = 5,
reps = 600,
mixsize = 100,
seed = 5,
alpha_repunit = 100,
alpha_collection = 0.01)
We can make a big ol’ bunch of scatterplots like this:
g <- ggplot(big_ugly, aes(x = n/100, y = post_mean)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", colour = "blue") +
facet_wrap(~collection, ncol = 12)
ggsave(g, filename = "scatters.pdf", width = 15, height = 40)