libraries:

library(tidyverse)
library(rubias)
library(stringr)

Introduction

The guys at DFO have been using GSI_SIM to run some tests on a big data set of theirs. This seemed like it would be a great chance to squash their data into rubias to see how it works. Huge baseline with 28,000 individuals and 302 SNPs…

Getting Data into R

Just for our records (i.e. these chunks are not set to evaluate), the data are in a GSI_SIM file, which I have turned into a space-delimited file that will be appropriate for rubias like so:

# in: /Users/eriq/Downloads/mco_Apr6
le2unix bco_GSI_SIM_Mar_31_2017.txt | awk '
  BEGIN {printf("sample_type repunit collection indiv")} 
  NF==1 {printf(" %s %s.1", $1, $1);} 
  $1=="POP" {if(go==0) printf("\n"); go=1; pop=$2; next;} 
  NF>100 {printf("reference lumped %s %s", pop, $1); for(i=2;i<=NF;i++) printf(" %s", $i); printf("\n");}
'  > bco.txt 

The read it in with readr and format it to be appropriate for rubias. Note that we don’t have any reporting unit designations from them, so we just put everyone into a single reporting unit called “lumped.” I’m curious to see if having only a sinble reporting unit breaks anything.

After reading it we compress it into an RDS file that only takes up 2.4 Mb. Way better than the 35 Mb text file format it was before.

bco <- read_delim("~/Downloads/mco_Apr6/bco.txt", delim = " ", progress = FALSE, na = "0") %>%
  mutate(repunit = factor(repunit, levels = unique(repunit)),
         collection = factor(collection, levels = unique(collection)))

# save it into development/data
saveRDS(bco, file = "development/data/bco_baseline.rds", compress = "xz")

Now we can just read that into R easily:

bco <- readRDS("development/data/bco_baseline.rds")

As it turns out, I think that having a single reporting unit does break things. So, for now, I will just arbitrarily break it into two (A-M) and (N-Z).

bco2 <- bco %>%
  mutate(repunit = ifelse(str_detect(collection, "^[A-M]"), "first", "second")) %>%
  mutate(repunit = factor(repunit))

Running Some Simulations

We will set alpha_collection to 0.01 which gives us about 4 to 6 populations with a fair number of individuals in it for each mixture (the rest are typically 0). If we do 600 of these, then each population should have a chance to have appreciable number at least a few times. We need to write a simulation function that we have more control over!

big_ugly <- assess_reference_loo(reference = bco2, 
                                          gen_start_col = 5, 
                                          reps = 600, 
                                          mixsize = 100, 
                                          seed = 5,
                                          alpha_repunit = 100, 
                                          alpha_collection = 0.01)

Plot those simulations

We can make a big ol’ bunch of scatterplots like this:

g <- ggplot(big_ugly, aes(x = n/100, y = post_mean)) + 
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", colour = "blue") + 
  facet_wrap(~collection, ncol = 12)

ggsave(g, filename = "scatters.pdf", width = 15, height = 40)
LS0tCnRpdGxlOiAiVHJ5aW5nIG91dCBERk8ncyBkYXRhIgpvdXRwdXQ6IAogIGh0bWxfbm90ZWJvb2s6CiAgICB0b2M6IHRydWUKLS0tCgoKYGBge3Igc2V0dXAsIGluY2x1ZGU9RkFMU0V9CiMgc2V0IHRoZSB3b3JraW5nIGRpcmVjdG9yeSBhbHdheXMgdG8gdGhlIHByb2plY3QgZGlyZWN0b3J5IChvbmUgbGV2ZWwgdXApCmtuaXRyOjpvcHRzX2tuaXQkc2V0KHJvb3QuZGlyID0gbm9ybWFsaXplUGF0aChycHJvanJvb3Q6OmZpbmRfcnN0dWRpb19yb290X2ZpbGUoKSkpIApgYGAKCmxpYnJhcmllczoKYGBge3IgbG9hZC1saWJzfQpsaWJyYXJ5KHRpZHl2ZXJzZSkKbGlicmFyeShydWJpYXMpCmxpYnJhcnkoc3RyaW5ncikKYGBgCiMjIEludHJvZHVjdGlvbgoKVGhlIGd1eXMgYXQgREZPIGhhdmUgYmVlbiB1c2luZyBHU0lfU0lNIHRvIHJ1biBzb21lIHRlc3RzIG9uIGEgYmlnIGRhdGEgc2V0IG9mIHRoZWlycy4KVGhpcyBzZWVtZWQgbGlrZSBpdCB3b3VsZCBiZSBhIGdyZWF0IGNoYW5jZSB0byBzcXVhc2ggdGhlaXIgZGF0YSBpbnRvIHJ1YmlhcyB0byBzZWUgaG93IAppdCB3b3Jrcy4gIEh1Z2UgYmFzZWxpbmUgd2l0aCAyOCwwMDAgaW5kaXZpZHVhbHMgYW5kIDMwMiBTTlBzLi4uCgojIyBHZXR0aW5nIERhdGEgaW50byBSCgpKdXN0IGZvciBvdXIgcmVjb3JkcyAoaS5lLiB0aGVzZSBjaHVua3MgYXJlIG5vdCBzZXQgdG8gZXZhbHVhdGUpLCB0aGUgZGF0YSBhcmUgaW4gYSBHU0lfU0lNIGZpbGUsIHdoaWNoIEkgaGF2ZSB0dXJuZWQgaW50byBhIHNwYWNlLWRlbGltaXRlZCBmaWxlIAp0aGF0IHdpbGwgYmUgYXBwcm9wcmlhdGUgZm9yIGBydWJpYXNgIGxpa2Ugc286CmBgYHtzaCBjcnVuY2gtaXQsIGV2YWw9RkFMU0V9CiMgaW46IC9Vc2Vycy9lcmlxL0Rvd25sb2Fkcy9tY29fQXByNgpsZTJ1bml4IGJjb19HU0lfU0lNX01hcl8zMV8yMDE3LnR4dCB8IGF3ayAnCiAgQkVHSU4ge3ByaW50Zigic2FtcGxlX3R5cGUgcmVwdW5pdCBjb2xsZWN0aW9uIGluZGl2Iil9IAogIE5GPT0xIHtwcmludGYoIiAlcyAlcy4xIiwgJDEsICQxKTt9IAogICQxPT0iUE9QIiB7aWYoZ289PTApIHByaW50ZigiXG4iKTsgZ289MTsgcG9wPSQyOyBuZXh0O30gCiAgTkY+MTAwIHtwcmludGYoInJlZmVyZW5jZSBsdW1wZWQgJXMgJXMiLCBwb3AsICQxKTsgZm9yKGk9MjtpPD1ORjtpKyspIHByaW50ZigiICVzIiwgJGkpOyBwcmludGYoIlxuIik7fQonICA+IGJjby50eHQgCmBgYAoKVGhlIHJlYWQgaXQgaW4gd2l0aCByZWFkciBhbmQgZm9ybWF0IGl0IHRvIGJlIGFwcHJvcHJpYXRlIGZvciBydWJpYXMuICAgTm90ZSB0aGF0IHdlIGRvbid0IGhhdmUKYW55IHJlcG9ydGluZyB1bml0IGRlc2lnbmF0aW9ucyBmcm9tIHRoZW0sIHNvIHdlIGp1c3QgcHV0IGV2ZXJ5b25lIGludG8gYSBzaW5nbGUgcmVwb3J0aW5nCnVuaXQgY2FsbGVkICJsdW1wZWQuIiAgSSdtIGN1cmlvdXMgdG8gc2VlIGlmIGhhdmluZyBvbmx5IGEgc2luYmxlIHJlcG9ydGluZyB1bml0IGJyZWFrcyBhbnl0aGluZy4KCkFmdGVyIHJlYWRpbmcgaXQgd2UgY29tcHJlc3MgaXQgaW50byBhbiBSRFMgZmlsZSB0aGF0IG9ubHkgdGFrZXMgdXAgMi40IE1iLiAgV2F5IGJldHRlciB0aGFuCnRoZSAzNSBNYiB0ZXh0IGZpbGUgZm9ybWF0IGl0IHdhcyBiZWZvcmUuCmBgYHtyIHJlYWQtaXQsIGV2YWw9RkFMU0V9CmJjbyA8LSByZWFkX2RlbGltKCJ+L0Rvd25sb2Fkcy9tY29fQXByNi9iY28udHh0IiwgZGVsaW0gPSAiICIsIHByb2dyZXNzID0gRkFMU0UsIG5hID0gIjAiKSAlPiUKICBtdXRhdGUocmVwdW5pdCA9IGZhY3RvcihyZXB1bml0LCBsZXZlbHMgPSB1bmlxdWUocmVwdW5pdCkpLAogICAgICAgICBjb2xsZWN0aW9uID0gZmFjdG9yKGNvbGxlY3Rpb24sIGxldmVscyA9IHVuaXF1ZShjb2xsZWN0aW9uKSkpCgojIHNhdmUgaXQgaW50byBkZXZlbG9wbWVudC9kYXRhCnNhdmVSRFMoYmNvLCBmaWxlID0gImRldmVsb3BtZW50L2RhdGEvYmNvX2Jhc2VsaW5lLnJkcyIsIGNvbXByZXNzID0gInh6IikKYGBgCgpOb3cgd2UgY2FuIGp1c3QgcmVhZCB0aGF0IGludG8gUiBlYXNpbHk6CmBgYHtyIHJlYWxseS1yZWFkLWl0fQpiY28gPC0gcmVhZFJEUygiZGV2ZWxvcG1lbnQvZGF0YS9iY29fYmFzZWxpbmUucmRzIikKYGBgCkFzIGl0IHR1cm5zIG91dCwgSSB0aGluayB0aGF0IGhhdmluZyBhIHNpbmdsZSByZXBvcnRpbmcgdW5pdCBkb2VzIGJyZWFrIHRoaW5ncy4gIFNvLCBmb3Igbm93LCBJIHdpbGwKanVzdCBhcmJpdHJhcmlseSBicmVhayBpdCBpbnRvIHR3byAoQS1NKSBhbmQgKE4tWikuCmBgYHtyIGJyZWFrLWx1bXBzfQpiY28yIDwtIGJjbyAlPiUKICBtdXRhdGUocmVwdW5pdCA9IGlmZWxzZShzdHJfZGV0ZWN0KGNvbGxlY3Rpb24sICJeW0EtTV0iKSwgImZpcnN0IiwgInNlY29uZCIpKSAlPiUKICBtdXRhdGUocmVwdW5pdCA9IGZhY3RvcihyZXB1bml0KSkKYGBgCgojIyBSdW5uaW5nIFNvbWUgU2ltdWxhdGlvbnMKCldlIHdpbGwgc2V0IGBhbHBoYV9jb2xsZWN0aW9uYCB0byAwLjAxIHdoaWNoIGdpdmVzIHVzIGFib3V0IDQgdG8gNiBwb3B1bGF0aW9ucyB3aXRoIAphIGZhaXIgbnVtYmVyIG9mIGluZGl2aWR1YWxzIGluIGl0IGZvciBlYWNoIG1peHR1cmUgKHRoZSByZXN0IGFyZSB0eXBpY2FsbHkgMCkuICBJZgp3ZSBkbyA2MDAgb2YgdGhlc2UsIHRoZW4gZWFjaCBwb3B1bGF0aW9uIHNob3VsZCBoYXZlIGEgY2hhbmNlIHRvIGhhdmUgYXBwcmVjaWFibGUgbnVtYmVyCmF0IGxlYXN0IGEgZmV3IHRpbWVzLiAgIFdlIG5lZWQgdG8gd3JpdGUgYSBzaW11bGF0aW9uIGZ1bmN0aW9uIHRoYXQgd2UgaGF2ZSBtb3JlIGNvbnRyb2wgb3ZlciEKCmBgYHtyIGJpZy1zZXQtby1zaW1zfQpiaWdfdWdseSA8LSBhc3Nlc3NfcmVmZXJlbmNlX2xvbyhyZWZlcmVuY2UgPSBiY28yLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZ2VuX3N0YXJ0X2NvbCA9IDUsIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICByZXBzID0gNjAwLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgbWl4c2l6ZSA9IDEwMCwgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHNlZWQgPSA1LAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBhbHBoYV9yZXB1bml0ID0gMTAwLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgYWxwaGFfY29sbGVjdGlvbiA9IDAuMDEpCmBgYAoKIyMgUGxvdCB0aG9zZSBzaW11bGF0aW9ucwoKV2UgY2FuIG1ha2UgYSBiaWcgb2wnIGJ1bmNoIG9mIHNjYXR0ZXJwbG90cyBsaWtlIHRoaXM6CmBgYHtyIG1ha2Utc2NhdHRlcnN9CmcgPC0gZ2dwbG90KGJpZ191Z2x5LCBhZXMoeCA9IG4vMTAwLCB5ID0gcG9zdF9tZWFuKSkgKyAKICBnZW9tX3BvaW50KCkgKwogIGdlb21fYWJsaW5lKGludGVyY2VwdCA9IDAsIHNsb3BlID0gMSwgbGluZXR5cGUgPSAiZGFzaGVkIiwgY29sb3VyID0gImJsdWUiKSArIAogIGZhY2V0X3dyYXAofmNvbGxlY3Rpb24sIG5jb2wgPSAxMikKCmdnc2F2ZShnLCBmaWxlbmFtZSA9ICJzY2F0dGVycy5wZGYiLCB3aWR0aCA9IDE1LCBoZWlnaHQgPSA0MCkKCgpgYGAK